Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Supporting faithful and safe live malware analysis
(USC Thesis Other)
Supporting faithful and safe live malware analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SUPPORTING FAITHFUL AND SAFE LIVE MALWARE ANALYSIS
by
Hao Shi
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2017
Copyright 2017 Hao Shi
Dedication
To my dear parents, sister, and wife.
ii
Acknowledgments
During my Ph.D. studies, many people have been supporting and enlightening me, and
they deserve my sincere thankfulness.
In the first place, I would like to express my deepest gratitude and appreciation to
my advisor, Professor Jelena Mirkovic, for her support, instruction, and encouragement
throughout my entire time as a Ph.D. student. She always provides insightful sugges-
tions on my works and has inspired me so much with her knowledge and passion towards
research studies. It has been an honor to have the opportunity of being her student and
under her guidance, I have learned to sharpen my critical thinking and problem solving
abilities, as well as teamwork skills, and I believe these abilities that I have acquired
would be a lifelong asset. I appreciate all the time and ideas she has contributed towards
my research, as well as the support and motivation she has provided me during the tough
times of my PhD journey.
I want to thank Prof. Ramesh Govindan, Prof. Viktor Prasanna, Prof. Minlan Yu,
and Prof. Nenad Medvidovic for taking the time out of their busy schedule to serve
on my qualifying exam and dissertation committee. Their suggestions and advice have
been invaluable to me.
I would like to thank my fellow friends and colleagues at USC and ISI: Xiyue Deng,
Ted Faber, Alefiya Hussain, Liang Zhu, Chengjie Zhang, Xue Cai, Lin Quan, Zi Hu,
Hang Guo, Abdulla Alwabel, Lihang Zhao, Weiwei Chen, and many others. They
iii
have been very supportive and we have established friendships that have contributed
immensely to my professional and personal time through my entire PhD experience. I
would also like to express my gratitude towards Joe Kemp, Alba Regalado and Jeanine
Yamazaki for their assistance on administrative tasks at ISI.
Lastly, I would like to thank my family for their love and encouragement. I would
also like to give special thanks to my wife, Yimeng Lin, for her patience and willingness
to help me proofread all my research papers, as well as her faithful support during these
years.
iv
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables viii
List of Figures x
Abstract xii
1 Introduction 1
1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Demonstrating Thesis Statement . . . . . . . . . . . . . . . . . . . . . 3
1.3 Structure of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 6
2 Finding System State Dierences between Virtual and Physical Machines
using Cardinal Pill Testing 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Anti-Virtualization Techniques . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Pill Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Timing and String Attacks . . . . . . . . . . . . . . . . . . . . 12
2.4 Cardinal Pill Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Testing Architecture . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Behavioral Model of Instruction . . . . . . . . . . . . . . . . . 16
2.4.3 Generating Test Cases for Intel x86 Instruction Set . . . . . . . 17
2.5 Detected Pills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.1 Evaluation Process . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.3 Exploring Undefined Behavior Model . . . . . . . . . . . . . . 35
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
v
3 Hiding Debuggers from Malware using Apate 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Attack Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.1 Attacks On Debugging Principles . . . . . . . . . . . . . . . . 42
3.2.2 Detecting Debugger Traces . . . . . . . . . . . . . . . . . . . . 48
3.2.3 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Apate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.2 Handling Anti-Debugging . . . . . . . . . . . . . . . . . . . . 53
3.3.3 Attacks Against Apate . . . . . . . . . . . . . . . . . . . . . . 56
3.3.4 Uses of Apate . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.1 Anti-Debugging is Prevalent . . . . . . . . . . . . . . . . . . . 59
3.4.2 Apate Outperforms Other Debuggers . . . . . . . . . . . . . . 62
3.4.3 Apate Detects Known Vectors . . . . . . . . . . . . . . . . . . 65
3.4.4 Apate Deceives Malware . . . . . . . . . . . . . . . . . . . . . 69
3.4.5 Apate Comparable to MALT . . . . . . . . . . . . . . . . . . . 72
3.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 Hiding Virtual Machines from Malware using VM Cloak 77
4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1.1 Pill Hiding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Generating Hiding Rules from Cardinal Pills . . . . . . . . . . . . . . . 79
4.3 Integrating Hiding Rules with Existing Frameworks . . . . . . . . . . . 80
4.4 Integrating Hiding Rules with Apate – VM Cloak . . . . . . . . . . . . 81
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.5.1 Anti-VM is Popular . . . . . . . . . . . . . . . . . . . . . . . 82
4.5.2 VM Cloak Deceives Known Malware . . . . . . . . . . . . . . 85
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5 Supporting Live and Safe Malware Analysis using Pandora 91
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.1 Network Containment Policies in Malware Analysis . . . . . . 95
5.2.2 Analysis on Malware’s Network Behavior . . . . . . . . . . . . 96
5.2.3 Machine Learning Techniques in Malware Analysis . . . . . . . 97
5.2.4 Fidelity of Malware Analysis . . . . . . . . . . . . . . . . . . . 98
5.2.5 Internet Host Mimicking . . . . . . . . . . . . . . . . . . . . . 99
5.3 Pandora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
vi
5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.2 Logic Execution of Pandora . . . . . . . . . . . . . . . . . . . 101
5.3.3 Containment Policies for Network Trac . . . . . . . . . . . . 102
5.3.4 Service Impersonators . . . . . . . . . . . . . . . . . . . . . . 105
5.4 Implementing Pandora on DeterLab Testbed . . . . . . . . . . . . . . . 105
5.4.1 DeterLab Testbed . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4.2 Minimizing Artifacts . . . . . . . . . . . . . . . . . . . . . . . 106
5.4.3 System Restore . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5.1 Overview of Malware Samples . . . . . . . . . . . . . . . . . . 108
5.5.2 Pandora Exposes More Malware Behavior . . . . . . . . . . . . 109
5.5.3 Pandora Enables Safe Malware Analysis . . . . . . . . . . . . . 109
5.5.4 Execution Overhead . . . . . . . . . . . . . . . . . . . . . . . 110
5.6 Pandora Facilitates Novel Malware Research . . . . . . . . . . . . . . 110
5.6.1 Overview of Malware’s Communication Patterns. . . . . . . . . 113
5.6.2 Case Analysis – Forwarding HTTP trac. . . . . . . . . . . . . 118
5.6.3 Case Analysis – Mimicking FTP Service. . . . . . . . . . . . . 118
5.6.4 Classifying Malware by Its Network Behavior . . . . . . . . . . 119
5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6 Discussion: State-of-the-art Malware Analysis Frameworks 127
6.1 OS Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Virtual Machine Instrumentation . . . . . . . . . . . . . . . . . . . . . 129
6.3 Bare-metal Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . 129
7 Conclusion 131
Bibliography 134
vii
List of Tables
2.1 Test Cases Generated foraaa’s Defined Behavioral Model . . . . . . . 19
2.2 Instruction Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Results Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Pills per Instruction Category (User-space) . . . . . . . . . . . . . . . . 30
2.5 Details of pills with regard to the resource being dierent in the final
state—in some cases multiple resources will dier so the same pill may
appear in dierent rows . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Pills using Undefined/Defined Resources . . . . . . . . . . . . . . . . . 32
2.7 Undefinedeflags Behaviors . . . . . . . . . . . . . . . . . . . . . . . 33
2.8 Axiom Pills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Classification of attack vectors, and their handling in Apate . . . . . . . 43
3.2 Data Sets for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Classifications of Unknown Malware Samples . . . . . . . . . . . . . . 60
3.4 Popularity of Anti-debugging Techniques . . . . . . . . . . . . . . . . 61
3.5 Spectrum of Anti-debugging Techniques . . . . . . . . . . . . . . . . . 62
3.6 Attack vectors, handled by dierent debuggers . . . . . . . . . . . . . . 65
3.7 Results of Known Malware Samples . . . . . . . . . . . . . . . . . . . 66
3.8 Running a Packed Executable . . . . . . . . . . . . . . . . . . . . . . . 73
4.1 Example Hiding Rule Generation foraaa Instruction . . . . . . . . . . 80
viii
4.2 Spectrum of Anti-VM Techniques . . . . . . . . . . . . . . . . . . . . 85
5.1 Containment Policies of Pandora . . . . . . . . . . . . . . . . . . . . . 104
5.2 Minimizing Artifacts of DeterLab . . . . . . . . . . . . . . . . . . . . 107
5.3 Concise Tagging of Malware Samples . . . . . . . . . . . . . . . . . . 109
5.4 NetDigest of a Session . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5 Top 12 Application Protocols used by Malware . . . . . . . . . . . . . 113
5.6 Destination Ports of Malware’s First Packet . . . . . . . . . . . . . . . 114
5.7 Destination Ports of Malware’s Follow-up Packet after DNS Query . . . 115
5.8 Top 10 Domains Queried by Malware . . . . . . . . . . . . . . . . . . 116
5.9 Populairty of Top-level Domains Queried by Malware . . . . . . . . . . 117
5.10 Features of Malware Network Behavior . . . . . . . . . . . . . . . . . 120
5.11 Results of Classification on Testing Set . . . . . . . . . . . . . . . . . . 123
ix
List of Figures
2.1 Logic Execution of Cardinal Pill Testing . . . . . . . . . . . . . . . . . 14
2.2 Defined Behavioral Model of Instruction . . . . . . . . . . . . . . . . . 16
2.3 Building Defined Behavioral Model foraaa Instruction . . . . . . . . . 18
2.4 Test Case Template (in MASM assembly) . . . . . . . . . . . . . . . . 20
3.1 Multilevel Inward-Jumping Sequence . . . . . . . . . . . . . . . . . . 48
3.2 Overview of Apate’s Operation . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Anti-debugging Techniques in Each Sample . . . . . . . . . . . . . . . 60
3.4 API Attack - Checking Parent Process ID . . . . . . . . . . . . . . . . 64
3.5 Anti-debugging Techniques of Sample 1 . . . . . . . . . . . . . . . . . 67
3.6 Anti-debugging Techniques of Sample 2 . . . . . . . . . . . . . . . . . 67
3.7 Anti-debugging Techniques of Sample 3 . . . . . . . . . . . . . . . . . 69
3.8 Anti-debugging Techniques of Sample 4 . . . . . . . . . . . . . . . . . 70
4.1 Anti-VM Techniques of Sample 1 . . . . . . . . . . . . . . . . . . . . 87
4.2 Anti-VM Techniques of Sample 2 . . . . . . . . . . . . . . . . . . . . 88
4.3 Anti-VM Techniques of Sample 3 . . . . . . . . . . . . . . . . . . . . 90
5.1 Overview of Pandora . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Logic Execution of Pandora . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 Flow Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
x
5.4 Ranks of Domains fromalexa.com . . . . . . . . . . . . . . . . . . . 116
5.5 Popularity of Top-level Domains . . . . . . . . . . . . . . . . . . . . . 117
5.6 Example NetDigest (md5: 0155ddfa6feb24c018581084f4a499a8) . 119
5.7 Classification Precision under Dierent Number of Sessions . . . . . . 124
xi
Abstract
Current malware analysis relies heavily on the use of virtual machines and debuggers
for safety and functionality. However, these analysis tools produce a variety of arti-
facts, which can be detected by malware and thus hinder malware analysis. First, a
virtual machine usually executes multiple host instructions to simulate a single guest
instruction. This inevitably introduces delay and often deviations from the instruc-
tion’s semantics defined in the instruction set manual. Existing research approaches
to uncover dierences between VMs and physical machines use randomized testing,
and thus cannot completely enumerate these dierences. Second, debuggers modify
malware code and handle it in specific ways. These artifacts can be used by malware
to detect debuggers. When malware detects the presence of such entities by using anti-
VM and anti-debugging techniques, it may hide its original, malicious purposes. For
example, malware may behave like normal programs, exit prematurely, escape from the
analysis environment, or even crash the system. Third, current malware analysis usually
analyze malware in an isolated system without network access. However, recent mal-
ware tends to rely on networking to function properly. For example, bot clients have to
fetch commands and payload from bot masters on the Internet. Therefore, current iso-
lated analysis setting runs the risk of incomplete analysis. There are some other research
xii
eorts that adopt no network restrictions at all, which allow malware to contact Inter-
net hosts freely. These approaches may cause substantial damage to others if malware
launches an attack, and thus are not encouraged due to legal and ethical considerations.
In this dissertation, we propose three complete pieces of work to address the above
challenges. First, we propose cardinal pill testing, which aims to enumerate the dif-
ferences between a given VM and a physical machine, through carefully designed tests.
Cardinal pill testing finds five times more pills by running fifteen times fewer tests than
previous approaches.
Second, we propose Apate, a debugger plug-in, which systematically hides the
debugger from malware. In this piece of work, we enumerate anti-debugging attacks
and classify them into 6 categories and 16 sub-categories. We then develop techniques
that detect all these anti-debugging attacks and defeat them by hiding debugger arti-
facts. We develop novel techniques for handling attacks in two out of our six attack
categories. For three out of the remaining four categories, prior research has sketched
ideas for attack handling, but we implement and evaluate them.
Apate can also be used to detect and defeat cardinal pills, thanks to its extensible
design. Based on Apate, we propose a framework, called VM Cloak, to handle the
anti-VM attacks in malware. This is achieved by monitoring each executed malware
command, detecting potential pills, and modifying at run time the command’s outcomes
to match those that a physical machine would generate.
Finally, we propose a malware analysis framework, Pandora, to support live and
safe malware experiment. Pandora enforces a mixed containment policy for malware’s
network trac, which can provide necessary network input for malware’s normal execu-
tion and limit potential damage to Internet hosts to the minimal level. Depending on the
nature of network flows, we may forward them to Internet hosts, redirect to our service
impersonators, fake replies, rate limit, or drop them. Pandora executes malware using
xiii
a set of machines, which can be either physical or virtual machines, with or without
Apate and VM Cloak. In order to help defenders better understand malware’s network
behaviors, we propose a model, called NetDigest, to extract the session dynamics from
network packets.
xiv
Chapter 1
Introduction
Malicious software, or malware, has been existing for decades and is causing
huge economical losses to government, business owners, and individuals across the
world [HE
+
09, SGH
+
11, CGKP11]. In current malware analysis, defenders usually set
up an isolated environment for malware execution. This isolated execution platform
typically includes a virtual machine (VM), which is equipped with a variety of analysis
functionalities, such as system call tracing [DR
+
08], system object tainting [SBY
+
08],
fast system recovery [KVK11], and others [KVK14, BKK06, Int17b]. Debuggers are
also among the most popular tools utilized by defenders, because they provide a con-
venient way for defenders to investigate malware’s instructions and understand their
semantics.
The adversaries are also aware of the above analysis settings and have devised a
variety of ways to detect the analysis environment. Once they successfully discover the
existence of the analysis platform, they may behave like normal programs, exit prema-
turely, escape from the analysis tools, or even crash the system. Therefore, it is critical
to handle anti-analysis behaviors of malware to reveal their real, malicious purposes.
Anti-VM techniques of malware. The virtual machines are dierent from physi-
cal machines [PBB11] in several aspects, and malware can easily detect these dier-
ences. One source of the dierences comes from the method that virtual machines use
to emulate guest instructions. The execution of an assembly instruction in a VM may
cause a dierent side eect than in a physical machine, which can be detected by mal-
ware. These artifacts exists, to some extent, in all the current virtualization techniques
1
including software emulation [Bel05], hardware-assisted virtualization [DR
+
08], para-
virtualization [BDF
+
03], and interpretation [Law96]. The attack that exploits this type
of virtualization defect is known as semantic attack. Malware can also perform anti-
VM attacks by searching for the artifacts left due to the installation of virtual machines.
For example, the virtualized CPU device usually contains a keyword “virt” or a specific
VM brand (e.g., “QEMU”). Similar names exist in hard drives, CD-ROM drives, and
many other devices. Attacks that look for these artifacts are known as string attacks.
Finally, malware may experience apparent execution delay, because virtual machines
need to execute multiple host instructions for a single guest instruction. Malware can
measure the time elapsed to run a code block. If the elapsed time exceeds a predefined
threshold, a VM is detected. This type of attack is known as timing attacks.
Anti-debugging techniques. While debuggers aid malware analysis, they also intro-
duce plenty of artifacts. For example, if a binary is launched by a debugger, the binary’s
parent process will be the debugger, instead of explorer by default. Furthermore, in
order to set a software breakpoint in malware’s binary, the debuggers have to replace
the opcode byte at the breakpoint with a special one (0xcc). This byte will raise a
breakpoint exception upon execution, so debuggers can capture this exception and gain
control over malware. Malware can detect these artifacts with simple checks, such as
using NtQueryInformationProcess() or IsDebuggerPresent(), and scanning its
binary code for 0xcc byte. Debuggers are also vulnerable to timing attacks, because
active debugging of malware, such as single-stepping, will lead to substantial execution
delay.
Understanding malware’s network behavior. After we handle the anti-VM and
anti-debugging techniques in malware, we believe malware will exhibit its real, mali-
cious behavior. For example, spammers need network access to send out emails to
a great number of hosts on the Internet; bot clients must communicate with their bot
2
masters to receive commands; key-loggers have to send collected information back to
attackers, etc. Without access to network, malware will not function as expected and
thus cannot be fully analyzed. Unfortunately, current malware analysis frameworks are
not able to support safe and live malware experimentation. For example, the research
eorts in [PDZ
+
14,KVK14,BHL
+
08] prevent malware from sending any packets to the
Internet, which limits their analysis scope if malware could not receive desired network
input. A few other work, such as [RD
+
11, GLB12, BO
+
07], allow malware to commu-
nicate with the hosts on the Internet without any restrictions. While these approaches
are capable of revealing malware’s network behavior, they fail to consider the potential
damage that malware may cause to the Internet hosts. For example, during the execution
of malware samples, they may actively participate in DDoS attacks or in sending spams
to victims.
1.1 Thesis Statement
Comprehensive analysis of possible evasive malware actions enables 1) near-complete
detection and hiding of virtual machine and debugger artifacts and 2) safe live malware
experimentation via fine-grained containment policies.
1.2 Demonstrating Thesis Statement
We undertake comprehensive analysis of system state and network trac to reveal more
malware behaviors and gain better understanding of their purpose. We demonstrate the
success of this approach through three research studies.
3
First, we propose cardinal pill testing to enumerate the dierences in system states
between a virtual and a physical machine. We group instructions defined by a man-
ual into distinct categories according to the instruction semantics. For each instruction
category, we carefully devise the ranges for its arguments that lead down to dierent
execution paths. We then select random values within these ranges for testing each
instruction. We evaluate the performance of our cardinal pill testing and compare our
results to that of the red pill testing [MPR
+
09,MPFR
+
10]. Our evaluation shows that we
use 15 times fewer tests and discover 5 times more pills than red pill testing. In addition,
our testing is significantly more ecient: 47.6% of our test cases yield a pill, compared
to only 0.6% of red pill tests.
Second, in order to handle the anti-debugging techniques, we propose Apate – a frame-
work for systematic debugger hiding, which is meant to be integrated with existing
debuggers. In this piece of work, we enumerate the modifications to system state when
a debugger is present, based on a variety of sources, such as [SH12,BBN12,Fer11,Fal07,
Ope07, CAM
+
08, ZLS
+
15]. For example, debuggers change the memory of a debuggee
process, leave detectable names, modify the default exception handling procedure, and
alter many other system states. Then, we classify anti-debugging attacks into 6 cate-
gories and 16 subcategories. For two out of the six attack categories, we propose novel
handling techniques. For three out of the remaining four categories, prior research has
sketched ideas for attack handling, but they were not implemented or tested. We imple-
ment Apate as an extension to a popular debugger – WinDbg, and perform extensive
evaluations. Our evaluation data sets include 881 unknown samples captured in the wild,
79 unit tests, 4 known binaries that have already been examined, and others. Apate out-
performs its competitors, such as OllyDbg [Yus13] and ScyllaHide [Scy16], by handling
at least 39% more anti-debugging attacks in related data sets.
4
Because Apate is designed as a general malware analysis framework, it can also be
utilized to hide virtual machines from malware. Starting with our set of cardinal pills, we
extract the desired system state after executing an instruction into a knowledge base. For
each pill, we save the instruction name, argument values, and corresponding execution
results learned from a physical machine. Whenever Apate detects these instructions
during malware execution, we will replace the execution results from the knowledge
base. As a result, malware observes the exact outcome as in a physical machine, and
thus cannot detect the presence of a virtual machine.
Third, with the goal of supporting live and safe malware experimentation, we propose
a malware analysis framework, called Pandora, that enforces a mixed network con-
tainment policy. The goal of the policy is to provide the necessary network input for
malware, such that malware will exhibit malicious activities, instead of stalling when
network access is forbidden. This policy should also prevent malware from causing
substantial damage to the hosts on the Internet, such as participating in DDoS attacks
and spreading infections. To achieve these goals, we apply dierent routing rules for
malware’s network flows based on the nature of these flows. We may forward the flows
to the Internet, redirect them to the service impersonators within our control, fake replies
to them, or drop the flows. In addition, we employ two rate-limit rules: 1) no more than
10 packets will be sent per second and 2) no more than 10 distinct IP addresses will be
contacted within a malware run. Finally, we perform extensive analysis on malware’s
network trac by extracting the NetDigest from malware’s packets to describe its con-
versations.
5
1.3 Structure of the Dissertation
This dissertation is organized along our three main studies that take advantage of care-
ful enumeration of system states and network behavior. In Chapter 2, we discuss the
details of devising test cases for assembly instructions. The dierences between a vir-
tual machine and a physical machine will be provided in this chapter. In Chapter 3, we
present the design of Apate and its logic execution. The details of each anti-debugging
attack and our handling policies will be investigated. In Chapter 4, we apply Apate
in handling the anti-VM techniques, using the knowledge base acquired in Chapter 2.
The spectrum of current popular anti-VM techniques are presented. Finally, we devise a
transparent malware analysis environment, Pandora, in Chapter 5. We show that Pandora
can help understand malware’s network behavior under our mixed network containment
policy.
6
Chapter 2
Finding System State Dierences
between Virtual and Physical Machines
using Cardinal Pill Testing
There are a variety of dierences between a virtual and a physical machine; for exam-
ple, the execution of certain instructions will cause distinct eflags values. Malware
can exploit these discrepancies to detect the existence of virtual machines and perform
hiding-from-VM actions, such as exiting or even crashing the VMs. In this chapter, we
propose a testing framework called cardinal pill testing to enumerate the dierences in
both user and kernel space.
2.1 Introduction
Today’s malware analysis [BB07, JMG
+
09, KW
+
11, YJZ
+
12, SRL12] relies on virtual
machines to facilitate fine-grained dissection of malware functionalities (e.g., Anu-
bis [BKK06], TEMU [SBY
+
08], and Bochs [Law96]). For example, virtual machines
can be used for taint analysis, OS-level information retrieval, and in-depth behavioral
analysis. Use of VMs also protects the host through isolating it from malware’s destruc-
tive actions.
Malware authors have devised a variety of evasive behaviors to hinder automated and
manual analysis of their code, such as anti-dumping, anti-debugging, anti-virtualization,
7
and anti-intercepting [Fer09,Fer06]. Kirat et al. [KVK14] detect 5,835 malware samples
(out of 110,005) that exhibit evasive behaviors. The studies in [BBN12,LKMC11] show
that anti-virtualization and anti-debugging techniques have become the most popular
methods of evading malware analysis. Chen et al. [CAM
+
08], find in 2008 that 2.7% and
39.9% of 6,222 malware samples exhibit anti-virtualization and anti-debugging behav-
iors respectively. In 2011, Lindorfer et al. [LKMC11] detect evasion behavior in 25.6%
of 1,686 malicious binaries. In 2012, Branco et al. [BBN12] analyze 4 million sam-
ples and observe that 81.4% of them employ anti-virtualization and 43.21% employ
anti-debugging.
Upon detection of a virtual environment or the presence of debuggers, malicious
code can alternate execution paths to appear benign, exit programs, crash systems, or
even escape virtual machines. Therefore, it is critically important to devise methods
that handle anti-virtualization and anti-debugging, to support future malware analysis.
In this chapter, we focus only on anti-virtualization handling.
We observe that malware can dierentiate between a physical and a virtual machine
due to numerous subtle dierences that arise from their implementations. Let us call
the physical machine an Oracle. Malware samples can execute sets of instructions with
carefully chosen inputs (aka pills), and compare their outputs with the outputs that would
be observed in an Oracle. Any dierence leads to detection of VM presence. In addition
to these semantic attacks, there are two other approaches to anti-virtualization – timing
and string attacks (see Section 2.2). Our work focuses heavily on detecting and handling
semantic attacks as they are the most complex. Our solution, however, also handles
timing and string attacks.
Semantic attacks are successful because there are many dierences between VMs
and physical machines, and existing research in VM detection [MPR
+
09, MPFR
+
10,
KYH
+
09] uses randomized tests that cannot fully enumerate these dierences. We
8
observe that when a malware is run within a VM, all its actions are visible to the VM and
all the responses are within a VM’s control. If dierences between a physical machine
and a VM could be enumerated, the VM or the debugger could use this knowledge
to provide expected behaviors when malware commands are executed, thus hiding VM
presence. This is akin to kernel rootkit functionality, where the rootkit hides its presence
by intercepting instructions that seek to examine processes, files and network activity,
and provides replies that an uncompromised system would produce.
In this chapter, we propose cardinal pill testing, an approach that attempts to enu-
merate all the dierences between a physical machine and a virtual machine that stem
from their dierences in instruction execution. These dierences can be used for CPU
semantic attacks (see Section 2.2). Our contributions include the following:
1. We improve on the previously proposed red pill testing [MPR
+
09, MPFR
+
10] by
devising tests that carefully traverse operand space and explore execution paths
in instructions with the minimal set of test cases. For user-space testing, we use
15 times fewer tests and discover 5 times more pills than red pill testing. Our
testing is also more ecient: 47.6% of our test cases yield a pill, compared to
only 0.6% of red pill tests. In total, we discover between 7,487 and 9,255 pills
depending on the virtualization technology and the physical machine being tested.
For kernel-space testing, instructions show a much higher yield rate (pills/test
cases) than user-space ones: 83.5%85.5% v.s. 38.5%47.7%, depending on
dierent virtualization modes of VMs.
2. We find two root causes of pills: (1) failure of virtual machines to strictly adhere
to CPU design specification and (2) vagueness of the CPU design specification
that leads to dierent implementations in physical machines. Only 2% of our pills
stem from the second phenomenon.
9
All the scripts and test cases used in our study will be publicly released at our project
website (https://steel.isi.edu/Projects/cardinal/).
2.2 Anti-Virtualization Techniques
Anti-virtualization techniques can be classified into three broad categories [CAM
+
08,
KYH
+
09]:
Semantic Attacks. Malware targets certain CPU instructions that have dierent eects
when executed under virtual and real hardware. For instance, the cpuid instruction
in Intel IA-32 architecture returns the tsc bit with value 0 under the Ether [DR
+
08]
hypervisor, but outputs 1 in a physical machine [PBB11]. As another example found
in our experiment, when moving hex value 7fffffffh to floating point register mm1,
the resulting st1 register is correctly populated as SNaN (signaling non-number) in a
physical machine, but has a random number in a QEMU-virtualized machine. Malware
executes these pills and checks their output to identify the presence of a VM.
Timing Attacks. Malware measures the time needed to run an instruction sequence,
assuming that an operation takes a larger amount of time in a virtual machine compared
to a physical machine [Fer06]. Contemporary virtualization technologies (dynamic
translation [Bel05], bytecode interpretation [Law96], and hardware assistance [DR
+
08])
all add significant delays to instruction execution
1
.
String Attacks. VMs leave a variety of traces inside guest systems that can be
used to detect their presence. For instance, QEMU assigns the “QEMU Virtual CPU”
string to the emulated CPU and similar aliases to other virtualized devices such as hard
drive and CD-ROM. A simple query to Windows registry will reveal the VM’s pres-
ence [CAM
+
08].
1
This method can also be used to detect debuggers, because stepping code adds large delays.
10
The main focus of our work is on handling semantic attacks as they are the most
complex category to explore and enumerate. The string attacks can be handled through
enumeration and hiding of VM traces, which can be done by comprehensive listing and
comparison of files, processes, and Windows registries, with and without virtualization.
Also, timing attacks can be handled through systematic lying about the VM clock.
2.3 Related Work
In this section we discuss the work related to handling of semantic attacks (pill testing
and pill hiding) as well as handling of other anti-virtualization techniques.
2.3.1 Pill Testing
Martignoni et al. present the initial red pill work in EmuFuzzer [MPR
+
09]. They pro-
pose red pill testing – a method that performs a random exploration of a CPU instruction
set and parameter spaces, to look for pills. Testing is performed by iterating through the
following steps: (1) initialize input parameters in the guest VM, (2) duplicate the content
in user-mode registers and process memory in the host, (3) execute a test case, (4) com-
pare resulting states of register contents, memory and exceptions raised—if there are any
dierences, the test case is a pill. In their follow-up work KEmuFuzzer [MPFR
+
10], the
authors extend the state definition to include the kernel space memory, and test cases
are embedded in the kernel to facilitate testing of privileged instructions. However, the
authors test boundary and random values for explicit input parameters but do not exam-
ine implicit parameters, while we attempt to evaluate implicit parameters as well.
In their recent work [MMP
+
12], they use symbolic execution to translate code of
a high-fidelity emulator (Bochs) and then generate test cases that can investigate all
discovered code paths. Those test cases are used to test a lower-fidelity emulator such as
11
QEMU. While this symbolic analysis can automatically detect the dierences between
a high-fidelity and a low-fidelity model, it is dicult to evaluate how accurately their
high-fidelity model resembles a physical machine. In addition, the authors exclude test
generation for floating-point instructions since their symbolic execution engine does not
support them. In our work, we use instruction semantics to carefully craft test cases
that explore all code paths. We also use bare-metal physical machines as Oracle, which
improves fidelity of tests and helps us discover more pills.
Other works [SLC
+
11, LKMC11, BCK
+
10] focus on detecting anti-virtualization
functions of malware based on profiling and comparing their behavior in virtual and
physical machines. They do not uncover the details of anti-virtualization methods that
each individual binary employs, and they can only detect anti-virtualization checks
deployed by their malware samples, while we detect many more dierences that could
be exploited in future anti-virtualization checks.
2.3.2 Timing and String Attacks
Timing Attack. To feed malware with the correct time information, Vasudevan et
al. [VY06] replacerdtsc instruction with amov instruction that stores the value of their
internal processor-counter to theeax register. However, it is unclear how they maintain
the internal processor counter. In addition, malware can query a variety of time sources
besides using rdtsc to fetch the time-stamp counter. The authors of [VY05] apply a
clock patch, thereby resetting the time-stamp counter to a value that mimics the latency
close to that of normal execution. This work claims that it also performs the same reset
on the real-time clocks since malware could use the real-time clock. Nevertheless, the
details of clock resetting are unclear, and the enumeration of dierent time sources are
not provided.
12
String Attacks. Chen et al. [CAM
+
08] propose that malware may mark a system as
“suspicious”, if they find that certain tools are installed with well-known names and in a
well-known location, such as “VMWare” and “OllyDbg”. However, they do not provide
a systematic method to hide the presence of these strings. Vasudevan et al. [VY06]
merely state that they overwrite memory data that leaks the presence of debuggers with
values copied from physical machines. The details of memory data and the overwriting
method are not mentioned.
2.4 Cardinal Pill Testing
In this section, we first introduce our testing infrastructure that enables the evaluation
of the same test cases on dierent pairs of virtual and physical machines. Then, we
discuss the fundamental intuition behind our test case generation model, independent of
Instruction Set Architecture (ISA). Finally, we apply our generation model on Intel x86
instruction set and describe how we group them to automate test case generation as best
as we can.
2.4.1 Testing Architecture
Our testing architecture consists of three physical machines: a master, a slave hosting
a virtual machine (VM), and a slave running a bare-metal as a reference (Oracle). The
slaves are connected to the master by serial wires. The master generates test cases
(Section 2.4.2) and schedules their execution in slaves. In both slaves, we configure a
daemon that helps the master set up a specific test case in each testing round.
The execution logic of our cardinal pill testing is illustrated in Figure 2.1. The
master maintains a debugger that issues commands to and transfers data back from the
slaves. The Oracle and the VM have the same test case set and the same daemon; we
13
system
startup
ready
testcase name
start
system
loading
infinite
loop
copy
release
ready
state init.
copy
break loop
release
testing
instruction
ready
copy
next testcase
debugger testcase daemon
ready
one round
idle
reboot slave
[restart system in kernel mode testing]
Figure 2.1: Logic Execution of Cardinal Pill Testing
only show one pair of test case and daemon in Figure 2.1 for clarity. We set the slaves
in the kernel debugging mode so that they can be completely frozen when necessary.
At the beginning, the master reboots the slave (either VM or Oracle) for fresh system
states. After the slave is online, the daemon signals its readiness to the master, which
then evaluates test cases one per round.
14
We define the state of a physical or virtual machine as a set of all user and kernel
registers, and the data stored in the part of code, data, and stack segments, which our test
case accesses for reading or writing. In addition, the state also includes any potential
exceptions that may be thrown.
During each round, the master interacts with the slave through three main phases. In
the first phase, it issues a test case name to the daemon that resides in a slave, then the
daemon will ask the slave system to load this test case stored in its local disk. After-
wards, the system starts allocating memory, handles, and other resources needed by the
test case program. When this system loading completes, the test case executes an inter-
rupt instruction (int 3), which notifies the master and halts the slave. At this point, the
master saves the raw state of the slave locally. We use this raw state to identify axiom
pills (see Section 2.4.2) instead of discarding it [MPR
+
09, MPFR
+
10].
In the second phase, the master releases the slave to execute the test case’s initializa-
tion code and raise the second interrupt. Instead of using the same initial system state
for all test cases, we carefully tailor register and memory for each test case, such that all
possible exceptions and semantic branches can be evaluated (Section 2.4.2). The master
copies the resulting initial state and releases the slave again.
In the third phase, the slave executes the actual instruction being tested and raises
the last interrupt. The master will store this final state and use it to determine whether
the tested instruction along with the initial state is a cardinal pill (see Section 2.5.1).
A test case may drive the slave into an infinite loop or crash itself or its OS. To detect
this, we set up an execution time limit for each test case, so that the master can detect
incapacitated slaves and restore them.
Finally, when evaluating test cases with user-space commands, we can set up the
next test case after the previous one has completed. After evaluation of test cases with
15
Cond 1
Source 1
Source 2
Source 3
Cond 2
Intermediate Action
Action 1
Action 2
true
false
true
false
Figure 2.2: Defined Behavioral Model of Instruction
kernel-space commands, and after evaluation of test cases that crash the OS, we must
reboot the system before proceeding with testing.
2.4.2 Behavioral Model of Instruction
In a modern computer architecture, the program’s instructions are usually executed in a
pipeline style: fetching instructions from memory, decoding registers and memory loca-
tions used in the instruction, and executing the instruction. This pipeline can be modeled
as a directed, multi-source, and multi-destination graph, as shown in Figure 2.2. Each
source node stands for the input parameters that are demanded by the instruction. They
may be explicitly required by the instruction (solid line) in its mnemonic or implicitly
needed by its specification (dashed line). These parameters will be examined by cer-
tain condition checks and may go through some intermediate processing (Intermediate
Action). Finally, the execution of the instruction may end up with dierent operations
(Action 1 or Action 2) depending on the intermediate checks and actions. At parameter
fetching and action, exceptions may occur due to a variety of causes. For example, the
memory location of a source may be inaccessible (memory page not present or address
out of range). The intermediate action may cause an overflow, which will throw an
overflow exception. Furthermore, the purpose of the instruction itself may be to raise an
exception.
In most cases, the behavioral model of an instruction does not specify how certain
registers will be updated, because they are not consumed or produced by the instruction.
16
This incomplete specification leaves room for dierent implementations by dierent
vendors. We found in our evaluation that these registers may still be modified by CPU.
We call these modifications undefined behaviors. Because we do not know the logic
behind the undefined behaviors, there is no sound methodology to completely evaluate
them, other than exhaustive search. But exhaustive search is impractical because the
space of instruction parameters is prohibitively large. We briefly discuss our attempt to
infer semantics of undefined behaviors and thus reduce the need for exhaustive search
in Section 2.5.3.
The goal of VMs is to faithfully virtualize the behavioral model for each instruction
of the ISA that they are emulating, including both normal and abnormal execution paths.
Based on these observations, we set up the following goals of our test case generation
algorithm:
For defined behaviors of a given instruction, all execution branches should be
evaluated. All flag bit states that are read explicitly or implicitly, or updated using
results must be considered.
All potential exceptions must be raised, such as memory access and invalid input
arguments.
Undefined behaviors should be investigated to reveal undocumented implementa-
tion specifics.
In the following section, we illustrate how we generate test cases based on the
defined behavioral models of Intel x86 instructions.
2.4.3 Generating Test Cases for Intel x86 Instruction Set
Intel x86 instruction set is one of the most complex ISAs, which contains about 1000
instructions. Most of them incorporate multiple execution paths. We first illustrate our
17
IF 64-Bit Mode
THEN
#UD;
ELSE
IF ((AL AND 0FH)>9)
or (AF=1))
THEN
AL <- AL + 6;
AH <- AH + 1;
AF <- 1;
CF <- 1;
AL <- AL AND 0FH;
ELSE
AF <- 0;
CF <- 0;
AL <- AL AND 0FH;
FI;
FI;
(a) Specification
64 bit
Mode
AL
AF AND 0FH
#UD
THEN
AL <- AL + 6;
AH <- AH + 1;
AF <- 1;
CF <- 1;
AL <- AL AND 0FH;
true
false
> 9 = 1
true
ELSE
AF <- 0;
CF <- 0;
AL <- AL AND 0FH;
false
true
false
Cond 1
Cond 2 Cond 3
AH
(b) Defined Behavioral Model
Figure 2.3: Building Defined Behavioral Model foraaa Instruction
test case generation approach on an example instruction – aaa – and then describe our
general approach.
An Example – aaa. The aaa instruction adjusts the sum of two unpacked binary
coded decimal (BCD) values to create an unpacked BCD result. Its specification from
Intel manual is shown in Figure 2.3a, and the corresponding behavioral model is illus-
trated in Figure 2.3b. This instruction has no explicit parameters but needs four implicit
parameters: system mode, and the al, the ah, and the af registers. The al and the
ah are 8-bit registers and the af is a one-bit flag in eflags register. The behavioral
model contains one intermediate action (and 0fh) and three sink actions (#ud, then,
andelse nodes).
We aim to generate a minimal set of test cases that explore all possible code paths
in this instruction’s defined behavioral model. In addition, we want to cover all the
boundary values that are used in the condition checks. Therefore, we will generate the
following test cases as shown in Table 2.1. In the first test case, we set the mode to be
64 bit, so an#ud exception will be thrown. The second test sets bothal andah registers
18
Table 2.1: Test Cases Generated foraaa’s Defined Behavioral Model
No. Mode/Cond 1 AL/Cond 2 AF/Cond 3 AH Testing Goals
1 64 bit/true N/A N/A N/A bound(Cond 1),#ud exception
2 32 bit/false 0/false 0/false 0 min(AL), min(AH), andELSE
3 32 bit/false 0/false 0/false 0ffh min(AL), max(AH), andELSE
4 32 bit/false 0/false 1/true 0c9h min(AL), rand(AH), andTHEN
5 32 bit/false 9/false 0/false 58h bound(Cond 2), rand(AH), andELSE
6 32 bit/false 9/false 1/true 0a6h bound(Cond 2), rand(AH),THEN
7 32 bit/false 0ffh/true 0/false 0 max(AL), min(AH), andTHEN
8 32 bit/false 0ffh/true 0/false 30h max(AL), rand(AH), andTHEN
9 32 bit/false 0ffh/true 0/false 0ffh max(AL), max(AH), andTHEN
10 32 bit/false 0ffh/true 1/true 0b3h max(AL), rand(AH), andTHEN
11 32 bit/false 8/false 0/false 8ah bound(AL), rand(AH), andTHEN
12 32 bit/false 10/true 1/true 3fh bound(AL), rand(AH), andTHEN
13 32 bit/false 3/false 0/false 07fh rand(AL), rand(AH), andELSE
to contain the minimal value, and sets af to 0. This test case evaluates the ELSE action
at the end. The following test cases populate al, ah, and af to evaluate all possible
execution paths and boundary values used in the instruction’s defined behavioral model.
Test Case Template under Windows x86 Platform. We use Windows x86 platform
for demonstration purposes because this is the most popular OS that has been targeted by
malware. After we derive the minimal test case set that explores all execution paths of
an instruction’s behavioral model, the next step is to map the set to concrete binaries that
can be executed. In order to achieve this goal, we first compose a test case template to
describe the initialization work that is the same for all test cases, as shown in Figure 2.4.
This program notifies the master in Figure 2.1 and then halts the slave as soon as it
enters the main function (line 2), so the master can save the states. The same interaction
happens at lines 27, 29, and 38, after the test case completes a certain step. Then the
program installs a structured exception handler for the Windows system (line 4 – 7).
If an exception occurs, the program will jump directly to line 31, so we can save the
system state before exception handling.
19
1 main proc
2 int 3 ; Raw State
3
4 push offset handler ; install SEH
5 assume fs:nothing
6 push fs:[0]
7 mov fs:[0], esp
8
9 ;; populate reg and memory
10 mov eax, 0000001bh
11 mov ebx, 00001000h
12 ...
13 ;; double precision floating-point
14 mov eax, 00403080h
15 mov dword ptr [eax], 0h
16 mov dword ptr [eax+4], 7ff00000h ; +Infi
17 ...
18 ;; single precision floating-point
19 mov eax, 0040318ch
20 mov dword ptr [eax], 0ff801234h ; SNaN
21 ...
22 ;; double-extended precision FP
23 ...
24 ;; unsupported double-extended precision
25 ...
26 [state_init] ; specific init
27 int 3 ; Initial State
28 [testing_insn] ; instruction in test
29 int 3 ; Final State
30 call ExitProcess
31 handler:
32 ;; push exception information onto stack
33 mov edx, [esp + 4] ; excep_record
34 mov ebx, [esp + 0ch] ; context
35 push dword ptr [edx] ; excep_code
36 ...
37 push dword ptr [edx + 0c0h] ; eflags
38 int 3 ; Final State (exception)
39 mov eax, 1h
40 call ExitProcess
41 main endp
42 end main
Example initialization for aaa:
mov ah, 46h
sahf ; set AF to 0
mov al, 0 ; populate AL
mov ah, 0 ; populate AH
Example testing for aaa:
aaa
Figure 2.4: Test Case Template (in MASM assembly)
From line 9 to 25, we perform general-purpose initialization. Registers and memory
are populated using pre-defined values, including all floating point and integer formats.
This step occurs in all test cases and the carefully chosen, frequently used values are
stored in the registers to minimize the need for specific initialization. Afterwards, the
specific initialization (line 26) makes tailored modifications to the numbers if needed
for a given test case. For example, theeax is set to1bh at line 10 for all test cases. One
particular test case may need0ffh value in this register and will update it at line 26. The
20
actual instruction is being tested at line 28, where all defined and undefined behaviors
will be evaluated in various test cases.
Now we describe an example of mapping the second test case of aaa in Table 2.1 to
our test case template. The placeholder[state init] at line 26 will be replaced by the
four instructions shown in the upper block in Figure 2.4. Thesahf instruction transfers
bits 0-7 of ah into the eflags register, which correctly sets af to 0. Since aaa does
not take any explicit parameters,[testing insn] at line 28 will becomeaaa in all test
cases for this instruction. When compiling test cases, we disable linker optimization and
use a fixed base address. This eases the interaction between the master and slaves, and
does not aect the testing outcome. In our testing, we find that physical machines also
set or reset the sf, zf, and pf flags. These flags are not defined for the aaa instruction
in the manual, hence this is the undefined behavior of aaa.
Extending to Intel x86 Instruction Set. In this section, we describe how to apply
our test case generation method to the entire Intel x86 instruction set. We manually
analyze instruction execution flows defined in Intel manuals [Int17a], group the instruc-
tions into semantically identical classes, and classify all possible input parameter values
into ranges that lead to distinct execution flows. We then draw random parameter values
from each range.
The IA-32 CPU architecture contains about 1000 instruction codes. In our test
design strategy, a human must reason about each code to identify its inputs and out-
puts and how to populate them to test all execution behaviors. To reduce the scale of
this human-centric operation, we first group the instructions into six categories: arith-
metic, data movement, logic, flow control, miscellaneous, and kernel. The arithmetic
and logic categories are subdivided into general-purpose and FPU categories based on
the type of their operands. We then define parameter ranges to test per category, and
adjust them to fit finer instruction semantics as described below. This grouping greatly
21
Table 2.2: Instruction Grouping
Category
Insn.
Example Instructions Parameter Coverage
Count
arithmetic
48 aaa,add,imul,shl,sub
min, max, boundary values, ran-
doms in dierent ranges
336 addpd,vminss,fmul,fsqrt,roundpd
infi, normal, denormal, 0,
SNaN, QNaN,
QNaN floating-point indefinite,
randoms
data mov 232 cmova,fild,in,pushad,vmaskmovps
valid/invalid address, condition
flags,
dierent input ranges
logic
64 and,bound,cmp,test,xor
min, max, boundary values,>, =,<,
flag bits
128 andpd,vcomiss,pmaxsb,por,xorps
infi, normal, denormal, 0,
SNaN, QNaN,
QNaN FP indefinite, >, =, <, flag
bits
flow ctrl 64 call,enter,jbe,loopne,rep stos
valid/invalid destination, condition
flags, privileges
misc 34 clflush,cpuid,mwait,pause,ud2
analyze manually and devise dedi-
cated input
kernel 52 arpl,int,lds,lgdt,ltr,wbindbv
devise parameter values covering
all input ranges and boundaries if
applicable
reduces human time investment and reduces the chances of human errors. It took one
person from our team two months to devise all test cases. Table 2.2 shows the number
of dierent mnemonics, examples, and parameter ranges we evaluate for each category.
Arithmetic Group. We classify instructions in this group into two subgroups, depend-
ing on whether they work solely on integer registers (general-purpose group), or on
floating point registers (FPU group) as well. The instructions in the FPU group include
instructions with x87 FPU, MMX, SSE, and other extensions.
Based on the argument types and sizes, branch conditions, and the number of argu-
ments, we divide both subgroups into finer partitions. For example,aaa,aas,daa, and
das in the general-purpose subgroup all compare the al register (holding one packed
BCD argument 8-bits long) with 0fh and check the adjustment flag af in the eflags
22
register. This decides the output of the instruction. To test instructions in this set, we
initialize the al register to minimal (00h), maximal (0ffh), boundary (0fh), and ran-
dom values in dierent ranges ([01h, 0eh], [10h, 0feh]). We also set af to 0 and
1 for dierental values.
If a mnemonic takes two parameters, we select at least three value pairs to ensure
that a greater-than, equal-to, and less-than relationship between them is satisfied in our
test set. For the FPU subgroup, the parameter ranges are separated based on the sign,
biased exponent, and significand, which splits all possible values into 10 domains:infi,
normal,denormal, 0, SNaN, QNaN, and QNaN floating-point indefinite. We sample
values from all these ranges to test behaviors in the arithmetic FPU group. For example,
fadd, fsub, fmul, and fdiv each use one operand that can be specified using four
dierent addressing modes; one of them is m64fp, which stands for a double precision
float stored in memory. These instructions add/sub/mul/div the st(0) register with the
operand’s value and store the result in st(0). In addition, they also read control bits
in the mxcsr register and fdiv checks the divide-by-zero exception. In our test cases,
we generate values for the two floating point operands from the 10 identified ranges and
permute the relevant bits in the mxcsr register. Because instructions in this subgroup
can also access memory to read operands, we devise additional test cases to evaluate
the memory management unit. We place the m64fp argument in and out of the valid
address space of a data segment, into a segment with and a segment without required
privileges, and into a segment that is paged in and a segment that is paged out of memory.
By combining these test cases together, all potential memory access exceptions can be
raised along with all potential arithmetic exceptions.
Data Movement. Data movement instructions move data between registers, memory,
and peripheral devices and usually do not modify flag bits. There are several execu-
tion branches that we explore in tests. The source and the destination operands may
23
be located outside segment limits. If the eective address is valid but paged out, a
page-fault exception will be thrown. If alignment checking is enabled and an unaligned
memory reference is made while the current privilege level is 3, alignment exceptions
will be thrown. Some instructions also check direction and conditional flags, and a few
others validate the format of floating point values. All these input parameters and the
states that influence an instruction’s execution outcome must be tested.
For example, we group 30 conditional movement instructionscmovcc r32, r/m32
of distinct cc together because they move 32 bit signed or unsigned integers from the
second operand (32 bit register or memory) to the first operand (32 bit register). The
cc conditions are determined by the cf, zf, sf, of and pf flags. To access arguments
outside the segment limit, we compile our test cases with the fixed base (Section 2.4.3).
The starting addresses for code, data, and stack segment are 401000h, 403000h, and
12e000h respectively, and each has a size of 4KB. It is dicult to test page faults
directly because the Windows system does not provide APIs for page swap-out. To
work around this, we run other memory-consuming programs between test cases that
use memory operands to force the values to be paged out of memory. In our evaluation,
we find that this strategy works well and we successfully raise page faults when we need
to test them. To raise the alignment checking exception, we store instruction operands
at unaligned memory addresses. We permute the condition bits in the same way as we
do for arithmetic instructions.
Logic Group. Logic instructions test relationship and properties of operands and set
flag registers correspondingly. We divide these instructions into general-purpose and
FPU depending on whether they useeflags register only (general-purpose) or they use
both eflags and mxcsr registers (FPU). We also partition this group based on the flag
bits they read and argument types and sizes. When designing test cases, in addition to
testing min, max, and boundary values for each parameter, for instructions that compare
24
two parameters, we also generate test cases where these parameters satisfy larger-than,
equal-to, and less-than conditions.
For example, one of the subgroups has bt, btc, btr, and bts instructions because
all of them select a bit from the first operand at the bit-position designated by the second
operand, and store the value of the bit in the carry flag. The only dierence is how they
change the selected bit: btc complements; btr clears it to 0; and bts sets it to 1. The
first argument in this subgroup of instructions may be a register or a memory address
of size 16, 32, or 64, and the second must be a register or an immediate number of the
same size. If the operand size is 16, for example, we generate four input combinations
(choosing the first and the second argument from 0h, 0ffffh values), and we repeat
this forcf = 0 andcf = 1. Furthermore, we produce three random number combina-
tions that satisfy less-than, equal-to, and greater-than relationships. While the operand
relationship does not influence execution in this case, it does for other subgroups, e.g.,
the one containingcmp.
In the FPU subgroup, we apply similar rules to generate floating point operands. We
generate test cases to populate the mxcsr register, which has control, mask, and status
flags. The control bits specify how to control underflow conditions and how to round
the results of SIMD floating-point instructions. The mask bits control the generation of
exceptions such as the denormal operation and invalid operation. We use ldmxcsr to
loadmxcsr and test instruction behaviors under these scenarios.
Flow Control. Similar to logic instructions, flow control instructions also test condition
codes. Upon satisfying jump conditions, test cases start execution from another place.
For short or near jumps, test cases do not need to switch the program context; but for far
jumps, they must switch stacks, segments, and check privilege requirements.
The largest subgroup in this category is the conditional jump jcc, which accounts
for 53% of flow control instructions. Instructions in this group check the state of one
25
or more of the status flags in the eflags register (cf, of, pf, sf, and zf) and if the
required condition is satisfied, they perform a jump to the target instruction specified by
the destination operand. A condition code (cc) is associated with each instruction to
indicate the condition being tested for. In our test cases, we vary the status flags and set
the relative destination addresses to the minimal and maximal oset sizes of byte, word,
or double word as designated by mnemonic formats. For example,ja rel8 jumps to a
short relative destination specified byrel8 if cf = 0 andzf = 0. We permutecf and
zf values in our tests, and generate the destination address by choosing boundary and
random values from the ranges[0, 7fh] and[8fh, 0ffh].
For far jumps like jmp ptr16:16, the destination may be a conforming or non-
conforming code segment or a call gate. There are several exceptions that can occur. If
the code segment being accessed is not present, a #NP (not present) exception will be
thrown. If the segment selector index is outside descriptor table limits, an exception#GP
(general protection) will signal the invalid operand. We devise both valid and invalid
destination addresses to raise all these exceptions in our test cases.
Miscellaneous. Instructions in this group provide unique functionalities and we manu-
ally devise test cases for each of them that evaluate all defined and undefined behaviors,
and raise all exceptions.
Kernel instructions. Kernel instructions are supposed to run under ring 0 and each
of them accomplishes specific tasks. For example, arpl adjusts the rpl of a segment
selector that has been passed to the operating system by an application program to match
the privilege level of the application program. The int instruction raises a numbered
interrupt, and ltr loads the source operand into the segment selector field of the task
register. For this category, we devise parameter values that can cover all input ranges
and boundaries where applicable.
26
2.5 Detected Pills
We use two physical machines in our tests as Oracles: (O1) an Intel Xeon E3-1245
V2 3.40GHz CPU, 2 GB memory, with Windows 7 Pro x86, and (O2) Xeon W3520
2.6GHz, 512MB memory, with Windows XP x86 SP3. The VM host has the same hard-
ware and guest system as the first Oracle, but it has 16 GB memory, and runs Ubuntu
12.04 x64. We test QEMU (VT-x and TCG), and Bochs, which are the most popu-
lar virtual machines deploying dierent virtualization technologies: hardware-assisted,
dynamic translation, and interpretation respectively. We allocate to them the same size
memory as in the Oracle. We test QEMU versions 0.14.0-rc2 (Q1, used by EmuFuzzer),
1.3.1 (Q2), 1.6.2 (Q3), and 1.7.0 (Q4), and Bochs version 2.6.2. The master has an Intel
i7 CPU and installs WinDbg 6.12 to interact with the slaves. For test case compila-
tion, we use MASM 10 and turn o all optimization. Our user-space test cases take
around 10 seconds to run on a physical machine and 15 – 30 seconds to run on a VM.
The kernel-space test cases need about 5 minutes per case, because they need a system
reboot.
Counting dierent addressing modes, there are 1,769 instructions defined in Intel
manual [Int17a]. Out of these, there are 958 unique mnemonics. Following our test
generation strategy (Section 2.4.2), we generate 19,412 and 593 test cases for user-space
and kernel-space instructions respectively.
2.5.1 Evaluation Process
We classify system states into user registers, exception registers, kernel registers, and
user memory. The user registers contain general registers such as eax and esi. The
exception registers are eip, esp, and ebp. The dierences in the exception registers
imply dierences in the exceptions being raised. The kernel registers are used by the
27
system and include gdtr, idtr, and others. In our evaluation of user-space test cases,
we do not populate kernel registers in the initialization step because this may crash the
system or lead it to an unstable status. We simply use the default values for kernel
registers after system reboot. The contents of kernel registers are saved as part of our
states and compared to detect dierences between physical and virtual machines.
For each test case, we first examine whether the user registers, exception registers,
and memory are the same in the Oracle and the VM in the initial state. If they are
dierent, it means that the VM fails to virtualize the initialization instructions (line 26
in Figure 2.4) to match their implementation in the Oracle. We mark this test case as
“fatal”. If the initial values in these locations agree with each other, we then compare
the final states. A test case will be tagged as a pill when user registers, kernel registers,
exception registers, or memory in the final states are dierent.
2.5.2 Results
In this section, we discuss our evaluation results from multiple perspectives. (1) We
show the detailed spectrum of the pills found by our test cases. (2) By comparing our
pills with those found by [MPR
+
09], our approach discovers much more pills using
much fewer test cases. (3) We investigate the root causes of our pills and find four more
categories that are not captured by [MPR
+
09, MMP
+
12]. (4) In order to identify the
pills that are persistent across distinct physical and virtual machines, we generate 2,915
additional test cases for 13 selected instructions. (5) Finally, we discuss axiom pills that
are found during the raw states upon system loading.
Pill Spectrum. Table 2.3 shows the results of testing several virtual machines against
Oracle1 (O1). The second column “pills” shows the number of pills for dierent VMs.
Both QEMU (TCG) and Bochs exhibit moderate transparency – almost half of the test
cases report dierent states between O1 and VMs. For Q2 (VT-x) 38.5% of user-space
28
Table 2.3: Results Overview
VMs Pills Crash Fatal
User-space testing (19,412)
Q1 (TCG) 9,255/47.7% 7/<0.1% 1,378/7%
Q2 (TCG) 9,201/47.4% 7/<0.1% 1,376/7.1%
Q1 (VT-x) 7,523/38.7% 2/<0.1% 3/<0.1%
Q2 (VT-x) 7,478/38.5% 2/<0.1% 0/0%
Bochs 8,958/46.1% 2/<0.1% 950/4.9%
Kernel-space testing (593)
Q1 (TCG) 495/83.5% 14/2% 3/0.5%
Q2 (TCG) 496/83.6% 14/2% 3/0.5%
Q1 (VT-x) 506/85.3% 2/0.3% 3/0.5%
Q2 (VT-x) 506/85.3% 2/0.3% 3/0.5%
Bochs 507/85.5% 2/0.3% 3/0.5%
test cases result in pills, but there were no fatal cases. The pills we find for Q2 (VT-x)
occur because QEMU does not preserve the fidelity provided by hardware. Therefore,
we should be careful when using hardware-assisted VMs for fidelity purposes. Their
transparency depends on how they utilize the hardware extension.
The third column “crash” counts test cases that crash the system. For QEMU (TCG),
one test case crashes the Oracle 1 and another one crashes the virtual machine. Another
five crash both of them. For QEMU (VT-x) and Bochs, two test cases crash both the
physical and the virtual machine. The number of fatal test cases are shown in the last
column. All of them are related to FPU movement instructions. In some test cases
that use denormals, SNaN, or QNaN values, the virtual machines could not populate
the operand register as required. We note that no fatal test cases are found for VT-x
technology.
For kernel-space test cases, we observe that the yield rates of the pills are much
higher than those of user-space test cases. A closer investigation reveals that most of
pills are due to dierences in exceptions. We believe the higher yield rates of kernel-
space test cases are due to the fact that the setting up of exception in kernel mode requires
more work, which is more error-prone for VMs.
29
Table 2.4: Pills per Instruction Category (User-space)
Category Q1 (TCG) Q2 (TCG) Q1 (VT-x) Q2 (VT-x) Bochs Total tests
arithmatic
general 877 872 633 626 920 2,702
FPU 4,525 4,486 3,619 3,603 4,245 6,743
data movement 1,788 1,780 1,539 1,524 1,804 4,394
logic
general 371 365 345 346 363 2,185
FPU 1,446 1,447 1,132 1,127 1,362 2,192
flow control 164 166 172 169 171 1,017
miscellaneous 84 85 83 83 93 179
total 9,255 9,201 7,523 7,478 8,958 19,412
Table 2.5: Details of pills with regard to the resource being dierent in the final state—in
some cases multiple resources will dier so the same pill may appear in dierent rows
Category Q2 (TCG) Q2 (VT-x) Bochs
user register 2,416 34 1,671
excp register 1,578 21 1,566
kerl register 8,398 7,457 8,572
data content 46 9 20
Table 2.4 shows the breakdown of pills per instruction category for user-space test
cases in Table 2.2. The FPU arithmetic, FPU logic and data movement categories con-
tain the most pills—around 83%. Table 2.5 shows the breakdown of the pills with regard
to the resource that is dierent between a physical and a virtual machine in the final state.
Most pills occur due to dierences in the kernel registers.
Comparison with EmuFuzzer Pills. EmuFuzzer [MPR
+
09] generates 3 million user-
space test cases and the authors randomly select 10% of the test cases to test in dierent
virtual machines. Because they do not publish the entire test case set, we cannot directly
compare our test cases with theirs, but instead we only compare the percentage of the
user-space pills found by them and by us. EmuFuzzer publishes 20,113 red pills for
QEMU 0.14.0-rc2, which is about 7% of the tested cases. Out of our 19,412 test cases
we find 9,255 pills, which is a 47.6% yield – an order of magnitude higher than Emu-
Fuzzer. Overall, we find five times more pills running 300; 000=19; 412 = 15 times
30
fewer tests than EmuFuzzer. This illustrates the significant advantage of careful genera-
tion of operand values in tests over random fuzzing.
We further wanted to compare our pills with pills found by [MMP
+
12]. The Hi-Fi
tests for Lo-Fi emulators [MMP
+
12] generate 610,516 test cases, out of which 60,770
(9.95%) show dierent behaviors in QEMU, and 15,219 (2.49%) show dierent behav-
iors in Bochs. Since the tests used for [MMP
+
12] are not publicly released, we could
not compare against them.
Root Causes of Pills. The dierences detected by a pill can be due to registers, mem-
ory or exceptions that an instruction was supposed to modify, according to the Intel
manual [Int17a]. We call these instruction targets defined resources. However, there are
a number of instructions defined in the Intel manual that may write to some registers
(or to select flags) but the semantics of these writes are not defined by the manual. We
say that these instructions aect undefined resources. For instance, the aas instruction
should set the af and cf flags to 1 if there is a decimal borrow; otherwise, they should
be cleared to0. Theof,sf,zf, andpf flags are listed as aected by the instruction but
their values are undefined in the manual. Thus, theaf andcf flags are defined resources
for the instructionaas, butof,sf,zf, andpf flags are undefined.
Table 2.6 shows the number of pills that result from dierences in undefined and
defined resources for each instruction category compared to Oracle 1. We note that
a small number of pills that relate to general-purpose arithmetic and logic instruc-
tions occur because of dierent handling of undefined resources by physical and virtual
machines. These comprise roughly 2% of all the pills we found.
For pills originating from defined resources in both user and kernel space, we ana-
lyze their root causes and compare them against those found by the symbolic execution
method [MMP
+
12]. We find all root causes listed in [MMP
+
12] that are related to
general-purpose instructions and QEMU’s memory management unit.
31
Table 2.6: Pills using Undefined/Defined Resources
Category Q2 (TCG) Q2 (VT-x) Bochs
arith
gen 195/677 0/626 194/726
FPU 0/4,486 0/3,603 0/4,245
data mov 0/1,780 0/1,524 0/1804
logic
gen 23/342 0/346 20/343
FPU 0/1,447 0/1,127 0/1,362
flow ctrl 0/166 0/169 0/171
misc 0/85 0/83 0/93
kernel insn. 0/496 0/506 0/507
Because the symbolic execution engine in [MMP
+
12] does not support FPU instruc-
tions, we discover additional root causes that are not captured by the symbolic exe-
cution method. First, we find that QEMU does not correctly update 6 flags and 8
masks in the mxcsr register when no exception happens, including invalid opera-
tion flag, denormal flag, precision mask, overflow mask. It also fails to update 7
flags in fpsw status register such as stack fault, error summary status, and FPU busy.
Second, QEMU fails to throw five types of exceptions when it should, which are:
float multiple traps, float multiple faults, access violation, invalid lock sequence, and
privileged instruction. Third, QEMU tags FPU registers dierently from Oracles. For
example, it setsfptw tag word to “zero” when it should be “empty”, and sets it to “spe-
cial” when “zero” is observed in Oracles. Finally, the floating-point instruction pointer
(fpip, fpipsel) and the data pointer (fpdp, fpdpsel) are not set correctly in certain
scenarios.
Identifying Persistent Pills. Dierences found in our tests between an Oracle and a
virtual machine may not be present if we used a dierent Oracle or a dierent virtual
machine, i.e. a dierence may stem more from an implementation bug specific to that
CPU or VM version than from an implementation dierence that persists across ver-
sions. Furthermore, outdated CPUs may not support all instruction set extensions that
are available in recent ones. Finally, recent releases of VM software usually fix certain
32
Table 2.7: Undefinedeflags Behaviors
Instruction OF SF ZF AF PF CF
aaa
0 0 ZF (ax) PF (al + 6) orPF (al) 0
0 0 ZF (al) PF (al) 0
aad
F F F
0 0 0
aam 0 0 0
aas
0 0 ZF (ax) PF (al + 6 oral) 0
0 0 ZF (al) PF (al) 0
and,or,xor,text 0
bsf,bsr
I I I I I
0 0 F 0 0
bt,bts,btr,btc I I I I
daa,das 0
div,idiv I I I I I I
mul,imul
I I I I
F F 0 F
F 0 0 F
rcl,rcr,rol,ror
I
F
OF
†
sal,sar,shl,shrshld,shrd
I I
R 0
0 F
†
1-bit rotation
bugs and add new features, which may both create new dierences and remove the old
dierences between this VM and physical machines. We hypothesize that transient pills
are not useful to malware authors because they cannot predict under which hardware or
under which virtual machine their program will run, and we assume that they would like
to avoid false positives and false negatives.
To find pills that persist across hardware and VM changes, we perform our testing
on multiple hardware and VM platforms. We select 13 general instructions that can
be executed in all x86 platforms (aaa, aad, aas, bsf, bsr, bt, btc, btr, bts, imul,
mul, shld, shrd) and generate 2,915 test cases for them to capture more pills that are
caused by modification of undefined resources. We evaluate this set on the two physical
machines (Oracle 1 and Oracle 2), three dierent QEMU versions (Q2, Q3, and Q4),
33
and Bochs. We find 260 test cases that result in dierent values in eflags register in
Oracle 1 and Oracle 2 and will thus lead to transient pills. Bochs’ behavior for these
test cases is identical to the behavior of Oracle 2. Out of the remaining 2,655 test cases,
we find 989 persistent pills that generate dierent results in the three QEMU virtual
machines when compared to the physical machines. They are all related to undefined
resources. Bochs performs surprisingly well and does not have a single pill for these
particular test cases. Thus, we could not find persistent pills that would detect a VM for
any given VM/physical machine pair in our tests, but we found pills that can dierentiate
between any of the QEMU VM versions and configurations that we tested, and any of
the physical machines we tested.
We further investigate the persistence of pills that are caused by modifications
to undefined resources, across dierent physical platforms. We select five physical
machines with dierent CPU models in DeterLab [BBB
+
04]. Out of 218 pills that were
found for Oracle 1 and Q2 (TCG), we were able to map 212 pills to all five physical
machines (others involved instructions that did not exist in some of our CPU architec-
tures). Fifty of those were persistent pills—the undefined resources were set to the same
values in physical machines. We conclude that modifications to undefined resources can
lead to pills that are not only numerous but also persistent in both physical and virtual
machines. This further illustrates the need to understand the semantics of these modifi-
cations as this would help enumerate the pills and devise hiding rules for them without
exhaustive tests.
Axiom Pills. In addition to comparing final states across dierent platforms we also
compare raw states upon system loading. We define an axiom pill as a register or mem-
ory value whose raw state is consistently dierent between a physical machine and a
given virtual machine. This pill can be used to accurately diagnose the presence of the
given virtual machine. We select 15% of our test cases and evaluate them on Oracle 2,
34
Table 2.8: Axiom Pills
Reg O1 Q1 (TCG) Q2 (TCG) Q1 (VT-x) Q2 (VT-x) Bochs
edx vary vary vary 0ffffffffh 0ffffffffh vary
dr6 0ffff0ff0h 0 0 0ffff0ff0h 0ffff0ff0h 0ffff0ff0h
dr7 400h 0 0 400h 400h 400h
cr0 8001003bh 8001003bh 8001003bh 8001003bh 8001003bh 0e001003bh
cr4 406f9h 6f8h 6f8h 6f8h 6f8h 6f9h
gdtr vary 80b95000h 80b95000h 80b95000h 80b95000h 80b95000h
idtr vary 80b95400h 80b95400h 80b95400h 80b95400h 80b95400h
Q2, Q3 and Bochs. The axiom pills are shown in Table 2.8. For example, the value of
0ffffffffh in theedx register can be used to diagnose the presence of Q2 (VT-x).
2.5.3 Exploring Undefined Behavior Model
Our test cases were designed to explore eects of input parameters on defined resources.
We thus claim that our test cases cover all specified execution branches for all instruc-
tions defined in Intel manuals. Our test pills should thus include all possible individual
pills that can be detected for defined resources.
We now explore the pills stemming from modifications to undefined resources, to
evaluate their impact on the completeness of our pill sets and to attempt to devise seman-
tics of these modifications. The only undefined resources from the Intel manual are the
flags ineflags.
We analyze the instructions that aect one or more flags in the eflags register in
an undefined manner. We generate additional test cases for each instruction to explore
the semantics of modifications to undefined resources in each CPU. Although the exact
semantics dier across CPU models, we consider four semantics of flag modifications
that are the superset of behaviors we observed across tested hardware and software
machines: a flag might be (1) cleared, (2) remain intact, (3) set according to the ALU
output at the end of an instruction’s execution, or (4) set according to an ALU output of
an intermediate operation.
35
We run our test cases on a physical or virtual machine in the following manner. For
each instruction, we set an undefined flag and execute an operation that yields a result
inconsistent with the flag being set; for example,zf is set while the result is0. If the flag
remains set we conclude that the instruction does not modify it. Similarly, we can test if
the flag is set according to the final result. If none of these tests yield a positive result,
we go through the sub-operations in a given instruction’s implementation as defined in
the CPU manual, and discover which one modifies the flag. For example: aaa adds 6
to al and 1 to ah if the last four bits are greater than 9 or if af is set. The instruction
aects of, sf, zf and pf in an undefined manner. We find that in some machines, zf
and pf are set according to the final result, while in others, pf is set according to an
intermediate operation which isal = al + 6.
Table 2.7 shows dierent semantics for each instruction, which are consistent across
5 dierent CPU models. Empty cells represent defined resources for a given instruction.
Character “I” means the flag value is intact while “F” means that the flag is set according
to the final result. Otherwise, the flag is set to the value in the cell.
To detect pills between a given virtual machine and one or many physical machines,
we repeat the same tests on the virtual machine and look for dierences in instruction
execution semantics. If many physical machines are compared to a virtual machine, we
look for such dierences where physical machines consistently handle a given instruc-
tion in a way that is dierent from how it is handled in a virtual machine. For example,
in Table 2.7, instructionaad either clearsof, af andcf flags or sets them according to
the final result. If a virtual machine were to leave these flags intact, we could use this
behavior as a pill.
Our test methodology will discover all test pills (and thus all possible individual
pills) related to modifications of undefined resources by user-space instructions for a
36
given physical/virtual machine pair. Since the semantics of undefined resource modi-
fications vary greatly between physical CPU architectures as well as between various
virtual machines and their versions, all possible test pills cannot be discovered in a gen-
eral case.
To summarize, our testing reveals pills that stem from instruction modifications to
user-space or kernel-space registers. These modifications can further occur on defined
or on undefined resources for a given instruction. We claim we detect all test pills
(and thus all the individual pills) that relate to modifications of defined resources. We
can claim that because we fully understand semantics of these modifications, and all
physical machines we tested strictly adhere to the instructions’ semantics as specified
in the manual. We cannot claim completeness for pills that relate to modifications of
undefined resources because physical machine behaviors dier widely for those.
2.6 Conclusion
Virtualization is crucial for malware analysis, both for functionality and for safety. Con-
temporary malware aggressively checks if it is being run in VMs and applies evasive
behaviors that hinder its analysis. Existing works on detection and hiding of dierences
between virtual and physical machines apply ad-hoc or semi-manual testing to iden-
tify these dierences and hide them from malware. Such approaches cannot be widely
deployed and do not guarantee completeness.
In this chapter, we first propose cardinal pill testing that requires moderate manual
action per CPU architecture, to identify ranges for input parameters for each instruction.
It then automatically devises tests to enumerate the dierences between any pair of
physical and virtual machines. This testing is much more ecient and comprehensive
than state-of-the-art red pill testing. It finds five times more pills running fifteen times
37
fewer tests. We further claim that for instructions that aect defined resources, cardinal
pill testing identifies all possible test pills, i.e., it is complete. Other categories contain
instructions whose behavior is not fully specified by the Intel manual, which has led to
dierent implementations of these instructions in physical and virtual machines. Such
instructions need understanding of the implementation semantics to enumerate all the
pills and devise the hiding rules. However, these pills cannot be exploited by attackers
because they are not persistent.
38
Chapter 3
Hiding Debuggers from Malware using
Apate
3.1 Introduction
Debuggers enable detailed analysis of malware’s behaviors and the resulting system
states, including disassembling of the binary code, capturing of the system calls and the
exceptions, etc. Malware authors have strong incentives to make this analysis as dicult
as possible, thus prolonging the time before defenders can fully analyze their malware
and develop signatures for filtering. Malware binaries thus exhibit evasive behaviors
[BBN12, LKMC11, CAM
+
08, MPR
+
09], which aim to detect or disrupt the analysis in
VMs or in debuggers. Many contemporary malware samples use evasive behaviors.
Kirat et al. [KVK14] detect 5,835 malware samples (out of 110,005) that exhibit eva-
sive behaviors. Branco et al. [BBN12] find that 43.21% of 4 million samples exhibit
certain anti-debugging behaviors and 81.4% exhibit anti-virtualization behaviors. Lin-
dorfer et al. [LKMC11] detect evasive behaviors in 25.6% of 1,686 malicious binaries.
Chen et al. [CAM
+
08] find that 39.9% and 2.7% of 6,222 malware samples exhibit anti-
debugging and anti-virtualization behaviors respectively. In this chapter, we focus only
on detecting and defeating anti-debugging.
In anti-debugging, malware detects debuggers by detecting artifacts used to imple-
ment core debugger functionalities, such as breakpoints or tracing [Fer11,Fal07,SH12].
Malware also tries to evade debuggers by implementing anti-disassembly approaches
39
such as code encryption or instruction overlapping, or by diverting control flow outside
of the debugger.
Popular debuggers today, such as IDA [HR17], WinDbg [Win17], and Olly-
Dbg [Yus13], are all vulnerable to anti-debugging. There are extensions to these
debuggers which can detect some attack vectors but not the others. Similarly, research
approaches against anti-debugging (Section 3.6) cover a small subset of attack vectors,
and may do so by reimplementing core debugger functionalities in novel ways (e.g., by
using VM introspection) [KV08, WWW15]. Furthermore, research debuggers do not
oer a full range of functionalities that commercial debuggers do, and do not have an
established user base. Thus, they would see a hard path to adoption.
We propose Apate – a framework for systematic debugger hiding, which is meant
to be integrated with existing debuggers. Apate handles 58–465% more attack vectors
than its close competitors. Further, its integration with popular debuggers enables wide
adoption by their current users.
3.1.1 Contributions
Our first contribution lies in systematic investigation of known and possible anti-
debugging attack vectors from a variety of sources [SH12,BBN12,Fer11,Fal07,Ope07,
CAM
+
08, ZLS
+
15]. Our final set contains 79 attack vectors, 12 of which are novel vec-
tors identified by us. We abstract the 79 attack vectors into 6 broad categories, and 16
subcategories, which enables us to devise defense approaches per category.
As the second contribution, we develop novel techniques for handling attacks in
two out of our six categories (suppressible exceptions and timing). For three out of
the remaining four categories, prior research has sketched ideas for attack handling, but
they were not implemented or tested. We have worked out the details and implemented
these handlers for the first time. Only in one attack category, we adopt a mature solution
40
which has already been used in prior work. Our debugger-hiding approaches jointly
handle all 79 attack vectors, while commercial debuggers and research solutions handle
only 22%67% of those attack vectors [HR17, New14, Scy16, Deb15, Win17, Yus13,
Rce12, aad12, SH12, BBN12, Fer11, Fal07, Ope07].
Our third contribution is the design and implementation of the debugger-hiding
framework, called Apate, which actively monitors malware execution and hides debug-
gers at run time. The main novelty of our framework is that it augments existing debug-
gers to hide them from malware, and that it can be easily adapted to handle new attacks
as they emerge. It is also debugger-agnostic and OS-agnostic. While the implemen-
tations of some specific attack vectors and defense mechanisms depend on our imple-
mentation platform (Windows with WinDbg), the basic attack and defense strategies
are portable to other OS and debuggers.
Apate uses three techniques to hide debuggers: (1) single-stepping based execution,
(2) just-in-time instruction disassembly and analysis, and (3) instruction pattern detec-
tion and system state modification. Each instruction is disassembled just as it is ready to
be executed, and analyzed against attack vectors stored in an attack library. If a match
is found, corresponding countermeasures drawn from a defense library are carried out.
Since both libraries are extensible, novel attack vectors and defenses can be added in the
future.
We perform extensive evaluation of Apate with five datasets to measure its ability
to detect anti-debugging, and to deceive malware by hiding the debugger’s presence.
Our evaluations show that Apate outperforms other state-of-the-art debuggers by a wide
margin in all the data sets. It also successfully deceives malware by hiding the pres-
ence of the debugger – malware functionalites under Apate-enhanced WinDbg remain
identical to those in debugger-free runs.
41
3.2 Attack Vectors
In this section, we provide our categorization of attack vectors which malware can use
to detect and evade debuggers. We start by surveying known techniques from published
work and then abstract them into two broad groups: (1) attacking the debugging princi-
ples (Section 3.2.1) and (2) detecting the traces of debugger presence (Section 3.2.2). In
addition, we sub-divide these groups into six categories and sixteen sub-categories. For
each category, we enumerate all possible ways in which an attack could be conducted,
using information from Windows and Intel manuals. We discuss completeness of our
attack enumeration in Section 3.2.3.
We arrive at 79 attack vectors (see Table 3.1), including 12 novel vectors identified
by us (highlighted as blue text). Most attack vectors exploit the complex interactions
between applications and OS, which are not easily detected or circumvented. While
some details of the vectors such as the specific system APIs they use, are related to
our implementation platform (Windows), the basic attack principles apply to any oper-
ating system. We chose to focus on Windows OS, because most malware targets this
platform [Flo14].
3.2.1 Attacks On Debugging Principles
Attacks in this category exploit interactions between a debugger and its debuggee in an
active debugging session. They detect mechanisms employed by debuggers for code
analysis, such as breakpoints, exception handling, code disassembly, etc. The challenge
in handling these attacks, is that core debugger functionalities must be preserved.
42
Table 3.1: Classification of attack vectors, and their handling in Apate
Category Sub-category Representative Attacks Apate’s Handling Handling Novelty
Breakpoints
Software read 0xcc scan
Keep a copy and feed the
original byte
This is proposed
by [Fer11] but details
are worked out and
implemented by us
Software write WriteProcessMemory(),mov Update Apate’s copy
Hardware read N/A in Apate N/A in Apate
Exception
Suppressible
EXCEPTION INVALID HANDLE,
EXCEPTION HANDLE NOT CLOSABLE
Consume the exception
This is a new handling
approach proposed by
us
Non-suppress. all other exceptions
1. Monitor handler instal-
lation and add bpt at entry;
2. Pass exception to mal-
ware; 3. Monitor handler’s
completion and add bpt at
resume address
Special cases
Single-stepping (Clear the trap flag
to disable it)
Raise the exception
This is proposed
by [Fer11, Fal07,
Ope07] but details
are worked out and
implemented by us
int 2d (Debuggers use a dierent
resume address than native run)
Use correct exception
resume address
int 3 (Debuggee intentionally
raises a software breakpoint
exception)
Modify our single-
stepping exception to fake
it and pass to debuggee
Flow control
Callbacks
CallMaster(),
TLS, MouseProc(),
EnumDateFormats(),
EnumDateFormatsEx(),
EnumSystemLocale(),
EnumSystemCodePages(),
EnumSystemLanguageGroups(),
EnumSystemGeoID(),
EnumTimeFormats()
Add breakpoints at the
entry points of the callback
functions This is a new handling
approach proposed by
us
Direct hiding ZwSetInformationThread(),
NtSetInformationThread()
Skip the APIs This handling is
proposed and
implemented
by [Rce12, Scy16].
We do the same in
Apate
Multi-threading CreateThread() Set breakpoints at entries
Self-debugging Child process debugs the parent
process
SetDebugPort to0
Interaction
Hijacking BlockInput(),
SwitchDesktop()
†
Skip the APIs
Timing
GetLocalTime(),
GetTickCount(),
KiGetTickCount(),
timeGetTime(),
QueryPerformanceCounter(),
rdtsc
Maintain a high-fidelity
time source
This is a new handling
approach proposed by
us
Anti- Inst. overlap Embed one instruction in another
Single-stepping/Tracing
This is proposed
by [Fer11, Fal07,
Ope07] but details
are worked out and
implemented by us
disassembly Self-modifying xor code or copy from data section
Traces
Indirect read int 2e,
NtQueryInformationProcess()
Modify debuggee states
after calling/skipping
these APIs
Direct read
ProcessHeapFlags,
ProcessHeapForceFlags
BeingDebugged, Heap,
NtGlobalFlag,
Overwrite with correct
values when launching
client
segment selector registers (gs, fs,
cs,ds)
†
,
eflags manipulation
(popf/popfd/pop ss)
†
Maintain a copy and feed
the original value
This is a new handling
approach proposed by
us
†
Patterns consisting of multiple instructions/API calls
43
3.2.1.1 Breakpoint Attacks
Breakpoint attacks seek to detect/evade breakpoints which are the debugging mechanism
used to closely examine the debuggee’s behavior. This category contains subcategories:
software read, software write and hardware read.
Contemporary processors such as Intel and AMD support two types of breakpoints
— software and hardware breakpoints. To add a software breakpoint in a debuggee,
a debugger replaces the opcode byte at the breakpoint address with a 0xcc byte (dis-
assembled as an int 3 instruction). When this instruction is executed, a breakpoint
exception will be raised and passed to the debugger by Windows. The debugger then
restores the original opcode byte at the breakpoint address, and sets the trap flag (the
8th bit of eflags register). This trap flag will cause a single-stepping exception after
the processor executes exactly one instruction in the debuggee, and will then be auto-
matically cleared by the processor. Similarly, this exception will also be passed to the
debugger by Windows, and the debugger obtains a chance to set the breakpoint for the
next round.
To add a hardware breakpoint, a debugger saves the breakpoint address in a debug
register rather than modifying a debuggee’s code. During execution, the processor
actively compares the instruction pointer (ip) with the debug registers. If theip matches
one of the addresses in the debug registers, the processor will yield control to the debug-
ger.
Both software and hardware breakpoints can be detected or evaded by malware.
To detect a software breakpoint, malware may scan its code for 0xcc byte or it could
evade by overwriting its code. These attacks fall into software read and software write
subcategories in in Table 3.1.
To detect a hardware breakpoint (subcategory hardware read in Table 3.1), mal-
ware can read the debug registers. In practice, this check cannot be performed in user
44
space because of the privilege constraints enforced by the processor. To overcome this
problem, malware can intentionally raise an exception. Windows will create an excep-
tion record on the stack, which includes the contents of all registers. Malware then
examines this copy of debug registers in its exception handling routine. If any register is
not zero, presence of a debugger is revealed. In our evaluation, popular debuggers were
all vulnerable to breakpoint attacks.
3.2.1.2 Exception Attacks
Exception attacks leverage the way exceptions are handled by the OS in the presence
of a debugger, to detect or evade the debugging environment. This category contains
the following subcategories: suppressible exception, non-suppressible exception and
special-case attacks. Suppressible exception attacks have not been previously discussed
in literature.
Single-stepping is one of the key mechanisms used by debuggers to step through the
debuggee code. It is implemented by setting the trap flag, which raises a single-stepping
exception after the next instruction is executed.
Whenever an exception occurs in a debuggee, it will be handled in the following
manner: 1) Windows will give the debugger a first chance to handle it. 2) If the debugger
consumes the exception, the debuggee will continue the execution at either the current
or the next instruction, depending on the nature of the exception. Otherwise, Windows
will pass the exception to the debuggee and check if the debuggee has registered any
exception handler. 3) If the debuggee does not have any exception handlers, Windows
will provide the debugger a second chance to handle the exception. 4) Finally, if the
debugger still does not handle it, Windows will check if the debuggee has registered a
default exception handler, and will pass the exception to it. 5) If no default exception
handler is available, Windows will terminate the debuggee.
45
Malware can misuse the above exception handling mechanism to detect debuggers.
In suppressible exception attacks, malware raises an exception, which Windows does
not pass to applications. Windows will, however, pass such exceptions to the debug-
ger, which may pass them to debuggee during exception handling. Malware registers
a custom handler for these suppressible exceptions, and detects presence of a debugger
if the handler is invoked. Conversely, non-suppressible exceptions are always passed
to applications by Windows. The presence of a debugger can be detected if a non-
suppressible exception is consumed by the debugger.
There are also certain special cases of exception attacks, which require us to per-
form additional handling (details in Section 3.3). In the single-stepping attack, malware
clears the trap flag so that the debugger will no longer receive the single-stepping excep-
tion. Consequently, the debuggee will run freely inside the debugger. Inint 2d attack,
Windows sets its resume address based on current instruction pointer (ip) and eax.
However, some debuggers may only useip and therefore, continue from a wrong place.
Inint 3 attack, a debuggee intentionally raises a breakpoint exception.
In our evaluation, popular debuggers could address some but not all of the exception
attacks. This still holds true even if we regarded a union of their capabilities.
3.2.1.3 Flow Control Attacks
Attacks in this category abuse the implicit flow control mechanism that is available in
Windows operating systems, with the goal of executing out-of-debugger. This category
includes callback, direct hiding, multi-threading, and self-debugging subcategories.
The implicit flow control is typically implemented through callbacks, such as
CallMaster(), enumeration functions, thread local storage (TLS), and many others.
These callbacks, usually take a function address as a parameter [Fer11]. When a debug-
ger steps over a callback, the execution flow will be transferred to the function specified
46
as its parameter. Malware exploits this in a callback attack, by registering a callback
function and performing its malicious activities there, unseen by the debugger. Callback
attacks using some APIs have been previously discussed in literature, but we discover
eight new APIs that can be misused for these attacks (shown in blue in Table 3.1).
In direct hiding attacks, malware calls certain system APIs to decouple itself from a
debugger. In multi-threading attacks, malware hides malicious behaviors by launching
dierent threads which run outside of the debugger. In self-debugging, malware spawns
a child process which attempts to debug its parent. Because any given process can only
be debugged by one debugger, the child process will fail, revealing the presence of the
debugger.
In our evaluation, none of the popular commercial debuggers could handle callback
or self-debugging attacks, but they could handle direct hiding and multi-threading.
3.2.1.4 Interaction Attacks
In this category of attacks, malware interferes with communication channels between a
user and a debugger, or it attempts to detect a debugger by slow execution. This category
includes hijacking and timing attacks.
In hijacking attacks, malware uses system APIs to hijack a defender’s mouse, key-
board, or screen. Once successful, the eect will remain until the malware process exits.
In timing attacks, malware aims to detect substantial time delays introduced by inter-
active debugging. Malware can either use the system APIs or assembly instructions to
query the time information.
In our evaluation, none of the popular debuggers could handle all the interaction
attacks.
47
66 b8 eb 05 31 c0 74 fa e8
mov ax, 5ebh xor eax,eax jz -6 fake call
jmp 5 real code
Byte
No.
0 1 2 3 4 5 6 7 8 9 10
Figure 3.1: Multilevel Inward-Jumping Sequence
3.2.1.5 Anti-Disassembly Attacks
In this category of attacks, malware misleads the debuggers to incorrectly disassemble
its code, through instruction overlapping or self-modifying code.
In instruction overlapping, malware packs some of its instructions within oth-
ers. Figure 3.1 shows an example where bytes 2, 3 and 8 belong to multiple instruc-
tions [SH12]. The first instruction in this sequence is a 4-byte mov. The last 2 bytes
have been highlighted because they are both part of this instruction, and also form
another instruction to be executed later. Disassemblers will translate the mov, xor, and
jz instructions first, followed by the instruction beginning with 0xe8 (the first opcode
for a call instruction). However, the call instruction will never be executed, because
jz will always transfer the control flow to byte 2. The next instruction to be executed is
jmp 5 which jumps to byte 9.
In self-modifying code (a.k.a. packing [LH07]), malware encrypts its code, e.g., by
xor-ing it with a key. At run time, malware decrypts the code.
In our evaluation, most popular debuggers could not render the correct dis-assembly
when debugging through the packed malware.
3.2.2 Detecting Debugger Traces
In addition to interfering with or analyzing the debugger’s execution, malware can
attempt to detect or circumvent debuggers by looking for traces of their presence in
48
the file system and memory. Malware can read the file system or memory directly
(direct-read subcategory) or via APIs (indirect-read subcategory).
In direct read attacks, malware looks for debugger traces in memory and regis-
ters using assembly code. While some direct read attack vectors have been discovered
before, we discover two new ones (shown in blue in Table 3.1).
In indirect read attacks, malware uses a Windows API call to detect a debugger.
Some of these APIs are designed for debugger detection; others are designed for dier-
ent purposes but can be re-purposed to detect debuggers. For example, when malware
calls IsDebuggerPresent(), Windows returns a non-zero value if the Process Envi-
ronment Block (PEB) contains the field BeingDebugged. Another example is calling
FindWindow() with a well-known debugger name such as “OLLYDBG” and “WinD-
bgFrameClass”. This API will place a non-zero value ineax register if there is a match.
In our evaluation, none of the commercial debuggers could handle all the trace
attacks.
3.2.3 Completeness
We aimed to be as comprehensive as possible when enumerating possible anti-
debugging techniques. Starting from our sixteen sub-categories and the Windows API
manual [API17], we have identified 79 possible attack vectors. We cannot prove that
this list is complete, but we believe it is close to complete for documented Windows OS
and WinDbg functionalities, for the following reasons. First, the attacks in the subcate-
gories that use Windows APIs only require enumeration of these APIs to be complete,
which we have done using the Windows API manual [API17]. Similarly, we have used
the Microsoft’s ocial publications [Cod17, RSL12] to learn about exception types and
their handling, and have investigated possible attacks for each type and handling method.
Second, attacks that read or write system registers (e.g.,eflags) can only do so through
49
a handful of Intel x86 commands, for which Intel’s x86 manual provides the compre-
hensive reference [Int17a]. We have used this reference to enumerate all attack vectors
that read or write system registers. Finally, traces left by WinDbg are well understood
and documented in [Win17]. This leaves a few possible sources of incompleteness – (1)
missed known references (e.g., publications with low ranks in search engines) due to
the huge space of the knowledge base and (2) undocumented information (e.g., system
APIs private to an OS vendor). We discuss these limitations in Section 3.5.
3.3 Apate
In this Section, we propose our Apate framework. Apate is a collection of debugger-
hiding techniques, which systematically defeat our 79 attack vectors to hide a debugger
from malware. The high-level principles of these techniques can be applied to dierent
debuggers and OSes, such as consuming suppressible exceptions. For a concrete attack
that targets a specific platform, we need to address it in the corresponding environment.
For example, malware may useGetTickCount() in Windows anddate() in Linux to
query current time, necessitating dierent implementations of attack detection in Apate.
We present the overview of Apate’s operation in Section 3.3.1. The details of its
handling of anti-debugging techniques are given in Section 3.3.2. We discuss possible
attacks on Apate in Section 3.3.3, and discuss how to use Apate in Section 3.3.4.
3.3.1 Overview
In this Section we give a high-level overview of Apate’s operation. We assume Apate is
integrated with a debugger, thus our description corresponds to this joint entity.
Figure 3.2 shows the general operation of Apate when executing a binary. We first
pre-process the debuggee by parsing its portable executable (PE) header [Mic17]. From
50
Preprocess
add, sub, mul, …
Exploitable :
Step over or skip
Modify states after execution
Non-exploitable:
Step over
SEH attack
push an offset
push fs:[0]
mov fs:[0], esp
Disassemble only one instruction from the current
instruction pointer
Read main entry
Get TLS callbacks
Read import table
General instructions
Call to system APIs
Vector Handler Library
Set breakpoint
Update time stamp counter
Restore memory
...
Timing attack
rdtsc
GetTickCount()
Execute this instruction
① ②
Analyze & execute the next instruction
Attack Vector Library
Match against attacks
No match
Figure 3.2: Overview of Apate’s Operation
the header, we extract the entry point and TLS callbacks (if any). We then add software
breakpoints at these locations. The user-space code of the debuggee will start from one
of these locations, which allows Apate to get control of the program from the beginning.
Next, we start from the entry point discovered in the pre-process stage, and single-
step through each instruction in the debuggee. This single-stepping helps us thwart
51
anti-disassembly attacks, and is achieved by setting the trap flag in theeflags register.
The cost of single-stepping lies in the additional time it takes to analyze malware. Apate
is 2.4–2.8 slower than other debuggers (Table 3.7), but single-stepping greatly aids its
detection of anti-debugging. Apate detects between 58% and 465% more attack vectors
than other debugger-hiding approaches.
When Apate receives its first chance to handle the single-stepping exception, we
disassemble and analyze the instruction that is about to be executed. Based on the
instruction’s semantics, we make a decision on its handling policy. We have classified
all the instructions in the Intel x86 instruction set as either calls (conditional and uncon-
ditional jumps) or general instructions (everything else). If an instruction is acall, we
check if its destination resides in the user or the kernel space. A user-space destination
means that the debuggee is calling its sub-routines, so Apate single-steps into the call,
which allows the defenders to analyze the entire set of malware functionalities. If the
call is invoking a system API, Apate checks whether this API is in its list of possibly
exploitable APIs and may step over it or skip it (see Section 3.3.2). If an instruction is
a general one (e.g., add and sub), Apate single-steps it as it is, unless it is part of one
of our 79 attack vectors. If this is the case, a vector-specific handling will be invoked.
Finally, we may need to modify the debuggee state after executing the instruction to
hide the Apate’s presence.
While we believe we were comprehensive in our enumeration of attack vectors
known to date (plus 12 new vectors discovered by us), future malware attacks may
devise new vectors. New vectors can be easily added to Apate’s attack vector library,
and new handlers for these vectors can be added to Apate’s vector handler library.
52
3.3.2 Handling Anti-Debugging
In Section 3.2, we discussed sixteen subcategories of attacks that aim to either detect or
evade debuggers. In this section, we illustrate how we handle fifteen of these attacks in
Apate (we skip hardware breakpoints as there is a limited number of hardware debug
registers, limiting their use). This is also summarized in the fourth column in Table 3.1,
while the fifth column points out our novel contributions to attack handling. Out of six-
teen attack-vector subcategories, we propose novel handlers for five. For another seven
subcategories, prior literature [Fer11,Fal07,Ope07] has sketched ideas for the handlers,
but we are the first that have worked out the necessary details and implemented and
tested these handlers. Working out the details required looking through Windows man-
uals and identifying all possible ways that an attack could be performed. For the remain-
ing four attack-vector subcategories, we borrow handlers proposed and implemented by
others.
Breakpoint attacks. Apate only uses software breakpoints, which replace the opcode
of the selected malware instruction with a 0xcc byte. This byte will raise an exception
upon execution, and Windows will first allow Apate to handle this exception. To thwart
software read and software write attacks, Apate performs several actions. First, when a
software breakpoint is set, Apate records the breakpoint address in a lookup table called
breakpoint table, along with the original opcode. Second, during an active debugging
session, it monitors the debuggee’s access to its code section and compares the target of
each read and write instruction against the contents of the address field in the breakpoint
table. On a match from a read instruction, Apate returns the original opcode value from
the table, whereas in a write instruction, Apate updates the original opcode value in the
table. Our handling of breakpoint attacks has been proposed in [Fer11] but the authors
have not implemented this countermeasure.
53
An interesting case occurs when the malware’s instruction at the breakpoint address
is already an int 3. This scenario requires special handling, which we discuss in Sec-
tion 3.3.3.
Exception Attacks. Handling exception attacks is challenging. Malware can exploit
structured exception handler (SEH), vectored exception handler (VEH), or unhandled
exception filter (UEF) to set up exception handlers. After the debuggee sets a handler,
it can raise exceptions explicitly (int instruction) or implicitly (e.g., write to a non-
writable memory). When handling an exception, Apate will pass non-suppressible
exceptions to the debuggee and consume suppressible exceptions. Before passing the
non-suppressible exceptions, Apate also sets a breakpoint at the handler entry.
When an exception handler completes, Windows will direct the debuggee to the
return address that is saved in its exception record. Malware may tamper with the return
address. To handle this, Apate records the location of the return address on the stack
during its first chance of handling the exception. Apate steps into all debuggee’s excep-
tion handlers and single-steps through their instructions. When a ret is encountered,
Apate fetches the return address from the stored location which may have been modified
by malware, and sets a breakpoint at that location. This allows Apate keep control of
the execution, even when malware attempts to escape.
To our best knowledge, none of the current research works dierentiate between
suppressible and non-suppressible exceptions. Further, Apate sets breakpoints at handler
entry and fetches the return address right before the handler returns – these techniques
enable Apate to fully observe malware behavior and thwart escape attempts, which are
not used in prior work.
Flow Control Attacks. Attacks in this subcategory must be handled carefully; oth-
erwise, malware can escape the debugger. For callback and multi-threading attacks,
Apate will insert a software breakpoint at the entry of the callbacks or at the start address
54
of the thread. These breakpoints will transfer the control to Apate, enabling defenders to
fully analyze malware. Apate will skip the execution of the APIs in direct hiding sub-
category, by adding the size of the current instruction to ip. To bypass self-debugging
check, Apate sets EPROCESS->DebugPort field to 0. This allows another debugger to
attach to the same process as Apate, which is similar to the handling in [Tul08].
Interaction Attacks. Apate will skip execution of system calls in the hijacking sub-
category. If the API changes any system state, Apate mimics this eect to create an
impression of faithful execution to malware.
The timing attacks can be very complex, and malware may detect the inconsistent
timing by querying dierent sources. To defeat these attacks, Apate maintains a software
time counter and applies it to adjust the return values of time queries, which may be
misused for timing attacks. We update our time counter by adding a small delta which
reflects the CPU cycles for each malware instruction that has been executed. We also add
a small, randomly chosen oset to the final value of the time counter, which can defeat
attempts to detect identical timing of repeated runs. Previous works [VY06, DR
+
08]
only add a constant value to their time sources, which can be detected by malware if
they measure whether the elapsed time is the same. Apate does not handle attacks when
malware queries external time sources. There is no accurate way to detect and handle
this query [ZLS
+
15], and we leave it for future work.
Anti-Disassembly. Apate disassembles each instruction just before it is being executed,
instead of disassembling all instructions at the start of the debugging session. Therefore,
our disassembly is exactly the same as the instructions executed by the debuggee, which
means that Apate can handle both self-modifying and packed code. This is similar to
the work in [VY06].
Debugger Traces. To hide the debugger traces, we have enumerated memory locations
and registers that may be used to store them, and the Windows APIs that access these
55
locations. To handle indirect read attacks, Apate compares each debuggee’s instruction
and its parameters with this list of APIs and locations. If there is a match, Apate provides
a fake reply, which hides the debugger’s presence.
Instead of using Windows APIs, malware may read memory locations and registers
directly to look for debugger traces. To handle direct read attacks, Apate detects access
to our list of locations which may contain debugger traces, and overwrites these traces
with the values, which hide debugger presence. This does not aect the accuracy of
debugger’s execution. The handlings for indirect and direct read is proposed by [Fer11,
Fal07, Ope07] but details are worked out and implemented by us. We further propose
and implement two novel handlers for our newly discovered attacks, which use cs and
ds registers.
3.3.3 Attacks Against Apate
There are several possible attacks on Apate, which could lead to malware detecting its
presence. We have developed special handlers for these attacks, which we describe
below.
Our Apate framework sets the trap flag each time it single-steps an instruction. If
malware reads the trap flag, it can detect the debugger’s presence. Similarly, malware
may clear the trap flag and check it afterwards to detect a debugger. All reads and writes
of the trap flag occur through a few dedicated instructions, listed in “direct read” subcat-
egory of Table 3.1 (pushf/pushfd/popf/popfd, pop ss). These reads and writes are
detected by Apate. We handle the attacks by creating a “debuggee-only” version of the
trap flag. Malware reads and writes manipulate this copy.
If malware sets the trap flag and Apate consumes the corresponding single-stepping
exception, malware can detect the debugger’s presence. To handle this case correctly,
Apate needs to consume the single-stepping exceptions generated by itself, but pass
56
those raised by the debuggee. Apate detects this case by checking the presence of the
value1 in the debuggee-only version of the trap flag. If the single-stepping exception is
intentionally raised by the debuggee, Apate will faithfully pass it to the debuggee.
The next attack is specific to WinDbg, which is our chosen integration platform.
WinDbg engine has a special handling for the software breakpoint exception, which is
intentionally raised by the debuggee (int 3). WinDbg will lose control when single-
stepping this instruction. Since WinDbg is closed-source software, we could not diag-
nose the reason behind this occurrence. To work around this problem, when Apate
single-steps the instruction preceedingint 3, it will modify the single-stepping excep-
tion record on the stack to transform it into a software breakpoint record. Specifically,
we change the exception code to beEXCEPTION BREAKPOINT and also update the excep-
tion address to be the beginning of the int 3 instruction. This enables Apate to retain
control and step into the exception handler forint 3.
3.3.4 Uses of Apate
Apate can be used to automatically single-step through each instruction in malware bina-
ries, and record disassembled instructions and system traces, which can be analyzed to
detect malicious behavior [CSK
+
10, MSF
+
08]. In this use case, Apate compares each
instruction against attack vectors in its library, and applies countermeasures automati-
cally where needed. Single-stepping guarantees that all anti-debugging checks will be
detected and handled, but it does slow down the analysis (up to 2.8 in our tests). This
slow-down, however, may not be problematic, as malware analysis is frequently per-
formed in automated, bulk fashion.
In addition, Apate can also be used to assist interactive debugging, where the users
use single-stepping only when they desire to closely examine a portion of malware code.
57
In this case, the overhead introduced by Apate’s single-stepping is so short that it cannot
be noticed by users.
3.4 Evaluation
In this section, we compare Apate against several mainstream debuggers, using five
data sets (Table 3.2). We first use Apate in Section 3.4.1, to execute 881 unknown
malware samples to evaluate the spectrum of the anti-debugging techniques present in
these samples, and motivate our focus on handling anti-debugging. We find that 60% of
samples implement anti-debugging, with most samples using between 1 and 100 checks.
Therefore, anti-debugging is a prevalent and important problem to study.
Next, we design tests in Section 3.4.2, each implementing one of our 79 attack vec-
tors, and evaluate how many of these can be passed by popular debuggers versus Apate.
Apate outperforms other debuggers by 58–465%.
We then analyze 4 known malware samples with heavy anti-debugging checks in
Section 3.4.3. We demonstrate that Apate can detect all known anti-debugging checks
mentioned in the references, and a few checks that were missed.
To prove that Apate eectively hides a debugger from malware, we randomly select
20 malware samples and evaluate if they exhibit the same activities within Apate as
in debugger-free runs (Section 3.4.4). We discover that Apate can successfully hide
debugger’s presence in all 20 cases.
We cannot directly compare Apate with contemporary research solutions as their
code or datasets are not publicly available. Instead, in Section 3.4.5 we reproduce tests
against packers from [ZLS
+
15], which enables us to compare Apate’s results to one
related work – MALT. Apate defeats all ten tested packers, which is the same as what
MALT achieves. We compare Apate to other research solutions in Section 3.6.
58
Table 3.2: Data Sets for Evaluation
Name Num Goal Findings
Unknown malware 881 Find spectrum of anti-debugging techniques 1) Malware uses 010 distinct anti-
debugging checks; 2) A single check can be
used up to 695,219 times.
Vector tests 79 Evaluate popular debuggers v.s. Apate 1) Apate addresses all the attacks; 2) The
second best debugger solves 50
Known malware 4 Demonstrate practical use of Apate Apate finds all the anti-debugging tech-
niques in the samples
Unknown malware 20 Prove Apate eectively hides a debugger
from malware
The samples show the same malicious
behaviors in Apate as in a physical machine
Packed binary 10 Prove Apate outperforms other research
solutiosn
Apate overcomes all the anti-debugging
techniques provided by commercial packers
In our tests, we use Windows 7 Pro x86 with SP1 (retail build) and we integrate
Apate with WinDbg v6.3 x86. The physical machine has Intel Xeon CPU E3-1245 V2
@ 3.40 GHz, with 4 GB memory, and a hard drive of 1 TB.
3.4.1 Anti-Debugging is Prevalent
For these tests, we select 1,131 binaries from Open Malware [Geo17] that are captured
from 2006 to 2015. These samples are then sent to a malware analysis website Virus-
Total [Tot17], which uses about 2050 anti-virus products to analyze each binary. We
retain those binaries that are detected as malicious by more than 50% anti-virus prod-
ucts, and this leaves us with 881 samples. Each binary is automatically single-stepped
for a maximum of 20 minutes under Apate. Some works [PDZ
+
14] run samples for up
to several hours; however, their goal is to explore all execution paths, while our goal is
to detect the anti-debugging checks, which usually occur at the beginning of the run.
Table 3.3 shows the classification of the samples, with the second column showing
the number of binaries in each category.
Overview. Figure 3.3 shows the number of the unique and the total anti-debugging
checks that are detected in each sample. The samples are sorted based on the number
of checks detected, and the x-axis shows the rank of the sample. In our data set, 354
samples do not adopt any anti-debugging techniques, and 527 or 60% utilize at least one
59
Table 3.3: Classifications of Unknown Malware Samples
Tag # Samples Precent
Adware 7 1%
Backdoor 20 2%
Bot 83 9%
Downloader 16 2%
Trojan 310 35%
Worm 18 2%
Generic 427 48%
Total 881 100%
0
2
4
6
8
10
0 100 200 300 400 500
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
Uniuqe
Total (log
10
scale)
Samples
Unique
Total
Figure 3.3: Anti-debugging Techniques in Each Sample
check. Most samples (464=881 = 53%) use between 1 to 100 checks. However, there
is a heavy tail with the highest-ranked instance using 695,219 checks, but most of the
checks are just dierent instances of the same attack vector. Overall, there are between
1 and 10 unique attack vectors employed per sample, with 90% of samples utilizing six
or fewer vectors.
Table 3.4 shows the popularity of anti-debugging techniques in our malware samples
over time. The analysis time is the year when the samples are submitted to VirusTotal
and analyzed. We use this information to approximately date the samples. Malware may
60
Table 3.4: Popularity of Anti-debugging Techniques
Analysis Year # Total # w Anti-debug Percent
2006 22 16 73%
2007 17 10 59%
2008 36 29 81%
2009 66 42 64%
2010 31 19 61%
2011 74 46 62%
2012 128 88 69%
2013 285 136 48%
2014 182 125 69%
2015 40 16 40%
Total 881 527 60%
have been submitted to VirusTotal much later than it was released into the wild, so our
approach may underestimate its age.
We find that the percentage of malware using some anti-debugging technique fluc-
tuates between 40% and 81% across the years, showing no clear upward or downward
trend (with the average of 60%). In some years our number of samples was very low
and thus our results may be biased for these years.
Spectrum of anti-debugging techniques. Table 3.5 displays the details of anti-
debugging techniques which are detected in our samples. The third column shows the
number (and, where interesting, the percentage) of samples that apply a particular anti-
debugging check. The last column shows the maximum number of times the given check
was used by a single sample. We highlight only a few major findings.
In the “Traces” category, checking the trap flag is the most popular anti-debugging
technique, used in 15% of the samples. Our results indicate that 83 samples read or
write to the trap flag to detect debuggers, and one sample conducts 102,162 instances of
the trap flag attack. In the APIs category, int 2e attack is the most popular debugger
detection technique, adopted by 2% of our samples. This instruction is actually a system
call and does not raise any exceptions (see Section 3.4.3). The maximum number of calls
61
Table 3.5: Spectrum of Anti-debugging Techniques
Cat. Details Samples Max
Traces
Trap flag 83/15% 102,162
CheckRemoteDebuggerPresent() 10 1
APIs
int 2e 11/2% 1
CreateFileA() 5 5
OutputDebugString() 3 90
FindWindow(`OLLYDBG') 1 1
Soft.
bp/int
int 3 45/8.5% 6
int 1 1 1
Except.
ACCESS VIOLATION 116/22% 776
PRIVILEGED INSTRUCTION 13 2
ILLEGAL INSTRUCTION 11 2
INTEGER DIVIDE BY ZERO 3 1
Interact.
GetTickCount() 141/27% 695,219
QueryPerformanceCounter() 82 1
rdtsc 14 216,120
GetLocalTime() 8 156
BlockInput() 6 1
Implicit
flow
SEH 308/58% 148
UnhandledExceptionFilter() 40 9
TLS callback 16 2
AddVectoredExceptionHandler() 1 278
Disassem.
Self-modifying 145/27% 3,423
Instruction overlapping 125 805
to a particular API in a sample is 90 instances (OutputDebugString()). We believe
that the low popularity of API-based debugger detection techniques is due to the high
time cost that is incurred. It is faster for malware to check the debugger traces directly
rather than to invoke APIs. In addition, API usage is easily detected in a debugger.
3.4.2 Apate Outperforms Other Debuggers
Using our enumeration of attack vectors in Table 3.1, we design 79 tests cases, one per
each vector. Each test case attempts to detect the debuggers using only one particular
attack vector.
Example Test Case. Figure 3.4 shows a test case, which exploits its parent process. In
Windows, an application is typically launched by clicking its icon on the desktop. Since
62
the icon is rendered by a GUI shell (Explorer), the parent process of the application will
be “explorer.exe”. However, if the application is launched by a debugger, the parent
process will be the debugger instead of Explorer. Malware can use this distinction to
detect the debugger’s presence.
At line 3, the debuggee retrieves the handle to the Shell’s desktop window. Then
the identifier of the process that created the handle is saved on the stack after the call
at line 7. Line 13 retrieves the information about the debuggee and saves it into the
buer declared at line 1. Finally, two process IDs are compared against each other at
line 16, and the debugger is detected if they are dierent. The unit test returns 0 (line
19) if no debugger is detected or 1 (line 22) otherwise. However, there is a potential
false positive if the system has multiple shells running. In this case, the parent process
IDs of two applications are dierent if they are launched by two separate shells, even
when no debugger is present. Malware authors may opt to accept this false positive, as
the chances of it under normal operations are small.
Evaluation Method. Table 3.6 gives the number of test cases in each attack category,
and the rest of the table lists the numbers of test cases handled by each debugger. Dif-
ferent test cases need to be evaluated in specific ways such as single-stepping, setting a
breakpoint in the code, free execution, etc.
We compare Apate to several popular debuggers: WinDbg, IDA Pro, OllyDbg, and
Immunity Debugger [Deb15]. Where possible, we evaluate both a basic version of a
debugger and any extensions that aim to handle anti-debugging. IDA Pro’s version
is 6.6, with two highly-ranked debugger-hiding plugins: Stealth v1.3.3 [New14] and
ScyllaHide v1.2 [Scy16]. We evaluate OllyDbg 2.01 and two debugger-hiding plugins:
OllyExt v1.8 [Rce12] and ScyllaHide v1.2. Since the aadp v0.2.1 [aad12] plugin only
works in OllyDbg v1, we switch to the latest v1.10 when testing aadp. Each test case
takes about a few seconds to evaluate in each debugger.
63
1 .data buffer db 60 dup(0)
2 .code
3 main: call GetShellWindow
4 push eax
5 push esp ; output: process ID
6 push eax ; input: handle of Shell
7 call GetWindowThreadProcessId
8 push 0
9 push 18h
10 push offset buffer
11 push 0 ; ProcessBasicInformation
12 push -1 ; the debuggee
13 call NtQueryInformationProcess
14 pop eax
15 mov ebx, offset buffer
16 cmp [ebx + 14h], eax
17 jne debugger_detected
18 push 0
19 call ExitProcess
20 debugger_detected:
21 push 1
22 call ExitProcess
23 end main
Figure 3.4: API Attack - Checking Parent Process ID
Results. We find that all basic versions of the debuggers can only handle a limited
number of attack vectors. WinDbg achieves the best performance, identifying 22 out of
79 vectors, while OllyDbg and IDA Pro are able to handle 21 and 17 respectively. Plug-
ins substantially improve the debuggers’ robustness. For example, IDA Pro with Stealth
and ScyllaHide extensions can handle 43=17 = 2:5 more anti-debugging techniques
than the basic IDA Pro. Apate can handle all 79 test cases. Compared to the second best
debugger – OllyDbg with OllyExt, Apate outperforms it by (79 50)=50 = 58%. With
regard to the basic WinDbg, we handle 260% more test cases. The largest dierence lies
between IDA Pro/Immunity Debugger and Apate, where Apate handles 465% more test
cases.
64
Table 3.6: Attack vectors, handled by dierent debuggers
Category
Test
IDA Pro
IDA Pro/ IDA Pro/
OllyDbg
OllyDbg/
Cases Stealth ScyllaHide OllyExt
Traces 9 0 5 5 0 5
APIs 21 4 15 16 5 18
Hard. bp 1 0 1 1 0 1
Soft. bp 5 0 0 0 2 2
Except. 18 8 14 13 8 13
Interact. 9 0 2 2 0 4
Imp. flow 14 5 6 6 5 6
Disassem. 2 0 0 0 1 1
Total 79
17 43 43 21 50
22% 54% 54% 27% 63%
Category
Test OllyDbg/ OllyDbg/
ImmDbg WinDbg Apate
Cases ScyllaHide aadp
Traces 9 5 3 0 4 9
APIs 21 18 8 4 7 21
Hard. bp 1 0 0 0 0 1
Soft. bp 5 2 1 0 0 5
Except. 18 13 2 9 5 18
Interact. 9 3 3 0 0 9
Imp. flow 14 6 5 4 5 14
Disassem. 2 1 0 0 1 2
Total 79
48 22 17 22 79
61% 28% 22% 28% 100%
3.4.3 Apate Detects Known Vectors
In this section, we evaluate Apate using four malware samples (Table 3.7), which are
known to employ heavy anti-debugging techniques and have been manually analyzed
by others. We also compare the performance of Apate with OllyDbg/OllyExt, its closest
competitor from the previous evaluation. For brevity, we denote OllyDbg/OllyExt with
just “OllyExt”. In our evaluation, we set both Apate and OllyExt to automatically single-
step through the samples until they exit. At the same time, we record the assembly
code that has been executed. To prevent risks to the Internet, we disable all network
connectivity.
65
Table 3.7: Results of Known Malware Samples
No. Apate OllyExt Anti-debugging Techniques OllyExt’s Failure Points
1 176 min 63 min
SEH attack, anti-disassembly, int 2e attack, int 3
attack, trap flag attack (reading), self-modifying
Anti-disassembly, int 2e attack,
trap flag attack
2 71 min 29 min
BeingDebugged flag (XP, Win7),ProcessHeap flag
(XP),NtGlobalFlag (XP, Win7)
ProcessHeap flag (XP)
3 76 min 32 min
TLS callback,FindWindow(),
OutputDebugStringA()
OutputDebugStringA()
4 65 min 26 min
QueryPerformanceCounter(),
Exception attack,GetTickCount(),rdtsc
Exception handler entry and return
address,rdtsc
Sample 1. (Md5: 79f24cefd98c162565da71b4aa01e87b) This sample is a bot and
has been analyzed by the Honeynet Project [Wer10]. We illustrate its anti-debugging
techniques in Figure 3.5. At the beginning, it installs an SEH handler and then writes
to a non-writable memory location. This will cause an exception and the execution
flow will jump to the handler. This handler decreases a variable of an initial value
0x15000 by 1 and checks if it becomes 0. If the condition does not hold, the malware
returns to the faulting instruction and executes it again. Otherwise, the malware mod-
ifies the return address to transfer the control flow to another location. Following this
execution flow, the malware performs “int 2e”, “int 3”, and trap flag attacks. For this
sample, Apate detects and defeats all its anti-debugging techniques that are mentioned
in the reference [Wer10]. However, OllyExt fails to detect three attacks (last column of
Table 3.7).
Sample 2. (Md5: 7faafc7e4a5c736ebfee6abbbc812d80) Samples 2–4 are taken
from [SH12] and this sample is an HTTP reverse backdoor. This malware checks three
flags in PEB:BeingDebugged,NtGlobalFlag andProcessHeap flags, shown in Fig-
ure 3.6. The former two exist in both Windows XP and Windows 7 systems, but the last
one is only present in Windows XP. We evaluate this sample in Windows XP SP3 and
notice that OllyExt cannot hide theProcessHeap flag. As a result, malware detects the
debugger and removes itself under OllyExt. Apate can hide all these flags, which leads
the malware to believe it is running natively.
66
Install an SEH handler
Write to “non-writable”
memory
“int 2e” attack
Install an SEH handler
Self-modify to build the
malicious payload
Handler 1:
Decrease a variable
Initial value: 0x15000
Handler 2:
“int 3” attack
0 = = varirable?
“trap flag” attack
Normal flow
Implicit flow setup
Implicit flow jump Program entry
yes
no
0xffffffff = = edx?
no
yes
Exit
Figure 3.5: Anti-debugging Techniques of Sample 1
BeingDebugged == 1?
yes
no
exit
NtGlobalFlag == 0x70?
yes
exit
no
ForceFlags == 0?
exit
Figure 3.6: Anti-debugging Techniques of Sample 2
Sample 3. (Md5: e88b0d6398970e74de1de457b971003f) This malware hides
its anti-debugging techniques in the TLS callback as shown in Figure 3.7. The TLS
67
callback function accepts a parameter from Windows that informs the function when
it is being called. For example, the value of 1 is used when the process is starting
up, 2 when a thread is starting up, and 3 when the process is being terminated. The
TLS callback function sets a parameter based on some anti-debugging checks, whose
value indicates presence or absence of a debugger. This sample performs dierent anti-
debugging techniques according to the parameter value passed in. During the process
creation, the callback calls the FindWindow() API with the class name “OLLYDBG”.
The malware uses this function to check whether OllyDbg is running with its default
window’s title. If the title is found, the sample will exit. This malware also employs two
other anti-debugging checks that are carried out in the TLS callback when a thread is
starting up. The first check is through exploiting the functionOutputDebugString(),
and the second one is checking the BeingDebugged flag. The malware initially calls
SetLastError() to save a last-error code (0x3039) in the thread local storage. After-
wards,OutputDebugString() is invoked, which sends a string to a potential debugger
for display. If there is no debugger attached, an error code is set by the function. Finally,
the malware retrieves the error code by calling GetLastError(). If these two error
codes are the same, it means that the debugger is present. In our testing, OllyExt can-
not detect this anti-debugging check. The second check reads theBeingDebugged flag
from the PEB structure. OllyExt is able to handle this attack as well.
Sample 4. (Md5: 3612702fb6e5c1f756c116d9fce34677) This malware sample
belongs to the reverse shell category and features heavy timing attacks, as shown in
Figure 3.8. The malware first callsQueryPerformanceCounter() which retrieves the
current value of the performance counter. A custom SEH handler is then installed and
immediately triggered through dividing a number by0. Next, the same function is called
again and the current time is saved in t2. If the dierence between t2 and t1 is larger
than 0x4b0, the malware will generate a wrong salt to decrypt a string. This causes
68
FindWindow()
SetLastError()
OutputDebugString()
GetLastError()
1: program startup
Calling reason
2: thread startup
BeingDebugged flag exit()
TLS Callback
Malicious payload
Normal execution
Call exit() if anti-debugging checks succeed
Figure 3.7: Anti-debugging Techniques of Sample 3
the malware to exit prematurely. If this timing check passes, GetTickCount() and
rdtsc are used to perform the following timing attacks. Apate successfully handles all
these attacks; however, OllyExt has two limitations when analyzing this sample. First,
it cannot follow the implicit flow of the exception handling. Although it can prompt
whether the users wish to pass the exception to the malware, it can neither capture the
handler entry when jumping to the handler nor obtain the return address when leaving
from the handler. Second, OllyExt does not pass the third timing check.
Time cost. In our evaluation, Apate performs 2.42.8 slower than OllyExt. This is
expected, because Apate considers more anti-debugging checks for each instruction and
handles more attack vectors, which improves the accuracy of malware analysis.
3.4.4 Apate Deceives Malware
Our previous evaluations demonstrate that Apate detects more anti-debugging checks
than its competitors. In this section, we evaluate if Apate can successfully hide a debug-
ger’s presence. For chosen twenty samples, we compare a sample’s functionalities in
69
t1 =
QueryPerformanceCounter()
Handler:
Exception attack
t2 =
QueryPerformanceCounter()
t2 – t1 >
0x4b0
Generate a wrong salt
for decrypting
Exit prematurely
t3 = GetTickCount()
[String manipulation]
t4 = GetTickCount()
Install an SEH handler
Update return address
Remove the handler
t4 – t3 > 0x1
Raise an unhandled
exception
Program crashes
t5 = rdtsc
[SEH manipulation]
t6 = rdtsc
t4 – t3
> 0x7a120
Remove itself from
the disk
Exit prematurely
yes
no
yes
no
yes
Malicious payload
no
Normal flow
Implicit flow setup
Implicit flow jump
Figure 3.8: Anti-debugging Techniques of Sample 4
a native run (without any debugger) with its functionalities when run under Apate. If
these two match, we conclude that debugger was successfully hidden. The scope of our
evaluation is necessarily limited because: (1) there is no ground truth about which anti-
debugging checks are possible, and thus we cannot quantify our coverage; (2) there is
no well-understood measure of malware functionalities, and thus we have to define our
own; (3) we analyze a small number of samples, since our recording and comparison of
malware functionalities is time-demanding and manual.
70
We employ persistent file and network activities of malware as a proxy for measur-
ing its malicious functionalities. We regard them as the manifestations of a malware’s
attempt to either obtain some information (about the system or from the Internet) or
to share some information with the attacker. We term these activities “core malware
utilities”. In our tests, we monitor file and network activities under a native run, and
compare them to the same activities under a debugger. We test Apate and OllyExt under
these conditions. Thus we can both evaluate the completeness of Apate, and contrast it
with its close competitor. We note that all our tests are done in a closed environment.
Since some malware performs connectivity checks and may abort if these fail, our tests
do not fully expose all malware functionalities. Yet, this was a trade-o we had to make
to protect the Internet from attacks that may be launched by tested malware.
For file access, we record file creations, deletions and modifications. We save the
hard drive into raw disk images before and after running malware, and extract the file
information using the SleuthKit framework [Tec17, KVK14]. By comparing the file
meta data, we obtain the created, deleted, and modified files for each malware sample.
For network access, we record the destination IP addresses and ports of trac on our
network interface, and then drop the packets to contain risk to Internet hosts from our
experiments.
Native functionality. All file and network activities are noisy, because regular OS
operation overlaps malware’s activities. We aim to identify the activities that are invari-
ant in each malware run, as well as those that do not occur during regular OS operation.
To achieve this goal, we first observe base OS activities six times without malware. We
create a union of all the created, deleted, and modified files, and all network commu-
nications – U
base
. Next, we run malware natively for three times and create an inter-
section of activities found in all three native runs – I
native
. We define the set dierence
S
sig
= I
native
U
base
as malware’s signature. This selects only those activities that occur
71
in all native runs but never in base runs. In our evaluation, we look for items from
this signature to determine if malware performs the same malicious activities with and
without Apate.
Evaluation Method We test 20 random malware samples that make moderate use of
anti-debugging techniques as found by Apate. We set both Apate and OllyExt to single-
step through the samples automatically, similar to the method used in Section 3.4.3.
During our evaluation, we try to make the testing conditions as fair as possible for each
platform. After launching the malware in the native run, we wait 10 minutes before
powering o the system and taking a snapshot of the disk image. Under OllyExt and
Apate, we allow the malware to execute until it terminates itself, then we wait 10 min-
utes before collecting the data. For OllyExt, we adapt its tracing mode so that it performs
single-stepping. In addition, we disable its “command emulation” and “using hardware
breakpoints for stepping”. Therefore, OllyExt and Apate operate under the same debug-
ging principles. In our evaluation, the time needed to evaluate one sample (including
saving hard drives and extracting activities, which make up to 93% of the total time)
under Apate takes about 810 hours, while the time it takes for OllyExt is about 56
hours.
Results Malware functionalities under Apate were identical to those in native runs for
all twenty samples. We thus conclude that Apate successfully hides the debugger’s pres-
ence from malware. In OllyExt runs, however, nine out of twenty malware samples show
reduced activity, both with regard to file accesses and network trac. This indicates that
malware has detected OllyExt in 45% of the cases.
3.4.5 Apate Comparable to MALT
Nowadays, malware authors widely adopt packers to obfuscate the malware’s binary
code, and many packers provide anti-debugging functions. Zhang et al. evaluate a
72
Table 3.8: Running a Packed Executable
Packers Apate MALT
†
OllyExt
UPX 3.91w OK/15 m OK OK/7 m
PELock v1.0694 OK/13 m OK Div. by 0/5 m
Themida v2.3.4.14 OK/14 m OK Priv. insn./6 m
ASPack v2.38 OK/9 m OK OK/5 m
VMProtect 3.0.6 OK/7 m OK Search SFX/4 m
eXPressor v1.8.0.1 OK/18 m OK OK/8 m
PECompact v3.02.2 OK/16 m OK Access viol./7 m
Obsidium v1.5 OK/17 m OK Access viol./8 m
Armadillo v2.01 OK/12 m OK int 3/5 m
RLPack v1.21 OK/19 m OK OK/6 m
Total 10 10 4
†
Results taken from MALT [ZLS
+
15]
recent research solution to anti-debugging, MALT [ZLS
+
15], against ten popular pack-
ers. We cannot directly compare Apate with MALT because its code is not public, but
we repeat the same test for indirect comparison, and we show OllyExt’s results in the
same settings.
We apply ten popular packing tools (Table 3.8) to pack a custom application which
pops up a window upon execution. We enable all anti-debugging functions in the pack-
ing tools. If our custom executable successfully opens a window when being debugged,
we conclude that malware fails to detect the debugger.
Table 3.8 shows the results obtained from Apate, MALT and OllyExt runs. Apate and
OllyExt both single-step through binaries automatically. We further show the execution
times for Apate and OllyExt. Apate (and MALT) successfully open the window for all
ten packed binaries under each packer, while OllyExt can only handle four of them.
3.5 Limitations
While Apate surpasses other debuggers in our tests, there are some limitations that we
need to address in our future work. First, our tests prove that Apate can defeat every
attack vector in our library, but it is possible that there are some combinations of vectors,
73
or some vectors we have not discovered, which Apate will not be able to handle. If new
anti-debugging checks are devised in the future, Apate’s library of attack vectors and
handlers can be extended accordingly. Our future work will lie in standardizing these
extensions and evaluating human burden.
Second, in our attack vector enumeration, we did not consider the use of undocu-
mented APIs or undocumented system objects. To mitigate this problem, we may treat
all the undocumented APIs as the malware’s own functions and step into them, but this
will introduce substantial overhead. We plan to quantify this overhead in our future
work.
3.6 Related Work
In this Section, we review related research approaches, which address anti-debugging in
malware.
Covert Debugging. The goal of Apate is to help the defenders identify and over-
come anti-debugging techniques in malware. Zhang et al. [ZLS
+
15] present a debugging
framework MALT that employs System Management Mode (SMM) of CPU to transpar-
ently study evasive malware. MALT installs the debugging functionalities in the entities
provided by BIOS. The paper lists 20 out of our 79 anti-debugging techniques but it
is unclear how many they can handle. Vasudevan et al. [VY06] propose a framework
“Cobra” that divides the debuggee’s code into blocks based on branching instructions.
To overcome anti-analysis checks, the authors take two approaches. First, Cobra scans
for instructions that betray the real state of the program, and replaces them with cus-
tom functions. Second, Cobra maintains a copy of memory that mimics the system
states without debugger presence, and feeds this state to malware upon queries. From
its design, we conclude that Cobra does not analyze certain exceptions or system APIs
74
for anti-debugging checks, and thus could not handle suppressible exceptions, enumer-
ation functions, or indirect read attacks. Thus Cobra would miss approximately 50% of
our attack vectors.
Other researchers focus on how to covertly set breakpoints without modifying the
debuggee’s code or the debug registers. The works in [VY05, Vas09, QS07] propose
utilizing memory page protection to implement the breakpoint functionality. While this
approach can solve certain breakpoint attacks, implicit control flow attacks, and self-
modifying code, it does not help against debugger trace detection, interaction attacks,
or calls to exploitable APIs. These unhandled attacks occupy about 60% of our attack
vectors.
Virtual Machine Frameworks. Some works approach the anti-debugging problem
by shifting the debugging functionalities into the virtual machine, under the assump-
tion that in-guest modules cannot detect out-of-guest systems. For example, Hyper-
Dbg [FPMM10] proposes a VM framework based on hardware-assisted virtualization
technology. They launch a VM on the fly when an event of interest occurs, and perform
debugging within the VM. However, this framework is highly tied to specific hardware
and the operating system, and oers limited debugging functionalities compared to pop-
ular debuggers. Ether [DR
+
08] also develops a virtual machine based on hardware vir-
tualization and incorporates certain debugging functions in the VM. However, Pek et
al. [PBB11] prove that Ether can still be detected.
Anti-debugging Techniques. Some studies discuss how to classify anti-debugging
techniques but they do not provide a systematic framework such as Apate, nor do they
oer handlers for anti-debugging checks. Kirat et al. [KV15] propose MalGene, an
automated technique for extracting analysis evasion signatures. MalGene leverages a
bioinformatic algorithm to locate evasive behavior in system call sequences. While they
75
are capable of eciently extracting evasion signatures, there is no systematic enumera-
tion of attack vectors, and we cannot compare MalGene directly to our vectors. In addi-
tion, MalGene does not assign fine-grained semantics of each signature, while Apate
enumerates the meaning of each attack vector. Branco et al. [BBN12], Ferrie [Fer11],
and others [Fal07, Ope07, CAM
+
08] provide lists of anti-debugging strategies found in
real world malware. For some of the attacks, they provide ad-hoc solutions but none of
them proposes a universal framework that is adaptive to new variants. They list 2060
out of our attack vectors, and propose handlers for 80% of these.
3.7 Conclusion
Debuggers are essential tools in malware analysis. Malware often applies anti-
debugging checks to detect and evade debuggers. In this work, we enumerated a spec-
trum of anti-debugging checks, and categorized them into six categories and sixteen
subcategories. Within each subcategory, we enumerated possible attack vectors, result-
ing in 79 vectors. We then proposed Apate framework, which deploys single-stepping,
and per-vector detection and handling, to hide popular debuggers from malware. We
integrated Apate with the WinDbg, and tested it extensively. Apate outperforms com-
mercial debuggers by a wide margin. While we cannot directly compare it with research
solutions as many are not publicly released, our literature-based analysis of their func-
tionalities finds that Apate handles more attack vectors and is more deployable than
other research solutions.
76
Chapter 4
Hiding Virtual Machines from
Malware using VM Cloak
In this chapter, we discuss how our cardinal pills can be integrated into Apate to improve
the transparency of virtual machines. Our main contributions are the following:
1. We propose VM Cloak – an Apate plug-in, which hides VM presence. VM Cloak
monitors malware execution of each instruction, and modifies malware states after
the execution if the instruction matches a pill. The modified states match the states
that would be produced if the instruction were executed on a physical machine.
2. We implement VM Cloak and evaluate it through two data sets. We first ran-
domly select and analyze 319 malware samples captured in the wild, to evaluate
how frequently are anti-VM techniques used by contemporary malware. Then we
perform closer evaluation using three known samples that have been demonstrated
to show heavy anti-VM behavior. We show that malware, run under VM Cloak
and within a VM, exhibits the same file and network activities as malware run on a
bare metal machine. This proves that VM Cloak successfully hides the VM from
malware.
3. Our VM Cloak system implements detection and hiding of all three classes of
attacks: semantic, timing, and string attacks, but our intellectual contributions
focus mostly on semantic attacks.
77
4.1 Related Work
4.1.1 Pill Hiding
Dinaburg et al. [DR
+
08] aim to build a transparent malware analyzer, Ether, by
implementing analysis functionalities out of the guest using Intel VT-x extensions for
hardware-assisted virtualization. However, nEther [PBB11] finds that Ether still has
significant dierences in instruction handling when compared to physical machines,
and thus anti-VM attacks are still possible, i.e., Ether does not achieve complete trans-
parency.
Kang et al. [KYH
+
09] propose an automated technique to dynamically modify the
execution of a whole-system emulator to fool a malware sample’s anti-emulation checks.
They first collect two execution traces of a malware sample: one reference trace that the
authors believe passes all its anti-VM checks and contains real, malicious behavior,
and the other trace in which the sample fails certain anti-VM checks. For example, a
physical machine or a high-fidelity VM can be used to generate the reference trace, and
a low-fidelity VM produces a trace that shows anti-VM behavior. Then, the authors
use a trace matching algorithm to locate the point where emulated execution diverges.
Finally, they compare the states of the reference system and the VM to create a dynamic
state modification that repairs the dierences. But these VM modifications are specific
to a particular malware sample, while our work handles anti-VM attacks in a universal
way, across dierent malware samples.
Other works specify a variety of anti-VM techniques but they do not propose a sys-
tematic framework to detect and handle all the attacks. For example, Ferrie [Fer06]
78
shows some attacks against VMware, VirtualPC, Bochs, QEMU, and other VM prod-
ucts. While the attacks are eective in detecting the VMs, no methodology is illustrated
to protect the VMs from being detected.
4.2 Generating Hiding Rules from Cardinal Pills
We now describe how to generate hiding rules from cardinal pills, which will be used to
hide virtual machines from malware. We define a rule as an instruction, the values of its
parameters, and the system state modification after executing this instruction. While our
cardinal pills are specific to Intel x86 architecture, they are not specific to any OS, VM,
or debugger. Most malware analysis frameworks can use our pills to detect anti-VM
attacks launched by malware. In general, an analysis platform can actively monitor each
instruction that has been executed by a malware sample. If one instruction along with
its parameters matches a cardinal pill, the platform will overwrite the “write” locations
of the instruction (such as register and memory) with the values learned in a physical
machine.
For example, the No. 8 test case in Table 2.1 turns out to be a cardinal pill, as shown
in Table 4.1. We first extract the condition of the pill as: insn = aaa, al = 0ffh,
ah = 30h, and eflags = 246h. Then the hiding rule based on O1 physical machine
will be: al = 5, ah = 32h, and eflags = 217h. Other cardinal pills for aaa will
be parsed in a similar way, and the combined set will be the final hiding rule for this
instruction. For some other instructions, the resulting exceptions may be dierent as
well. For example, an instruction may not throw an exception in a VM but may throw it
in an Oracle, or an instruction may throw dierent exceptions in these two platforms. In
this case, we must throw the correct exception in a VM to match the Oracle. Whenever
79
Table 4.1: Example Hiding Rule Generation foraaa Instruction
aaa Before Execution After Execution
Parameter AL AH eflags AL AH eflags Exception
Q1 (TCG) 0ffh 30h 246h 5 32h 257h None
O1 0ffh 30h 246h 5 32h 217h None
a VM executes one of the pill conditions, the resulting system state will be modified
correspondingly based on the rule set.
4.3 Integrating Hiding Rules with Existing Frameworks
Our hiding rules can be easily integrated with existing frameworks. For example, the
infrastructure proposed in [KYH
+
09] may use our cardinal pills to improve the coverage
of its application scenarios. In this work, the authors first collect one execution trace
of a malware sample from a high-fidelity system (Ether) for reference, and then collect
another trace from a low-fidelity system (QEMU). They assume that the anti-VM checks
of the sample fail in the reference system but succeed in the low-fidelity environment.
Therefore, there are certain diverging points where two traces show dierent behaviors.
Then the authors devise a Dynamic State Modification (DSM) infrastructure for the low-
fidelity system to repair the dierences automatically. This is achieved by changing the
malware sample’s observation of its environment to match the observations it makes on
the reference system.
However, the DSM infrastructure is specific to a particular malware sample: the
modification method cannot be applied to other samples using the same anti-VM checks.
This limit can be improved by utilizing our cardinal pills as follows. During the active
execution of a malware sample, DSM can monitor each instruction that the sample has
executed. If the instruction together with its parameters match a cardinal pill, DSM will
overwrite the VM state with the values observed in the physical machines. This way,
80
the DSM module can be applied to more than one malware sample. Unfortunately, the
DSM infrastructure is not available to public, so we could not use it with our cardinal
pills.
4.4 Integrating Hiding Rules with Apate – VM Cloak
In this section, we describe how to use our cardinal pills in Apate to hide VM’s presence
from malware. We design VM Cloak as a plug-in to Apate, which operates in single-
stepping mode to avoid anti-disassembly behaviors in malware, such as packing and
code overwriting. VM Cloak monitors each instruction before its execution and takes
hiding actions if it deems that the instruction can be used to dierentiate between VMs
and physical machines. Now, we detail how VM Cloak can be used to handle dierent
anti-VM techniques, as described in Section 2.2: the semantic, the timing and the string
attacks.
Semantic/Cardinal Pill Attacks. For this category of attack, VM Cloak tries to match
each instruction against the hiding rules. Upon a successful match, VM Cloak retrieves
the expected state and reenacts it, overwriting registers and memory where needed.
Timing Attacks. VM Cloak maintains a software time counter and if it detects an
instruction that reads system time, it returns a value using this time counter. We update
our time counter by adding a small delta for each malware instruction that has been
executed, and we make the delta’s value vary with the complexity of the instruction. We
also add a small, random oset to the final value of the time counter before returning
the value to the application. This serves to defeat attempts to detect VM Cloak by
running the same code twice and detecting exactly the same passage of time. VM Cloak
maintains a list of instructions and system APIs that can be exploited by malware to
query the time information, such as rdtsc and GetTickCount(). Whenever malware
81
uses these methods, VM Cloak will replace the returned values with expected ones. We
currently cannot handle the cases when malware queries external time sources, as this is
an open research problem [ZLS
+
15].
String Attacks. VM Cloak monitors each instruction for use of APIs that query reg-
istry and files. If the values being read match a list of known VM-revealing strings
(such as “vmware”, “vbox”, and “qemu”), we overwrite these strings with values from
physical machines. We admit that it is hard to guarantee the completeness of this list
as well as the list we maintain for timing attacks, because newly released VM products
may introduce fresh strings and upcoming OS versions may provide new APIs for time
queries. However, it is easy to extend VM Cloak to cover these new detection signals
when they become available.
4.5 Evaluation
In this section, we evaluate VM Cloak using two data sets: 1) unknown samples captured
in the wild and 2) known malware samples that employ heavy anti-VM techniques.
4.5.1 Anti-VM is Popular
We randomly select 527 malware binaries from Open Malware [Geo17] that are captured
in 2016. These samples are then sent to a malware analysis website VirusTotal [Tot17],
which uses approximately 50 anti-virus products to analyze each binary. We retain
those binaries that are labeled as malicious by more than 50% anti-virus products, and
this leaves us with 319 samples.
82
4.5.1.1 Methodology
Each binary is automatically analyzed under VM Cloak for a maximum of 20 minutes.
Some malware analysis approaches [PDZ
+
14] run samples for up to several hours; how-
ever, their goal is to explore all execution paths of malware, while our goal is to detect
their anti-VM checks, which usually occur at the beginning of malware execution. For
each instruction that malware has executed, we check whether the instruction name and
parameters match one of our hiding rules. If a match is detected, we log this instruction
as an anti-VM attack found in the sample, and modify the state as specified by the hiding
rule.
4.5.1.2 Results
In our data set, 252 out of 319 (79%) samples show at least one anti-VM attack. The
spectrum of the of anti-VM techniques are shown in Table 4.2. For the semantic attack
category, the most popular instruction is the in instruction. This instruction copies the
value from the I/O port specified by the second operand (source operand) to the first
operand (destination operand), and is usually used by VMs to set up the communication
channel between the host and the guest system. Therefore, it behaves dierently in a
virtual and a physical machine and can be exploited by malware.
We observe that certain kernel registers are also popular in semantic attacks, such
as local and global descriptor table registers, task register, and interrupt descriptor table
register, which can be retrieved respectively using sldt, sgdt, str, and sidt instruc-
tions. This is because these registers are already set to fixed values by the host system,
and VMs have to save these values and replace them with values needed by the VM.
This behavior can be used by malware to detect VMs. The cpuid instruction returns a
value that describes the processor features. After executing this instruction with eax =
1, the 31st bit of ecx on a physical machine will be equal to 0, while this bit is 1 on a
83
VM. The smsw instruction stores the machine status word (bits 0 through 15 of control
register cr0) into the destination operand. Sometimes, the pe bit of cr0 is not set in
a VM. The movdqa instruction moves a double quad-word from the source operand to
the destination operand. This instruction can operate on an XMM register and a 128-bit
memory location, or between two XMM registers. The destination may be populated
with a random value by a VM if the source operand is not available in certain cases,
while the destination operand is untouched in a physical machine.
For string attacks, we find only one API (RegEnumKeyExA()) that is adopted by
malware. This function enumerates the subkeys of the specified open registry key. In
this case, malware attempts to find if “vmwa” exists, which is the prefix of a VM product
– VMWare.
For timing attacks, we discover four APIs or instructions that malware uses to
query date and time information. The prevalent API is GetTickCount(), which
retrieves the number of milliseconds that have elapsed since the system was started,
up to 49.7 days. Malware may call this API several times to check the elapsed time
of executing certain code block. If the elapsed time exceeds a reasonable threshold,
malware will detect a VM. This attack strategy can also be used to detect debug-
gers. Similarly, QueryPerformanceCounter() fetches the current value of the per-
formance counter, which is a high resolution (1us) time stamp that can be used for
time-interval measurements. The rdtsc instruction reads the timestamp counter regis-
ter, andGetLocalTime() obtains the current local date and time. All these instructions
can be exploited by malware.
4.5.1.3 Performance Overhead
In order to evaluate the performance overhead introduced by VM Cloak, we pro-
grammed WinDbg (without VM Cloak) to single step the malware binary without any
84
Table 4.2: Spectrum of Anti-VM Techniques
Category Instruction Samples Instruction Samples
Semantic
in 87/35% smsw 16/6%
sldt 65/26% sgdt 10/4%
str 49/19% sidt 8/3%
cpuid 35/14% movdqa 6/2%
Category API/Instruction Samples
String RegEnumKeyExA(``vmwa'') 3/1%
Timing
GetTickCount() 51/20%
QueryPerformanceCounter() 27/11%
rdtsc 8/3%
GetLocalTime() 2/1%
additional operations. We find that VM Cloak performs 10.415.8 times slower than
vanilla WInDbg. This is expected, because VM Cloak considers anti-VM checks for
each instruction and fixes system state if necessary, which improves the accuracy of
malware analysis.
4.5.2 VM Cloak Deceives Known Malware
We wanted to test VM Cloak with samples that have been known to have anti-VM
behaviors, i.e., with ground truth. However, the number of such samples that are publicly
available is limited and we were only able to find three. We now test if VM Cloak can
hide VM presence from malware for these three samples.
4.5.2.1 Methodology
We test three select malware samples within VM Cloak and under a VM, and we com-
pare their behavior with the behavior they exhibit on a physical machine, without any
VM or debugger. The scope of our evaluation is necessarily limited because: (1) there
is no ground truth about which anti-VM checks are possible; (2) there is no well-
understood measure of malware functionalities and thus we have to define our own; (3)
85
we analyze a small number of samples, since only these samples had detailed analysis
published by other researchers.
We define malware behavior as a union of file and network activities. While malware
may exhibit other behaviors, such as running calculations, invoking other applications,
etc., we regard file and network activities as crucial for malware to export any knowledge
to an external destination or to receive external input (e.g., from a bot master). Thus, if
a sample exhibits the same pattern of file and network activities within a VM (hidden by
VM Cloak) and in a native run, we will conclude that we were able to successfully hide
VM presence from malware. For file activities, we record file creations, deletions and
modifications. We save the hard drive into raw disk images before and after running mal-
ware, and extract the file information using the SleuthKit framework [Tec17, KVK14].
By comparing the file meta data, we obtain the list created, deleted, and modified files
for each malware sample. For network activities, we record the destination IPs and ports
of trac on our network interface, but we do not actually route the packets, to preserve
Internet from any harm.
Sample 1 (md5: 6bdc203bdfbb3fd263dadf1653d52039). This sample is pro-
vided and analyzed by [SH12], and Figure 4.1 shows the anti-VM techniques that are
employed. The sample employs three semantic attacks, includingsidt,str, andsldt
instructions. The sidt instruction stores the contents of idtr register into a memory
location. TheIDTR is 6 bytes, and the fifth byte oset contains the start of the base mem-
ory address. If this byte is equal to 0xef, the signature of “VMware” is detected. This
sample will terminate itself and remove it from disk if this anti-VM check succeeds. To
handle this attack, we overwrite the byte with the value of 0xff that is learned from an
Oracle.
86
0xEF = = byte?
yes
no
“sidt” instruction
Terminate itself and
remove from disk
0x4000 = = byte?
yes
no
“str” instruction
No “MalService”
0x0000 = = byte?
no
“sldt” instruction
Terminate itself
Figure 4.1: Anti-VM Techniques of Sample 1
Similarly,str andsldt store the task register and the local description table register
whose contents are dierent in a virtual machine. We update their values with those
found in Oracles, so malware cannot detect the VMs.
Sample 2 (md5: 7a2e485d1bea00ee5907e4cc02cb2552). This sample has been
analyzed by [SH12], and it uses one semantic attack and three string attacks, as shown
in Figrue 4.2. The in instruction is a privileged instruction; it will raise an excep-
tion if administrator right is not granted. However, VMware uses virtual I/O ports for
communication between the virtual machine and the host to support functionalities like
copying and pasting between the two systems. The port can be queried and compared
with a magic number to identify the use of VMware by usingin instruction. To handle
this attack, we intentionally raise an exception after the execution of this instruction and
overwrite the returned bytes with random ones.
Next, this malware uses three system APIs to query possible strings containing cer-
tain VM bands, such as “VMWare”. The RegEnumKeyExA() is used to enumerate all
registry entries under “SYSTEMnnCurrentControlSetnnControlnnDevice”. The sample
compares the first six characters (after changing them to lowercase) of each subkey
87
Exception?
No
Yes
“in” instruction
Terminate itself
Starts with
“vmware”?
yes
no
RegEnumKeyExA()
Terminate itself
Vmware-associated
MAC address?
no
GetAdaptersInfo()
Terminate itself
Next item?
yes
no
yes
Next item?
yes
no
Starts with “vmware”?
no
Process32Next()
Terminate itself
yes
Next item?
yes
Figure 4.2: Anti-VM Techniques of Sample 2
name to the string “vmware”. The GetAdaptersInfo() retrieves the MAC addresses
of Ethernet or wireless interfaces. The addresses are then compared against known ones
for VMware, such as “005056h” and “000C29h”. Finally, ProcessNext32() queries
the names of all processes. For each process name, the malware hashes it into a number
and then checks if this number is equal to the hash value of “vmware”. This, however,
does not deter VM Cloak. We handle these string attacks, by monitoring the calls to the
APIs and replacing the returned strings with random characters.
88
Sample 3 (md5: 2c1a7509b389858310ffbc72ee64d501). This sample is analyzed
by [0xE13], as shown in Figure 4.3. It performs both string and timing attacks. First, it
checks the names of processes by comparing their CRC32-hashes to predefined ones
of “vmwareuser.exe”, “vboxservice.exe”, “vboxtray.exe”, and many others. If this
check passes, the sample will use GetModuleHandleA() to query the existence of
“sbiedll.dll”, which is the artifact created by a VM – Sandboxie. The third string attack is
reading the registry keys in “SYSTEMnnCurrentControlSetnnServicesnnDisknnEnum”.
If any subkey string starts with “vmwa”, “vbox”, or “qemu”, the sample will terminate
and exit. Finally, the sample uses two consecutive rdtsc instructions to measure the
execution time for one push eax instruction. If the time exceeds 0x200, the sample
will exit immediately.
4.5.2.2 Results
We evaluated our VM Cloak by running these three selected samples under VMware and
QEMU, with VM Cloak, and comparing the malware behavior in this environment with
its behavior on an Oracle. We found that all three samples exhibited the same activities
when VMWare and QEMU were hidden by VM Cloak, as when the samples were run in
Oracles. However, all three samples show early termination when executed in VMWare
and QEMU without VM Cloak. We conclude that VM Cloak has successfully hidden
VMWare and QEMU from malware.
Performance Overhead. In our evaluation, it takes us 34 hours to complete the
full analysis of a sample. Most of the time (86%) is spent in saving hard drives and
extracting file and network activities.
89
“vmware”, “vbox”, ...?
no
Process32Next()
Terminate itself
yes
Next item?
yes
GetModuleHandleA()
“sbiedll.dll”
no
Terminate itself
yes
“vmwa”, “vbox”,
“qemu” ?
yes
no
RegEnumKeyExA()
Terminate itself
Next item?
yes
no
“rdtsc” instruction
“push eax” > 0x200? Terminate itself
Figure 4.3: Anti-VM Techniques of Sample 3
4.6 Conclusion
In this chapter, we propose a pill hiding framework called VM Cloak – a debugger plug-
in that examines each instruction for possible anti-VM checks, and overwrites register
and memory states with expected values to hide VM presence. We implement VM Cloak
as a WinDbg plug-in and show through small-scale evaluation that it successfully hides
VMware and QEMU from malware. We believe that cardinal pill testing and VM Cloak
have many advantages – they are agnostic to VM and physical machine choices, enable
comprehensive detection and hiding of VM checks, and are easily adoptable.
90
Chapter 5
Supporting Live and Safe Malware
Analysis using Pandora
Current malware analysis platforms have two main shortcomings. First, the instrumen-
tation approaches inevitably introduce artifacts that are detectable by malware, such as
virtual machines [BKK06, SBY
+
08], debuggers [SM17], hardware-assisted virtualiza-
tion [DR
+
08], FPGA [SHL16], and others. If malware successfully detects the artifacts
left by the instrumentation methods, it may behave like normal programs, exit prema-
turely, or delete itself. This will prevent malware’s real, malicious purposes from being
exposed by defenders, and thus prolong malware’s life cycle. Second, while there are
extensive research work focusing on the interaction between malware and operating
systems [PDZ
+
14,XZGL14,VY06,MKK07], the analysis on malware’s network behav-
ior is limited [RD
+
11]. Current network containment policies either could not fully
understand malware’s network activities, or may potentially cause substantial damages
to Internet hosts. For example, most research eorts apply a completely isolated policy,
which cannot reveal malware’s networking behavior if malware needs input from Inter-
net servers to trigger its malicious activities. On the contrary, some other work analyzes
malicious samples without any network restrictions, which may allow the binaries to
participate in DoS attacks.
In this Chapter, we propose a malware analysis framework, called Pandora, to
address the limitations aforementioned. First, Pandora maintains a high-fidelity envi-
ronment to malware, because we do not employ any virtual machines, debuggers, or
91
other noticeable instrumentation approaches. We analyze malware on physical machines
through taking advantage of the infrastructure provided by DeterLab [BBB
+
04]. Sec-
ond, we devise customized network containment policies for dierent network trac
generated by malware. For example, we allow malware’s HTTP/HTTPS packets to
reach their hosts on the Internet, because these are well-known, application-level proto-
cols and we are able to control their potential damages. For malware’s packets targeting
non-standard ports, such as65520, we drop them since their malicious purposes are not
clear. In addition, we also limit the number of packets each sample could send and the
quantity of distinct IP addresses each binary could contact, during a time period. This
restriction can prevent malware from launching DoS attacks or spreading itself. There-
fore, out mixed containment policies can help understand malware’s network activities
and minimize the potential damages of malicious network trac at the same time.
5.1 Introduction
In recent years, malware has become increasingly profit-oriented [HE
+
09, SGH
+
11,
CGKP11]. For example, Advanced Persistent Threat (APT) malware [LCL13] targets
a particular category of people and attempts to steal commercial or national security
information from its targets. Keyloggers intercept the keystrokes when users are typ-
ing credentials, which can later be used to access bank accounts. Botnet can infect a
variety of Internet hosts that may be used to mine bitcoins or launch DDoS attacks on
demand [ES14]. Advertising-supported malware (adware) lures victims to install mali-
cious software by oering monetary incentives. Later, adware will fetch more adver-
tisement from the master server.
Much of malware functionality today relies on a functional network. For example,
APT and keyloggers have to transfer the collected information back to the master server.
92
The DDoS attacks are only possible when botnet clients can receive the commands
from their bot masters and generate network trac to the victims. Unsurprisingly then,
contemporary malware is becoming environment-sensitive. It tests its environment and
proceeds only if this test is passed. The tests may include checking for presence of
virtual machines (VM-sensitive malware), debuggers (debugger-sensitive malware) or
unlimited network connectivity (network-sensitive malware).
While there are sophisticated approaches to hide VMs and debuggers from mal-
ware (e.g., [SAM14, SM17]), current approaches to handling malware communication
requests have been very crude. On one hand, one could allow malware to operate
with unlimited connectivity, while enforcing some rate limits and diverting outgoing
emails sent by malware to a sinkhole. This is a common approach among top security
researchers [KW
+
11,JMG
+
09], but it does not guarantee safety from malicious actions.
On the other hand, due to ethical and legal considerations, malware analysis is lim-
ited to a fully contained environment [SHL16, PDZ
+
14]. However, this greatly hinders
analysis as malware may stall if its outgoing requests are filtered. Therefore, malware
may not completely exhibit its malicious activities when being analyzed in an isolated
environment.
In this work, we propose a framework, called Pandora, to investigate how malware
communicates with the Internet in a high-fidelity and secure manner. Our main contri-
butions are the following:
1. Design a malware analysis framework (Pandora) that can analyze malware’s net-
work behavior automatically in batch mode. Pandora includes a gateway that
enforces several packet routing policies, including packet forwarding, dropping,
and rate limit. The gateway can also fake replies to malware when viable and
necessary.
93
2. Implement Pandora on the DeterLab [BBB
+
04] testbed without using any vir-
tual machines, debuggers, or other instrumentation tools. Our implementation
leaves the minimal artifacts that may be detected by malware, comparing to other
malware analysis frameworks. Pandora is also able to automatically restore hard
drives without a human’s intervention. Furthermore, Pandora is scalable to ana-
lyze multiple malware samples in parallel, which greatly improves the analysis
eciency.
3. Perform extensive analysis on the network traces generated by malware. In order
to understand the dynamics of malware’s network packets, we devise a list of 83
machine learning features to describe the nature of network traces from dierent
aspects. Then, we apply both supervised classification and unsupervised cluster-
ing algorithms to label malware samples based on the list of features. Finally, we
propose a network behavior model for malware, called NetDigest, to describe the
high-level semantics of malware’s network activities, based on the conversations
between malware and Internet hosts. This model can help us understand who the
malware is contacting and for what purposes.
4. Evaluate Pandora using 1,049 malicious samples, and 541 (52%) binaries show
network behavior. Our evaluation shows that the list of features can help clas-
sification algorithms achieve about 80% accuracy in labeling malware samples.
We also build the NetDigest for all the samples and illustrate one concrete exam-
ple. Our findings can help inform researchers on how to better handle malware
communication requests.
94
5.2 Related Work
In this section, we compare our work to important research eorts that are closely related
to Pandora.
5.2.1 Network Containment Policies in Malware Analysis
Due to legal and ethical considerations, most malware analysis frameworks adapt a
completely contained policy for malware’s network communication, such as [PDZ
+
14,
VY06, KVK11, KVK14, XZGL14]. When malware relies on functional network to
download malicious payload, these research eorts are not able to fully explore mal-
ware’s purposes. In our work, we allow malware to contact Internet hosts when we
believe it is necessary for malware to function properly.
Some research work allows malware to communicate with Internet hosts without
any limitations, which fails to consider the harm that malware may cause to victims.
For example, Sandnet [RD
+
11] executes a sample for up to one hour, while a study in
2015 [Net15] discovers that the average duration of a reflection attack is 20 minutes. The
work [GLB12] limits malware’s execution time to 5 minutes, but the authors run each
sample for as many as 25 times. We argue that this type of network containment policy
will cause unexpected damages to Internet hosts; therefore, we adopt a more restricted
policy to forbid certain types of network trac.
A few other research eorts apply a mixed network containment policy, but their
policies still could not provide the required input for malware. For example, the authors
of [BO
+
07] adopt a virtual machine to analyze malware samples. The virtual machine
is partially firewalled so that the external impact of any immediate attack behaviors
(e.g., scanning, DDoS, and spam) is minimized during the 5 minute’s execution time.
While this policy can prevent malware from infecting other machines, it is unclear
95
whether malware samples are allowed to download payload from Internet. The GQ
work [KW
+
11] designs a framework that can conveniently apply dierent network con-
tainment policies, including packet drop, forward, rate limit, redirect, and others. How-
ever, they do not discuss how to design reasonable network containment policies for
malware analysis. In our work, we clearly define the containment policies for dierent
network trac (Section 5.3.3).
5.2.2 Analysis on Malware’s Network Behavior
Most of malware analysis work focuses on the system traces that are introduced by
malware execution, such as system call [RHW
+
08, RTWH11, PDZ
+
14]. Nevertheless,
the analysis on the semantics of network behavior is limited.
The framework Sandnet [RD
+
11] provides a detailed, statistical analysis of mal-
ware’s network trac. The authors give an overview of the popularity of each protocol
that malware employs, including DNS, HTTP, IRC, and SMTP. In addition, they mea-
sure the details of each protocol, such as DNS message error rate, HTTP requests and
responses. However, they do not attempt to understand the high-level semantics of mal-
ware’s network conversations. We seek to understand the purpose and dynamics of
malware’s network packets, especially the HTTP/HTTPS trac.
Morales et al. [MABXS10] define seven network activities based on heuristics. For
example, they describe one of the behavior as “A process performs a NetBIOS name
request on a domain name that is not part of a DNS or rDNS query”. While this text
narrative is easy to understand, it is limited in expressing why malware sends out this
type of network trac. It is also dicult to extend these descriptive rules to adapt to new
network behavior in the future. In [BCH
+
09a], the authors propose a scalable clustering
approach to identify and group malware samples that exhibit similar behavior. The mal-
ware activities they measure include both system calls and network packets. However,
96
their model of network behavior merely includes the names of downloaded files, IRC
channels, and email subjects. We consider a much broader range of network features,
such as domains queried by malware, geographical location of the contacted IPs, ICMP
echo requests, and many others. Bailey et al. [BO
+
07] propose a new classification
technique that describes malware behavior in terms of system state changes (e.g., files
written, processes created). In terms of network activity, this work merely checks the
ports that malware scans or visits. On the contrary, we seek to understand details of
malware’s network activities and to interact with them when possible.
5.2.3 Machine Learning Techniques in Malware Analysis
Sometimes, machine learning techniques are applied in malware analysis, especially in
the realm of tagging, or labeling, malware samples. For example, supervised classi-
fication algorithms, such as decision tree [DF00], support vector [BV93], and multi-
layer perception [RHW88], usually first learn from a training set that has already been
tagged. Then, they apply learned knowledge to predict the labels for an additional testing
set. On the contrary, unsupervised clustering algorithms do not need any priori knowl-
edge about a set of malware samples, including DBScan [EKS
+
96], Birch [ZRL96],
k-Means [A V07], and hierarchical clustering [BB00]. These algorithms directly cluster
malware samples into dierent groups based on the mathematical distance between each
individual sample.
Current applications of machine learning algorithms in malware analysis usually
rely on the system information, such as system calls [BO
+
07], malware execution
traces [RTWH11], and system objects [BCH
+
09b]. However, the work of consider-
ing network behavior as the main input to machine learning methods is limited. For
example, Morales et al. [MABXS10] propose 7 malicious network behavior for mal-
ware clustering. The KDD Cup 99 data [UC99] provides a listing of 33 features defined
97
for the network connection records. The features consist of three categories: basic fea-
tures of individual TCP connections, content features within a connection suggested
by domain knowledge, and trac features computed using a two-second time window.
However, this listing is tailored for intrusion detection, not suitable for malware classi-
fication or clustering.
In this work, we present a detailed list of 83 features for malware’s network com-
munication. These features are extracted from dierent perspectives of network trac,
such as the number of distinct IP addresses, diversity of network packets, etc. To our
best knowledge, this is the most detailed list up to date.
5.2.4 Fidelity of Malware Analysis
In current practice of malware analysis, analysts usually utilize virtual machines and
debuggers to execute malware. However, environment-sensitive malware can detect
these environments and abort operations. Many techniques have been developed to hide
virtual machines or debuggers from malware (e.g., [DR
+
08, PBB11, SHL16, SAM14,
SM17]. Nevertheless, all of these research eorts have certain limitations. For example,
malware can query network time to measure the time needed to execute a block of code.
If the time elapsed exceeds a predefined threshold, malware will believe it is being ana-
lyzed inside a sandbox and refuse to show malicious behavior any more. Furthermore, if
malware encrypts the network packets, all the above references will have diculties in
handling the encryption. In Pandora, we run malware on bare-metal machines and thus
do not need to hide VMs or debuggers. We also allow malware to access Internet, under
Pandora’s strict surveillance, thus there is no need to dig and modify the time informa-
tion hidden deeply in the network packets. However, there are some artifacts introduced
by our framework; we discuss how to minimize them in Section 5.4.2.
98
5.2.5 Internet Host Mimicking
Some researchers attempt to learn from the network packets generated by malware, and
then play the role of Internet servers. For example, Graziano et al. [GLB12] propose a
malware analysis system to examine the communication patterns of malware. They first
run a sample multiple times to collect enough network traces, without any networking
restrictions. Then, they build a finite state machine (FSM) based on the sessions in the
traces. This FSM can help answer malware’s messages by determining the current state
of malware. Finally, the sample is executed in a completely contained environment,
and the authors reply to malware’s messages using the FSM. However, there are several
disadvantages of this learning and replaying procedure. First, a sample is executed
several times with no restrictions, which can potentially cause huge damages to Internet
hosts. The servers may also be alarmed by this abnormal behavior of malware and thus
cease to respond to this sample. Second, the FSM is sample-specific, which means it is
dicult to extend to other malicious binaries and thus is not scalable.
5.3 Pandora
In this section, we describe our malware analysis framework called Pandora, which aims
to support analysis of network-sensitive malware. Our high-level goal is to analyze the
network behavior of malware in a faithful, automatic way, while minimizing the negative
eects on Internet hosts at the same time. There are three main challenges we need to
address in order to achieve this goal:
1. Some malware’s network trac may cause damages to the hosts on the Internet, so
we need to devise a containment policy that selectively allows outgoing network
packets, which do not cause permanent damages and are necessary for malware’s
functionality.
99
Internet
Malware
Execution
Environment
Gateway
1. Forward
2. Redirect
3. Fake reply
4. Drop
5. Rate limit
HTTP
Coordinator
Storage
Control Command
Blob Data Flow
Network Packet
FTP
Impersonators
SMTP
HTTPS
FTP SMTP
Figure 5.1: Overview of Pandora
2. Current malware widely employs anti-analysis techniques to thwart researchers’
defense against them [SAM14,CAM
+
08,LKMC11]. Malware can actively detect
the artifacts introduced by the analysis environment, such as virtual machines or
debuggers. Our analysis framework should be transparent to malware.
3. It is very likely that malware will contaminate the host operating system. We want
to restore a clean state quickly and eciently before analyzing a new malware
sample.
In the following sections, we will explain how we address these challenges.
5.3.1 Overview
Figure 5.1 shows the overall design of Pandora, which consists of five main entities.
The Malware Execution Environment (MEE) is the location where malware samples
are executed. This environment provides all the required system resources for malware
execution, such as access to the Internet. All network packets from this execution envi-
ronment will be routed to Gateway. The Gateway examines the nature of each packet
100
and makes decisions if they should be forwarded. A flow may be forwarded to the Inter-
net or to a service impersonator within our control, dropped, or rate limited. All the
network packets observed by Gateway will be saved in the Storage entity. The Storage
entity also saves malware samples and system information that is used for recovering
MEE after it is infected by malware. Finally, the Coordinator is responsible for issuing
commands to the above entities, controlling timing and nature of their activities.
5.3.2 Logic Execution of Pandora
We now illustrate how Pandora is used to analyze one malware sample, as shown in
Figure 5.2. First, the Coordinator starts the Storage and Gateway services, which takes
place only once for analyzing all samples. Next, MEE will be loaded with a fresh OS
image from Storage service, so each malware sample will be executed in a vanilla envi-
ronment. During this step, the Coordinator will keep checking the status of MEE in a
loop, until MEE has been restored and is in running status. In the following “Prepare
system” stage, the Coordinator will update the route table of MEE, such that all outgo-
ing network trac will be sent to the Gateway. This ensures that Gateway will observe
all the network packets generated by malware, and no packet will be sent to Internet
without Gateway’s approval. During the normal execution of an operating system, there
usually are some packets sent out by background processes, such as time synchroniza-
tion, checking for and downloading updates, etc. Coordinator disables related system
services to remove this background noise. After the system is prepared, one selected
malware sample will be loaded on MEE from the Storage service (“Copy sample”) and
executed.
During “Malware execution”, the sample may need to communicate with Internet
hosts. For example, malware may query a domain name first and then send a SYN packet
to its corresponding IP at port 80. For these types of trac, Gateway will forward the
101
MEE Coordinator
Restore system state
Check if ready
Deploy one sample
Terminate
Malware
execution
Prepare system
Storage Gateway
Copy image
Copy sample
DNS query
Response
DNS response
Internet
Query
HTTP request
Response
Request
HTTP response
Start Storage service
Start Gateway service
Control Command
Blob Data Flow
Network Packet
Save pcap
Pcap
Figure 5.2: Logic Execution of Pandora
requests from malware to Internet, provided that these requests do not exceed specific
rate limits. Gateway will also forward the responses from Internet to malware faithfully.
The details of trac handling policy are discussed in Section 5.3.3).
Finally, Coordinator will terminate “Malware execution” when the maximum analy-
sis time for a sample has been reached. After this termination, the system states of MEE
will be restored (“Restore system state”) and the next analysis cycle begins.
5.3.3 Containment Policies for Network Trac
We now briefly discuss the risks of letting any malware packet or flow out into the
Internet. Based on its ultimate purpose we classify malware flows into the following
categories: benign (e.g., well-formed requests to public servers at a low rate), e-mail
(spam or phishing), scan, denial of service, exploit and C & C (command and control).
102
Internet
Yes
Flow Essential?
Can fake? Risky?
Drop
No
Impersonator
No
Yes No
Yes
Figure 5.3: Flow Handling
Potential harm to Internet hosts depends on the flow’s category. Spam, scans and denial
of service are harmful only in large quantities – letting a few such packets out will
usually not cause severe damages to their targets, but it may generate complaints from
their administrators. On the other hand, binary and text-based exploits are destructive,
even in a single flow. The C&C and benign communications are not harmful and must
be let out to produce malware behavior of our interests.
The challenge of handling the outside communication attempts lies in the fact that
the flow’s purpose is usually not known a priori. For example a SYN packet to port 80
could be the start of a benign flow (e.g., a Web page download to check connectivity), a
C & C flow (to report infection and receive commands for future activities), an exploit
against a vulnerable Web server, a scan or a part of denial-of-service attack. We thus
have to make a decision how to handle a flow based on incomplete information, and
revise this decision when more information is available. Our initial decision depends on
how essential we believe the flow is to the malware’s continued operation, how easy it
is to fake the flow’s replies from within our analysis environment, and how risky it may
be to let the flow out into the Internet. Essential flows whose replies we can fake are
redirected to our impersonators. Essential flows whose replies we cannot fake and which
are not risky are let out into the Internet, and closely observed lest they exhibit risky
behavior in the future. Non-essential flows and essential but risky flows are dropped.
Figure 5.3 illustrates our flow handling.
103
Table 5.1: Containment Policies of Pandora
Goal Action Targeted Services
Elicit malware behavior
Forward DNS, HTTP, HTTPS
Redirect FTP, SMTP, ICMP echo
Restrict forwarded flows
Drop Other services
Limit Number of non-responsive flows
Trac that we let out can be misused for scanning or DDoS. We actively monitor for
these activities and enforce limits on the number of non-responsive flows that a sample
can initiate. We define a non-responsive flow as a flow, which receives no replies from
the Internet. For example, a TCP SYN to port 80 that does not receive a TCP SYN-ACK
would be a part of non-responsive flow. Similarly a DNS query that receives no reply
is a non-responsive flow. Non-responsive flows will be present if a sample participates
in DDoS attacks or if it scans Internet hosts. If the sample exceeds its allowance of
non-responsive flows, we abort this sample’s analysis.
Pandora’s containment policies summarize our initial decisions and revision rules in
Table 5.1. We consider DNS, HTTP and HTTPS flows as essential and non-risky, whose
replies we cannot fake. We make this determination because many benign and C & C
flows use these services to obtain additional malware executables, report data to the bot
master and receive commands. Among our samples, DNS is used by 62%, HTTP by
35%, and HTTPS by 10% of samples (Section 5.5).
We consider FTP, SMTP and ICMP flows as essential flows whose replies we can
fake. We forward these to our corresponding service impersonators (Figure 5.1). These
are machines in our analysis environment that run the given service, and are configured
to provide generic replies to service requests. We redirect ICMP Echo requests to our
service impersonators and fake positive replies. We drop other ICMP trac. The details
of how we mimic FTP and SMTP services are provided in Section 5.3.4.
104
All other trac is considered non-essential or risky and is dropped by the Gateway.
We intend to develop support for more protocols and develop further policies in our
future work.
5.3.4 Service Impersonators
Our FTP service impersonator is a customized, permissive FTP service that positively
authenticates when any user name and password is supplied. This setting can handle all
potential connection requests from malware. If malware tries to download a file, we will
create one with the same extension name, such as .exe, .doc, .jpg, and others. We
save uploaded files for further analysis. For SMTP service, we set up an Email server
that can reply with a “250 OK” message to malware’s requests. Our ICMP impersonator
sends positive replies to any ICMP echo request.
5.4 Implementing Pandora on DeterLab Testbed
In this section, we discuss the details of implementing Pandora on a public testbed –
DeterLab [BBB
+
04]. We chose this deployment platform for three reasons. First, Deter-
Lab testbed provides convenient ways of experimenting on physical machines directly,
instead of virtual machines. This setting helps us analyze environment-sensitive mal-
ware in a faithful setting. Second, it has automated ways to restore clean state on
machines between experimental runs, which include OS reload, and modification of
network setting. Finally, we need many physical machines to analyze malware samples
in parallel and DeterLab provides this.
105
5.4.1 DeterLab Testbed
The DeterLab testbed [BBB
+
04] is based on the Emulab technology [WLS
+
02] that
enables remote experimentation and automated setup. An experimenter gains exclusive
access and sudoer privileges to a set of physical machines and may connect them into
custom topologies. The machines run an operating system and applications of a user’s
choice. Trac between machines in the experiment is fully contained, and does not
aect other experiments on the testbed, nor can it get out into the Internet. In our exper-
iments, we leverage a special functionality in the DeterLab testbed, called “risky exper-
iment management”, which allows exchange of some user-specified trac between an
experiment and the Internet. We specify that all DNS, HTTP and HTTPS trac should
be let out.
We map the Coordinator, MEE, and Gateway to individual physical machines on
DeterLab, and we co-locate the Storage with the Gateway on the same physical machine.
We implement all of the service impersonators on a single physical machine. Each
machine has a 3GHz Intel processor, 2GB of RAM, one 36Gb disk, and 5 Gigabit net-
work interface cards.
5.4.2 Minimizing Artifacts
To hide the fact that our machines reside within DeterLab from environment-sensitive
malware we modify the system strings shown in Table 5.2. For example, we replace the
default value (“Netbed User”) of “Registered User” with a random name, e.g., – “Jack
Linch”.
106
Table 5.2: Minimizing Artifacts of DeterLab
Key Name Default Value in DeterLab Our Modification
Registered User “Netbed User” Random name, e.g., “Jack Linch”
Computer Name “pc.isi.deterlab.net” Random name, e.g., “Jack’s PC”
Workgroup “EMULAB” “WORKGROUP”
5.4.3 System Restore
After analyzing each malware sample, Pandora needs to restore MEE’s hard drive with a
clean snapshot. We take advantage of the OS setup functionality provided by DeterLab
to implement this function. After we perform certain OS optimization at the beginning,
this modified OS can be conveniently saved into a snapshot using the imaging function
of DeterLab. This step takes a few minutes but is carried out only once for our malware
analysis experimentation. Later, when we need to restore the system after analyzing
each individual malware sample, we issue a single command (os load) from Coordi-
nator to restore MEE from the previously-saved snapshot. This command will copy the
snapshot from the Storage service to overwrite MEE’s local hard drive. However, since
this restoring process is expanding the whole disk image, there is some performance
overhead. For the purpose of speeding up the system recovery, we limit the primary
partition of the operating system to only 4GB in size. We quantify this overhead in
Section 5.5 and discuss possible improvements in Section 5.7.
5.5 Evaluation
In this section, we show our evaluation of Pandora along several dimensions:(1) benefit
to malware analysis by exposing more malware behaviors than full-containment analysis
(2) safety, and (1) runtime overhead.
107
5.5.1 Overview of Malware Samples
The discovery dates of malware are critical, because the Internet hosts contacted by older
samples have a higher probability of being taken down by defenders. If we execute
these samples, they may not be able to exchange network packets with their targeted
Internet hosts and thus will exhibit reduced activity. Therefore, we randomly select
29,319 samples that are discovered after Jan, 2017, from OpenMalware [Geo17].
In order to obtain some ground truth about the purposes of these samples, we submit
their md5 hashes to VirusTotal [Tot17] and retrieve 28,495 valid reports. Each report
contains the analysis results of about 5060 anti-virus (A V) products for a single sample.
We keep the samples that were labeled as malicious by more than 50% A V products.
This leaves us with 19,007 samples.
Concise Tagging. Each A V product tags a binary with vendor-specific label, for
example, “worm.win32.allaple.e.”, “trojan.waski.a”, “malicious confidence 100% (d)”,
or just benign. As demonstrated in [BO
+
07], A V vendors disagree not only on which tag
to assign to a binary, but also how many unique tags exist. To overcome this limitation,
we devise a mapping from vendor-specific tags into concise, generic tags. We first take
a union of all the tags assigned by the A V products (74,443 in total), and then extract
several common categories out of them, such as: worm, trojan, virus, etc. Finally, we
tag the sample with the concise category that the majority of the A V products assign to
it. Table 5.3 shows the breakdown of our samples over our concise tags.
Selecting Samples with Network Behavior. We randomly select 2,994 out of the
19,007 samples, which cover all the categories in Table 5.3, to analyze them with Pan-
dora. We run them in Pandora and log all the network packets generated by MEE using
tcpdump. We choose the samples that exhibit network activity, which leaves 1,737
108
Table 5.3: Concise Tagging of Malware Samples
Categories Samples Categories Samples
Virus 6,126/32% Riskware 409/2%
Trojan 6,040/32% Backdoor 197/1%
Worm 4,227/22% Bot 45/<1%
Downloader 984/5% Ransomware 17/<1%
Adware 962/5% Total 19,007
binaries. Among these samples, 1,354 (78%) send more network packets under Pan-
dora, which demonstrates that our live policies can help improve malware’s network
behaviors. In the following section, we will investigate network activities of the 1,737
samples.
5.5.2 Pandora Exposes More Malware Behavior
We measure malware’s network behavior using the number of flows generated. Out
of 1,737 samples that exhibit any network behavior, 1,354 (78%) launch more flows
with Pandora than in the baseline case. This demonstrates that Pandora exposes more
malware behavior and thus facilitates better malware analysis.
5.5.3 Pandora Enables Safe Malware Analysis
During twelve weeks of our experimentation, we received no abuse complaints. We
also analyzed 203 IP blacklists from 56 well-known maintainers (e.g., [Fee17]), which
contain 178 million IPs and 34,618 /16 prefixes for our experimentation period. Our
external IP was not in any of the blacklists. This data shows that malware analysis in
Pandora is safe and no harmful trac is allowed out.
109
5.5.4 Execution Overhead
In our evaluation process, the “Restore system state” step takes about 10 minutes, while
“Prepare system” and “Deploy one sample” need only a few seconds. After each sample
is launched, we allow it to run for 5 minutes before we terminate it. Therefore, we
spend roughly 20 minutes in analyzing each sample, and most of the time is spent on
restoring the system state. To improve the analysis eciency, we allocate four physical
machines for MEE and thus are able to analyze four samples in parallel. This achieves
an amortized cost of 5 minutes for executing each malicious binary.
5.6 Pandora Facilitates Novel Malware Research
In the previous section we have shown that Pandora exposes more malware behaviors
than full-containment analysis. We now explore what new research questions can be
answered with this additional data.
We focus on the question of classifying unknown malware based on its communica-
tion patterns. Current malware classification relies on binary analysis. Yet, this approach
has a few challenges. First, malware may use packing or encryption to obfuscate its
code, thus defeating binary analysis. Second, malware may be environment-sensitive
and may not exhibit interesting behavior and code if ran in a virtual machine or debug-
ger, which are usually used for binary analysis. We thus explore malware classification
based on its communication behavior, reasoning that malware may obfuscate its code
but it must exhibit certain key behaviors to achieve its basic functionality. For example,
a scanner must scan its targets and cannot significantly change this behavior without
jeopardizing its functionality.
Our approach starts withtcpdump logs of a malware sample’s trac. From this data
we build a concise representation of malware communication, which we call NetDigest.
110
Table 5.4: NetDigest of a Session
Protocol [Attribute: Value]
All
[LocalPort: integer]
‡
, [NumPktSent: integer]
‡
, [NumPk-
tRecv: integer]
‡
, [PktSentTS:float list]
‡
, [PktRecvTS:
float list]
‡
, [PayloadSize: integer list]
†
DNS
[Server: IP address]
‡
, [QueryType: domain]
‡
,
[CNAME:CNAME]
‡
, [ResponseType: ResponseValue]
‡
HTTP/FTP
[Server: IP address]
‡
, [Proactive: boolean]
‡
,
[GotResponse: boolean]
‡
, [Code: integer]‡,
[Download: file type]
†
, [Upload: file type]
†
SMTP
[Server: IP address]
‡
, [EmailTitle: string]
*
,
[Recipients: string]
*
, [BodyLength: integer]
*
,
[ContainAttachment: boolean]
*
, [AttachmentType:
string]
*
ICMP [Requested IP:IP address]
*
, [Times,count]
*
‡
Occur exactly once
†
May have zero or more occurrences
*
Have at least one occurrence
First, we split malware’s trac into dierent sessions/connections based on the local
ports used by malware and the external IP address and port number. All packets sent
from and received by the same local port, with the same Internet host and port, belong to
a single session. Second, for each session, we extract the application protocol employed
and devise a list of [Attribute: Value] pairs for this protocol, as shown in Table 5.4.
The first row of Table 5.4 shows the information that we will extract for all types
of application protocols. For example, “LocalPort” denotes the local IP port used by
malware, which is an integer. This attribute appears only once for a single session, as
directly derived from the definition of a session. The “NumPktSent” means the total
number of packets sent by malware in an individual session. The “PktSentTS” is a list
of Unix epoch time of all the packets sent by malware. Finally, we also maintain a list
of each packet’s payload size when possible.
111
For example, the DNS protocol has one attribute “Server”, which has the value
of IP address that the query is sent to. For the domain queried by malware, the
QueryType can be address record (A), mail exchange record (MX), pointer record (PTR),
or others. For the response sent back by DNS server, we first save its canonical name, if
any, in a [CNAME: CNAME] pair. Then, we extract the ResponseType and correspond-
ing values in a [ResponseType,ResponseValue] pair. For example, the ResponseType
may be a “A” record and the ResponseValue contains a list of IP addresses.
For an HTTP or FTP session, we first take note of the server’s IP address in the
[Server IP: IP address] pair. Then, we use boolean values to mark if this session is
initiated by malware (“Proactive”) and if malware receives any response from Internet
host (“GotResponse”). If the outside server replies to malware, we classify the following
packets as “Download” or “Upload” depending on the direction of the packets, from the
perspective of malware. We also extract the file type being transferred.
For an SMTP message, we extract the server IP address, Email title, recipients, and
body length. We also use a boolean value to note whether the message has an attachment
and save the attachment’s file type in a string.
For the ICMP protocol, we extract the destination IP address into the [Requested IP:
IP address] field. We also save the number of requests in [Times: count] field.
After we build the lists of attribute-value pairs for all the sessions produced by a
malware sample, we sort the lists based on their first timestamps. The final, sorted list
of session abstractions is called the NetDigest of the sample. In the following para-
graphs, we first give the overview of all malware’s communication patterns, mined from
NetDigests, and then give more details about one select case.
112
Table 5.5: Top 12 Application Protocols used by Malware
Protocols Samples Protocols Samples
DNS 1081/62% 1042 65/4%
ICMP echo 818/47% 799 33/2%
HTTP 600/35% 6892 25/1%
65520 237/14% 11110 17/1%
HTTPS 173/10% 11180 17/1%
SMTP 75/4% FTP 12/1%
5.6.1 Overview of Malware’s Communication Patterns.
Table 5.5 shows the top 12 application protocols used by all of the 1,737 samples. We
find that DNS and ICMP echo request are the two most popular protocols, which are
used by 62% and 47% of the samples. DNS is used by malware for functionality – to
resolve the IPs of the domains that malware wishes to contact – while ICMP is likely
used to test reachability, either to detect if malware is running in a contained environ-
ment or to identify live hosts that may later be infected, if vulnerable. HTTP and HTTPS
are the most popular and known protocols that can be used to transfer payload, which
are used by 35% and 10% of the samples.
Table 5.6 shows the destination ports that malware uses during its first attempt to
connect with Internet hosts. We find that DNS and ICMP are the two most popular
protocols, which are used in 926 (53%) and 629 (36%) samples respectively. DNS is
likely used for functionality – to resolve the IPs of the domains that malware wishes to
contact – while ICMP is likely used to test reachability and detect if malware is being
ran in a contained environment. In some cases, ICMP is also used to scan Internet hosts.
For example, in our evaluation, several malware samples send ICMP echo requests to
Internet hosts after being launched, without a preceding DNS query. In addition, these
samples engage in no follow-up activities with the hosts that send ICMP echo replies.
113
Table 5.6: Destination Ports of Malware’s First Packet
Destination Port Samples
DNS 926/53%
ICMP 629/36%
1042 63/4%
HTTP 25/1%
6892 22/1%
HTTPS 19/1%
11110 14/1%
FTP 12/1%
1034 8/<1%
4899 7/<1%
9999 5/<1%
36355 1/<1%
78 1/<1%
5517 1/<1%
8080 1/<1%
Kerberos 1/<1%
55107 1/<1%
5500 1/<1%
Total 1737
Table 5.7 shows the follow-up behavior of malware samples that use DNS query as
their first packets. There are 314 (18%) out of 1,737 samples that send packets to HTTP
servers, while 287 (17%) samples keep on querying the same or dierent domains. In
our evaluation, port65520 is the most popular non-standard port. This port is exploited
by several well-known malware samples. After successful infection, these samples open
a back door on the compromised computers [BC07]. We also find 13 samples that do
not have any following up actions after DNS query.
Table 5.8 shows the top 10 domains queried by all samples. The third column shows
the domains’ ranks retrieved fromalexa.com, which sorts websites based on their pop-
ularity. About 14% and 11% of samples queryzief.pl andgoogle.com respectively.
We also notice that the ranks are either very high (e.g., google.com) or very low (e.g.,
zief.pl).
114
Table 5.7: Destination Ports of Malware’s Follow-up Packet after DNS Query
Follow-up Port Samples
HTTP 314/18%
DNS 287/17%
65520 211/12%
HTTPS 55/3%
799 17/1%
ICMP 12/1%
6667 4/<1%
3333 2/<1%
587 2/<1%
3080 1/<1%
2668 1/<1%
8000 1/<1%
1199 1/<1%
1177 1/<1%
8080 1/<1%
1255 1/<1%
6688 1/<1%
2016 1/<1%
No follow-up 13/1%
Total 926
Samples in our data set query a total of 5,548 dierent domains, among which
zief.pl (14%) andgoogle.com (11%) are the most popular domains. We query these
domains from alexa.com, which has the records for 341 (6%) domains, as shown in
Figure 5.4. We find that 1% of the domains have ranks higher than 10,000, while 94%
of domains are not recorded byalexa. For the domains ranked higher than 10,000, we
observe that most of them are web portals, such asyoutube.com,facebook.com, and
baidu.com. For the domains that have no record fromalexa, we manually check 20 of
them, and none of them has a valid DNS record. This suggests that malware uses portal
websites to test network reachability and uses private servers for file transfer or C&C
communication.
115
Table 5.8: Top 10 Domains Queried by Malware
Domain Samples Rank
zief.pl 244/14% 20,965,412
google.com 187/11% 1
buzzrin.de 91/5% 1,918,353
msrl.com 73/4% No record
tembel.org 73/4% No record
cygnus.com 73/4% 677,416
develooper.com 73/4% 6,058,804
secureserver.net 73/4% 622
ide.com 73/4% 8,611,029
cpan.org 73/4% 29,382
0
5x10
6
1x10
7
1.5x10
7
2x10
7
2.5x10
7
3x10
7
0 50 100 150 200 250 300 350
Ranks from Alexa
Domains
Figure 5.4: Ranks of Domains fromalexa.com
We classify the queried names based on their top-level domain, e.g., .com or .net.
We find a total of 72 distinct top-level domains, as shown in Figure 5.5. The Top 5 of
these domains are shown in Table 5.9. The .com is the most popular top-level domain,
which is queried by 540 (31%) samples. The third column in Table 5.9 shows the top
3 queried domains in each top-level category. These domains contain 53 country code
top-level domains, which shows that Poland, Germany, and Netherlands are the top three
116
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 10 20 30 40 50 60 70
Percentage of Samples
Top-level Domains
Figure 5.5: Popularity of Top-level Domains
Table 5.9: Populairty of Top-level Domains Queried by Malware
Top-level Domain Samples Second-level Domain (Top 3) Samples
.com 540/31%
google.com 187/35%
msrl.com 73/14%
ide.com 73/14%
.pl 293/17%
zief.pl 244/83%
brenz.pl 26/9%
ircgalaxy.pl 22/8%
.net 235/14%
secureserver.net 73/31%
surf.net 68/29%
aol.net 65/28%
.de 179/10%
buzzrin.de 91/51%
lst.de 65/36%
loewis.de 41/23%
.org 153/9%
tembel.org 73/48%
cpan.org 73/48%
python.org 70/46%
countries preferred by malware. This hints that current malware mostly targets victims
in Europe and malware authors may likely reside in this continent.
117
5.6.2 Case Analysis – Forwarding HTTP trac.
Figure 5.6 shows the NetDigest of a sample tagged as Trojan by A V products. At the
beginning, this sample queries a domain (ic-dc.deliverydlcenter.com) using the
default DNS server provided by Pandora. Our DNS server replies with the real record
– no canonical name but with a bunch of IP addresses: 52.85.83.81,52.85.83.112,
etc. Then, this sample downloads a picture and blob files from the first IP address
returned. However, for the remaining Internet hosts, this sample just establishes con-
nections with them but does not download or upload any information. For exam-
ple, the second domain (www.1-ads.com) suggests that it is an advertising web-
site, but no payload is downloaded from this website (session starting at timestamp
1488068896.977464). In addition, some IPs are unreachable at the time of our exe-
cution, such as52.85.83.112,52.85.83.132, and52.85.83.4. We also notice that
this sample shows network activities at an interval of about 20 seconds.
5.6.3 Case Analysis – Mimicking FTP Service.
We select a sample (md5: 54d3389cd53160dbc79b3d3c2ce151a2) that attempts to
communicate with a FTP server at 199.231.188.109:21. This sample directly con-
nects to this IP without performing any DNS query first. We manually carry out a rDNS
query, which shows the IP is bound toserver.questerhost.in. The binary launches
a session using a credential of username-johan and password-GetUpEarlyAt09. Then,
it changes the remote directory toincoming and uploads two files of 13KB in size. We
save these files and upload them to VirusTotal for analysis. However, the format and
the content of these files cannot be determined by VirusTotal. Finally, we use the above
credential to access the real server located in Secaucus, New Jersey, USA (shown by
GeoLite [Min17]). The connection fails due to “530 Login authentication failed” error.
118
1488068895.052901: DNS - [Server: 10.1.1.3], [A: ic-dc.deliverydlcenter.com],
[CNAME: N/A], [A: 52.85.83.81, 52.85.83.112,
52.85.83.132, 52.85.83.4, 52.85.83.96, 52.85.83.56,
52.85.83.32, 52.85.83.37]
1488068895.154335: HTTP - [Server: 52.85.83.81], [Proactive: True], [GotResponse: True],
[Download: blob], [Download: .png], [Download: blob]
1488068895.948346: HTTP - [Server: 52.85.83.81], [Proactive: True], [GotResponse: True]
1488068896.767094: DNS - [Server: 10.1.1.3], [A: www.1-1ads.com], [CNAME: n135adserv.com],
[A: 212.124.124.178]
1488068896.977464: HTTP - [Server: 212.124.124.178], [Proactive: True], [GotResponse: True]
1488069110.044756: DNS - [Server: 10.1.1.3], [A: ic-dc.deliverydlcenter.com], [CNAME: N/A],
[A: 52.85.83.56, 52.85.83.112, 52.85.83.96, 52.85.83.37,
52.85.83.81, 52.85.83.4, 52.85.83.132, 52.85.83.32]
1488069110.049507: DNS - [Server: 10.1.1.3], [A: ic-dc.deliverydlcenter.com], [CNAME: N/A],
[A: 52.85.83.32, 52.85.83.37, 52.85.83.56, 52.85.83.112,
52.85.83.96, 52.85.83.132, 52.85.83.4, 52.85.83.81]
1488069110.338822: HTTP - [Server: 52.85.83.81], [Proactive: True], [GotResponse: False]
1488069110.342816: HTTP - [Server: 52.85.83.81], [Proactive: True], [GotResponse: False]
1488069131.273458: HTTP - [Server: 52.85.83.112], [Proactive: True], [GotResponse: False]
1488069131.277206: HTTP - [Server: 52.85.83.112], [Proactive: True], [GotResponse: False]
1488069152.304031: HTTP - [Server: 52.85.83.132], [Proactive: True], [GotResponse: False]
1488069152.308025: HTTP - [Server: 52.85.83.132], [Proactive: True], [GotResponse: False]
1488069173.334854: DNS - [Server: 10.1.1.3], [A: ic-dc.deliverydlcenter.com], [CNAME: N/A],
[A: 52.85.83.32, 52.85.83.132, 52.85.83.96, 52.85.83.81,
52.85.83.4, 52.85.83.56, 52.85.83.112, 52.85.83.37]
1488069173.338605: DNS - [Server: 10.1.1.3], [A: ic-dc.deliverydlcenter.com], [CNAME: N/A],
[A:52.85.83.32, 52.85.83.132, 52.85.83.112, 52.85.83.56,
52.85.83.81, 52.85.83.4, 52.85.83.37, 52.85.83.96]
1488069173.381571: HTTP - [Server: 52.85.83.4], [Proactive: True], [GotResponse: False]
1488069173.383566: HTTP - [Server: 52.85.83.4], [Proactive: True], [GotResponse: False]
Figure 5.6: Example NetDigest (md5: 0155ddfa6feb24c018581084f4a499a8)
With our FTP service impersonator, we are able to extract the user name and pass-
word used by malware and capture its uploading behavior, without routing the malware’s
network packets to the Internet.
5.6.4 Classifying Malware by Its Network Behavior
In this section, we apply machine learning techniques to classify malware samples based
on their network behaviors. Our goal is to first learn network behaviors that are indica-
tive of samples that bear specific concise labels (e.g., worm) and then use these behaviors
to label unknown malware.
Extracting Features. We start with 83 select features, extracted out of the malware’s
NetDigest, as shown in Table 5.10.
119
Table 5.10: Features of Malware Network Behavior
Categories Subgroups Features (83 in total)
Packet
Header
Distinct number of: IPs, countries, continent,
and local ports
Payload
Total size in bytes;
Sent/received: total number, minimum, maxi-
mum, mean, and standard variance
Statistics
Sent/received packets: total number, rate;
Sent/received time interval: min, max, mean,
and standard variance
Session
Direction
Proactive (initiated by malware) or passive (ini-
tiated by Internet servers)
Result Succeeded or failed
Statistics
Total number of SYN sent;
Number of sessions per IP: minimum, maxi-
mum, mean, and standard variance
Protocol
DNS
Number of distinct domains queried by mal-
ware
HTTP
Number of code received: 200, 201, 204, 301,
302, 304, 307, 400, 401, 403, 404, 405, 409,
500, 501, 503;
Method: GET, POST, HEAD
ICMP
Total number of packets;
Number per IP: min, max, mean, and standard
variance
Other
Ports: total number of distinct ports, top three
used
Content
Files
php, htm, exe, zip, gzip, ini, gif, jpg, png, js,
swf, xls, xlsx, doc, docx, ppt, pptx, blob
Host info OS id, build number, system language, NICs
Registry
Startup entries, hardware/software configura-
tion, group policy
Keyword
Number of: “mailto”, “ads”, “install”, “down-
load”, “email”
We abstract malware’s network trac into four broad categories: Packet, Session,
Protocol, and Content. For the Packet category, we divide it into three subgroups:
Header, Payload, and Statistics. In Header subgroup, we count the number of distinct IPs
120
that all malware’s packets have been sent to. In addition, we also search the geographi-
cal locations of the IPs from the GeoLite [Min17] database, including the countries and
continent they reside in. The calculation of these features is based on the observation
that certain classes of malware target Internet hosts in dierent countries. In Payload
subgroup, we calculate the total size of all payload in bytes. Furthermore, we compute
the mathematical statistics for both sent and received payload, including total number,
minimum, maximum, mean, and standard variance. In Statistic subgroup, we perform
similar calculations but the objects are packet quantity and the time interval between
consecutive packets. They are also based on packets sent or received by malware.
For the Session category, we consider all packets that are exchanged between mal-
ware and a single IP address. For these packets, we divide them into dierent sessions
according to the local ports used by malware. For each session, we determine if its
direction is proactive or passive, depending on whether the malware initiates the session
or not. We say the Result of a session is successful if malware initiates the session and
receives any responses from the host. For the Statistics of Session category, we calculate
the number of TCP SYN packets, which can be used to detect SYN flood attacks. We
also record the number of sessions per IP. This feature can reveal the properties of each
IP. For example, in our evaluation, we find that one sample launches one short session
with the first IP and then initiates multiple sessions with the second one. This network
behavior indicates that the first IP serves as a master while the second behaves like a file
server.
For the Protocol category, we extract features for dierent types of application pro-
tocols. For example, we summarize the number of distinct domains queried by malware
in their DNS query and response packets. For HTTP protocol, we calculate the num-
ber of packets carrying specific HTTP status codes, such as 200 (OK), 302 (Found),
404 (Not Found), 500 (Internal Server Error), and many others. Some malware samples
121
behave dierently based on the status codes returned. Because ICMP packets do not
have IP port information, we perform the same statistical calculation as aforementioned
on them. For non-standard IP ports, we maintain a set of distinct port numbers and
explicitly give the top three ports targeted by each malware sample. The extraction of
this feature is due to the fact that certain malware classes tend to use a uniform port for
communication.
For the Content category, we investigate the payload content carried in HTTP pack-
ets, because this is the main application protocol we allow in Pandora. We use regular
expression to match whether links of certain files are present in payload. The file types
we consider include php, htm, exe, zip, gif, jpg, xls, and others, which are usually
exploited by malware to hide malicious code. Sometimes the content does not contain
any meaningful word, so we tag it as blob. We also examine whether the payload content
contains host information and Windows registries that are typically reported to bot mas-
ters. Finally, we collect the count of appearance for selected keywords that are closely
tied to the purposes of malware. For example, if “ads” dominates the keyword list, the
malware is likely to be used for adware propagation.
Classification Results. We select three popular classification methods in machine
learning area – decision tree [DF00], support vector [BV93], and multi-layer percep-
tion [RHW88]. We implement these algorithms and standard data pre-processing (data
scaling and feature selection) through a Python package Scikit [PVG
+
11].
Out of the 1,737 samples that exhibit network behavior, not all of them are suitable
for classification. For example, a sample may query a domain but this domain has no
valid record at the time of execution, and the sample may then abort. This behavior does
not generate enough data for classification. In order to provide the suitable samples
for classification, we select the samples that send out HTTP, HTTPS, or ICMP packets.
These types of trac are either allowed to reach the Internet or handled by Pandora’s
122
Table 5.11: Results of Classification on Testing Set
Algorithms Rank 1 Rank 2 Rank 3
Decision Tree 242/89% 257/94% 259/95%
Support Vector 231/85% 259/95% 265/97%
Multi-layer Perception 231/85% 257/94% 262/96%
fake replies (Table 5.1). In other words, these samples receive what they need from the
Internet to show their complete network behavior. After applying this filtering rule, we
have 1,354 binaries left for classification.
We use 80% of this data set for training and the remaining 20% of samples for test-
ing. The results are shown in Table 5.11. The second column (“Rank 1”) counts the
samples for which each algorithm produces the same tag as the one produced by the
majority of A V products. Decision tree has a better performance (89%) than support
vector classification (85%) and multi-layer perception (85%). In addition to tagging
each sample with the most popular label (Table 5.3), we also keep record of the popu-
larity of all possible labels. We define the outcome generated by a classification method
as “Rank 2” if the resulting tag is either the most or the second popular one. The third
column in Table 5.11 shows the “Rank 2” results for the classification methods. We
observe that the classification precision will be improved to 94% to 95% under this
criteria. Similarly, the last column illustrates the results for “Rank 3”. Since malware
may behave like virus and worm at the same time, it makes sense for a sample to have
multiple tags.
In order to investigate the root causes as to why some samples are not classified as
“Rank 1” in Table 5.11, we manually examine their pcap traces. We find that all these
samples show certain limited network behavior that is not sucient for classification.
For example, one sample queries a domain and then establishes a connection with the
HTTP server. However, no payload is downloaded or uploaded through the conversa-
tion, which could happen in any malware categories. Next, we filter out the samples that
123
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25
700
800
900
1000
1100
1200
1300
1400
1500
Classification Precision
Training + Testing Samples
Threshold for Number of Sessions
Decision Tree
Support Vector
Multi-layer Perc.
Samples
Figure 5.7: Classification Precision under Dierent Number of Sessions
fail to launch enough sessions and reapply the classification algorithms. The evaluation
results are shown in Figure 5.7.
The x-axis of Figure 5.7 means the number of sessions that is selected to filter out
samples. Only the samples that have equal-to or more-than a specific number will be
kept for classification. The left y-axis shows the classification precision for each algo-
rithm – the “Rank 1” column in Table 5.11. The right y-axis shows the number of
samples under a chosen number of sessions. Overall, all of the classification methods
perform well and are stable, except for multi-layer perception when session quantity is
between 5 to 8. After investigating these sessions, we find that they do not have enough
distinguishing feature values for multi-layer perception algorithm. The small variance
of the input are further reduced by the intermediate calculation (hidden layers) of the
algorithm [PVG
+
11].
Based on the typical performance of applying machine learning techniques in mal-
ware analysis [PES01], we conclude that our feature extraction method can lead to high-
precision malware classification.
124
5.7 Discussion
With regard to network containment policy, we currently forward HTTP/HTTPS trac
and fake replies for ICMP echo requests. For DNS, FTP, and SMTP packets, we redirect
them to our service impersonators. However, the impersonators will stall if malware
attempts to download payload that we do not have. To overcome this limit, we may
replay the conversations between the malware and the impersonator with the real server
on the Internet. Then, we can forward the payload retrieved from the Internet host to
malware.
Currently, malware analysis usually focuses on the OS behavior introduced by mal-
ware and the assembly instructions inside malware itself. However, the analysis on
malware’s network behavior using live malware experimentation is limited. While our
work improves this limitation, it can produce deeper understanding of malware’s pur-
poses if the research eorts in these three areas are integrated. For example, we may
combine all the disk and network activities generated by malware to build a timeline of
these events, which would render a more comprehensive picture of the malware.
5.8 Conclusion
In this work, we propose a malware analysis framework called Pandora that helps col-
lect and analyze network behavior of malware. We present a mixed containment policy
for malware’s network communication, which can provide malware with necessary net-
work input and minimize its potential damages to the Internet. We implement Pandora
on DeterLab testbed and evaluate its eectiveness using 1,737 malicious binaries cap-
tured in the wild. We propose a concise representation of a malware’s network behavior,
125
called NetDigest. We use this representation and Pandora to analyze contemporary mal-
ware behoove and discuss trends. We then extract 83 features from malware’s network
behavior and use these features to train classifiers of malware purpose. These classi-
fiers achieve high (85%89%) classification accuracy. Pandora can help defenders bet-
ter understand malware’s network behavior, as 58% of samples increase their network
activities under our framework.
126
Chapter 6
Discussion: State-of-the-art Malware
Analysis Frameworks
In current practice of malware analysis, the analysis functionalities (e.g., debugging and
trace collection) can be performed online (in the same physical machine where mal-
ware is executed) or oine (in a dierent physical machine). No matter which option
is selected, there are two main challenges that researchers need to address. First, it is
expected that malware will contaminate the system and thus it is necessary to restore
system states, such as disk contents, after each analysis cycle. Traditional reboot and
installation of a fresh OS copy may take a couple of hours, which slows down mal-
ware analysis. Second, since one must instrument software or hardware to add analysis
functionalities, it is critical to hide any artifacts of the instrumentation because malware
can detect them and evade analysis. Therefore, researchers have been searching for an
instrumentation framework that can 1) restore system states eciently and 2) minimize
its exposure to malware.
6.1 OS Instrumentation
The most direct way is to design the analysis framework as an extension to the oper-
ating system that runs on bare-metals. However, it is dicult to handle both system
restore and artifact hiding challenges in this type of instrumentation. For example, Bare-
Box [KVK11] proposes a malware analysis framework based on a fast and rebootless
127
system restore technique. For memory isolation, they divide the physical memory into
two partitions, one for the OS and the other for malware execution. For disk restore, the
authors use two identical disks as main and mirror configuration. When saving a snap-
shot, they redirect all write operations to the mirror disk, so the contents of the main
disk will be preserved. While these techniques help restore the system within a few
seconds, BareBox has very limited approaches to artifact hiding. For example, malware
can perform string attack, e.g., enumerating process names to match against “BareBox”.
Since BareBox runs at the same privilege level as the OS, malware running at ring 0 can
always detect BareBox. In addition, the authors do not provide a list of anti-debugging
attacks that BareBox can handle.
Debuggers [Yus13, VY06, HR17] can also be classified into this category, since
debuggers need to instrument OS. These frameworks propose a debugging strategy that
aims to analyze malware in a fine-grained, transparent, and faithful way. However, it is
not an easy task to hide debuggers from malware, and malware authors have devised a
variety of anti-debugging techniques. Generally, malware can attack the debugging prin-
ciples or the artifacts introduced by debuggers. For example, in order to set up a software
breakpoint at an instruction, the debuggers need to replace the beginning opcode of the
instruction with a 0xcc byte. This will raise a breakpoint exception upon execution,
which will be captured by the debuggers. To perform a breakpoint attack, malware may
scan its opcode for the special byte, or calculate the hash value of its opcode and com-
pare it to a predefined value. In addition, malware can also attack debuggers by detecting
their exception handling, flow control, disassembling, and many other principles used
for analysis. The focus of VM Cloak is to address the virtualization transparency prob-
lem.
128
6.2 Virtual Machine Instrumentation
The usage of VM monitors is the most popular method in analyzing malware, because it
provides an environment under which it is easy to toggle security settings and monitor
malware’s activities, such as file modifications and network communication. In addition,
the VM environment is isolated from other processes in the operating system, which
makes it easy to recover from malware’s actions. Dinaburg et al. [DR
+
08] build a VM
hypervisor, called Ether, based on the hardware-assisted virtualization technique (VT-x)
available in recent Intel processors. This type of VMM has better virtualization fidelity
than traditional emulation methods, for example, code translation (e.g., QEMU in TCG
mode) and software emulation (e.g., Bochs). They place certain analysis functionali-
ties, such as setting breakpoints and tracing, into the hypervisor, believing that malware
cannot detect the existence of Ether and thus the analysis modules inside it. While this
approach achieves a high level of transparency, malware can still detect VM’s presence.
In nEther [PBB11] Pek et al. find that Ether still has significant dierences in instruc-
tion handling when compared to physical machines, and thus anti-VM attacks are still
possible. Other mainstream VMMs, such as [BKK06, SBY
+
08, FPMM10], suer from
similar challenges.
6.3 Bare-metal Instrumentation
This category of instrumentation introduces the fewest artifacts among the malware
analysis methods. It instruments on-board hardware with analysis functionalities. Spen-
sky et al. [SHL16] (LO-PHI) modify a Xilinx ML507 development board, which pro-
vides the ability to passively monitor memory and disk activities through its physical
interfaces. Since they do not use any VMs or debuggers, there are no artifacts at that
129
level that a malware may detect. However, malware could attempt to detect presence of
this particular development board and avoid it, assuming that it is used for analysis.
BareCloud [KVK14] replaces analysis of local disk by using the iSCSI protocol to
attach remote disks to local system. After each run of malware, the authors extract file
activities from the remote disk and restore it through copy-on-write technique. While
this disk restore method improves system recovery eciency, the actual evaluation over-
head is not mentioned in the paper. The system management mode (SMM) is used by
Zhang et al. in MalT [ZLS
+
15] to implement debugging functionalities. SMM is a
special-purpose CPU mode in all x86 processors. The authors run malware on one
physical target machine and employ SMM to communicate with the debugging client on
another physical machine. While SMM executes, Protected Mode is essentially paused.
The OS and hypervisor, therefore, are unaware of code execution in SMM. However,
MalT is designed to run on a single-core system, which is not easy to extend to a multi-
core environment. The authors argue that MalT can debug a process by pinning it to a
specific core, while allowing the other cores to execute the rest of the system normally.
This will change thread scheduling for the debugged process by eectively serializing
its threads, and can be used by malware for detection.
130
Chapter 7
Conclusion
Malicious software is becoming more and more sophisticated as its creators try to detect
malware analysis environment. First, malware authors can detect virtual machines by
running certain instructions. These instructions are known to perform dierently in
terms of semantics when executed in a virtual and in a physical machine. Second, mal-
ware can also attack debuggers when malware are being analyzed by such tools. For
example, malware can detect the code modification by the debugger, manipulate the
default exception handling procedure, and disable user input. Once these attacks suc-
ceed, malware may behave like benign binaries, exit prematurely, delete itself, or even
crash the analysis tools. Third, recent malware relies on functional networking to behave
as expected; however, present malware analysis frameworks treat this requirement at a
very coarse level. Most of the research eorts completely forbid malware from commu-
nicating with Internet hosts, which has a risk of incomplete analysis. Some defenders
allow malware to send out network packets without any restrictions, which may cause
substantial harms to the Internet.
In order to address the above challenges, we undertake comprehensive analysis of
system state and network trac to reveal more malware behaviors and gain better under-
standing of their purpose. We demonstrate the success of this approach through three
research studies – cardinal pill testing, Apate, and Pandora.
First, we propose cardinal pill testing to enumerate the dierences in system state
between a virtual and a physical machine. We group instructions defined by a man-
ual into distinct categories according to the instruction semantics. For each instruction
131
category, we carefully devise the ranges for its arguments that lead down to dierent
execution paths. We then select random values within these ranges for testing each
instruction. We evaluate the performance of our cardinal pill testing and compare our
results to that of the red pill testing [MPR
+
09,MPFR
+
10]. Our evaluation shows that we
use 15 times fewer tests and discover 5 times more pills than red pill testing. In addition,
our testing is significantly more ecient: 47.6% of our test cases yield a pill, compared
to only 0.6% of red pill tests. While our implementation of cardinal pill testing uses
Intel x98 as an example, the testing methodology can be applied in all other instruction
set architectures, such as PowerPC, ARM, etc.
Second, in order to handle the anti-debugging techniques, we propose Apate – a
framework for systematic debugger hiding, which is meant to be integrated with existing
debuggers. In this piece of work, we enumerate the modifications to system state when a
debugger is present, based on a variety of sources, such as [SH12,BBN12,Fer11,Fal07,
Ope07, CAM
+
08, ZLS
+
15]. For example, debuggers change the memory of a debuggee
process, leave detectable names, modify the default exception handling procedure, and
alter many other system state. We come up with a list of 79 attack vectors, and classify
them into 6 categories and 16 subcategories. For two out of the six attack categories,
we propose novel handling techniques. For three out of the remaining four categories,
prior research has sketched ideas for attack handling, but they were not implemented
or tested. We implement Apate as an extension to a popular debugger – WinDbg, and
perform extensive evaluations. Our evaluation data sets include 881 unknown samples
captured in the wild, 79 unit tests, 4 known binaries that have already been examined,
and others. Apate outperforms its competitors by a wide margin in related data sets. For
example, the second best debugger, OllyDbg [Yus13], can only handle 63% of the 79
anti-debugging attacks, while Apate can handle all of them.
132
Because Apate is designed as a general malware analysis framework, it can also
be utilized to hide virtual machines from malware. Starting with our set of cardinal
pills, we extract the desired system state after executing an instruction into a knowledge
base. For each pill, we save the instruction name, argument values, and corresponding
execution results learned from a physical machine. We define the combination of these
information as hiding rules. Whenever Apate detects these instructions during malware
execution, we will replace the execution results from the knowledge base. As a result,
malware observes the exact outcome as in a physical machine, and thus cannot detect
the presence of a virtual machine.
Third, with the goal of supporting live and safe malware experimentation, we pro-
pose a malware analysis framework, called Pandora, that enforces a mixed network
containment policy. The goal of the policy is to provide the necessary network input for
malware, such that malware will exhibit malicious activities, instead of stalling when
network access is forbidden. This policy should also prevent malware from causing
substantial harms to the hosts on the Internet, such as participating in DDoS attacks
and spreading infections. To achieve these goals, we apply dierent routing rules for
malware’s network flows. We may forward the flows to the Internet, redirect to the
service impersonators within our control, fake replies, or drop the flows. For example,
we forward DNS and HTTP/HTTPS trac to the Internet, because these types of flows
are the most popular among malware. For SMTP messages, we redirect them to our
service impersonator. This is because a single spam email may cause huge damage to
the recipients. In addition, we employ two rate-limit rules: 1) no more than 10 packets
will be sent per second and 2) no more than 10 distinct IP addresses will be contacted
within a malware run. Finally, we perform extensive analysis on malware’s network
trac by applying supervised classification and extracting the digests of the network
conversations.
133
Bibliography
[0xE13] 0xEBFE. Fooled by Andromeda. http://0xebfe.net/blog/2013/03/30/
fooled-by-andromeda/, 2013.
[aad12] aadp. Anti-Anti-Debugger Plugins. https://code.google.com/p/aadp/,
2012.
[API17] Windows API. Windows API Index. https://msdn.microsoft.com/
en-us/library/windows/desktop/ff818516%28v=vs.85%29.aspx, 2017.
[A V07] David Arthur and Sergei Vassilvitskii. k-means++: The Advantages of
Careful Seeding. In Proceedings of the eighteenth annual ACM-SIAM sym-
posium on Discrete algorithms, 2007.
[BB00] Doug Beeferman and Adam Berger. Agglomerative Clustering of a Search
Engine Query Log. In Proceedings of the sixth ACM SIGKDD interna-
tional conference on Knowledge discovery and data mining, 2000.
[BB07] Paul Barford and Mike Blodgett. Toward Botnet Mesocosms. In Proceed-
ings of the First Conference on First Workshop on Hot Topics in Under-
standing Botnets (HotBots), 2007.
[BBB
+
04] R. Bajcsy, T. Benzel, Bishop, et al. Cyber Defense Technology Networking
and Evaluation. Commun. ACM, 47(3), 2004.
[BBN12] Rodrigo Rubira Branco, Gabriel Negreira Barbosa, and Pedro Drimel
Neto. Scientific but Not Academical Overview of Malware Anti-
Debugging, Anti-Disassembly and Anti-VM Technologies. In Black Hat,
2012.
[BC07] Henry Bell and Eric Chien. Malware analysis report. https:
//www.symantec.com/security_response/writeup.jsp?docid=
2007-041117-2623-99, 2007.
134
[BCH
+
09a] Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher
Kruegel, and Engin Kirda. Scalable, behavior-based malware clustering.
In NDSS, 2009.
[BCH
+
09b] Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher
Kruegel, and Engin Kirda. Scalable, Behavior-Based Malware Clustering.
In NDSS, 2009.
[BCK
+
10] D. Balzarotti, M. Cova, C. Karlberger, et al. Ecient Detection of Split
Personalities in Malware. In Network and Distributed System Security
(NDSS), 2010.
[BDF
+
03] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex
Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the Art of
Virtualization. In Proceedings of the 19th ACM Symposium on Operating
Systems Principles, 2003.
[Bel05] Fabrice Bellard. QEMU, a Fast and Portable Dynamic Translator. In
USENIX ATC, 2005.
[BHL
+
08] David Brumley, Cody Hartwig, Zhenkai Liang, James Newsome, Dawn
Song, and Heng Yin. Automatically Identifying Trigger-based Behavior in
Malware. In Botnet Detection. 2008.
[BKK06] Ulrich Bayer, Christopher Kruegel, and Engin Kirda. TTAnalyze: A Tool
for Analyzing Malware. In European Institute for Computer Antivirus
Research (EICAR) Annual Conference, 2006.
[BO
+
07] Michael Bailey, Jon Oberheide, et al. Automated Classification and Anal-
ysis of Internet Malware. In RAID, 2007.
[BV93] I Boser and V Vapnik. Automatic Capacity Tuning of Very Large VC-
dimension Classifiers. Advances in neural information processing systems,
1993.
[CAM
+
08] Xu Chen, Jon Andersen, Z.Morley Mao, et al. Towards an Understanding
of Anti-virtualization and Anti-debugging Behavior in Modern Malware.
In IEEE International Conference on Dependable Systems and Networks
with FTCS and DCC (DSN), 2008.
[CGKP11] Juan Caballero, Chris Grier, Christian Kreibich, and Vern Paxson. Mea-
suring Pay-per-Install: The Commoditization of Malware Distribution. In
Usenix security symposium, 2011.
135
[Cod17] Excpetion Code. Exception Code. https://msdn.microsoft.com/en-us/
library/cc704588%28d=lightweight,l=en-us,v=PROT.10%29.aspx,
2017.
[CSK
+
10] Paolo Milani Comparetti, Guido Salvaneschi, Engin Kirda, Clemens Kol-
bitsch, Christopher Kruegel, and Stefano Zanero. Identifying Dormant
Functionality in Malware Programs. In Proceedings of the 2010 IEEE
Symposium on Security and Privacy, 2010.
[Deb15] Immunity Debugger. Immunity Debugger. http://debugger.
immunityinc.com/, 2015.
[DF00] Glenn De’ath and Katharina E Fabricius. Classification and regression
trees: a powerful yet simple technique for ecological data analysis. Ecol-
ogy, 2000.
[DR
+
08] Artem Dinaburg, Paul Royal, et al. Ether: Malware Analysis via Hardware
virtualization Extensions. In Proceedings of the 15th ACM Conference on
Computer and Communications Security (CCS), 2008.
[EKS
+
96] Martin Ester, Hans-Peter Kriegel, J¨ org Sander, Xiaowei Xu, et al.
A Density-based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise. In Kdd, 1996.
[ES14] Ittay Eyal and Emin G¨ un Sirer. Majority is not enough: Bitcoin mining is
vulnerable. In International Conference on Financial Cryptography and
Data Security, 2014.
[Fal07] Nicolas Falliere. Windows Anti-Debug Reference. http://www.
symantec.com/connect/articles/windows-anti-debug-reference,
2007.
[Fee17] Master Feeds. Bambenek Consulting Feeds. http://osint.
bambenekconsulting.com/feeds/, 2017.
[Fer06] Peter Ferrie. Attacks on Virtual Machine Emulators. Symantec Security
Response, 2006.
[Fer09] Peter Ferrie. Anti-Unpacker Tricks. http://vpn23.homelinux.org/
Anti-Unpackers.pdf, 2009.
[Fer11] Peter Ferrie. The “Ultimate” Anti-Debugging Reference. http://
pferrie.host22.com/, 2011.
136
[Flo14] Cristian Florian. Most vulnerable operating systems
and applications in 2014. http://www.gfi.com/blog/
most-vulnerable-operating-systems-and-applications-in-2014/,
2014.
[FPMM10] Aristide Fattori, Roberto Paleari, Lorenzo Martignoni, and Mattia Monga.
Dynamic and Transparent Analysis of Commodity Production Systems. In
Automated Software Engineering, 2010.
[Geo17] ISC Tech Georgia. Open Malware. http://oc.gtisc.gatech.edu/, 2017.
[GLB12] Mariano Graziano, Corrado Leita, and Davide Balzarotti. Towards Net-
work Containment in Malware Analysis Systems. In Proceedings of the
28th Annual Computer Security Applications Conference. ACM, 2012.
[HE
+
09] Thorsten Holz, Markus Engelberth, et al. Learning more about the under-
ground economy: A case-study of keyloggers and dropzones. In ESORICS,
2009.
[HR17] Hex-Rays. IDA: multi-processor disassembler and debugger. https://
www.hex-rays.com/products/ida/, 2017.
[Int17a] Intel. Intel 64 and IA-32 Architectures Software Developers
Manuals. http://www.intel.com/content/www/us/en/processors/
architectures-software-developer-manuals.html, 2017.
[Int17b] Intel. Pin - A Dynamic Binary Instrumentation
Tool. https://software.intel.com/en-us/articles/
pin-a-dynamic-binary-instrumentation-tool, 2017.
[JMG
+
09] John P. John, Alexander Moshchuk, Steven D. Gribble, et al. Studying
Spamming Botnets Using Botlab. In Proceedings of the 6th USENIX Sym-
posium on Networked Systems Design and Implementation (NSDI), 2009.
[KV08] Johannes Kinder and Helmut Veith. Jakstab: A Static Analysis Platform
for Binaries. In Proceedings of the 20th International Conference on Com-
puter Aided Verification, 2008.
[KV15] Dhilung Kirat and Giovanni Vigna. MalGene: Automatic Extraction of
Malware Analysis Evasion Signature. In Proceedings of the 22Nd ACM
SIGSAC Conference on Computer and Communications Security, 2015.
[KVK11] Dhilung Kirat, Giovanni Vigna, and Christopher Kruegel. BareBox: E-
cient Malware Analysis on Bare-metal. In ACSAC, pages 403–412, 2011.
137
[KVK14] Dhilung Kirat, Giovanni Vigna, and Christopher Kruegel. BareCloud:
Bare-metal Analysis-based Evasive Malware Detection. In 23rd USENIX
Security Symposium, 2014.
[KW
+
11] Christian Kreibich, Nicholas Weaver, et al. GQ: Practical Containment
for Measuring Modern Malware Systems. In Proceedings of the 2011
ACM SIGCOMM Conference on Internet Measurement Conference (IMC),
2011.
[KYH
+
09] Min Gyung Kang, Heng Yin, Steve Hanna, et al. Emulating Emulation-
resistant Malware. In Proceedings of the First ACM Workshop on Virtual
Machine Security (VMSec), 2009.
[Law96] Kevin P. Lawton. Bochs: A Portable PC Emulator for Unix/X. Linux
Journal, (29es), 1996.
[LCL13] Shun-Te Liu, Yi-Ming Chen, and Shiou-Jing Lin. A novel search engine
to uncover potential victims for apt investigations. In IFIP International
Conference on NPC, 2013.
[LH07] Robert Lyda and James Hamrock. Using Entropy Analysis to Find
Encrypted and Packed Malware. 2007.
[LKMC11] Martina Lindorfer, Clemens Kolbitsch, and Paolo Milani Comparetti.
Detecting Environment-Sensitive Malware. In Proceedings of the 14th
International Conference on Recent Advances in Intrusion Detection
(RAID), 2011.
[MABXS10] Jose Andre Morales, Areej Al-Bataineh, Shouhuai Xu, and Ravi Sandhu.
Analyzing and Exploiting Network Behaviors of Malware. In Interna-
tional Conference on Security and Privacy in Communication Systems,
2010.
[Mic17] Microsoft. Microsoft PE and COFF Specification. https://msdn.
microsoft.com/en-us/windows/hardware/gg463119.aspx, 2017.
[Min17] Max Mind. GeoLite Legacy Downloadable Databases. http://dev.
maxmind.com/geoip/legacy/geolite/, 2017.
[MKK07] Andreas Moser, Christopher Kruegel, and Engin Kirda. Exploring Multi-
ple Execution Paths for Malware Analysis. In Security and Privacy, 2007.
SP’07. IEEE Symposium on, 2007.
138
[MMP
+
12] Lorenzo Martignoni, Stephen McCamant, Pongsin Poosankam, Dawn
Song, and Petros Maniatis. Path-exploration Lifting: Hi-fi Tests for Lo-
fi Emulators. In Proceedings of the 17th International Conference on
Architectural Support for Programming Languages and Operating Systems
(ASPLOS), pages 337–348, 2012.
[MPFR
+
10] Lorenzo Martignoni, Roberto Paleari, Giampaolo Fresi Roglia, et al. Test-
ing System Virtual Machines. In Proceedings of the 19th International
Symposium on Software Testing and Analysis (ISSTA), 2010.
[MPR
+
09] Lorenzo Martignoni, Roberto Paleari, Giampaolo Fresi Roglia, et al. Test-
ing CPU Emulators. In Proceedings of the 18th International Symposium
on Software Testing and Analysis (ISSTA), 2009.
[MSF
+
08] Lorenzo Martignoni, Elizabeth Stinson, Matt Fredrikson, Somesh Jha,
and John C. Mitchell. A Layered Architecture for Detecting Malicious
Behaviors. In Proceedings of the 11th International Symposium on Recent
Advances in Intrusion Detection, 2008.
[Net15] Arbor Networks. Arbor Networks ATLAS Data Shows the Aver-
age DDoS Attack Size Increasing. https://www.arbornetworks.com/
arbor-networks-atlas-data-shows-the-average-ddos-attack-size-increasing,
2015.
[New14] Jan Newger. IDAStealth Plugin. https://github.com/nihilus/
idastealth, 2014.
[Ope07] OpenRCE. OpenRCE Anti Reverse Engineering Techniques Database.
http://www.openrce.org/reference_library/anti_reversing, 2007.
[PBB11] G´ abor P´ ek, Boldizs´ ar Bencs´ ath, and Levente Butty´ an. nEther: In-guest
Detection of Out-of-the-guest Malware Analyzers. In Proceedings of the
Fourth European Workshop on System Security (EuroSec), 2011.
[PDZ
+
14] Fei Peng, Zhui Deng, Xiangyu Zhang, Dongyan Xu, Zhiqiang Lin, and
Zhendong Su. X-Force: Force-Executing Binary Programs for Security
Applications. In 23rd USENIX Security Symposium, 2014.
[PES01] Leonid Portnoy, Eleazar Eskin, and Sal Stolfo. Intrusion Detection with
Unlabeled Data using Clustering. In In Proceedings of ACM CSS Workshop
on Data Mining Applied to Security (DMSA-2001, 2001.
139
[PVG
+
11] F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Pas-
sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-
learn: Machine Learning in Python. Journal of Machine Learning
Research, pages 2825–2830, 2011.
[QS07] Danny Quist and V Smith. Covert debugging circumventing software
armoring techniques. Black hat briefings USA, 2007.
[Rce12] Ferrit Rce. OllyExt. https://forum.tuts4you.com/files/file/
715-ollyext/, 2012.
[RD
+
11] Christian Rossow, Christian J. Dietrich, et al. Sandnet: Network Trac
Analysis of Malicious Software. BADGERS ’11, 2011.
[RHW88] David E Rumelhart, Georey E Hinton, and Ronald J Williams. Learning
representations by back-propagating errors. Cognitive modeling, 1988.
[RHW
+
08] Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick D¨ ussel, and Pavel
Laskov. Learning and Classification of Malware Behavior. In Interna-
tional Conference on Detection of Intrusions and Malware, and Vulnera-
bility Assessment, 2008.
[RSL12] Mark E. Russinovich, David A. Solomon, and Alex Lonescu. Windows
Internals (6th Edition). Microsoft Press, 2012.
[RTWH11] Konrad Rieck, Philipp Trinius, Carsten Willems, and Thorsten Holz. Auto-
matic Analysis of Malware Behavior using Machine Learning. Journal of
Computer Security, 2011.
[SAM14] Hao Shi, Abdulla Alwabel, and Jelena Mirkovic. Cardinal Pill Testing of
System Virtual Machines. In 23rd USENIX Security Symposium (USENIX
Security 14), 2014.
[SBY
+
08] Dawn Song, David Brumley, Heng Yin, et al. BitBlaze: A New Approach
to Computer Security via Binary Analysis. In ICISS, 2008.
[Scy16] ScyllaHide. ScyllaHide. https://bitbucket.org/NtQuery/scyllahide,
2016.
[SGH
+
11] Brett Stone-Gross, Thorsten Holz, et al. The Underground Economy
of Spam: A Botmaster’s Perspective of Coordinating Large-Scale Spam
Campaigns. LEET, 2011.
140
[SH12] Michael Sikorski and Andrew Honig. Practical Malware Analysis: The
Hands-On Guide to Dissecting Malicious Software. No Starch Press, 2012.
[SHL16] Chad Spensky, Hongyi Hu, and Kevin Leach. LO-PHI: Low-Observable
Physical Host Instrumentation for Malware Analysis. In NDSS, 2016.
[SLC
+
11] Ming-Kung Sun, Mao-Jie Lin, Michael Chang, et al. Malware
Virtualization-Resistant Behavior Detection. In Proceedings of the 2011
IEEE 17th International Conference on Parallel and Distributed Systems
(ICPADS), 2011.
[SM17] Hao Shi and Jelena Mirkovic. Hiding Debuggers from Malware with
Apate. In SAC, 2017.
[SRL12] Chengyu Song, Paul Royal, and Wenke Lee. Impeding Automated Mal-
ware Analysis with Environment-Sensitive Malware. In Proceedings of the
7th USENIX Conference on Hot Topics in Security (HotSec), 2012.
[Tec17] Basis Technology. The Sleuth Kit. http://www.sleuthkit.org/, 2017.
[Tot17] Virus Total. VirusTotal. https://www.virustotal.com/en/, 2017.
[Tul08] Joshua Tully. An Anti-Reverse Engineering
Guide. http://www.codeproject.com/Articles/30815/
An-Anti-Reverse-Engineering-Guide/, 2008.
[UC99] Irvine UC. KDD Cup 1999 Data. http://kdd.ics.uci.edu/databases/
kddcup99/kddcup99.html, 1999.
[Vas09] Amit Vasudevan. Reinforced Stealth Breakpoints. In Proceedings of the
4th IEEE Conference on Risks in Internet Systems (CRiSIS), 2009.
[VY05] Amit Vasudevan and Ramesh Yerraballi. Stealth Breakpoints. In Pro-
ceedings of the 21st Annual Computer Security Applications Conference,
ACSAC, 2005.
[VY06] A. Vasudevan and R. Yerraballi. Cobra: fine-grained Malware Analysis
using Stealth Localized-executions. In Security and Privacy, 2006 IEEE
Symposium on, 2006.
[Wer10] Tillmann Werner. Waledac’s Anti-Debugging Tricks. http://www.
honeynet.org/node/550, 2010.
[Win17] WinDbg Windows. WinDbg. https://msdn.microsoft.com/en-us/
windows/hardware/hh852365.aspx, 2017.
141
[WLS
+
02] Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad,
Mac Newbold, Mike Hibler, Chad Barb, and Abhijeet Joglekar. An Inte-
grated Experimental Environment for Distributed Systems and Networks.
In Proceedings of the USENIX Symposium on Operating System Design
and Implementation, 2002.
[WWW15] Shuai Wang, Pei Wang, and Dinghao Wu. Reassembleable disassembling.
In 24th USENIX Security Symposium (USENIX Security 15), 2015.
[XZGL14] Zhaoyan Xu, Jialong Zhang, Guofei Gu, and Zhiqiang Lin. GoldenEye:
Eciently and Eectively Unveiling Malwares Targeted Environment. In
Proceedings of the 17th International Symposium on Research in Attacks,
Intrusions and Defenses, 2014.
[YJZ
+
12] Lok-Kwong Yan, Manjukumar Jayachandra, Mu Zhang, et al. V2E:
Combining Hardware Virtualization and Software Emulation for Trans-
parent and Extensible Malware Analysis. In Proceedings of the 8th ACM
SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE),
2012.
[Yus13] Oleh Yuschuk. OllyDbg. http://www.ollydbg.de, 2013.
[ZLS
+
15] Fengwei Zhang, Kevin Leach, Angelos Stavrou, Haining Wang, and Kun
Sun. Using Hardware Features for Increased Debugging Transparency. In
Proceedings of The 36th IEEE Symposium on Security and Privacy, May
2015.
[ZRL96] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: an E-
cient Data Clustering Method for Very Large Databases. In ACM Sigmod
Record, 1996.
142
Abstract (if available)
Abstract
Current malware analysis relies heavily on the use of virtual machines and debuggers for safety and functionality. However, these analysis tools produce a variety of artifacts, which can be detected by malware and thus hinder malware analysis. First, a virtual machine usually executes multiple host instructions to simulate a single guest instruction. This inevitably introduces delay and often deviations from the instruction’s semantics defined in the instruction set manual. Existing research approaches to uncover differences between VMs and physical machines use randomized testing, and thus cannot completely enumerate these differences. Second, debuggers modify malware code and handle it in specific ways. These artifacts can be used by malware to detect debuggers. When malware detects the presence of such entities by using anti-VM and anti-debugging techniques, it may hide its original, malicious purposes. For example, malware may behave like normal programs, exit prematurely, escape from the analysis environment, or even crash the system. Third, current malware analysis usually analyze malware in an isolated system without network access. However, recent malware tends to rely on networking to function properly. For example, bot clients have to fetch commands and payload from bot masters on the Internet. Therefore, current isolated analysis setting runs the risk of incomplete analysis. There are some other research efforts that adopt no network restrictions at all, which allow malware to contact Internet hosts freely. These approaches may cause substantial damage to others if malware launches an attack, and thus are not encouraged due to legal and ethical considerations. ❧ In this dissertation, we propose three complete pieces of work to address the above challenges. First, we propose cardinal pill testing, which aims to enumerate the differences between a given VM and a physical machine, through carefully designed tests. Cardinal pill testing finds five times more pills by running fifteen times fewer tests than previous approaches. ❧ Second, we propose Apate, a debugger plug-in, which systematically hides the debugger from malware. In this piece of work, we enumerate anti-debugging attacks and classify them into 6 categories and 16 sub-categories. We then develop techniques that detect all these anti-debugging attacks and defeat them by hiding debugger artifacts. We develop novel techniques for handling attacks in two out of our six attack categories. For three out of the remaining four categories, prior research has sketched ideas for attack handling, but we implement and evaluate them. ❧ Apate can also be used to detect and defeat cardinal pills, thanks to its extensible design. Based on Apate, we propose a framework, called VM Cloak, to handle the anti-VM attacks in malware. This is achieved by monitoring each executed malware command, detecting potential pills, and modifying at run time the command’s outcomes to match those that a physical machine would generate. ❧ Finally, we propose a malware analysis framework, Pandora, to support live and safe malware experiment. Pandora enforces a mixed containment policy for malware’s network traffic, which can provide necessary network input for malware’s normal execution and limit potential damage to Internet hosts to the minimal level. Depending on the nature of network flows, we may forward them to Internet hosts, redirect to our service impersonators, fake replies, rate limit, or drop them. Pandora executes malware using a set of machines, which can be either physical or virtual machines, with or without Apate and VM Cloak. In order to help defenders better understand malware’s network behaviors, we propose a model, called NetDigest, to extract the session dynamics from network packets.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Studying malware behavior safely and efficiently
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Dynamic graph analytics for cyber systems security applications
PDF
A protocol framework for attacker traceback in wireless multi-hop networks
PDF
Algorithms and architectures for high-performance IP lookup and packet classification engines
PDF
Enabling symbolic execution string comparison during code-analysis of malicious binaries
PDF
Improving binary program analysis to enhance the security of modern software systems
PDF
Model-driven situational awareness in large-scale, complex systems
PDF
Language abstractions and program analysis techniques to build reliable, efficient, and robust networked systems
PDF
Hardware and software techniques for irregular parallelism
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Protecting online services from sophisticated DDoS attacks
PDF
Global analysis and modeling on decentralized Internet
PDF
Towards highly-available cloud and content-provider networks
PDF
Machine learning for efficient network management
PDF
Collaborative detection and filtering of DDoS attacks in ISP core networks
PDF
Mitigating attacks that disrupt online services without changing existing protocols
PDF
Scaling-out traffic management in the cloud
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Enabling efficient service enumeration through smart selection of measurements
Asset Metadata
Creator
Shi, Hao
(author)
Core Title
Supporting faithful and safe live malware analysis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2017-08
Publication Date
06/02/2017
Defense Date
05/04/2017
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
debugger,malware analysis,networking,OAI-PMH Harvest,virtual machine
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mirkovic, Jelena (
committee chair
), Govindan, Ramesh (
committee member
), Prasanna, Viktor (
committee member
)
Creator Email
haoshi@usc.edu,haoshi7@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11258273
Unique identifier
UC11258273
Identifier
etd-ShiHao-5367.pdf (filename)
Legacy Identifier
etd-ShiHao-5367
Dmrecord
376998
Document Type
Dissertation
Format
theses (aat)
Rights
Shi, Hao
Internet Media Type
application/pdf
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
debugger
malware analysis
networking
virtual machine