Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Studying malware behavior safely and efficiently
(USC Thesis Other)
Studying malware behavior safely and efficiently
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
STUDYING MALWARE BEHAVIOR SAFELY AND EFFICIENTLY by Xiyue Deng A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2022 Copyright 2022 Xiyue Deng STUDYING MALWARE BEHAVIOR SAFELY AND EFFICIENTLY by Xiyue Deng A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2022 Copyright 2022 Xiyue Deng Dedication To my dearest family and all those who are surviving in the face of diculties. ii Acknowledgments My Ph.D. studies have been a long journey and I cannot make it through without the support and enlightenment of many people who deserve my sincerest thankfulness. Firstly, I would like to express my wholehearted gratitude and appreciation to my research advisor, Professor Jelena Mirkovic, for her continuous guidance, support, criticism, and encour- agement throughout my entire Ph.D. study. She has set a high bar of the integrity of scientic research and continued to lead us through diculties during this process with insightful sug- gestions and passionate encouragements. It has been a great honor to have the opportunity to become one of her Ph.D. students, and under her guidance, I have learned how to solve prob- lems with critical thinking and diligently search for solutions using scientic approaches. More specically, I want to sincerely thank Prof. Mirkovic for her patient guidance and motivational critics during dicult times over the years that have made me stronger and better prepared for challenging tasks during my Ph.D. study which I will also benet from as a lifelong asset. My gratitude also goes to the other supervisors in our STEEL lab Genevieve Bartlett and Christophe Hauser for their help and inspirations that also inuenced me greatly during my study. They have been supporting me and other Ph.D. students tirelessly and are a crucial source for any successes we achieved during this time. I would like to thank Prof. Cliord Neuman, Prof. William G. J. Halfond, Prof. Bhaskar Krishnamachari, Prof. Milind Tambe, and Prof. Sandeep Gupta for taking the time to serve on iii my qualifying exam and dissertation committee out of their busy schedules. They suggestions and advice have been a great inspiration to me. I would also like to thank my fellow friends, colleagues at USC and ISI: Hao Shi, Simon Woo, Chengjie Zhang, Xun Fan, Lin Quan, Xue Cai, Lihang Zhao, Zi Hu, Liang Zhu, Hang Guo, Con- gxing Cai, Weiwei Chen, Abdulla Alwabel, Vinod Sharma, Sivaram Ramanathan, Rajat Tandon, Nicolaas Weideman, and many others. The friendship we have established is one of the strongest sources of courage that has led me through my entire Ph.D. study especially during dicult times. Also I would like to express my special gratitude towards Lizsl De Leon, Alba Regalado, Joseph Kemp, Jeeanine Yamazaki, and Matt Binkley for their assistance and administrative support at USC and ISI that has greatly improved our research life. Lastly, I would like to thank my parents for their continuous supports and encouragements, our cat Phurphy for being with me and lightening our life, and especially my wife Yi Lao for her accompany, support, patience, and love. iv TableofContents Dedication ii Acknowledgments iii ListofTables viii ListofFigures x Abstract xi Chapter1: Introduction 1 1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Demonstrating the Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Structure of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter2: SafeLiveMalwareAnalysisusingFantasm 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Fantasm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Partial Containment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Experimentation Goals Environment and Design . . . . . . . . . . . . . . . . . . . 13 2.4.1 Experimentation Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.2 Experimentation Environment . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.3 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.4 Malware Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.1 Partial Containment Exposes More Malware Behavior . . . . . . . . . . . 17 2.5.2 Malware Communication Patterns . . . . . . . . . . . . . . . . . . . . . . 17 2.5.3 Summarizing Malware Communication . . . . . . . . . . . . . . . . . . . 19 2.5.4 Classifying Malware by Its Network Behavior . . . . . . . . . . . . . . . . 22 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Chapter3: MalwareBehavioranalysisusingHigh-LevelBehavior 29 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 v 3.2.1 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2 Dynamic Analysis on Host . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3 Dynamic Analysis of Network Behavior . . . . . . . . . . . . . . . . . . . 33 3.3 Capturing Malware Network Behavior . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 Analysis Environment: Fantasm . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1.1 Impersonators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 High-level Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.1 High-level Behavior Signature . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 Experiment and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5.2 Gathering High-Level Behavior . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter4: AnalyzeMalwareSimilarityUsingHigh-LevelBehavior 45 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.1 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.2 Dynamic Analysis on Host . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.3 Dynamic Analysis of Network Behavior . . . . . . . . . . . . . . . . . . . 49 4.3 Capturing Malware Network Behavior . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1 Analysis Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4 Sample Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.1 Application Data Unit (ADU) . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.2 Payload Byte Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.3 Malware Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Evaluation of Contemporary Malware . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 Clustering Flow Embeddings Using Machine Learning . . . . . . . . . . . . . . . . 57 4.6.1 Clustering by ADU Sequence . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.6.2 Clustering by Payload Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.6.3 Sample Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.7 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter5: PolymorphicMalwaredetectionusingHigh-LevelBehavior 66 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.1 Static Binary Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.2 Dynamic Binary Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2.3 Dynamic Analysis of Network Behavior . . . . . . . . . . . . . . . . . . . 70 5.2.4 Contemporary Polymorphic Malware Analysis . . . . . . . . . . . . . . . 70 5.3 Capturing Malware Network Behavior . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3.1 Analysis Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4 Embedding the Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.1 Application Data Unit (ADU) . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.2 Payload Byte Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 vi 5.5 Polymorphic Malware Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.6.2 Cross-verifying Our Findings . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.6.3 Calibrating Clustering Parameters . . . . . . . . . . . . . . . . . . . . . . 79 5.6.4 Network vs local behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.6.5 Large Malware Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.7 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Chapter6: Summary: OurContributions 89 Chapter7: Conclusion 92 Bibliography 93 vii ListofTables 2.1 Flow policies for partial containment . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Minimizing artifacts of DeterLab. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Concise Tagging of Malware Samples . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Top 12 application protocols used by malware, and the number and percentage of samples that use them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Popularity of domains in malware DNS queries. . . . . . . . . . . . . . . . . . . . 20 2.6 NetDigest of a session. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.7 Features extracted from a malware’s NetDigest for classication purpose. . . . . 24 2.8 Classication results: Rank 1 – our label was the top label assigned by AV products, Rank 2 – our top label was in the top 2 labels assigned by AV products, Rank 3 – our top label was among top 3 assigned by AV products. . . . . . . . . . 26 3.1 Flow policies in Fantasm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 High-level behavior list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Percentage of our selected 999 malware samples, which exhibit the given high-level behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Percentage of malware having the given number of behaviors, among all malware, which exhibited some network activity. . . . . . . . . . . . . . . . . . . 40 3.5 Percentage of malware exhibiting given combinations of behaviors. . . . . . . . . 41 4.1 Features selected for ow and sample embedding . . . . . . . . . . . . . . . . . . 53 4.2 ADU transformation from packet sequence . . . . . . . . . . . . . . . . . . . . . . 54 viii 4.3 High-level summary of clustering result using ADU. . . . . . . . . . . . . . . . . . 59 4.4 High-level summary of clustering result using payload sizes. . . . . . . . . . . . . 61 4.5 Notable behaviors of HTTP Downloader . . . . . . . . . . . . . . . . . . . . . . . 63 5.1 Clustering results using dierent parameter combinations. . . . . . . . . . . . . . 80 5.2 Comparison of clustering by local and by network behavior . . . . . . . . . . . . 81 5.3 Top 10 most reused DLL les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4 Top 3 polymorphic malware groups categorized by ADU features. . . . . . . . . . 83 5.5 Prominent potentially polymorphic malware groups categorized by ADU features. 83 ix ListofFigures 2.1 Flow handling: how we decide if an outgoing ow will be let out, redirected to our impersonators or dropped. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Ranks of domains fromalexa.com. . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Popularity of top-level domains in our observed malware communications. . . . 19 2.4 Example NetDigest (md5: 0155ddfa6feb24c018581084f4a499a8). . . . . . . . 22 2.5 Classication precision as number of sessions grows. . . . . . . . . . . . . . . . . 27 2.6 Number of samples as the limit on number of sessions grows. . . . . . . . . . . . 28 x Abstract Malware defenses today deploy manual or semi-automated analysis in debuggers and virtual machines to derive code-based signatures, and then use these signatures to detect malware as it traverses the network or infects a vulnerable host. However, malware is becoming ever more sophisticated at bypassing these defenses. Contemporary malware attempts to detect when it is being analyzed in a debugger or virtual machine, and changes its behavior, preventing sig- nature derivation. Malware also mutates its code, generating new polymorphic samples, which evade existing code signatures. Such countermeasures have set a higher bar for modern malware detection and analysis. In this dissertation, we proposesafeandecientwaystoanalyzeandencodemalware networkbehaviors, which can be observed in bare metal environments, and can reliably indi- cate a malware sample’s true purpose, in spite of code polymorphism. First, we proposeFantasm, a safe live analysis environment, which does not use debuggers or virtual machines. Fantasm de- ceives malware that it runs on an unconstrained host, while carefully curtailing its malicious activities for safe analysis. Fantasm is built upon the DeterLab testbed, and uses a series of se- curity mechanisms to limit malware interaction with live Internet, and decoy services, which mimic responses malware samples expect, to reveal interesting network behaviors. We show how network behaviors, observable in Fantasm, can be used to classify malware into VirusTotal categories, without any binary code analysis. xi Second, we propose anembeddingofmalware’shigh-levelnetworkbehavior, which we observe in Fantasm. Our embedding summarizes important features of network trac malware exchanges with Internet hosts or our decoy services. This embedding enables us to encode inter- esting patterns of communication from malware’s trac, and identify common patterns among many samples, which could be used for malware detection. It also enables fast and accurate classication of malware samples via machine learning. Finally, we further develop malware sample embedding, to detect polymorphic mal- ware. We show how our embedding based on network behavior of malware successfully identi- es malware with identical network and local behaviors, but dierent binary code. This approach complements traditional malware detection based on binary code signatures. xii Chapter1 Introduction Malicious computer software has been a huge threat to the stability and integrity of computing infrastructures all over the world since its early appearance in the 1980s[60]. As the utilization of computer software has become prevalent in virtually all aspects of daily life for both business and leisure, malware has also evolved in both number and diversity. Damages caused by malware also greatly increase over time[24, 54, 8]. Researchers have been tirelessly working on malware detection and analysis for decades. This line of research produces the most commonly used malware defense – binary signatures. Such signatures can then be loaded into rewalls and antivirus software to prevent malware infec- tions. It is also standard practice to analyze malware in a sandbox-like environment, such as a debugger or a virtual machine, both to isolate malware from the computer host and to have larger observability and control of malware activities. Although detection and analysis techniques keep improving, malware creators have also been developing counter measures seeking to evade the defenses. Two of the most frequently used counter measures are: (1) malware detects the presence of the analysis environment (debugger or VM) and stops running, and (2) malware transforms its code to evade signature-based detection, while preserving its functionality. Therefore, it is crucial to combat the evasion attempts and polymorphism of modern malware. 1 Evasion techniques of malware. Modern malware analysis usually employs sandboxing techniques that gives more control to researches and provide sucient safety to study malware. However, the dierence between a sandbox environment compared with a bare metal physi- cal machine can be detected by malware, which then becomes dormant. Common sandboxing techniques include using a debugger or a virtual machine, both of which have dierent run-time behavior compared to a physical machine. There are myriad of anti-virtualization, anti-debugger, and anti-assembly methods malware uses to detect debuggers and virtual machines [7]. To han- dle evasive techniques in malware we need to be able to analyze malware in bare metal machines instead of sophisticated environments. Such analysis will likely be unable to observe detailed malware’s local behavior on the infected host, since malware can compromise this observation. However, such analysis can observe malware’s network behavior, which is observable from a dif- ferent vantage point than the infected host. Analysis of malware’s network behavior can provide useful information for malware understanding, and even ltering, to complement binary signa- tures. Malwarebehavior. According to AV-TEST institute, there are over one billion dierent mal- ware samples, and over 450,000 new malware samples are discovered daily [58]. Such high mal- ware production rate suggests that new malware samples may be created by transforming exist- ing binary code, to avoid most signature detection techniques, or potentially reuse components of malicious building blocks to create new combinations of malware. Thus, it is infeasible to try to identify malware solely by its binary signature. On the other hand studying malware network activity can be very useful to understand its purpose or event derive network-based signatures for detection. First, much of malware nowadays relies on the network connectivity to achieve its purpose, so analysis of network communication should be able to provide useful hints about 2 malware purpose. Second, while malware can easily change its binary code, it is less likely to change its network behavior – doing so would lead to additional delays in malware achieving its goal, and it may require additional setup of remote hosts. We believe that studying malware network behavior can provide important insight into malware’s purpose. Malwaresimilarityandgenealogy. Another important aspect to study is to identify similar malware samples. As we hypothesize that most new malware samples are really versions of existing code, it is very likely that we observe similar network activity from dierent malware samples that are created from the same source with minimal modications. By studying malware network communication patterns, one can potentially quantify the similarity and hence be able to identify similar malware or malware variants that may share the same ancestor, which then enables more ways to identify and categorize malware. This can also help focus defensive eorts on prevention or detection of most popular malware communication patterns. Malware polymoprhism. Advanced malware takes advantage of interpreted engines that eases the process of transforming malware binary code at runtime. Such malware is called polymorphic malware and metamorphic malware [2]. Polymorphic malware is usually wrapped through a shell code or encrypted, and during runtime it will unwrap or decrypt itself and exe- cute. After it nishes, it will try to re-wrap or re-encrypt itself through a dierent hash or key to be transformed into a dierent binary code. On the other hand, metamorphic malware will directly transform its code into a completely dierent code, while preserving its functionality. This makes it almost impossible to detect metamorphic malware by current signature-based ap- proaches. Using observations and adequate encodings of network behavior of malware it can be possible to identify polymorphic and metamorphic malware even when it changes its code. 3 1.1 ThesisStatement Analysis of malware network behavior (1) can be done safely through live malware experimen- tation in ne-grained containment, (2) enables identication of malware purpose via behavior feature embedding, and (3) enables study of malware similarity, genealogy, and polymorphism. 1.2 DemonstratingtheThesisStatement In this thesis we undertake a careful study of malware network behavior and explore its use for malware classication, detecting similar malware and detecting polymorphic malware. First, we developFantasm: a safe live-experimentation platform with ne-grained containment policy. Fantasm runs on DeterLab [6], in an experimental environment consisting of bare-metal machines with programmable network topology conguration. We monitor malware’s network behavior from a network vantage point. We design a ne-grained network containment policy that is just open enough to facilitate malware activities, while guaranteeing safety to Internet hosts. We show through analysis of about 3,000 malware samples that Fantasm can capture useful network communications of the malware being analyzed, without causing safety issues. Fantasm environment is developed for public use by DeterLab testbed’s users. So far it has been used by three research groups, from National University of Singapore, University of Colorado, Boulder and University of Massachusetts, Lowel. Second, we use Fantasm to collect network traces of thousands of malware samples. We then develop twoembedding of network communications, which consists of features that can be used to identify similarities and dierences between malware samples. We use one of these encodings 4 – NetDigest – to classify malware into categories used by VirusTotal [62], achieving almost 90% accuracy. Third, we show how we can use our second malware embedding formalwaresimilarityanal- ysis. We develop malware similarity measures based on our embedding. We use this similarity metric to cluster malware samples we analyzed to identify groups with similar or identical be- havior, and to identify most popular network behavior patterns. Such patterns can be used to develop network-level signatures for malware defense. Last, we show how we can use our second malware embedding for polymorphic malware analysis. We cluster malware which has high similarity together and investigate if such clusters with same network behavior also exhibit same local behavior, and thus could be labeled as poly- morphic. We show that using malware embedding provides the advantage over observing only local behavior, and helps us identify network communication patterns, which could be useful for malware defense. 1.3 StructureoftheDissertation This dissertation is organized along our for main studies that perform safe live malware analysis based on network behavior patterns. In Chapter 2, we discuss the development of Fantasm plat- form to ensure safe and productive live malware analysis. We present the mechanisms to combat malware evasion and pave the way for malware network behavior analysis. We also study the malware behavior by capturing malware network communication and develop a way to normal- ize dierent types of communication patterns, into a structure called NetDigest. We show how we can use NetDigest to accurately classify malware. In Chapter 3, we develop a set of network communication patterns that malware can use to achieve its goals (e.g., scanning, le download, 5 le upload, etc.). We use these patterns to identify popular network behaviors of malware and their combinations. In Chapter 4, we use develop malware embedding, which encodes dynamics and contents of its communication with outside hosts. Further, we propose a way to mathemat- ically quantify malware similarity using our embedding. We use this embedding and similarity metrics to study malware similarity and identify malware genealogy patterns. In Chapter 5 we apply our embedding to the identication of potential polymorphic malware samples. We com- pare identication based on network behaviors versus identication based on local behaviors, and show the advantages of our approach. We discuss our contributions in Chapter 6, and then conclude in Chapter 7. 6 Chapter2 SafeLiveMalwareAnalysisusingFantasm In this chapter we introduce Fantasm, a live malware experimentation environment for safe and eective malware analysis, to combat malware evasion and facilitate malware behavior studies. 2.1 Introduction Malware today evolves at an amazing pace. Kaspersky lab [28] reports that more than 300,000 new malware samples are found each day. While many have analyzed malware binaries to understand its purpose [13, 5], little has been done on analyzing and understanding malware communication patterns [48, 38]. Specically, we do not know how much malware needs outside connectivity and what impact limited connectivity has on malware’s functionality. We further do not under- stand which application and transport protocols are used by contemporary malware, and what is the purpose of this communication. Understanding these issues is necessary for two reasons. First, much malware analysis occurs in full containment due to legal and ethical reasons. If com- munication is essential to malware, then analyzing it in full containment makes what defenders observe very dierent from how malware behaves in the wild. Second, understanding malware communication patterns may be useful to understand its functionality, even when malware code is obfuscated or encrypted. 7 We hypothesize that communication may be essential to malware for multiple reasons. First, contemporary malware is becoming environment-sensitive and may test its environment before it reveals its functionality [5, 31]. If constrained environment is detected, malware may modify or abort its behavior. Second, much of malware functionality today relies on a functional net- work [24, 54]. Malware often downloads binaries needed for its functionality from the Internet, or connects into command and control channel to receive instructions on its next activity [59]. With- out network access such malware is an empty shell, containing no useful code. Third, malware functionality itself may require network access. Advanced persistent threats [32] and keyloggers collect sensitive information on users’ computers, but need network access to transfer it to the attacker. DDoS attack tools, scanners, spam and phishing malware require network access to send malicious trac to their targets. Without connectivity, such malware will become dormant. We test our hypothesis by analyzing 2,994 contemporary malware samples, chosen to repre- sent a wide variety of functional behaviors (e.g., key loggers, ransomware, bots, etc.). We analyze each sample under full and under partial containment, for ve minutes, and record all network trac. Our partial containment is designed to carefully allow select malware communication at- tempts into the Internet, when we believe this is necessary to reveal more interesting behaviors. All trac is monitored for signs of malicious intent (e.g., DDoS or scanning) and quickly aborted if these are detected. This way we can guarantee safety to the Internet from our experimentation. We nd that 58% of samples exhibit some network behavior, and that 78% of these samples ex- hibit more network behaviors when ran under our partial containment, than when ran under full containment, which means they are environment-sensitive. Most malware samples send DNS, ICMP ECHO and HTTP trac, and contact obscure destinations rather than popular servers. 8 Likely purpose of these malware communication attempts is command and control communica- tion, and new binary download. We further show that malware’s network behaviors can be used to determine its purpose with 85–89% accuracy. We also show that our partial containment is safe for the Internet. In twelve weeks of running, we have received no abuse complaints and our IP addresses have not been blacklisted. All the code developed in our work and the materials used in our evaluation are available at our project website: https://steel.isi.edu/Projects/evasive/ 2.2 RelatedWork In this section, we summarize related work on understanding malware behaviors. Most malware analysis works focus on analyzing system traces and malware binaries [46, 47]. There are fewer eorts on analyzing the semantics of malware’s network behavior. The Sandnet article [48] provides a detailed, statistical analysis of malware’s network trac. The authors give an overview of the popularity of each protocol that malware employs. However, they do not attempt to understand the high-level semantics of malware’s network conversations, and this is the contribution we make. Our work also updates results from [48] with communication patterns of contemporary malware. For example, we observe that ICMP ECHO has become the second most popular protocol used by malware. Morales et al. [38] dene seven network activities based on heuristics and analyze malware for prevalence of these behaviors. Yet this work does not provide insight into a malware sample’s purpose (e.g., worm, scanner, etc.) and it may miss behaviors other than those seven select ones. Our work complements this work and covers a richer set of behaviors, composed out of some basic communication patterns discussed in Section 2.5.3. 9 2.3 Fantasm In this section, we describe the goals for our Fantasm system, our partial containment rules and how we ensure safety to the Internet from our experimentation. 2.3.1 Goals Our goal in designing the Fantasm system was to support safe and productive malware experi- mentation. Safe means that we wanted to ensure that we do no harm to other Internet hosts with our experiments. Productive means that we wanted to ensure that as many as possible outgoing communication requests, launched by malware, receive a reply to that malware may move on to its next activity. 2.3.2 PartialContainment One could achieve safety in full containment, without letting any trac out of the environment. But because malware is environment-sensitive this would not lead to productive experimentation. One could also experiment in an open environment, where all the trac is let out. But this would not be safe since the analysis environment could become a source of harmful scans, DDoS attacks and worm infections, which harm other Internet hosts. Due to ethical consideration, no organization would support such analysis for long. To meet our goals we decided to experiment with malware in partial containment, where we selectively decide which malware ows to allow to reach into the Internet based on our as- sessment of their potential risk to the Internet, which is conformant to the ethical principles for information and communication technology research [19]. We also attempt to handle each out- going ow in full containment rst, by impersonating remote servers and crafting generic replies. 10 This further reduces the amount of trac we must let out and improves experimentation safety. We now explain how we assessed this risk and how we enforced the containment rules. Based on a malware ow’s purpose we distinguish between the following ow categories: benign (e.g., well-formed requests to public servers at a low rate), e-mail (spam or phishing), scan, denial of service, exploit and C&C (command and control). Potential harm to Internet hosts depends on the ow’s category. Spam, scans and denial of service are harmful only in large quantities – letting a few such packets out will usually not cause severe damages to their targets, but it may generate complaints from their administrators. On the other hand, binary and text- based exploits are destructive, even in a single ow. The C&C and benign communications are not harmful and usually must be let out to achieve productive malware experimentation. The challenge of handling the outside communication with a xed set of rules lies in the fact that the ow’s purpose is usually not known a priori. For example a SYN packet to port 80 could be the start of a benign ow (e.g., a Web page download to check connectivity), a C&C ow (to report infection and receive commands for future activities), an exploit against a vulnerable Web server, a scan or a part of denial-of-service attack. We thus have to make a decision how to handle a ow based on incomplete information, and revise this decision when more information is available. Our initial decision depends on how essential we believe the ow is to the malware’s continued operation, how easy it is for us to fabricate responses without letting the ow out of our analysis environment, and how risky it may be to let the ow out into the Internet. For essential ows whose replies are predictable, we develop generic services that provide these predictable responses and do not allow these ows into the Internet. We call these services “impersonators”. Essential ows whose replies are not predictable, and which are not risky, are let out into the 11 Flow Essential? Can we fake? Risky? Drop Impersonator Internet Yes Yes No Yes No No Figure 2.1: Flow handling: how we decide if an outgoing ow will be let out, redirected to our imperson- ators or dropped. Goal Action TargetedServices Elicit malware behavior Forward DNS, HTTP, HTTPS Redirect FTP, SMTP, ICMP ECHO Restrict forwarded ows Drop Other services Limit Number of suspicious ows Table 2.1: Flow policies for partial containment Internet, and closely observed lest they exhibit risky behavior in the future. Non-essential ows and essential but risky ows are dropped. Figure 2.1 illustrates our ow handling. Trac that we let out could be misused for scanning or DDoS if we let it out in any quantity. We actively monitor for these activities and enforce limits on the number of suspicious ows that a sample can initiate. We dene a suspicious ow as a ow, which receives no replies from the Internet. For example, a TCP SYN to port 80 that does not receive a TCP SYN-ACK would be a part of a suspicious ow. Similarly a DNS query that receives no reply is a suspicious ow. Suspicious ows will be present if a sample participates in DDoS attacks or if it scans Internet hosts. If the sample exceeds its allowance of suspicious ows, we abort this sample’s analysis. We summarize our initial decisions and revision rules in Table 2.1. We consider DNS, HTTP and HTTPS ows as essential and non-risky, whose replies we cannot fake. We make this deter- mination because many benign and C&C ows use these services to obtain additional malware executables, report data to the bot master and receive commands. Among our samples, DNS is used by 62%, HTTP by 35%, and HTTPS by 10% of samples (Section 2.4). 12 We consider FTP, SMTP and ICMP ows as essential ows with predictable replies. We for- ward these to our corresponding impersonators (Figure 2.1). These are machines in our analysis environment that run the given service, and are congured to provide generic replies to service requests. We redirect ICMP ECHO requests to our service impersonators and fake positive replies. We drop other ICMP trac. Our FTP service impersonator is a customized, permissive FTP service that positively authen- ticates when any user name and password are supplied. This setting can handle all potential connection requests from malware. If malware tries to download a le, we will create one with the same extension name, such as .exe, .doc, .jpg, and others. We save uploaded les for fur- ther analysis. For SMTP service, we set up an Email server that can reply with a “250 OK” message to any request. Our ICMP impersonator sends positive replies to any ICMP ECHO request. 2.4 ExperimentationGoalsEnvironmentandDesign In this section, we discuss our experimentation goals, environment and experiment design. 2.4.1 ExperimentationGoals We wanted to observe and analyze communication patterns of malware. This necessitated identi- cation of a relatively recent, representative set of malware binaries and running them in partial containment, while recording their communication. We further needed a way to quickly and automatically restore “clean state” of machines between malware samples 13 2.4.2 ExperimentationEnvironment We experiment with malware samples in the DeterLab testbed [6]. DeterLab [6] enables remote remote experimentation and automated setup. An experimenter gains exclusive access and su- doer privileges to a set of physical machines and may connect them into custom topologies. The machines run an operating system and applications of a user’s choice. Experimental trac is usually fully contained, and does not aect other experiments on the testbed, nor can it get out into the Internet. In our experiments, we leverage a special functionality in the DeterLab testbed, called “risky experiment management”, which allows exchange of some user-specied trac be- tween an experiment and the Internet. We specify that all DNS, HTTP and HTTPS trac should be let out. We run malware samples on several machines in a DeterLab experiment, which we will call Inmates. We hijack default route on Inmates and make all their trac to the Internet pass through a special machine in our experiment, called Gateway. This Gateway implements our partial con- tainment rules. We implement all of the service impersonators on a single physical machine. Each machine has a 3GHz Intel processor, 2GB of RAM, one 36Gb disk, and 5 Gigabit network interface cards. To hide the fact that our machines reside within DeterLab from environment-sensitive mal- ware we modify the system strings shown in Table 2.2. For example, we replace the default value (“Netbed User”) of “Registered User” with a random name, e.g., – “Jack Linch”. Therefore, malware will not detect the existence of DeterLab by searching for such strings. 14 KeyName DefaultinDeterLab OurModication Registered User “Netbed User” Random name, e.g., “Jack Linch” Computer Name “pc.isi.deterlab.net” Random name, e.g., “Jack’s PC” Workgroup “EMULAB” “WORKGROUP” Table 2.2: Minimizing artifacts of DeterLab. 2.4.3 ExperimentDesign We run each malware sample under a given containment strategy (full or partial) for ve minutes and record all network trac at the Gateway. After analyzing each malware sample, we must restore Inmates to a clean state. We take advantage of the OS setup functionality provided by DeterLab to implement this function. We rst perform certain OS optimization to reduce the size of OS image and thus shorten the time needed to load the image when restoring clean state. This modied OS is saved into a snapshot using the disk imaging function of DeterLab. This step takes a few minutes but is carried out only once for our experimentation. Later, whenever we need to restore the system after analyzing a malware sample, we reload the OS image using DeterLab’s os_load command. Our environment could also be used to study behavior of benign code, but this is outside of the scope of this research. 2.4.4 MalwareDataset We obtained a recent set of malware samples by downloading 29,319 malware samples between March 4th and March 17th, 2017 from OpenMalware [26]. In order to obtain a balanced dataset we establish ground truth about the purposes of these samples by submitting their md5 hashes to VirusTotal [62]. We retrieve 28,495 valid reports. Each report contains the analysis results of about 5060 anti-virus (AV) products for a given sample. We keep the samples that were labeled as malicious by more than 50% AV products. This leaves us with 19,007 samples. 15 Table 2.3: Concise Tagging of Malware Samples Categories Samples Categories Samples Virus 6,126/32% Riskware 409/2% Trojan 6,040/32% Backdoor 197/1% Worm 4,227/22% Bot 45/<1% Downloader 984/5% Ransomware 17/<1% Adware 962/5% Total 19,007 Concise Tagging. Each AV product tags a binary with vendor-specic label, for example, “worm.win32.allaple.e.”, “trojan.waski.a”, “malicious_condence_100% (d)”, or just “benign”. As demonstrated in [4], AV vendors disagree not only on which tag to assign to a binary, but also how many unique tags exist. To overcome this limitation, we devise a translation service that translates vendor-specic tags into a nine concise, generic tags, such as: worm, trojan, virus, etc. We learn the translation rules by rst taking a union of all the tags assigned by the AV products (74,443 in total), and then manually extracting common keywords out of them that signify a given concise category. Finally, we tag the sample with the concise category that is assigned by the majority of the AV products. Table 2.3 shows the breakdown of our samples over our concise tags. We then randomly select 2,994 out of the 19,007 samples, trying to select equal number of samples from each category, to achieve diversity and form a representative malware set. We continue working with this malware set. 2.5 Results In our evaluation, out of 2,994 malware samples in our malware set 1,737 samples exhibited some network activity during a run. The remaining samples may be dormant, waiting for some trigger 16 or may simply exhibit too small communication frequency, which we cannot observe given our experiment duration (5 minutes). 2.5.1 PartialContainmentExposesMoreMalwareBehavior We measure the quantity of observable malware behavior by counting the number of network ows recorded during experimentation. Out of 1,737 samples that exhibit any network behavior, 1,354 (78%) generate more ows under partial containment than under full containment. This supports our hypothesis that network connectivity is essential for malware functionality, and that most malware samples are environment-sensitive. Out of 1,737 that exhibit network behavior there were 9,304,083 outgoing ows generated during our 5 minute experimentation interval. Out of these 9,304,083 ows, our impersonators could fake replies to 9,270,831 (99.64%) of them. We had to let 2,295 ows (0.02%) out into the internet because we could not fake their replies and they were deemed essential. Finally 30,957 ows (0.33%) were dropped because we did not have an impersonator for their protocol, but they were deemed too risky to be let out. We hope to develop more impersonators in the future, and thus further reduce risk to the Internet. As a proof of how safe our experimentation was, during twelve weeks that we ran, we re- ceived no abuse complaints. We also analyzed 203 IP blacklists from 56 well-known maintainers (e.g., [33]), which contain 178 million IPs and 34,618 /16 prexes for our experimentation period. Our external IP was not in any of the blacklists, which further supports our claim that no harmful trac was let out. 2.5.2 MalwareCommunicationPatterns Table 2.4 shows the top 12 application protocols used by our malware dataset. DNS is used by 62% of samples and its primary use seemed to resolve the IPs of the domains that malware wishes to 17 Protocols Samples Protocols Samples DNS 1081/62% 1042 65/4% ICMP echo 818/47% 799 33/2% HTTP 600/35% 6892 25/1% 65520 237/14% 11110 17/1% HTTPS 173/10% 11180 17/1% SMTP 75/4% FTP 12/1% Table 2.4: Top 12 application protocols used by malware, and the number and percentage of samples that use them. contact. ICMP was used by 47% of samples, likely to test reachability, either to detect if malware is running in a contained environment or to identify live hosts that may later be infected, if vulnerable. HTTP (35% of samples) and HTTPS (10% of samples) are likely used to retrieve new binaries, as we nd many of these connections going out to le-hosting services. Port 65520 is mostly used by a virus that infects executable les and opens a back door on the compromised computers. The SMTP protocol is used to spread spam. Samples in our malware dataset queried a total of 5,548 dierent domains, among which zief.pl (14%) and google.com (11%) are the most popular domains. We query these domains from alexa.com ∗ , which has the records for 341 (6%) domains, as shown in Figure 2.2. We nd that only 1% of the domains have ranks lower than 10,000, 5% have higher ranks and 94% of do- mains are not recorded byalexa. For the domains whose rank is lower than 10,000, most are web portals, such as YouTube and many are le storage services, like Dropbox. We manually check 20 domains that have no record in alexa, and none had a valid DNS record. This suggests that malware may use portal websites either test network reachability or for le transfer, and it may use private servers for le transfer or for C&C communication. We classify the queried names based on their top-level domain, e.g., .com or .net. We nd a total of 72 distinct top-level domains, as shown in Figure 2.3. The Top 3 of these domains ∗ In our future work we will look to use a more robust representation of popular domains, like proposed by Metcalf et al in [35] 18 �� ����� � ����� � ������� � ����� � ������� � ����� � �� ��� ���� ���� ���� ���� ���� ���� ���������������� ������� Figure 2.2: Ranks of domains fromalexa.com. �� ����� ���� ����� ���� ����� ���� ����� ���� �� ��� ��� ��� ��� ��� ��� ��� ��������������������� ����������������� Figure 2.3: Popularity of top-level domains in our observed malware communications. are shown in Table 2.5. The .com is the most popular top-level domain, which is queried by 540 (31%) samples. The third column in Table 2.5 shows the top 3 queried domains in each top- level category. These domains contain 53 country codes, with Poland, Germany, and Netherlands being the top three countries. This means that malware in our dataset predominantly targeted European victims. 2.5.3 SummarizingMalwareCommunication We now explore how to summarize malware communication so we can further investigate com- mon patterns in how malware uses the Internet. Our goal was to create a concise and human- readable digest of malware’s communication starting from recorded tcpdump logs. We call this representation NetDigest. 19 Top-level Samples Second-level Samples .com 540/31% google.com 187/35% msrl.com 73/14% ide.com 73/14% .pl 293/17% zief.pl 244/83% brenz.pl 26/9% ircgalaxy.pl 22/8% .net 235/14% secureserver.net 73/31% surf.net 68/29% aol.net 65/28% Table 2.5: Popularity of domains in malware DNS queries. We start by splitting a malware’s trac into ows based on the communicating IP address and port number pairs, and the transport protocol. We call each such ow a “session”. Then, for each session, we extract the application protocol employed and devise a list of {attribute: value} pairs for this protocol, as shown in Table 2.6. The rst row of Table 2.6 shows the information that we will extract for all types of application protocols. For example, “LocalPort” denotes the local IP port used by malware, which is an integer. This attribute appears only once for a single session, and is derived from the denition of a session. The “NumPktSent” means the total number of packets sent by malware in an individual session. The “PktSentTS” is a list of Unix epoch time of all the packets sent by malware. Finally, we also maintain a list of each packet’s payload size. The DNS protocol has one attribute “Server”, which has the value of IP_address that the query is sent to. For the domain queried by malware, the QueryType can be address record (A), mail exchange record (MX), pointer record (PTR), or others. For the response sent back by DNS server, we rst save its canonical name, if any, in a CNAME eld. Then, we extract the response type and corresponding values and assign them to the ResponseType eld. For an HTTP or FTP session, we rst take note of the server’s IP address in the ServerIP eld. Then, we use boolean values to denote if this session is initiated by malware (“Proactive”) and if 20 Protocol [Attribute: Value] All [LocalPort: integer] ‡ , [NumPktSent: integer] ‡ , [NumPktRecv: integer] ‡ , [Pk- tSentTS: float_list] ‡ , [PktRecvTS: float_- list] ‡ , [PayloadSize: integer_list] † DNS [Server: IP_address] ‡ , [QueryType: string] ‡ , [CNAME: string] ‡ , [ResponseType: IP_- address list] ‡ HTTP/FTP [Server: IP_address] ‡ , [Proactive: boolean] ‡ , [GotResponse: boolean] ‡ , [Code: integer]‡, [Download: file_type] † , [Upload: file_- type] † SMTP [Server: IP_address] ‡ , [EmailTitle: string] ∗ , [Recipients: string] ∗ , [BodyLength: integer] ∗ , [ContainAttachment: boolean] ∗ , [Attachment- Type: string] ∗ ICMP [RequestIP: IP_address] ∗ , [NumRequests, integer] ∗ ‡ Occur exactly once † May have zero or more occurrences ∗ Have at least one occurrence Table 2.6: NetDigest of a session. malware receives any response from Internet host (“GotResponse”). If the outside server replies to malware, we classify the following packets as “Download” or “Upload” based on the direction of the bulk volume of data. We also extract the le type being transferred. For an SMTP message, we extract the server IP address, Email title, recipients, and body length. We also use a boolean value to note whether the message has an attachment and save the attach- ment’s le type in a string. For the ICMP protocol, we extract the destination IP address into the RequestIP eld. We also save the number of requests in NumRequests eld. After we build the lists of attribute-value pairs for all the sessions produced by a malware sam- ple, we sort the lists based on their rst timestamps. The nal, sorted list of session abstractions is called the NetDigest. 21 1488068895.052901: DNS - [Server: 10.1.1.3], [A: ic-dc.deliverydlcenter.com], [CNAME: N/A], [A: 52.85.83.81, 52.85.83.112, 52.85.83.132, 52.85.83.4, 52.85.83.96, 52.85.83.56, 52.85.83.32, 52.85.83.37] 1488068895.154335: HTTP - [Server: 52.85.83.81], [Proactive: True], [GotResponse: True], [Download: blob], [Download: .png], [Download: blob] 1488068895.948346: HTTP - [Server: 52.85.83.81], [Proactive: True], [GotResponse: True] 1488068896.767094: DNS - [Server: 10.1.1.3], [A: www.1-1ads.com], [CNAME: n135adserv.com], [A: 212.124.124.178] 1488068896.977464: HTTP - [Server: 212.124.124.178], [Proactive: True], [GotResponse: True] 1488069110.044756: DNS - [Server: 10.1.1.3], [A: ic-dc.deliverydlcenter.com], [CNAME: N/A], [A: 52.85.83.56, 52.85.83.112, 52.85.83.96, 52.85.83.37, 52.85.83.81, 52.85.83.4, 52.85.83.132, 52.85.83.32] 1488069110.049507: DNS - [Server: 10.1.1.3], [A: ic-dc.deliverydlcenter.com], [CNAME: N/A], [A: 52.85.83.32, 52.85.83.37, 52.85.83.56, 52.85.83.112, 52.85.83.96, 52.85.83.132, 52.85.83.4, 52.85.83.81] 1488069110.338822: HTTP - [Server: 52.85.83.81], [Proactive: True], [GotResponse: False] 1488069110.342816: HTTP - [Server: 52.85.83.81], [Proactive: True], [GotResponse: False] 1488069131.273458: HTTP - [Server: 52.85.83.112], [Proactive: True], [GotResponse: False] 1488069131.277206: HTTP - [Server: 52.85.83.112], [Proactive: True], [GotResponse: False] 1488069152.304031: HTTP - [Server: 52.85.83.132], [Proactive: True], [GotResponse: False] 1488069152.308025: HTTP - [Server: 52.85.83.132], [Proactive: True], [GotResponse: False] 1488069173.334854: DNS - [Server: 10.1.1.3], [A: ic-dc.deliverydlcenter.com], [CNAME: N/A], [A: 52.85.83.32, 52.85.83.132, 52.85.83.96, 52.85.83.81, 52.85.83.4, 52.85.83.56, 52.85.83.112, 52.85.83.37] 1488069173.338605: DNS - [Server: 10.1.1.3], [A: ic-dc.deliverydlcenter.com], [CNAME: N/A], [A:52.85.83.32, 52.85.83.132, 52.85.83.112, 52.85.83.56, 52.85.83.81, 52.85.83.4, 52.85.83.37, 52.85.83.96] 1488069173.381571: HTTP - [Server: 52.85.83.4], [Proactive: True], [GotResponse: False] 1488069173.383566: HTTP - [Server: 52.85.83.4], [Proactive: True], [GotResponse: False] Figure 2.4: Example NetDigest (md5: 0155ddfa6feb24c018581084f4a499a8). One sample NetDigest is shown in Figure 2.4 for the sample tagged as Trojan by AV products. At the beginning, this sample queries a domain (ic-dc.deliverydlcenter.com) using the de- fault DNS server that is part of our impersonator set. Our DNS server acts as a recursive resolver and obtains and returns the actual mapping. Then, this sample downloads a picture and blob les from the rst IP address returned. However, for the remaining Internet hosts, this sample just establishes connections with them but does not download or upload any information. For example, the second domain (www.1-ads.com) suggests that it is an advertising website, but no payload is downloaded from this website (session starting at timestamp 1488068896.977464). In addition, some IPs are unreachable at the time of our execution, such as52.85.83.112. 2.5.4 ClassifyingMalwarebyItsNetworkBehavior We now explore if unknown malware could be classied based on its communication patterns. Current malware classication relies on binary analysis. Yet, this approach has a few challenges. 22 First, malware may use packing or encryption to obfuscate its code, thus defeating binary analy- sis. Second, malware may be environment-sensitive and may not exhibit interesting behavior and code if ran in a virtual machine or debugger, which are usually used for binary analysis. We thus explore malware classication based on its communication behavior, reasoning that malware may obfuscate its code but it must exhibit certain key behaviors to achieve its basic functionality. For example, a scanner must scan its targets and cannot signicantly change this behavior without jeopardizing its functionality. In our classication we divide our malware set into a training and a testing set. We then apply machine learning to learn associations on the training set between some features of mal- ware communication, which we describe next, and our concise labels denoting malware purpose. Finally, we attempt to classify the malware in the testing set and report our success rate. ExtractingFeatures. We start with 83 select features, extracted out of the malware’s NetDigest, as shown in Table 2.7. We abstract malware’s network trac into four broad categories: Packet, Session, Protocol, and Content. For the Packet category, we divide it into three subgroups: Header, Payload, and Statistics. In the Header subgroup, we count the number of distinct IPs that a sample’s packets have been sent to. In addition, we also look up the geographical locations of the IPs from the GeoLite [34] database, including the countries and continent they reside in. We chose these features because it is known that certain classes of malware target Internet hosts in dierent countries. In the Payload subgroup, we calculate the total size of payload in bytes. Furthermore, we compute the following statistics for both sent and received volume, the packet counts and the packet timing: minimum, maximum, mean, and standard deviation. 23 Categories Subgroups Features(83intotal) Packet Header Distinct number of: IPs, countries, continent, and local ports Payload Total size in bytes; Sent/received: total number, minimum, maximum, mean, and standard variance Statistics Sent/received packets: total number, rate; Sent/received time interval: min, max, mean, and standard variance Session Direction Proactive (initiated by malware) or passive (initiated by Internet servers) Result Succeeded or failed Statistics Total number of SYNs sent; Number of sessions per IP: minimum, maximum, mean, and standard variance Protocol DNS Number of distinct domains queried by malware HTTP Number of replies received per reply code: 200, 201, 204, 301, 302, 304, 307, 400, 401, 403, 404, 405, 409, 500, 501, 503; Method: GET, POST, HEAD ICMP Total number of packets; Number per IP: min, max, mean, and standard variance Other Ports: total number of distinct ports, top three used Content Files php, htm, exe, zip, gzip, ini, gif, jpg, png, js, swf, xls, xlsx, doc, docx, ppt, pptx, blob Host info OS id, build number, system language, NICs Registry Startup entries, hardware/software conguration, group policy Keyword Number of: “mailto”, “ads”, “install”, “download”, “email” Table 2.7: Features extracted from a malware’s NetDigest for classication purpose. 24 For the Session category, we consider all packets that are exchanged between malware and a single IP address. For these packets, we divide them into dierent sessions according to the local ports used by malware. For each session, we determine if its direction is proactive or passive, depending on whether the malware initiates the session or not. We say the Result of a session is successful if malware initiates the session and receives any responses from the host. We further calculate the number of TCP SYN packets, which can be used to detect SYN ood attacks. We also record the number of sessions per IP, which can be useful to further establish communication purpose. For example, in our evaluation, we nd that one sample launches one short session with the rst IP and then initiates multiple sessions with the second one for download. This network behavior indicates that the rst IP serves as a master, directing the malware sample to the second, which acts as a le server. For the Protocol category, we extract features for dierent types of application protocols. For example, for the DNS we summarize the number of distinct domains queried by malware in their DNS query and response packets. For HTTP, we count the number of packets carrying specic HTTP status codes, such as 200 (OK). Some malware samples behave dierently based on the returned status code. For ICMP, we calculate minimum, maximum, average and standard deviation of packet counts. For non-standard IP ports, we maintain a set of distinct port numbers and calculate the top three ports targeted by each malware sample. For the Content category, we investigate the payload content carried in HTTP packets, be- cause this is the top application protocol used by malware in our experiments. We then use regular expressions to extract les from hyperlinks in HTTP content, and interpret their exten- sions. Sometimes the content is binary, and we tag it as blob. We also attempt to identify, using 25 Algorithms Rank1 Rank2 Rank3 Decision Tree 242/89% 257/94% 259/95% Support Vector 231/85% 259/95% 265/97% Multi-layer Perception 231/85% 257/94% 262/96% Table 2.8: Classication results: Rank 1 – our label was the top label assigned by AV products, Rank 2 – our top label was in the top 2 labels assigned by AV products, Rank 3 – our top label was among top 3 assigned by AV products. regular expressions, if payload contains host information and Windows registries that are typi- cally reported to bot masters. Finally, we collect the frequencies of select keywords that are may indicate a malware purpose, such as “ads”. ClassicationResults. We investigate three popular classication methods in machine learn- ing area – decision trees [14], support vector machines [23], and multi-layer perception [50]. We implement these algorithms and standard data pre-processing (data scaling and feature selection) through a Python package Scikit [41]. We use 80% of this data set for training and the remaining 20% of samples for testing. The results are shown in Table 2.8. Since malware today has very versatile functionality, it may be possible that a sample exhibits behavior that matches multiple labels. We denote as “Rank 1” the case when our chosen label matches the top one concise label chosen by the majority of AV products. When it matches one of top two labels, we denote this as “Rank 2” and if it matches one of top three labels, we denote it as “Rank 3”. Our Rank 1 success rate ranged from 85 % (support vectors and multi-layer perception) to 89% (decision trees), which is very good performance. When we allow for a match between top two labels (Rank 2), our success rate climbs to 94–95%. And if we count match with any of the top three labels as a success (Rank 3), our rate climbs to 95–97%. Based on the typical performance of applying machine learning techniques in malware analysis [42], we conclude that our NetDigest representation can lead to very accurate malware classication, based only on observed communication patterns. 26 �� ���� ���� ���� ���� �� �� �� ��� ��� ��� ��� ������������� �������� �������������������������������� ������������� ������� ������� ����������������� Figure 2.5: Classication precision as number of sessions grows. We further investigated the root causes of our misclassications in Rank 1 that later became a success under Rank 2 or Rank 3 criteria. Toward this goal, we manually examinedpcap traces of related samples. We nd that all these samples exhibit limited network behavior that was not sucient for classication. For example, one sample queries a domain and then establishes a connection with the HTTP server. However, no payload is downloaded or uploaded, and thus this behavior may match any malware category. To investigate the relationship between classication accuracy and the number of sessions observed in malware communication we perform several iterations of the classication experi- ment. In each iteration lter out samples that launched fewer thanN sessions. We then divide the remaining samples into training and testing set in 80%/20% ratio, train on the training set, perform the classication on the testing set, and report the success rate. We varyN from 1 to 25. The evaluation results are shown in Figure 2.5. The x-axis of Figure 2.5 denotes our limit on the number of sessions in a given run –N and the y-axis shows the classication success rate for each algorithm, corresponding to our Rank 1 criterion, on the testing set. Figure 2.6 shows the number of samples that generated N or fewer sessions in the training and the testing set together. Overall, all three of the classication methods performed well and were stable, except for multi-layer perception when session quantity is between 5 to 8. After investigating these 27 ���� ���� ���� ����� ����� ����� ����� ����� ����� �� �� ��� ��� ��� ��� ���������� ���������������� �������������������������������� ������� Figure 2.6: Number of samples as the limit on number of sessions grows. sessions, we found that they do not have enough distinguishing feature values for multi-layer perception algorithm. The small variance of the input are further reduced by the intermediate calculation (hidden layers) of the algorithm [41]. The classication success rate increased slightly as the limit on number of sessions increased, from 88% at 1 session to 93% at 25 sessions. Thus longer observations increase classication accuracy but not by a lot. 2.6 Conclusions In this work, we investigate how essential Internet connectivity is for malware functionality. We nd that 58% of diverse malware samples initiate network connections within the rst ve minutes and that 78% of these samples will become dormant in full containment. We further provide breakdown of popular communication patterns and some evidence as to the purpose of these communications. Finally we show that malware communication behaviors can be used for relatively accurate (85–89%) inference of a sample’s purpose. As future work, we will extend our framework to include analysis of system-level activities for better understanding of a malware’s purpose, and will seek to improve our generic impersonators to further reduce the cases when trac must be let outside of the analysis environment. 28 Chapter3 MalwareBehavioranalysisusingHigh-LevelBehavior In this chapter, we introducemalwareembedding for studying malware similarity, and categorize popular malware behaviors. 3.1 Introduction Contemporary state of Internet security is grave, thanks to proliferation of malware. Malware is an increasing threat, not only because of its increased capability and sophistication, but also because of its stealth, aimed at avoiding detection and analysis. Because anti-virus suites usually apply signature-based malware detection, malware design- ers have invested enormous eort to change their binaries to avoid detection. From simple in- struction transformation, such techniques have evolved through junk code injection, code obfus- cation, to polymorphic and metamorphic engines that can transform malware code into millions of variants [51]. Such obfuscation techniques have greatly undermined traditional signature- based malware detection methods. We simply cannot keep pace with proliferating malware vari- ants. Notable attempts have been made by researchers to improve signature analysis by running AV suites on cloud [63], detecting encryption decoders [44], detecting runtime signatures for 29 polymorphic malwares [66], etc. However, as new obfuscation techniques emerge, signature analysis alone cannot save us. Besides static analyses, researchers have used dynamic analysis to overcome code obfusca- tion. These include tracking disk changes, analyzing dynamic call graphs, as well as monitoring malware execution using debuggers and virtual machines. However, modern malware is often equipped with anti-debugging and anti-virtualization capabilities [52, 53], making its analysis in a controlled environment dicult. Yet, a controlled environment is needed to observe a malware’s actions on the compromised host, and record sucient details for analysis. To complement contemporary static and dynamiccode analysis, we propose to study malware behavior by observing and interpreting its network activity. Based on our prior research [18], we believe that network access is essential for malware to achieve its purpose. We also believe that it would be dicult for malware to signicantly alter its network behavior and still achieve its purpose. Studying network behavior thus may oer an opportunity to both understand what malware is trying to do in an analysis environment, and to detect malware on compromised hosts. In this thesis we pursue this rst direction, by investigating how we can infer the malware’s ultimate purpose by analyzing its network activity. Live malware analysis, with unrestricted network access, is risky, because malware may inict damage to other Internet hosts during analysis, and we would become unwitting accomplices. We leverage our Fantasm system for safe and productive live malware analysis [18] to minimize this risk. Just having network trac records of a malware run is not enough to understand a malware’s purpose, because such data is very rich and very unstructured. To overcome this challenge, we rst extract a set of features from network trac records, and then devise patterns that help us 30 group these features into higher-level behaviors, such as le download, e-mail sending, scanning, etc. Next we use a set of these high-level behaviors to label malware by its purpose. We note that malware could have more than one purpose. For example, it could both scan for vulnerable hosts and send unsolicited emails. Thus a malware sample could have more than one label. We perform our analysis on 999 malware samples, randomly selected from the Georgia Tech Apiary project [1]. We observe that the most common activities are scanning, propagating, and downloading. Interestingly, of all samples that show more than one behavior, scanning and prop- agating show up most frequently together, which suggests these activities usually work in com- bination and complement each other. This chapter is structured as follows. We discuss background of this work and contemporary research in Section 3.2. In Section 3.3 we explain our methodology. We dene our classication criteria in Section 3.4; We describe experiment set up and evaluation in Section 3.5. We discuss the use of the results and possible future work in Section 3.6 and conclude in Section 3.7. 3.2 BackgroundandRelatedWork In this section we discuss contemporary malware analysis methods and the rationale behind our research methods, and also survey related work. 3.2.1 StaticAnalysis Various static analysis methods like CTPL [29], Generic Virus Scanner [12], etc. have been used to analyze binary code of malware without running it, and identify portions that could be used for signature generation. These portions of code exhibit malicious activity, and are unlikely to present in benign binaries. The identied portion is then synthesized to become the signature 31 of this kind of malware. Once the signature is obtained, researchers can build anti-virus tools to nd the same kind of malware in future by scanning all suspicious binaries. Signature-based malware detection has been the most widely used method and has been quite successful. However, malware designers have also been working on countermeasures over the years to undermine such techniques. From junk code generation, malware encryption and oligo- morphism, to polymorphic and metamorphic malware [51], such techniques have evolved signif- icantly against signature-based detection techniques. On the other hand, researchers have also been improving signature detection methods, such as detecting decryption routines, runtime sig- nature detection, etc. Overall, this is a losing race for the researchers, as it will always be cheaper to generate new malware variants, than to analyze them. We can become faster, but we can never win. Our proposed direction shows promise because malware cannot change its network behav- ior with as much freedom as changing its code. We thus believe that we will be able to analyze and detect those malware samples that static analysis misses. 3.2.2 DynamicAnalysisonHost Instead of looking for a binary signature to detect malware presence, dynamic analysis builds a signature of a malware’s interaction with its host. This signature may include memory access and le access patterns, as well as system call patterns. Those patterns that are likely to be present in malware but not in benign binaries can be used to develop a behavioral signature for malware detection [20]. Dynamic analysis complements static analysis, and is able to overcome malware code obfus- cation. Willems et al. [64] proposed CWSandbox that combines static and dynamic techniques for analyzing malware on a contained host. Guarnieri et al. [22]’s Cuckoo sandbox follows similar 32 concepts as CWSandbox and additionally provides a network packet sink. Both works utilize vir- tual machines to isolate study environment and the host. However, stealthy malware has another set of techniques to evade dynamic analysis – it detects debuggers and virtual machines, which are often used to speed up and facilitate dynamic analysis, and modies its behavior. This leads to incorrect or unusable signatures of stealthy malware. Chen et al. [11] found that 39.9% and 2.7% of 6,222 malware samples exhibit anti-debugging and anti-virtualization behaviors respectively. In 2012, Branco et al. [7] analyzed 4 million samples and observed that 81.4% of them exhibited anti-virtualization behavior and 43.21% exhibited anti-debugging behavior. Our work helps analyze malware that would otherwise be missed by static and dynamic anal- ysis, because it relies on observations of malware’s network behavior. Malware can be run on bare metal machines thus evading detection of the analysis environment. 3.2.3 DynamicAnalysisofNetworkBehavior There are a few eorts on analyzing the semantic of malware network behavior. Sandnet [49] provided a detailed, statistical analysis of malware network trac. The authors surveyed protocol popularity employed by malware. However, they did not go further to understand the semantic of the network communication, while we do. Morales et al. [38] studied seven network activities, selected using heuristics. These activities could be used to identify malicious activities. Morales et al. also report on prevalence of those behaviors in contemporary malware. Their chosen activities are not as useful to understand a malware sample’s purpose or type, and are very limited. We complement this work by extending the set of activities being observed. 33 Nari et al. [40] proposed an automated malware classication system also focusing on mal- ware network behavior, which generates protocol ow dependency graph based on the IP address being contacted by malware. Our work improves this eort by systematizing detection of dier- ent malware behaviors using dierent network behavior patterns, and dierent network trac contents. 3.3 CapturingMalwareNetworkBehavior With prevalence of high-speed Internet access, malware has begun to rely more and more on net- work to achieve its ultimate purpose [24, 54]. Malware often downloads binaries needed for its functionality from the Internet, or connects into command and control channel to receive instruc- tions on its next activity [25]. Advanced persistent threats [57] and keyloggers collect sensitive information on users’ computers, but need network access to transfer it to the attacker. DDoS at- tack tools, scanners, spam and phishing malware require network access to send malicious trac to their targets. We propose to study malware’s network activities because they have become essential for modern malware. The rst step in this study includes capturing malware’s network trac in an environment, which is transparent to malware but also minimizes risk to Internet hosts. 3.3.1 AnalysisEnvironment: Fantasm In our prior work, we have developed a semi-contained network experiment environment, called Fantasm [18]. Fantasm runs on a testbed with full Internet access, and carefully constrains this access to achieve productive malware analysis, and minimize risk to outside hosts. In our analysis, we run malware on a bare metal Windows host, without any virtualization or debugger. We 34 capture and analyze all network trac between this machine and the outside using a separate Linux host, sitting in between the Windows host and the Internet. Both hosts are controlled by our Fantasm framework. Fantasm makes decisions on which communications toimpersonate, i.e., intercept and answer itself, which toforward and which todrop. This decision is made by taking into account each out- going ow separately, making an initial decision, and revising it later if subsequent observations require this. We dene a ow as a unique combination of destination IP address, destination port and protocol. Each ow is initially regarded as non-essential, and it is dropped. If this leads to the abortion of malware activity, we stop our analysis, restart it, and regard that specic ow as essential. Fantasm then considers if it can fake replies to this outgoing connection in a way that would be indistinguishable from the actual replies, should the ow be allowed into the Internet. If Fantasm has an impersonator for the given destination and the given service, it will intercept the communication and fake the response. Otherwise, it will evaluate if the outgoing ow may be risky to let out. If so, the ow will be dropped. Otherwise, it will be let out into the Internet. Table 3.1 illustrates the criteria we use to determine if a ow is risky or not, and if it can be im- personated. Note that even ows considered not risky and let out are still subjected to further monitoring and may be aborted if they misbehave. Flows, which are let out, could become part of scanning or DDoS. To minimize risk of un- witting participation in attacks, we actively monitor for these activities and enforce limits on the number of suspicious ows that a sample can initiate. We dene a suspicious ow as a ow, which receives no replies from the Internet. Many scan and DDoS ows will be classied as suspicious. If the analyzed malware sample exceeds its allowance of suspicious ows (10 in our current implementation), we abort its analysis. 35 ServiceorProtocol Label DNS, HTTP, HTTPS Not risky FTP, SMTP, ICMP_ECHO Risky, can be impersonated Other services or protocols Risky, cannot be impersonated Table 3.1: Flow policies in Fantasm 3.3.1.1 Impersonators To handle simple outgoing requests with predictable replies, we built dummy services that can produce these replies within our analysis environments, called impersonators. Several protocols used by popular services are good candidates to be handled by impersonators, such as ICMP, SMTP, and FTP. ICMP_ECHO messages have a predictable reply, which can be either positive or negative. Our impersonator provides a positive reply to each ICMP_ECHO request. SMTP actions by malware are usually used for spamming purpose, and require us to only fake a successful receipt. We do this by setting up an Email server, which replies with a “250 OK” message to any request. Our FTP service impersonator is a customized, permissive FTP service, which accepts any user name and password combination. In addition to impersonating servers for these three types of trac, we set up our Fantasm host to act as DNS caching proxy for all DNS requests by malware. This enables us to observe malware’s DNS trac and to cache replies, thus again minimizing outgoing requests. 3.4 High-levelBehaviors To analyze captured network traces, and reason about malware behavior, we need to impose some structure on trace data. We do this by identifying certain network behavior patterns, which we believe will be commonly used for a given malware purpose. We call these patterns high-level behaviors. We seek to identify behaviors, which are common for the following malware purposes: 36 • Downloading. This type of malware activity is prevalent and usually occurs immediately upon infection, since many malware samples are just empty downloaders to defeat static and dynamic analysis [3]. • Reporting. This type of activity usually suggests that malware samples are submitting in- formation gathered from the victim to their command & control server, as well as requesting more information from it. • Scanning. This is usually used to collect vulnerable host identities, to use as potential victims in future attacks. • Spamming. Such malware samples send spam emails that usually contain phishing mate- rials like malicious URLs or attachments. • C&C communication. Malware samples maintain a connection with certain public ser- vices such as IRC to receive C&C commands. • Propagating. This type of malware takes advantage of le transfer protocols (e.g. ftp, samba, NFS, etc.) to propagate binaries to other hosts. Table 3.2 summaries high-level behaviors, which we use to infer the above mentioned malware purposes: 3.4.1 High-levelBehaviorSignature Fantasm extracts many features from network traces. These features could be used by machine learning to learn how to classify malware behaviors into high-level categories, but this would require manual labeling of many samples. Instead, we can use a simpler set of parameters and behavior patterns to achieve identication of high-level behaviors in a deterministic fashion. 37 Action NetworkActivityFeatures Detectionparameters Downloading Connection to remote server HTTP GET request for a specic URL Receiving payload for further action Receive payload (binary or text) Reporting Connection to remote server HTTP POST to remote server with parameters or payload Submit data to remote server Scanning Detecting availability of other hosts ICMP ECHO packets to other hosts Detecting open ports of hosts TCP SYN or UDP packets to ports Spamming Connection to public mail server TCP connection to port 25, 465, or 587 Sending spam email Text email with URL or attachments C&C HTTP, IRC connection to remote server IRC protocol / ports Communication Receive instruction Propagating Try to transfer data over to other hosts FTP, Samba, NFS protocol / ports Table 3.2: High-level behavior list. We utilize information including protocol, port, packet length, as well as some application level headers. We also observe the sequences of network behaviors to identify patterns that denote a specic high-level behavior. For example, if malware attempts to connect to an external host on port 25, 465 or 587 and send email we will infer that this is part of the “spamming” category. A detailed description of network patterns and parameters we use is shown in Table 3.2. 3.5 ExperimentandEvaluation We now use the patterns identied in the previous section to study prevalence of our selected high-level behaviors in contemporary malware. 3.5.1 ExperimentSetup Our experiment environment is built using the Fantasm platform [18], which allows for safe and productive live malware experimentation. Fantasm is built on the DeterLab testbed [37], which is a public testbed for cyber-security research and education. DeterLab allows users to request a number of physical nodes, connect them into custom network topologies and install custom 38 OS and applications on them. Users are granted root access to the machines in their experi- ments. Fantasm utilizes this infrastructure to construct a LAN environment with a node running Windows to host the malware, and another node running Ubuntu Linux acting as the gateway. Fantasm provides necessary services for impersonators, and monitors the network activities by capturing all network packets usingtcpdump. One round of analysis for a given malware sample consists of the following steps: • Enable service network monitoring on Linux gateway • Reload operating system on Windows node, and set up necessary network congurations • Deploy and start the malware binary and continue running it for a given time period (we used 5 minutes) • Kill the malware process and save the captured network trace. This setting has the advantage that it is immune to current analysis evasion attempts by mal- ware, because it does not use a debugger or a virtual machine. By reloading OS for each run, it ensure that each sample is analyzed in an environment free from any artifacts from the previous analysis rounds. The malware samples we used in the study were provided by the Georgia Tech Apiary project [1], which is a repository with a daily collection of malware samples. We randomly selected 999 mal- ware captured in Mar. 2016 for this research. Next we submitted each sample to VirusTotal [63] to determine the type of malware, and ensure that the sample is recognized as malicious. We then analyzed each selected sample in our experiment environment, using the method described above. After the analysis, we have saved traces with all captured communications to and from the Windows node. 39 Behavior Percentage Downloading 10.9% Reporting 5.6% Scanning 28.5% Spamming 2.2% C&C communication 0.2% Propagating 11.5% Not Detected 57% Table 3.3: Percentage of our selected 999 malware samples, which exhibit the given high-level behavior. Countofbehaviors Ratio 1 67.6% 2 28.2% 3 3.4% 4 0.7% 5 or more 0% Table 3.4: Percentage of malware having the given number of behaviors, among all malware, which ex- hibited some network activity. 3.5.2 GatheringHigh-LevelBehavior Our analysis program extracted all useful parameters to identify high-level behavior as described in Table 3.2. The program rst generates raw results for all parameters such as protocol, desti- nation ports, HTTP request type, remote hosts, as well as session sequence patterns. We then automatically identify patterns listed in Table 3.2, to detect high-level behaviors. Our results are shown in Table 3.3 and Table 3.4. Table 3.3 shows prevalence of dierent behaviors in our malware samples. Around 57% of the samples do not show any network activity. Possible reasons of inactivity include shortness of the observation period, malware waiting for an external trigger, or outdated (dormant) malware. We plan to establish the exact causes of inactivity in our future work. Out of malware samples, which exhibit some network activity, 28.5% engage in scanning scan- ning, 11.5% in propagating itself further and 10.9% in downloading new binaries. This suggests that most malware focuses on nding vulnerable hosts, downloading the actual malicious code 40 BehaviorCombinations Ratio Scanning+Propagating 43.1% Downloading+Propagating 16.5% Scanning+Spamming 12.2% Downloading+Scanning 7.9% Downloading+Scanning+Propagating 5.7% Downloading+Uploading 4.3% Uploading+Scanning 2.2% Downloading+Uploading+Scanning+Propagating 2.2% Downloading+Uploading+Propagating 1.4% Scanning+Spamming+Propagating 1.4% Uploading+Scanning+Propagating 1.4% Uploading+Propagating 0.7% Downloading+Uploading+Scanning 0.7% Table 3.5: Percentage of malware exhibiting given combinations of behaviors. and propagating further. Reporting shows up in 5.6% of the samples, while spamming shows up in 2.2% of the samples. Other types of behaviors, such as C&C communication, are rarely seen, which suggests we may need more samples or longer observations, to study these behaviors. We were further interested in combinations of behaviors, which show up together. Table 3.4 shows the percentage of samples, which show one or more behavior. More than 30% of the samples, which exhibit some network activity, engage in more than one behavior, and around 4% of the samples engage in more than two behaviors. Table 3.5 shows the most common behavior combinations. Scanning and propagating show up most frequently together (43.1%), which is not surprising. This combination suggests that most propagating malware samples are also equipped with scanning capability to nd potential victims. The next common combination is downloading with propagating, which suggests this kind of malware samples usually download new payload, and then propagate further. Also frequent is scanning with spamming, suggesting spam malware samples also scan for potential victims, before sending spam. 41 3.6 DiscussionandFutureWork Currently we have dened patterns, which we can be used to detect six types of high-level behav- iors. We believe there are more high-level behaviors that are useful for understanding malware, and we will seek to dene them (and the associated patterns) in our future work. Our experiment results also suggest that behavior combinations are common. We believe this nding can be useful for the following future directions in malware analysis. Malwaredetection. Most low-level malware behaviors also occur in benign software, which makes them unreliable for malware detection purposes. However, combinations of these low- level behaviors may be unique to malware, and could be useful for detection. For example, a software that scans other hosts and then copies data over is unlikely to be benign, although each of these actions separately could be undertaken by benign software (e.g., probing several servers could look like scanning, if servers are unresponsive, and data can be uploaded to a cloud for legitimate reasons) As we extend our malware analysis to more samples, we expect to nd more behavior combinations, which can be used for malware detection. Polymorphic malware. Modern malware may take advantage of a polymorphic engine to encode itself and evade signature-based detection. Through network behavior analysis and behavior combination patterns, we may detect similar malicious behaviors exhibited by dierent binary samples. These samples could be polymorphic versions of the same code. Thus our work can be useful to detect polymorphic malware. Malwaregenealogy. With longer analysis on more samples, we hope to build a more ne- grained representation of high-level malware behaviors. Such information can then be used iden- tify malware samples, which have similar behaviors, and thus may be related. For example, if two samples send emails with the same attachments, use same or similar phishing URLs or contact the 42 same C&C server, they are likely to be related. Similarly, if two samples exhibit similar spamming behavior (e.g., using same or similar content) but one has self-propagating behavior too, while the other does not, these samples are likely to be related. Our high-level behaviors can be useful to study malware genealogy and to identify malware families. Similarly, when malware downloads a new binary, which then attempts to perform some malicious activity, these two samples work in collaboration for a common goal. We hope to leverage our current work to build information on such related malware in the future. MalwareEvadingNetworkAnalysis. With the existence of techniques to evade signature- based malware detection, we can naturally assume that malware designers will also look to evade our network-communication-based analysis. There are several strategies that malware could use: (a) appear to be dormant and await a specic trigger (b) interleave malicious activities with benign trac (c) attempt to detect analysis environment, e.g., by attempting to connect to an FTP server with non-existing username and password (this will succeed in our analysis but fail in real world) We leave handling of these evasion techniques for our future work, but sketch some possible solutions here. For trigger-based malware, we could leverage existing work on trigger detection, such as using static analysis and symbolic execution [12]. For malware, which interleaves its ma- licious behavior with benign ones, we need to improve our pattern detection to consider subsets of activities, and not just sequences. For malware, which detects our impersonators, we plan to make our impersonators more sophisticated and harder to detect. There is however a trade-o here between impersonator sophistication and scalability. For example, our FTP impersonator 43 could check for each username/password input, if this input can be used to access the actual FTP server in the real Internet. This information could be cached for the next attempt with the same input. Such approach would guarantee that our impersonator always provides correct reply to malware, but it would inundate us with testing and caching potentially many username/password combinations. This approach clearly does not scale. In our future work we will investigate where to draw the line between trying to improve sophistication and stealthiness of our impersonators, and scalability. 3.7 Conclusion In this work we propose patterns of malware’s network behaviors, which can be used to identify some high-level behaviors, such as scanning or spamming. We then study prevalence of these high-level behaviors in contemporary malware. We nd that scanning, propagation and down- loading are the most prevalent behaviors. We also nd the some behavior combinations are quite popular, such as scanning and propagating, downloading and propagating, and scanning and spamming. We discuss how our identication of high-level behaviors could be useful for mal- ware detection, identication of polymorphic malware and identication of malware samples, which may be versions of each other, or share some common code. 44 Chapter4 AnalyzeMalwareSimilarityUsingHigh-LevelBehavior In this chapter, we propose a dierent malware embedding, which encodes dynamics and contents of malware communication. We use this embedding to study malware similarity. 4.1 Introduction The Internet is facing increasing threats due to the proliferation of malware. A study by Kasper- sky Lab suggests an estimated daily increase of over 360,000 samples in the wild in 2017. [27]. Such high malware production rate suggests that new malware may be created by transforming existing code to evade signature detection. Malware designers have invested enormous eorts to change their binaries to avoid detection and analysis by defenders. From simple instruction transformation, such techniques have evolved through junk code injection, code obfuscation, to polymorphic and metamorphic engines that can transform malware code into millions of variants [51]. Such obfuscation techniques have greatly undermined traditional signature-based malware detection methods, and they create an enormous workload for code analysis. Yet much of the new malware variants could simply be old malware in a new package, or new malware assembled from pieces of old malware. Clearly, as new malware samples emerge, code analysis and signature-based ltering cannot keep pace. 45 Besides static analysis, researchers have used dynamic analysis to overcome code obfuscation. Dynamic analysis includes tracking disk changes, analyzing dynamic call graphs, as well as mon- itoring malware execution using debuggers and virtual machines. However, modern malware is often equipped with anti-debugging and anti-virtualization capabilities [52, 53], making dynamic code analysis in a controlled environment dicult. To complement contemporary static and dynamiccode analysis, we propose to study malware behavior by observing and interpreting its network activity. Much of today’s malware relies on the network connectivity to achieve its purpose, such as sending reports to the malware author, joining the botnet, sending spam and phishing emails, etc. We hypothesize that it would be di- cult for malware to signicantly alter its network behavior and still achieve its purpose. Studying network behavior thus may oer an opportunity to both understand what malware is trying to do in an analysis environment, and to develop behavior or network trac signatures useful for malware defense. Live malware analysis, with unrestricted network access, is risky, because malware may in- ict damage to other Internet hosts during analysis, and analysts would become unwitting ac- complices. We leverage the Fantasm system for safe and productive live malware analysis [18] to minimize this risk. The system enables us to emulate unrestricted connectivity from malware’s point of view, while it protects the Internet from unwanted trac. Just having network trac records of a malware run is not enough to understand a malware’s purpose, because such data is very rich and unstructured. To overcome this challenge, we develop anembedding for a malware sample – a set of ow-record features observed during the analysis. We then perform a two-pronged analysis. First, we devise patterns over the embeddings that help us detect presence and features of high-level network behaviors, such as le download, e-mail 46 sending, scanning, etc. We use a set of these high-level behaviors to infer a malware sample’s purpose. We note that malware could have more than one purpose. For example, it could both scan for vulnerable hosts and send unsolicited emails. Thus a malware sample could have more than one label. Second, we use the embeddings to identify malware samples that have the same or very similar network behaviors, and study code reuse and malware genealogy. We perform our analysis on 8,172 malware samples, randomly selected from the Georgia Tech Apiary project [1]. 63.9% of the samples exhibit more than one behavior, which speaks to the multi-purpose nature of contemporary malware. We further cluster malware ows based on their embeddings and identify groups that have similar or even identical behaviors. We nd that only 14 clusters are responsible for over 80% of data-carrying ows, which appear in about 70% of malware samples. Apart from these ows, the samples exhibit diverse other network behaviors, ranging from repeatedly queryinggoogle.com, to uploading private user information, to utilizing Bittorrent tracker to scan for potential victims. We then show how our clusters can be used to devise behavioral signatures for malware defenses. 4.2 BackgroundandRelatedWork Malware analysis has received continuous increase of research interests[61, 10, 45]. In this sec- tion, we discuss contemporary malware analysis methods and the rationale behind our approach. We also survey related work. 4.2.1 StaticAnalysis A common approach to malware detection is analyzing binaries for code signatures – sequences of binary code that are present in malware and are not common in benign binaries. Various static 47 analysis methods like CTPL [29], Generic Virus Scanner [12], etc. have been used to analyze binary code of malware without running it, and identify portions that could be used for signature generation. The identied portion is then synthesized to become the signature of this kind of malware. Signature-based malware detection has been the most widely used method and has been quite successful. However, malware designers have also been working on countermeasures over the years to undermine such techniques. From junk code generation, malware encryption and oligo- morphism, to polymorphic and metamorphic malware [51], such techniques have evolved signif- icantly. Our research complements signature-based detection by identifying common network- level behaviors of malware. These behaviors can be used to develop behavioral signatures of malware, which can be used to detect malware running on compromised hosts. In other words, signature-based detection can prevent malware infections, and behavior signatures can detect in- fections that bypass signature-based detection. 4.2.2 DynamicAnalysisonHost Dynamic analysis builds signatures of a malware’s interaction with its host. Such signatures may include memory access and le access patterns, as well as system call patterns. Those patterns that are likely to be present in malware but not in benign binaries can be used to develop a behavioral signature for malware detection [20]. Dynamic analysis complements static analysis and can overcome malware code obfuscation. Willems et al. [64] proposed CWSandbox that combines static and dynamic techniques for analyz- ing malware on a contained host. Guarnieri et al. [22]’s Cuckoo sandbox follows similar concepts 48 as CWSandbox and additionally provides a network packet sink. Both works utilize virtual ma- chines to isolate the study environment and the host. However, stealthy malware has another set of techniques to evade dynamic analysis – it detects debuggers and virtual machines, which are often used to speed up and facilitate dynamic analysis, and modies its behavior. This leads to incorrect or unusable signatures of stealthy malware. Chen et al. [11] found that 39.9% and 2.7% of 6,222 malware samples exhibit anti-debugging and anti-virtualization behaviors respectively. In 2012, Branco et al. [7] analyzed 4 million samples and observed that 81.4% of them exhibited anti-virtualization behavior and 43.21% exhibited anti-debugging behavior. Our work complements dynamic analysis, by providing another set of features, based on net- work behavior of malware, that can be used for detection. Our approach allows for feature col- lection using the network and does not require virtualization (binaries can be run on bare metal machines). 4.2.3 DynamicAnalysisofNetworkBehavior There are a few eorts on analyzing the semantic of malware network behavior. Sandnet [49] provides a detailed, statistical analysis of malware network trac, and surveys popular protocols employed by malware. Our work aims to understand popular network behavior patterns and the similarity in network behaviors between dierent malware samples. Morales et al. [38] studied several network activities, selected using heuristics, which include: (1) NetBIOS name request, (2) failed network connections after DNS or rDNS queries, (3) ICMP- only activity with no replies or with error replies, (4) TCP activity followed by ICMP replies, etc. These activities can be used to detect the likely presence of malware. Morales et al. also report on the prevalence of those behaviors in contemporary malware. While their chosen activities are 49 useful for malware detection, our high-level behavior analysis focuses on understanding malware purposes (e.g., spamming vs scanning). Our work thus complements the work by Morales et al. Nari et al. [40] proposed an automated malware classication system also focusing on mal- ware network behavior, which generates protocol ow dependency graph based on the IP address being contacted by malware. Our work improves this eort by systematizing detection of dier- ent malware behaviors using dierent network behavior patterns, and includes other information from network ows such as packet size, packet contents, etc. Lever et al. [30] experimented with 26.8 million samples collected over ve years and showed several ndings including that dynamic analysis traces are susceptible to noise and should be carefully curated, Internet services are increasingly lled with potentially unwanted programs, as well as that network trac provides the earliest indicator of infection. As a slight downside, the data used for this long period came from dierent sources which may not be collected in a well controlled environment and could result in noise that is hard to identify and remove. In contrast, our experiment is performed in a well-controlled experiment environment that makes it much easier to sanitize. 4.3 CapturingMalwareNetworkBehavior Contemporary malware relies more and more on the network to achieve its ultimate purpose [24, 54]. Malware often downloads binaries needed for its functionality from the Internet or connects into command and control channel to receive instructions on its next activity [25]. Advanced persistent threats [57] and key-loggers collect sensitive information on users’ computers but need network access to transfer it to the attacker. DDoS attack tools, scanners, spam, and phishing malware require network access to send malicious trac to their targets. 50 We study malware’s network activities because they have become essential for modern mal- ware. The rst step in this study includes capturing malware’s network trac in an environment, which is transparent to malware, but also minimizes risk to Internet hosts from adversarial mal- ware actions. 4.3.1 AnalysisEnvironment We leverage the experimentation platform, called Fantasm [18]. Fantasm is built on the DeterLab testbed [37], which is a public testbed for cyber-security research and education. DeterLab allows users to request several physical nodes, connect them into custom network topologies, and install custom OS and applications on them. Users are granted root access to the machines in their experiments. Fantasm runs on Deterlab with full Internet access, and carefully constrains this access to achieve productive malware analysis, and minimize risk to outside hosts. In our analysis, we run malware on a bare-metal Windows host, without any virtualization or debugger. We capture and analyze all network trac between this machine and the outside using a separate Linux host, sitting in between the Windows host and the Internet. Both hosts are controlled by the Fantasm framework. Fantasm makes decisions on which communications toimpersonate, i.e., intercept and answer itself, which toforward and which todrop. This decision is made by taking into account each out- going ow separately, making an initial decision, and revising it later if subsequent observations require this. Fantasm denes a ow as a unique combination of destination IP address, desti- nation port, and protocol. Each ow is initially regarded as non-essential, and it is dropped. If this leads to the abortion of malware activity, Fantasm stops analysis, restarts it, and regards that specic ow asessential. Fantasm then considers if it can fake replies to this outgoing connection 51 in a way that would be indistinguishable from the actual replies, should the ow be allowed into the Internet. If Fantasm has an impersonator for the given destination and the given service, it will intercept the communication and fake the response. Otherwise, it will evaluate if the outgo- ing ow is risky, i.e., potentially harmful to other Internet hosts. If so, the ow will be dropped. Otherwise, it will be let out into the Internet. Table 3.1 illustrates the criteria used by Fantasm to determine if a ow is risky or not and if it can be impersonated. Note that even ows con- sidered not risky and let out are still subjected to further monitoring and may be aborted if they misbehave because they could become part of scanning or DDoS. To minimize the risk of unwit- ting participation in attacks, Fantasm actively monitors for these activities and enforces limits on the number of suspicious ows that a sample can initiate. A suspicious ow receives no replies from the Internet. Many scans and DDoS ows will be classied as suspicious. If the analyzed malware sample exceeds its allowance of suspicious ows (10 in the current implementation), Fantasm aborts its analysis. 4.4 SampleEmbedding Once each sample is analyzed in Fantasm, we extract ow-level details of the malware’s commu- nication with the Internet from the captured trac traces and create anembedding for each ow and each sample. Our selected ow features can be categorized into three broad categories: • Header information. This information includes destination address, port, and transport protocol. We use this information to detect when dierent malware samples contact the same server, or same destination port (and thus may leverage the same service at the desti- nation server). 52 FeatureCategory FeatureSelected DataType Header information Source/Destination address string Source/Destination port number Protocol string Flow dynamics Application data units list Payload information Byte frequency table dict Table 4.1: Features selected for ow and sample embedding • Flowdynamics. This includes a sequence of application-data-units (ADUs, see Section 4.4.1) exchanged in each direction of the ow, which corresponds to request and response sizes on the ow. We use this ow dynamics to detect malware ows with similar communication patterns. • Payloadinformation. We use a frequency-based technique (see Section 4.4.2) to generate an embedding of the ow’s content, which can be used for a fast comparison between ows. To create an embedding for an entire sample we use all its ow embeddings. A detailed list of features selected for each category is provided in Table 4.1. We next introduce two metrics we will use to closely compare malware samples. 4.4.1 ApplicationDataUnit(ADU) Flow dynamics include sequences of application data units with their sizes and direction. Anap- plicationdataunit, orADU, is an aggregation of a ow by the direction that combines all adjacent packets transmitting in the same direction together, while maintains boundary of direction shift. Intuitively, ADU dynamics seeks to encode the length of requests and responses in a connec- tion between a malware sample and a remote host. A transformation of packet trace to ADU is illustrated in Table 4.2. The ADU sequence is useful to detect similar ows across dierent mal- ware samples based on their communication dynamics. For example, two dierent samples may 53 Pkt ID Direction Pkt size 1 incoming 50 2 incoming 60 3 outgoing 50 4 incoming 100 5 outgoing 70 6 outgoing 90 7 incoming 80 8 incoming 100 (a) Packet sequence Ref. ID Direction Pkt size (1+2) incoming 110 (3) outgoing 50 (4) incoming 100 (5+6) outgoing 160 (7+8) incoming 180 (b) ADU sequence Table 4.2: ADU transformation from packet sequence download the same le from two dierent servers, and the contents may be encrypted with two dierent keys. However, the ADU sequence of these two ows should be very similar, enabling us to detect that these two ows have a similar or same purpose. 4.4.2 PayloadByteFrequency Payload usually stores application-level data. Not all malware ows carry a payload, but if it is present it usually carries high-level logic, such as new instructions or binaries that are important for new functionality in malware. Hence it is important to study payload contents. On the other hand, payload information usually does not have a specic structure, as dierent malware may organize its data dierently. We thus need a way to quickly summarize and compare payloads that may have very dierent formats. We transform each ow’s payload into a dictionary that encodes each byte’s frequency. Keys to this dictionary are all possible byte values, 0–255, and the values being stored are the counts of how many times the given byte value was present in the payload. Finally, we divide each count with the total payload size to arrive at the frequency of byte values. This encoding has two advantages: First, it has a xed and much smaller size than the actual payload. Second, it simplies our similarity comparison between ows and samples. 54 4.4.3 MalwareSimilarity Another useful application of behavior signatures is to study overlap in behaviors between dif- ferent malware samples, e.g., to detect polymorphic or metamorphic malware, and to understand how malware functionality evolves, and how common it is across samples. Obviously, our behavior signatures can yield a wide range of similarity measures, depend- ing on how we dene what “similar” means, and what weight we assign to features during the comparison. Our calculation of sample similarity depends on two distance measures: 1. Inverse Jaccard Score for ADU comparison: For a pair of ows, their ADU sequences are compared by calculating the Jaccard score, taking ADU sizes as the values in the set. The Jaccard score between two ADU sequencesX andY is dened in Equation 4.1. J(X;Y) = jX\Yj jX[Yj (4.1) The ow similarity based on the original Jaccard score ranges from 0 to 1, with higher values denoting higher similarity. We convert this score into adistance-type metric “inverse Jaccard score” –JS = 1JS, with lower values denoting lower distance and thus higher similarity. 2. Kullback-Leiblerdivergenceforpayloadcomparison: We compare payloads of ows P and Q by calculating Kullback-Leibler divergence (KL measure) between their payload frequencies, as described in Equation 4.2 D KL (PjjQ) = X b2BY P(b)log( P(b) Q(b) ) (4.2) 55 where BY is set of all possible byte values and P(b), Q(b) are frequencies of value b in pay- loads of the ows P and Q respectively. KL measure ranges from 0 to1, with lower values denoting higher similarity, i.e. it is already a distance-type metric. When comparing two samples for similarity, we calculate their distance from ow distance measures in the following way. For each ow in sample A, we compare it with each ow in sample B, and take the lowest distance measure, i.e. d(f A ) =d(f A ;B) = min f B d(f A ;f B ) (4.3) whered is eitherJS orKL measure. We repeat this process for each ow in sample B. Finally, we calculate theaveragedistance, which is the average over all ows, i.e. d avg (A;B) = avg f X 2ff A ;f B g (d(f X )) (4.4) 4.5 EvaluationofContemporaryMalware We now use features and patterns identied or dened in the previous sections to study malware sample similarity and prevalence of some selected high-level behaviors in contemporary malware. 4.5.1 ExperimentSetup We build our experiment environment using the Fantasm platform [18] as introduced in Sec- tion 4.3.1. Fantasm utilizes the Deterlab infrastructure to construct a LAN environment with a node running Windows to host the malware, and another node running Ubuntu Linux acting as the gateway. In addition, Fantasm provides necessary services for impersonators, and monitors 56 the network activities by capturing all network packets using tcpdump. One round of analysis for a given malware sample consists of the following steps: • Enable service network monitoring on Linux gateway • Reload operating system on Windows node and set up necessary network congurations • Deploy and start the malware binary and continue running it for a given period (we used 5 minutes) • Kill the malware process and save the captured network trace. This setting has the advantage that it is immune to current analysis evasion attempts by malware, because it does not use a debugger or a virtual machine. By reloading OS for each run, it ensures that each sample is analyzed in an environment free from any artifacts from the previous analysis rounds. The malware samples we used in the study were provided by the Georgia Tech Apiary project [1]. We selected malware samples captured throughout 2018 for this research. Next, we submitted each sample to VirusTotal [63] to determine the type of malware, and ensure that the sample is recognized as malicious, from which we randomly picked 8,217 samples. We then analyzed each selected sample in our experiment environment, using the method described above. After the analysis, we have saved traces with all captured communications to and from the Windows node. 4.6 ClusteringFlowEmbeddingsUsingMachineLearning To evaluate the eectiveness of using ow embeddings and payload byte frequencies on study- ing malware behavior, we cluster ows based on their metrics and see how well they perform on 57 dierentiating trac behavior. We choose to use OPTICS algorithm (Ordering Points To Iden- tify the Clustering Structure) from Scikit-Learn, which is an algorithm for nding density-based clusters in spatial data. The parameters for OPTICS we use are: = 2, and min_samples=100. We cluster malware ows based on two features from our embedding: 10-ADUsequence, which is useful to identify common communication patterns, andtotalpayloadsize,payloadsizeofprint- able characters, and payload size of all other characters. We choose to use these higher-level fea- tures instead of our payload frequency maps because they are not sensitive to small payload changes. In total, we have analyzed 200,236 ows from the 6,595 malware samples out of the 8,172 samples we have selected. The remaining 1,577 samples do not successfully exchange payload with an external host. They send only DNS queries to the local resolver but do not initiate any further contact with the outside. In most cases, this happens because malware appears dormant during our analysis. We exclude these samples from further analysis. 4.6.1 ClusteringbyADUSequence The clustering based on ADU results in 53 clusters and one group of unclustered ows, containing 8.71% of the ows. We exclude unclustered ows from further analysis. Based on the behavior, the ow clusters can be further grouped into one of the following larger categories: • HTTP Downloading – incoming trac volume is larger than outgoing and the application protocol is HTTP. • HTTP Uploading – outgoing trac volume is larger than incoming and the application protocol is HTTP. 58 Behavior Cluster Eective Sample Span Avg owJS Avg sampleJS Count Flow Span distance distance ratio Ratio (lower is better) (lower is better) HTTP Downloading 17 2.86% 16.46% 0.1542 0.1717 HTTP Uploading 2 0.21% 1.91% 0.1571 0.2017 UDP Uploading 4 1.17% 1.87% 0.0001 0.1901 ICMP scanning 5 65.02% 2.04% 0.0 0.2120 HTTPS Downloading 7 4.00% 2.07% 0.0749 0.1379 HTTPS Uploading 8 2.90% 2.19% 0.1621 0.1931 Unestablished 9 23.92% 67.57% 0.0034 0.3844 Table 4.3: High-level summary of clustering result using ADU. • UDP Uploading – outgoing trac volume is larger than incoming and the transport protocol is UDP. • ICMP Scanning – various external hosts are contacted using ICMP. • HTTPS Downloading – incoming trac volume is larger than outgoing and the application protocol is HTTPS. • HTTPS Uploading – outgoing trac volume is larger than incoming and the application protocol is HTTPS. • Unestablished - the ow attempted to connect to an external host but there was no reply or did not nish the 3-way handshake. We summarize high-level ndings based on these categories in Table 4.3. In this table, we gather information including several clusters that belong to a behavior category, the ratio of clustered ows for the behavior category, the ratio of samples for the behavior category that its ows belonging to, as well as the averageJS distance for all ows within this behavior category. The largest category is ICMP scanning, containing 5 clusters, 65.02% of total ows, and 2.04% of samples. The second-largest category is unestablished ows, containing 9 clusters, 23.92% of 59 ows, and 67.57% of samples ∗ . This suggests that most malware samples have many unsuccessful connections. These two types of ows cover 14 clusters, 88.94% of all ows, and span 69.64% of the samples. HTTPS downloading category consists of 7 clusters, 4% of the ows, and 2.07% of the samples; meanwhile, HTTPS uploading category contains 8 clusters, 2.90% of the ows, and 2.19% of the samples. HTTP downloading category contains 17 clusters, 2.86% of the ows, and 16.46% of the samples, while HTTP uploading contains 5 clusters, 0.21% of the ows, and 1.91% of samples. This suggests that more malware samples utilize unencrypted HTTP trac compared to encrypted HTTPS trac, which allows for their payload analysis and possibly easier detection and ltering. The rarest observed behavior is UDP uploading, relating to 2 clusters, 0.53% of the ows, and 0.24% of the samples. As table 4.3 shows, the average JS distance of ows within categories is very low, which means that ows have similar dynamics (similar sizes of the rst 10 ADUs). The ADU sequences of clusters could thus be used as behavioral signatures for malware detection. 4.6.2 ClusteringbyPayloadSizes We next cluster ows by payload sizes, resulting in 37 clusters and one group of unclustered ows containing 14.16% of ows. Note that while we cluster by total, printable and non-printable character length of the payload, the average KL distance for ows in each category is very low (ranging from 0.0 to 0.16). This means that ows in the same category have very similar payloads, not just in length but also with regard to the actual byte values. We again categorize the clusters using the same criteria as we used for ADU clusters. The categories and their coverage of clusters, ows and samples, as well as the average KL distance between ows in each category are shown in Table 4.4. ∗ A sample can have multiple ows and thus can appear in multiple categories 60 Behavior Cluster Eective Sample Avg owKL Avg sampleKL Count Flow Span distance distance Ratio Ratio (lower is better) (lower is better) HTTP Downloading 18 2.05% 9.78% 0.0194 2.557 HTTP Uploading 2 0.28% 2.90% 0.0738 0.1397 UDP Uploading 1 0.42% 1.11% 0.2289 2.480 HTTPS Downloading 1 0.09% 0.94% 0.0161 2.879 HTTPS Uploading 5 1.92% 5.26% 0.1283 0.3195 ICMP Scanning 4 69.29% 0.12% 0.0 0.1732 Unestablished 9 25.86% 68.81% 0.0034 2.149 Table 4.4: High-level summary of clustering result using payload sizes. We see a similar trending as in ADU sequence-based clusters. ICMP is the dominating type of ows (69.29%) and is only shown in 0.12% of the samples with 0.0 averageKL distance because there is no variation in the payload. This is followed by unestablished that contains 25.86% of the ows from 68.81% of the samples with averageKL distance of 0.0034, which reinforces the results from ADU analysis that most samples contain failed connection attempts. We also nd that more HTTP trac is detected than HTTPS trac for both downloading and uploading ows. Payload size also detects more HTTPS uploading ows from 1.92% of all eective ows from 5.26% of the samples, which covers more samples than its ADU equivalence. UDP is still the rarest type of ows detected here with 0.42% of the eective ows from 1.11% of the samples. As mentioned above, the averageKL distance is calculated based on the payload frequency map and the result distances are still very low suggesting very high similarity within each cluster. This proves the eectiveness of using the payload frequency map to detect payload patterns. We now take a closer look at the payload of ows in dominant clusters. We skip unestab- lished and HTTPS encrypted ows as we cannot analyze their payload. We illustrate our ndings through three examples. From ICMP ping packets, we observe three dominant types of payload: 61 • “Babcdefghijklmnopqrstuvwabcdefghi”. • “\u0000” of 28 bytes. • “b\u0004\u0000EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE” (with rst character being ei- ther “b”, or “o” or “\u0601”). This suggests that 150 malware samples in our dataset that perform ICMP scanning may reuse only three dierent code segments to generate scan packets. ICMP packet payloads we identied can be used as payload signatures for malware detection. The unencrypted HTTP trac also shows several dominant behaviors that are interesting. We show them in the Table 4.5. One type of trac tries to query and access the root of google.com using “GET / HTTP/1.1”. It receives “HTTP/1.1 301 Moved Permanently” response but ignores it and keeps sending the same request repeatedly. Such behavior is unusual for benign code and could be used as a behavioral signature to detect malware samples. Another interesting type of HTTP trac is a type of GET request that tries to retrieve a potentially malicious payload from a malicious website but uses google.com as Referrer in the header. We suspect that this may be a way to circumvent the detection of some network defense systems by masquerading as an innocent redirection from Google query. Since a benign code could exhibit similar behavior we cannot use this pattern for malware detection, but it could be used to identify suspicious ows and send them to a more sophisticated defense for further scrutiny. Another common HTTP ow behavior attempts to query many web sites using xed-length and random-looking domain names, such asqexylup.com orvowyzuk.com, with the endpoints being/login.php or/key.bin. These patterns could be used as behavioral signatures for mal- ware detection. 62 HTTP Behavior Request Example Flow ratio over HTTP Downloader Query google.com GET / HTTP/1.1 35.83% Host: google.com Access potential GET /.../ddos.bss HTTP/1.1 14.17% malicious payload Host: migsel.com (Migsel bike parts) on compromised website Referrer: google.com Access potential GET /key.bin HTTP/1.1 39.76% malicious websie Host: qexylup.com Referrer: google.com Table 4.5: Notable behaviors of HTTP Downloader For UDP ows, we found a type of payload similar to what is used in Bittorrent tracker search- ing and can conrm that some ows even use the canonical Bittorrent port 6881 for communica- tion. The ows in this cluster have a low average KL distance of 0.2289, indicating a high payload similarity. Thus their payloads could be used to devise a payload signature for malware detection. 4.6.3 SampleDiversity We analyze sample diversity based on our ndings from ows. We found that 5,223 samples contain ows from dierent cluster groups from our previous results, which consists of 63.9% of all samples. We also listed the average sampleJS distance in Table 4.3 and average sampleKL distance in Table 4.4. We notice that for samples, both the averageJS distance and averageKL distance are higher than their ow counterparts, because those samples may contain ows of dierent types. We manually inspected some of those samples, and nd some types of ows are more likely to be found in the same sample, such as HTTP GET requests for dierent URLs, HTTP uploading followed by HTTP downloading for potentially malicious payload, etc. These ndings suggest that many samples exhibit multiple high-level behaviors and may be multi-purposed. 63 4.7 DiscussionandFutureWork Our clustering results show that many ows are very similar and that we can use information about their ADU behavior and payload to devise behavioral and payload signatures for contem- porary malware. Since the malware ecosystem changes rapidly, the signatures we devise today will likely be obsolete tomorrow. However, our methodology can be used with contemporary malware samples to identify future clusters of behaviors and payloads and to help defenses keep track of malware evolution. Our current result is still bounded by time and computing power restrictions. Given more time and better infrastructures, we can foresee several future directions to continue our research. Increase analyzed sample repository. We would like to examine more malware samples and expand our analysis, to balance out samples in each high-level behavior group. We would also like to perform a longitudinal study of malware evolution over time, to quantify how much dominant behaviors change. Understand sample genealogy. Our results can currently help us identify samples that share similar or identical ows, suggesting that these samples may have a common author or that they may share code. We would like to extend our analysis to map out the evolution of malware samples, e.g., which sample came rst, how did the specic behavior (e.g., contacting a C&C channel) change over time, etc. Malwaredetection. Most high-level malware behaviors also occur in benign software, which makes them unreliable for malware detection purposes. However, combinations of these high- level behaviors may be unique to malware and could be useful for detection. For example, a software that scans other hosts and then copies data over is unlikely to be benign, although each of these actions separately could be undertaken by benign software (e.g., probing several servers 64 could look like scanning if servers are unresponsive, and data can be uploaded to a cloud for legitimate reasons) As we extend our malware analysis to more samples, we expect to nd more behavior combinations, which can be used for malware detection. 4.8 Conclusion In this work, we propose to use malware’s network trac patterns to identify high-level behav- iors of malware. We dene “application data unit” to study malware trac behavior and “payload frequency map” to study the payload behavior of malware. We then cluster ows from those mal- ware samples based on these features and then form high-level behavior groups. The results show that each behavior group shows a signicant behavior pattern. We conrm that ows from each cluster are very similar based on our distance metrics, suggesting that malware may be created by modifying existing code. We then look deeper into the actual behavior patterns within each cluster and nd features that can be used to devise behavioral signatures for malware defense. We further analyze sample-to-ow mapping and conrm that 63.9% of the samples contain ows that belong to dierent behavior clusters, suggesting that contemporary malware is more likely to be multi-purposed. 65 Chapter5 PolymorphicMalwaredetectionusingHigh-LevelBehavior In this chapter, we further improve malware embedding for polymorphic malware detection. We normalize our embedding for malware samples with dierent ow patterns for identifying identi- cal network behavior with good results, and show that it provides better behavior understanding compared with local behavior based patterns. 5.1 Introduction The Internet is facing increasing threats due to the proliferation of malware. A study by Purple Sec suggests an estimated daily number of 230,000 new malware samples [43]. Such high malware production rate suggests that new malware may be created by transforming existing code to evade signature detection. Malware designers invest large eort to change their binaries to avoid detection and analysis by defenders. From simple instruction transformation, such techniques have evolved through junk code injection, code obfuscation, to polymorphic and metamorphic engines that can trans- form malware code into millions of variants [51]. Such obfuscation techniques have greatly un- dermined traditional signature-based malware detection methods, and they create an enormous 66 workload for code analysis. Yet much of the new malware variants could simply be old mal- ware in a new package, or new malware assembled from pieces of old malware. Clearly, as new malware samples emerge, code analysis and signature-based ltering cannot keep pace. Besides static analysis, researchers have used dynamic analysis to overcome code obfuscation. Dynamic analysis includes tracking disk changes, analyzing dynamic call graphs, as well as mon- itoring malware execution using debuggers and virtual machines. However, modern malware is often equipped with anti-debugging and anti-virtualization capabilities [52, 53], making dynamic code analysis in a controlled environment dicult. To complement contemporary static and dynamiccode analysis, we propose to study malware behavior by observing and interpreting its network activity. Much of today’s malware relies on the network connectivity to achieve its purpose, such as sending reports to the malware author, joining the botnet, downloading malicious code, sending spam and phishing emails, etc. We hy- pothesize that it would be dicult for malware to signicantly alter its network behavior and still achieve its purpose. Studying network behavior thus may oer an opportunity to both un- derstand what malware is trying to do in an analysis environment, and to develop behavior or network trac signatures useful for malware defense. In our previous chapters, we have proposed the Fantasm environment for safe and eective live malware analysis [18], and we analyzed thousands of malware samples to identify partial code reuse [16]. In this work, we turn our focus on detecting polymorphic malware, i.e., samples that have identical network behaviors but dierent binary code. We analyze 8,172 malware samples randomly selected from the Georgia Tech Apiary project [1]. For each sample, we gather infor- mation from anti-virus suite results from VirusTotal [63] as well as our own embedding, which encodes network behavior information. We compare the local behavior reported by VirusTotal 67 and remote network behavior collected by ourselves, and show that our embedding provides bet- ter information for identifying polymorphic malware. By clustering 6,595 samples, which show some network activity, using features dened through our embedding, we nd that over 90% of clusters contain potentially polymorphic malware, with up to 80% of the clusters identify truly polymorphic malware. This indicates the added benet of network behavior encoding, over code analysis, for malware classication and malware defense. In Section 5.3 we describe our environment for safe and eective malware experimentation, which we use to collect information about the malware’s network behaviors. In Section 5.4 we de- tail our embedding of network behavior into a feature vector for machine learning. In Section 5.5 we describe our method for detection of polymorphic malware using our embeddings. We show our ndings in Section 5.6. We discuss future work in Section 5.7 and conclude in Section 5.8. 5.2 BackgroundandRelatedWork Malware analysis has received increased research interest over the years [61, 10, 45]. In this sec- tion, we discuss contemporary malware analysis methods and the rationale behind our approach. 5.2.1 StaticBinaryAnalysis A common approach to malware detection is analyzing binaries for code signatures – sequences of binary code that are present in malware and are not common in benign binaries [29, 12]. Such signatures can be used for malware detection, e.g., when binary code is downloaded over the network. Signature-based malware detection has been the most widely used method and has been quite successful. However, malware designers have also been working on countermeasures over the years to undermine such techniques. From junk code generation, malware encryption and 68 oligomorphism, to polymorphic and metamorphic malware [51], such techniques have evolved signicantly. Researchers have also been improving signature detection methods, such as detecting decryp- tion routines, performing runtime signature detection, etc. It is dicult for researchers to gain advantage in this race, as it will always be cheaper to generate new, obfuscated malware variants, than to analyze them. Our research complements signature-based detection by identifying common network-level behaviors of malware. These behaviors can be used to develop behavioral signatures of mal- ware, which can be used to detect malware that bypasses code-based defenses, and runs on com- promised hosts. In other words, signature-based detection can prevent malware infections, and behavior signatures can detect infections that bypass signature-based detection. 5.2.2 DynamicBinaryAnalysis Dynamic binary analysis builds behavioral signatures of a malware’s interaction with its host. Such signatures may include memory access and le access patterns, as well as system call pat- terns. The patterns that are prevalent in malware but not in benign binaries can be used to develop a behavioral signature for malware detection [20]. Dynamic analysis complements static analysis, and can overcome malware code obfusca- tion [64, 22]. However, stealthy malware has another set of techniques to evade dynamic analysis – it detects debuggers and virtual machines, which are often used to speed up and facilitate dy- namic analysis, and modies its behavior to hide its true purpose [11, 7]. Our work complements binary analysis by providing another set of features, based on network behavior of malware, that can be used for detection and behavior analysis. Our approach allows 69 for feature collection using the network and does not require virtualization (binaries can be run on bare metal machines), which overcomes the aforementioned evading techniques. 5.2.3 DynamicAnalysisofNetworkBehavior There are a few eorts on analyzing the semantic of malware network behavior. Sandnet [49] provides a detailed, statistical analysis of malware network trac, and surveys popular proto- cols employed by malware. Morales et al. [38] studied several network activities, selected using heuristics, which include: (1) NetBIOS name request, (2) failed network connections after DNS or rDNS queries, (3) ICMP-only activity with no replies or with error replies, (4) TCP activity fol- lowed by ICMP replies, etc. These activities can be used to detect the likely presence of malware. Nari et al. [40] proposed an automated malware classication system also focusing on malware network behavior, which generates protocol ow dependency graph based on the IP address be- ing contacted by malware. Lever et al. [30] experimented with 26.8 million samples collected over ve years and showed several ndings including that dynamic analysis traces are susceptible to noise and should be carefully curated, Internet services are increasingly lled with potentially unwanted programs, as well as that network trac provides the earliest indicator of infection. Our work complements the prior eorts by using a larger and more generic set of features to encode a malware’s network behavior. In our prior work [16] we have shown how this encoding can be used to study a malware sample’s genealogy and trends in malware code. In this work we focus on using the same encoding to detect polymorphic malware. 5.2.4 ContemporaryPolymorphicMalwareAnalysis Cesare et al. [9] proposed Malwise, a classication system for packed and polymorphic malware, which used string-based detection techniques for polymorphic malware behavior detection with 70 high eciency. However this approach is susceptible to more sophisticated polymorphic mal- ware generation mechanisms like shell code transformation and encryption which makes the generated malware samples undetectable by the same string-based signature. Our work focus on runtime network behavior and hence not aected by this issue. Tajoddin et al. [55] proposedHM 3 alD, a polymorphic malware detection system using pro- gram behavior-aware hidden Markov model. This system is based on local behavior analysis. The authors also admitted that this approach is very program topology dependent and changing topology could greatly aect its eectiveness. Our system instead uses network behavior pattern and hence immune to evasive malware that could detect local analysis environment and is also more generic that can analyze most malware samples from the data set. 5.3 CapturingMalwareNetworkBehavior Contemporary malware relies more and more on the network to achieve its ultimate purpose [24, 54]. Malware often downloads binaries needed for its functionality from the Internet, or connects into command and control channel to receive instructions on its next activity [25]. Advanced persistent threats [57] and key-loggers collect sensitive information on users’ computers, but need network access to transfer it to the attacker. DDoS attack tools, scanners, spam, and phishing malware require network access to send malicious trac to their targets. We study malware network activities because they have become essential for modern mal- ware. The rst step in this study includes capturing malware’s network trac in an environment that is transparent to malware, and that also minimizes risk to Internet hosts from adversarial malware actions. 71 5.3.1 AnalysisEnvironment We leverage the experimentation platform, called Fantasm, described in [18]. Fantasm is built on the DeterLab testbed [37], which is a public testbed for cyber-security research and educa- tion. DeterLab allows users to request several physical nodes, connect them into custom network topologies, and install custom OS and applications on them. Users are granted root access to the machines in their experiments. Fantasm runs on Deterlab with full Internet access, and carefully constrains this access to achieve productive malware analysis, and minimize risk to outside hosts. In our analysis, we run malware on a bare-metal Windows XP host, without any virtualization or debugger. We capture and analyze all network trac between this machine and the outside, using a separate Linux host, which resides between the Windows host and the Internet. Both hosts are controlled by the Fantasm framework. Fantasm makes decisions on which communications toimpersonate, i.e., intercept and answer itself, which toforward and which todrop. This decision is made by taking into account each out- going ow separately, making an initial decision, and revising it later if subsequent observations require this. Fantasm denes a ow as a unique combination of destination IP address, desti- nation port, and protocol. Each ow is initially regarded as non-essential, and it is dropped. If this leads to the abortion of malware activity, Fantasm stops analysis, restarts it, and regards that specic ow asessential. Fantasm then considers if it can fake replies to this outgoing connection in a way that would be indistinguishable from the actual replies, should the ow be allowed into the Internet. If Fantasm has an impersonator for the given destination and the given service, it will intercept the communication and fake the response. Otherwise, it will evaluate if the outgo- ing ow is risky, i.e., potentially harmful to other Internet hosts. If so, the ow will be dropped. 72 Otherwise, it will be let out into the Internet. Table 3.1 illustrates the criteria used by Fantasm to determine if a ow is risky or not, and if it can be impersonated. Fantasm monitors a given sample’s communication with the Internet, and limits the number ofsuspicious ows – ows that receive no replies – that a sample can initiate. Many scans and DDoS ows will be classied as suspicious. If the analyzed malware sample exceeds its allowance of suspicious ows (10 in the current implementation), Fantasm aborts the experiment and stops its analysis. 5.4 EmbeddingtheSamples Once each sample is analyzed in Fantasm, we extract ow-level details from the captured trac traces and create an embedding for each ow and each sample. We reuse the ow features as dened in [16]. Our selected ow features can be categorized into three broad categories: • Header information. This information includes destination address, port, and transport protocol. We use this information to detect when dierent malware samples contact the same server, or same destination port (and thus may leverage the same service at the desti- nation server). • Flowdynamics. This includes a sequence of application-data-units (ADUs, see Section 5.4.1) exchanged in each direction of the ow, which corresponds to request and response sizes on the ow. We use these features to detect malware ows with similar communication patterns. • Payloadinformation. We use a frequency-based technique [16] to encode the ow’s pay- load into a compressed format, which can be used for fast comparison between ows. We 73 transform each ow’s payload into a dictionary that encodes each byte’s frequency. Keys to this dictionary are all possible byte values, 0–255, and the values being stored are the counts of how many times the given byte value was present in the payload. Finally, we divide each count with the total payload size to arrive at the frequency of byte values. An embedding for an entire sample is the union of its ow embeddings. A detailed list of features selected for each category is provided in Table 4.1. We next introduce two metrics we use to closely compare malware samples. 5.4.1 ApplicationDataUnit(ADU) Flow dynamics include sequences of application data units with their sizes and direction. An application data unit, or ADU, is an aggregation of a ow by the direction, which combines all adjacent packets sent in the same direction together, while maintaining the boundary of direction shift. Intuitively, ADU dynamics seeks to encode the length of requests and responses in a con- nection, between a malware sample and a remote host. A transformation of packet trace to ADU is illustrated in Table 4.2. The ADU sequence is useful to detect similar ows across dierent malware samples based on their communication dynamics. For example, two dierent samples may download the same le from two dierent servers, and the contents may be encrypted with two dierent keys. This will make their payload information dierent. However, the ADU se- quence of these two ows should be very similar, enabling us to detect that these two ows have a similar or same purpose. 74 5.4.2 PayloadByteFrequency Payload usually stores application-level data. Not all malware ows carry a payload, but if it is present it usually carries high-level logic, such as new instructions or binaries that are important for new functionality in malware. Hence it is important to study payload contents. On the other hand, payload information usually does not have a specic structure, as dierent malware may organize its data dierently. We thus need a way to quickly summarize and compare payloads that may have very dierent formats. We transform each ow’s payload into a dictionary that encodes each byte’s frequency. Keys to this dictionary are all possible byte values, 0–255, and the values being stored are the counts of how many times the given byte value was present in the payload. Finally, we divide each count with the total payload size to arrive at the frequency of byte values. This encoding has two advantages: First, it has a xed and much smaller size than the actual payload. Second, it simplies our similarity comparison between ows and samples. 5.5 PolymorphicMalwareDetection Polymorphic malware transforms its binary form, without changing its behavior. This enables malware to avoid being detected by most of the AV suites, because it evades static signature detection. A typical malware changes its binary form through a polymorphic engine. A common ap- proach used by a polymorphic engine is packing, which transforms a binary by encrypting or compressing it, and includes shell code to reverse this transformation in runtime. In contrast there also exists a type of malware called metamorphic malware, which perma- nently transforms its binary code into another form while maintaining its behavior by replacing 75 its assembly instructions with instructions that have equivalent functionality. From the attacker’s point of view, achieving metamorphic transformation is harder than achieving polymorphism – the rst requires transformation of the assembly code, which is non-trivial, while the second just requires encryption with a unique key. Therefore we focus on polymorphic malware in this work. We identify potential polymorphic malware samples using clustering over parts of our en- coding of network behaviors. In Section 5.6 we experiment with ADU sequence feature and with byte frequency feature, and show that byte frequency works better in identifying polymorphic malware. We do not use header information for clustering as a sample could easily change its communication endpoint (e.g., C&C server) and port to evade signature-based network detec- tion. In other cases, the change of communication endpoint occurs because malware contacts a cloud-based service and each sample may communicate with a set of dierent IP addresses. Since the ADU sequence feature has variable length, depending on the number of ows and number of ADUs in each ow, we limit the number of ows to 50 and number of ADUs per ow to 10. For shorter ows and smaller samples, we pad their embedding with zeros. Finally, to be robust to ow or ADU reordering we concatenate embeddings for each ow in a sample, and sort the resulting embedding before clustering. Similarly for byte frequency feature, we limit the number of ows per sample to 50, pad shorter ows with zeros and concatenate their embeddings, then sort to produce the nal embedding for the sample. We use OPTICS algorithm (Ordering Points To Identify the Clustering Structure) from Scikit- Learn for clustering, which is an algorithm for nding density-based clusters in spatial data. OPTICS uses Euclidean distance between vectors that are being clustered, and two parameters 76 – max_eps, which denotes the maximum distance of a sample from its cluster and min_samples, which denotes the smallest allowed cluster size. 5.6 Evaluation We now use features and patterns identied or dened in the previous sections to study the preva- lence of polymorphic malware in contemporary malware. It is dicult to evaluate accuracy of our polymorphic malware identication, because there is no public polymorphic malware dataset. Instead, we use a collection of randomly selected malware samples from the Georgia Tech Apiary project [1]. We use our approach to identify clusters of malware, which we believe are polymor- phic and then we use observation of malware’s local behavior to double-check quality of our ndings. This process is detailed in Section 5.6.2. 5.6.1 ExperimentSetup We build our experiment environment using the Fantasm platform [18] as introduced in Sec- tion 5.3.1. Fantasm utilizes the Deterlab infrastructure to construct a LAN environment with a node running Windows to host the malware, and another node running Ubuntu Linux acting as the gateway. In addition, Fantasm provides necessary services for impersonators, and monitors the network activities by capturing all network packets using tcpdump. One round of analysis for a given malware sample consists of the following steps: • Enable service network monitoring on Linux gateway • Reload operating system on Windows node and set up necessary network congurations • Deploy and start the malware binary and continue running it for a given period (we used 5 minutes) 77 • Kill the malware process and save the captured network trace. This setting has the advantage that it is immune to current analysis evasion attempts by malware, because it does not use a debugger or a virtual machine. By reloading OS for each run, it ensures that each sample is analyzed in an environment free from any artifacts from the previous analysis rounds. We selected malware samples captured throughout 2018 for this research. Next, we submitted each sample to VirusTotal [63] to determine the type of malware, and ensure that the sample is recognized as malicious. We randomly selected 8,217 samples for our evaluation. We then ana- lyzed each selected sample in our experiment environment, using the method described above. After the analysis, we have saved traces with all captured communications to and from the Win- dows node. In total, we have analyzed 6,595 malware samples out of the 8,172 samples we have selected. The remaining 1,577 samples do not successfully exchange payload with an external host. They send only DNS queries to the local resolver but do not initiate any further contact with the outside. Since this low level of network communication is not sucient to establish a malware sample’s purpose, we exclude these samples from further analysis. 5.6.2 Cross-verifyingOurFindings To evaluate quality of our clustering and usefulness of our network behavior features for iden- tication of polymorphic malware, we leverage malware’s local behavior. We reuse the analysis results from VirusTotal to collect information about a sample’s local behavior. Those reports include le system accesses such as les created, opened, deleted, etc., host les changed to ma- nipulate DNS resolution, system mutex created or opened, processes malware spawned, Windows 78 registry changed, runtime DLLs accessed, system services used, etc. To compare local behaviors we need to decide which parameter to use for comparison purpose. We nd most of local infor- mation are potentially unstable because they can be modied easily by malware, for example, the names of accessed local le may be modied even those they were actually the same le being changed by dierent malware samples, therefore using those lenames as a metric for nding polymorphic malware samples will be unreliable. The same trick may also be applied to other types of information by malware. We then focus on those parameters that cannot be changed easily by malware, while keeping its functionality. As a result, we select the set of runtime DLLs accessed as a stable representation of local malware behavior. This feature is the feature of the host system, which malware leverages for its own purpose, and thus it is dicult for malware to manipulate this feature while preserving its functionality. When we evaluate accuracy of a cluster that OPTICS produced using our network behavior embedding, we will count how many dierent local behaviors (dierent DLL sets) we observe in each cluster. We label those clusters that have a single DLL set astrulypolymorphic. On the other hand if a cluster has more than one DLL set, but fewer than the number of samples in the cluster, we call this cluster potentially polymorphic. Finally, if each sample in the cluster has a dierent DLL set we say that this cluster is not polymorphic. 5.6.3 CalibratingClusteringParameters We rst perform clustering experiments to identify best-performing values ofmax_eps andmin_- samples. Our accuracy measure is the percentage of clusters that are labeled as truly polymorphic, because all samples in those clusters access one set of local DLLs at runtime. We perform the experiment under both the full containment and partial containment policy using Fantasm. The 79 Clustering Parameters ADU Sequence (Partial Containment) ADU Sequence (Full Containment) Truly Potentially Samples Truly Potentially Samples max_eps min_samples Cluster Count Polymorphic Polymorphic Clustered Cluster Count Polymorphic Polymorphic Clustered 500 50 27 3 (11.1%) 27 (100%) 3,220 (48.8%) 0 0 (0%) 0 (0%) 0 (0.0%) 200 20 55 10 (18.2%) 55 (100%) 3,553 (53.9%) 5 0 (0%) 4 (100%) 1,266 (19.2%) 100 10 93 22 (23.7%) 92 (98.9%) 3,907 (59.2%) 8 0 (0%) 8 (100%) 1,401 (21.2%) 50 5 164 55 (33.5%) 163 (99.3%) 4,192 (63.6%) 21 3 (14.3%) 21 (100%) 2,183 (33.1%) 20 2 420 310 (73.8%) 405 (96.4%) 4,693 (71.2%) 50 26 (52.0%) 50 (100%) 3,061 (46.4%) 10 2 408 302 (74.0%) 394 (96.5%) 4,640 (70.3%) 107 73 (68.2%) 106 (99.1%) 3,888 (59.0%) 5 2 393 288 (73.2%) 379 (96.4%) 4,595 (69.7%) 237 177 (74.7%) 233 (98.3%) 4,757 (72.1%) (a) Clustering results for ADU using dierent parameter combinations with dierent containment policy. Clustering Parameters Byte Frequency (Partial Containment) Byte Frequency (Full Containment) Truly Potentially Samples Truly Potentially Samples max_eps min_samples Cluster Count Polymorphic Polymorphic Clustered Cluster Count Polymorphic Polymorphic Clustered 500 50 10 3 (30%) 10 (100%) 3,355 (50.8%) 0 0 (0%) 0 (0%) 0 (0%) 200 20 21 7 (33.3%) 21 (100%) 3,493 (53.0%) 5 0 (0%) 5 (100%) 1,241 (18.8%) 100 10 51 17 (33.3%) 50 (98.0%) 3,834 (58.1%) 8 0 (0%) 8 (100%) 957 (14.5%) 50 5 94 42 (44.4%) 91 (96.8%) 3,969 (60.2%) 7 0 (0%) 7 (100%) 742 (11.3%) 20 2 320 262(81.9%) 302 (94.4%) 4,343 (65.9%) 8 0 (0%) 8 (100%) 331 (5.0%) 10 2 285 228 (80.0%) 269 (94.4%) 4,230 (64,2%) 19 0 (0%) 19 (100%) 527 (8.0%) 5 2 235 188 (80.0%) 219 (93.2%) 4,067 (61.7%) 54 6 (11.1%) 52 (96.3%) 748 (11.3%) (b) Clustering results for byte frequency using dierent parameter combinations with dierent containment policy. Table 5.1: Clustering results using dierent parameter combinations. results are shown in Table 5.1. Under full containment, we observe that the quality of data is much worse than running under our partial containment policy. Consequently fewer clusters are produced using the same clustering parameters. These clusters cover much fewer samples and many are not truly polymorphic. For results obtained from partial containment, we observe that the rate of detecting truly polymorphic improves as we decreasemax_eps andmin_samples with diminishing improvements after reaching a certain threshold. When we use ADU sequences, best results are achieved for max_eps=10 and min_samples=2. These settings produce 408 clusters, out of which 74% are truly polymorphic and additional 22.5% are potentially polymorphic. These settings cluster 4,640 samples or 70% of all samples. For comparison, the same settings in full containment produce only 107 clusters, out of which 68.2% are truly polymorphic. These clusters cover 3,888 samples or 59%. When we use byte frequency best results are achieved for max_eps=20 and min_samples=2. These settings produce 320 clusters, out of which 81.9% are truly polymorphic and additional 12.5% are potentially polymorphic, which achieves the highest truly polymorphic percentage 80 Categorization Total Unclustered sub-categorization Cluster Sample With a single sub-pattern With multiple sub-patterns Count Count Pattern Count Sample Count Pattern Count Sample Count Network behavior (byte freq) with 320 2252 262 1,086(25%) 58 3,257 (75%) local behavior sub-pattern Local behavior with 1,322 0 798 1,449 (22.0%) 524 5,146 (78.0%) network behavior sub-pattern (byte freq) Table 5.2: Comparison of clustering by local and by network behavior across all settings that we have tested.. These settings cluster 4,343 samples or 65% of all samples. For comparison, the same settings in full containment produce only 8 clusters, out of which none are truly polymorphic. These clusters cover only 331 samples or 5%. In the rest of the chapter we use max_eps=20 and min_samples=2 and we use byte frequency to identify clusters of polymorphic malware samples. We use our partial containment results. 5.6.4 Networkvslocalbehavior Since we use local behavior to evaluate quality of our polymorphic malware identication, it may seem that sets of local DLLs could be used, independently of network features to identify polymorphic malware. We explore this direction in this section. We cluster all samples based on their DLL sets, grouping samples with identical sets into the same cluster. We then sub-cluster the samples in each cluster based on their network behavior, usingmax_eps=20 andmin_samples=2 and clustering over byte frequency feature. The results are shown in the second row in Table 5.2, and compared with our network-behavior based clustering, shown in the rst row of the same table. Clustering rst on local behavior (DLL sets) leaves no unclustered samples, but only 22% of clustered samples and 60.4% of clusters exhibit same network behaviors. This reects the fact that many DLL sets are not unique to a single malware sample or malware purpose, but rather used broadly by many samples for a variety of purposes. Table 5.3 shows the top 10 reused 81 DLL le Reuse In Clusters Reuse In Samples rpcrt4.dll 1,212 (91.7%) 5,962 (90.4%) advapi32.dll 1,171 (88.6%) 5,552 (83.7%) shell32.dll 1,055 (79.8%) 4,772 (72.3%) mswsock.dll 946 (71.6%) 4,880 (74.0%) secur32.dll 928 (70.2%) 4,347 (65.9%) comctl32.dll 914 (69.1%) 4,067 (61.7%) ole32.dll 895 (67.7%) 4,346 (65.9%) rasadhlp.dll 875 (66.2%) 4,747 (72.%) dnsapi.dll 873 (66.0%) 4,737 (71.8%) wshtcpip.dll 850 (64.2%) 4,405 (66.8%) Table 5.3: Top 10 most reused DLL les DLLs, which are used in 64.2% to 91.7% of all DLL behavior group and 65.9% to 90.4% of all samples. The utility of some DLL les may be clear, e.g. mswsock.dll and dnsapi.dll are clearly used to initiate network activity, while some other DLL les are more general purpose, such as advapi32.dll, secur32.dll, comctl32.dll, etc. and hence don’t provide a clear indication of the actual behavior. The high percentage of reuse of general purpose DLL les makes using accessed DLL les to study behavior pattern more challenging. Also DLL set access pattern doesn’t provide a clear local access pattern, unlike our embedding which maps directly to human understandable network behavior. On the other hand, clustering based on network behaviors leaves 34.1% of samples unclus- tered, but produces more coherent clusters, with 25% of clustered samples and 81.9% of clusters exhibiting same local behaviors (accessing same DLL sets). 5.6.5 LargeMalwareClusters We now take a closer look into largest clusters identied by our network-behavior based cluster- ing, and shown in Table 5.4. For each cluster, we list the domain names queried through DNS, the network communication protocols, accessed remote IP addresses, as well as the sample count. 82 Group Domain Comm. Methods IP/Proto/PortInfo SampleCount 1 migsel.com GET 95.128.128.129, TCP/80 115 2 www.baidu.com TCP 104.193.88.77, TCP/80 95 3 N/A ICMP 72.30.35.10, ICMP 93 98.137.246.8, ICMP 98.138.219.232, ICMP (etc.) Table 5.4: Top 3 polymorphic malware groups categorized by ADU features. Group Domain Comm. Methods IP/Proto/PortInfo SampleCount DLLpatterns 1 Accessing zief.pl N/A 148.81.111.121, TCP/65520 611 91 2 Accessing aa.org N/A 157.122.62.205, TCP/1379 175 25 3 google.com GET 172.217.4.174, TCP/80 21 7 (etc.) 4 google.com GET 172.217.4.142, TCP/80 15 5 (etc.) Table 5.5: Prominent potentially polymorphic malware groups categorized by ADU features. We rst pick the 3 largest truly polymorphic cluster and take a closer look at their behavior. The top cluster contains 115 samples, all of which try to access the domain migsel.com with the following GET request: GET /system/classes/alive.php? key=Blackshades%5FKey& pcuser=<anonymized>& pcname=PC118& hwid=C405FD41& country=United+States This request looks like a keep-alive heart beat packet while providing information of the victim machine as part of a potential botnet. The authors tried to access migsel.com through web browser at the time of the writing of this chapter with success. The domain name points to an online shopping web site and seems to function normally. On further inspection, all requests 83 to the above URL got HTTP 404 as reply. One possible explanation is that the web site was compromised sometime ago with a botnet control server and had already been xed so that all bots that were still trying to contact this server would fail like observed in our experiment. The samples from second largest cluster initiate a proprietary TCP connection towww.baidu.com through port 80, which is unusual. The packet contents seems to be binary format, but some text are still recognizable and contains some system information like OS version, CPU frequency, etc. This behavior is similar to a backdoor malware, which reports victim system information to mal- ware control center. Similar to the previous cluster, those requests were replied with “HTTP/1.1 400 Bad Request”, which suggests there was an HTTP server running on port 80. This suggests that there might be a period of time that the baidu.com server was compromised to serve other purposes and had since been xed. This cluster consists of 95 samples. The third largest cluster, consisting of 93 samples, sent out huge amount of ICMP packets trying to detect availability of a given IP address. Such a sample may initiates as much as 15,000+ ICMP packets during our 5 minutes experiment duration. This suggests that ICMP is still the major way for detecting potential victim. We then take a look at potentially polymorphic groups of interest. We rst inspected the top potentially polymorphic groups and found 3 out of the top 5 clusters have their network communication related tozief.pl on an unusual TCP port 65520, which is a well-known malicious website and has since been taken down. The largest cluster consists of 611 samples, which also show 91 dierent DLL access patterns. Similarly, the second largest potentially polymorphic group try to access aa.org with another unusual TCP port 1379. As it turns out the malware was taken down from the website so all TCP SYN packets sent from the malware received no response. 84 The other clusters of interest we observed consists of samples that tried to contact google.com with only a simple “GET /” HTTP request, without trying to submit anything either. We believe such malware may try to use google.com for network availability check. Also note while all samples from group 3 and 4 access google.com, they are clustered dierently not because they access dierent IP addresses, but their ow count diers. On the other hand, samples in the same cluster may access dierent google.com front end server due to dierences in geolocation. Regardless, our embedding features can still cluster them into the same cluster, proving their robustness on identifying similar underlying network communication patterns. Our further inspection through samples in the resulting clusters show that our embedding mechanism is eective on identifying common network behavior and at the meantime retain human readable information to ease further analysis of network behaviors. Now we compare the network behavior our system captured among truly polymorphic clus- ters and potentially polymorphic clusters. We observe that for truly polymorphic malware, the network activity accesses lesser known domain, or performs simple tasks like sending ICMP packets. Such malware samples are either more single purposed or specialized to perform a set of specic tasks as dened by a malicious control center resides on a malicious or compromised web site. On the other hand, malware samples from potentially polymorphic cluster are more likely to access a public service or a well-known malicious service. As we detect dierent DLL set access patterns, we suppose that such samples may perform dierent local malicious activities while sharing the same network activity. This suggests malware component reuse, which com- bines dierent single purpose malicious modules to form new malware samples. As the statistics shown in Table 5.2, more samples fall into clusters with more than one DLL set. This suggests that potential malware module reuse is very common in samples encountered in the wild. As 85 shown previously, our network behavior embedding based clustering mechanism can identify or provide evidences of the existence of such polymorphic malware samples. 5.7 DiscussionandFutureWork Our clustering results show that many malware samples are very similar with regard to the ADU sequences and byte frequency of the ows contained. In addition to detection of polymorphic malware, these features could be used to form behavioral signatures for malware detection. Since the malware ecosystem changes rapidly, the signatures we devise today will likely be obsolete to- morrow. However, our methodology can be used with contemporary malware samples to identify future clusters of behaviors and payloads and to help defenses keep track of malware evolution. Our current result is still bounded by time and computing power restrictions. Given more time and better infrastructures, we can foresee several future directions to continue our research. Increase analyzed sample repository. We would like to examine more malware samples and expand our analysis, to balance out samples in each high-level behavior group. We would also like to perform a longitudinal study of malware evolution over time, to quantify how much dominant behaviors change. Understand sample genealogy. Our results can currently help us identify samples that share similar or identical ows, suggesting that these samples may have a common author or that they may share code. We would like to extend our analysis to map out the evolution of malware samples, e.g., which sample came rst, how did the specic behavior (e.g., contacting a C&C channel) change over time, etc. 86 Malwaredetection. Most high-level malware behaviors also occur in benign software, which makes them unreliable for malware detection purposes. However, combinations of these high- level behaviors may be unique to malware and could be useful for detection. For example, a software that scans other hosts and then copies data over is unlikely to be benign, although each of these actions separately could be undertaken by benign software (e.g., probing several servers could look like scanning if servers are unresponsive, and data can be uploaded to a cloud for legitimate reasons) As we extend our malware analysis to more samples, we expect to nd more behavior combinations, which can be used for malware detection. MalwareEvadingNetworkAnalysis. With the existence of techniques to evade signature- based malware detection, we can naturally assume that malware designers will also look to evade our network-communication-based analysis. There are several strategies that malware could use: (a) appears to be dormant and await a specic trigger (b) interleave malicious activities with benign trac to avoid detection We leave the handling of these evasion techniques for our future work, but sketch some possi- ble solutions here. For trigger-based malware, we could leverage existing work on trigger detec- tion, such as using static analysis and symbolic execution [12]. Since our methodology analyzes each ow separately, malware, which interleaves its malicious behavior with benign behavior will still generate ows that we can recognize as malicious. Miramirkhani et al. [36] showed that malware authors could detect that a sample is being analyzed by detecting a lack of user-generated les or registry entries. Our current analysis environment would thus be easily detected. In the future we plan to address these challenges by providing articial artifacts of human-user presence in our analysis environment, such as 87 command line history, registry entries, les in user directories, recently opened le list in popular applications, etc. 5.8 Conclusion In this work, we propose to use clustering over malware’s network trac patterns to identify polymorphic malware. We show that clustering over our application data unit and byte frequency in malware’s network trac produces a high percentage of clusters, containing samples whose ows are very similar to each other, and whose local behaviors are identical. We show that using local DLL access pattern to identify polymorphic malware samples has limited capability due to common DLL les are heavily reused, while our embedding based pattern clustering results in over 90% of potentially polymorphic clusters and up to 80% of truly polymorphic clusters and in the meantime provides human understandable network patterns to help understand the underlying behaviors of the malware. 88 Chapter6 Summary: OurContributions In this chapter we summarize state of the art in malware analysis and malware trend and simi- larity detection. We then highlight our contributions to this body of work. Researches have explored multiple methods to analyze malware [13, 5, 31, 48, 38, 40, 65]. However, there are several limitations of existing work. First, fully contained environments can be detected by malware, which then modies its behavior. Analysis environments are currently subject to existing anti-debugger, anti-virtual machine, and anti-disassembly techniques deployed by malware to detect that it is being analyzed. Our work addresses this problem by analyzingmal- warenetworkbehaviors on bare metal machines, without any additional instrumentation. Second, full containment of network activity makes malware abort its actions, because malware relies on network connectivity to download malicious software, upload its ndings and receive commands from C&C network. Our research has uncovered that in 78% of cases malware reduces its activity if it cannot reach the Internet. Alternatively, malware could be analyzed without any contain- ment policy. This would be unsafe as malware can try to spread to other Internet hosts, perform vulnerability scanning of other hosts or generate harmful trac (DDoS or spam/phishing emails) 89 to other hosts. Researchers would be complicit in such harmful actions. Our Fantasm frame- work [18] addresses these issues through a exible, carefully crafted containment policy. It iter- atively analyzes malware identifying essential network communications, which must be allowed to keep malware analysis productive. Such communications are rst simulated, using our im- personators within a contained environment. Impersonators try to trick malware into believing it is communicating with Internet hosts. If impersonation attempts fail, malware is allowed to send trac to the outside, but this communication is closely observed and it is aborted if it is too aggressive or one-sided (one-sided nature indicates scanning or DDoS). We show in evaluation of Fantasm that it leads to productive malware analysis and does not generate safety complaints. There are also many attempts to encode network trac for malware analysis [38, 40]. These eorts focus on answering a specic research question, such as identifying popular protocols in malware communication or popular remote hosts being contacted. Our two embeddings [18, 16] focus on abstracting specics of individual communications, but encoding their dynamics and content (ADU and payload byte frequency), and port and protocol information along with message sizes (NetDigest). Clustering over these sets of features helps us classify malware accu- rately [18] based on NetDigest, and they help us identify malware samples with similar network behaviors [16]. In [17] we propose using our improved network embedding that focuses on detecting malware samples with almost identical behaviors, which can indicate polymorphic malware. We nd that existing researches on polymorphic malware detection mostly focusing on binary analysis [56, 39, 9, 21, 55] that attempt to identify invariant portion of the polymorphic malware binary for detection, to provide a way to auto-generate signatures that are adaptable to transformed poly- morphic binaries. Our work is novel in this regard that provides a new eective way to identify 90 polymorphic malware based on its network behavior. Unfortunately, there is no public repository of polymorphic malware and thus we cannot directly compare with existing work. 91 Chapter7 Conclusion Malicious software has been becoming more and more sophisticated, engaging in evasive be- haviors when being analyzed and changing its binary code to evade signature-based detection. Our work improves state-of-the art malware analysis by creating Fantasm [18], a safe malware experimentation environment, with exible, partial containment of malware’s network activity. In this thesis we show that Fantasm leads to safe and productive malware analysis, and enables capture of network trac traces generated by malware. These traces can then be used to accu- rately classify malware [18, 15], to detect communication trends [16], and to detect polymorphic malware [17]. We hope our work can complement code-based malware analysis to oer new insights into malware goals and behavior trends and to oer new data for malware detection. 92 Bibliography [1] Apiary. http://apiary.gtri.gatech.edu/. Accessed: 2018-05-09. [2] Raj Badhwar. “Polymorphic and Metamorphic Malware”. In: The CISO’s Next Frontier. Springer, 2021, pp. 279–285. [3] Mark Baggett. “Eectiveness of antivirus in detecting metasploit payloads”. In: SANS Institute (2008). [4] Michael Bailey, Jon Oberheide, Jon Andersen, Zhuoqing Morley Mao, Farnam Jahanian, and Jose Nazario. “Automated classication and analysis of internet malware”. In: RAID. Vol. 4637. Springer. 2007, pp. 178–197. [5] Davide Balzarotti, Marco Cova, Christoph Karlberger, Engin Kirda, Christopher Kruegel, and Giovanni Vigna. “Ecient Detection of Split Personalities in Malware.” In: NDSS. 2010. [6] Terry Benzel. “The Science of Cyber-Security Experimentation: The DETER Project”. In: Annual Computer Security Applications Conference (ACSAC). 2011. [7] Rodrigo Rubira Branco, Gabriel Negreira Barbosa, and Pedro Drimel Neto. “Scientic but not academical overview of malware anti-debugging, anti-disassembly and anti-vm technologies”. In: Black Hat (2012). [8] Juan Caballero, Chris Grier, Christian Kreibich, and Vern Paxson. “Measuring pay-per-install: the commoditization of malware distribution.” In: Usenix security symposium. Vol. 13. 2011. [9] Silvio Cesare, Yang Xiang, and Wanlei Zhou. “Malwise—an eective and ecient classication system for packed and polymorphic malware”. In: IEEE Transactions on Computers 62.6 (2012), pp. 1193–1206. [10] S Sibi Chakkaravarthy, D Sangeetha, and V Vaidehi. “A Survey on malware analysis and mitigation techniques”. In: Computer Science Review 32 (2019), pp. 1–23. [11] Xu Chen, Jon Andersen, Z Morley Mao, Michael Bailey, and Jose Nazario. “Towards an understanding of anti-virtualization and anti-debugging behavior in modern malware”. In: Dependable Systems and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference on. IEEE. 2008, pp. 177–186. 93 [12] Mihai Christodorescu and Somesh Jha. Static analysis of executables to detect malicious patterns. Tech. rep. WISCONSIN UNIV-MADISON DEPT OF COMPUTER SCIENCES, 2006. [13] Paolo Milani Comparetti, Guido Salvaneschi, Engin Kirda, Clemens Kolbitsch, Christopher Kruegel, and Stefano Zanero. “Identifying dormant functionality in malware programs”. In: IEEE Symposium on Security and Privacy. 2010, pp. 61–76. [14] Glenn De’ath and Katharina E Fabricius. “Classication and regression trees: a powerful yet simple technique for ecological data analysis”. In: Ecology 81.11 (2000), pp. 3178–3192. [15] Xiyue Deng and Jelena Mirkovic. “Malware analysis through high-level behavior”. In: 11th fUSENIXg Workshop on Cyber Security Experimentation and Test (fCSETg 18). 2018. [16] Xiyue Deng and Jelena Mirkovic. “Malware Behavior Through Network Trace Analysis”. In: International Networking Conference. Springer. 2020, pp. 3–18. [17] Xiyue Deng and Jelena Mirkovic. “Polymorphic Malware Behavior Through Network Trace Analysis”. In: 2022 14th International Conference on COMmunication Systems & NETworkS (COMSNETS). IEEE. 2022, pp. 138–146. [18] Xiyue Deng, Hao Shi, and Jelena Mirkovic. “Understanding Malware’s Network Behaviors using Fantasm”. In: Proceedings of LASER 2017 Learning from Authoritative Security Experiment Results (2017), p. 1. [19] David Dittrich and Erin Kenneally. “The Menlo Report: Ethical principles guiding information and communication technology research”. In: US Department of Homeland Security (2012). [20] Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel. “A survey on automated dynamic malware-analysis techniques and tools”. In: ACM computing surveys (CSUR) 44.2 (2012), p. 6. [21] James B Fraley and Marco Figueroa. “Polymorphic malware detection using topological feature extraction with data mining”. In: SoutheastCon, 2016. IEEE. 2016, pp. 1–7. [22] Claudio Guarnieri, Allessandro Tanasi, Jurriaan Bremer, and Mark Schloesser. The cuckoo sandbox. https://cuckoosandbox.org/. 2012. [23] Isabelle Guyon, B Boser, and Vladimir Vapnik. “Automatic capacity tuning of very large VC-dimension classiers”. In: Advances in neural information processing systems. 1993, pp. 147–155. [24] Thorsten Holz, Markus Engelberth, and Felix Freiling. “Learning more about the underground economy: A case-study of keyloggers and dropzones”. In: Computer Security–ESORICS (2009), pp. 1–18. [25] Kurt Thomas Danny Yuxing Huang, David Wang Elie Bursztein Chris GrierD, Thomas J Holt, Christopher Kruegel, Damon McCoy, Stefan Savage, and Giovanni Vigna. “Framing dependencies introduced by underground commoditization”. In: Workshop on the Economics of Information Security. 2015. 94 [26] ISC Tech Georgia, Open Malware. http://oc.gtisc.gatech.edu/. 2017. [27] Kaspersky. Kaspersky Lab detects 360,000 new malicious les daily. https://www.kaspersky.com/about/press-releases/2017_kaspersky-lab-detects-360000-new- malicious-files-daily. Accessed: 2018-09-23. 2017. [28] Kaspersky Lab, 323,000 New Malware Samples Found Each Day. http://www.darkreading.com/vulnerabilities---threats/kaspersky-lab-323000-new-malware- samples-found\-each-day/d/d-id/1327655. 2016. [29] Johannes Kinder, Stefan Katzenbeisser, Christian Schallhart, and Helmut Veith. “Detecting malicious code by model checking”. In: International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer. 2005, pp. 174–187. [30] Chaz Lever, Platon Kotzias, Davide Balzarotti, Juan Caballero, and Manos Antonakakis. “A lustrum of malware network communication: Evolution and insights”. In: 2017 IEEE Symposium on Security and Privacy (SP). IEEE. 2017, pp. 788–804. [31] Martina Lindorfer, Clemens Kolbitsch, and Paolo Milani Comparetti. “Detecting environment-sensitive malware”. In: Recent Advances in Intrusion Detection. Springer. 2011, pp. 338–357. [32] Shun-Te Liu, Yi-Ming Chen, and Shiou-Jing Lin. “A novel search engine to uncover potential victims for apt investigations”. In: IFIP International Conference on Network and Parallel Computing. 2013, pp. 405–416. [33] Master Feeds, Bambenek Consulting Feeds. http://osint.bambenekconsulting.com/feeds/. 2017. [34] MaxMind, GeoLite Legacy Downloadable Databases. http://dev.maxmind.com/geoip/legacy/geolite/. 2017. [35] Leigh B. Metcalf, Dan Ruef, and Jonathan M. Spring. “Open-source Measurement of Fast-ux Networks While Considering Domain-name Parking”. In: Proceedings of the Learning from Authoritative Security Experiment Results Workshop. 2017. [36] Najmeh Miramirkhani, Mahathi Priya Appini, Nick Nikiforakis, and Michalis Polychronakis. “Spotless sandboxes: Evading malware analysis systems using wear-and-tear artifacts”. In: 2017 IEEE Symposium on Security and Privacy (SP). IEEE. 2017, pp. 1009–1024. [37] Jelena Mirkovic and Terry Benzel. “Deterlab testbed for cybersecurity research and education”. In: Journal of Computing Sciences in Colleges 28.4 (2013), pp. 163–163. [38] Jose Andre Morales, Areej Al-Bataineh, Shouhuai Xu, and Ravi Sandhu. “Analyzing and exploiting network behaviors of malware”. In: International Conference on Security and Privacy in Communication Systems. 2010, pp. 20–34. [39] Fahad Bin Muhaya, Muhammad Khurram Khan, and Yang Xiang. “Polymorphic malware detection using hierarchical hidden markov model”. In: 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing. IEEE. 2011, pp. 151–155. 95 [40] Saeed Nari and Ali A Ghorbani. “Automated malware classication based on network behavior”. In: Computing, Networking and Communications (ICNC), 2013 International Conference on. IEEE. 2013, pp. 642–647. [41] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. “Scikit-learn: Machine learning in Python”. In: Journal of Machine Learning Research 12.Oct (2011), pp. 2825–2830. [42] Leonid Portnoy, Eleazar Eskin, and Sal Stolfo. “Intrusion detection with unlabeled data using clustering”. In: Proceedings of ACM CSS Workshop on Data Mining Applied to Security. 2001. [43] PurpleSec. 2021 Cyber Security Statistics. https://purplesec.us/resources/cyber-security-statistics/. 2012. [44] Babak Bashari Rad, Maslin Masrom, and Suhaimi Ibrahim. “Camouage in malware: from encryption to metamorphism”. In: International Journal of Computer Science and Network Security 12.8 (2012), pp. 74–83. [45] Chandni Raghuraman, Sandhya Suresh, Suraj Shivshankar, and Radhika Chapaneri. “Static and dynamic malware analysis using machine learning”. In: First International Conference on Sustainable Technologies for Computational Intelligence. Springer. 2020, pp. 793–806. [46] Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick Düssel, and Pavel Laskov. “Learning and classication of malware behavior”. In: International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. 2008, pp. 108–125. [47] Konrad Rieck, Philipp Trinius, Carsten Willems, and Thorsten Holz. “Automatic analysis of malware behavior using machine learning”. In: Journal of Computer Security 19.4 (2011), pp. 639–668. [48] Christian Rossow, Christian J Dietrich, Herbert Bos, Lorenzo Cavallaro, Maarten Van Steen, Felix C Freiling, and Norbert Pohlmann. “Sandnet: Network trac analysis of malicious software”. In: Proceedings of the First Workshop on Building Analysis Datasets and Gathering Experience Returns for Security. ACM. 2011, pp. 78–88. [49] Christian Rossow, Christian J Dietrich, Herbert Bos, Lorenzo Cavallaro, Maarten Van Steen, Felix C Freiling, and Norbert Pohlmann. “Sandnet: Network trac analysis of malicious software”. In: Proceedings of the First Workshop on Building Analysis Datasets and Gathering Experience Returns for Security. ACM. 2011, pp. 78–88. [50] David E Rumelhart, Georey E Hinton, and Ronald J Williams. “Learning representations by back-propagating errors”. In: Cognitive modeling 5.3 (1988), p. 1. [51] Ashu Sharma and Sanjay Kumar Sahay. “Evolution and detection of polymorphic and metamorphic malwares: A survey”. In: arXiv preprint arXiv:1406.7061 (2014). 96 [52] Hao Shi, Abdulla Alwabel, and Jelena Mirkovic. “Cardinal Pill Testing of System Virtual Machines.” In: USENIX Security Symposium. 2014, pp. 271–285. [53] Hao Shi and Jelena Mirkovic. “Hiding debuggers from malware with apate”. In: Proceedings of the Symposium on Applied Computing. ACM. 2017, pp. 1703–1710. [54] Brett Stone-Gross, Thorsten Holz, Gianluca Stringhini, and Giovanni Vigna. “The Underground Economy of Spam: A Botmaster’s Perspective of Coordinating Large-Scale Spam Campaigns.” In: LEET 11 (2011), pp. 4–4. [55] Asghar Tajoddin and Saeed Jalili. “HM3alD: Polymorphic malware detection using program behavior-aware hidden Markov model”. In: Applied Sciences 8.7 (2018), p. 1044. [56] Ke Tang, Ming-Tian Zhou, and Zhi-Hong Zuo. “An enhanced automated signature generation algorithm for polymorphic malware detection”. In: Journal of Electronic Science and Technology 8.2 (2010), pp. 114–121. [57] Colin Tankard. “Advanced persistent threats and how to monitor and deter them”. In: Network security 2011.8 (2011), pp. 16–19. [58] AV-TEST. AV-TEST Malware statistics. https://web.archive.org/web/20211231082316/https://www.av-test.org/en/statistics/malware/. Accessed: 2022-01-07. 2022. [59] Kurt Thomas, Danny Yuxing, Huang David, Wang Elie, Bursztein Chris Grier, Thomas J Holt, Christopher Kruegel, Damon McCoy, Stefan Savage, and Giovanni Vigna. “Framing dependencies introduced by underground commoditization”. In: Proceedings (online) of the Workshop on Economics of Information Security. 2015. [60] Fred Touchette. “The evolution of malware”. In: Network Security 2016.1 (2016), pp. 11–14. [61] Daniele Ucci, Leonardo Aniello, and Roberto Baldoni. “Survey of machine learning techniques for malware analysis”. In: Computers & Security 81 (2019), pp. 123–147. [62] VirusTotal. https://www.virustotal.com/en/. 2017. [63] VirusTotal. VirusTotal-Free online virus, malware and URL scanner. https://www.virustotal.com/en. 2012. [64] Carsten Willems, Thorsten Holz, and Felix Freiling. “Toward automated dynamic malware analysis using cwsandbox”. In: IEEE Security & Privacy 5.2 (2007). [65] L Xue and Guozi Sun. “Design and implementation of a malware detection system based on network behavior”. In: Security and Communication Networks 8.3 (2015), pp. 459–470. [66] Ilsun You and Kangbin Yim. “Malware obfuscation techniques: A brief survey”. In: Broadband, Wireless Computing, Communication and Applications (BWCCA), 2010 International Conference on. IEEE. 2010, pp. 297–300. 97
Abstract (if available)
Abstract
Malware defenses today deploy manual or semi-automated analysis in debuggers and virtual machines to derive code-based signatures, and then use these signatures to detect malware as it traverses the network or infects a vulnerable host. However, malware is becoming ever more sophisticated at bypassing these defenses. Contemporary malware attempts to detect when it is being analyzed in a debugger or virtual machine, and changes its behavior, preventing signature derivation. Malware also mutates its code, generating new polymorphic samples, which evade existing code signatures. Such countermeasures have set a higher bar for modern malware detection and analysis. ❧ In this dissertation, we propose safe and efficient ways to analyze and encode malware network behaviors, which can be observed in bare metal environments, and can reliably indicate a malware sample’s true purpose, in spite of code polymorphism. First, we propose Fantasm, a safe live analysis environment, which does not use debuggers or virtual machines. Fantasm deceives malware that it runs on an unconstrained host, while carefully curtailing its malicious activities for safe analysis. Fantasm is built upon the DeterLab testbed, and uses a series of security mechanisms to limit malware interaction with live Internet, and decoy services, which mimic responses malware samples expect, to reveal interesting network behaviors. We show how network behaviors, observable in Fantasm, can be used to classify malware into VirusTotal categories, without any binary code analysis. ❧ Second, we propose an embedding of malware’s high-level network behavior, which we observe in Fantasm. Our embedding summarizes important features of network traffic malware exchanges with Internet hosts or our decoy services. This embedding enables us to encode interesting patterns of communication from malware’s traffic, and identify common patterns among many samples, which could be used for malware detection. It also enables fast and accurate classification of malware samples via machine learning. ❧ Finally, we further develop malware sample embedding, to detect polymorphic malware. We show how our embedding based on network behavior of malware successfully identifies malware with identical network and local behaviors, but different binary code. This approach complements traditional malware detection based on binary code signatures.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Supporting faithful and safe live malware analysis
PDF
Enabling symbolic execution string comparison during code-analysis of malicious binaries
PDF
Toward understanding mobile apps at scale
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Automatic detection and optimization of energy optimizable UIs in Android applications using program analysis
PDF
Detection, localization, and repair of internationalization presentation failures in web applications
PDF
Improving binary program analysis to enhance the security of modern software systems
PDF
Detecting SQL antipatterns in mobile applications
PDF
Inferring mobility behaviors from trajectory datasets
PDF
Mining and modeling temporal structures of human behavior in digital platforms
PDF
Defending industrial control systems: an end-to-end approach for managing cyber-physical risk
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
When AI helps wildlife conservation: learning adversary behavior in green security games
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Improving efficiency, privacy and robustness for crowd‐sensing applications
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Behavior-based approaches for detecting cheating in online games
PDF
Protecting online services from sophisticated DDoS attacks
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
A function-based methodology for evaluating resilience in smart grids
Asset Metadata
Creator
Deng, Xiyue
(author)
Core Title
Studying malware behavior safely and efficiently
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-05
Publication Date
01/28/2022
Defense Date
01/18/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
containment,malware behavior,network,OAI-PMH Harvest,Security
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mirkovic, Jelena (
committee chair
), Gupta, Sandeep (
committee member
), Halfond, William G. J. (
committee member
), Neuman, Clifford (
committee member
)
Creator Email
dengxiyue@gmail.com,xiyueden@isi.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110582765
Unique identifier
UC110582765
Legacy Identifier
etd-DengXiyue-10363
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Deng, Xiyue
Type
texts
Source
20220201-usctheses-batch-910
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
containment
malware behavior