Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improving binary program analysis to enhance the security of modern software systems
(USC Thesis Other)
Improving binary program analysis to enhance the security of modern software systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Improving Binary Program Analysis to Enhance the Security of
Modern Software Systems
by
Nicolaas Weideman
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2024
Copyright 2024 Nicolaas Weideman
Acknowledgements
First, I would like to thank my advisors, Dr. Jelena Mirkovic and Dr. Christophe Hauser, for their
unwavering support throughout my PhD journey. Dr. Mirkovic’s expertise in advising PhD students is unparalleled, and I could not have succeeded without her guidance. Dr. Hauser introduced
me to the fascinating field of binary program analysis, which continues to captivate me as much
as it did on the first day. His knowledge and insights in this field were invaluable in helping me
complete my PhD.
I would also like to extend my gratitude to everyone else I have worked with at USC during
my PhD, including Dr. Genevieve Bartlett, Dr. Luis Garcia, and Dr. Erik Kline. Each of you has
taught me valuable lessons about conducting research. Additionally, I am grateful to the staff of
the Information Sciences Institute and USC for their support with administrative tasks.
A special thanks to all the students of the STEEL lab, past and present. Thank you, Rajat and
Sivaram, for showing me the ropes of being a PhD student. Thank you, Sima, Wei-Cheng, Tristan,
Will, and Dipsy, for joining me on the journey of obtaining a PhD.
I would also like to acknowledge the members of the USC Capture the Flag (CTF) team.
Participating in CTFs provided a welcome distraction from research and taught me many technical
skills that I would not have encountered in my research work.
Lastly, I want to thank my friends and family. Your support has meant the world to me.
ii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Thesis statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Demonstrating the thesis statement . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Structure of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: Harm-DoS: Hash Algorithm Replacement for Mitigating Denial-of-Service
Vulnerabilities in Binary Executables . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Fast hash algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Hash-collision vulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Attacker model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Challenges and requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Approach overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Vulnerability diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 Hash function template-matching . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 Constant-mnemonic pair discovery . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Pre-patch examination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.1 Filtering inlined hash functions . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.1.1 Duplicate candidate algorithm detection . . . . . . . . . . . . . . 21
2.6.1.2 Template-match size difference . . . . . . . . . . . . . . . . . . 22
2.6.2 Symbolic hash function analyses . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.2.1 Symbolic signature detection . . . . . . . . . . . . . . . . . . . 23
2.6.2.2 Symbolic input-output matching . . . . . . . . . . . . . . . . . . 25
2.6.2.3 Symbolic case sensitivity checking . . . . . . . . . . . . . . . . 26
iii
2.6.2.4 Symbolic memory access analysis . . . . . . . . . . . . . . . . . 27
2.7 Hash transplant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7.1 Replacement hash function construction . . . . . . . . . . . . . . . . . . . 28
2.7.1.1 Candidate replacement hash algorithms . . . . . . . . . . . . . . 28
2.7.2 Replacing the hash function . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8 Post-patch examination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.8.1 Symbolic patch memory access analysis . . . . . . . . . . . . . . . . . . . 31
2.8.2 Symbolic preimage calculation . . . . . . . . . . . . . . . . . . . . . . . . 31
2.9 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.10 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.10.1 Full-scale analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.10.1.1 Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.10.1.2 Patching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.10.2 Ground truth analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.10.2.1 Test case verification . . . . . . . . . . . . . . . . . . . . . . . . 37
2.10.3 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.11 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.12 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.13 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 3: Diamonds: Automatic Discovery and Selective Mitigation of Spectre Vulnerabilities in Binary Executables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.1 Spectre vulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.2 State-of-the-art mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.2.1 Clang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.2.2 Gcc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1 Vulnerability discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1.1 R1: Speculative execution window . . . . . . . . . . . . . . . . 51
3.3.1.2 R2: Input-reachable . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.1.3 R3: Conditional branch . . . . . . . . . . . . . . . . . . . . . . 52
3.3.1.4 R4: Mappable memory . . . . . . . . . . . . . . . . . . . . . . 53
3.3.2 Vulnerability mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.1 Setting parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.2 Spectre discovery and mitigation . . . . . . . . . . . . . . . . . . . . . . . 57
3.5.3 Comparison to speculative load hardening . . . . . . . . . . . . . . . . . . 60
3.5.3.1 Identifying superfluous mitigation points . . . . . . . . . . . . . 62
3.5.3.2 Investigating superfluous mitigation . . . . . . . . . . . . . . . . 62
3.5.3.3 Eliminating superfluous mitigation . . . . . . . . . . . . . . . . 64
3.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.7 Threats to validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
iv
3.7.1 Alias analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7.2 Control-flow analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Chapter 4: Data Flows in You: Benchmarking and Improving Static Data-flow Analysis on Binary Executables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Binary data-flow analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.1.1 Degree of data-flow . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.1.2 Data-flow scope . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.1.3 Data-flow channel . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.2 Our scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.3 Data-flow analysis implementations . . . . . . . . . . . . . . . . . . . . . 75
4.3 Improving static data-flow analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.1 Data-flow model extensions . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.1.1 Handling function calls . . . . . . . . . . . . . . . . . . . . . . 77
4.3.1.2 Field sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.2 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.2.1 Alias classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.2.2 Microbenchmark test cases . . . . . . . . . . . . . . . . . . . . 79
4.3.2.3 Real-world test cases . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3.3 Evaluation framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.1 Selected DA approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.2 Microbenchmark test cases . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.3 Real-world test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.4 Evaluation framework implementation . . . . . . . . . . . . . . . . . . . . 90
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5.1 Microbenchmark test cases . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5.2 Real-world test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5.3 Improving the state of the art . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.6 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A Vulnerability Diagnosis Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
B Hash Transplant Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
C DJB Hash Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
D Multilinear Hash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
v
List of Tables
2.1 The constant-mnemonics pair fingerprints of each known-weak hash algorithm. . . 21
2.2 The candidate, lone, isolated, confirmed and patched hash functions from the fullscale analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 The number of times signature-nonconformity and input-output nonconformity
lead to an isolated hash function not being confirmed. The right-most column
shows when analysis time exceeded 4 hours or an error occurred. The large number
of signature-nonconforming isolated hash functions is caused by too-broad hash
function detection in the optimistic static analysis in the Vulnerability Diagnosis
phase. These false positives are correctly filtered out during symbolic execution. . 35
2.4 The results from running HARM-DOS on the binaries in the ground truth data set.
The MI column shows the number of manually identified hash functions. The Candidate column shows how many manually identified hash functions were identified
as candidate hash functions, i.e. true positive classifications. The Isolated column
shows how many of these remain after filtering out the inlined functions. . . . . . . 36
2.5 The results from manually inspecting the candidate hash functions that are not
isolated hash functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 The binary files we analyze for our evaluation with their input functions. . . . . . . 58
3.2 The number of load instructions remaining after enforcing requirements R1-R4.
The load instructions in the last row, meeting all requirements, require mitigation. . 58
3.3 The task we perform with each of our target binaries in order to measure the performance impact. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4 Number of times a patch executed while performing the task listed in Table 3.3. . . 60
3.5 We measure the performance impact by comparing the execution times when performing the task in Table 3.3. We compute the percentage increase in execution
time over the original binary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6 We show the ratio of binary modification performance penalty to mitigation performance penalty by computing the execution time of BN as a percentage of BM. . . 61
3.7 The number of SLH load instructions that meet a requirement R1-R4. . . . . . . . 65
vi
3.8 The number of SLH mitigation instructions executed while performing the task in
Table 3.3. We measure this for the default SLH binary (BSLH), as well as binary
with SLH eliminated by DIAMONDS (BSLH−D). We show what fraction of all executed instructions are mitigation instruction and we show the percentage decrease
from BSLH to BSLH−D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9 The performance benefit of eliminating superfluous SLH instructions as indicated
by DIAMONDS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 The undefined registers corresponding to each pointer origin. . . . . . . . . . . . . 87
4.2 The selected target functions for each real-world binary with the number of identified data-flows per alias class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3 Performance of selected DA approaches on the microbenchmark fully-specified
test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4 Performance of selected DA approaches on the microbenchmark underspecified
test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5 Extraction time of the data-flow graph for the target function in each real-world
binary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.6 The number of data-flows discovered dynamically per alias class and the number
of these discovered statically. Additionally, we show the number of data-flows
discovered statically only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.7 Performance of selected DA approaches over all target real-world binaries. . . . . . 98
4.8 The change (in boldface) introduced by C1 and C2 in how angr reports data-flows
interrupted by a callee function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.9 The change (in boldface) introduced by F with respect to how angr reports dataflows between pointers transformed by equal or distinct offsets. . . . . . . . . . . . 101
4.10 The concrete improvement gained by extending angr with our model extensions
C1, C2 and F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
vii
List of Figures
2.1 A flowchart of the approach overview. . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 The symbolic starting state that we use for symbolic execution while determining
if a function implements the (a) buffer-length and (b) buffer-only signature. . . . . 24
2.3 Parsing time for both vulnerable and patched Snudown for both malicious and
random reference labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 The pointer is defined with a selected pointer origin, data type, size and length. . . 81
4.2 We either use the same pointer, or two distinct pointers, for the write operation and
read operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 The write pointer and read pointer may be transformed by adding an offset. . . . . 81
4.4 The write operation and read operation may be interrupted by a call to a callee
function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.5 We modify the intra-procedural data-flow graph to reconnect registers across saverestore edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.6 We modify the intra-procedural data-flow graph to eliminate data-flows into register clearing instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1 A CFG showing the basic blocks of an implementation of the SDBM hash algorithm. Solid nodes and edges indicate the template-match. . . . . . . . . . . . . . 115
5.2 An illustration of how a hash function is replaced, by overwriting the original and
adding instructions to code caves. Every code cave (on the left) receives a number
of instructions (on the right) of the patch, shown in Listing 5.3. The grayed-out
instructions show the last instruction of the function before the padding bytes (i.e.
the code cave) starts. Every code cave ends with a jump instruction to the code
cave housing the next patch instructions. . . . . . . . . . . . . . . . . . . . . . . . 118
viii
Abstract
With the ever-increasing reliance of the modern world on software systems, the frequency and
impact of cyber attacks have greatly increased as well. Such cyber attacks often operate by exploiting vulnerabilities in software, i.e. flaws that enable the software to act in an unsafe manner.
Software must be analyzed thoroughly to evaluate its security, as vulnerabilities in software can
have devastating consequences such as compromised privacy of users, shutdown of infrastructure,
significant business losses, and even pose threat to human life. Unfortunately, manual analysis of
the source code is insufficient to evaluate the security of software. First, the quantity and size of
modern software makes manual analysis impossible. Second, low-level vulnerabilities may exist
in binary code that cannot be observed in the source code (“what you see is not what you execute”
principle). Third, in some cases of proprietary or legacy code source code may not be available
to the analyst. Binary program analysis is a research field that focuses on ameliorating these issues by automatically analyzing the machine code instructions of executables to reason about their
security-related properties. In this dissertation we enhance automatic software security evaluation
by leveraging and extending binary program analysis. We develop approaches to 1) automatically
discover specific vulnerabilities in binary code and to 2) automatically and safely patch vulnerabilities in binary code. We further improve the reliability of a fundamental binary analysis technique,
known as data-flow analysis, by 3) evaluating three state-of-the-art binary analysis frameworks
with regard to the accuracy of their data-flow analysis, and 4) doubling the accuracy (in terms of
F1 score) of data-flow analysis in the angr open-source binary analysis framework by fine tuning
its approximations to reflect real-world scenarios more accurately.
ix
In this dissertation, we present HARM-DOS and DIAMONDS that combine the goals of automatic vulnerability discovery and non-disruptive patching. In HARM-DOS we define an approach
to automatically discover and mitigate hash-collision denial-of-service vulnerabilities in binary
code, by replacing the weak hash algorithm – the cause of the vulnerability – with a secure alternative. We evaluate our prototype on a large data set of 105,831 real-world binaries, identify 796
confirmed weak hash functions, and successfully replace 759 (95%) of these in a non-disruptive
manner. In DIAMONDS, we discover and selectively mitigate Spectre vulnerabilities in binary
code. We define four requirements a machine code instruction must fulfill in order to be vulnerable to Spectre. For every instruction that fulfills all four of these requirements, we incorporate
a patch that temporarily disables speculative execution, mitigating the vulnerability. We compare
DIAMONDS against the state of the art in Spectre mitigation: speculative load hardening (SLH) in
Clang. This comparison shows that up to 96% of mitigating instructions introduced by SLH are
superfluous, they are applied to instructions that are not vulnerable, thus slowing down execution
unnecessarily. By performing selective mitigation, i.e. only applying patches where necessary, we
reduce execution time up to 20%, compared to SLH.
In FLOW-METER, we focus on an understudied, fundamental technique in binary program
analysis: data-flow analysis. We introduce a data set designed carefully to evaluate data-flow analysis implementations and understand in which situations they may miss or erroneously report presence of data flows. Using this data set, we evaluate three popular binary program analysis engines:
angr, Ghidra and Miasm, and discuss our insights. We further propose three model extensions to
static data-flow analysis that improve accuracy, and implement them in angr. These extensions
offer almost perfect recall (0.99) and increase precision from 0.13 to 0.32 on our benchmark set.
Each of our contributions HARM-DOS, DIAMONDS and FLOW-METER independently pushes
the boundary of research in binary program analysis. Therefore, these approaches further what
is possible with automatic software security evaluation. Ultimately, our research contributes to
establishing a more secure digital environment.
x
Chapter 1
Introduction
With the ever-increasing reliance of the modern world on technology, cyber attacks have become
commonplace. Cyber attacks are often executed by exploiting vulnerabilities present in computer
programs. These vulnerabilities are flaws in a program that, when exercised by the attacker, lead
the program to behave in an unsafe or unstable way. We are specifically focused on vulnerabilities
that, when exercised, violate a security policy, in terms of confidentiality, integrity or availability.
For example, they may allow the attacker unauthorized access to the host on which the vulnerable
program is running. Attackers exploit vulnerabilities for various malicious purposes, such as to
leak or corrupt sensitive data, or to disrupt service. This makes software vulnerabilities the center
of a cat-and-mouse game: defenders are struggling to find and mitigate each vulnerability before
it is found and exploited by attackers. In the modern world, with its pervasive dependence on
software, this cat-and-mouse game has reached critical levels and much is at stake, ranging from
loss of productivity to loss of lives. Software must be analyzed thoroughly to evaluate its security,
as vulnerabilities in software can have devastating consequences, such as compromised privacy
of users, shutdown of infrastructure, significant business losses, and even pose threat to human
life. Unfortunately, manual analysis of the source code is insufficient to evaluate the security of
corresponding software. This is firstly due to the quantity and size of modern software making
manual analysis impossible. Secondly, there are low-level vulnerabilities that are invisible in the
source code, which is known as “what you see is not what you execute” or WYSINWYX principle [1]. Thirdly, many devices today run proprietary or legacy software, for which source code is
1
unavailable. Binary program analysis aims to ameliorate these issues by automatically analyzing
the machine code instructions of executables to reason about their security-related properties.
Program analysis as a research field models the behavior of software in order to automatically
uncover properties of the program under analysis related to, e.g. its performance or security. Binary program analysis is a specialization of this research field, focusing on analyzing the machine
code instructions of software, rather than the source code. This approach is motivated by the
What You See is not What You Execute principle [1], which expresses the notion that there is a
distinction between the behavior expressed by the source code and the behavior when physically
executed. This is important for software security, as there may be vulnerabilities that are invisible
in the source code, or even inserted during or after compilation [2]. Given that the machine code
represents low-level instructions of the CPU, this can help build a better model of a program’s behavior. Unfortunately, the increased granularity gained by analyzing machine code, comes at the
cost of an extra layer of complexity, because compilation loses high-level semantic information,
such as control-flow structures, data types and data structures. The challenge of binary program
analysis is to reason about software properties in spite of this loss of information.
It is well known that program analysis is unsolvable, in the general case. Formally, determining whether or not programs have any non-trivial semantic property is undecidable, as proved
by Rice [3]. Binary program analysis also inherits this complexity. Fortunately, a wide variety
of approximation algorithms have been designed and implemented for fundamental binary program analysis techniques, such as disassembly [4, 5], control-flow analysis [6, 7] and data-flow
analysis [8]. These algorithms employ approximations (assumptions and heuristics) to achieve
guaranteed termination at the cost of analysis accuracy. These algorithms perform fundamental
techniques that are leveraged by many specialized models for goals such as vulnerability discovery. This means the impact on analysis accuracy incurred through e.g., data-flow analysis’s
approximations will affect the performance of the downstream models. Similarly, any improvement to the approximation algorithms for fundamental binary program analysis techniques will
2
carry over to the downstream models. Therefore, it is critical to understand and quantify the employed approximations and their impact on analysis accuracy. Unfortunately, the implementations
of these algorithms – and their approximations – are often insufficiently documented, if at all. This
makes evaluating the accuracy of fundamental binary program analysis techniques an active area
of research, with studies focusing on disassembly [9, 10], control-flow analysis [11] and data-flow
analysis [12]. Measuring the accuracy of an approximation algorithm for an undecidable problem
is challenging, due to the difficulty of obtaining ground truth. Two approaches that are employed
in binary program analysis is to construct ground truth by synthesizing artificial test cases [12] and
to extract partial ground truth by employing dynamic analysis [13].
To aid in binary program analysis research, a number of frameworks have been created, such
as angr [14], Ghidra [15] and Miasm [16], each implementing approximation algorithms for many
fundamental binary program analysis techniques. Leveraging binary analysis frameworks empowers researchers towards evaluating security properties of software. In this regard, automatic
vulnerability discovery is an active area of research [17–19]. A common approach in this area
is to model the unsafe behavior associated with a particular vulnerability class. For example, we
can model a vulnerability allowing an attacker to execute unauthorized commands as a data-flow
between an untrusted input source and a system call [17]. In general, modeling vulnerabilities
is challenging, because different occurrences of the same vulnerability class may vary widely in
terms of their manifestation in machine code. The challenge is to define a discovery model that
is capable of identifying as many different vulnerabilities of the same class as possible, while simultaneously minimizing the number of false alarms. Therefore, it is crucial to research different
vulnerability classes to identify features that may be employed in a discovery model.
After a vulnerability has been discovered, the next course of action is to identify and apply an
appropriate mitigation to this vulnerability, called a patch. An ideal patch should completely prevent the malicious behavior enabled by a vulnerability, while not changing the rest of the program’s
behavior. Automating this process is an active area of research [20–22]. One of the challenges in
this area is understanding the context in which the vulnerable code is executing. An automated
3
patch system that fails to understand the code it is changing may change the behavior of the program in a way that breaks functionality.
In binary program analysis, automated patching is often achieved by modifying the machine
code instructions of the program under analysis directly, rather than in the source code, referred
to as binary modification. This is a very useful area of research for software security, because
it allows for the mitigation of vulnerabilities in legacy software, for which the source code is no
longer be maintained, or available. Unfortunately, binary modification is complicated by the fact
that many machine code instructions are sensitive to their relative offset from other instructions and
data. Naively inserting patch instructions will break these expected offsets, introducing defects into
the program. Therefore, performing binary modification involves solving a number of additional
challenges to ensure that the patches added to the vulnerable binary do not inadvertently break the
executable, referred to as non-disruptive patching. Performing non-disruptive patching through
binary modification is an active area of research as well [23–25].
In conclusion, we have identified four key areas of binary program analysis that can be used
to improve software security. These four areas include: evaluating and improving fundamental
techniques of binary program analysis, building specialized models for vulnerability discovery
and enabling non-disruptive patching. Next, we explain the focus of this dissertation at improving
these key areas.
1.1 Thesis statement
Automated software security evaluation can be improved through: 1) automatic, vulnerabilityclass-specific approaches to vulnerability discovery, 2) automatic, vulnerability-class-specific approaches to non-disruptive patching, 3) better evaluation of and improvements to fundamental
binary program analysis techniques.
4
1.2 Demonstrating the thesis statement
In this dissertation we enhance automatic software security evaluation by contributing to and improving the research area of binary program analysis. In HARM-DOS and DIAMONDS, we combine the goals of automatic vulnerability discovery and non-disruptive patching. We combine fast,
but imprecise static analysis with slow and accurate symbolic execution and create models that
are capable of discovering many different occurrences of the target vulnerability class while also
minimizing false alarms. In HARM-DOS, we study a vulnerability class known as hash-collision
denial-of-service vulnerabilities. We show how commonalities between different weak hash functions – the cause of the vulnerability – can be leveraged to discover many different occurrences
of such vulnerabilities. We minimize false alarms by confirming the unique input-to-output relationship associated with each hash function. The vulnerability discovery results of HARM-DOS
are subsequently used towards non-disruptive patching. We exploit the commonalities of these
vulnerabilities to automatically construct an appropriate replacement – secure hash function. We
leverage binary modification to overwrite the weak hash function with its replacement, mitigating
the vulnerability. We show our modifications to the binary only changes the hash function, guaranteeing the changes are non-disruptive. We evaluate our prototype on a large data set of 105,831
real-world binaries, identify 796 confirmed weak hash functions, and successfully replace 759 of
these (95%).
In DIAMONDS, we discover and selectively mitigate Spectre vulnerabilities. These vulnerabilities rely on out-of-bounds memory reads that occur exclusively in speculative execution. We
define four requirements a machine code instruction that reads from memory must fulfill in order
to be vulnerable to Spectre. For every instruction that fulfills all four of these requirements, we
incorporate a patch that temporarily disables speculative execution, mitigating the vulnerability.
As speculative execution does not affect the semantics of a program, we can guarantee the changes
are non-disruptive. We compare DIAMONDS against the state of the art in Spectre mitigation –
speculative load hardening (SLH) in Clang. This comparison shows that up to 96% of mitigating
5
instructions introduced by SLH are applied to instructions that are not vulnerable, introducing unnecessary performance cost. By performing selective mitigation, i.e. only applying patches where
necessary, we reduce execution time up to 20%, compared to SLH.
In FLOW-METER we evaluate and improve the state of the art in binary program analysis, focusing on the understudied, fundamental technique of data-flow analysis. We introduce a data set
of 215,072 microbenchmark test cases, mapping to 277,072 binary executables, created specifically to test accuracy of data-flow analysis implementations. Additionally, we augment the data set
with 6 real-world executables. Using our data set, we evaluate three state-of-the-art data-flow analysis implementations, in angr, Ghidra and Miasm and find that they all exhibit poor performance,
with low F1 scores, and scalability issues with respect to memory usage. We further propose three
model extensions to static data-flow analysis that greatly improve accuracy. We implement these
extensions in angr and show that they improve recall from 0.39 to 0.99 and improve precision
from 0.13 to 0.32. Due to the importance of data-flow analysis in software security evaluation (e.g.
for vulnerability discovery), these contributions to improve data-flow analysis will carry over to
improve downstream models.
1.3 Structure of the dissertation
This dissertation is organized along our main research contributions in enhancing software security evaluation via leveraging and improving binary program analysis. In Chapter 2, we introduce
HARM-DOS and discuss our approach towards automatically discovering hash-collision denialof-service vulnerabilities. We use the discovery results to automatically mitigate the vulnerability
with guarantees of non-disruption. In Chapter 3, we introduce DIAMONDS as our approach to automatically discover and mitigate Spectre vulnerabilities. In Chapter 4, we discuss our contribution
towards evaluating and improving static data-flow analysis. We conclude in Chapter 5.
6
Chapter 2
Harm-DoS: Hash Algorithm Replacement for Mitigating
Denial-of-Service Vulnerabilities in Binary Executables
Programs and services relying on weak hash algorithms as part of their hash table implementations
are vulnerable to hash-collision denial-of-service attacks. In the context of such an attack, the
attacker sends a series of program inputs leading to hash collisions. In the best case, this slows
down the execution and processing for all requests, and in the worst case it renders the program
or service unavailable. In this chapter, we propose HARM-DOS, a new binary program analysis
approach to automatically detect weak hash functions and patch vulnerable binary programs, by
replacing the weak hash function with a secure alternative. To verify that our mitigation strategy
does not break program functionality, we design and leverage multiple stages of static analysis and
symbolic execution, which demonstrate that the patched code performs equivalently to the original
code, but does not suffer from the same vulnerability. We analyze 105,831 real-world programs
and confirm the use of 796 weak hash functions in the same number of programs. We successfully
replace 759 of these in a non-disruptive manner. The entire process is automated. Among the
real-world programs analyzed, we discovered, disclosed and mitigated a zero-day hash-collision
vulnerability in Reddit.
7
2.1 Introduction
Denial-of-service (DoS) attacks can create significant losses to businesses, by slowing down processing of client requests or by making a service unavailable. There are many types of DoS attacks. In this chapter, we focus on a hash-collision DoS attack, a type of algorithmic complexity
attack [26], which exploits a vulnerability in weak hash table implementations in order to disrupt
the availability of a target service or program.
Hash tables are data structures, ubiquitous for their fast, constant-time insertion and lookup
operations. Because speed is at stake, hash table implementations typically use simple hash algorithms, which unfortunately also have low collision resistance. We denote these as weak hash
algorithms. Attackers can easily generate inputs that will create collisions in hash tables that use
weak hash algorithms. During a hash-collision DoS attack, the attacker crafts a large number of
malicious inputs that are all inserted at the same table index, which drastically increases both the
lookup and the insertion time. On each insertion and retrieval at the affected index, the hash table
now has to iterate over a large list of colliding entries. This makes the operation time effectively
linear in the number of entries. Likewise, inserting a number of these colliding entries requires
polynomial time, allowing an attacker to greatly increase the computational load.
Hash-collision vulnerabilities have been discovered in the hash table implementation of the
programming languages PHP [27], Python [28], and Java [29], and have affected all programs
written in the vulnerable versions of these languages. This is especially critical when a remote
attacker has control over hash table entries, as was the case with a PHP web server [27]. In
addition, we have identified a remotely-exploitable zero-day hash-collision vulnerability, discussed
in Section 2.10.3, using HARM-DOS. It is evident that the real world impact of such vulnerabilities
is serious and has therefore attracted the attention of the security community in recent years.
Because algorithmic complexity vulnerabilities are a serious threat to security, there has been
work in detecting these vulnerabilities automatically via static analysis, as presented by Kirrage [30] and Chang [31], as well as via fuzzing, presented by Petsios [32] and Blair [33]. Previous
approaches based on static analysis do not focus on vulnerabilities caused by hash collisions and
8
thus are likely to be less accurate in detecting them than HARM-DOS. Fuzzing, on the other hand,
requires an extensive run time to find malicious inputs, which could be very sparse for hash functions. Thus fuzzing is unsuitable to use for detection of hash-collision vulnerabilities at scale.
There has also been work on mitigating algorithmic complexity vulnerabilities by closing network
connections that exploit a vulnerability as presented by Meng [34]. This approach is useful, but
ideally, we would like to patch the vulnerable code to fully remove the vulnerability.
With the prevalence of proprietary, third-party software libraries, it is essential to conduct vulnerability analysis on binary code. Unfortunately, the problem of detecting and patching hashcollision vulnerabilities in executable programs, without relying on source code, has received little
to no attention. We propose HARM-DOS, a novel approach to fill this gap by detecting and replacing weak hash functions automatically, at the binary level. In spite of the inherent complexity of
working with binary code, HARM-DOS surgically analyzes the program to diagnose hash-collision
vulnerabilities and perform a hash transplant – replacing the weak hash algorithm with a secure alternative. Similar to a medical organ transplant, the entire process must be conducted with utmost
precision. We introduce hash-collision vulnerability diagnosis, a novel static analysis inspired
by past research in detecting cryptographic hash functions by Lestringant and Grobert, respec- ¨
tively [35, 36], and adapted to detect weak hash functions at scale. After diagnosis, HARM-DOS
conducts a thorough pre-patch examination, a novel use of symbolic execution, to ensure the patch
can be performed safely, without introducing critical errors (like accessing memory out of bounds).
Next, HARM-DOS performs the hash transplant by leveraging static binary rewriting to replace the
weak hash function with an appropriate secure alternative, crafted with the insights gained from the
pre-patch examination. Finally, HARM-DOS conducts a post-patch examination, a second phase of
symbolic execution to confirm that the replacement was successful and no errors were introduced.
Since the weak hash function is removed from the patched program, the program is now resilient
against hash-collision DoS attacks. HARM-DOS does not rely on source code or debug symbols,
and simply requires the binary image of an executable program as input. The entire approach is
automated.
9
To the best of our knowledge, our approach is the first to propose an automated solution to
patch vulnerable hash algorithms at the binary level. We make the following contributions:
• We introduce the concepts of hash-collision vulnerability diagnosis and hash transplant,
new approaches to automatically detect and replace weak hash algorithms in binary code.
• In the new concepts of pre-patch and post-patch examination, we leverage static analysis and
symbolic execution in a novel way, along with insights tailored for non-disruptive patching,
to preserve the original program semantics through verification steps.
• We implement a prototype of the proposed analysis, which is available as open source at
https://github.com/usc-isi-bass/hashdos_vulnerability_detection.
• We evaluate our approach on 105,831 binaries from the AllStar data set [37]. HARM-DOS
confirms the use of 796 weak hash functions, in the same number of programs, and successfully replaces 759 (95%) of these with a secure alternative in a non-disruptive manner.
2.2 Scope
Our work focuses on the detection and non-disruptive patching of weak hash algorithms, by replacing them with a secure alternative.
2.2.1 Fast hash algorithms
Many programs implement hash tables to store and process user input and internal data. Programs
use hash functions to calculate an index for the hash table. While there are well-known cryptographic hash algorithms, which are collision resistant (e.g. MD5 and SHA256), these typically
impose a significant performance penalty which make them unsuitable to be used in hash table
implementations. Instead, programs use simpler hash algorithms, to achieve high performance,
which we refer to as fast hash algorithms.
10
Due to the intricacies of developing a good hash algorithm, developers often reuse existing
source-code implementations of well-known fast hash algorithms, with good average-case performance. In order to be reusable, the algorithm is often implemented in a single function, a hash
function, that is context-agnostic, i.e. does not rely on any program specific assumptions. This
means, hash functions are usually implemented without side effects, i.e. they do not modify the
program state beyond the context of the function. This is beneficial to HARM-DOS because, it
means it is possible to replace one such algorithm with another, without introducing unwanted side
effects into the program.
Moreover, to be context-agnostic, many fast hash functions share common features. HARMDOS relies on these features to identify fast hash functions in binary code. To receive input in a
context-agnostic way, a hash function often receives a variable-length input buffer as a simple byte
array, passed as function input parameter. We have observed that, in practice, many hash functions
implement one of two signatures. We refer to these signatures as the buffer-length signature (the
function receives a pointer to the byte array and the buffer’s length),
unsigned int hash(const char* str, unsigned int length);
and the buffer-only signature (the function receives a pointer to the null-terminated byte array),
unsigned int hash(const char* str);.
The hash function iterates over the bytes of the input array to compute a small integer (fitting
within the bit-width of the architecture) – the hash value. At the start of a hash function, the hash
value is initialized, often to a constant unique to the algorithm. On each iteration, the hash value
is updated according to the definition of the algorithm. The final hash value is returned from the
function. We show an example of such a hash function in Listing 5.4, in Appendix C.
During compilation, the context-agnostic nature of hash functions is often disrupted by an
optimization feature called function inlining. Here the body of the hash function is placed directly
in the body of the context-specific caller function. Patching such functions is out of scope for
HARM-DOS. We discuss this decision in Sections 2.3 and 2.11.
11
2.2.2 Hash-collision vulnerabilities
The high performance guarantees of hash tables rely on a hash function that distribute the entries
evenly over the table. An algorithm that is too simple can make it very easy for an attacker familiar
with it to calculate colliding inputs. We refer to these as weak hash algorithms. When the attacker
serves such inputs to an online program, they will degrade performance of the hash table from
constant to linear time, and lead to denial of service.
A program can also use an internal secret within a fast hash algorithm to achieve collision
resistance from an attacker, discussed by Aumasson and Alakuijala, respectively [38, 39], while
maintaining computational efficiency. We refer to these as secure hash algorithms, which we use
to replace the weak hash algorithm, when mitigating hash-collision vulnerabilities.
2.2.3 Attacker model
We assume that a remote attacker knows which hash algorithm is used by the target program
(i.e., they can obtain an official version of the program). We also assume the remote attacker can
observe inputs and outputs of the target program (e.g., by interacting through a network socket)
but cannot observe intermediate computations or the internal state of the program (which would
require local access with debugging capabilities). This assumption is reasonable since programs
generally use hashes internally, e.g., as an index into a hash table, and do not disclose these values
to the user. Finally, we consider an attacker obtaining information about either the hash value, or
internal program state via side-channel attack to be out of scope.
2.3 Challenges and requirements
The first step in mitigating hash-collision vulnerabilities automatically is to detect them. This is
challenging, since a program could use any arbitrary hash algorithm. As with any program analysis
approach, there are strict theoretical limits to what can be determined regarding the behavior of
the target binary. Therefore, identifying any possible weak hash algorithm is infeasible. Instead,
12
HARM-DOS focuses on detecting implementations of several known-weak hash algorithms, which
we call weak hash functions.
We aim to detect weak hash functions in stripped binary executables, meaning we cannot rely
on human-friendly artifacts of source code (e.g., function and variables’ names and comments) in
order to derive information regarding the purpose of different pieces of code. To overcome this
challenge, we instead rely on the computations and control flow inherent to a list of known-weak
hash algorithms to detect the vulnerability. This is explained in Section 2.5.
If a hash function is inlined, the boundaries between it and its caller function are blurred.
Identifying the function boundaries of inlined functions is a research problem orthogonal to the
focus of HARM-DOS, as done by Bao [40]. HARM-DOS detects inlined hash functions, but we
leave patching them for future work (Section 2.11).
After detecting the vulnerability, the next challenge is to mitigate it automatically. Patching
hash functions at the source code level usually involves replacing the weak hash function (the
original hash function) with a secure alternative (the replacement hash function). Indeed, this
was the approach taken in mitigating the hash-collision vulnerability in Python [28, 41] as well as
Perl [42]. We reproduce this process in binary code, but encounter several challenges because we
must modify the binary code, while preserving its correctness.
In order to address the outlined challenges effectively, HARM-DOS must fulfill the following
vulnerability detection requirements (DR) as well as mitigation requirements (MR).
DR1: It is important to modify only the binary code instructions that form part of an implementation of a weak hash function. It is therefore critical that HARM-DOS identifies weak hash
functions correctly. We require that HARM-DOS has zero false positives among the successfully
patched functions. In Sections 2.5 and 2.6.2.2 we discuss our strategy for detecting and confirming
the detection of weak hash functions.
DR2: Inlined hash functions in a given executable should either all be patched or none should be
patched. Replacing only some of these would result in a hash table using different hash algorithms
13
in different scenarios and, in turn, would break the functionality. We address the unique challenges presented by inlined hash functions partially, by identifying the presence of inlined hash
algorithms and leaving these unpatched. Therefore, we require HARM-DOS to be able to identify
inlined functions, as discussed in Section 2.6.1.
If HARM-DOS replaces a weak hash function, we need to preserve correctness with regard to
how inputs are processed, the range of the outputs produced, and the functionality of the rest of the
program code. To achieve this HARM-DOS must fulfill the following mitigation requirements.
MR1: HARM-DOS must only replace the binary code responsible for implementing the weak hash
algorithm. Replacing any other instructions will introduce defects into the program. We discuss
our strategy for replacing the hash function in Section 2.7.2
MR2: To ensure HARM-DOS does not affect the program behavior in unintended ways, we require the replacement hash function to restore the program to its prior state after completion, apart
from the hash value. We make an exception here for the stack memory, local to the replacement
hash function, since this memory is discarded once the function returns. We discuss this in Section 2.7.1.
MR3: HARM-DOS must be able to detect side effects in the original hash function. We identify
two types of such side effects, namely when the hash function writes to global memory, or when it
writes to a memory address passed via input parameter. Replacing the hash function while omitting
these side effects will affect the program behavior in an unknowable way. Therefore, HARM-DOS
must detect these side effects and not replace the weak hash function. The method for achieving
this is discussed in Section 2.6.2.4.
MR4: In order to avoid illegal memory accesses, the replacement hash function must not access
any memory that is not accessed by the original. We make an exception for the stack memory of
the replacement hash function. We discuss this in Section 2.8.1.
MR5: The replacement hash function must return hash values that are no greater than the maximum of the original hash function. As the hash value is often used to calculate an index in a hash
table, yielding larger hash values than expected may lead to memory access errors. Our approach
14
Candidate Hash Function +
Candidate Algorithm
Vulnerability Diagnosis
Constant-Mnemonic
Pair Discovery
Conforming Hash Function +
Replacement Requirements
Pre-patch Examination
Static Inlined Function
Filters
Symbolic Analysis of
Hash Function Behavior
Modified Binary +
Patched Hash Function
Hash Transplant
Hash Algorithm
Replacement
Post-patch Examination
Symbolic Patch
Verification
Hash Function
Template Matching
Analysis Preparation
Disassembly +
Control Flow
Recovery
Binary Function
Binary
Patched
Binary
Figure 2.1: A flowchart of the approach overview.
to address this problem is discussed in Sections 2.6.2.2 and 2.8.2.
MR6: The replacement hash function must consume its input in the same way as the original hash
function. This is discussed in Section 2.6.2.1.
MR7: The replacement hash function must match the original in terms of case sensitivity. A
case-insensitive hash function yields equal hash values for input buffers that differ only in case,
e.g. buffers abc and AbC. Using a case-sensitive hash function to replace a case-insensitive hash
function will cause incorrect lookups in the hash table. Conversely, using a case-insensitive hash
function to replace a case-sensitive hash function makes exploitation trivial, even for secure hash
functions. We discuss the solution to this requirement in Section 2.6.2.3.
MR8: The replacement hash function introduced by HARM-DOS must be secure, i.e., collision
resistant with respect to our definition in Section 2.2.2 and our attacker model. We discuss such
replacement hash functions in Section 2.7.1.1.
2.4 Approach overview
HARM-DOS is divided into five phases, shown in Figure 2.1, namely Analysis Preparation, Vulnerability Diagnosis, Pre-patch Examination, Hash Transplant, and Post-patch Examination. HARMDOS receives as input a binary executable file, the target binary as well as a set of detection models,
where each detection model contains the features necessary to detect a specific weak hash algorithm. In preparation for analysis, HARM-DOS disassembles the target binary, recovers control
flow and identifies the function boundaries therein.
In the Vulnerability Diagnosis phase, HARM-DOS analyzes functions in the target binary. We
refer to the function under analysis as the target function. In the target function, HARM-DOS detects the presence of a weak hash algorithm towards fulfilling DR1 and DR2. This phase leverages
15
static analysis of both the instructions and control-flow of the binary code of the target function.
First, the target function is matched against a hash function template, designed to detect the presence of patchable hash functions, with the features described in Section 2.2.1. If a template-match
is found, we compare the function to each detection model created for a known-weak hash algorithm. This light-weight static analysis allows us to pinpoint candidate hash functions quickly, in
linear execution time in the size of the target binary. HARM-DOS will analyze these candidate
hash function with a more accurate, but more expensive analysis using symbolic execution.
With the set of candidate hash functions, HARM-DOS proceeds to determine those functions
that can be patched, while preserving correctness, in the Pre-patch Examination phase. HARMDOS employs static analysis to determine if the hash algorithm is isolated in a function, away
from other functionality, for DR2 and MR1. HARM-DOS also employs symbolic execution to
build a profile of the behavior of each candidate hash function, with regards to signature (MR6),
input-output relationships (DR1, MR5), case sensitivity (MR7), and memory accesses (MR3).
HARM-DOS preserves this profile during patching. The input-output relationships are of particular
importance, as these enable HARM-DOS to confirm that the candidate hash function has indeed
been identified correctly. We refer to such candidate hash functions as confirmed hash functions.
For the confirmed hash functions for which HARM-DOS can preserve correctness, HARMDOS performs the Hash Transplant to mitigate the vulnerability, by replacing the hash algorithm
with a secure alternative, while fulfilling MR2, MR1 and MR8, by modifying the binary executable. The patched executable is passed to the final phase of HARM-DOS, Post-patch Examination. In this phase the memory accesses and output values of the replacement hash function are
monitored to confirm that correctness has indeed been preserved, fulfilling MR4 and MR5. If this
verification passes, we consider the patch a success, otherwise the patch is discarded and an error
is reported.
16
2.5 Vulnerability diagnosis
In this section, we provide more detail on how we discover candidate hash functions in target binaries, for DR1 and DR2. While there is significant past research in code similarity and detecting
cryptographic algorithms, such as done by Lestringant [35], Grobert [36], Farhadi [43] and Br- ¨
uschi [44, 45], the focus of these often lie in detecting a specific implementation of an algorithm.
HARM-DOS, on the other hand, strikes a balance between being implementation agnostic and
identifying those weak hash functions that can be patched. This is achieved with a novel approach
that is simple, yet effective. This approach uses static analysis to determine how a target function
interacts with the program memory, while making minimal assumptions with regards to how a hash
algorithm is implemented. Consequently, HARM-DOS detects candidate hash functions optimistically. This is beneficial for DR2, for which it is important to detect all implementations of a single
hash algorithm in a target binary. Even though this approach leads to false positive detections,
these will be pruned in Pre-patch Examination, preventing a faulty patch.
HARM-DOS performs Hash Function template-matching (Section 2.5.1) to determine if the
target function matches a template, designed from the insights in Section 2.2.1. For any matching target function, HARM-DOS determines if it implements a known-weak hash algorithm using
Constant-Mnemonic Pair Discovery, Section 2.5.2. For each known-weak hash algorithm, HARMDOS uses a supplied detection model that captures the constants used in the calculations defined in
the algorithm. We create a detection model for each of the following list of popular known-weak
hash algorithms: BKDR [46], DEK [47], DJB [48], ELF [49], FNV [50], JS [51], RS [52], and
SDBM [53]. Note that this list can be easily extended to other weak hash algorithms. The outcome
of this phase is a set of candidate hash functions in the target binary, each with a candidate algorithm – a label indicating which known-weak hash algorithm we suspect is implemented therein
(e.g., BKDR).
17
2.5.1 Hash function template-matching
The first step of identifying candidate hash functions, is to determine if a target function f contains
the features of a weak hash function. In Section 2.2.1, we mention that weak hash algorithms typically iterate over a variable-length input buffer. Iterating over such a buffer necessarily manifests as
a loop in control flow. Therefore, HARM-DOS requires a target function to have at least one nontrivial strongly connected component (SCC)1
in its control-flow graph (CFG). Let CFG(f) = Gf
denote the CFG of f and let SCC(Gf) denote the set of nontrivial SCCs {Cf,1,Cf,2,...,Cf,M} of
Gf
. HARM-DOS, therefore, requires |SCC(Gf)| ≥ 1.
Next, HARM-DOS analyzes the instructions of each SCC to identify possible side effects of
the target function. As mentioned in Section 2.2.1, weak hash functions are often implemented
without side effects in order to be context-agnostic and a function with side effects cannot be
replaced while fulfilling MR3.
Let V(G) and E(G) denote the set of nodes and edges in graph G, respectively. In Gf
the nodes
b1,b2,...,bN are basic blocks and the edges represent the control-flow between them. Each basic
block bi
is a sequence of instructions where bi, j denotes the j-th instruction. HARM-DOS identifies
side effects by identifying instructions that perform a memory-write operation to memory outside
of the stack memory of the target function. Memory-write operations receive the destination memory address as an expression consisting of registers, constants, or both. HARM-DOS analyzes this
expression to determine whether a specific memory-write operation indicates a side effect. For
an instruction bi, j
, WRr(bi, j) and WRc(bi, j) denote the set of registers and constants, respectively,
used in the expression of the address in a memory write operation. If bi, j does not write to memory, WRr(bi, j) = WRc(bi, j) = /0. Write operations to memory inside the stack memory are usually
identifiable as a write to an offset from a register holding the stack pointer or the stack base pointer.
Let R stack denote the registers used in stack operations. In AMD64, R stack = {rsp,rbp}. Therefore, HARM-DOS explicitly only allows such memory-write operations in an SCC. Conversely,
other types of memory-write operations, that use nonstack registers, or use only constants in the
1A nontrivial SCC is a subgraph of mutually reachable nodes, with at least one node and edge (we allow self loops).
18
destination address expression are forbidden. For example, the instruction mov [rsp-8],rax is
allowed, while instructions xor [rsi+4],rbx; mov [rsp+rcx],rsi and mov [10000],rdi are
forbidden.
We say an SCC Cf,i
is a template-match if every memory write operation in the SCC writes
only to memory on the stack. That is,
∀bj ∈ V(Cf,i), ∀bj,k ∈ bj
,
(|WRc(bj,k)| > 0 =⇒ |WRr(bj,k)| > 0)∧WRr(bj,k) ⊆ R stack
We say a function f is template-matching if it contains an SCC that is a template-match.
2.5.2 Constant-mnemonic pair discovery
Given a template-matching function, the next step is to determine if it is a known-weak hash
function. To this end, HARM-DOS leverages a code identification technique relying on unique
constants, proposed by Lestringant [35]. We include the unique constants found in weak hash
algorithms paired with their assembly-language operators (mnemonics) in the detection model of
each known-weak hash algorithm. For example, in order for a target function to be considered an
SDBM hash function, it must either contain the constant 65599 used with the signed multiplication
operator (imul), or both the constants 6 and 16, each used with the logical left shift operator
(shl). This is because, in some implementations of this algorithm the calculation performed by
the algorithm is implemented as h * 65599, while in others as (h << 6) + (h << 16) - h2
.
Formally, in the detection model of each known-weak hash algorithm, we include a
set of constant-mnemonic pair fingerprints. For a hash algorithm A, let CMF(A) =
{Scm1
,Scm2
,...,Scmn
} be a set of constant-mnemonic pair fingerprints. Each constant-mnemonic
pair fingerprint Scmi
is a set of tuples {(ci,1,Smi,1
),(ci,2,Smi,2
),...}, where each ci, j
is a constant
and each Smi, j
is a set of mnemonics {m1,m2,...}. These act like fingerprints in the sense that
2Note these implementations are mathematically equivalent.
19
they are unique to one known-weak hash algorithm. Due to the simplistic nature of weak hash
algorithms, some have very few algorithm-specific constant-mnemonic pairs to use for identification. We opt for an optimistic approach, leading to false positives which are subsequently filtered
out in the Pre-patch Examination phase.
HARM-DOS searches for the presence of these constant-mnemonic pair fingerprints in the
instructions of the target function f . For a function f , let CM(f) denote the set of tuples
{(c1,m1),(c2,m2),...} where each tuple (ci
,mi) represents a mnemonic mi and constant operand
ci used in the instructions of f . We say f has a constant-mnemonic pair match with respect to
weak hash algorithm A, if all the constants of a constant-mnemonic pair fingerprint of A appear
within f , each used with one of its paired mnemonics. That is,
∃ Scmi
′ ∈ CMF(A) | ∀(ci
′
, j
,Smi
′
, j
) ∈ Scmi
′
, ∃mk ∈ Smi
′
, j
|
(ci
′
, j
,mk) ∈ CM(f)
We generate the constant-mnemonic pair fingerprints, shown in Table 2.1, by compiling source
code implementations of the known-weak hash algorithms with compilers GCC-7.5.0 and Clang6.0.0. We use optimization levels O0, O1, O2, O3, Os, Ofast and O0, O1, O2, O3, Os, Ofast, Oz Og,
respectively.
If the target function f is template-matching and has a constant-mnemonic pair match with
respect to algorithm A, we say f is a candidate hash function with candidate algorithm A, denoted
fcA
. We show a full example of discovering a candidate hash function in Appendix A. Candidate
hash functions must be analyzed to determine patchability.
2.6 Pre-patch examination
In order to patch the candidate hash function, it is important to understand how it interacts with
the rest of the program, in terms of control flow and data flow. This phase starts by identifying
and filtering inlined hash algorithms in accordance to DR2, by using a heuristic static analysis
20
Table 2.1: The constant-mnemonics pair fingerprints of each known-weak hash algorithm.
Hash Constant-Mnemonic Pair Fingerprints
BKDR {(131, {imul, mov})}, {(1313, {imul, mov})}
DEK {(5, {rol})}, {(5, {shl}), (27, {shr})}
DJB {(5381, {imul, mov})}
ELF {(4, {shl}), (24, {shr, sar}), (4026531840, {and, mov})}
FNV {(16777619, {imul, mov}), (2166136261, {mov})}
JS {(2, {shr}), (5, {shl}), (1315423911, {mov})}
RS {(63689, {mov}), (378551, {mov, imul})}
SDBM {(65599, {imul})}, {(6, {shl}), (16, {shl})}}
{(6, {shl}), (10, {shl})}
of two steps Duplicate Candidate Algorithm Detection (Section 2.6.1.1) and Template-match Size
Difference (Section 2.6.1.2). Candidate hash functions that remain after this step, are referred to as
isolated hash functions.Focusing on isolated hash functions, allows us to use symbolic execution
to build a behavior profile of the hash function.
HARM-DOS continues by applying four analysis steps, Symbolic Signature Detection (Section 2.6.2.1), Symbolic Input-Output Matching (Section 2.6.2.2), Symbolic Case Sensitivity
Checking (Section 2.6.2.3) and Symbolic Memory Access Analysis (Section 2.6.2.4). These steps
leverage symbolic execution on each of the isolated hash functions in the set to build a profile of
their behavior.
2.6.1 Filtering inlined hash functions
The hash function discovery analysis described in Section 2.5 makes no distinction between
whether a hash algorithm has been inlined or not. In this section, we describe the static analysis we use to remove inlined hash algorithms from the set of candidate hash functions, fulfilling
DR2.
2.6.1.1 Duplicate candidate algorithm detection
If a hash algorithm A is inlined in multiple caller functions, the hash function discovery analysis
will identify any number as candidate hash function with candidate hash algorithm A. We use
21
this as a clue to determine if a hash function has been inlined. Specifically, if we have duplicate
candidate algorithms in the target binary executable (multiple candidate hash functions with the
same candidate algorithm), we assume the hash functions have been inlined and remove these
when building the set of isolated hash functions.
The remaining candidate hash functions, that have a unique candidate algorithm in the target
binary, are denoted lone hash functions. Formally, let Fc(b,A) denote the set of candidate hash
functions in binary executable b with candidate algorithm A. A candidate hash function fcA
is a
lone hash function if |Fc(b,A)| = 1.
2.6.1.2 Template-match size difference
The next clue we use for detecting inlined hash algorithms is the difference in the number of basic
blocks in the CFG of the candidate hash function and the template-matches. We refer to this as the
template-match size difference. We define the template-match size difference of a candidate hash
function fcA
as
DIFF(fcA
) = |V(CFG(fcA
)| −|V(CfcA
,i)|
where CfcA
,i
is the template-match of CFG(fcA
) with the maximum number of nodes. The intuition
here is that if the template-match only makes up a small portion of the candidate hash function’s
CFG, the hash function is more likely to have been inlined in a larger function. A candidate hash
function fcA
is an isolated hash function fiA
if
DIFF(fcA
) < K ∧|Fc(b,A)| = 1.
Through manual analysis and experimentation, we have discovered that using K = 6 provides a
good balance between correctly identifying isolated hash functions and HARM-DOS’s execution
time. This decision is discussed more in Section 2.10.
22
2.6.2 Symbolic hash function analyses
We perform four symbolic execution tests on the isolated hash functions to build a profile of the
behavior in order to determine if a nondisruptive patch can be made. The behavior profile is built
with regard to signature (MR6), input-output relationships (DR1, MR5), case sensitivity (MR7),
and memory accesses (MR3, MR4), of each.
Let f(s0) denote the set of program states created when performing symbolic execution, starting at function f with symbolic program state s0. Let FIN(f(s0)) be the set of program states
that reach a ret instruction of f , the final states. Let WR(f(s0)) be a set of tuples {(a1, v1),...},
denoting the memory write operations that occur during symbolic execution. Each tuple consists
of a symbolic expression for the address ai and the value vi of the memory write operation. Let
RD(f(s0)) be defined similarly, but for memory read operations. For a symbolic expression e,
VAR(e) denotes the symbolic variables used in the expression. With MEM(s,a), we denote the
content of the memory at program state s at the address represented by symbolic expression a.
Similarly, REG(s,r) denotes the content of register r. Let ARG(s0,i) denote the value of the register, or memory location, that corresponds to the i-th argument of a function. For our purposes, we
assume ARG(s0,1) = REG(s0,rdi) and ARG(s0,2) = REG(s0,rsi).
2.6.2.1 Symbolic signature detection
In Section 2.2.1 we explained that hash functions frequently implement one of two signatures, the
buffer-length signature and buffer-only signature. We restrict our focus to patching hash functions
with these signatures, towards fulfilling MR6.
The main difference between the buffer-length and buffer-only signature is in how the end of
the input buffer is determined. For the buffer-length signature, the length of the buffer is given
explicitly as function argument. On the other hand, for the buffer-only signature, the end of the
buffer is usually identified by a special byte, often a null byte.
For each signature, we set up a symbolic program state s0 that corresponds to calling the hash
function with symbolic input according to the signature. For the buffer-length signature, we create
23
Registers
Register Purpose Value
rip Instruction Pointer hash function entry
rdi 1st function argument input buffer address
rsi 2nd function argument n
Memory
Address Value Constraint
input buffer address + 0 b0
input buffer address + 1 b1
... ... ...
input buffer address + (n-1) b(n-1)
(a)
Registers
Register Purpose Value
rip Instruction Pointer hash function entry
rdi 1st function argument input buffer address
Memory
Address Value Constraint
input buffer address + 0 b0 b0≠0
input buffer address + 1 b1 b1≠0
... ... ...
input buffer address + (n-1) b(n-1) b(n-1)≠0
input buffer address + n 0
(b)
Figure 2.2: The symbolic starting state that we use for symbolic execution while determining if a
function implements the (a) buffer-length and (b) buffer-only signature.
s0 such that ARG(s0,1) = x, ARG(s0,2) = n. Here, x is a symbolic variable denoting a memory address of the input buffer and n is a defined, concrete length. Similarly, for the buffer-only
signature, we create s0 such that ARG(s0,1) = x, MEM(s0, x + i) = (bi ̸= 0) for 0 ≤ i < n and
MEM(s0, x + n) = 0. Note, we assume the end of the buffer is indicated with a null byte. Figures 2.2a and 2.2b show a visual representation.
For each of the two signatures, we define a set of addresses Ae we expect to be accessed
if the isolated hash function indeed implements the signature. For the buffer-length signature,
Ae = {x+i | 0 ≤ i < n}, while for the buffer-only signature Ae = {x+i | 0 ≤ i ≤ n}. Note that the
number of memory addresses accessed for the buffer-only signature is one more than those for the
buffer-length signature, because an additional address needs to be read in order to identify the end
of the buffer. We determine which (if either) signature f implements by checking if we observe a
memory read operation accessing each of the expected addresses during symbolic execution. That
is, we calculate the following for each Ae,
{ai
| (ai
, vi) ∈ RD(f(s0)) and x ∈ VAR(ai)} = Ae.
24
Knowing which signature the isolated hash function implements provides us with insight into
how the hash function receives its input. The benefit of this is twofold. It allows us to ensure
that our replacement hash function consumes its input in the same way as the original, necessary
for MR6, and it allows us to perform symbolic execution on the hash function with controlled
input. We know which memory locations will be accessed, so we assign concrete values here,
corresponding to the input. Let IN(f,w) denote the program state s0 set up in a way to perform
symbolic execution on f with input buffer w = b0,b1 ...bn−1. That is, we set up s0 according to
the signature of f and let ∀i ∈ {0..n − 1}, MEM(s0, x + i) = bi
. For a final state sf
let OUT(sf)
denote the expression (symbolic or concrete) corresponding to the return value. For our purposes,
we assume OUT(sf) = REG(sf
,rax).
If the isolated hash function implements either the buffer-length signature, or buffer-only signature, we say the hash function is signature-conforming. The signature-conforming hash functions
are passed to the following symbolic execution tests. Otherwise, the candidate hash function will
not be patched.
2.6.2.2 Symbolic input-output matching
Each hash algorithm deterministically produces the output value, the defined hash value, for
every given input. Since this relationship is unique to a specific algorithm, this can be used
to identify the algorithm, as proposed by Grobert [35, 36]. We include information on such ¨
input-output relationships in the detection model for each known-weak hash algorithm. In practice, we have observed small changes in the implementation of a hash algorithm that produce
changes in the hash value. For this reason, we decide to associate a small set of defined hash
values for each input buffer and hash algorithm. The first change we allow for, is hash functions producing 32-bit integer hash values versus 64-bit integer hash values. We also allow for
two known variations in the source code implementation of the DJB hash function, which we
show in Appendix C. Let Wio = {(w1,H1),...,(wN,HN)}. Each tuple (wi
,Hi) is an input buffer
wi and the set of defined hash values Hi
, for the algorithm A. For example, for SDBM we have
25
(wi
′,Hi
′) = (abc,{97,417419622498}) to associate input buffer abc with the corresponding 32-bit
and 64-bit defined hash values, respectively.
To determine if a signature-conforming hash function implements the candidate algorithm, we
use symbolic execution to obtain the observed hash value for given input. A set of observed hash
values that each appear in the corresponding set of defined hash values, indicates that the signatureconforming hash function indeed implements the candidate algorithm, fulfilling DR1. Formally, f
is input-output conforming (and therefore a confirmed hash function) with respect to A if
∀(wi
,Hi) ∈ Wio, ∀sf ∈ FIN(f(IN(f,wi))),
|FIN(f(IN(f,wi)))| = 1 ∧OUT(sf) ∈ Hi
When choosing the input buffers to use for symbolic input-output matching, we create 256
buckets in the range {0..2
32}. For each known-weak hash algorithm, we choose input buffers
that yield a defined hash value in each bucket. Additionally, we also choose 5 input buffers for
which the algorithm outputs large defined hash values, approaching 232. By selecting inputs that
correspond to large defined hash values, we can detect cases where the signature-conforming hash
function yields output in a smaller range than 232. We cannot patch these while fulfilling MR5.
Finally, if we observe a symbolic hash value when evaluating the observed hash value, it means
the concrete input we supplied was not sufficient to constrain the observed hash value to a single
number. We mark such hash functions as not input-output-conforming. In all cases, functions that
are not input-output conforming will not be patched.
2.6.2.3 Symbolic case sensitivity checking
To determine if a signature-conforming hash function is case-sensitive, we provide it with pairs of
concrete input buffers Wcase = {(w1,1,w1,2),...,(wN,1,wN,2)}. Each tuple (wi,1,wi,2) is a pair of
26
input buffers which differ only in case, for example (abc,ABC). We say a signature-conforming
hash function f is case-sensitive if two such buffers produce unequal hash values. Formally,
∃(wi
′
,1
,wi
′
,2) ∈ Wcase |
∃(sf,1,sf,2) ∈ FIN(f(IN(f,wi
′
,1)))×FIN(f(IN(f,wi
′
,2))) |
OUT(sf,1) ̸= OUT(sf,2)
The results of this analysis are stored to be used when constructing the replacement hash function,
in order to fulfill MR7.
2.6.2.4 Symbolic memory access analysis
It is important to know of any side effects of the signature-conforming hash function. Similar to
the Hash Function Template-matching, described in Section 2.5.1, the purpose of this analysis is to
identify side effects, to satisfy MR3. This is a more accurate, albeit slower, analysis that leverages
symbolic execution.
Using symbolic execution, we can identify memory modifications, beyond the function’s stack
memory. If we observe such modifications, we decide not to patch the hash function. Otherwise,
we refer to the isolated hash function as memory-conforming. This fulfills MR3. We set up s0
according to the signature of f and create a symbolic variable xr for each stack register r ∈ R stack.
Then, we set REG(s0,r) = xr for each r ∈ R stack. Formally, we say f is memory-conforming if
∀ (ai
, vi) ∈ WR(f(s0)), VAR(ai) ⊆ {xr
: r ∈ R stack}.
If a hash function is signature-conforming, input-output-conforming, and memory-conforming, we
say it is a conforming hash function and it is eligible for patching.
27
2.7 Hash transplant
To perform the hash transplant, we construct a secure hash function, the replacement hash function
and use it to replace the conforming hash function, the original hash function.
2.7.1 Replacement hash function construction
Using the profile built during Pre-patch Examination, we construct a replacement hash function
that is similar to the original in terms of signature and case sensitivity, fulfilling MR6 and MR7.
We also add instructions at the start and end of the replacement hash function to store and
restore the values of all registers that are modified in the replacement hash function, except the
register housing the hash value. This is done to fulfill MR2. To avoid side effects and fulfill MR4,
the replacement hash function is constructed to access only its input buffer and stack memory.
Next, we discuss the two hash algorithms we use as a secure alternative.
2.7.1.1 Candidate replacement hash algorithms
The first hash algorithm we use as replacement is SipHash, designed by Aumasson [38] specifically
for the purpose of preventing hash-collision DoS attacks. SipHash achieves collision resistance
by incorporating a 16-byte secret into its calculations. As long as this secret is unknown to an
attacker, collisions can only be achieved through guessing [38], fulfilling MR8. The downside of
SipHash is that it takes approximately 670 bytes to implement in AMD64 binary code. This makes
it challenging to use SipHash as replacement hash algorithm, while fulfilling MR1, explained in
Section 2.7.2. When the implementation size of SipHash makes it an impractical replacement
algorithm, we use an alternative algorithm.
For this, we turn to strongly universal hash algorithms, introduced by Wegman [54]. A set of
hash algorithms is strongly universal if, for a random hash algorithm, any input is mapped to every
hash value with equal probability. If the attacker does not know which hash algorithm in the set
is used, they cannot predictably generate collisions, as noted by Crosby [26]. Lemire introduces
28
a strongly universal set of hash algorithms, named Multilinear [55]. This set of hash algorithms
achieves collision resistance through a secret initial state of a pseudorandom number generator
(PRNG). Therefore, this hash function is secure and can be used to fulfill MR8. We provide more
details about our Multilinear hash algorithm approach in Appendix D.
The security guarantees of strongly universal hash algorithms are slightly weaker than those of
SipHash. If an attacker can obtain a number of input-output pairs of the strongly universal hash
algorithm, it is possible to calculate some of the random values used, mentioned by Aumasson [38].
This, in turn, will allow the attacker to pinpoint the specific hash algorithm used and therefore
generate collisions. Therefore, we first try to replace a weak hash function with SipHash and only
resort to using a Multilinear hash algorithm implementation if we fail to insert SipHash due to its
implementation size.
Finally, an important decision to make is whether to randomize the secret used for the replacement hash function once when the weak hash function is patched, or every time the program is run.
The latter may be more secure, as the attacker will not learn the secret by obtaining the patched executable. However, we have observed programs that store the key-value pairs produced by the hash
function to a file [56]. Randomizing the hash function between these runs may break the functionality. Therefore, we have decided to limit randomization to the time when the binary executable is
patched. The user can re-randomize the secret by patching the binary executable again.
2.7.2 Replacing the hash function
To patch the vulnerable binary executable, we leverage a process called binary rewriting. From a
high level, we replace the original hash function by overwriting it with the replacement hash function. If the replacement hash function is larger than the original (in the number of bytes it requires
to implement), we insert the remaining instructions, the overflowing instructions, elsewhere into
the binary executable and adjust control flow so that these instructions are executed in the correct
order.
29
Binary rewriting is complicated by the fact that many binary instructions, such as jump and
data reference instructions, rely on their relative position in the binary. Simply inserting additional
instructions in between existing instructions may break these relations, rendering the program
defective. For this reason, it is safer to add instructions to a binary executable by overwriting
others. To this end, we build on a method often used to modify binary executables, for example
used by Bruschi [45] and Menon [22], namely we search for bytes in the binary executable that are
never executed, colloquially called code caves. Code caves are often introduced by the compiler to
align functions [57]. If no single code cave exists that is large enough to hold all the overflowing
instructions, we join multiple caves with jump instructions. We show a full example of how to
replace a hash function in Appendix B. In some cases, the binary executable does not contain
enough code caves to house all the overflowing instructions. In these cases the patch fails and we
report that there was no room for the replacement hash function. We discuss our decision to use
this approach towards patching, as opposed to other existing approaches, in Section 2.12.
As we are only overwriting the instructions of the original hash function and, possibly, some
instructions that serve only as padding bytes, we replace the original hash function while fulfilling
MR1.
2.8 Post-patch examination
After patching, we need to confirm that the replacement hash function does not introduce errors.
Since proving equivalence of two programs is undecidable, we perform a local verification, aided
by symbolic execution. We perform two steps, Symbolic Patch Memory Access Analysis and
Symbolic Preimage Calculation. If either of these two steps fail, we say that replacing the hash
function introduced errors into the program and discard the patch. Otherwise, the patch is considered a success.
30
2.8.1 Symbolic patch memory access analysis
Similar to the original hash function fo, the replacement hash function fr should not have any side
effects. Moreover, fr should access exactly the same bytes of the input buffer as fo. To this end,
we use the set of expected addresses Ae, defined in Section 2.6.2.4, for the signature of fo. We
perform symbolic execution on fr and monitor the memory accessed in order to determine if the
replacement was successful. We set up s0 according to the signature of fo and create a symbolic
variable vr for each stack register r ∈ R stack. Then, we set REG(s0,r) = vr for each r ∈ R stack.
Let the address of the input buffer of be a symbolic variable x. We require that fr only writes to
stack memory:
∀ (ai
, vi) ∈ WR(fr(s0)), VAR(ai) ⊆ {vr
| r ∈ R stack}
and reads from all the offsets of x appropriate for its signature:
{ai
| (vi
,ai) ∈ RD(fr(s0)) and x ∈ VAR(ai)} = Ae
and does not read from global memory:
∀ (ai
, vi) ∈ RD(fr(s0)), VAR(ai) ⊆ ({x} ∪ {vr
| r ∈ R stack}).
This fulfills MR4.
2.8.2 Symbolic preimage calculation
The purpose for this verification step is to ensure the replacement hash function only returns hash
values that are within the range of the original hash function. We calculate preimages of the
original hash function for hash values returned from the replacement hash function. We do this
by evaluating the concrete hash value for the replacement hash function for a number of concrete
inputs. For each of these concrete hash values, we use symbolic execution and constraint solving
to obtain input for the original hash function for which it returns the same hash value. Note that
31
such an evaluation is feasible due to the simple nature of known-weak hash algorithms. Formally,
Wp = {w1,w2,...,wN} be a set of input buffers. We require,
∀wi ∈ Wp, ∃ w
′
i
|
∀(sf,o,sf,r) ∈ FIN(fo(IN(fo,wi)))×FIN(fr(IN(fr
,w
′
i
))),
OUT(sf,o) = OUT(sf,r).
We do this towards fulfilling MR5.
2.9 Implementation
HARM-DOS is implemented in approximately 3,500 lines of Python code in an open-source repository3
. It leverages the angr binary program analysis framework [14] for the fundamental requirements of binary program analysis, such as disassembly, CFG recovery and symbolic execution.
All the analysis steps discussed in this chapter are implemented by the authors, using only these
fundamentals. As input, HARM-DOS takes the names of executable files to analyze and writes
the analysis results for each to a file in JSON format. The patched executable is also produced, if
applicable. In the current implementation, only ELF executable files for the AMD64 architecture
are supported.
2.10 Experimental results
Our experimental evaluation is composed of three main parts. First, in Section 2.10.1 we analyze a large data set of real-world binaries to test the ability of HARM-DOS to detect and patch
vulnerabilities at scale. Second, in Section 2.10.2, we constitute and analyze a subset of these
binaries, containing manually identified hash functions. We use this to determine the accuracy of
our approach based on a known ground truth. Third, we discuss a case study in Section 2.10.3.
3https://github.com/usc-isi-bass/hashdos_vulnerability_detection
32
All experiments were run with PyPy 7.3.5, in an Ubuntu 20 Docker container with 10 CPU cores.
Analyzing a single binary executable take approximately 215MB memory on average.
2.10.1 Full-scale analysis
In this section we present the results we have obtained by running HARM-DOS on a large set of
real-world binaries. Our data set consists of 105,831 unique AMD64 ELF executable files, extracted from the AllStar data set [37]. This data set contains the binaries obtained when building
the Jessie distribution of the Debian packages. These packages are built with the debuild tool
which uses the compiler specified in the package configuration to create the executable. On average, these binaries are approximately 5.6MB large, consisting of approximately 58,000 basic
blocks and 110,000 CFG edges. Among these packages are many widely used programs, such
as Firefox, Apache, PHP and Binutils. Note that even though the binaries in this data set contain
debugging symbols, HARM-DOS does not rely on these in any phase of analysis.
2.10.1.1 Discovery
We run Vulnerability Diagnosis on all of the binaries in the data set. This takes about 2 days to
complete, with an average analysis time of 17s per binary. We identify 31,052 candidate hash
functions in 8,930 binaries. Table 2.2 shows the number of candidate hash functions per candidate
algorithm. The table shows that not a single candidate hash function with JS as candidate algorithm
was discovered. To ensure this is not caused by false negative detections, we performed a cursory
manual search of the data set, which also did not yield any such hash functions. The lack of JS
hash functions, therefore seems to originate from a lack of their use in practice.
In the next phase of analysis, we reduce the set of candidate hash functions to those that are not
inlined – the isolated hash functions. The first step of this analysis is to remove all candidate hash
functions with duplicate candidate algorithms in the same binary. Table 2.2 shows that 10,086 lone
hash functions remain after this step.
33
Table 2.2: The candidate, lone, isolated, confirmed and patched hash functions from the full-scale
analysis.
Hash Alg Candidate Lone Isolated Confirmed Patched
BKDR 11,053 2,568 45 2 2
DEK 2,944 1,872 976 0 0
DJB 2,558 1,724 424 390 372
ELF 2,069 1,318 433 321 317
FNV 3,327 141 82 46 44
JS 0 0 0 0 0
RS 69 21 0 0 0
SDBM 9,032 2,442 301 37 24
Total 31,052 10,086 2,261 796 759
The next step is to remove the lone hash functions for which the template-match size difference
exceeds 6 basic blocks. We discuss this decision in Section 2.10.2. Table 2.2 shows the number of
isolated hash functions, whose behavior HARM-DOS will analyze with symbolic execution.
2.10.1.2 Patching
In this section, we discuss how effectively we can patch the confirmed hash functions. Analyzing
all the isolated hash functions takes about 3.5 days with an average analysis time of 28s per binary,
showing that HARM-DOS is scalable. In Table 2.2 we show the patching results in the Patched
column. Approximately 35% of the isolated hash functions were confirmed as weak hash functions
and 95% of these confirmed hash functions were patched successfully. We use the ratio of successful patches over the confirmed hash functions, as this excludes all false positive identifications
in Vulnerability Diagnosis and only includes true positive hash functions, verified with Symbolic
Input-Output Matching. Note that this also excludes inlined hash functions which we cannot verify with symbolic execution. Recall that in order to calculate the observed output, a hash function
must be signature-conforming. Therefore, if a hash function is either signature non-conforming
or input-output non-conforming, it cannot be confirmed as hash function. In Table 2.3 we show
how often these two reasons lead to an isolated hash function not being confirmed as hash function. In the vast majority of cases, signature-nonconformity prevented the patch. Through manual
34
Table 2.3: The number of times signature-nonconformity and input-output nonconformity lead to
an isolated hash function not being confirmed. The right-most column shows when analysis time
exceeded 4 hours or an error occurred. The large number of signature-nonconforming isolated
hash functions is caused by too-broad hash function detection in the optimistic static analysis in
the Vulnerability Diagnosis phase. These false positives are correctly filtered out during symbolic
execution.
Hash Alg Non-confirmed sig IO TOs and Errors
BKDR 43 40 3 0
DEK 976 865 37 74
DJB 34 25 5 4
ELF 112 47 52 13
FNV 36 5 31 0
SDBM 264 90 54 120
Total 1465 1,072 182 211
analysis we verified that most cases of non-conforming signature were indeed not hash functions,
but instead included other functions that share popular constants with a weak hash function (and
are thus flagged as candidates during optimistic weak hash function discovery in the Vulnerability Diagnosis phase). Thus our approach correctly filters out these misidentified candidates in the
symbolic execution analysis. For the DEK hash algorithm in particular, we observe that in 99% of
these cases a portion of the MD5 hash algorithm is misclassified as DEK hash, because it includes
the constant-mnemonic pair match (5, {rol}).
In Section 2.7.1.1 we explain that SipHash has stronger security properties than the Multilinear
hash function and is therefore the preferred replacement hash algorithm. From our experiments,
we found that of the 759 hash functions that were patched successfully, SipHash was used as the
replacement 702 times, or 92.49%.
2.10.2 Ground truth analysis
In order to measure how well HARM-DOS detects all known-weak hash functions, we construct
ground truth data set. We select 202 hash functions, for which we have manually verified from the
source code that they implement a given known weak hash algorithm. We draw our functions from
a subset of 156 binaries used in our full-scale analysis of the AllStar data set [37]. We attempted
35
Table 2.4: The results from running HARM-DOS on the binaries in the ground truth data set. The
MI column shows the number of manually identified hash functions. The Candidate column shows
how many manually identified hash functions were identified as candidate hash functions, i.e. true
positive classifications. The Isolated column shows how many of these remain after filtering out
the inlined functions.
Hash Alg MI Candidate Isolated
BKDR 1 1 1
DJB 50 50 34
ELF 50 50 30
FNV 50 50 9
RS 8 8 0
SDBM 43 43 33
Total 202 202 107
to balance representation of different known weak hash algorithms in our ground truth data set,
but in reality some algorithms were much more frequent in real-world programs than others. We
show the number of manually identified hash functions, included the ground truth data set, in the
MI column of Table 2.4.
For the Vulnerability Diagnosis phase, we show the number of true positive classifications
in Table 2.4. A true positive is a hash function that is correctly detected by our analysis as a
candidate hash function, with the correct candidate algorithm. Conversely, a false negative is a hash
function either not detected by our analysis, or is paired with an incorrect candidate algorithm (i.e.,
mistaken for another algorithm). Note that in Table 2.4 for each algorithm the number of candidate
hash functions match the number of manually identified hash functions. Therefore, HARM-DOS
detected all hash functions correctly.
Next, we look at how well HARM-DOS detects inlined true positive hash functions. In Table 2.4, we show the number of isolated hash functions. To measure the accuracy of the Filtering
Inlined Hash Functions analysis (Section 2.6.1) we perform a manual inspection of the candidate
hash functions that are removed. The results of this inspection is shown in Table 2.5. We place
each of these candidate hash function into one of four categories. Cat1: the target candidate hash
function is indeed inlined. Cat2: the target candidate hash function is not inlined, but there are
36
inlined, duplicate candidate hash algorithms. Cat3: the target candidate hash function is not inlined and none of the duplicate candidate hash algorithms are inlined. Cat4: the target candidate
hash function is not inlined and it is a lone hash function, but the template-match size difference
is greater than 6. For categories Cat1 and Cat2, the analysis correctly decided not to patch the
candidate hash function. For Cat2, only replacing the target candidate hash function and not the
rest, will break functionality (see DR2). For Cat3, since none of the duplicate candidate algorithm are inlined, we can replace each individually. However, distinguishing between Cat2 and
Cat3 automatically is difficult and we leave this for future research. For Cat4, we observed an
isolated hash function with a template-match size difference of up to 13, due to a loop-unrolled
implementation of SDBM. To investigate this further, we reran the full-scale analysis with an increased template-match size difference limit of 13. We observed that in this case Cat4 is empty,
however no additional hash functions were patched successfully. For the SDBM hash function
mentioned above, there was a failure in the post-patch examination. Therefore, these cannot be
patched automatically while guaranteeing correctness with respect to MR5. We have also observed RS hash functions with template-match size difference 8. These hash functions are added
to the binary executable by a Pascal compiler [58] for use in a hash table. The hash function is
implemented in a specialized way to compute a hash value for Pascal string objects. These objects
are passed as a pointer to the hash function. The string length and data are accessed via constant
offsets of this pointer. Note that this does not match either the buffer-length or buffer-only signature. Consequently, it does not follow the context-agnostic nature of the hash functions described
in Section 2.2.1 and cannot be patched automatically while ensuring correctness. For this reason,
we decide to use a template-match size difference of 6 allowing, us to capitalize on both analysis
time and patchability.
2.10.2.1 Test case verification
As an extra layer of verification, we use the test cases included with some Debian packages to
confirm that HARM-DOS does not break functionality. These test cases are provided via the GNU
37
Table 2.5: The results from manually inspecting the candidate hash functions that are not isolated
hash functions.
Hash Alg Removed Cat1 Cat2 Cat3 Cat4
DJB 16 5 7 1 3
ELF 20 5 9 6 0
FNV 41 0 39 2 0
RS 8 0 0 0 8
SDBM 10 0 0 4 6
Total 95 10 55 13 17
Make build utility [59]. We identify and run the test cases of 21 packages with binaries patched by
HARM-DOS. When necessary, we also rebuild hash tables stored in files, used by the test cases,
with the patched binaries. In every case, the patched binaries do not introduce test failures.
2.10.3 Case study
To show the effectiveness of HARM-DOS in a real-world context, we discuss a remotelyexploitable zero-day hash-collision vulnerability that we discovered, and patched, in
Snudown [60], a component of Reddit [61]. We disclosed this vulnerability to Reddit in accordance
to the coordinated disclosure policy. The vulnerability was assigned ID CVE-2021-41168 [62].
We also implemented a mitigation that replaces the weak hash function with SipHash, which was
accepted by the developers.
Snudown is a library used in Reddit to convert markdown to HTML. This library uses a hash
table to map reference labels to their links, using the SDBM hash algorithm.
We launch a proof-of-concept attack against Snudown, running locally, by parsing a large
number of references with labels crafted to cause collisions in the hash table. We measure the
parsing time of an increasing number of colliding reference labels. As a sanity check, we repeat the
experiment with random labels. We plot the parsing time against the input size in Figure 2.3. The
significant difference in parsing time growth between the malicious and random labels confirms
the vulnerability. Note, the superlinear growth in parsing time for random labels, is caused by
coincidental collisions due to the small table size.
38
0.0 0.5 1.0 1.5 2.0 2.5
Input Size (bytes) 1e6
0
10
20
30
40
50
60
Parsing Time (s)
Vulnerable, Malicious
Vulnerable, Random
Patched, Malicious
Patched, Random
Figure 2.3: Parsing time for both vulnerable and patched Snudown for both malicious and random
reference labels.
We run HARM-DOS on the executable, which successfully produces a patched Snudown. To
show the malicious growth in parsing time no longer occurs, we relaunch the same attack on the
patched executable, shown in Figure 2.3. It is clear that patched Snudown does not suffer from the
same vulnerability. Moreover, due to the secret used in patched Snudown, the attacker cannot adapt
the attack to reliably trigger collisions. To show that we have mitigated the vulnerability without
introducing errors, we run the test cases that are supplied with the Snudown project, which all
pass. This shows that HARM-DOS can be used to identify and mitigate real-world hash-collision
vulnerabilities.
Note, in Figure 2.3 parsing time is less for malicious references, parsed with patched Snudown.
This is unrelated to the hash function, since both the malicious and random labels are distributed
evenly across the hash table. The difference in parsing time originates from CPU caching, leading
to a slower time per iteration when searching the list of coincidental collisions.
39
2.11 Limitations and future work
In this section, we discuss the current limitations of HARM-DOS and future work to improve it.
Patching Inlined Hash Functions. As motivated by MR1, it is necessary to determine exactly
which instructions implement the weak hash algorithm. For inlined hash functions, this requires
HARM-DOS to distinguish between the instructions implementing the hash algorithm and those of
the caller function. A possible approach to this end is to use data-flow analysis. This can be used
to compute a program slice over the instructions that form part of the hash algorithm. The start and
end of the hash algorithm in this slice can be identified by using Symbolic Input Output matching
(Section 2.6.2.1) on various subexpressions extracted via symbolic execution. However, due to the
computationally expensive nature of symbolic execution, it is unclear whether such an approach
will be feasible.
Reassembleable Disassembly. In Section 2.10.1 we have shown that in 57 cases SipHash
could not be used to mitigate the vulnerability, because of its implementation size. Recent advancements in reassembleable disassembly could be useful here, made by Flores-Montoya [24]
and Bauman[63], as they allow for arbitrary changes to the disassembled code, including introducing new functions. Note however that such approaches are subject to limitations in terms of
trade-offs between scalability and correctness. This is achieved by inferring the symbols used
during compilation and adding these to the disassembled code.
2.12 Related work
Hash Function Discovery. Lestringant defined an approach to detect implementations of cryptographic algorithms [35]. It searches for the unique constants, similar to HARM-DOS. The scope
of the chapter, however ends at detecting the cryptographic algorithm and the analysis thereof is
left to a human expert. HARM-DOS, on the other hand, requires an analysis that is capable of
determining if the identified hash function is patchable.
40
Another detection mechanism for cryptographic algorithms, mentioned by Lestringant [35]
and Grobert [36], is to compare the output for corresponding input to subcomputations of the ¨
algorithm. We use the same approach Symbolic Input-Output Matching (Section 2.6.2.2). In our
case, however, the hash functions are simple enough that we can calculate the output of the entire
function for over 200 inputs.
Other code similarity approaches often rely on creating a message digest (MD) of the code, as
done by Farhadi [43], Xu [44] and Ghidra [15], as well as testing for CFG subgraph isomorphism
as done by Bruschi [45]. To test the feasibility of detecting weak hash functions using MDs, we
used the Function ID feature of Ghidra [15]. This feature creates a MD of the function using its
sequence of instructions, including the mnemonic, register names, memory accesses and constants.
We created a MD for each hash function used to create the detection models used by HARM-DOS.
However, this approach failed to detect any hash function in the ground truth data set used in
our evaluation. This is to be expected, as MDs are used to identify syntactically equivalent code.
Consequently, trivial changes made to the hash function by the compiler (e.g. instruction order)
will yield a different MD. This approach is therefore less well suited for our purposes than our
approach, which relies on features inherent to weak hash functions, discussed in Section 2.2.1. We
expect other MD based approaches, such as the IDA FLIRT [64] approach to perform similarly.
We also measured the feasibility of using CFG subgraph isomorphism to detect weak hash
functions. We observed this approach is very susceptible to failure, caused by slight modifications
made to the CFG by the compiler. If HARM-DOS used subgraph isomorphism it would fail to
detect 13 of the 202 hash functions in our ground truth data set, while our approach correctly
identified all of these hash functions. Moreover, due to the computational complexity of subgraph
isomorphism, the analysis takes approximately 3 hours to process the 156 binaries in the ground
truth data set, where our approach takes about 40 minutes. Therefore, HARM-DOS is more suitable
for analysing a large number of binaries.
Meijer introduces an approach to detect previously unseen cryptographic algorithms [65] by
defining signatures to use for subgraph isomorphism in the data-flow graph (DFG). This approach
41
is not well-suited for detecting weak hash functions because these have very simple data flow, as
discussed in Section 2.2.1, making it prone to many false positives.
Petsios and Blair, respectively propose fuzzing based approaches for detecting algorithmic
complexity vulnerabilities [32, 33]. Given that fuzzing is a randomized dynamic analysis, using
such an approach is inherently slower than using a lightweight static analysis. Moreover, in both
these approaches automatic patching of the vulnerabilities are out of scope.
Mitigation Strategies. Hash-collision vulnerabilities can also be mitigated by limiting the
number of collisions, as done in DJBDNS [66] for cached DNS queries. This solution is unsatisfactory, since additional colliding entries are discarded, even if they originate from a benign
user. At the network level it is possible to terminate malicious network connections, as done by
Meng [34]. The downside is that an entire additional system is required to prevent exploitation. In
contrast, HARM-DOS addresses the root cause of the vulnerability, the weak hash function.
Hash Function Patching. Bruschi and Menon, respectively achieve binary rewriting by repurposing unused bytes [22, 45]. In the former, sequences of no operation (nop) instructions are
overwritten to insert malicious code. This malicious code executes independently from the host executable and therefore it is unnecessary to take the context in which it is placed into consideration.
In the latter, unused alignment bytes are used to patch buffer-overflow vulnerabilities by exiting
the program gracefully on malicious input. HARM-DOS, on the other hand, keeps the program
running normally.
Other approaches, such as used by Duck [25], rely on dynamically mapped memory to insert
patch code. This memory is mapped during program execution and the patch code is loaded from
disk into it. Consequently, the patch code is not available for static analysis. In this regard, HARMDOS is better as it inserts the patch code directly into the executable file, making it available for
static analysis to ensure correctness.
42
2.13 Conclusion
We have presented a novel approach for automatically detecting and replacing weak hash functions in binary code in order to prevent algorithmic complexity vulnerabilities. We evaluated our
prototype on a large data set of 105,831 real-world binaries, identified 796 confirmed weak hash
functions, and successfully replaced 759 of these in a non-disruptive manner. We show that HARMDOS is scalable, evident by our processing a large data set of 100 K binaries in about 5.5 days.
HARM-DOS is also effective in a real-world context – we used it to discover a zero-day vulnerability in Reddit.
43
Chapter 3
Diamonds: Automatic Discovery and Selective Mitigation of
Spectre Vulnerabilities in Binary Executables
In recent years, speculative execution vulnerabilities have received significant attention from the
research community [67–69], which produced many mitigation approaches [70–72]. In this chapter, we present DIAMONDS, a novel approach to discover and selectively mitigate Spectre vulnerabilities in binary executables. We achieve this by carefully formalizing four requirements for
a Spectre vulnerability to be exploitable. We implement these four requirements into a proof of
concept that we evaluate on three real-world binary executables to identify instructions that require mitigation. Using these discovery results we automatically insert mitigating instructions and
measure their performance impact. Additionally, we compare DIAMONDS against compiler-level
mitigation and show that the majority of mitigating instructions introduced by the latter is superfluous and it adds unnecessary performance impact. We show that DIAMONDS eliminates up to 20%
of performance cost by only turning off speculative execution when absolutely needed to mitigate
the Spectre vulnerability.
3.1 Introduction
With the digitization of society, reliance on software has drastically increased. This reliance has
made software an attractive target to attackers seeking to exploit vulnerabilities in software for
44
financial benefit, or to cause harm. To counteract these malicious actors, security researchers have
created automated methods for discovering vulnerabilities in software like memory-out-of-bounds
errors [73, 74]. These methods model software behavior in order to determine if it can reach an
unsafe state, e.g. leaking sensitive information. A common paradigm is to model software behavior at the level of its machine code instructions, instead of the source code. The idea is that
the machine code instructions give a much more fine-grained insight into software behavior [75],
allowing for more accurate models of program behavior. Unfortunately, even this low-level modeling may still be insufficient, as it fails to capture the microarchitectural behavior of the CPU,
such as out-of-order execution, caching and speculative execution, which may lead to a vulnerability. Three vulnerabilities that stem from such microarchitectural behavior are Spectre [67],
Meltdown [68] and Downfall [69]. Unless the program model is extended to include the microarchitectural behavior it cannot detect Spectre, Meltdown, Downfall or any other vulnerabilities that
leverage specifics of CPU microarchitectures. In this chapter, DIAMONDS, we focus on Spectre V1
vulnerabilities and define a vulnerability discovery model that incorporates the microarchitectural
behavior necessary for their exploitation. We use the information of discovered vulnerabilities to
automatically apply a patch to the executable binary itself, mitigating the vulnerability. While we
only focus on one class of vulnerabilities (Spectre V1) our approach can be easily extended to
apply to other microarchitecture-dependent vulnerabilities.
Compared to existing approaches, DIAMONDS has a more refined approach to reduce the number of false alarms reported. Fewer reported vulnerabilities leads to less patching required for
mitigation and reduces performance penalty from mitigation. In summary, we make the following
contributions in this chapter
• We formalize four requirements a machine code instruction must meet in order to be vulnerable to Spectre V1.
• We develop an open source proof of concept based on these requirements.
45
• Using this proof of concept, we evaluate real-world executables to identify mitigation points
and leverage binary modification to perform this mitigation.
• We measure the performance impact incurred by this mitigation and show that DIAMONDS
outperforms state-of-the-art, compiler-based mitigation, incurring up to 20% smaller performance penalty.
3.2 Background
In this section, we give an overview of the fundamentals of Spectre V1 vulnerabilities and the state
of the art in mitigating these vulnerabilities at the compiler level.
3.2.1 Spectre vulnerabilities
Modern CPUs embed a feature known as speculative execution that incorporates a branch predictor. This branch predictor improves performance by allowing for speculatively executing instructions following a conditional branch, before the corresponding conditional instruction has
been finalized. Without the branch predictor, branch instructions introduce a bottleneck, delaying
execution until they are resolved. In some cases, it turns out that the branch predictor made an
incorrect prediction, causing the incorrect instructions to be speculatively executed. We refer to
this as misprediction. In such cases, the CPU will rollback the incorrectly executed instructions to
return to a valid program state. However, traces of the mispredicted instructions may persist in the
microarchitectural state of the CPU, such as its cache. Conditional branch instructions are regularly used to restrict user access to intended memory, e.g. within the bounds of an array. However,
when such a branch is mispredicted, memory accesses are not restricted by the condition and outof-bounds accesses may occur. Even after the CPU rolls back the mispredicted branch, the data
accessed during misprediction may persist in the cache of the CPU. This creates a vulnerability. If
an attacker can manipulate the misprediction to access secret data, out of bounds of the intended
46
memory region, it may be possible to subsequently leak this secret data from CPU’s cache. This is
the key to Spectre vulnerabilities.
In [67] a proof of concept is provided showing how Spectre vulnerabilities can be used to leak
secret data. In Listing 3.1 we show a simplified version of this proof of concept as an example of
a Spectre V1 vulnerability. The conditional instruction on line 1 guards the array access on line 2,
so that no array-out-of-bounds errors may occur. However, in speculative execution, misprediction
may cause the body of the if-condition (lines 2 and 3) to execute even when the condition is
violated. Consequently, during misprediction, array-out-of-bounds errors may occur. This creates
a vulnerability as secret data may be accessed by an attacker, which is carefully controlling the
out-of-bounds access. The value accessed on line 2 is used as part of an array index on line 3. At
the binary level, this array index will be converted to a memory address, which will be loaded into
the cache of the CPU. If an attacker is capable of measuring access times at different indexes of
array2, she can learn which index has been cached, and thereby learn the secret value (known
as a cache timing attack). In this chapter, we define an approach that is agnostic to the specific
side-channel attack used to learn the secret by considering all the speculative out-of-bounds access
as vulnerable.
1 if (x < array1_size ) {
2 y = array1 [x ];
3 _ = array2 [y * 256];
4 }
Listing 3.1: An illustration of a Spectre vulnerability.
In this chapter, we focus on analyzing the binary code of programs in order to discover Spectre vulnerabilities. In Listing 3.2, we re-show the Spectre vulnerability of Listing 3.1, but at
the assembly-code level. The comparison instruction (line 1) and conditional branch instruction
(line 2) form the if-condition (line 1 of Listing 3.1). If this branch is mispredicted, instructions
on lines 3 - 7 will execute with a value in register rdi greater than array1 size. Observe that
the array accesses of Listing 3.1, lines 2 and 3 correspond to memory loads on lines 4 and 7. If an
attacker can fully control the value in the 64-bit register rdi during misprediction she can make
the expression on line 4 [rax+rdi] point to any address in memory. This allows an attacker to
target secrets existing in memory at this point in execution. By accessing this secret data, it may
47
be leaked into the microarchitectural state of the CPU and subsequently leaked via a side-channel
attack.
1 cmp QWORD PTR [ rip +0 x2ee1 ], rdi ; x < array1_size
2 jbe 1159 ; if ( ... )
3 lea rax ,[ rip +0 x2ed0 ] ; array1
4 movsx rax , BYTE PTR [rdi+ rax ] ; array1 [x]
5 shl rax ,0 x8
6 lea rdx ,[ rip +0 x2ed0 ] ; array2
7 mov al , BYTE PTR [ rax +rdx ] ; array2 [y * 256]
Listing 3.2: An illustration of a Spectre vulnerability in assembly code.
3.2.2 State-of-the-art mitigation
To address Spectre vulnerabilities, different compilers have added features that enable Spectre
mitigation. In this section we discuss the features added to Clang version 8 and Gcc version 10.
3.2.2.1 Clang
In Clang version 8, the feature to mitigate Spectre vulnerabilities feature is named speculative load
hardening [71] (SLH). SLH works by setting a mask during speculative misprediction to protect
sensitive load instructions. This protection is achieved by either erasing the address of the load
instruction or the loaded value. We show this concept in Listing 3.3. On line 1 the mask to protect
load instructions is initialized to 0. If the branch on line 4 is mispredicted, the instructions on lines 5
to 11 will execute with a value in register rdi violating the condition on line 3. Note, however,
the conditional move (line 5) and branch (line 4) share the same condition below or equal (jbe vs.
cmovbe). Therefore, a mispredicted branch will trigger the conditional move and the mask will be
set to 0xff...f. On the other hand, if no misprediction occurs (register rdi satisfies the condition
on line 3), the conditional move will not trigger, and the mask will remain 0. On line 9 this mask
erases the value loaded on line 7 if misprediction occurred, otherwise the value is left intact.
48
1 xor rax, rax
2 mov rcx,0xffffffffffffffff
3 cmp QWORD PTR [ rip +0 x2ed3 ], rdi ; x < array1_size
4 jbe 1178 ; if ( ... )
5 cmovbe rax,rcx
6 lea rcx ,[ rip +0 x2ebe ] ; array1
7 movsx rcx , BYTE PTR [ rdi + rcx ] ; array1 [x]
8 shl rcx ,0 x8
9 or rcx,rax
10 lea rdx ,[ rip +0 x2ebb ] ; array2
11 mov cl , BYTE PTR [ rcx + rdx ] ; array2 [y * 156]
Listing 3.3: Clang’s speculative load hardening instructions are highlighted. This is used to
protect the value loaded (line 9).
In order to determine which load instructions to protect, SLH performs a simple analysis to
determine if the address accessed by a load instruction may be affected by misprediction. This
approach assumes memory loads that correspond to reading stack variables or global variables
(in the source code) are not affected by misprediction. Conversely, it assumes all other memory
accesses (array accesses, pointer accesses, heap accesses, etc.) are affected by misprediction. This
is implemented by checking if a load instruction uses a non-stack or non-instruction pointer register
as part of its address. This approach will discover all Spectre vulnerabilities, as long as the stack
pointer register (rsp) or instruction pointer register (rip) are not used in a conditional branch.1
On the other hand, using a non-stack or non-instruction pointer register does not guarantee that
the address can be affected by misprediction. Therefore, this approach prioritizes mitigating all
memory load instructions vulnerable to Spectre at the expense of also applying mitigation to load
instructions not vulnerable to Spectre. Therefore SLH introduces an unnecessary performance
degradation by mitigating non-vulnerabilities.
We show a handcrafted example of Clang unnecessarily applying mitigation in Listing 3.4. The
load instruction on line 4 can only access a single address in global memory. However, since register rdi is used, mitigation is applied to this instruction on line 5. This highlights the improvement
to be gained by our approach in defining a strict list of requirements for which load instructions
are to be considered vulnerable.
1To the best of our knowledge, Clang does not generate such instructions.
49
1 lea rax ,[ rip +0 x2ecb ]
2 lea rdx ,[ rax + rax ]
3 sub rdx , rax
4 mov eax , DWORD PTR [ rdx ]
5 or eax,ecx
Listing 3.4: An example highlighting the limitations of Clang’s speculative load
hardening. The load instruction on line 4 is not vulnerable, but Clang applied mitigation
on line 5.
3.2.2.2 Gcc
Gcc version 10 added a builtin compiler feature named builtin speculation safe value.
This allows developers to tag array indexes for protection. At the low-level Gcc achieves this by
introducing an LFENCE instruction prior to the load instruction. This instruction serializes memory
accesses and therefore mitigates Spectre.
Our approach to mitigate spectre vulnerabilities improves on those of Gcc and Clang. Gcc
places the burden on developers to identify Spectre vulnerabilities, which as all manual interventions may be incomplete and error-prone. Clang automates mitigation, but at the cost of applying
mitigation to non-vulnerable loads, which is likely to increase performance penalty beyond that
which is absolutely necessary to mitigate Spectre vulnerabilities.
3.3 Our approach
In this section, we discuss our approach to automatically discover vulnerabilities in binary executables (Section 3.3.1). These discovered vulnerabilities are automatically mitigated without requiring
access to the source code, by placing mitigating code into the binary itself (Section 3.3.2).
3.3.1 Vulnerability discovery
We discussed in Section 3.2.1 that Spectre V1 vulnerabilities may cause out-of-bounds memory
reads during misprediction, and that this may be exploited by an attacker to read and leak secret data from memory. Therefore, in order to discover Spectre vulnerabilities, we analyze every
50
instruction-reading memory in the target program. In this section we define a list of requirements
(R1–R4) that such an instruction must meet in order to be considered vulnerable. We refer to the
memory-load instruction under investigation as l.
3.3.1.1 R1: Speculative execution window
CPUs can only perform speculative execution on a finite number of instructions before these instructions either must be rolled back (e.g. in the case of misprediction), or committed if no misprediction occurred. We refer to this number of instructions as the speculative execution window
NSEW. We leverage a criterion for discovering Spectre vulnerabilities in past work, Oo7 [70], that
requires l to be NSEW or less instructions away from a conditional branch c. If the distance between
c and l exceeds NSEW, the instructions comprising the condition will either have been committed
or rolled back. Therefore, misprediction of this branch can have no effect on the values loaded by
l. We assume any conditional branch may be the source of speculative misprediction. The specific value for NSEW varies between different CPU models and so we leave this as a user-specified
parameter.
In order to measure the distance between instructions c and l, we leverage a control-flow graph
(CFG). Starting at c, we perform a depth-first search (DFS) of up to NSEW instructions. We deviate
from the textbook DFS algorithm by allowing nodes to be revisited if they are re-encountered by
following a fewer number of instructions than the previous visit. Any load instruction encountered in this DFS fulfills R1. We conduct this DFS in an inter-procedural manner, by following
call and return instructions. This is necessary to account for Spectre vulnerabilities crossing function boundaries. However, our DFS is context-sensitive, meaning when we encounter a return
instruction in a function, we only follow the edge that matches the most recent call edge. In other
words, we assume misprediction cannot occur on a return instruction. We employ this trade-off for
scalability.
51
3.3.1.2 R2: Input-reachable
In order for an attacker to exploit a Spectre vulnerability, it must be possible to control the address
that is loaded out-of-bounds via user input. This can be modeled by confirming the presence of
a data flow between the address loaded by l and an input function. Note, however, the length of
this data-flow is not subject to the NSEW constraint. As a result, the distance between the source
of attacker’s control and the Spectre vulnerability may be very large, crossing multiple function
boundaries. Since inter-procedural data-flow analysis is an open problem in binary program analysis, we over-estimate attacker control by relying on control-flow analysis instead. Specifically, we
assume control-flow implies data-flow. That is, if l is reachable in the CFG from an input function it satisfies R2. Similar to R1, we establish control-flow reachability in an inter-procedural
and context-sensitive manner. The source of potential attacker input is both context dependent and
varies between different programs. We leave the definition of input functions that represent sources
of attacker control as a user-specified parameter.
3.3.1.3 R3: Conditional branch
Spectre vulnerabilities exploit behavior that may exclusively happen in speculative misprediction.
Since we assume any conditional branch may trigger speculative misprediction, instruction l must
be affected by a conditional branch. This means that this branch must exercise control over which
addresses may be accessed. To confirm this control, we employ symbolic execution. Symbolic
execution is used to express program behavior as symbolic expressions over symbolic variables.
Conditional branches are modeled as constraints on these symbolic variables. Therefore, to determine if a conditional branch affects the address of a load instruction, we inspect the symbolic
expression representing this address and the constraints. If these share any symbolic variables
l satisfies R3. Consider Listing 3.2, performing symbolic execution will uncover the constraint
(rdi < array1 size), where rdi and array1 size are symbolic variables, after executing the
instructions on lines 1 and 2. When symbolic execution encounters the load on line 4 the symbolic
expression representing the loaded address is (array1 + rdi). We can see the constraint (rdi
52
< array1 size) and load address expression (array1 + rdi) share the symbolic variable rdi.
This indicates that the load instruction is affected by the conditional branch and satisfies R3. In
order to ensure termination of symbolic execution, we limit the number of instructions executed to
NSEW. Note, if l is not affected by any conditional branch, out-of-bounds accesses may still occur,
but these will be unrelated to speculative misprediction, and therefore not Spectre vulnerabilities.
3.3.1.4 R4: Mappable memory
After we established that l is being affected by a conditional branch (R3), our next goal is to
analyze what memory may be accessed during misprediction. The purpose of the conditional
branch is to restrict the memory that may be read by l. During misprediction, there is a mismatch
between the instructions executing speculatively, and the intended instructions, with respect to the
condition. In other words, the instructions are executing with a violated condition, and therefore
the memory that may be read by l is out of bounds of the intended memory. The purpose of this
requirement is to ensure that this out of bounds memory is actually mappable.
We illustrate the purpose of this in Listing 3.5. The condition on line 1 clearly affects the load
instruction on line 2. However, if this condition is mispredicted, then the only address that can be
loaded is 0. As address 0 is not mappable, this does not create a Spectre vulnerability.
1 if ( ptr != NULL ) {
2 y = ptr [0];
3 }
Listing 3.5: The if-condition on line 1 guards the array access on line 2, but the only address
that can be accessed during misprediction is 0.
In order to achieve R4 we process the constraints c0, c1,..., cn and symbolic address a of l,
gathered during symbolic execution when confirming R3. The memory accessible by l is any
value of a for which (c0 ∧...∧cn) is satisfiable. Recall: a prerequisite for Spectre vulnerabilities
is a mispredicted branch. As explained in Section 3.2.1, misprediction occurs when a conditional
branch is followed when its condition is violated. We model the presence of at least one mispredicted branch, by considering values for a, for which at least one of the symbolic constraints are
negated (¬c0 ∨ ... ∨ ¬cn). In order to determine if any mappable memory is accessible during
53
misprediciton, we use a constraint solver to determine if one of these possible values for a falls
within a user-specified lower Al and upper Au bound address.
With respect to Listing 3.5, at line 2 the symbolic variable ptr will be constrained as ptr1 ̸= 0.
By negating the constraint, to model misprediction, we have ptr1 = 0, showing that this is not a
vulnerability.
3.3.2 Vulnerability mitigation
In response to the discovery of Spectre vulnerabilities, Intel released an article explaining a mitigation for Spectre vulnerabilities [76] by placing an LFENCE instruction between the conditional
instruction and the vulnerable load instruction. The LFENCE instruction is a serialization instruction
that prohibits execution of subsequent instructions, until all prior instructions have been committed. By placing this instruction between the conditional and load instructions, it will ensure that
speculative execution does not access the load instruction until the conditional instruction is committed. Since this conditional instruction will not be committed when misprediction occurs, this
successfully mitigates the vulnerability. In the article, the mitigation is performed either by the
developer inserting the LFENCE instruction at the appropriate location in the source code, or by
setting a flag with the Intel compiler. In this chapter, we improve on this mitigation by automatically identifying vulnerable load instructions and eliminating the requirement for source code by
inserting the LFENCE instruction into the executable code.
Modifying programs at the binary level is a challenging task. Many instructions are sensitive
to their relative position in the executable file, for example, relative jump instructions expect their
target to be at a precise number of bytes away. Consequently, naively inserting patch instructions
will break this expectation and most likely render the program broken. Different strategies exist
for circumventing this issue, such as trampolining. Trampolining involves overwriting the instruction(s) located at the desired patching location with an instruction branching to a previously-unused
sequence of bytes. These bytes are then repurposed to implement the overwritten instruction(s),
as well as the new patching instructions. The trampoline finishes with an instruction branching
54
back to the instruction following the trampoline branch. While this approach successfully modifies the behavior of the executable, the additional branching induces an extra performance penalty
unrelated to the mitigation performance penalty. For this reason, performing mitigation within the
binary itself generally incurs a higher performance penalty than by mitigating the source code.
3.4 Implementation
We implement our vulnerability discovery model as open source2
, by leveraging the angr binary
analysis platform [14] for CFG generation and symbolic execution. In our DFS on the CFG,
we preemptively terminate any paths that encounter a function call to an entry in the procedurelinkage table (PLT). Such function calls commonly correspond to library functions, which often act
as wrappers around operating-system specific system calls. Instructions preceding and following
such a system call cannot be speculatively executed in the same speculative execution window, and
therefore cannot be part of the same Spectre vulnerability.
In its default configuration, angr’s symbolic execution engine uses the Z3 constraint solver [77]
to identify unreachable states via contradicting constraints. Any state identified as unreachable is
not followed. However, branch prediction executes instructions before the branch condition has
been committed, meaning this model of angr is too strict for identifying Spectre vulnerabilities. We
configure symbolic execution with the LAZY SOLVES flag, which does not employ the constraint
solver to preempt unreachable states. We implement our termination condition for symbolic execution, by making use of the history feature of angr’s symbolic states. This allows us to view which
basic blocks have been executed prior to any given symbolic state. We can compute the total number of instructions that have been symbolically executed by summing the number of instructions
in each of these blocks. When this number exceeds NSEW, we terminate symbolic execution. While
this does guarantee termination, we observed that it is insufficient to prevent path explosion and
keep the analysis scalable. To ameliorate this, we enforce a common, albeit artificial restriction, allowing symbolic execution of every basic block only once. This affects the accuracy of our results
2https://github.com/usc-isi-bass/spectre_analysis
55
as load instructions vulnerable to Spectre may be reported as non-vulnerable for not being affected
by a conditional instruction (R3). We accept this compromise in accuracy to achieve scalability.
We leave the engineering effort to address the scalability issues incurred by symbolic execution –
without compromising accuracy – as future work.
For vulnerability mitigation, we use the E9Patch static binary rewriter for x86 64 [25]. E9Patch
provides a plugin frontend allowing patch definition at the assembly instruction level. It inserts
these patches, while also handling the additional modifications necessary to preserve correctness.
We leverage this feature to instruct E9Patch to insert an LFENCE instruction immediately prior to
the discovered vulnerable load instruction via a trampoline. We investigate the overhead introduced
by E9Patch through binary modification (unrelated to mitigation) in Section 3.5.2.
3.5 Evaluation
In this section, we define the different parameters we use in our evaluation in Section 3.5.1. Our
aim is to answer the following research questions (RQ). RQ1: What performance penalty, with respect to execution time increase, is incurred by applying mitigation to load instructions identified
by DIAMONDS? RQ2: Does selective mitigation (enforcing requirements R1-R4) incur a reduced
performance penalty compared to the state of the art? To answer these research questions, we evaluate DIAMONDS by analyzing real-world executables and reporting our findings in Section 3.5.2.
We also compare DIAMONDS against Clang’s SLH mitigation in Section 3.5.3.
3.5.1 Setting parameters
We configure the parameters of our evaluation for the Intel i7-6820HQ CPU. We confirm this CPU
is vulnerable to Spectre, by employing the proof-of-concept Spectre vulnerability and exploit supplied with [67]. To estimate NSEW for our evaluation CPU, we take this proof-of-concept Spectre
vulnerability and separate the conditional branch from the load instruction with an increasing number of nop instructions. We illustrate this idea in Listing 3.6. After every increment in number of
56
nop instructions, we run the exploit and observe the reported success or failure. After a certain
number of increments, the exploit only reports failures. This indicates that misprediction of the
conditional branch no longer affects load instruction. Since the only difference between the succeeding exploit and failing exploit is the number of instructions separating the conditional branch
and load instruction, we assume the failure is caused by the limited size of NSEW. Using this approach, we identify NSEW for our evaluation CPU as 216 instructions. This boundary appears to be
tight, since the success rate for the exploit with 216 nop instructions is 100%, but it is 0% for 217
nop instructions. Therefore, for our evaluation we use NSEW = 216.
1 cmp QWORD PTR [ rip +0 x2ee1 ], rdi ; x < array1_size
2 jbe ... ; if ( ... )
3 nop
4 ...
5 nop
6 lea rax ,[ rip +0 x2ed0 ] ; array1
7 movsx rax , BYTE PTR [rdi+ rax ] ; array1 [x]
8 shl rax ,0 x8
9 lea rdx ,[ rip +0 x2ed0 ] ; array2
10 mov al , BYTE PTR [ rax +rdx ] ; array2 [y * 256]
Listing 3.6: We estimate NSEW by manually increasing the number of instructions
separating the conditional branch (line 2) and load instruction (line 7) until the
vulnerability is no longer exploitable.
For R2, we need to define sources of attacker input. In Table 3.1, we show the applications
we analyze in our evaluation, paired with the input functions we specify for this requirement. For
R4, we need a specified lower and upper bound address of mappable memory. We select bounds
Al = 0x555555554aaa and Au = 0x7fffffffffff in order to match the bounds specified in the
Linux kernel for loading ELF executables.
3.5.2 Spectre discovery and mitigation
For each of the applications in Table 3.1 we analyze the corresponding binary. In Table 3.2 we
show the total number of memory load instructions in the target binary and how requirements
R1-R4 cumulatively filter out these instructions as non-vulnerable. The bottom row of this table
represents the number of load instructions that meet all the requirements and therefore require
mitigation. We refer to these as the mitigation points. We show an example of such a mitigation
57
Table 3.1: The binary files we analyze for our evaluation with their input functions.
Binary Version File Size Specified Input Functions
Grep v3.11.12-d1c3f 692Kb getc, read
Awk 20240311 376Kb getc
Bc 6.7.5 740Kb read
Table 3.2: The number of load instructions remaining after enforcing requirements R1-R4. The
load instructions in the last row, meeting all requirements, require mitigation.
Requirement(s) Grep Awk Bc
All load instructions 6,525 4,967 6,565
R1 5,329 4,216 5,082
R1,R2 1,268 4,168 4,761
R1,R2,R3 188 568 473
R1,R2,R3,R4 153 535 449
point in Grep in Listing 3.7. This code snippet is reachable from the input function read, meaning
we establish an attacker may have control over the address loaded on line 5, fulfilling R2. The
proximity of the conditional c (line 3) and load l (line 5) instructions indicates that l is indeed
within NSEW instructions from a conditional instruction. So, l and c may execute within the same
speculative execution window, implying R1 is fulfilled. Moreover, we can see l is affected by c
through use of the shared register r15. This means that c affects the memory that may be read by
l and so R3 is fulfilled. The bound placed on this register by c, in register rax, is defined outside
of the scope of symbolic execution. Therefore, symbolic execution cannot establish any concrete
constraints placed on the address read by l. In the absence of concrete proof that l cannot read
memory during misprediction, R4 is fulfilled conservatively.
1 mov rax , QWORD PTR [ r14 +0 x8 ]
2 add r15 ,0 x10
3 cmp r15 , rax
4 jae ...
5 mov rdi , QWORD PTR [ r15 ] ; LOAD
Listing 3.7: An example of a mitigation point in Grep.
58
We use E9Patch [25] to apply a patch to each target binary at every vulnerable load instruction shown in Table 3.2. We denote the original binary and the mitigated binary as BO and BM,
respectively. E9Patch reports 100% success in applying these patches for every binary.
With the patched binaries we can investigate RQ1, the performance penalty introduced in applying these patches. This penalty is proportional to the number of times the patch instructions
are executed. For example, if the patches are only applied to rarely executed code paths, we can
expect a minimal performance penalty. We select a task to perform with each of our target binaries
in order to count the number of times patches are executed. We show the specific task for each of
our target binaries in Table 3.3 and the number of times patch instructions execute while performing this task in Table 3.4. We see from this table that for each binary BM the patch instructions are
executed a significant number of times. Therefore, we need to measure the performance penalty
that these patch instructions introduce with respect to execution time.
To gain a better understanding of the performance penalty, we measure the difference in execution time between BO and BM when performing the task listed in Table 3.3. As explained in
Section 3.3.2, performing binary-only mitigation incurs a performance penalty separate from the
mitigation itself, due to trampolining. We perform an additional measurement: measuring the
performance penalty incurred by binary modification. For this measurement, we create an alternatively modified executable BN, by inserting a NOP instruction prior to every vulnerable load
instruction (where an LFENCE is placed in in BM). We assume executing the NOP instructions will
contribute negligibly to the performance penalty and therefore any difference in execution time observed between BO and BN is incurred due to binary modification rather than mitigation. It is only
the difference in execution time between BM and BN that is incurred by mitigation. We show the
different execution times, and therefore the answer to RQ1, in Table 3.5. In this table, we see that
for each binary Grep, Awk and Bc the difference in performance penalty between the BM and BN is
smaller than between BN and BO. This indicates the majority of the performance impact is incurred
due to the penalty of performing binary modification, rather than the mitigation itself. We quantify
59
Table 3.3: The task we perform with each of our target binaries in order to measure the performance
impact.
Binary Task
Grep Search for two fixed strings.
Awk Match input with a regular expression
Bc Sum numbers in a loop
Table 3.4: Number of times a patch executed while performing the task listed in Table 3.3.
Binary Patches applied Patch executions
GrepM 153 20,557
AwkM 535 3,033,018
BcM 449 250,063
the ratio of the two factors contributing to performance penalty (binary modification and mitigation) by computing the execution time of BN as a percentage of BM. We show this in Table 3.6.
Observe that in each case the binary modification performance penalty makes up between 63.87%
and 99.47% of the total performance penalty. With a sufficiently advanced binary modification
tool, one may be able to decrease this penalty further. However, as binary modification is an open
area of research orthogonal to the contributions of DIAMONDS, we leave this as future work.
3.5.3 Comparison to speculative load hardening
To understand the benefit of performing a selective mitigation (RQ2), we compare DIAMONDS
against state of the art Spectre mitigation, implemented as SLH in the Clang compiler. In Section 3.5.3.1, we use DIAMONDS to identify cases in which SLH applies mitigation unnecessarily (referred to as superfluous mitigation). We highlight some of these mitigation points in Section 3.5.3.2. We eliminate superfluous mitigation to investigate the reduced performance impact,
discussed in Section 3.5.3.3.
60
Table 3.5: We measure the performance impact by comparing the execution times when performing
the task in Table 3.3. We compute the percentage increase in execution time over the original
binary.
Version Execution time Percentage increase
GrepO 44.254s -
GrepM 98.639s 122.89%
GrepN 96.421s 117.88%
AwkO 14.75s -
AwkM 102.490s 594.85%
AwkN 65.470s 343.86%
BcO 26.808s -
BcM 27.331s 1.95%
BcN 27.187s 1.41%
Table 3.6: We show the ratio of binary modification performance penalty to mitigation performance
penalty by computing the execution time of BN as a percentage of BM.
Binary BN Execution time BM Execution time Percentage
Grep 96.421s 98.639s 97.75%
Awk 65.470s 102.490s 63.87%
Bc 27.187s 27.331s 99.47%
61
3.5.3.1 Identifying superfluous mitigation points
As discussed in Section 3.2.2.1, Clang aims to mitigate all Spectre-vulnerable load instructions,
at the cost of also applying mitigation to non-vulnerable instructions (superfluous mitigation). We
use DIAMONDS to identify such superfluous mitigation points.
We build our evaluation executables with the SLH flag enabled. As SLH is a compiler-level
mitigation, it changes the binary that is produced through compilation. In order to ensure a correct
comparison, it is necessary to run DIAMONDS on these hardened binaries instead of re-using the
results from Section 3.5.2. In Table 3.7 we show the number of hardened load instructions that
meet the requirements of DIAMONDS. This table shows that in the target binaries most hardened
load instructions (between 75.38% and 95.66%) were mitigated unnecessarily.
3.5.3.2 Investigating superfluous mitigation
We investigate the cases of superfluous mitigation to identify common causes for Clang mitigating
non-vulnerable load instructions. For each of our requirements, R1-R4, we analyze a number
of load instructions that DIAMONDS classifies as non-vulnerable for violating the requirement.
Then, through manual analysis we confirm that the requirement is indeed violated, proving superfluous mitigation. Next, we explain the common causes for superfluous mitigation for each of our
requirements.
R1 : In Listing 3.8, we show a load instruction that is classified as non-vulnerable, because it
is not within NSEW instructions from a conditional instruction and therefore violates R1. The load
instruction (line 9) is only reachable through the call to the the library function free() (line 1),
which we exclude from analysis. Note that the load instruction (line 9) uses non-stack and noninstruction pointer register r8. Therefore, Clang applies mitigation (line 10) to this load instruction.
We know this mitigation is superfluous, due to the violation of R1.
62
1 call 46 e0 < free@plt >
2 mov r8 , QWORD PTR [rsp +0 x18 ]
3 mov r14 , rsp
4 mov rax , QWORD PTR [rsp -0 x8 ]
5 sar r14 ,0 x3f
6 lea rcx ,[ rip +0 xffffffffffffffe8 ]
7 cmp rax , rcx
8 cmovne r14 , rbp
9 mov rax , QWORD PTR [r8 +0 x1b0 ] ; LOAD
10 or rax , r14
Listing 3.8: The load instruction (line 9) violates R1. Therefore, Clang applied
superfluous mitigation (line 10) to this load.
R2 : In Listing 3.9, the load instruction (line 5) is only used in context of printing the version
information3 of Grep and exiting. When Grep is used in this mode, no input file is processed and
therefore this load instruction is unreachable with respect to control-flow via an input function.
This, in turn, means that an attacker will have no control over which address is loaded, violating
R2. However, the load instruction (line 5) uses non-stack, non-instruction pointer register rax and
so Clang applies mitigation (line 6). This mitigation is superfluous, due to violation of R2.
1 cmp BYTE PTR [ rip +0 x41070 ] ,0 x1 ; show_version global variable
2 jne 8 a22
3 cmovne r10 , r14
4 mov rax , QWORD PTR [ rip +0 x4062e ]
5 mov rbp , QWORD PTR [ rax ] ; LOAD
6 or rbp , r10
Listing 3.9: The load instruction (line 5) violates R2. Therefore, Clang applied superfluous
mitigation (line 6) to this load.
R3 : In Listing 3.10, a pointer to stderr is retrieved from global memory (line 4). In the
immediately subsequent instruction, this pointer is dereferenced in the load instruction (line 5). It
is clear that no conditional instructions affect this load instruction between lines 4 and 5. Therefore,
misprediction cannot have an impact on which addresses may be loaded on line 5, implying R3
is violated. Clang applies mitigation (line 6) because the address of the load instruction uses nonstack, non-instruction pointer register r12. We know this mitigation is superfluous because R3 is
violated.
3grep -V
63
1 test edi , edi
2 je 685 f < usage +0 x17f >
3 cmove rax , r15
4 mov r12 , QWORD PTR [ rip +0 x428d4 ] ; stderr struct
5 mov rbx , QWORD PTR [ r12 ] ; LOAD
6 or rbx , rax
Listing 3.10: The load instruction (line 5) violates R3. Therefore, Clang applied superfluous
mitigation (line 6) to this load.
R4 : In Listing 3.11, the condition (line 1) and branch (line 2) instructions ensure that register r14 is non-zero for the subsequent instructions (lines 3 - 6). Mispredicting this branch
will constrain this register to 0. Consequently, the only address that can be read in the load
instruction (line 5) during misprediction is (0 + 0x10). Since this address is less than our
specified lower bound Al, this load instruction violates R4. In other words, misprediction
cannot be used to read any data and therefore this load instruction is not vulnerable. Clang
applies mitigation (line 6) because the address of the load instruction uses non-stack noninstruction pointer register r14. We know this mitigation is superfluous, due to violation of R4.
1 test r14 , r14
2 je 31 b74
3 cmove rax , r13
4 nop DWORD PTR [ rax + rax *1+0 x0 ]
5 mov rbx , QWORD PTR [ r14 +0 x10 ] ; LOAD
6 or rbx , rax
Listing 3.11: The load instruction (line 5) violates R4. Therefore, Clang applied superfluous
mitigation (line 6) to this load.
3.5.3.3 Eliminating superfluous mitigation
In Sections 3.5.3.1 and 3.5.3.2 we have seen and confirmed, respectively, that DIAMONDS identifies cases in which SLH applies mitigation unnecessarily. This enables us to investigate RQ2,
by eliminating the mitigating instructions related to these superfluous mitigation points. This
elimination allows us to measure the performance benefit we gain by only applying mitigation
in accordance to DIAMONDS’s requirements (RQ2). We implement this elimination by using NOP
instructions to overwrite the SLH instructions introduced by Clang in binary BSLH for those load
instructions that DIAMONDS has identified as superfluous, to produce binary BSLH−D. We assume
64
Table 3.7: The number of SLH load instructions that meet a requirement R1-R4.
Requirement(s) GrepSLH AwkSLH BcSLH
All SLH loads 3,459 2,429 3,096
R1 3,052 2,186 2,521
R1,R2 429 2,138 2,311
R1,R2,R3 182 729 722
R1,R2,R3,R4 150 598 584
SLH loads eliminated 95.66% 75.38% 81.14%
executing these NOP instructions contributes negligibly to the performance penalty. We take this
approach – comparing BSLH to BSLH−D – to investigate RQ2, rather than comparing BSLH directly to
BM, because BM includes the performance penalty introduced by performing binary modification,
as shown in Section 3.5.2.
Similar to Section 3.5.2, we start by measuring the number of mitigation instructions executed
while performing the tasks listed in Table 3.3. We show this in Table 3.8. We see that in each case
using DIAMONDS to eliminate superfluous mitigation reduces the total number of SLH instructions executed, both overall and as a fraction of all instructions (between 52.75% and 99.68%). In
Table 3.9 we show the difference in execution time between binaries BSLH and BSLH−D. By eliminating superfluous mitigation, we reduce the performance penalty in each of our evaluated binaries
by 6.95% to 20.26%, answering RQ2. Comparing binaries BSLH−D Table 3.9 to binaries BM in Table 3.5, we observe that the penalty is higher in the latter. The reason for this is because Table 3.5
includes the performance penalty incurred due to binary modification.
3.6 Related work
In Oo7 [70], Wang et. al proposes an approach towards discovering and mitigating Spectre vulnerabilities. Here, taint analysis is used to confirm attacker control over conditional branches and
loaded memory addresses that occur within a speculative execution window. This is similar to
DIAMONDS in which we use the speculative execution window to limit the pairs of conditional
65
Table 3.8: The number of SLH mitigation instructions executed while performing the task in Table 3.3. We measure this for the default SLH binary (BSLH), as well as binary with SLH eliminated
by DIAMONDS (BSLH−D). We show what fraction of all executed instructions are mitigation instruction and we show the percentage decrease from BSLH to BSLH−D.
Binary SLH Executed Fraction Percentage Decrease
GrepSLH 4,569,473 0.1335745 -
GrepSLH-D 14,836 0.0004337 99.68%
AwkSLH 8,581,597 0.120718 -
AwkSLH-D 5,047,072 0.070997 41.19%
BcSLH 2,194,003 0.063085 -
BcSLH-D 1,036,624 0.029807 52.75%
Table 3.9: The performance benefit of eliminating superfluous SLH instructions as indicated by
DIAMONDS.
Binary Execution time Percentage Increase Reduction in Performance Penalty
GrepO 44.254s - -
GrepSLH 78.334s 77.01% -
GrepSLH-D 63.824s 44.22% 18.52%
AwkO 14.75s - -
AwkSLH 66.16s 348.54% -
AwkSLH-D 61.56s 317.36% 6.95%
BcO 26.808s - -
BcSLH 34.202s 27.58% -
BcSLH-D 27.274s 1.74% 20.26%
66
branches and load instructions to consider. However, Oo7 does not analyze the relationship between the conditional branch and load instruction past the fact that both are tainted. In contrast,
DIAMONDS confirms that the conditional branch affects the memory address that is loaded. While
Oo7 does cover mitigation of Spectre vulnerabilities, this is achieved by placing an LFENCE instruction into compiler-generated assembly code. This does measure the performance penalty incurred
from mitigation, but does not capture the penalty incurred from performing binary-only mitigation,
as opposed to DIAMONDS. In Spectator [72], Guarnieri et. al use symbolic execution to prove load
instructions are invulnerable to Spectre during misprediction by asserting that the loaded address is
constrained to a single value. This approach is implemented on a subset of x86 64 assembly and, to
the best of our knowledge, not evaluated on real-world executables. For symbolic execution based
approaches, like Spectator and DIAMONDS, it is crucial to evaluate on real-world executables to
ensure the scalability of the approach.
3.7 Threats to validity
In this section, we discuss open research problems in binary program analysis and how these affect
the accuracy of DIAMONDS.
3.7.1 Alias analysis
In binary program analysis, alias analysis aims to determine if two memory access instructions
access the same memory. In the general case, this problem is undecidable. Consequently, it is
inevitable that in some cases DIAMONDS will fail to identify the relationship between a conditional
and load instruction with symbolic execution. This will lead DIAMONDS to incorrectly report a
violation of R3.
67
3.7.2 Control-flow analysis
In Section 3.3 we use a control-flow graph (CFG) to establish control-flow reachability between
input functions and a load instruction under analysis. Accuracy of this process may be affected
by indirect branches, where the target of the branch is computed, rather than a constant address or
offset, which may lead to missing control-flow edges. Identifying the target of indirect branches an
open problem in binary analysis research. If the CFG is missing control-flow edges, DIAMONDS
may incorrectly conclude that R2 is violated, leading to false negative results.
3.8 Conclusion
In this chapter we presented DIAMONDS, a novel approach towards discovering and mitigating
Spectre V1 vulnerabilities. In this approach, we formalize four requirements, R1-R4, a memory
load instruction must meet in order to be considered vulnerable. We develop a proof of concept
implementing these requirements and evaluate on three real-world executables. This evaluation
shows that using DIAMONDS incurs a reduced performance penalty than by using the state of the
art compiler mitigation, implemented as SLH in Clang.
68
Chapter 4
Data Flows in You: Benchmarking and Improving Static
Data-flow Analysis on Binary Executables
Data-flow analysis is a critical component of security research. Theoretically, accurate data-flow
analysis in binary executables is very hard, due to complexities of binary code. Practically, many
binary analysis engines offer some data-flow analysis capability, by employing, often undocumented, approximations and assumptions about possible data flows. But so far, evaluating the accuracy of data-flow analysis of different binary analysis engines, and understanding the gaps, has
received little attention. We address this problem by introducing a data set of 215,072 microbenchmark test cases, mapping to 277,072 binary executables, created specifically to test data-flow analysis implementations. Additionally, we augment the data set with 6 real-world executables. Using
this data set, we evaluate three state of the art data-flow analysis implementations, in angr, Ghidra
and Miasm and discuss their limitations, which lead to very low accuracy. We further propose three
model extensions to static data-flow analysis that improve accuracy. We implement the extensions
in angr, which increases its recall from 0.39 to 0.99, and improves precision from 0.13 to 0.32 on
our benchmarks.
69
4.1 Introduction
With the ever-increasing reliance of the modern world on software systems, the cat-and-mouse
game of finding and mitigating software vulnerabilities before they are exploited has reached critical levels. This is exacerbated by the growing complexity of software, which increases the burden
of ensuring its security. To assuage this burden, program analysis for system security has gained
more traction. Data-flow analysis is a critical component in program analysis for security, as understanding how information flows between the instructions of a program is tightly related to evaluating security. Unfortunately, static data-flow analysis is undecidable in the general case. Moreover,
when evaluating the security of a program, analyzing the source code is often considered to be
insufficient due to the What You See is not What You Execute phenomenon [1]. This creates the
necessity of data-flow analysis designed to operate on the binary instructions of programs.
Fortunately, while data-flow analysis may be undecidable in a general case, there are many
practical scenarios in which it is possible to perform data-flow analysis on small code segments
(e.g., within a function), by implementing certain approximations [14–16]. These approximations
manifest as assumptions made in the model and heuristics incorporated into the implementation
that may at times sacrifice correctness in order to achieve guaranteed termination (and indeed
scalability). These approximations impact correctness by causing non-existing data-flows to be
reported as well as existing data-flows to be missed. This impact has yet to be quantified and
understood, and hopefully reduced by refining the data-flow implementations.
This chapter aims to benchmark data-flow approaches in binary analysis engines, to help us
gain an in depth understanding of their strengths and limitations. With such understanding it
is possible to refine the approximations to improve analysis accuracy, as we demonstrate. This
chapter makes the following contributions.
• We introduce alias classes, a novel concept for categorizing data-flows.
• We define and implement an open-source framework for generating microbenchmark test
cases for evaluating static data-flow analysis models with respect to our alias classes.
70
• We define and implement a framework for extracting data-flows from real-world programs
using dynamic analysis, to generate ground-truth information about existing data flows.
• With the provided frameworks, we generate a data-flow evaluation benchmark data set, consisting of 215,072 microbenchmark test cases mapping to 277,072 unique binary executables, as well as data flows from 6 real-world executables.
• We show the twofold utility of this data set:
1. We evaluate three state of the art static data-flow analysis approach implementations
in angr [14], Ghidra [15] and Miasm [16]. Our evaluation yields insights about how
accurate these analysis engines are with respect to different categories of data flows. To
the best of our knowledge, this is the first evaluation of accuracy of data-flow analysis
in binary analysis engines.
2. We propose three novel data-flow-model extensions, implement them in angr, and evaluate them on real-world executables. These extensions achieve nearly perfect recall,
while simultaneously improving precision of data-flow analysis on our benchmarks.
4.2 Binary data-flow analysis
To perform data-flow analysis on a program is to reason about the flow of information between
its instructions. Since data-flow between two instructions requires executing these instructions in
order, data-flow analysis is often built on top of control-flow analysis. Consequently, data-flow
analysis inherits the complexities of control-flow analysis, such as context sensitivity – the notion
that the next instruction in an execution path may be determined by the preceding instructions. For
example, the instruction following a return instruction, is determined by the preceding, matching
call instruction. This complexity is part of a larger complexity caused by indirect branches, in
which case the control-flow depends on the data-flow. Due to the intermixed nature of control-flow
and data-flow analysis, in the general case, both problems are undecidable. Moreover, performing
71
these analyses on binary code, as opposed to source code, adds an extra layer of complexity due
to the loss of high-level semantic information such as control-flow structures, data types and data
structures.
One of the root causes of the undecidability of data-flow analysis lies in the aliasing problem,
i.e. determining if two instructions access the same data. In binary program analysis, this means
determining if a memory write instruction and a subsequent memory read instruction access memory at the same address. A data flow exists between these instructions if a control flow exists, and
if they must (or may) access the same data, such that the data written by the first instruction is read
by the second instruction, and has not been changed in the meantime. Stated differently, a data
flow exists between two instructions, if they form a link in a def-use chain [78].
In spite of data-flow analysis being undecidable, a number of theoretical algorithms [8, 75] and
implementations of binary data-flow analysis [14–16] have been created to yield approximate solutions. While the theoretical data-flow analysis algorithms make well-defined assumptions in order
to guarantee termination, in the implementations it is necessary for developers to deviate from these
theoretical models in order to achieve scalability, in addition to termination. This deviation manifests as additional, often undocumented, approximations (assumptions and heuristics), leading to
extra cost in analysis accuracy. We discuss examples of such approximations in Section 4.2.3. In
this chapter we use our benchmarks to quantify the accuracy cost of various approximations, which
enables us to pinpoint areas for improvement, and to implement and demonstrate benefit of several
improvements (Section 4.5).
4.2.1 Definitions
In order to rigorously define our approach, we extend the existing concept of data flow with the
following definitions.
72
4.2.1.1 Degree of data-flow
We define three different kinds of data flow between any pair of instructions in a program: unconditional data flow, possible data flow and impossible data flow.
A pair of instructions have an unconditional data flow, if on every execution of the program, in
which these instructions are executed in order, there is a data flow from the first instruction to the
second. We show an example of an unconditional data flow in Listing 4.1.
1 mov [rdi] , dl
2 mov al , [ rdi ]
Listing 4.1: The instruction pair on lines 1 and 2 have an unconditional data-flow.
A pair of instructions have a possible data flow, if on at least one execution of the program
there is a data flow from the first instruction to the second. We show an example of a possible data
flow in Listing 4.2. In Listing 4.2, the data written to memory by the instruction on line 1 will
be read from memory by the instruction on line 2, if and only if the value in register rsi is 0. If
this value is dependent on input to the program, then in some executions there will be a data flow,
while in others not.
1 mov [rdi] , dl
2 mov al , [ rdi +rsi]
Listing 4.2: The instruction pair on lines 1 and 2 have a possible data-flow.
A pair of instructions have an impossible data flow, if on every execution of the program, there
is not a data flow from the first to the second. We show an example of an impossible data flow in
Listing 4.3. In Listing 4.3, the 1 byte written to memory by the instruction on line 1 will never be
read from memory by the instruction on line 2. Regardless of the value in the register rdi (used
as the address in Line 1), this value can never be equal to 8 less than itself (used as the address in
Line 2).
1 mov [rdi] , dl
2 mov al , [rdi -8]
Listing 4.3: The instruction pair on lines 1 and 2 have an impossible data-flow.
4.2.1.2 Data-flow scope
We define the scope of a data flow as being either intra-procedural, inter-procedural or both. A
data flow is considered intra-procedural if it occurs in a single execution of a function. We show an
73
example of an intra-procedural data flow in Listing 4.4. Both instructions of the data flow, line 2
and 3, occur within the function f and the data flow manifests when executing the function once.
1 f:
2 mov [rbp -0 x8 ] , rdi
3 mov rax , [rbp -0 x8 ]
4 ret
Listing 4.4: The instruction pair on lines 2 and 3 have an intra-procedural data flow.
Conversely, a data flow is inter-procedural if its instructions span multiple functions, or multiple executions of a single function. We show an example of an inter-procedural data flow in
Listings 4.5 and 4.6. In Listing 4.5, the data flow between the instructions on lines 2 and 7 is
inter-procedural, because these instructions span two functions, f caller and f.
In Listing 4.6, the data flow between the instructions on lines 10 and 9 is inter-procedural. Even
though there is no control flow from line 10 to line 9 within function f, this control-flow exists
inter-procedurally, due to the two invocations of function f in function f caller. Moreover, the
data written by the instruction on line 10 (on the first invocation of f) is read by the instruction on
line 9 (on the second invocation of f), hence the existence of a data-flow.
1 f_caller :
2 mov [rbp -0 x8 ] , rbx
3 lea rdi , [rbp -0 x8 ]
4 call f
5 ret
6 f:
7 mov rax , [ rdi ]
8 ret
Listing 4.5: The instruction pair on lines 2 and 7 have an inter-procedural data flow.
1 f_caller :
2 mov [rbp -0 x8 ] , 0
3 lea rdi , [rbp -0 x8 ]
4 mov rsi , 0
5 call f
6 call f
7 ret
8 f:
9 mov rax , [ rdi ]
10 mov [rdi] , rsi
11 ret
Listing 4.6: The instruction pair on lines 10 and 9 have an inter-procedural data flow.
4.2.1.3 Data-flow channel
Given that a data flow is caused by a pair of instructions writing and reading the same data, we
define the channel of a data flow as the medium through which the data flows, i.e., the register or
74
memory that is accessed by both instructions. We indicate this channel by using a token matching
the name of the register or mem for memory. We use a single token for all memory accesses, because
the accessed memory address may change with every execution of the instruction.
We show an example of data-flow channels in Listings 4.7 and 4.8. In Listing 4.7, the instructions on lines 1 and 2 have the data-flow channel rbx as they both access this register. In Listing 4.8,
the instructions on lines 1 and 2 write and read memory at the same address respectively, so have
the data-flow channel mem.
1 mov rbx , 0
2 mov rax , rbx
Listing 4.7: The instruction pair on lines 1
and 2 have a data-flow channel rbx.
1 mov [rbp -0 x8 ], rdi
2 mov rax , [rbp -0 x8 ]
Listing 4.8: The instruction pair on lines 1
and 2 have a data-flow channel mem.
4.2.2 Our scope
In this chapter, we separate the intermixed nature of control flow and data flow in order to focus
exclusively on the latter. We posit that the main challenge of data-flow analysis, independently
of control flow, is approximating a solution to the aliasing problem, that is, whether or not there
exists a data-flow between two instructions that access the same memory. Therefore, in all our test
cases, we focus on data flows between pairs of memory write and read instructions, i.e. with mem
in the data-flow channel. We further focus on intra-procedural data-flow analysis. Inter-procedural
is significantly more difficult than intra-procedural analysis, due to increased scope of analysis.
Therefore, many data-flow analysis implementations only support intra-procedural analysis [15,
16]. We leave inter-procedural data-flows for future work.
4.2.3 Data-flow analysis implementations
Binaries of modern software often have thousands of functions, each with hundreds of thousands
of possible data-flows that must be analyzed. Due to this scale and complexity, developers of
data-flow analysis implementations are required to introduce approximations in order to make the
analysis usable on real-world executables. This includes implementing heuristics to determine if
75
a pair of instructions access the same memory, and the data written by the first instruction is not
overwritten prior to being read by the second instruction.
A heuristic to determine if two instructions access the same memory must handle cases where at
least one of the addresses are undefined (See Listing 4.2). We show in Section 4.5.3 that angr uses
a heuristic that equates all undefined memory addresses. Miasm, on the other hand, differentiates
addresses that are expressed with different syntax in the machine code.
We illustrate the complexity of determining whether or not a memory value is overwritten
with function calls. Refer to Listings 4.9 and 4.10. In both these listings the f target functions
are identical, writing to stack memory (labeled as Write) and subsequently reading from stack
memory (labeled as Read). However, these Write and Read instructions are interrupted by a
function call to f callee. In Listings 4.10 the callee function on line 8 overwrites the same
memory written by the Write instruction. Therefore this listing shows an impossible data flow
between the instructions on line 2 and 5. On the other hand, in Listing 4.9 the callee function on
line 8 only reads this memory. Therefore this listing shows an unconditional data flow between the
instructions on line 2 and 5. Since the effects of f callee are out of scope for an intra-procedural
analysis of function f target, a data-flow analysis implementation must use a single heuristic to
handle both these cases. In Section 4.5.3 we show that angr disrupts data flows that cross a callee
function, i.e., it assumes that an impossible data flow exists between instructions.
1 f_target :
2 mov [rsp +0 x8 ] , 0 ; Write
3 lea rdi , [ rsp +0 x8 ]
4 call f_callee
5 mov rax , [ rsp +0 x8 ] ; Read
6 ret
7 f_callee :
8 mov rax , [ rdi ]
9 ret
Listing 4.9: The instructions on line 2 and
line 5 have an unconditional data-flow.
1 f_target :
2 mov [rsp +0 x8 ], 0 ; Write
3 lea rdi , [ rsp +0 x8 ]
4 call f_callee
5 mov rax , [ rsp +0 x8 ] ; Read
6 ret
7 f_callee :
8 mov [rdi] , 0
9 ret
Listing 4.10: The instructions on line 2 and
line 5 have an impossible data-flow.
4.3 Improving static data-flow analysis
In order to improve static data-flow analysis, we define three novel data-flow model extensions,
discussed in Section 4.3.1. Since data-flow is undecidable in general [3, 79], the utility of any
76
data-flow model can only be evaluated experimentally. We propose a benchmark data set for
evaluation of data-flow models, discussed in Section 4.3.2. We further implement an automated
framework for evaluating data-flow models against our benchmarks, discussed in Section 4.3.3.
4.3.1 Data-flow model extensions
Our novel model extensions focus on addressing the complexities discussed in Section 4.2.3. Two
extensions introduce more precise handling of the state of a caller function upon return from a
callee function, discussed in Section 4.3.1.1. The third model extension introduces a simple, but effective, way to improve field sensitivity of static data-flow analysis, as described in Section 4.3.1.2.
We show how these three model extensions vastly improve the accuracy of static data-flow analysis
in Section 4.5.3.
4.3.1.1 Handling function calls
C1: Leveraging calling convention A calling convention defines how function parameters and
return values are passed between a caller function and its callee functions. We propose a model
extension that improves intra-procedural data-flow analysis by incorporating information about
the calling convention implemented by the target function. Consider Listing 4.9, in which register
rdi is used to pass a memory address from the caller function f target (line 3) to callee function f callee (line 8) as a function argument. By identifying such function arguments, a policy
can be used to determine whether to preserve or kill definitions at these addresses. We show in
Section 4.5.3 that a policy preserving all data flows across callee functions out performs angr’s
approach of killing such data flows.
C2: Stack frame preservation Conventionally, callee functions are implemented to preserve
and restore the stack frame of their caller function. As the callee function is out of scope for an
intra-procedural analysis, our model extension preserves the stack frame artificially for all callee
functions. This is challenging, as it requires suppressing the effect that the call instruction itself
77
has on the stack frame. In architectures such as x86 64, the call instruction pushes the subsequent
instruction address to the stack automatically, modifying the stack pointer. The matching return
instruction, popping this instruction address and restoring the stack pointer, is in the callee function
and therefore out of scope of analysis. Since an intra-procedural analysis will only consider the
call instruction and not the return instruction, artificial restoration of the stack pointer is necessary.
4.3.1.2 Field sensitivity
F: Constant-based Field Disunion Programming languages often allow the programmer to
group related data together into a structure (aka structs). Separate values in such a struct are
referred to as fields. In machine code, this is usually implemented by storing a base address of
the struct, and accessing the fields by computing an offset from the base address. The sizes of
the different fields must be defined at compile time, meaning the offsets to the different fields are
constant, and can be observed in the assembly code. Our extension assumes there is no data flow
between two memory-access instructions, which use distinct constant offsets, since these represent
different fields of a struct.
4.3.2 Data set
To gain a fine-grained insight into the accuracy of a data-flow analysis model, we break down dataflows into a number of categories, which we refer to as alias classes, discussed in Section 4.3.2.1.
Using these alias classes as a guide, we create a framework for generating a data set of test
cases. We divide these test cases into two categories: microbenchmark test cases, discussed in
Section 4.3.2.2 and real-world test cases (Section 4.3.2.3). The microbenchmark test cases are
synthesized at the source code level, and then compiled. Conversely, the real-world test cases are
extracted from real-world programs.
After generation, this data set is a collection of binaries, each paired with information regarding
its functions and data flows. Each data-flow is also labeled with its kind of data flow as ground
truth and an alias class.
78
4.3.2.1 Alias classes
We categorize an intra-procedural data flow between a pair of instructions by how the pointers –
dereferenced by these instructions – are introduced in the function at the source code level. We
refer to these categories as alias classes and the introduction method of a pointer as the pointer
origin. We define four such pointer origins, stack, heap, foreign and global. For the foreign pointer
origin, the pointer is defined outside the function and introduced via a function argument. For the
stack pointer origin, the pointer is introduced as an offset of the stack pointer register. For the
heap pointer origin, the pointer is introduced via the return value of a memory allocation function.
Finally, global pointers are allocated upon program initialization, and are accessed either as a
constant address, or an address relative to the instruction pointer. We show a source-code level
illustration of each of the pointer origins in Listing 4.11. Additionally, we show examples of data
flows with alias classes (Stack, Stack) and (Global, Foreign) in Listings 4.12 and 4.13,
respectively.
1 char global_pointer ;
2 void f( char * foreign_pointer ) {
3 char stack_pointer ;
4 char * heap_pointer = malloc (1) ;
5 }
Listing 4.11: Pointers with each of the pointer origins.
1 char f( char c ) {
2 char stack_ptr ;
3 stack_ptr = c;
4 return stack_ptr ;
5 }
Listing 4.12: An unconditional data-flow
(line 3 to 4) with alias class (Stack,
Stack).
1 char global_ptr ;
2 char f ( char * foreign_ptr ) {
3 global_ptr = c;
4 return * foreign_ptr ;
5 }
Listing 4.13: A possible data-flow (line 3
to 4) with alias class (Global, Foreign).
4.3.2.2 Microbenchmark test cases
The purpose of our microbenchmark test cases is to span a wide variety of different intra-procedural
data flows that can occur in programs. Each test case is designed to be minimalistic, testing a single
target data flow in a target function, focused solely on executing this data flow.
79
At the heart of each test case lie a pair of instructions writing and reading memory addresses,
forming the target data flow. The ground truth of each test case – whether there exists an unconditional, possible or impossible data flow between these instructions – depends on the parameters
of its construction. The source code instructions of the test cases are generated by exercising
each combination of these parameters. Finally, this source code is compiled by exercising each
combination of a set of selected compiler options.
Pointer creation The first phase of test-case creation is to create the pointers that will be used
by the memory access instructions. This phase involves pointer definition and pointer expansion.
In creating the data set, we enumerate a set of properties that make up the pointer definition – a
pointer origin, data type, size and length – and expand the pointer into a write and read pointer. The
data type is a native data type supported by the compiler and underlying architecture, e.g. integer
or floating point. The size attribute of a pointer indicates the number of bytes comprising the data
type. Finally, the length describes the number of adjacent data types in memory the pointer points
to. For a length greater than 1, the pointer points to an array. These properties of a pointer are used
to define it in the source code of the test case, including allocating the necessary amount of space
on the stack or heap. We illustrate these properties of a pointer definition in Figure 4.1.
In the next phase, pointer expansion, we either use a single pointer for both the write and read
instruction, or two distinct pointers for each instruction. We refer to the pointers used in the write
and read instruction as the write pointer and read pointer, respectively, regardless of whether or
not they are the same. We use a single pointer for both the write and the read pointer in order to
construct unconditional data-flows. We illustrate pointer expansion in Figure 4.2.
Pointer transformation To increase test-case complexity and variety, we introduce an optional
pointer transformation, illustrated in Figure 4.3. This transformation adds an offset to the pointer,
which can either be a constant value, or a variable, which is undefined within the target function.
The length attribute of the pointer is used to select an offset within the bounds of the pointer array.
The transformed pointers are ultimately used in the target data-flow.
80
Pointer Definition
type global_ptr[length];
... target_function(...) {
...
}
... target_function(...) {
type stack_ptr[length];
...
}
... target_function(...) {
type *heap_ptr = malloc(size * length);
...
}
... target_function(type *foreign_ptr) {
...
}
type:
(data type, size)
E.g:
(int, 1): char
(int, 2): short
(int, 4): int
(int, 8): long
(float, 4): float
(float, 8): double
Pointer origin: Global
Pointer origin: Foreign
Pointer origin: Stack
Pointer origin: Heap
Figure 4.1: The pointer is defined with a selected pointer origin, data type, size and length.
/* Pointer Definition - ptr */
write_ptr = ptr;
read_ptr = ptr;
*write_ptr = ...; // Write operation
... = *read_ptr; // Read operation
Pointer Expansion
Single pointer
/* Pointer Definition - ptr1 */
/* Pointer Definition - ptr2 */
write_ptr = ptr1;
read_ptr = ptr2;
*write_ptr = ...; // Write operation
... = *read_ptr; // Read operation
Distinct pointers
Figure 4.2: We either use the same pointer, or two distinct pointers, for the write operation and
read operation.
*write_ptr = ...; // Write operation
... = *read_ptr; // Read operation
Pointer Transformation
Non-transformed pointers
*(write_ptr + offset) = ...; // Write operation
... = *read_ptr; // Read operation
Transformed pointers
*write_ptr = ...; // Write operation
... = *(read_ptr + offset); // Read operation
*(write_ptr + offset) = ...; // Write operation
... = *(read_ptr + offset); // Read operation
Offset:
Constant,
Variable
Figure 4.3: The write pointer and read pointer may be transformed by adding an offset.
81
Callee Interruption
/* Write operation */
/* Read operation */
/* Write operation */
callee_function()
/* Read operation */
No callee function Callee function
Figure 4.4: The write operation and read operation may be interrupted by a call to a callee function.
Callee interruption As discussed in Section 4.2.3, an intra-procedural data-flow analysis must
use approximations in order to handle function calls in the target function. In order to expose these
approximations, we create a counterpart for each test case, where the target data-flow is interrupted
by a function call, as shown in Figure 4.4.
Example We show an example of a generated microbenchmark test case in Listing 4.14. The test
case contains two pointer definitions, ptr1 (line 1) and ptr2 (line 2) . The write pointer, ptr1,
has a foreign pointer origin, integer data type of size 1. The read pointer, ptr2, has a stack pointer
origin, integer data type of size 1 and length 2. The read pointer is also transformed by adding
a constant offset of 1 (line 5). Finaly, the write operation (line 3) and read operation (line 5) are
interrupted by a callee function (line 4).
1 char target_func ( char v , char * ptr1 ) {
2 char ptr2 [2];
3 * ptr1 = v; /* Write operation */
4 callee_func () ;
5 return *( ptr2 + 1) ; /* Read operation */
6 }
Listing 4.14: An example of a generated microbenchmark test case.
Test case compilation We further improve test case diversity, by compiling the source code with
a variety of compiler flags. This is useful, because compiler flags, such as optimization options,
have a significant impact on how the source code is converted into machine code. Additionally,
options to include or omit the frame pointer have an effect on how stack variables are represented
in machine code.
82
Ground truth With respect to ground truth, we divide the microbenchmark test cases into two
subcategories, the underspecified test cases and the fully-specified test cases. Underspecified test
cases include a target data-flow with existence depending on data not defined within the target
function. The purpose of the underspecified test cases, is to evaluate how a data-flow analysis
behaves in scenarios where perfect information is unavailable. Listing 4.2 shows an example of an
underspecified test case. Underspecified test cases have possible data flows, because by definition,
the information missing from the scope of analysis can be defined specifically to cause a data flow.
In the example shown in Listing 4.2, a data flow exists when register rsi is equal to 0. Note,
however, that in some cases a possible data flow is unlikely. For example, between an instruction writing to global memory and an instruction reading from stack memory. For a data-flow to
exist, the stack pointer would need to point to global memory. We illustrate this in Listing 4.15.
1 f:
2 mov [rsp] , dl
3 mov al , [ rip + 0 x1000 ]
4 ret
Listing 4.15: The instruction pair on lines 2 and 3 have a possible, but unlikely data-flow. For
the data-flow to exist, the stack pointer rsp (which is undefined within the scope of function
f) should point to global memory.
For fully-specified test cases, all information regarding the existence of the target data flow
is within the target function. The goal of fully-specified test cases is to uncover the concrete
strengths and weaknesses of a particular approach. In every fully-specified test case, the target
data flow either occurs on every execution of the program (unconditional data flow), or does not
occur on any execution of the program (impossible data flow). Therefore, any result reported by
a data-flow analysis that contradicts the ground truth can be confirmed as an error. Listings 4.1
and 4.3 are examples of fully-specified test cases.
4.3.2.3 Real-world test cases
Microbenchmark test cases do not allow for evaluating the real-world performance of a data-flow
analysis model, e.g., its scalability. They are further artificially generated, and thus not representative of complexity and scale of data flows in real-world binaries. For this reason, we also include
83
test cases built from real-world binaries, which requires us to solve three challenges. Firstly, we
need to establish ground truth, i.e. which data flows exist within this real-world binary. Secondly,
we need to construct intra-procedural data-flow graphs composed of these ground-truth data flows.
Thirdly, we need to determine the alias class of each of these data flows.
Establishing best-effort ground truth Unlike the microbenchmark test cases, where ground
truth is constructed, in real-world test cases the ground truth is unknown. Dynamic analysis allows
us to prove that a data flow exists in some executions of the binary. Therefore, dynamic analysis is capable of uncovering some possible data flows in a binary. Conversely, dynamic analysis
cannot identify unconditional or impossible data flows, for the following reason. By definition,
dynamic analysis considers a target program one program path at a time. For any program with an
unbounded number of program paths, any dynamic analysis will inevitably be incomplete. In our
data set we accept this imperfection, and collect possible data flows in real world binaries, using
dynamic analysis.
In order to establish ground truth for a target binary, we perform dynamic instrumentation and
log all information necessary to recover the data flows. This information includes the instruction
address, data access type (write or read), the data location accessed (either a register identifier, or
memory address) and the context (the function call in which the instruction executes). Whenever
a data write access is encountered, a map is used to associate the data location with the instruction
that performed the access. Whenever a data read access occurs, this map is consulted to identify
the matching write access. The result is a pair of instructions, for which the first writes to the same
data that is read by the second – a data flow.
Creating the data-flow graphs The collected data flows are consolidated into an interprocedural data-flow graph, where instruction addresses are the nodes, and directed edges indicate
a data flow. We annotate the edges in this graph with information pertaining the scope and channel
of the data flow.
84
The data-flow graph produced by dynamic analysis is inter-procedural and therefore needs
to be separated into intra-procedural data-flow subgraphs for comparison with the static dataflow graphs. In order to separate the inter-procedural data-flow graph into a collection of intraprocedural subgraphs, we use a static analysis for identifying functions in the target binary. This
assigns each instruction in the target binary to a function, so for each function we have a set of
instructions comprising the function. We use this instruction set of the function, to extract a nodeinduced subgraph from the inter-procedural data-flow graph. This subgraph contains all (and only)
the nodes corresponding to instructions in the function and edges between these nodes. Finally,
we eliminate all edges in this subgraph with only an inter-procedural scope. The result is an intraprocedural graph.
Handling special cases We apply the following modifications to the intra-procedural data-flow
graph, to handle common special cases found in binary programs. We make these modifications,
because correctly identifying registers undefined in the target function form the basis of how we
identify alias classes in real-world binaries. A common pattern in machine code is to preserve
registers across function calls, by writing the register to memory at the start of the function, and
then reading it back again into the register at the end. Because this register is restored in the
callee function, in an intra-procedural data-flow graph of the caller function, this appears as undefined registers used after the function call. We identify these save-restore patterns in the interprocedural data-flow graph, and connect the instruction addresses in the target function with an
intra-procedural data-flow edge. We show an example of this in Figure 4.5. The register rbx is
saved across the function call f callee and therefore, we connect the definition and use of this
register in f target.
A second modification we make to inter-procedural data-flow graph, is to identify and remove
data flows leading into an instruction used to clear a register. We show an example of this in
Figure 4.6. The xor instruction clears the value in the register rbx. Therefore, even though this
instruction writes and reads this register, no data flow exists through this instruction. If such an
instruction is used to clear a register at the start of the function in the intra-procedural data-flow
85
f_target:
1: mov rbx, rdi
2: mov [rbx], 1
3: call f_callee
4: mov rax, [rbx]
5: ret
f_callee:
6: push rbx
7: pop rbx
8: ret
1
2
4
rbx, intra
6
7
rbx, inter
mem, intra
rbx, inter
rdi, inter
mem, intra
f_target
Intra-procedural DFG
1
2
4
rbx, intra
rbx, intra
rdi, inter
mem, intra
f_target
Intra-procedural DFG
Modify
Figure 4.5: We modify the intra-procedural data-flow graph to reconnect registers across saverestore edges.
graph, without modification any subsequent instruction that reads this register will incorrectly
appear to use a register not defined within the scope of the function.
Alias class identification Recall that the alias class of a data flow is defined by the pointer origin
of the pointers dereferenced in its instructions. For a stack pointer origin, the pointer is introduced
via the stack pointer register (rsp), which is undefined within a function (as its value depends
on the call stack). Foreign pointers are passed to the target function as arguments. The exact
implementation of pointer passing in machine code depends on the calling convention. In this
chapter, we focus on the calling convention specified in the System V AMD64 Binary Application
Interface. In this calling convention, pointer arguments are passed via the registers rdi, rsi, rdx,
rcs, r8, r9 (additional pointer arguments are passed via the stack). Therefore, a foreign pointer is
introduced via one of these undefined registers. Heap pointers are most often obtained in a function
via the return value of a memory allocation function. In the above-mentioned calling convention,
the return value of a function is passed via the rax register. Therefore, heap pointers are introduced
via an undefined rax register. In the case of global pointers, the address does not depend on any
undefined registers, because the address is constant or an offset of the instruction pointer. We
summarize these undefined registers associated with each pointer origin in Table 4.1.
86
1
2
3
rbx
rbx
f_target:
1: mov rbx, 1
2: xor rbx, rbx
3: mov rax, rbx
4:ret
1
2
3
rbx
Modify
Figure 4.6: We modify the intra-procedural data-flow graph to eliminate data-flows into register
clearing instructions.
Table 4.1: The undefined registers corresponding to each pointer origin.
Pointer origin Undefined register dependencies
Stack rsp
Foreign rdi, rsi, rdx, rcs, r8 or r9
Heap rax
Global None
87
Identifying pointer origins of a pointer in real-world program is challenging, because between
its introduction and use, the pointer may be assigned to a different register and also saved (restored)
to (from) memory. Therefore, a simple syntactic analysis of the pointer is insufficient to correctly
determine its origin. Indeed, it is necessary to trace the data flow of a pointer back from its use to
its introduction into the function. We identify the introduction point of an address, by following
its intra-procedural data flow backwards, until an inter-procedural data-flow is encountered. The
undefined registers upon which the pointer address depends are the union of the data-flow channels
of all these encountered inter-procedural edges.
4.3.3 Evaluation framework
We evaluate a data-flow analysis by testing all combinations of the binaries, functions and dataflows in our data set. We refer to the binary and function under evaluation as the target binary and
target function, respectively. Similarly, we refer to the data-flow under test as the target data-flow.
For each target function, we generate a data-flow graph using the analysis under evaluation. This
data-flow graph is examined to determine the presence (or absence) of the edge between the two
instructions constituting the target data flow. This information is compared to the ground truth degree of data flow to establish correctness. For the microbenchmark test cases, this ground truth was
constructed and is included in the data set. For the real-world test cases, we have a partial ground
truth, established with dynamic analysis. We discuss our approach to quantify the performance of
the analysis under evaluation, in spite of this imperfect ground truth, in Section 4.5.2.
4.4 Implementation
4.4.1 Selected DA approaches
We select the three state-of-the-art data-flow analysis implementations, found in the binary program analysis engines angr [14] (version 9.2.39), Ghidra [15] (version 10.2.2) and Miasm [16]
(version 0.1.3.dev447). We refer to these as our selected DA approaches. These binary analysis
88
engines are popular in both academia and the industry, and were selected for being open-source
and for their implementation of a general-purpose, best-effort data-flow analysis. In Section 4.5,
we conduct our evaluation using these selected DA approaches to uncover the similarities and
differences between them, as well as strengths and weaknesses of each approach.
4.4.2 Microbenchmark test cases
We implement an open source framework1
to generate microbenchmark test cases automatically,
by following the approach presented in Section 4.3.2.2. For the pointer definition properties, we
use the native data types and sizes: 8-bit int, 16-bit int, float and double. We also include
a defined struct data type comprising an integer and pointer value. For the length property of
pointers, we assign one of two constant values, 1 and 2 elements, or one of two source code
variables. These variables are passed to the target function via function parameter. We incorporate
these variable lengths to test undefined offset transformations. We show an example of a generated
microbenchmark test case in Listing 4.14.
After generating the source code, we generate the test-case binaries by compiling the source
code with the GNU Compiler Collection (GCC, version 11.3.0) using one of the 6 optimization
options (O0-O3, Os and Ofast) and varying the inclusion of the function stack frame pointer with
-fomit-frame-pointer. All pointers are tagged as volatile in the source code, to prevent the
compiler optimizing away the target data flow. Every target binary is compiled with DWARF debug
symbols, using the flag -gdwarf-4. The debug symbols are used in ground truth establishment –
they help us identify the memory access instructions of the target data flow in the compiled binary
using the pyelftools Python library [80]. Finally, given the set of compiled binaries, a fingerprint
is calculated for each target function, by computing the MD5 hash of the bytes comprising its
machine code instructions. Due to the simplistic nature of the target functions, we observe that in
some cases different compiler options produce the same machine code from the same source code.
We use the MD5 hash to identify such duplicate functions. The final set of microbenchmark test
1https://github.com/usc-isi-bass/DataFlowDataSetGeneration
89
cases is a set of binaries with unique target function. In total, we have 215,072 source-code-level
test cases, mapping to 277,072 unique target functions.
4.4.3 Real-world test cases
For the real-world test cases, we select binaries from the following projects: chmod, cp, ls
(from the Coreutils package [81], version v9.1-98-g8613d35be) as well as the Apatche-Httpd
server [82] (version 2.5.1-dev), the Mujs javascript interpreter [83] (version 1.3.3) and CJson
parser [84] (version v1.7.15-14-gb45f48e).
To collect execution traces for a target binary, we instrument it using Intel’s PIN framework [85], while executing test cases in the project repository. We normalize instruction addresses,
to compensate for address space layout randomization (ASLR).
From each of the real-world binaries, we select a subset of target functions, for which we will
compute static data-flow graphs. While we could extract a static data-flow graph for all functions
in the target binary, this is often a computationally expensive procedure and we wish to grant
sufficient time to the selected DA approaches for computing the data-flow graph, while still keeping
the overall run time manageable. To this end, we select 5 functions from each target binary and
allow a 5-hour time limit per target function.
For the target functions, we select the 5 functions with the most memory data flows, as computed by dynamic analysis. We show these functions for each target binary in Table 4.2, with the
number of target data flows per alias class. We denote the alias class as unknown, in the cases
where our alias class identification for real-world binaries (see Section 4.3.2.3) fails. We use these
memory data flows as ground truth for the static data-flow analyses.
4.4.4 Evaluation framework implementation
We implement the evaluation framework in an open source repository2
, by creating a Python wrapper script for each approach, to generate a data-flow graph for a specified target function in a target
2https://github.com/NicolaasWeideman/MultiToolDataFlowAnalysis
90
Table 4.2: The selected target functions for each real-world binary with the number of identified
data-flows per alias class.
Target Function (F, F) (G, G) (H, H) (S, S) Unknown
chmod
main 0 16 1 29 0
quotearg buffer restyled 0 0 0 37 0
fts build 0 0 0 23 3
rpl fts open 0 0 1 4 0
quotearg n options 0 0 0 4 0
cp
copy internal 0 2 0 354 4
sparse copy 1 0 0 44 5
main 0 1 0 47 0
backupfile internal 0 0 0 39 0
make dir parents private 3 0 2 27 2
ls
main 0 66 0 57 0
quotearg buffer restyled 0 0 0 52 0
mpsort with tmp.part.0 1 0 0 37 8
canonicalize filename mode 0 0 1 34 2
strftime internal.isra.0 0 0 0 28 0
Apache-Httpd
trie node link 9 0 0 53 0
ap add module 0 0 0 39 2
trie node alloc 0 0 1 36 0
register filter 1 0 0 36 0
ap setup prelinked modules 0 11 0 24 0
Mujs
jsR run 4 0 0 380 6
cstm 0 0 0 323 0
jsC cexp 0 0 0 256 0
js gc 2 0 0 197 0
statement 2 0 0 177 0
CJson
cJSON Delete 0 0 0 23 0
get object item 0 0 0 18 0
add item to object 0 0 0 18 0
add item to array 0 0 0 16 0
UnityAssertEqualString 0 0 0 15 0
91
binary. Each of the data-flow analysis implementations, provided by the selected DA approaches,
provides a number of settings to fine-tune the analysis for a particular use case. We make a besteffort approach to apply the settings that will maximize performance on our data set. This is
achieved by performing an analysis with a variety of settings and selecting the most accurate data
flow results, while discarding the rest. Next, we discuss these settings, as well as the process of
converting the DA-approach-specific data-flow information into a unified-representation, which
can be compared directly with our ground truth.
angr By default, angr builds an inter-procedural data-flow graph on top of a control-flow graph
(CFG), augmented with symbolic execution. In order to focus this data-flow analysis on the target
function, we reduce the scope of CFG generation to only this function. We achieve this by disabling
context sensitivity, and setting the call-depth parameter to 0. angr produces a data-flow graph over
the statements of the Vex IR. Each of the statements in this IR is associated with a machine code
instruction, via an IMark3
statement, which contains the instruction address of this machine-code
instruction. We use these IMark statements to convert the data-flow graph produced by angr, to
data-flow graph over machine code instructions.
Ghidra Ghidra performs its data-flow analysis in the process of converting the machine code
instructions to its P-code IR. In this process, the inputs and output of P-code operations are linked.
Each of these P-code operations are associated with a machine-code instruction address, which
we use to create a data-flow graph. Per its default settings, Ghidra performs a number of simplification steps on the produced P-code, such as consolidating some operations. This simplification
eliminates the mapping between some machine code instructions and their corresponding P-code
operations. To prevent this, we set the simplification style to firstpass instead.
3An IMark statement is a special statement in Vex IR that does not describe the behavior an instruction, but instead
its address and size in bytes.
92
Miasm Similar to angr, for Miasm we first create a CFG for the target function over its intermediate representation. Then, we generate a data-flow graph, while instructing Miasm explicitly to
consider memory dependencies, while not descending into function calls.
4.5 Evaluation
We evaluate the selected DA approaches on both the microbenchmark test cases and real-world
test cases of our data set. We use the insights of this evaluation to propose improvements to the
state of the art in static data-flow analysis.
4.5.1 Microbenchmark test cases
Our microbenchmarks consist of 277,072 target binaries, each paired with information regarding
its target function, target data-flow and ground truth. We use our evaluation framework to extract
a static data-flow graph for the target function, using each of the selected DA approaches. We
inspect the produced data-flow graphs, to determine if the target data-flow is reported or not by
each DA approach. We compare this report with the ground truth for the test case. We show the
performance of the selected DA approaches on our fully-specified microbenchmark test cases in
Table 4.3 and similarly for our underspecified test cases in Table 4.4.
In Table 4.3 we observe that angr is capable of identifying every unconditional data flow in
each alias class. However, it also reports a data flow for each of the impossible data flows in the
(Foreign, Foreign) and (Heap, Heap) alias classes, and for some (20.19%) of the impossible
data flows in the (Stack, Stack) alias class. As all these data flows are impossible, each of
these cases represents a false positive. In Section 4.5.3 we conduct an investigation of these false
positives and show how they expose room for improvement. We also investigate the few cases
in which Ghidra reports a data flow (both correctly and erroneously). From manual inspection
of Ghidra’s source code, we conclude that Ghidra does not perform alias analysis, it effectively
assumes all memory addresses are unequal. There are a few exceptions in which it equates memory
93
Table 4.3: Performance of selected DA approaches on the microbenchmark fully-specified test
cases
Alias Class Ground Truth Edge Edge % No Edge No Edge % Total
angr
(F, F) unconditional 158 100.00% 0 0.00% 158
(F, F) impossible 72 100.00% 0 0.00% 72
(G, G) unconditional 170 100.00% 0 0.00% 170
(G, G) impossible 0 0.00% 3,726 100.00% 3,726
(H, H) unconditional 376 100.00% 0 0.00% 376
(H, H) impossible 12,988 100.00% 0 0.00% 12,988
(S, S) unconditional 115 100.00% 0 0.00% 115
(S, S) impossible 717 20.19% 2,835 79.81% 3,552
Ghidra
(F, F) unconditional 0 0.00% 158 100.00% 158
(F, F) impossible 0 0.00% 72 100.00% 72
(G, G) unconditional 10 5.88% 160 94.12% 170
(G, G) impossible 210 5.64% 3,516 94.36% 3,726
(H, H) unconditional 0 0.00% 376 100.00% 376
(H, H) impossible 0 0.00% 12,988 100.00% 12,988
(S, S) unconditional 0 0.00% 115 100.00% 115
(S, S) impossible 0 0.00% 3,552 100.00% 3,552
Miasm
(F, F) unconditional 106 67.09% 52 32.91% 158
(F, F) impossible 24 33.33% 48 66.67% 72
(G, G) unconditional 52 30.59% 118 69.41% 170
(G, G) impossible 232 6.23% 3,494 93.77% 3,726
(H, H) unconditional 376 100.00% 0 0.00% 376
(H, H) impossible 2462 18.96% 10,526 81.04% 12,988
(S, S) unconditional 107 93.04% 8 6.96% 115
(S, S) impossible 150 4.22% 3,402 95.78% 3,552
addresses in global memory, but we leave the deep-dive into Ghidra’s source code to establish the
reason for this as future work. Finally, we observe that Miasm sporadically reports unconditional
and impossible data flows. Miasm performs alias analysis by differentiating addresses that are
expressed with different syntax in the machine code. This is a reasonable heuristic, but it is also
sensitive to how the compiler implements the access instructions, especially with respect to register
allocation. We show an example of this in Listing 4.16, Miasm misses the unconditional data flow
(from line 4 to 5), due to the different syntax used to express the memory address (rbx, versus
rcx).
1 f:
2 mov rbx , rdi
3 mov rcx , rdi
4 mov QWORD PTR [ rbx ], rsi
5 mov rax , QWORD PTR [ rcx ]
6 ret
Listing 4.16: An unconditional data flow exists between lines 4 and 5. Miasm misses this data-flow
due to the different syntax used to express the memory address (rbx, versus rcx).
94
Table 4.4: Performance of selected DA approaches on the microbenchmark underspecified test
cases
Alias Class Ground Truth Edge Edge % No Edge No Edge % Total
angr
(F, F) underspecified 5,699 46.40% 6,584 53.60% 12,283
(F, G) underspecified 0 0.00% 11,806 100.00% 11,806
(F, H) underspecified 13,954 50.00% 13,954 50.00% 27,908
(F, S) underspecified 0 0.00% 9,228 100.00% 9,228
(G, F) underspecified 0 0.00% 12,446 100.00% 12,446
(G, G) underspecified 0 0.00% 4,221 100.00% 4,221
(G, H) underspecified 0 0.00% 20,390 100.00% 20,390
(G, S) underspecified 0 0.00% 9,789 100.00% 9,789
(H, F) underspecified 14,656 50.00% 14,656 50.00% 29,312
(H, G) underspecified 0 0.00% 19,098 100.00% 19,098
(H, H) underspecified 10,546 30.61% 23,910 69.39% 34,456
(H, S) underspecified 0 0.00% 22,120 100.00% 22,120
(S, F) underspecified 0 0.00% 8,713 100.00% 8,713
(S, G) underspecified 0 0.00% 9,239 100.00% 9,239
(S, H) underspecified 0 0.00% 20,830 100.00% 20,830
(S, S) underspecified 0 0.00% 4,076 100.00% 4,076
Ghidra
(F, F) underspecified 0 0.00% 12,283 100.00% 12,283
(F, G) underspecified 932 7.89% 10,874 92.11% 11,806
(F, H) underspecified 0 0.00% 27,908 100.00% 27,908
(F, S) underspecified 0 0.00% 9,228 100.00% 9,228
(G, F) underspecified 0 0.00% 12,446 100.00% 12,446
(G, G) underspecified 0 0.00% 4,221 100.00% 4,221
(G, H) underspecified 0 0.00% 20,390 100.00% 20,390
(G, S) underspecified 0 0.00% 9,789 100.00% 9,789
(H, F) underspecified 0 0.00% 29,312 100.00% 29,312
(H, G) underspecified 1,082 5.67% 18,016 94.33% 19,098
(H, H) underspecified 0 0.00% 34,456 100.00% 34,456
(H, S) underspecified 0 0.00% 22,120 100.00% 22,120
(S, F) underspecified 0 0.00% 8,713 100.00% 8,713
(S, G) underspecified 763 8.26% 8476 91.74% 9,239
(S, H) underspecified 0 0.00% 20,830 100.00% 20,830
(S, S) underspecified 0 0.00% 4,076 100.00% 4,076
Miasm
(F, F) underspecified 2,818 22.94% 9,465 77.06% 12,283
(F, G) underspecified 24 33.33% 10,958 66.67% 11,806
(F, H) underspecified 4,402 15.77% 23,506 84.23% 27,908
(F, S) underspecified 920 9.97% 8308 90.03% 9,228
(G, F) underspecified 1,178 9.46% 11,268 90.54% 12,446
(G, G) underspecified 284 6.73% 3,937 93.27% 4,221
(G, H) underspecified 1,700 8.34% 18,690 91.66% 20,390
(G, S) underspecified 496 5.07% 9,293 94.93% 9,789
(H, F) underspecified 7,594 25.91% 21,718 74.09% 29,312
(H, G) underspecified 2,230 11.68% 16,868 88.32% 19,098
(H, H) underspecified 8,178 23.73% 26,278 76.26% 34,456
(H, S) underspecified 2,660 12.03% 19,460 87.97% 22,120
(S, F) underspecified 785 9.01% 7928 90.99% 8,713
(S, G) underspecified 426 4.61% 8813 95.39% 9,239
(S, H) underspecified 1,418 6.81% 19,412 93.19% 20,830
(S, S) underspecified 274 6.72% 3,802 93.28% 4,076
95
Table 4.5: Extraction time of the data-flow graph for the target function in each real-world binary.
Target Function angr Ghidra Miasm
chmod
main 14.63s 12.59s OOM
quotearg buffer restyled 60.79s 13.08s OOM
fts build 16.29s 12.21s OOM
rpl fts open 10.60s 12.11s 580.53
quotearg n options 9.38s 11.90s OOM
cp
copy internal 87.95s 16.32s OOM
sparse copy 17.36s 14.62s OOM
main 16.67s 14.92s 54.52s
backupfile internal 15.90s 14.60s OOM
make dir parents private 16.26s 14.27s OOM
ls
main 30.81s 18.98s OOM
quotearg buffer restyled 66.94s 17.99s OOM
mpsort with tmp.part.0 17.37s 16.75s 249.54s
canonicalize filename mode 18.85s 17.03s OOM
strftime internal.isra.0 77.55s 17.53s OOM
Apache-Httpd
trie node link 112.62s 43.09s 2.40s
ap add module 112.51s 44.51s 6.91s
trie node alloc 111.97s 43.42s 1.11s
register filter 112.21s 43.97s 1.65s
ap setup prelinked modules 112.47s 43.51s 1.66s
Mujs
jsR run 50.05s 22.39s 0.73s
cstm 18.90s 21.94s 0.48s
jsC cexp 24.92s 21.17s 0.44s
js gc 17.45s 21.03s OOM
statement 21.69s 21.70s 19.81s
CJson
cJSON Delete 5.05s 10.63s 0.78s
get object item 4.72s 10.81s 0.63s
add item to object 4.85s 10.80s 0.55s
add item to array 4.64s 11.01s 0.41s
UnityAssertEqualString 4.91s 10.56s 0.62s
4.5.2 Real-world test cases
Our real-world test cases consist of 6 real-world binaries, each paired with information regarding 5
target functions and their dynamically recovered data flows. For each target function, we compute
the static data-flow graph, using each of the selected DA approaches. We show the run time of
this analysis in Table 4.5. Miasm fails to produce a data-flow graph for 6 target functions, due to
exceeding a memory limit of 100GB. By using a profiler, we establish that the excessive memory
usage is related to Miasm’s task list of states to process. These states are differentiated by code
location and data flows present at that state. Consequently, the size of this task list may grow
disproportionately larger than the size of the target function.
96
We combine the data flows of each target function in a particular binary and show the total
number per alias class in Table 4.6. Additionally, this table shows how many of the dynamic data
flows are discovered by the selected DA approaches and the number of data flows reported by
static analysis only. We see that Ghidra surprisingly identifies data flows in more alias-classes than
in our microbenchmark evaluation (Table 4.3). However, after manual analysis, we conclude that
these data flows exist between synthetic P-code operations inserted by Ghidra, and do not reflect
the target data flow.
We use the numbers from the Table 4.6 to compute a score for the real-world performance. For
each selected DA approach α, we consolidate all data flows discovered dynamically and statically
into a sets D and Sα, respectively. We consider the intersection (D ∩ Sα) the true positive data
flows discovered by α. As discussed in Section 4.3.2.3, dynamic analysis is incomplete. Every
data flow reported only by static analysis (Sα \D ) may either be a false positive, or a true positive
for which we do not have dynamic evidence. We conservatively treat each such data flow is a
false positive. Consequently, the number of true positives |D ∩Sα| and false positives |Sα \ D| we
report is a lower and upper bound approximation, respectively. To estimate the number of false
negatives, we identify the dynamic data-flows not discovered statically (D \ Sα). This is a lower
bound approximation, because there may exist data flows unreported by both dynamic and static
analysis. We show these approximations in Table 4.7 along with an approximation of the precision,
recall and F1 score. Table 4.7 gives us an insight into how each of the selected DA approaches
perform on real-world binaries. This table shows Ghidra has few true positives and a many false
negatives, due to it not reporting data flows between memory access instructions. Ghidra also has
many false positives, due to it reporting data-flows between synthetic instructions. Together, this
causes Ghidra to have a very low precision, recall and F1 score estimation. We see that Miasm has
the highest F1 score estimation, but as we have seen from Table 4.5, it also has scalability issues
with respect to memory consumption. We see that angr misses a significant number of data flows,
and also reports a very large number of false positives. We use this as an opportunity to improve
the state of the art.
97
Table 4.6: The number of data-flows discovered dynamically per alias class and the number of
these discovered statically. Additionally, we show the number of data-flows discovered statically
only.
Alias Class Dyn angr angr % Ghidra Ghidra % Miasm Miasm %
chmod
(G, G) 16 0 0.00% 0 0.00% 0 0.00%
(H, H) 2 0 0.00% 0 0.00% 1 50.00%
(S, S) 97 24 24.74% 2 2.06% 4 6.19%
unknown 3 1 33.33% 0 0.00% 0 0.00%
Static-only - 1,331 - 363 - 22 -
cp
(F, F) 4 0 0.00% 0 0.00% 0 0.00%
(G, G) 3 0 0.00% 0 0.00% 0 0.00%
(H, H) 2 0 0.00% 0 0.00% 0 0.00%
(S, S) 511 128 25.05% 9 1.76% 13 2.54%
unknown 11 2 18.18% 2 18.18% 0 0.00%
Static-only - 1,555 - 654 - 56 -
ls
(F, F) 1 0 0.00% 0 0.00% 0 0.00%
(G, G) 66 7 10.61% 4 6.06% 0 0.00%
(H, H) 1 0 0.00% 0 0.00% 0 0.00%
(S, S) 208 96 46.15% 10 4.81% 37 17.79%
unknown 10 1 10.00% 0 0.00% 1 10.00%
Static-only - 3,030 - 572 - 94 -
Apache-Httpd
(F, F) 10 9 90.00% 0 0.00% 9 90.00%
(G, G) 11 8 72.73% 0 0.00% 0 0.00%
(H, H) 1 1 100.00% 0 0.00% 1 100.00%
(S, S) 188 133 70.74% 19 10.11% 188 100.00%
unknown 2 1 50.00% 0 0.00% 0 0.00%
Static-only - 125 - 70 - 36 -
Mujs
(F, F) 8 2 25.00% 0 0.00% 2 25.00%
(S, S) 1,333 519 38.93% 25 1.88% 217 16.28%
unknown 6 3 50.00% 0 0.00% 0 0.00%
Static-only - 976 - 363 - 57 -
CJson
(S, S) 90 79 87.78% 7 7.78 90 100.00%
Static-only - 70 - 31 - 26 -
Table 4.7: Performance of selected DA approaches over all target real-world binaries.
angr Ghidra Miasm
True positives (lower bound) 1,014 78 563
False positives (upper bound) 7,087 2,053 291
False negatives (lower bound) 1,570 2,506 2,021
Precision (lower bound) 0.1252 0.0366 0.6593
Recall (estimation) 0.3924 0.0302 0.2179
F1 score (estimation) 0.1898 0.0331 0.3275
98
Table 4.8: The change (in boldface) introduced by C1 and C2 in how angr reports data-flows
interrupted by a callee function.
angr angrC
Alias Class Ground Truth Callee Edge Edge % Edge Edge %
(F, F) Unconditional No 158 100.00% 158 100.00%
(F, F) Under-specified Yes 0 0.00% 180 100.00%
(G, G) Unconditional No 170 100.00% 170 100.00%
(G, G) Under-specified Yes 0 0.00% 168 100.00%
(H, H) Unconditional No 376 100.00% 376 100.00%
(H, H) Under-specified Yes 0 0.00% 376 100.00%
(S, S) Unconditional No 115 100.00% 115 100.00%
(S, S) Under-specified Yes 0 0.00% 135 100.00%
4.5.3 Improving the state of the art
In our evaluation of the selected DA approaches, we see in Table 4.7 that angr has a significant
number of false negatives and assumed false positives. This matches what we see in Table 4.3,
where angr reports many impossible data flows (false positives). These false positives and negatives are a clear indication that angr can be improved by fine-tuning its internal approximations
for when to report data flows. In this section, we conduct an investigation to identify the specific
data-flow scenarios that require improvement and show how our model extensions, discussed in
Section 4.3.1, yield such an improvement.
In order to show the specific areas of improvement, we present two additional categorizations
of our microbenchmarks in Tables 4.8 and 4.9. Table 4.8 shows all unconditional data flows, paired
with their underspecified counterparts, where the target data flow is interrupted by a function call.
We show an example of such a pair of data flows in Listings 4.17 and 4.18. From Table 4.8, we
clearly see that the presence of a callee function causes angr to not report the target data flow.
Since the callee function introduces out-of-scope modifications to the program state, the ground
truth is underspecified. Therefore, one could argue that disrupting all data flows that cross the
callee function is an acceptable assumption for an intra-procedural data-flow analysis to make.
However, we show that model extensions C1 and C2 reflect real-world behavior more accurately,
and therefore yield better results.
99
1 mov BYTE PTR [rsp -0 x1 ], dil
2 mov al , BYTE PTR [rsp -0 x1 ]
Listing 4.17:
A fully-specified unconditional data-flow
exists between lines 1 and 2.
1 sub rsp ,0 x10
2 mov BYTE PTR [rsp +0 xf ] , dil
3 call 11 e9
4 mov al , BYTE PTR [ rsp +0 xf ]
5 add rsp ,0 x10
Listing 4.18: The counterpart of Listing 4.17.
The data-flow is interrupted by a function call
(line 3), resulting in a underspecified data-flow
between lines 2 and 4.
In Table 4.9 we select all fully-specified data-flows, and categorize these with respect to ground
truth, and whether or not an offset transformation was applied to both the write and read pointer
with equal offsets (discussed in Section 4.3.2.2). We show an example of such a test case in
Listing 4.19. In Table 4.9, we see that angr reports fully-specified impossible data flows, with
distinct offsets. Shortly, angr exhibits this behavior due to how it treats undefined memory addresses, which we have verified by analyzing angr’s source code. An undefined memory address
is artificially concretized to a constant specified in angr’s source code. This effectively assumes
all undefined memory addresses alias. This shows an opportunity to improve angr by extending it
with F.
1 lea rax ,[ rsi +0 x1 ]
2 mov QWORD PTR [rsp -0 x8 ], rax
3 mov BYTE PTR [rsi ], dil
4 mov rax , QWORD PTR [rsp -0 x8 ]
5 mov al , BYTE PTR [ rax ]
Listing 4.19: A fully-specified impossible data-flow exists between instructions 3 and 5 due
to the distinct offsets of register rsi that are accessed: [rsi+0x0] vs. [rsi+0x1].
We implement our model extensions C1 and C2, following a policy that assumes callee functions have no impact on intra-procedural data flows. Additionally we implement F, such that
instead of concretizing the entire undefined address, we only concretize the undefined registers
used in the address expression. This keeps address concretization sensitive to offsets, as required
by F. We refer to angr, extended with C1 and C2, as angrC , with F as angrF , and with all three
extensions as angrC F .
100
Table 4.9: The change (in boldface) introduced by F with respect to how angr reports data-flows
between pointers transformed by equal or distinct offsets.
angr angrF
Alias Class Ground Truth Equal Offset Edge Edge % Edge Edge %
(F, F) Unconditional Yes 158 100.00% 158 100.00%
(F, F) Impossible No 72 100.00% 0 0.00%
(G, G) Unconditional Yes 170 100.00% 170 100.00%
(G, G) Impossible Yes 0 0.00% 0 0.00%
(G, G) Impossible No 0 0.00% 0 0.00%
(H, H) Unconditional Yes 376 100.00% 376 100.00%
(H, H) Impossible Yes 8,480 100.00% 3,402 40.12%
(H, H) Impossible No 4,508 100.00% 475 10.54%
(S, S) Unconditional Yes 115 100.00% 115 100.00%
(S, S) Impossible Yes 717 38.88% 0 0.00%
(S, S) Impossible No 0 0.00% 0 0.00%
We show the difference between angr and angrC using our microbenchmarks in Table 4.8, and
similarly for angrF in Table 4.9. Table 4.8 shows that we have successfully extended angr to preserve data flows that cross a callee function. Table 4.9 shows that we have reduced the cases where
angr reports impossible data flows, when two distinct offsets are employed. An exception here is
with the (Heap, Heap) alias class in which 475 (10.54%) impossible data flows with distinct offsets are still reported. After manually investigating a number of the remaining cases, we concluded
that the reason for this is because of multi-byte memory accesses. A multi-byte memory access
instruction writes or reads to memory at a small range of addresses. Because angrF (and angr)
treat the heap pointers as undefined, they are concretized. In angrF this concretization is sensitive
to offsets, but if an offset is small, the address range accessed by a multi-byte memory access may
overlap with this offset. We also observe an unexpected improvement gained by angrF , reducing
the reported impossible data flows with equal offset transformations in the (Heap, Heap) and
(Stack, Stack) alias classes. We established that the reason for this is a secondary offset introduced by the compiler, due to different data types accessed by the write pointer and read pointer.
Since angrF does not distinguish between offsets added explicitly in the source code, and those
added by the compiler, it correctly does not report the data flow.
We prove the real-world improvement of angrC F by re-running our evaluation on real-world test
cases, and show the results in Table 4.10. We see that angrC F has a higher F1 score estimation than
angr, Ghidra and Miasm. We gain a significant increase in true positives and reduction in assumed
101
Table 4.10: The concrete improvement gained by extending angr with our model extensions C1,
C2 and F.
angr angrC F
True positives (lower bound) 1,014 2,569
False positives (upper bound) 7,087 5,351
False negatives (lower bound) 1,570 15
Precision (lower bound) 0.1252 0.3244
Recall (estimation) 0.3924 0.9942
F1 score (estimation) 0.1898 0.4891
false positives. Indeed, angrC F nearly perfects recall (0.99), meaning any real data flow is likely
to be reported by angrC F with near guaranteed certainty. This is achieved while simultaneously
improving precision from 0.13 to 0.32.
4.6 Future work
In Section 4.3.2.2 we introduced the offset pointer transformation that operates by adding a value
to the write or read pointer of a target data flow. Additional pointer transformations can be implemented, such as killing (redefining) the definition of the write instruction. Any additional transformation will yield further insight into how effectively a static data-flow analysis can identify
data flows in spite of transformations. We leave the investigation of defining and implementing
additional pointer transformations as future work.
In Section 4.3.2.1 we define a pointer origin for pointers passed as function arguments, the
foreign pointers, and a for pointers returned from a memory allocation function, the heap pointers.
It is possible to bridge these two pointer origins with pointers returned from functions other than
memory allocation functions. Such a pointer is essentially also a type of foreign pointer, as no
information is available regarding its definition site. We leave the implementation of such, returned
pointers, as future work.
102
4.7 Related work
To the best of our knowledge, we are the first to evaluate static data-flow analysis approaches on
an extensive data set of binary executables. There have been a number of other benchmarks with
related, but orthogonal goals. Andriesse et al. [10] and similarly Pang et al. [9, 86] evaluate disassembly implementations on a data set consisting of binaries, extracted from the SPEC CPU 2006
benchmark as well as real-world binaries. Such an evaluation has a number of overlapping goals
with ours, such as establishing ground truth information for real-world binaries, but disassembly is
a problem orthogonal to data-flow analysis. The ground truth for disassembly only covers which
machine-code instructions exist in the target binary and does not consider which data they may access. Therefore, this is unsuitable for use to evaluate data-flow analysis. Di Federico et al. evaluate
CFG recovery, by creating a data set of binaries with ground truth function boundaries [87]. They
compare their novel approach REV.NG with other approaches, toward function boundary detection. Data flow involves a number of challenges independent of control-flow analysis, as discussed
in Section 4.2.2 and thus we cannot reuse this data set to benchmark data-flow analysis approaches.
Hind [88] has surveyed a number of approximating alias analysis solutions on source code.
This work approaches the data-flow analysis challenge from a theoretical perspective, dividing the
approaches into a number of dimensions. This is different from our approach, as we focus on
measuring the concrete strengths and weaknesses of implementations of binary data-flow analysis,
which include various, often undocumented, approximations and assumptions. Our work helps
uncover and quantify impact of these approximations and assumptions.
Machiry et al. introduced AutoFacts, an approach to inject synthetic facts into real-world
programs [12]. These facts allow for ground truth knowledge that is both sound and complete,
regarding aliasing pointers. The injected facts, however, are entirely separate from the logic of the
program into which they are injected. Our approach focuses on the ends of this spectrum: testing
microbenchmarks that are disjoint from real-world program logic, and testing data-flows fully
intertwined in real-world program logic. Additionally, while Machiry et al. propose the AutoFacts
data set, they do not evaluate static program analysis implementations. A possible reason for
103
this is that the implementation of the AutoFacts data set does not appear report ground truth with
respect to the injected facts. Without this ground truth, we cannot use AutoFacts in our evaluation
either. In both cases [12, 88] the alias approximations are divided into a number of dimensions,
called sensitivities. Since our selected DA approaches do not allow for enabling or disabling these
sensitivities, we do not bring them explicitly into our evaluation.
4.8 Conclusion
In this chapter, we introduced a novel approach to classify data flows, namely alias classes. Using
these alias classes as a guide, we implemented an open source framework to create a data set
of both microbenchmarks and real-world binaries to evaluate data-flow analysis implementations.
We also implemented an open source framework to perform this evaluation. Finally, we evaluated
angr, Ghidra and Miasm using our data set and framework and provide insights with regards to the
performance of each. We propose three novel model extensions and by leveraging our evaluation
framework we show that our three contributed model extensions improve the state of the art in
data-flow analysis.
104
Chapter 5
Conclusion
As the size and quantity of modern software grows, so too does the burden of ensuring its security.
Manual analysis of the source code is no longer sufficient to ensure secure and automated software
security is required instead. Binary program analysis is a cornerstone in automated software security evaluation. In this dissertation, we introduced HARM-DOS and DIAMONDS, focusing on
the areas of automatic vulnerability discovery and non-disruptive patching. We have shown how
we can model the unsafe behavior of hash-collision denial-of-service and Spectre vulnerabilities,
respectively, to discover new vulnerabilities in binary executables. Using the discovery results, we
leveraged binary modification to mitigate these vulnerabilities without breaking the functionality
of the encompassing program. Additionally, we introduced FLOW-METER to evaluate and improve
the state of the art in the fundamental technique of static data-flow analysis. We evaluated three
implementations of data-flow analysis and discuss insights. We defined three model extensions
and illustrated how these improve the state of the art in data-flow analysis.
Each of our contributions HARM-DOS, DIAMONDS and FLOW-METER independently pushes
the boundary of research in binary program analysis. Therefore, these approaches further what
is possible with automatic software security evaluation. Ultimately, our research contributes to
establishing a more secure digital environment.
105
References
1. Balakrishnan, G. & Reps, T. W. WYSINWYX: What you see is not what you eXecute. ACM
Trans. Program. Lang. Syst. 32, 23:1–23:84. doi:10.1145/1749608.1749612 (2010).
2. Thompson, K. Reflections on Trusting Trust. Commun. ACM 27, 761–763. doi:10.1145/
358198.358210 (1984).
3. Rice, H. G. Classes of recursively enumerable sets and their decision problems. Transactions
of the American Mathematical society 74, 358–366 (1953).
4. Pei, K., Guan, J., Williams-King, D., Yang, J. & Jana, S. XDA: Accurate, Robust Disassembly
with Transfer Learning in 28th Annual Network and Distributed System Security Symposium,
NDSS 2021, virtually, February 21-25, 2021 (The Internet Society, 2021).
5. Miller, K. A., Kwon, Y., Sun, Y., Zhang, Z., Zhang, X. & Lin, Z. Probabilistic disassembly
in Proceedings of the 41st International Conference on Software Engineering, ICSE 2019,
Montreal, QC, Canada, May 25-31, 2019 (eds Atlee, J. M., Bultan, T. & Whittle, J.) (IEEE /
ACM, 2019), 1187–1198. doi:10.1109/ICSE.2019.00121.
6. Kinder, J. & Kravchenko, D. Alternating Control Flow Reconstruction in Verification,
Model Checking, and Abstract Interpretation - 13th International Conference, VMCAI 2012,
Philadelphia, PA, USA, January 22-24, 2012. Proceedings (eds Kuncak, V. & Rybalchenko,
A.) 7148 (Springer, 2012), 267–282. doi:10.1007/978-3-642-27940-9\_18.
7. Andriesse, D., Slowinska, A. & Bos, H. Compiler-Agnostic Function Detection in Binaries
in 2017 IEEE European Symposium on Security and Privacy, EuroS&P 2017, Paris, France,
April 26-28, 2017 (IEEE, 2017), 177–189. doi:10.1109/EUROSP.2017.11.
8. Kiss, A., J ´ asz, J., Lehotai, G. & Gyim ´ othy, T. ´ Interprocedural Static Slicing of Binary Executables in 3rd IEEE International Workshop on Source Code Analysis and Manipulation
(SCAM 2003), 26-27 September 2003, Amsterdam, The Netherlands (IEEE Computer Society, 2003), 118. doi:10.1109/SCAM.2003.1238038.
9. Pang, C., Zhang, T., Yu, R., Mao, B. & Xu, J. Ground Truth for Binary Disassembly is Not
Easy in 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August
10-12, 2022 (eds Butler, K. R. B. & Thomas, K.) (USENIX Association, 2022), 2479–2495.
106
10. Andriesse, D., Chen, X., van der Veen, V., Slowinska, A. & Bos, H. An In-Depth Analysis of
Disassembly on Full-Scale x86/x64 Binaries in 25th USENIX Security Symposium, USENIX
Security 16, Austin, TX, USA, August 10-12, 2016 (eds Holz, T. & Savage, S.) (USENIX
Association, 2016), 583–600.
11. Alves-Foss, J. & Venugopal, V. The Inconvenient Truths of Ground Truth for Binary Analysis. CoRR abs/2210.15079. doi:10.48550/ARXIV.2210.15079 (2022).
12. Machiry, A., Redini, N., Gustafson, E., Aghakhani, H., Kruegel, C. & Vigna, G. Towards Automatically Generating a Sound and Complete Dataset for Evaluating Static Analysis Tools.
Workshop on Binary Analysis Research (BAR). doi:10.14722/bar.2019.23090 (2019).
13. Akbar, L., Jiang, Y., Yap, R., Liang, Z. & Zhuohao, L. Evaluating Disassembly Ground Truth
Through Dynamic Tracing in Workshop on Binary Analysis Research (BAR) 2024, San Diego,
California, USA, March 1, 2024 (The Internet Society, 2024).
14. angr. The Angr binary analysis platform http://angr.io. 2016.
15. Ghidra https://ghidra-sre.org/. 2022.
16. Miasm. Miasm https://miasm.re. 2019.
17. Vadayath, J., Eckert, M., Zeng, K., Weideman, N., Menon, G. P., Fratantonio, Y., et al. Arbiter: Bridging the Static and Dynamic Divide in Vulnerability Discovery on Binary Programs in 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022 (eds Butler, K. R. B. & Thomas, K.) (USENIX Association, 2022), 413–
430.
18. Luo, Z., Wang, P., Wang, B., Tang, Y., Xie, W., Zhou, X., et al. VulHawk: Cross-architecture
Vulnerability Detection with Entropy-based Binary Code Search in 30th Annual Network and
Distributed System Security Symposium, NDSS 2023, San Diego, California, USA, February
27 - March 3, 2023 (The Internet Society, 2023).
19. Wang, T., Wei, T., Lin, Z. & Zou, W. IntScope: Automatically Detecting Integer Overflow
Vulnerability in X86 Binary Using Symbolic Execution in Proceedings of the Network and
Distributed System Security Symposium, NDSS 2009, San Diego, California, USA, 8th February - 11th February 2009 (The Internet Society, 2009).
20. Salls, C., Shoshitaishvili, Y., Stephens, N., Kruegel, C. & Vigna, G. Piston: Uncooperative
Remote Runtime Patching in Proceedings of the 33rd Annual Computer Security Applications
Conference, Orlando, FL, USA, December 4-8, 2017 (ACM, 2017), 141–153. doi:10.1145/
3134600.3134611.
21. Perkins, J. H., Kim, S., Larsen, S., Amarasinghe, S. P., Bachrach, J., Carbin, M., et al. Automatically patching errors in deployed software in Proceedings of the 22nd ACM Symposium
107
on Operating Systems Principles 2009, SOSP 2009, Big Sky, Montana, USA, October 11-
14, 2009 (eds Matthews, J. N. & Anderson, T. E.) (ACM, 2009), 87–102. doi:10 . 1145 /
1629575.1629585.
22. Menon, J., Hauser, C., Shoshitaishvili, Y. & Schwab, S. A Binary Analysis Approach to
Retrofit Security in Input Parsing Routines in 2018 IEEE Security and Privacy Workshops, SP
Workshops 2018, San Francisco, CA, USA, May 24, 2018 (IEEE Computer Society, 2018),
306–322. doi:10.1109/SPW.2018.00049.
23. Bauman, E., Lin, Z. & Hamlen, K. W. Superset Disassembly: Statically Rewriting x86 Binaries Without Heuristics in 25th Annual Network and Distributed System Security Symposium,
NDSS 2018, San Diego, California, USA, February 18-21, 2018 (The Internet Society, 2018).
24. Flores-Montoya, A. & Schulte, E. M. Datalog Disassembly in 29th USENIX Security Symposium, USENIX Security 2020, August 12-14, 2020 (eds Capkun, S. & Roesner, F.) (USENIX
Association, 2020), 1075–1092.
25. Duck, G. J., Gao, X. & Roychoudhury, A. Binary rewriting without control flow recovery
in Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020 (eds Donaldson, A. F. & Torlak, E.) (ACM, 2020), 151–163. doi:10.1145/3385412.3385972.
26. Crosby, S. A. & Wallach, D. S. Denial of Service via Algorithmic Complexity Attacks in
Proceedings of the 12th USENIX Security Symposium, Washington, D.C., USA, August 4-8,
2003 (USENIX Association, 2003).
27. CVE-2011-4885. Available from CVE Details, CVE-ID CVE-2011-4885. 2011.
28. CVE-2012-1150. Available from National Vulnerability Database, CVE-ID CVE-2009-1897.
2012.
29. CVE-2012-1150. Available from National Vulnerability Database, CVE-ID CVE-2012-2739.
2012.
30. Kirrage, J., Rathnayake, A. & Thielecke, H. Static Analysis for Regular Expression Denialof-Service Attacks in Network and System Security - 7th International Conference, NSS 2013,
Madrid, Spain, June 3-4, 2013. Proceedings (eds Lopez, J., Huang, X. & Sandhu, R. S.) ´ 7873
(Springer, 2013), 135–148. doi:10.1007/978-3-642-38631-2\_11.
31. Chang, R. M., Jiang, G., Ivancic, F., Sankaranarayanan, S. & Shmatikov, V. Inputs of Coma:
Static Detection of Denial-of-Service Vulnerabilities in Proceedings of the 22nd IEEE Computer Security Foundations Symposium, CSF 2009, Port Jefferson, New York, USA, July 8-10,
2009 (IEEE Computer Society, 2009), 186–199. doi:10.1109/CSF.2009.13.
108
32. Petsios, T., Zhao, J., Keromytis, A. D. & Jana, S. SlowFuzz: Automated Domain-Independent
Detection of Algorithmic Complexity Vulnerabilities in Proceedings of the 2017 ACM
SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX,
USA, October 30 - November 03, 2017 (eds Thuraisingham, B. M., Evans, D., Malkin, T. &
Xu, D.) (ACM, 2017), 2155–2168. doi:10.1145/3133956.3134073.
33. Blair, W., Mambretti, A., Arshad, S., Weissbacher, M., Robertson, W., Kirda, E., et al. HotFuzz: Discovering Algorithmic Denial-of-Service Vulnerabilities Through Guided MicroFuzzing in 27th Annual Network and Distributed System Security Symposium, NDSS 2020,
San Diego, California, USA, February 23-26, 2020 (The Internet Society, 2020).
34. Meng, W., Qian, C., Hao, S., Borgolte, K., Vigna, G., Kruegel, C., et al. Rampart: Protecting
Web Applications from CPU-Exhaustion Denial-of-Service Attacks in 27th USENIX Security
Symposium, USENIX Security 2018, Baltimore, MD, USA, August 15-17, 2018 (eds Enck, W.
& Felt, A. P.) (USENIX Association, 2018), 393–410.
35. Lestringant, P., Guihery, F. & Fouque, P. ´ Automated Identification of Cryptographic Primitives in Binary Code with Data Flow Graph Isomorphism in Proceedings of the 10th ACM
Symposium on Information, Computer and Communications Security, ASIA CCS ’15, Singapore, April 14-17, 2015 (eds Bao, F., Miller, S., Zhou, J. & Ahn, G.) (ACM, 2015), 203–214.
doi:10.1145/2714576.2714639.
36. Grobert, F., Willems, C. & Holz, T. ¨ Automated Identification of Cryptographic Primitives in
Binary Programs in Recent Advances in Intrusion Detection - 14th International Symposium,
RAID 2011, Menlo Park, CA, USA, September 20-21, 2011. Proceedings (eds Sommer, R.,
Balzarotti, D. & Maier, G.) 6961 (Springer, 2011), 41–60. doi:10 . 1007 / 978 - 3 - 642 -
23644-0\_3.
37. Staff, J. Assembled Labeled Library for Static Analysis Research (ALLSTAR) Dataset 2019.
38. Aumasson, J. & Bernstein, D. J. SipHash: A Fast Short-Input PRF in Progress in Cryptology
- INDOCRYPT 2012, 13th International Conference on Cryptology in India, Kolkata, India,
December 9-12, 2012. Proceedings(eds Galbraith, S. D. & Nandi, M.) 7668 (Springer, 2012),
489–508. doi:10.1007/978-3-642-34931-7\_28.
39. Alakuijala, J., Cox, B. & Wassenberg, J. Fast keyed hash/pseudo-random function using
SIMD multiply and permute. CoRR abs/1612.06257 (2016).
40. Bao, T., Burket, J., Woo, M., Turner, R. & Brumley, D. BYTEWEIGHT: Learning to Recognize Functions in Binary Code in Proceedings of the 23rd USENIX Security Symposium, San
Diego, CA, USA, August 20-22, 2014 (eds Fu, K. & Jung, J.) (USENIX Association, 2014),
845–860.
41. Heimes, C. PEP 456 – Secure and interchangeable hash algorithm https://www.python.
org/dev/peps/pep-0456/.
109
42. perlsec - Perl security. Algorithmic Complexity Attacks https : / / perldoc . perl . org /
perlsec.html. 2003.
43. Farhadi, M. R., Fung, B. C. M., Charland, P. & Debbabi, M. BinClone: Detecting Code
Clones in Malware in Eighth International Conference on Software Security and Reliability,
SERE 2014, San Francisco, California, USA, June 30 - July 2, 2014 (IEEE, 2014), 78–87.
doi:10.1109/SERE.2014.21.
44. Xu, Z., Chen, B., Chandramohan, M., Liu, Y. & Song, F. SPAIN: security patch analysis for
binaries towards understanding the pain and pills in Proceedings of the 39th International
Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017
(eds Uchitel, S., Orso, A. & Robillard, M. P.) (IEEE / ACM, 2017), 462–472. doi:10.1109/
ICSE.2017.49.
45. Bruschi, D., Martignoni, L. & Monga, M. Detecting Self-mutating Malware Using ControlFlow Graph Matching in Detection of Intrusions and Malware & Vulnerability Assessment,
Third International Conference, DIMVA 2006, Berlin, Germany, July 13-14, 2006, Proceedings (eds Buschkes, R. & Laskov, P.) ¨ 4064 (Springer, 2006), 129–143. doi:10.1007/
11790754\_8.
46. Kernighan, B. & Ritchie, D. The C Programming Language (Prentice Hall PTR, New Jersey,
USA, 1972).
47. Knuth, D. E. The art of computer programming, , Volume III, 2nd Edition (Addison-Wesley,
1998).
48. Bernstein, D. J. DJB Hash http://www.cse.yorku.ca/~oz/hash.html. 2003.
49. GCC, the GNU Compiler Collection http://www.gnu.org/software/gcc/index.html.
2022.
50. Fowler, G., Vo, P. & Noll, L. C. The FNV Non-Cryptographic Hash Algorithm https://
datatracker.ietf.org/doc/html/draft-eastlake-fnv-03. 2012.
51. Ramakrishna, M. V. & Zobel, J. Performance in Practice of String Hashing Functions in
Database Systems for Advanced Applications ’97, Proceedings of the Fifth International
Conference on Database Systems for Advanced Applications (DASFAA), Melbourne, Australia, April 1-4, 1997 (eds Topor, R. W. & Tanaka, K.) 6 (World Scientific, 1997), 215–
224.
52. Sedgewick, R. Algorithms in C (Addison-Wesley Professional, Bosotn, MA, 1990).
53. SDBM Library https://apr.apache.org/docs/apr-util/0.9/group__APR__Util_
_DBM__SDBM.html. 2007.
110
54. Wegman, M. N. & Carter, L. New Hash Functions and Their Use in Authentication and Set
Equality. J. Comput. Syst. Sci. 22, 265–279. doi:10.1016/0022-0000(81)90033-7 (1981).
55. Lemire, D. & Kaser, O. Strongly Universal String Hashing is Fast. Comput. J. 57, 1624–1638.
doi:10.1093/comjnl/bxt070 (2014).
56. Bernstein, D. J. CDB https://cr.yp.to/cdb.html. 2000.
57. Free Software Foundation, I. Using the GNU Compiler Collection (GCC) https://gcc.
gnu.org/onlinedocs/gcc-4.7.2/gcc/Optimize-Options.html. 2022.
58. Free Pascal https://www.freepascal.org/. 2021.
59. Foundation, G. P. .-. F. S. GNU make https://www.gnu.org/software/make/manual/
make.html. 2022.
60. Snudown https://www.github.com/reddit/snudown. 2018.
61. Reddit https://www.reddit.com. 2005.
62. CVE-2021-41168. Available from National Vulnerability Database, CVE-ID CVE-2021-
41168. 2021.
63. Bauman, E., Lin, Z. & Hamlen, K. W. Superset Disassembly: Statically Rewriting x86 Binaries Without Heuristics in 25th Annual Network and Distributed System Security Symposium,
NDSS 2018, San Diego, California, USA, February 18-21, 2018 (The Internet Society, 2018).
64. Hex-Rays. IDA F.L.I.R.T. Technology: In-Depth https://hex-rays.com/products/ida/
tech/flirt/in_depth/.
65. Meijer, C., Moonsamy, V. & Wetzels, J. Where’s Crypto?: Automated Identification and Classification of Proprietary Cryptographic Primitives in Binary Code in 30th USENIX Security
Symposium, USENIX Security 2021, August 11-13, 2021 (eds Bailey, M. & Greenstadt, R.)
(USENIX Association, 2021), 555–572.
66. Bernstein, D. J. DJBDNS https://cr.yp.to/djbdns.html. 2001.
67. Kocher, P., Genkin, D., Gruss, D., Haas, W., Hamburg, M., Lipp, M., et al. Spectre Attacks:
Exploiting Speculative Execution. meltdownattack.com (2018).
68. Lipp, M., Schwarz, M., Gruss, D., Prescher, T., Haas, W., Fogh, A., et al. Meltdown: Reading
Kernel Memory from User Space in 27th USENIX Security Symposium, USENIX Security
2018, Baltimore, MD, USA, August 15-17, 2018 (eds Enck, W. & Felt, A. P.) (USENIX
Association, 2018), 973–990.
111
69. Moghimi, D. Downfall: Exploiting Speculative Data Gathering in 32th USENIX Security
Symposium (USENIX Security 2023) (2023).
70. Wang, G., Chattopadhyay, S., Gotovchits, I., Mitra, T. & Roychoudhury, A. oo7: Lowoverhead Defense against Spectre Attacks via Binary Analysis. CoRR abs/1807.05843
(2018).
71. Chandler Carruth. Speculative Load Hardening https : / / llvm . org / docs /
SpeculativeLoadHardening.html. 2019.
72. Guarnieri, M., Kopf, B., Morales, J. F., Reineke, J. & S ¨ anchez, A. ´ Spectector: Principled Detection of Speculative Information Flows in 2020 IEEE Symposium on Security and Privacy,
SP 2020, San Francisco, CA, USA, May 18-21, 2020 (IEEE, 2020), 1–19. doi:10 . 1109 /
SP40000.2020.00011.
73. Stepanov, E. & Serebryany, K. MemorySanitizer: fast detector of uninitialized memory use
in C++ in Proceedings of the 13th Annual IEEE/ACM International Symposium on Code
Generation and Optimization, CGO 2015, San Francisco, CA, USA, February 07 - 11, 2015
(eds Olukotun, K., Smith, A., Hundt, R. & Mars, J.) (IEEE Computer Society, 2015), 46–55.
doi:10.1109/CGO.2015.7054186.
74. Serebryany, K., Bruening, D., Potapenko, A. & Vyukov, D. AddressSanitizer: A Fast Address
Sanity Checker in 2012 USENIX Annual Technical Conference, Boston, MA, USA, June 13-
15, 2012 (eds Heiser, G. & Hsieh, W. C.) (USENIX Association, 2012), 309–318.
75. Balakrishnan, G., Reps, T. W., Melski, D. & Teitelbaum, T. WYSINWYX: What You See Is Not
What You eXecute in Verified Software: Theories, Tools, Experiments, First IFIP TC 2/WG
2.3 Conference, VSTTE 2005, Zurich, Switzerland, October 10-13, 2005, Revised Selected
Papers and Discussions (eds Meyer, B. & Woodcock, J.) 4171 (Springer, 2005), 202–213.
doi:10.1007/978-3-540-69149-5\_22.
76. Jennifer L Jiang. Using Intel® Compilers to Mitigate Speculative Execution Side-Channel
Issues https : / / www . intel . com / content / www / us / en / developer / articles /
troubleshooting / using - intel - compilers - to - mitigate - speculative -
execution-side-channel-issues.html. 2018.
77. Research, M. Z3 https://github.com/Z3Prover/z3. 2024.
78. Aho, A. V., Sethi, R. & Ullman, J. D. Compilers: Principles, Techniques, and Tools (AddisonWesley, 1986).
79. Ramalingam, G. The Undecidability of Aliasing. ACM Trans. Program. Lang. Syst. 16, 1467–
1471. doi:10.1145/186025.186041 (1994).
80. pyelftools. Pyelftools https://github.com/eliben/pyelftools. 2023.
112
81. coreutils. Coreutils - GNU core utilities https://www.gnu.org/software/coreutils/.
2023.
82. apache. Apache - HTTP Server Project https://httpd.apache.org/. 2023.
83. Artifex. MuJS https://mujs.com/. 2023.
84. cjson. cJSON - Ultralightweight JSON parser in ANSI C https : / / github . com /
DaveGamble/cJSON. 2023.
85. Luk, C., Cohn, R. S., Muth, R., Patil, H., Klauser, A., Lowney, P. G., et al. Pin: building customized program analysis tools with dynamic instrumentation in Proceedings of the
ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation,
Chicago, IL, USA, June 12-15, 2005 (eds Sarkar, V. & Hall, M. W.) (ACM, 2005), 190–200.
doi:10.1145/1065010.1065034.
86. Pang, C., Yu, R., Chen, Y., Koskinen, E., Portokalidis, G., Mao, B., et al. SoK: All You Ever
Wanted to Know About x86/x64 Binary Disassembly But Were Afraid to Ask in 42nd IEEE
Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021
(IEEE, 2021), 833–851. doi:10.1109/SP40001.2021.00012.
87. Federico, A. D., Payer, M. & Agosta, G. rev.ng: a unified binary analysis framework to
recover CFGs and function boundaries in Proceedings of the 26th International Conference
on Compiler Construction, Austin, TX, USA, February 5-6, 2017 (eds Wu, P. & Hack, S.)
(ACM, 2017), 131–141.
88. Hind, M. Pointer analysis: haven’t we solved this problem yet? in Proceedings of the 2001
ACM SIGPLAN-SIGSOFT Workshop on Program Analysis For Software Tools and Engineering, PASTE’01, Snowbird, Utah, USA, June 18-19, 2001 (eds Field, J. & Snelting, G.) (ACM,
2001), 54–61. doi:10.1145/379605.379665.
89. Marsaglia, G. et al. Xorshift rngs. Journal of Statistical Software 8, 1–6 (2003).
113
Appendices
A Vulnerability Diagnosis Example
We illustrate the process of detecting weak hash functions in binary code with an example. Consider Figure 5.1, showing the assembly code, arranged in a CFG, for a hash function implementing
SDBM. After disassembling the target binary executable and discovering the functions therein, the
analysis considers each of these functions individually as target function.
The first step of HARM-DOS when analyzing a target function, is to determine if it matches
the hash function template. To this end, it identifies the nontrivial SCCs in the CFG of the function
(if any). The CFG shown in Figure 5.1 has a single nontrivial SCC, which we highlight with solid
nodes and edges.
Next, HARM-DOS iterates through the instructions in the SCC and analyzes memory-write
operations. In Figure 5.1 instructions that write to memory are identified at addresses 0x749,
0x74c and 0x751. Each of these write to a memory address calculated as a constant offset from
the stack base pointer register, rbp. As mentioned in Section 2.5.1, these are the only type of
memory-write operations allowed. Therefore, this SCC passes and is marked as a template-match
and HARM-DOS proceeds to perform Constant-Mnemonic Pair Discovery.
Since the target function is template-matching, HARM-DOS searches for the constantmnemonic pair fingerprints of the detection model of each known-weak hash algorithm. When
considering the constant-mnemonic pair fingerprints of SDBM, the analysis discovers the constants 6 and 16 (0x10 in hexadecimal), each used with mnemonic shl. These can be seen at addresses 0x739 and 0x741 in Figure 5.1. As this matches the constant-mnemonic pair fingerprints
114
for SDBM (see Table 2.1), the analysis notes the discovery of a constant-mnemonic pair match.
Therefore, it identifies the target function as a candidate hash function with SDBM as candidate
algorithm.
0x70a push rbp
0x70b mov rbp,rsp
0x70e mov QWORD PTR [rbp-0x18],rdi
0x712 mov DWORD PTR [rbp-0x1c],esi
0x715 mov DWORD PTR [rbp-0x8],0x0
0x71c mov DWORD PTR [rbp-0x4],0x0
0x723 mov DWORD PTR [rbp-0x4],0x0
0x72a jmp 755
0x72c mov rax,QWORD PTR [rbp-0x18]
0x730 movzx eax,BYTE PTR [rax]
0x733 movsx eax,al
0x736 mov edx,DWORD PTR [rbp-0x8]
0x739 shl edx,0x6
0x73c add edx,eax
0x73e mov eax,DWORD PTR [rbp-0x8]
0x741 shl eax,0x10
0x744 add eax,edx
0x746 sub eax,DWORD PTR [rbp-0x8]
0x749 mov DWORD PTR [rbp-0x8],eax
0x74c add QWORD PTR [rbp-0x18],0x1
0x751 add DWORD PTR [rbp-0x4],0x1
0x755 mov eax,DWORD PTR [rbp-0x4]
0x758 cmp eax,DWORD PTR [rbp-0x1c]
0x75b jb 72c
0x75d mov eax,DWORD PTR [rbp-0x8]
0x75d mov eax,DWORD PTR [rbp-0x8]
0x760 pop rbp
0x761 ret
Figure 5.1: A CFG showing the basic blocks of an implementation of the SDBM hash algorithm.
Solid nodes and edges indicate the template-match.
115
B Hash Transplant Example
To illustrate the hash transplant procedure, we use a replacement hash function from the Multilinear set of universal hash functions to replace the hash function shown in Listing 5.1. In order to
generate pseudorandom values at run time, we add an implementation of the Xorshift PRNG [89],
introduced by Marsaglia, to the replacement hash function. A source code representation of the
replacement hash function is shown in Listing 5.2. The initialization values for the hash value
(line 3) and the random state (line 4) are chosen randomly for every patch and are therefore unknown to an attacker. On line 8, the hash value is updated in a loop according to the next input
character. Lines 9 to 11 implement the Xorshift PRNG.
Next, we illustrate the process of replacing a hash function, using the weak hash function shown
in Listing 5.1, a source code implementation of the DEK hash algorithm. Note that in Listing 5.2
the replacement hash function receives its input and yields output in exactly the same way as the
original hash function (line 1). The assembly code for the replacement hash function is shown in
Listing 5.3. Figure 5.2 shows how the assembly code of the replacement hash function is inserted
in the binary. The first five instructions of the replacement hash are inserted by overwriting the
original hash function. The 16 overflowing instructions are inserted in 4 code caves. The grayedout instructions represent the last instruction of the function before the code cave starts.
1 unsigned int hash ( const char * str , unsigned int len ) {
2 unsigned int h = len ;
3 unsigned int i = 0;
4 for (i = 0; i < len ; ++ str , ++ i ) {
5 h = (( h << 5) ^ (h >> 27) ) ^ (* str );
6 }
7 return h;
8 }
Listing 5.1: The hash function we aim to replace, a source code implementation of the DEK
hash algorithm [47].
116
1 unsigned int univ_hash ( const char * str , unsigned int len ) {
2 char c ;
3 int h = 270369;
4 int r = 67601921;
5 int i;
6 for (i = 0; i < len ; i ++) {
7 c = str [i ];
8 h += (r * c );
9 r ^= r << 13;
10 r ^= r >> 17;
11 r ^= r << 5;
12 }
13 return h;
14 }
Listing 5.2: A source code representation of the hash function we use as a replacement to the
weak hash function, Listing 5.1.
1 xor r8d , r8d
2 mov edx ,0 x4078601
3 mov eax ,0 x42021
4 loop :
5 cmp rsi , r8
6 je done
7 movsx ecx ,[ rdi + r8 ]
8 inc r8
9 imul ecx , edx
10 movsxd rcx , ecx
11 add rax , rcx
12 mov ecx , edx
13 shl ecx ,0 xd
14 xor ecx , edx
15 mov edx , ecx
16 sar edx ,0 x11
17 xor ecx , edx
18 mov edx , ecx
19 shl edx ,0 x5
20 xor edx , ecx
21 jmp loop
22 done :
23 ret
Listing 5.3: Assembly code of Listing 5.2.
C DJB Hash Variations
We show two common variations of the DJB hash algorithm in Listings 5.4 and 5.5. In the source
code implementation illustrated in Listing 5.4, often referred to as DJBX33A, an addition operation is performed between the current hash value and the next input byte. On the other hand, in
Listing 5.5, often referred to as DJBX33X, an exclusive or operation is used instead.
117
Vulnerable
0a4f hash:
0a4f mov eax,esi
0a51 xor edx,edx
0a53 cmp esi, edx
0a55 jbe 65
0a57 jbe movsx ecx,[rdi+rdx*1]
0a5b rol eax,0x5
0a5e inc rdx
0a61 xor eax,ecx
0a63 jmp a53
0a65 ret
0a70 func1:
· · ·
0a7e jmp ...
0a80 nop
. . .
0a8c nop
0a95 func2:
· · ·
0a9f hlt
0aa0 nop
. . .
0aac nop
0ab0 fun3:
· · ·
0ab4 ret
0ab5 nop
. . .
0abf nop
0ae9 fun4:
· · ·
0acf ret
0ad0 nop
. . .
0ad7 nop
Patched
0a4f hash:
0a4f xor r8d,r8d
0a52 mov edx,0x4078601
0a57 mov eax,0x42021
0a5c cmp rsi,r8
0a5f je 0ad6
0a61 jmp 0a80
0a70 func1:
· · ·
0a7e jmp ...
0a80 movsx [rdi+r8*1]
0a85 inc r8
0a88 imul ecx,edx
0a8b jmp 0aa0
0a95 func2:
· · ·
0a9f ret
0aa0 movsxd rcx,ecx
0aa3 add rax,rcx
0aa6 mov ecx,edx
0aa8 shl ecx,0xd
0aab jmp 0ab5
0ab0 fun3:
· · ·
0ab4 ret
0ab5 xor ecx,edx
0ab7 mov edx,ecx
0ab9 sar edx,0x11
0abc xor ecx,edx
0abe mov edx,ecx
0abe jmp 0ad0
0ae9 fun4:
· · ·
0acf ret
0ad0 shl edx,0x5
0ad3 xor edx,ecx
0ad5 jmp 0a5c
0ad6 ret
Overwrite original
Insert overflowing instructions
Insert overflowing instructions
Insert overflowing instructions
Insert overflowing instructions
Adjust control flow
Adjust control flow
Adjust control flow
Adjust control flow
Figure 5.2: An illustration of how a hash function is replaced, by overwriting the original and
adding instructions to code caves. Every code cave (on the left) receives a number of instructions
(on the right) of the patch, shown in Listing 5.3. The grayed-out instructions show the last instruction of the function before the padding bytes (i.e. the code cave) starts. Every code cave ends with
a jump instruction to the code cave housing the next patch instructions.
118
1 uint djbx33a ( char * str , uint len ) {
2 uint h = 5381;
3 uint i = 0;
4 for (i =0; i < len ; ++ i ) {
5 h =(( h < <5) +h) +( str [i ]) ;
6 }
7 return h;
8 }
Listing 5.4: One common implementation of the DJB hash function, often referred to as
DJBX33A. Note the addition operation on line 5.
1 uint djbx33x ( char * str , uint len ) {
2 uint h = 5381;
3 uint i = 0;
4 for (i =0; i < len ; ++ i ) {
5 h =(( h < <5) +h) ^( str [i ]) ;
6 }
7 return h;
8 }
Listing 5.5: A second common implementation of the DJB hash function, often referred to as
DJBX33X. Note the exclusive or operation on line 5.
D Multilinear Hash
Lemire introduces a strongly universal set of hash algorithms, named Multilinear [55]. For a string
s = s0s1 ...sn−1 of length n, the hash value is calculated as
h(s) = m0 +
n−1
∑
i=0
(mi+1si)
where m0,m1,...mn are random values. Every selection of random values defines a different hash
algorithm in this set. Since the replacement hash algorithm has to handle strings of any length,
using a hash algorithm from the Multilinear set will require us to generate random, or pseudorandom, values on the fly. We can achieve this by incorporating an implementation of a PRNG with
the hash function.
119
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Side-channel security enabled by program analysis and synthesis
PDF
Static program analyses for WebAssembly
PDF
Data-driven and logic-based analysis of learning-enabled cyber-physical systems
PDF
Supporting faithful and safe live malware analysis
PDF
Assessing software maintainability in systems by leveraging fuzzy methods and linguistic analysis
PDF
Constraint-based program analysis for concurrent software
PDF
Mitigating attacks that disrupt online services without changing existing protocols
PDF
Detection and decoding of cognitive states from neural activity to enable a performance-improving brain-computer interface
PDF
Automatic test generation system for software
PDF
Techniques for methodically exploring software development alternatives
PDF
Hardware and software techniques for irregular parallelism
PDF
Novel graph representation of program algorithmic foundations for heterogeneous computing architectures
PDF
Automatic detection and optimization of energy optimizable UIs in Android applications using program analysis
PDF
Security functional requirements analysis for developing secure software
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Formal analysis of data poisoning robustness of K-nearest neighbors
PDF
Analysis of embedded software architecture with precedent dependent aperiodic tasks
PDF
When AI helps wildlife conservation: learning adversary behavior in green security games
PDF
Improving network security through collaborative sharing
Asset Metadata
Creator
Weideman, Nicolaas Hendrik
(author)
Core Title
Improving binary program analysis to enhance the security of modern software systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
06/12/2024
Defense Date
05/16/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
binary program analysis,cybersecurity,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mirkovic, Jelena (
committee chair
), Bogdan, Paul (
committee member
), Wang, Chao (
committee member
)
Creator Email
nhweideman@gmail.com,nweidema@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11399617G
Unique identifier
UC11399617G
Identifier
etd-WeidemanNi-13082.pdf (filename)
Legacy Identifier
etd-WeidemanNi-13082
Document Type
Dissertation
Format
theses (aat)
Rights
Weideman, Nicolaas Hendrik
Internet Media Type
application/pdf
Type
texts
Source
20240612-usctheses-batch-1167
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
binary program analysis
cybersecurity