Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Automatic test generation system for software
(USC Thesis Other)
Automatic test generation system for software
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Automatic Test Generation System for Software
by
Jianwei Zhang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Electrical Engineering)
May 2021
Copyright 2021 Jianwei Zhang
Acknowledgements
I would like to express my sincere gratitude to my advisor, Prof. Sandeep Gupta, for guiding me
on my journey of academic pursuit. He always gives me courage, energy, and inspiration. He is so
knowledgeable, kind, positive, and humorous. It’s my greatest honor to be his student!
And I would like to express my special thanks to my research advisor, Prof. William G.J.
Halfond, for giving me a lot of help on this research. His patience, immense knowledge and
enthusiasm help me overcome the challenges along the way!
Also, I cannot express enough thanks to my committee, Prof. Pierluigi Nuzzo, for his continued
support, encouragement, and insightful comments!
Last but not the least, I would like to thank my family, my parents and my wife for all the
support!
ii
Table of Contents
Acknowledgements ii
List of Tables vii
List of Figures viii
Abstract x
Chapter 1: Background and Motivation 1
1.1 Software testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Test data quality and mutation testing . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Automatic test data generation (ATG) for mutant . . . . . . . . . . . . . . . . . . 3
1.4 Limitations of existing SW ATG for mutant . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Equivalent mutant problem . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.3 Test data quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.4 Test oracle problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 Research scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2: Equivalent Mutant Identification 9
2.1 Equivalent mutant problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Key ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Minimal region of analysis (mROA) . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Control flow graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Identify minimal ROA in original program . . . . . . . . . . . . . . . . . 13
2.3.3 Identify minimal ROA in mutant . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.4 Code behavior modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.5 The code behavior within ROA . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.5.1 Derive control and data dependence graph (CDDG) from CFG . 17
2.3.5.2 Identify ROA’s CRIU and OROD . . . . . . . . . . . . . . . . . 18
2.3.5.3 Symbolic execution within ROA . . . . . . . . . . . . . . . . . 19
2.4 Identify equivalent mutant through constraint solving . . . . . . . . . . . . . . . . 20
2.5 ROA expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
iii
2.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Chapter 3: SW D-Algorithm 25
3.1 Challenges in constraint based test generation methods . . . . . . . . . . . . . . . 25
3.2 Key ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Essential properties of SW and SW D-algorithm . . . . . . . . . . . . . . . . . . . 27
3.3.1 Basic Block in SW – mROA . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Interconnections in SW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.3 Mutation effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.4 Execution status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.5 Active DU path, dead DU path, and potential DU path . . . . . . . . . . . 31
3.3.6 Unjustified element and the unjustified element list . . . . . . . . . . . . . 32
3.3.7 Subtask and the subtask stack . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.8 Algorithm state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.9 Backtrack limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Essential procedures in SW ATG D-algorithm . . . . . . . . . . . . . . . . . . . . 34
3.4.1 Implication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1.1 Value assignment implication . . . . . . . . . . . . . . . . . . . 35
3.4.1.2 EXS implication . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 Mutation effect excitation (MEE) subtask . . . . . . . . . . . . . . . . . . 39
3.4.2.1 MEE creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.2.2 MEE subtask processing . . . . . . . . . . . . . . . . . . . . . . 39
3.4.3 Mutation effect propagation (MEP) subtask . . . . . . . . . . . . . . . . . 41
3.4.3.1 MEP creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.3.2 MEP subtask processing . . . . . . . . . . . . . . . . . . . . . . 41
3.4.4 Justification (JUST) subtask . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.4.1 JUST for a node . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.4.2 JUST for a use . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.4.3 JUST for a definition . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.4.4 JUST for an EC . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.5 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.6 SW D-algorithm overview . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 2-pass test generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 4: Advanced methods 51
4.1 Extending the scope of our ATG tool . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.1 Support more data types . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.2 Support non-linear expressions . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Guaranteeing completeness of the SW D-algorithm . . . . . . . . . . . . . . . . . 53
4.2.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
iv
4.2.2 Key ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.3 Our new approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Improving the efficiency of SW D-algorithm . . . . . . . . . . . . . . . . . . . . . 55
4.3.1 Edge condition assist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.1.2 Key ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.1.3 EC assist for executability of ROA . . . . . . . . . . . . . . . . 59
4.3.1.4 EC assist for controllability of RIU . . . . . . . . . . . . . . . . 63
4.3.1.5 EC assist for observability of ME . . . . . . . . . . . . . . . . . 64
4.3.1.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.2 Possible value pre-computation . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.2.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.2.2 Key ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.2.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.2.4 Our new approach . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.3 Improving use to def justification . . . . . . . . . . . . . . . . . . . . . . 70
4.3.3.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.3.2 Key ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.3.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.3.4 Our new approach . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.4 Data transformation history for possible value pre-computation . . . . . . . 73
4.3.4.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.4.2 Key ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.4.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.4.4 Our new approach . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.5 Other improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.5.1 Test template . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.5.2 Indirect EXS Implication . . . . . . . . . . . . . . . . . . . . . 77
4.3.5.3 Multi-pass justification . . . . . . . . . . . . . . . . . . . . . . 78
4.3.5.4 Add randomness to constraint solving process . . . . . . . . . . 79
Chapter 5: SW ATG System 80
5.1 Multi-pass test generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1.2 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Equivalent mutant elimination during test generation . . . . . . . . . . . . . . . . 82
5.2.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.2 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Test compaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3.2 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
v
5.4 Complete SW ATG system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Chapter 6: Conclusion 86
6.1 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.1 Experiments setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.2 Experiments process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
References 93
vi
List of Tables
1.1 Analogy between HW testing and SW mutation testing . . . . . . . . . . . . . . . 7
2.1 Results of equivalent mutant checking for various programs . . . . . . . . . . . . . 23
3.1 Comparison between global scale approach and our approaches . . . . . . . . . . . 49
4.1 Compare previous D-algorithm with the improved D-algorithm - part 1 . . . . . . . 55
4.2 Compare previous D-algorithm with the improved D-algorithm - part 2 . . . . . . . 67
4.3 Compare previous D-algorithm with the improved D-algorithm - part 3 . . . . . . . 70
4.4 Compare previous D-algorithm with the improved D-algorithm - part 4 . . . . . . . 72
4.5 Compare previous D-algorithm with the improved D-algorithm - part 5 . . . . . . . 75
6.1 Details of the programs under test . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Comparison among different approaches . . . . . . . . . . . . . . . . . . . . . . . 89
6.3 Detailed view of all ATG system phases . . . . . . . . . . . . . . . . . . . . . . . 90
vii
List of Figures
2.1 The example code and its CFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 mROA
O
and mROA
M
for the mutation at line 6 . . . . . . . . . . . . . . . . . . . . 14
2.3 CDDG
O
with mROA
O
/ CDDG
M
with mROA
M
. . . . . . . . . . . . . . . . . . . . 18
2.4 Two procedures for finding CRIU and OROD . . . . . . . . . . . . . . . . . . . . 19
2.5 The CFG of an example code and its mutant . . . . . . . . . . . . . . . . . . . . . 21
3.1 The example code and its CFG with data dependency information . . . . . . . . . 29
3.2 The original program and its mutant with the identified mutation effect . . . . . . . 30
3.3 All DU paths of the original program and its mutant . . . . . . . . . . . . . . . . . 32
3.4 Parts of the example program to show cases of value implication . . . . . . . . . . 35
3.5 Parts of the example program to show cases of EXS implication . . . . . . . . . . 37
3.6 The original program and its mutant with the identified MEEmROA . . . . . . . . 40
3.7 The original program and its mutant with the identified MEEmROA and MEPmROA 42
3.8 The decision tree of the proposed SW D-algorithm . . . . . . . . . . . . . . . . . 45
3.9 The original program and its mutant with the identified MEE and MEP mROAs . . 47
3.10 The search tree of the example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 A general case during MEP identification . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Illustration of ROA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 A code snippet and its mutant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 EC assist for executability of ROA - part 1 . . . . . . . . . . . . . . . . . . . . . . 60
4.5 EC assist for executability of ROA - part 2 . . . . . . . . . . . . . . . . . . . . . . 62
4.6 EC assist for executability of ROA - part 3 . . . . . . . . . . . . . . . . . . . . . . 63
viii
4.7 EC assist for controllability of RIU - part 1 . . . . . . . . . . . . . . . . . . . . . 64
4.8 EC assist for controllability of RIU - part 2 . . . . . . . . . . . . . . . . . . . . . 65
4.9 EC assist for observability of ME . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.10 Example for possible value computation . . . . . . . . . . . . . . . . . . . . . . . 68
4.11 Forward pre-computation from def to use . . . . . . . . . . . . . . . . . . . . . . 69
4.12 Forward pre-computation from use to def . . . . . . . . . . . . . . . . . . . . . . 69
4.13 Backward use to def justification . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.14 Improved backward use to def justification . . . . . . . . . . . . . . . . . . . . . . 72
ix
Abstract
After decades of research and development, test generation for digital hardware is highly auto-
mated, scalable (in practice), and provides high test quality. In contrast, current software auto-
matic test data generation approaches suffer from either low test quality or high complexity. One
of the important reasons for this discrepancy is that hardware automatic test pattern generation
(ATPG) is fault oriented. Although mutation-oriented [1] (mutations are analogues to faults in
hardware) constraint-based test data generation for software was proposed to generate high quality
test data focusing on real program bugs, all existing implementations require symbolic analysis
for the whole program and hence are not scalable even for unit testing, i.e., testing the lowest-
level software modules. Importantly, these approaches generate too many tests, and hence require
impractically high manual effort during testing (test oracle problem [2]). Also, equivalent mu-
tant problem [3] remains an open question which affects the efficiency of test generation and the
accuracy of the mutation score.
In this research, we study the similarities and differences between software (SW) testing and
hardware (HW) testing, and apply important insights from hardware testing to improve existing
software mutation-oriented testing. We combine global structural static analysis and a sequence
of small and reusable symbolic analyses of local parts of the program, instead of symbolically
executing each mutated version of the entire program, to reduce run-time complexity and improve
scalability.
In particular, we propose the first approach for local analysis in software testing to identify
equivalent mutants, and a new method inspired by hardware D-algorithm [4] and divide and con-
quer for software unit automatic test generation (ATG).
x
In addition, we develop multiple new algorithms and heuristics to further reduce run-time com-
plexity and improve the test quality provided by our SW D-algorithm. We also propose a multi-pass
SW ATG system for an optimized test generation process that reduces run-time complexity and the
number of tests generated.
We compare our tools with one of the best state-of-the-art software test generation tools (Evo-
Suite[5], which won the SBST 2017 tool competition). The results shows that our SW ATG system
generates perfect quality unit tests in a scalable manner. We also demonstrate that our approach
dramatically reduces the number of tests and hence drastically reduces effort for testing.
Finally, our research dramatically shifts the thinking and opens up a new direction of research
in SW testing, by showing, for the first time, that deterministic search based technique can indeed
be more powerful.
xi
Chapter 1
Background and Motivation
1.1 Software testing
Software (SW) testing is the process of evaluating the system under test (SUT) to check whether
the specific design requirements are satisfied. This is an inevitable step before the software can be
released to users with confidence. Software testing is one of the important stages of the software
development life cycle (SDLC), especially for safety-critical systems (e.g., flight control system,
nuclear system) and security-critical systems (e.g., banking system).
Y2K bug (i.e., millennium bug) [6] is known as one of the worst programming errors in the his-
tory, it is simply caused by the abbreviation of the calendar year. In 90s, most computer programs
still used abbreviated representation for the 4-digit years (e.g., 1997 was shortened to 97) in order
to save the valuable memory at that time. As the year 2000 approached, it was believed that certain
programs might not interpret year 00 as year 2000, but as year 1900. The wrong date in computers
could cause disasters in multiple time-sensitive industries. For example, banks may calculate inter-
ests incorrectly due to the wrong interval used, airline scheduling may become chaotic due to the
incorrect date. The anxiety of the Y2K bug widely spread across the world, and it costed roughly
$300 billion for programmers to update critical programs throughout all industries. As a result,
catastrophe was avoided, and life proceeded as normal in year 2000. It was fortunate for us to
realize and fix the potential Y2K bug before it broke out. But this is not always the case. In 1996,
the maiden flight of Europe’s newest and unmanned satellite-launching rocket Ariane 5 ended in a
1
failure. It disintegrated and exploded after only about 40 seconds. The Ariane 5 itself costed nearly
$8 billion and it carried a $500 million satellite payload when it exploded. The lack of software
testing deserves the blame for this accident [7]. Before the disaster, the on-board computer tried to
store a 64-bit number into a 16-bit space, which lead to overflow that crashed both the main and
the backup computers.
Software testing is a necessary procedure to ensure the correctness of a software, and a high-
quality test set is required to expose any potential programming errors. According to [8], software
testing itself accounts for more than 50% of the total development costs. Since the testing process
is labor intensive, and human effort is required to generate test inputs and produce test oracles. A
test oracle is a mechanism that determines whether a software’s behavior is correct or not.
Due to the high cost of software testing process, automatic software test data generation is an
active research area in software testing. Several techniques are proposed and evaluated during the
last few decades [8], and fall into five categories: (1) structural testing using symbolic execution,
(2) model-based testing, (3) combinatorial testing, (4) random testing and its variant of adaptive
random testing, and (5) search-based testing. However, the testing oracle generation is less au-
tomated. In most cases, testers are still required to manually analyze the system’s behavior and
generate testing oracle. This prevents the software testing procedure from being fully automatic.
It also makes the cost of testing proportional to the number of test cases.
1.2 Test data quality and mutation testing
Given a set of test cases, its quality is commonly measured by code coverage [9], which is a
structural metric, i.e., it captures the percentage of lines and branches that are covered (executed)
by the test. Ghosh and Fujita [10] also proposed a HW RTL level ATPG algorithm that adapts the
idea of line coverage from SW testing. But a recent paper [11] shows that the effectiveness of this
quality metric is doubtful, especially for identifying real program’s errors that are deeply hidden
within the program.
2
In the late 1970s, mutation testing was proposed by Demillo et al. [1] to provide another option
for evaluating the quality of a given test set. A software mutant is defined as a modified program
that diverges from the original program, usually by one minor change at one statement. It mimics
the real error programmers might make. A mutant is marked as killed if we have a test case for
which the program’s outcome (e.g., values of outputs) for the mutant is different from the outcome
for the original program. Mutation score is obtained by calculating the percentage of mutants
killed over the population of the whole set of generated mutants, to represent the quality of the
given test set. Recent research [12] shows a strong positive correlation between a high mutation
score and a high coverage of real programming errors. Since mutation score metric is stricter than
code coverage metrics: to kill a mutant, a test case must not only execute the mutation site (code
coverage), but also trigger the mutation and produce erroneous program outputs.
1.3 Automatic test data generation (ATG) for mutant
Although the initial purpose of SW mutation testing was to find a more accurate way for evalu-
ating the quality of a given test set, it opened the door to mutation based test generation for SW.
Mutation based test generation was first proposed to improve mutation score [13], which has pos-
itive correlation [12] with the test set quality. It generates tests for mutants that remain alive after
mutation testing for an initial set of tests. The generated tests are then added to the original test set
to form a stronger test set.
Mutation based test generation has gained more attention from software researchers during
recent years due to its ability to generate high quality tests. Random test data generation [8] is
the most basic approach and is easy to implement. However, its limitations are obvious: it is
difficult to kill some mutants (hard-to-kill mutant), solely by randomly generated test data. The
second category is constraint based test generation method. Demillo et al. [13] proposed the first
method to generate test for a targeted mutant by converting the program into a constraint system.
The author then derives the constraints to satisfy “reachability”, “necessity”, and “sufficiency”
3
conditions to expose the mutant. The derived constraints are then sent to a constraint solver, and its
solution is a test for this mutant. A recent paper [14] also implements a similar constraint based test
generation tool in Java and applies to Java programs. Papadakis and Malevris [15] focused on using
existing code coverage based test generation tools for mutation based test generation tasks. The
authors developed enhanced control flow graph (CFG) by adding special vertex for each mutation
to original program’s CFG, where each special vertex is connected to its original un-mutated node
and the connection between them represents the “necessity” condition of the related mutant. This
method converts a mutation based test generation program to a branch coverage problem, which
can be solved by all existing branch coverage based test generation methods. Other researchers
[8] noticed the complex path conditions and the path explosion problem prevent constraint based
test generation from being widely used. The authors in [16] and [17] proposed an approach to
utilize dynamic symbolic execution (DSE) for generating mutation based tests. The DSE executes
a program using some given concrete inputs, it also records symbolic path conditions during its
execution. When the path condition becomes too complex for the constraint solver, the DSE engine
can simplify the path constraints using concrete run-time values. In contrast to static symbolic
execution, this method trades off accuracy for complexity. All methods from this category are
constraint based, since constraint generation and solving are always involved. We note that all
these approaches are not scalable to large programs under test. On the other hand, search-based
test data generation was first suggested by Bottaci [18] to kill mutants. This category of methods
models the test generation task as a search problem guided by a fitness function. EvoSuite [5] is one
of the most state-of-the-art tools in this field. Multiple meta-heuristics, like simulated annealing
and genetic algorithm, are applied to help solve the search problem. Although the search based
methods are scalable and suitable for any code, these suffer from limitations of the corresponding
algorithms or heuristics. In particular, the search based approaches might become stuck at local
minima, spend too much time for convergence, or complete with incomprehensive solutions due
to the design of the fitness function.
4
1.4 Limitations of existing SW ATG for mutant
1.4.1 Equivalent mutant problem
A mutant which is semantically equivalent to the original program is called an equivalent mutant.
In other words, there exists no test data which can “kill” an equivalent mutant. Hence, the mu-
tation score obtained through testing on a set of mutants which includes equivalent mutants will
be inaccurate (lower score), unless the method is able to identify all equivalent mutants. Also,
the presence of equivalent mutants severely increases the run-time complexity of the mutant based
ATG, because test generation effort spent on equivalent mutants can be high and it is always waste-
ful. Therefore, the equivalent mutant problem prevents mutation testing from broader and more
practical usage.
1.4.2 Scalability
The existing constraint based test data generation requires global symbolic analysis, which is ex-
pensive and not scalable. When the program under test is large or has complex structures (e.g.,
many branches, especially branches that re-converges), global symbolic analysis may become im-
practical. In addition, constraint solvers may fail to solve symbolic expressions created during
symbolic execution in practical run-time when programs become larger or have complex struc-
tures.
1.4.3 Test data quality
Some approaches try to reduce computational complexity by only considering “reachability” and
“necessity” conditions for the mutant (i.e., weakly [19] “kill” mutant). The tests generated under
“weak” mutation assumption are potentially with low quality since “sufficiency” condition is not
considered, i.e., the erroneous effects are not guaranteed to be propagated to the program’s outputs.
Hence, the quality of the test data is compromised due to the reduction of complexity.
5
1.4.4 Test oracle problem
In software testing, it is challenging to identify whether the behavior of a software system for a
given test case is the desired behavior or an incorrect behavior. This is called test oracle problem,
and exists for most software systems due to the lack of the formal specifications and the design-
for-test principles. Therefore, software testers must manually check the software’s behavior under
all necessary test cases. The test oracle problem is a bottleneck that prevents the overall automa-
tion of software testing and remains an open problem in software testing. However, the cost of
test application as well as the burden on the testers can be alleviated by being generating a more
compact set of tests for desired testing quality.
1.5 Motivation
Software automatic test generation (SW ATG) is not commonly used due to its limitations. In
contrast, hardware (HW) test generation has a high level of automation that is now a near universal
industry practice. Several test generation algorithms were developed and refined via five decades
of research and development. The most common and successful algorithms we use in HW test
generation, like D-algorithm [4] and PODEM [20], are fault model based. Research and practice
show that a test set with high coverage of faults typically also provides high coverage of real
hardware defects, even when the faults in the model do not exactly capture the likely defects.
HW fault based test generation and SW mutation based test generation methods are both based
on abstract fault models. Both have been shown to be effective in identifying real defects/errors.
We find that they share many similarities and the major analogies between the two are listed in
Table. 1.1.
The effectiveness of HW ATPG has been validated extensively via decades of use. Even though
single stuck-at faults do not accurately model real defects, high coverage of single stuck-at faults
indicates high coverage of real defects. This is the case because the coverage of single stuck-at
6
Table 1.1: Analogy between HW testing and SW mutation testing
HW testing SW mutation testing
Testing artifact HW circuit SW method/unit
Data type Boolean Int, Real, String, etc.
System I/P Primary I/P Method’s arguments, user inputs, global variable
System O/P Primary O/P Method’s outputs, exceptions, state change
Descriptive language Netlist Lines of code
Basic element Gate Statement/ROA
Function description Truth table Symbolic expression
Interconnections Circuit lines CFG paths, DU paths
Element executability Always Depends on run-time execution path
Defect Fault Mutation
Common fault model Single stuck-on Single mutation
faults at every line satisfies many necessary conditions for coverage of real defects in all parts of
the circuit.
Mutation based test generation shares similar properties. It is important for test cases to exer-
cise all parts of the code, by applying a rich set of values at every line and simultaneously making
the value at the line observable at the output. Recent research [12] has shown that high mutation
score indicates high coverage of real programming errors.
In HW ATPG, it is crucial to find a compact set of target faults to generate a compact set of
test patterns, because these patterns will be applied to millions of fabricated chips. While in SW
ATG, we also wish to find a minimal set of target mutants for test generation, since a golden model
of the software does not exist and each test’s outcome must be manually checked (testing oracle
problem). Although the oracle problem remains open, we can reduce its impact by compacting test
cases to its minimal scale while maintaining the test quality.
Given many similarities shared by HW testing and SW mutation testing, we propose new meth-
ods inspired by hardware D-algorithm and divide and conquer for software ATG system. We com-
bine global structural (i.e., static) analysis and a sequence of small and reusable symbolic analyses
of parts of the program, instead of symbolically executing each mutated version of the entire pro-
gram, to reduce run-time complexity and improve scalability. The set of main functions of our ATG
7
system includes new scalable methods for eliminating equivalent mutants and generating a com-
pact set of mutation based tests. Our goal is to apply all important insights from hardware testing
to improve existing constraint based ATG for mutants in terms of scalability and effectiveness.
1.6 Research objectives
Given the limitations of existing SW ATG for mutant, we propose to develop a SW ATG system
that addresses these limitations. Specifically, our new SW ATG system must identify (and hence
eliminate) a majority of equivalent mutants from the set of the generated mutants, and generate a
high quality yet compact unit tests for the remaining mutants in a scalable manner.
1.7 Research scope
In this research, we focus on mutation-oriented automatic test generation for unit level testing. A
unit is the smallest testable module of a SW. For example, in Java programs, a unit could be an
individual method or a class.
We choose SW unit level testing as our scope based on the following observations: (1) Mutation
testing is a white box testing method, since it needs the source code for applying mutations (a
mutant is a slight change to the original code). (2) White box testing is mostly used for unit level
testing, since it is more practical at unit level, and since unit testing is carried out at early stages of
SW development cycle, when programmers are trying to find problems in their source code.
Therefore, unit level test generation is the most suitable scope for our research. In this thesis,
if not explicitly stated otherwise, “test” means the “test” at unit level.
8
Chapter 2
Equivalent Mutant Identification
2.1 Equivalent mutant problem
An equivalent mutant is functionally equivalent to the original program and hence it is not kill-
able. The mutation score obtained without eliminating equivalent mutants is not accurate, this
prevents the mutation score from being a better metric for evaluating test set. Also, the computa-
tional efforts spent on test generation for equivalent mutants are unnecessary, and they cannot be
completed within a practical run-time limit. Therefore, eliminating equivalent mutants before test
data generation is crucial to build a scalable, efficient and accurate SW ATG system.
Just et al. [21] proposed using constraint solving to identify equivalent mutants. However, it
requires whole program analysis, which is not scalable beyond small programs. Apart from the
deterministic method, Offut et al. [22] proposed to eliminate equivalent mutants using compiler
optimizations. Other researchers in [23], [24] and [25] tried to prevent the generation of equivalent
mutants using selected mutation operators. Nevertheless, all these studies are based on approxima-
tion, and its effectiveness is hard to be guaranteed. Thus, the equivalent mutants are usually manu-
ally detected in most mutation testing literatures (e.g., [24] and [25]), due to the non-scalability of
all existing constraint based methods, and the unpredictability of all approximation methods. And
Hierons et al. [26] studied how to effectively guide the manual process of equivalence checking
through program slicing.
9
The parallel difficulty faced in HW testing is the identification of undetectable faults. Kunz
[27] applied recursive learning to identify undetectable faults. Kuehlmann et al. [28] used BDD to
describe circuit and applied graph theory to prove the equivalence between two functional equiva-
lent BDDs. These methods cannot be easily used in SW testing, because the data types in SW are
much more complex than data type (binary) in HW.
2.2 Key ideas
Consider the fault collapsing process in HW testing: we use local gate information to eliminate
redundant faults, using properties of fault dominance and equivalence, from the fault list to reduce
the computational complexity for ATPG. We may also find such a local area in a SW program
and simplify the equivalent mutant checking process by only analyzing behaviors and solving
constraints within this local scope.
We apply techniques inspired by HW testing theories to identify equivalent mutants. Our ap-
proach limits the scope of analysis to a constrained area around the location of targeted mutant, to
make the complexity of equivalent mutant identification, in the average case, independent of the
size of the program, which makes our method more scalable and universal.
We will formally define the region of analysis (ROA) later to spatially limit the search to de-
termine the detectability of a given mutant. For now, informally, ROA is a code snippet around
the mutated location which should “include” the mutated locations in each version of program.
Also, the code and the program execution outside the ROA must be identical across all program
versions being analyzed. Given an original program O and a mutant M, their corresponding ROA
O
and ROA
M
are identified such that if we remove the code of ROA
O
from O and the code of ROA
M
from M, we get two identical partial programs, that is(O ROA
O
)(M ROA
M
).
This requirement for ROA ensures that we only need to analyze the behavior of the code snippet
within ROA to capture the equivalence between O and M, because any captured state changes
within ROA will then have identical effect on the outside partial code(OROA
O
) or(MROA
M
)
10
as they are identical. If the behavior of ROA
O
is exactly the same as the behavior of ROA
M
, we can
definitively conclude that the O and M are equivalent (that is, M is an equivalent mutant), because
the equality within the ROAs will not be altered by the identical code outside.
In contrast, if the behaviors are different between ROA
O
and ROA
M
, we cannot conclude M is
kill-able for the reason that the identical code outside may prevent the propagation of the differ-
ences caused by the mutation within ROA
M
and make M equivalent at last. In other words, for a
given mutation and for a given ROA either we can definitely determine equivalence or fail to do
so. In the latter case, we have the option of expanding the ROA (to be described later) and check-
ing again. By applying our method, less false positives (identified equivalent but really kill-able)
can occur. We also note that checking whether the mutation effect can propagate is similar to the
X-path check [20] for fault effect propagation in HW ATG.
2.3 Minimal region of analysis (mROA)
Each gate or line in HW is treated as a basic block since its behavior can be captured strictly in
terms of values at its input(s) and output(s), with or without any fault inserted within the block. In
a SW, this is true for non-conditional statements. However, the behavior of a conditional statement
with or without a mutation cannot be captured solely in terms of values at the conditional state-
ment’s value inputs/outputs alone, since the execution of the statement not only determines the
values at its outputs but also determines which branch is taken. For example, in Fig. 2.1, if there
is a mutation at statement 12 that changes “y= x+ 1” to “y= x 1”, the behavior of the mutation
can be easily captured by the definition of y at line 12. Consider another mutation at statement 6
that changes “x< z” to “x<= z”. Since no variable is defined here, the behavior of the mutant
can only be captured by knowing which branch will be taken (evaluated to be true or false) under
specific value of x and z.
To tackle this complication, we developed the new notion - minimal region of analysis (mROA),
which should contain the statement under study and a minimal number of additional statements.
11
Figure 2.1: The example code and its CFG
The purpose is that all complications associated with any changes in program flow (i.e., executions)
are confined within the mROA. Here we use mROA
s
to represent a minimal region of analysis for a
statement s. The behavior of this part of the program - with or without a mutation in the statement
s - must be captured only in terms of values at inputs and outputs (will be defined later) of the
mROA
s
.
2.3.1 Control flow graph
CFG is a directed graph, where each node represents a basic block, and each edge represents a path
the control flow may follow. CFG is generated from program’s code by traversing every statement.
Fig. 2.1 shows an example Java code and its CFG.
During static analysis on CFG, following notions are used: (1) Dominator – a node x dominates
a node y if every path from the entry node to y goes through x, we call x as y’s immediate dominator
if x is the closest one (2) Post-dominator – a node x post-dominates a node y if every path from y
12
to the exit node goes through x, we call x as y’s immediate post-dominator if x is the closest one.
For example, in the CFG shown in Fig. 2.1, nodes 0, 1, 2, 3, 4, 5, and 6 are all dominators of node
16, and node 6 is the immediate dominator; nodes 2,3,4,5,6,16,17, and 18 are all post-dominators
of node 1, and node 2 is its immediate post-dominator.
2.3.2 Identify minimal ROA in original program
mROA
O
is defined using the mutation’s minimal region of influence (mmROI), which is the mini-
mal region where we can observe the execution change caused by the mutation. Also, we identify
mROA
O
in a way such that it has a unique entry node and a unique exit node on CFG
O
. This allows
us to capture mROA
O
’s behavior by only monitoring the value changes within it, without consid-
ering any program flow changes from/to the scope of the mROA
O
. mROA
O
is represented using
the line numbers of its entry and exit nodes. For example, an mROA
O
denoted as[x;y] contains all
paths and nodes between line x and line y on CFG
O
; an mROA
O
denoted as[x;y) contains all paths
and nodes between line x and line y except node y on CFG
O
.
There are two cases when finding mROA
O
: (1) the mutated statement is a non-decision state-
ment; (2) the mutated statement is a decision statement. In the first case, mROA
O
is the mutated
statement itself. In the second case, mROA
O
contains all the nodes and paths between the decision
statement node and its immediate post-dominator (but exclusive of the immediate post-dominator)
on CFG
O
, since the mutation’s presence may change program flow across different branches from
that decision statement. Fig. 2.2 shows the second case: the mutant is to change line 6 from
“i f(x< z)” to “i f(x> z)”, where the CFG
O
is[6;16).
2.3.3 Identify minimal ROA in mutant
In order to analyze the program behavior within the ROA between the original program and the
mutant, we need to identify mROA
M
by projecting mROA
O
onto mutant M’s CFG
M
, such that
(O ROA
O
)(M ROA
M
).
13
Figure 2.2: mROA
O
and mROA
M
for the mutation at line 6
After the mutant generation process, a function f is obtained to describe the line correspon-
dences between original program and the mutant. Therefore, given mROA
O
=[x;y]
O
, we can get
mROA
M
= [f(x);f(y)]
M
on CFG
M
. Fig. 2.2 shows mROA
O
and its projected mROA
M
for the
mutation at line 6.
2.3.4 Code behavior modeling
BSF (Boolean switching function) is used in HW testing to describe the relations between circuit’s
input and output logic variables. In SW testing, the code behavior can be described using symbolic
expression generated by symbolic execution [29].
In order to model a program using symbolic expression, we need to identify program’s inputs
and outputs. We use the notion of program’s I/O streams to describe program’s inputs from dif-
ferent sources and outputs to different destinations. Those sources/destinations include disk files,
14
other programs, user consoles, etc. The basic I/O streams, in JA V A for example [30], include byte
streams, character streams, buffered streams, I/O from the command line, etc. In a program, we
identify a set of variables which define its input streams, and a set of variables which define its
output streams. The variable we mention here carries a broad definition, and includes all forms
of program’s I/O streams. For instance, line 13 of the example code (see Fig. 2.1) contains no
traditional variable, but it contains a “hidden” variable which carries the buffer of output stream
“a/2 is odd”.
In software, a variable var is said to have a definition at statement x if var is assigned a value at
x; and var is said to have a use at statement x if var determines either the value of another variable
defined at statement x or the program flow if statement x is a decision statement. In addition, we
consider the definition of variable var at statement x reaches a statement y if there is a path in
CFG leading from x to y that doesn’t pass through any other definitions of var. Here we use SSA
(static single assignment form) which represents a variable with the information of its static single
definition location. E.g., in the code shown in Fig. 2.1, variable a is defined at entry (statement 0)
as an argument, and variable x is defined at statement 1. So the reaching definitions of statement 2
are de f
0
(a) and de f
1
(x).
For a program, we identify a set called program’s input definition (PID) and a set called pro-
gram’s output definition (POD), which includes all definitions representing program’s I/O streams.
In current research, we figure out PID and POD manually from program’s specification. In the
example shown on Fig. 2.1, definition de f
0
(a) is an argument for the method “function”, so it
can be used to represent the inputs data stream. On the other hand, there are two output stream
definitions which display at system’s standard output, we name them as O
13
and O
17
. So we have
PID
O
=fde f
0
(a)g, and POD
O
=fO
13
;O
17
g for the example code. PID
M
and POD
M
can be
identified in the similar way.
15
2.3.5 The code behavior within ROA
ROA in the whole program is like a faulty sub-circuit in the CUT. In order to model the code
behavior within ROA, we need to identify ROA’s I/O streams and generate symbolic expression as
if it is a complete program. We identify a set called ROA’s input use (RIU) which contains all uses
within ROA and defined outside of ROA, and a set called ROA’s output definition (ROD) which
contains all definitions within ROA. Also, we generate a set of ROA’s input variable (RIV) from
RIU, and a set of ROA’s output variable (ROV) from ROD. For example, in mROA
O
of Fig. 2.2,
node 6 uses definition de f
5
(x) that defined at node 5, so use
6
(x) belongs to RIU
O
, and x belongs
to RIV
O
. Also, node 10 defines de f
10
(y), so it belongs to ROD
O
, and y belongs to ROV
O
.
However, not all the uses in RIU can be controlled from outside through input streams of the
entire program (PID), and not all the definitions in ROD can change output streams of the entire
program (POD). Hence, we want to use the static information of the whole program to identify a
subset of RIU called controllable ROA’s input use (CRIU), such that CRIU only contains the ROA’s
input uses which can also be affected by the definitions in PID. And we identify observable ROA’s
output definition (OROD) from ROD, such that OROD only contains ROA’s output definitions
which can affect the definitions in POD. We need dependency information between statements
to identify controllability and observability properties of each definition. Here we focus on two
kinds of dependencies: (1) RAW data dependence (read after write data dependence) - a use of a
variable var at statement line y is data dependent on the definition of that variable at statement line
x, de f
x
(var), if there exists a path (acyclic or cyclic) in CFG between the node of statement line x
and the node of statement line y, and there is no other definition of var on this path. In this research,
only RAW data dependence is considered, so we will use “data dependence” for convenience to
imply RAW data dependence. In CFG shown on Fig. 2.1, the use of variable x at node 10 is data
dependent on the definition of variable x at node 5 (de f
5
(x)). (2) Control dependence – a node x is
control dependent on node y if node y is a decision node, and the results of its evaluation determine
whether node x is executed. In the CFG shown on Fig. 2.1, node 7 is control dependent on node 6.
16
2.3.5.1 Derive control and data dependence graph (CDDG) from CFG
In order to identify ROA’s CRIU and OROD, we propose a directed graph called control and data
dependence graph (CDDG), which combines CDG (control dependence graph) and DDG (data
dependence graph) to capture both control and data dependence relations between statements of
the program.
We use program’s CFG and other static information to generate its CDDG. Given the original
program O, we first generate CFG
O
. For each node x of CFG
O
: (1) Identify the definition at
node x in SSA form, and add it to a definition set denoted as DEF
O
x
. (2) Identify all reaching
definitions which are also used at node x, add them to the set named URD
O
x
. (3) Identify all
control dependent nodes of node x, add their node numbers to the set called CDN
O
x
. We then
create CDDG
O
by starting with all nodes with their identified DEF, URD, and CDN from CFG
O
,
and adding edges according to their data/control dependencies: for each node x, if URD
O
x
contains
definition of variable var at node p, we add an edge from node p to x; if CDN
O
x
contains node q,
we add an edge from node q to x. Also, we project ROA
O
onto CDDG
O
by marking all nodes
belongs to the original ROA
O
including its entry and exit nodes. Starting from CFG
M
and ROA
M
,
we use an identical procedure to identify CDDG
M
with ROA
M
’s projection.
Fig. 2.3 shows both CDDG
O
with its mROA
O
and CDDG
M
with its mROA
M
in the same graph.
They are generated from CFG
O
, CFG
M
, mROA
O
, and mROA
M
shown in Fig. 2.2. We note that the
CDDG
O
and CDDG
M
are identical, because the mutation doesn’t change dependence relations.
The CDDG graphs obtained from above steps keep all information of dependence relations
between nodes: if there is a path from node x to node y on CDDG, we know node y is dependent
on node x; if node y is dependent on node x, there must be a path from node x to node y on CDDG.
Therefore, CDDG is sufficient to identify CRIU and OROD given PID and POD of the whole
program.
17
Figure 2.3: CDDG
O
with mROA
O
/ CDDG
M
with mROA
M
2.3.5.2 Identify ROA’s CRIU and OROD
Given CDDG
O
, ROA
O
, PID
O
and POD
O
, we use the procedures shown in Fig. 2.4 to derive RIU
O
,
CRIU
O
, ROD
O
and OROD
O
. We use the same approach to identify RIU
M
, CRIU
M
, ROD
M
and
OROD
M
.
18
Figure 2.4: Two procedures for finding CRIU and OROD
2.3.5.3 Symbolic execution within ROA
Symbolic execution is used to generate symbolic expression for ROA. It explicitly explores all
paths up to the specified depth limit. This process may have high complexity when the program
is large or there are loops. The confined symbolic execution reduces complexity as well as the
probability of path explosion.
In CRIU and OROD, multiple definitions may refer to the same variable. We store those input
variable names in a set named CRIS (controllable ROA’s input symbols), and outputs variable
names in a set named OROS (observable ROA’s output symbols). In particular, for ROA
O
and
ROA
M
, we identify four sets of variable names: CRIS
O
, CRIS
M
, OROS
O
and OROS
M
.
Given a program and its ROA, CRIS, and OROS, we perform following steps to generate sym-
bolic expressions for describing ROA’s behavior: (1) Assign each variable in CRIS a symbolic
value and start symbolic execution. (2) The program runs symbolically and generates path condi-
tions (PC) for each path. Each time when the execution reaches the ROA’s exit, we check symbolic
expression of all variables in OROS and use the conjunction of their expressions to form the path
output (PO). We use the conjunction of PC and PO of the path to represent the path state (PS). (3)
After every path is traversed, we terminate the symbolic execution and form the state expression
(SE) by composing a disjunction of PSs of all visited path.
In Fig. 2.2, mROA
M
has 3 paths: 6-7, 6-9-10, 6-9-12-13. The PSs of their three paths are:
19
Therefore, SE
mROA
M
for mROA
M
is:
Similarly, we can get SE
mROA
O
for mROA
O
in Fig. 2.2:
2.4 Identify equivalent mutant through constraint solving
After symbolic execution, we obtain SE
ROA
O
and SE
ROA
M
. We use constraint solver to check the
relations between ROA
O
and ROA
M
by solving the following three functions:
ROA
O
and ROA
M
are identified equivalent if E
0
is true and E
1
is false and E
2
is false, i.e.,
the set of all possible values that satisfies SE
ROA
M
is identical to the set of all possible values that
satisfies SE
ROA
O
. If the solver fails to solve any of these three constraints, the mutant cannot be
proved kill-able or equivalent.
2.5 ROA expansion
The equivalence between ROA
O
and ROA
M
is sufficient to conclude M is an equivalent mutant, but
it is not necessary. Because the code outside the ROA may pose additional constraints on OROD of
20
the current ROA such that kill-able mutant under smaller ROA may actually be equivalent as ROA
becomes larger. In Fig. 2.5, we show the CFGs of an example code and its mutant. The mutant
changes the line 2 from “a< b” to “a<= b”. If we use mROA for equivalence check analysis, we
will conclude the mutant is not equivalent to the original program. But if we consider the expended
ROA (shown in Fig. 6 as eROA), we will conclude the mutant is an equivalent mutant. Therefore,
if the relation between ROA
O
and ROA
M
are evaluated to be not equivalent for the current ROA,
we may expand the current ROA and repeat above analysis to obtain a more accurate result.
Figure 2.5: The CFG of an example code and its mutant
The ROA can expand up and down in CFG. The new ROA must have single entry and single
exit: when it expands up and towards the entry of the program, the begin line of the new ROA must
be a dominator of the current ROA; when it expands down and towards the exit of the program, the
end line of the new ROA must be a post-dominator of the current ROA. As of now, we expand the
ROA uniformly up and down; we will develop heuristics to expand the ROA more effectively.
2.6 Implementation
We first use muJava [31] to generate a set of mutants, and we generate CFG
O
for the original
program. For each mutant M generated by muJava, we perform the following steps: (a) Generate
21
CFG
M
, which is the corresponding CFG of CFG
O
on M. (b) Use our newly developed static
analysis tools to identify mROA (if it is the first iteration for M), CDDG, CRIU, OROD, CRIS
and OROS for each version of the code (O and M). (c) If both OROD
O
and OROD
M
are empty,
we conclude M is an equivalent mutant. (d) If not, we use variable symbols in CRIS and OROS
and modified JPF-SE tools [32] to perform symbolic execution within ROA (mROA or expanded
ROA), and generate SE of ROA for each version of code. (e) Use constraint solver (e.g., yices
[33]) to identify the equivalent mutant by solving E
0
, E
1
, and E
2
. (f) If equivalence is identified,
we can proceed to the next mutant. If the equivalence is not identified, we expand the ROA and
repeat all steps from (b) to (f) using the expanded ROA until time constraint is met. Finally, we
obtain a set of equivalent mutants of O from all generated mutants.
2.7 Results
We use eight test programs from online resources of Ammann and Offutt’s text [34]. These are
java programs whose sizes range between 20 and 200 lines.
To show the possible improvements our method might achieve, we compare our method with
the existing constraint based approaches, since all the other methods (i.e., [22], [24], and [25]) are
approximate and not deterministic. All existing constraint based approaches [21], [35] perform
analysis at global scale, we compare with these approaches by applying our localized method to
identify equivalent mutants in these eight programs. In Table. 2.1, “ROA depth” denotes how
much the ROA expands. Specifically, 0 means minimal ROA (mmROI), “global” means the whole
program. Note that the results shown for “global” case correspond to the results for all existing
approaches while those for other ROA depths are enabled for the first time by our approach. “Avg.
ROA size” is the average size of the ROA in bytecode (lines of bytecodes) among all program
versions; “Avg. # of ROA I/Ps” is the average number of variables in CRIS among all program
versions; “Avg. # of ROA O/Ps” is the average number of variables in OROS among all pro-
gram versions; “Avg. expression size” shows the average size of the constraint expression E
0
, E
1
,
22
Table 2.1: Results of equivalent mutant checking for various programs
and E
2
among all program versions, we calculate the expression size by counting the number of
constraints; “Run-time” is the time (in second) spent on equivalent mutant identification; “# of
equivalent M” is the number of equivalent mutants identified by our tools under the current depth;
“
#o f equivalentM
#o f allequivalentM
” is the percentage of equivalent mutants discovered under the current depth.
The first observation in Table. 2.1 is that minimal ROA is sufficient to identify all equivalent
mutants for the first seven programs. We note in Program H, either average number of ROA I/Ps
or average number of ROA O/Ps are slightly different among different ROA depth, because some
intermediate variables (not variables defined in PID or POD but they have dependence path(s)
from/to the definition(s) in PID/POD) are involved in the symbolic execution. We also notice
that the average minimal ROA sizes in some of programs are around half or less than half of the
whole program sizes. In contrast, in the first two program, the ROA size reductions are not that
significant. This difference results from different structures of those programs, as well as the ROA
identification strategy we use. Also, in the results of program E, if we use minimal ROA, the
average expression size is far less than its counterpart in the global analysis, and the run-time is
23
greatly reduced. That is because there is an “if” statement inside of a “for” loop in this program
and it creates a large number of possible execution paths. The advantage of minimal ROA analysis
is obvious here: it avoids possible path explosion by only focusing on a small part of the code.
In the last program, the results show that when we use the minimal ROA, we can identify
60.87% of all equivalent mutants while the run-time complexity is greatly reduced. If we increase
the ROA depth, more equivalent mutants are identified, but the run-time also increases. We note
that the run-time is even higher than using global analysis when ROA depth is larger than 3. This
is due to the larger expressions created during symbolic execution with the variables in CRIS and
OROS. Hence, we need to identify ROA in a way that maximizes the advantage of our method.
This is a subject of our ongoing research.
2.8 Conclusion
In this chapter, we analyze the SW testing problem from the view of HW testing and identify the
similarities between SW mutation testing and HW fault oriented testing. Inspired by fault collaps-
ing and D-algorithm in HW testing, we have developed the first localized approach for equivalent
mutant identification. Experimental results show that, compared with previous constraint based
approaches [21], [35], our approach can effectively reduce the run-time complexity, which proves
that our approach is more scalable and more universal. The related research is published in [36].
24
Chapter 3
SW D-Algorithm
3.1 Challenges in constraint based test generation methods
For all constraint based test generation methods (e.g., [13], [16] and [17]), constraint creation is an
inevitable process. And all existing constraint creation approaches for mutation based test gener-
ation require global scale symbolic analysis, which is an expensive process. In addition, solving
the constraint expression is also with high complexity. Therefore, the test generation process is
not scalable. Although the weak mutant approach proposed by Howden [19] reduces the com-
plexity for mutations close to the beginning of the program, it does not reduce the complexity for
mutations close to the end of the program. Also, the weak mutation approach completely ignores
mutation effect propagation, and hence the results are typically highly approximate, especially for
mutations for which it reduces complexity. These drawbacks prevent current constraint based test
generation approaches from being widely used in SW testing area.
We propose to develop a method to minimize the computational effort spent on constraint cre-
ation and solving, and make the mutation based test generation scalable yet accurate for large pro-
grams. Our objective is to develop fundamental concepts, methods, and tools for mutation-oriented
automatic test generation (ATG) for SW unit testing that will provide near-universal automation.
25
3.2 Key ideas
We develop a new divide and conquer approach to reduce the complexity of the constraint creation
phase as well as the constraint solving phase. More specifically, we combine global static analysis
and a sequence of small local (for parts of the program) symbolic analysis, instead of performing
symbolic analysis on the whole program. This also reduces the load on constraint solver by de-
riving smaller constraint expressions. Also, static analysis, which includes control flow analysis
and data dependency analysis, are low-complexity processes (compare to symbolic analysis). The
generated static information for the program under test forms the connections between symbolic
expressions of different parts of the program, thus the total (average) run-time complexity is sig-
nificantly reduced compared to traditional global analysis method, especially for larger programs
or relatively smaller programs with path explosion problem.
Moreover, inspired by the concept of HW D-algorithm, we develop SW D-algorithm to re-
strict the generation process to directed local search. HW D-algorithm was the first complete test
generation algorithm which established a paradigm for completely searching the space of all pos-
sible tests. Its test generation subtasks (TGSTs) includes fault excitation, fault effect propagation,
and justification. Similarly, our new approach (SW D-algorithm) has three test generation sub-
tasks: mutation effect excitation, mutation effect propagation, and justification. The details will be
presented in the following sections.
To further reduce the run-time complexity of our SW D-algorithm, we can reuse the constraints
created for different parts of the program wherever applicable. This saves a lot of computational
effort otherwise wasted on constraint creation for the same part of the program. This is particu-
larly important since there are a lot of repeated uses of constraints. Also, since for each mutant,
only a part of the program is different from the original program and all remaining parts remain
unchanged. In this SW D-algorithm, we pre-create constraints for all possible parts of the program
and reuse whenever needed, only the mutated part requires fresh constraint creation.
26
In addition, we implement an 2-pass test generation system. Specifically, we first use SW
D-algorithm with a small backtrack limit to generate the test cases for most mutants at low com-
plexity. Then we use the traditional global method to generate tests to kill the remaining mutants.
Our approach provides high coverage at lower complexity and with fewer test cases compared to
existing constraint based methods.
3.3 Essential properties of SW and SW D-algorithm
SW D-algorithm is developed by building on the principles used in HW D-algorithm. But due to
the differences between SW and HW in many aspects, we must significantly extend several existing
definitions and methods used in HW D-algorithm to adapt many other concepts and methods to SW
D-algorithm, and develop new concepts and methods to capture special characteristics in SW.
We need to convert SW program to a form that is suitable for SW D-algorithm. Here we
describe all essential components we use to represent a SW program. We note that currently we
use loop unwinding to convert cyclic programs to acyclic.
3.3.1 Basic Block in SW – mROA
We use mROA to represent a minimal region of analysis for SW. The behavior of this part of the
program - with or without a mutation in the statement s – can be captured only in terms of values at
its RIU and ROD. The details of mROA identification and analysis are described earlier in Section
2.3.
3.3.2 Interconnections in SW
In HW circuit, gates are connected by circuit lines, which not only carry logic values, but also
indicate flow of control via logic value transitions (events). Circuit lines are, hence, the sole inter-
connections in HW circuits.
27
While in SW program, there is no physical “line” that represents the interconnections. Instead,
we use both control dependency and data dependency information to fully capture the intercon-
nections between statements within SW, in terms of control flow and data flow. The first type of
interconnections in SW is represented as a sequence of nodes and edges in its CFG. This type of
interconnection captures the control flow of the program. The second type of interconnections is
represented as DU chains. This type of interconnection captures the data flow of the program.
Fig. 3.1 shows the example code and its augmented CFG. The solid lines represent the control
flows of the program, and the dotted lines represent the data dependencies of the program. The
lines between statement nodes capture all interconnections within the program. For example, we
can see from this augmented CFG that de f
1
(x) and use
3
(x) are not only connected in terms of the
path between node 1 and node 3 in CFG (path 1-2-3), they are also connected in terms of the DU
path between de f
1
(x) and use
3
(x). In this chapter, we use the form of “de f=use
location
(name)” to
represent a variable’s definition or use at the specified location.
3.3.3 Mutation effect
In HW D-algorithm, fault effect at a circuit line is denoted as D or D [4], which is based on multi-
valued composite value system to describe different values in fault free and faulty versions of the
circuit. Similarly, in SW D-algorithm, we define mutation effect as the presence of two different
values for the same variable in the original program and its corresponding location in the mutant.
The purpose of test data generation is to excite the mutation effect, propagate the mutation effect
to at least one output of the program under test, and ensure that all values are justified.
The data type in SW is more complex compared to Boolean values used in digital circuits, thus
it is difficult to use a composite value system to represent values of variables in different versions
of code. For example, if an integer variable has mutation effect, we cannot use a simple composite
value system to represent all possible value combinations since original and faulty values can both
take an exponential number of possibilities within the value range of integer. Instead, we can use
a combined representation: “value of the variable in the original program / value of the variable in
28
Figure 3.1: The example code and its CFG with data dependency information
the mutant”. And mutation effect is captured if the value of the variable in the original program
is different from its value in the mutant. For example, in Fig. 3.2, consider a mutant that changes
the statement at line 3 from ”x= x+ 1” to ”x= x 1”. We can observe a mutation effect at this
line if we assign x with any integer, 0 for instance. And the mutation effect can be denoted as
de f
3
(x)= 1= 1.
To find the corresponding variables used or defined in the original program and the mutant, we
identify the mapping function f to record the correspondences between the statements in original
program and the mutant. Given a line p in original program, we can find its corresponding line in
the mutant through functionf.
29
Figure 3.2: The original program and its mutant with the identified mutation effect
3.3.4 Execution status
In HW, every element in a digital circuit is always activated because of its inherent parallelism.
While in SW program, a node (statement) or an edge in program’s CFG is not automatically acti-
vated until it is executed during run-time, because program flow branches after a conditional node,
and only one branch will be executed. The precondition of executing this branch is determined
during run-time. We define edge condition (EC) as the precondition for a specific edge to be exe-
cuted. If a statement is not executed, any change induced by this statement, either in control flow
or data dependency relations, must not be counted. This difference between HW and SW requires
a completely new set of variables and algorithms.
30
We define execution status (EXS) of a node/edge in a program’s CFG to indicate whether
this node/edge will be executed during run-time. EXS is a new type of dynamic information
proposed here. We note several special properties for EXS: (1) the entry statement of a program
is always executed; (2) the mutated statement must be executed; (3) the mutation effect needs to
be propagated through a sequence of executed nodes and edges connecting them; (4) among all
outgoing edges of an executed conditional node, only one edge must be executed, and all the other
edges must not be executed; (5) among all incoming edges of an executed node, only one edge
must be executed, and all the other edges must not be executed.
For example, Fig. 3.2 shows CFGs of the original program and a mutant with mutation at line
3. For CFGs of both versions, we know node 0 (entry) must be executed, because it is an entry
node; and node 3 must be executed, because it is the mutated statement.
3.3.5 Active DU path, dead DU path, and potential DU path
A DU chain consists of a definition of a variable and all its uses, which are reachable from that
definition. In Fig. 3.1, de f
0
(a) and its uses, use
2
(a) and use
8
(a) form a DU chain of variable a.
During static analysis, we identify all static information including DU chains in the program. We
call them static DU chains, since they don’t reflect any dynamic information during the run-time
of the program.
When the program is running, some codes are executed while some other codes are not exe-
cuted due to conditional statements. EXS is introduced to reflect the dynamic execution informa-
tion of a node/edge during run-time, and a DU path within a DU chain is invalid when at least one
of its edges/nodes has invalid EXS.
An active DU path is defined as a DU path such that all nodes/edges along this path are exe-
cuted; A dead DU path is defined as a DU path such that at least one of its nodes/edges is with
invalid EXS; A potential DU path is defined as a DU path such that none of its nodes/edges is with
invalid EXS. A potential DU chain is obtained by eliminating all dead DU paths between definition
and use pairs in the DU chain. Also, we say def(x) actively reaches use(x) if there exists an active
31
DU path between def(x) and use(x), and def(x) is called an active reaching definition of use(x); we
say def(x) is a dead reaching definition of use(x) if all DU paths between them are dead DU paths.
In Fig. 3.3, static analysis shows use
6
(x) has two reaching definitions: de f
1
(x) and de f
3
(x). But if
we identify that edge 2-3 is executed during run-time (e.g., when a= b= 0), only de f
3
(x) actively
reaches use
6
(x). Given a= b= 0, after eliminating dead DU paths. In Fig. 3.3, all the dead DU
paths are marked with X mark.
Figure 3.3: All DU paths of the original program and its mutant
3.3.6 Unjustified element and the unjustified element list
In general, an element is considered unjustified if the current given values of its input(s) cannot
be implied by program input(s). In our algorithm, several different cases need to be considered,
including unjustified EXS of a node, unjustified EXS of an edge, unjustified value assignment of a
definition, unjustified value assignment of a use, and unjustified EC. For example, in the original
32
CFG in Fig. 3.3, if node 13 is executed and its two incoming edges, edge 9-13 and 11-13, are with
unknown execution status, the EXS of node 13 is marked as unjustified. All these will be defined
and explained later in this chapter. In SW D-algorithm, we keep an unjustified element list (UEL)
to store all unjustified elements, which is updated during execution of implication.
3.3.7 Subtask and the subtask stack
Similar to HW D-algorithm, we define three types of subtasks in SW D-algorithm, these are muta-
tion effect excitation (MEE) subtask, mutation effect propagation (MEP) subtask, and justification
(JUST) subtask. We will explain them in detail later. Also, all these subtasks need to be stored and
processed in LIFO fashion. We implement a stack called subtask stack to hold all subtasks.
During the run-time of SW D-algorithm, the subtask at the top of subtask stack is executed.
Then, the subtask stack pops the top subtask if it is completely processed without conflicts, and
pushes newly identified subtasks onto the subtask stack.
3.3.8 Algorithm state
SW D-algorithm requires backtrack when a conflict is identified, all temporary information of the
program must be erased and the previous state must be restored.
We define algorithm state as a complete set of values that are essential for the recovery. Al-
gorithm state includes EXS of all nodes/edges, value assignments at all definitions/uses, all DU
chains of the program, the unjustified element list and the subtask stack.
3.3.9 Backtrack limit
In HW digital circuit, the logic function of a gate can be represented using a truth table, which has
limited number of entries, given the bounded value space of Boolean variables.
While in SW program, we use constraint expressions to describe the behavior of an ROA. In
most cases, it is difficult to list all solutions of the constraints because individual variables in SW
33
can have extremely large value spaces (e.g., a variable with integer value, a variable with floating
point value). Therefore, it is crucial to set up a limit on how many solutions are tried for a constraint
expression before it aborts, we call this limit as backtrack limit. Consider a case when we set the
backtrack limit n, then before it generates the(n+1)
th
solution for the constraint expression of the
current subtask, the algorithm will return a failure on this subtask and proceed to the next subtask
at the top of the subtask stack.
3.4 Essential procedures in SW ATG D-algorithm
We propose a new method for mutation based test case generation. It is called SW D-algorithm
since it is derived by adapting basic concepts of HW D-algorithm and developing new concepts
and methods to tackle special characteristics of SW.
In this section, we describe essential procedures in SW D-algorithm. Although these pro-
cedures are conceptually similar to the ones performed in HW D-algorithm, they are markedly
extended and reinvented to include special characteristics of SW.
3.4.1 Implication
In general, implication is the process of determining some values as a consequence of the changes
in some other values. Implication is performed to reduce the search space for the test, since it helps
the algorithm assign as many known values as possible and hence refines existing possibilities.
There are two types of implications in SW D-algorithm: value assignment implications and EXS
implications.
Also, we need to keep a task list called implication task list (ITL) for all incomplete implication
tasks. It can be implemented in LIFO or FIFO, since the order of implication tasks being processed
must not affect the result. During implication procedure, all tasks in the ITL need to be processed.
Implication process succeeds if all tasks are completed without conflict, or it fails if any conflict is
found.
34
3.4.1.1 Value assignment implication
We define value assignment implication as a process of determining the value defined or used at
various locations of a program because of some other new value assignments. It can be performed
backward or forward, from a definition to a use or from a use to a definition. It can also be
performed for the variables used in a EC when the EC must be satisfied or must not be satisfied. A
conflict is identified if the implied value is not compatible with the current value.
Value assignment implication occurs (1) from a definition to its uses when the definition ac-
tively defines the use; (2) from a use to its definition when the definition actively defines the use;
(3) between variables used/defined within a statement if the symbolic expression of this statement
has a unique solution based on known constraints.
Here we use parts of the example program (shown in Fig. 3.4) to explain some important cases
of value assignment implication. In Fig. 3.4(a), variable x is defined at node 1 and it is used in
node 3, the dotted directed line represents the potential DU path from de f
1
(x) to use
3
(x). Consider
a forward value assignment implication task is performed for de f
1
(x), thus a value of 0 is assigned
to use
3
(x). Also, a new forward value assignment implication task for use
3
(x) is added to ITL.
For this newly added implication task, we need to first identify an mROA and its RIV and ROV
as described earlier. Then the mROA’s symbolic expression “x
out
= x+ 1” is retrieved (since is
pre-computed for repeated use), and it is sent to solver together with constraint for known value of
“use
3
(x)= 0”. Because the solver generates unique answer “x
out
= 1”, a new value of 1 is implied
at de f
3
(x) and a new forward value implication task is added to ITL.
Figure 3.4: Parts of the example program to show cases of value implication
35
In general, this example illustrates a simple case of forward value assignment implications
from def(v) to use(v). In this case, the implication proceeds as follows: (1) If def(v) is the only
definition of v that actively reaches use(v), the value at def(v) is assigned to use(v), and another
forward value assignment implication task for use(v) is added to ITL if use(v) is changed and there
is no conflict. (2) If def(v) is not the only definition of v that potentially reaches use(v), the value
at use(v) cannot be implied. We note that this type of implication is triggered by a change in value
at def(v), by a change in the EXS of def(v)’s corresponding node, or by a change in the EXS of
node/edge along the DU path between def(v) and use(v).
In Fig. 3.4(b), de f
1
(x) and de f
3
(x) are two potential reaching definitions of use
6
(x). Consider
a backward value assignment implication task for de f
6
(x) with a value of 2 is performed. We first
identify mROA for statement 6 and retrieve its symbolic expression. Then we send the expression
“x
out
= x+ 1” and “x
out
= 2” to solver, which gives us the unique solution of “x = 1”. Thus, the
value of 1 is assigned to use
6
(x) and a backward value assignment implication task for use
6
(x) is
added to ITL. But use
6
(x) has two potential reaching definitions, that means no value should be
implied and this unjustified use
6
(x) is added to the UEL.
In Fig. 3.4(c), there is a conditional statement and the two outgoing edges represent its true
and false edges. Assume that the implication procedure determines the true edge must be taken
(executed), thus the edge condition of this edge must be satisfied. This creates a backward value
assignment implication task from the EC to its uses: use
2
(a) and use
2
(b). We send the EC of the
true edge, “a= b”, to solver to find a solution. But both use
2
(a) and use
2
(b) are unknown, the
solver cannot provide unique answer. No value can be assigned this unjustified EC is added to the
UEL.
3.4.1.2 EXS implication
EXS implication is defined as a process of determining the values of EXS of nodes/edges of a
program’s CFG as a result of the changes in value of EXS of other nodes/edges. A conflict is
36
identified if the implied value is not compatible with the current value. EXS implication can be
performed backward or forward, from node to edge or from edge to node.
Some important properties for EXS implication are: (1) If a non-entry node is executed, only
one of its incoming edge must be executed. (2) If a non-entry node is not executed, all its incoming
edges must not be executed. (3) If a non-exit node is executed, only one of its outgoing edge must
be executed. (4) If a non-exit node is not executed, all its outgoing edges must not be executed. (5)
If an edge is executed, the node it points to must be executed.
We use parts of the example program (shown in Fig. 3.5) to explain some important cases of
EXS implication in detail. In Fig. 3.5(a), consider a forward EXS implication from entry node,
and we know the entry node is always executed. Thus, the EXS of edge 0-1 is also true since it is
the only outgoing edge of entry node. Then, a new forward EXS implication task for edge 0-1 is
added to ITL. For this task, we simply assign true to the EXS of node 1 and add a forward EXS
implication task for node 1 to ITL. We note that node 1 contains a definition of variable x, de f
1
(x),
we must also create a forward value assignment implication task for it and add this task to ITL.
This is because node 1 is not only “connected” to its outgoing edge in CFG, it is also “connected”
to the DU path starting from de f
1
(x). The two types of interconnections must both be considered
during implication.
Figure 3.5: Parts of the example program to show cases of EXS implication
In general, the above example illustrates one simple case of forward EXS implications from a
node N to an edge E. In this case, the implication method proceeds as follows: (1) If E is the only
outgoing edge of N, the EXS of E is directly implied by the EXS of the N. (2) If N has more than
37
one outgoing edges and N is executed, for each outgoing edge of N, the EXS of the edge depends
on whether its EC is satisfied. If an edge’s EC is satisfied, its EXS is set to true, and vice versa. (3)
If the node has more than one outgoing edges and N is not executed, all its outgoing edges must not
be executed. Thus, the EXS of each edge is set to false. Then we scan every N’s outgoing edge: if
the EXS of any edge is changed, a new forward EXS implication task for this edge is added to ITL.
We note that this type of implication is triggered by a change of N’s EXS, or by a re-evaluation of
EC of its outgoing edges.
In Fig. 3.5(b), if node 8 is executed and a forward EXS implication is performed for it, we first
need to obtain the condition expression of node 8, which is “a= c”. Then, we evaluate it using
constraint solver. Since use
8
(a) and use
8
(c) are both unknown, solver cannot provide definite
answer that whether this condition is satisfied or not. Thus, we cannot imply any value for EXS
of edge 8-9 or edge 8-11. Consider another case that edge 8-9 is executed and a backward EXS
implication from edge 8-9 is processed. We first assign a value of true to the EXS of node 8, since
the execution of node 8 is the necessary condition for edge 8-9 to be executed. And a backward
EXS implication task is added to the ITL for node 8. Also, another edge (8-11) must not be
executed, and a forward EXS implication task for edge 8-11 is added to the ITL. In addition, The
EC of 8-9 must be satisfied, thus a backward value assignment implication task for EC of edge 8-9
is added to the ITL.
In Fig. 3.5(c), consider that node 13 is executed, and a backward EXS implication task is
processed for this node. Thus, one of node 13’s incoming edges must be executed but we do not
know which. Hence, no value can be assigned and node 13 is added to the UEL. Consider another
case that a forward EXS implication is processed for the executed edge 9- 13 and the EXSes of
edge 11-13 and node 13 are both unknown. It first assigns true to the EXS of node 13 and adds a
forward EXS implication task for this node to ITL. Then, it assigns false to the EXS of edge 11-13
and adds a backward EXS implication task for this edge to ITL.
38
3.4.2 Mutation effect excitation (MEE) subtask
MEE is performed at the location of the mutation for the purpose of generating mutation effect.
This is the necessary condition for killing the mutant.
3.4.2.1 MEE creation
We first identify the mutation effect excitation mROA (MEEmROA) and its related information
at the location of the mutated statement using the methods mentioned above. Since the mutation
effect must be present at one or more outputs of MEEmROA’s ROV (i.e., its output variables),
we create MEE subtask for each variable in MEEmROA’s ROV , and this variable is denoted as
the MEE subtask’s mutation effect output variable (MEOV). Thus, each MEE subtask consists
of its corresponding MEEmROA and the MEOV . For example, in Fig. 3.6, MEEmROA [5, 8) is
identified for the mutation at line 5. Its RIV consists of b, c, and x; its ROV has only x. Therefore,
only one MEE task is created with variable x as its MEOV .
3.4.2.2 MEE subtask processing
When the MEE subtask is processed, the algorithm first generates MEEmROA’s symbolic expres-
sion for the corresponding MEOV , for both original program and the mutant. Then we add the
constraint that the value defined at the MEOV in original program is different from the value de-
fined at the MEOV in the mutant. This ensures the excitation of the mutation effect. We also
include all known values into the constraint. The entire set of constraints is then sent to the solver
to generate possible solutions.
If a solution is found, mutation effect can be excited. Then, we assign values to the variables in
MEEmROA’s RIV and ROV based on the solution. In addition, MEEmROA must be executed to
excite the mutation effect, thus, we assign true to the EXS of MEEmROA’s entry node, and assign
true to the EXS of MEEmROA’s exit node.
Finally, we create the following implication tasks: (1) Backward EXS implication task for
MEEmROA’s entry node. (2) Forward EXS implication task for MEEmROA’s exit node. (3)
39
Figure 3.6: The original program and its mutant with the identified MEEmROA
Backward value assignment implication task for any new value assignment at MEEmROA’s RIV .
(4) Forward value assignment implication task for any new value assignment at MEEmROA’s ROV .
These implication tasks are then added to the ITL.
For example, in Fig. 3.6, the MEEmROA’s symbolic expression for original program is “b=
c^ x
out ori
= x+ 1_ b6= c^ x
out ori
= x”; the MEEmROA’s symbolic expression for the mutant is
“b6= c^ x
out ori
= x+ 1_ b= c^ x
out ori
= x”. “x
out
” is the MEOV of the current subtask, and
“ ori” and “ mut” are used as suffix to distinguish between variables in different versions. Then
we add the constraint “x
out ori
6= x
out mut
”, and the entire set of constraints is sent to the solver.
One possible solution is “x= 0, b= 0, c= 0, x
out ori
= 1 and x
out mut
= 0”. Also, we assign true
to the EXS of node 5 and node 8. Finally, following tasks are created and added to the ITL: (1)
backward EXS implication for node 5, (2) forward EXS implication for node 8, (3) backward value
40
assignment implication for use(b)@MEEmROA, use(c)@MEEmROA, and use(x)@MEEmROA,
and (4) forward value assignment implication for def(x)@MEEmROA.
3.4.3 Mutation effect propagation (MEP) subtask
MEP is performed at the location of the statement where mutation effect presents at a use of a
variable. MEP is required because mutation effect needs to be propagated forwards until it is
exposed at program’s POD.
3.4.3.1 MEP creation
We first identify the mutation effect propagation mROA (MEPmROA) and its related information
at the statement where mutation effect presents using the method mentioned above. Since the
mutation effect must be present at one or more outputs of MEPmROA’s ROV , we create MEP
subtask for each variable in MEPmROA’s ROV , and this variable is denoted as the MEP subtask’s
mutation effect output variable (MEOV). Thus, each MEP subtask consists of its corresponding
MEPmROA and MEOV . For example, in Fig. 3.7, consider a mutation effect is present at use
9
(x),
where use
9
(x)= 1=0. MEPmROA1 [9, 9] is identified at the location of line 9. Its RIV has only x;
its ROV has only x. Therefore, only one MEP task is created at this location, with variable x as its
MEOV .
3.4.3.2 MEP subtask processing
When the MEP subtask is processed, the algorithm first retrieves the pre-computed MEPmROA’s
symbolic expression for the corresponding MEOV , for both original program and the mutant. Then
we add the constraint that the value defined at the MEOV in original program is different from the
value defined at the MEOV in the mutant. this ensures the propagation of the mutation effect. We
also include all known values into the constraint. The whole set of constraints are then sent to the
solver to generate possible solutions.
41
Figure 3.7: The original program and its mutant with the identified MEEmROA and MEPmROA
If a solution is found, mutation effect can be propagated. Then, we assign values to the variables
in MEPmROA’s RIV and ROV based on the solution. In addition, MEPmROA’s must be executed
to propagate the mutation effect, thus, we assign true to the EXS of MEPmROA’s entry node, and
assign true to the EXS of MEPmROA’s exit node.
Finally, we create the following implication tasks: (1) Backward EXS implication task for
MEPmROA’s entry node. (2) Forward EXS implication task for MEPmROA’s exit node. (3)
Backward value assignment implication task for any new value assignment at MEPmROA’s RIV .
(4) Forward value assignment implication task for any new value assignment at MEPmROA’s ROV .
These implication tasks are then added to the ITL. For example, in Fig. 3.7, the MEPmROA1’s
symbolic expression for the original program is “x
out ori
= x
ori
+ 1”; the MEPmROA1’s symbolic
expression for the mutant is “x
out mut
= x
mut
+ 1”. “x
out
” is the MEOV of the current subtask.
42
The suffix of “ ori” and “ mut” are used to distinguish variables between different versions. Then
we add the constraint “x
out ori
6= x
out mut
”, and the whole set of constraints is sent to the solver.
The only possible solution is “x
ori
= 1”, “x
out ori
= 2”, “x
mut
= 0”, and “x
out mut
= 1”, since we
know “use
9
(x)= 1=0”. But those values are already assigned during implication at line 9. Also,
we assign true to the EXS of node 9. Finally, following tasks are created and added to the ITL:
(1) backward EXS implication for node 9, (2) forward EXS implication for node 9. No value
assignment implication task is created, since there is no new value assignment.
3.4.4 Justification (JUST) subtask
At the moment when any mutation effect is present at at least one of program’s POD, we need to
justify all unjustified elements, i.e., the UEL must be empty.
3.4.4.1 JUST for a node
A node N is not justified in EXS if it is executed, and it has two or more incoming edges and at
least two of them have unknown EXS while the other edges (it there exists) are not executed. We
can justify the EXS of N by invoking one of its incoming edges that with unknown EXS, and not
invoking the rest incoming edges. Therefore, if there are p incoming edges of N that with unknown
EXS, we have p possible ways to justify the EXS of N. For example, in Fig. 3.5(c), if node 13 is
executed, and both of edge 9-13 and edge 11-13 are with unknown EXS. We can justify the EXS
of node 13 by invoking edge 9-13 and not invoking edge 11-13. We note that new backward EXS
implication tasks are then added to ITL for all edges with changed EXS.
3.4.4.2 JUST for a use
A use of v is not justified if it has a concrete value, and it has two or more potential reaching
definitions of v. We can justify use(v) by assigning the value at use(v) to one of its potential
reaching def(v)s, and updating the reaching definition set of use(v). If there are p potential reaching
definitions of v, we have p possible ways to justify use(v).
43
3.4.4.3 JUST for a definition
If def(v) has a concrete value, but the expression representing def(v)’s corresponding mROA
doesn’t have unique solution based on known values, def(v) is marked unjustified. We justify
def(v) by assigning one of these solutions at mROA’s RIV . Therefore, it has p possible ways to
justify def(v) if there are p possible solutions. We note that new backward value assignment impli-
cation tasks are added to ITL for all newly assigned variables in RIV of this mROA.
3.4.4.4 JUST for an EC
If an EC expression has more than one solutions given its expected satisfiability (i.e., to satisfy or
not to satisfy), this EC is marked unjustified. We justify the EC by assigning one of these solutions
to uses within this EC. Therefore, it has p possible ways to justify this EC if there are p possible
solutions. We note that new backward value assignment implication tasks are added to ITL for all
newly assigned uses used within EC.
3.4.5 Initialization
At the beginning of SW D-algorithm, we need to create an ITL, a subtask stack, and an UEL. Also,
any known EXS and value assignment need to be assigned, including the value true to the EXS
of the entry node of the program, and all concrete definitions of variables. Then, we add to the
ITL that forward EXS implication task for the EXS of entry node and forward value assignment
implication tasks for all variables with newly assigned concrete values.
3.4.6 SW D-algorithm overview
The procedures mentioned above are essential for our SW D-Algorithm. Given a mutant, we first
perform initialization and assign all know values, then the algorithm proceeds as the flow graph
shown in Fig. 3.8, which is the decision tree of our SW D-algorithm.
44
Figure 3.8: The decision tree of the proposed SW D-algorithm
3.5 2-pass test generation
We propose a 2-pass test generation method: the first pass uses our new SW D-algorithm, and
the second pass uses the existing symbolic test generation (global symbolic analysis) method to
enhance test set. All steps are described below:
i. For each line of code in original program, generate its mROA, perform symbolic analysis
on this mROA and store the generated symbolic expression for future reuse. Also, for each
outgoing edge of a conditional node, perform symbolic analysis to obtain EC and store it for
future reuse.
ii. In the original program, initialize all known values and perform implication.
iii. Given a mutant, initialize all known values and perform implication.
iv. Identify MEE subtasks and push them onto subtask stack.
v. If the subtask stack is empty, return failure. This means that no test can be generated. If the
subtask stack is not empty, proceed to the next step.
45
vi. Select and process the subtask at the top of subtask stack.
• If a new solution is obtained, perform all implications on ITL. (1) If no conflict, proceed
to the next step. (2) If conflict, recover the state by removing all temporary changes, and
redo this step by following the next branch in decision tree.
• If no solution exists or the backtrack limit is reached, pop the current subtask from sub-
task stack.
vii. Identify all location where mutation effect is present.
• If any mutation effect appears at any definition in program’s POD: (1) If UEL is empty,
store the generated test and proceed to the next step. (2) If UEL is not empty, push all
JUST subtasks onto subtask stack, and go to step v.
• If none of mutation effects is currently present at any definition in program’s POD, iden-
tify all MEP subtasks and push them onto subtask stack, then go to step v.
viii. Perform mutation simulation (mutation testing) by applying the current generated test, elimi-
nate any mutant that are killed. We call this process mutant dropping.
ix. Repeat steps iii to ix for all remaining mutants.
x. For all the remaining mutants, we use the global method to generate tests with mutant dropping
until it reaches 100% (or anticipated) coverage.
We use SW D-algorithm (from step i to step ix) with a small backtrack limit on all mutants to
kill most mutants at low complexity, and then use global method for the remaining mutants. This
2-pass approach takes advantage of the scalability of our new SW D-algorithm, and the ability of
the global method to deal with more global view (at high complexity) for relatively few mutants.
Using the original program and its mutant shown in Fig. 3.9, we show the whole process for
test generation using SW D-algorithm. Fig. 3.10 shows the search tree for this process (V A means
“value assignment”). Consider the mutant shown in Fig. 3.9, after we initialize the whole program
46
Figure 3.9: The original program and its mutant with the identified MEE and MEP mROAs
and perform implication, EXSes of node 0, 1, and 2 are all true, de f
1
(x)= 0=0, use
3
(x)= 0=0, and
de f
3
(x)= 1=1. Then, MEEmROA is identified and processed, we assign use(b)= 0=0, use(c)=
0=0, use(x)= 0=0, and de f(x)= 1=0 to variables in MEEmROA’s RIU and ROD. After implication
procedure, node 8 is executed, use
9
(x)= 1=0, de f
9
(x)= 2=1, use
11
(x)= 1=0 and de f
11
(x)= 3=2.
Therefore, two MEP subtasks, through use
9
(x) or use
11
(x) are created. We first try MEPmROA1
for use
9
(x). During implication, a conflict is found at use(x) of MEEmROA’s RIU, where the
existing value is 0 and the newly implied value is 1. Therefore, the algorithm backtracks to the
last option, which is to propagate mutation effect via MEPmROA2 of use
11
(x), and the algorithm
state is recovered. MEPmROA2 is then processed and all required implication tasks are performed
afterwards without conflict, and the mutation effect is present at line 13 (the returned value) which
belongs to program’s POD. Although, two unjustified elements, EC of edge 8-11 and EXS of node
47
Figure 3.10: The search tree of the example
5, are still left in the UEL. So, we push these justification subtasks to the subtask stack and process
them. We first process justification for the unjustified EC of edge 8-11. Thus, we have a new
value assignment of use
8
(a)= 1=1. Implication is performed afterwards and UEL is now empty.
We generate a test: “a= 1”, “b= 0”, and “c= 0” at program’s PID, and the mutation effect at
program’s output (the returned value) is “de f
13
(retVal)= 3=2”.
3.6 Implementation
We implement SW D-algorithm in Java. The symbolic analysis engine is based on JPF-SE [32]
tool, while we make many changes to accommodate all specific requirements on our new SW
D-algorithm. The constraint solver used in the algorithm is Z3 [37], which is a commonly used
powerful theorem prover. Also, we note that our SW D-algorithm works on Java program’s byte-
code, while here we use Java source code to illustrate our algorithm. In addition, we use muJava
[31] to generate mutants, and use the method we described earlier to eliminate equivalent mutants
before we perform our 2-pass test generation.
48
Table 3.1: Comparison between global scale approach and our approaches
3.7 Results
We test the proposed SW D-algorithm and 2-pass test generation method on programs from Am-
mann and Offutt’s text [34]. These programs’ sizes range from 20 to 200 lines of code.
The baseline approach is based on global symbolic analysis used in [13], [16] and [17]. For
each mutant: (1) It first generates constraint expressions for original program and the mutant.
(2) Adds the constraint that at least one program output has different value between the original
program and the mutant. (3) Requires identical values at program inputs for the original program
and the mutant. The solution of these constraint expressions is a test for the target mutant.
Given a program under test, we use muJava [31] to generate all its mutants. Then we compare
our methods with the baseline method in three aspects: number of unique tests being generated,
mutation score, and run-time (in seconds). We first apply the baseline method for test generation
and get these statistics. After that, we use the method described earlier to eliminate all equivalent
mutants. Then we get the “# of non-equ mutants” in Table 3.1.
For all non-equivalent mutants, we apply the proposed 2-pass test generation (the 1
st
pass uses
SW D-algorithm with small backtrack limit) to generate tests and obtain the statistics of interest.
Table 3.1 shows a significant speed-up when we use SW D-algorithm or 2-pass test generation in
all programs. We obtain good mutation scores by using SW D-algorithm alone, while the mutation
score can reach 100% when we use 2-pass test generation. Also, we see that the number of tests
are greatly reduced when we use 2-pass test generation for all programs under test. We note that a
49
compact test set is crucial in SW testing since a golden model of the software does not exist and
each test’s outcome must be manually checked.
Overall, by applying 2-pass test generation, we can reach an expected mutation score within
a shorter run-time, and generate a much more compact test set at the same time. Hence, our new
method is more scalable and effective.
3.8 Conclusion
We propose a new method called SW D-algorithm for mutation based test generation. We define
new concepts and methods to capture unique characteristics of SW and extend HW D-algorithm
to be applicable for SW test generation. Specifically, we define new concepts including EXS,
active DU path, mROA, and mutation effect. And we create a series of innovative approaches and
procedures for SW D-algorithm. In addition, we combine SW D-algorithm and the existing global
method to build a more efficient 2-pass test generation method. Compared with previous constraint
based test generation methods for mutant, our approach is more scalable, and it generates compact
test set. The related research is published in [38].
50
Chapter 4
Advanced methods
In this chapter, we start with a brief description of our work on extending the scope of our previous
SW D-algorithm. Then we capture different characteristics of SW and propose several improve-
ments to our previous SW D-algorithm, including guaranteeing its completeness and improving its
efficiency.
Notes on the experiments conducted in this chapter In section 4.2 and 4.3, we conduct exper-
iments on several Java programs (the details of these programs are described in section 6.1.1 and
in Table. 6.1) to show the effectiveness of these advanced methods. In these experiments, (1) we
did not perform mutant dropping by simulating the test generated at the end of each round of test
generation. Instead, the algorithm tries to generate one test for each mutant. We do this since we
want to compare the test generation capabilities between different version of our D-algorithms, and
hence to avoid any distortions due to mutant dropping. Therefore, all experiments report success
rate for test generation, which is the ratio of the number of tests generated to the number of all
mutants. (2) Also, the mutant list we used for test generation contains only non-equivalent mutant
(equivalent mutants are eliminated in advance). This is to ensure an accurate calculation of suc-
cess rate (without the interference from equivalent mutant) and a better evaluation of our advanced
methods.
We note that the mutation scores obtained in this chapter’s experiments are relatively high.
One reason is that we are using the mutant list without equivalent mutants. Another reason is that
51
without mutant dropping, our test generation algorithms try to generate test for every mutant in the
mutant list, which leads to a large numbers of tests. And a larger deterministic test set usually kills
more mutants.
The version of test generation procedure used in this chapter’s experiments is not practical for
the actual test generation process. Firstly, we cannot easily eliminate all equivalent mutants in real-
life testing. Secondly, test generation without mutant dropping will take much longer run-time and
generate many more tests, which would be a heavy burden for the testing process. Therefore, the
version of test generation procedure used in this chapter is only for demonstrating the improve-
ments achieved through our proposed approaches, and we use our new algorithms in a slightly
different way in our final ATG tool and ATG system described in following chapters.
Notes on terminologies For simplicity, starting from this chapter: (1) The ”ROA” refers to the
”minimal ROA” (mROA) around the target statement, since our SW D-algorithm always identifies
and uses minimal ROA for local-level analysis. (2) The ”RIU” refers to the ”controllable RIU”
(CRIU) and the ”ROD” refers to the ”observable ROD” (OROD). Because the algorithm always
uses controllable RIU and observable ROD by default to describe an ROA’s behavior.
4.1 Extending the scope of our ATG tool
4.1.1 Support more data types
Previously, our SW ATG tool supports integer as the only data type. We add support for all prim-
itive data types as well as string. To handle program consists of various data types, the captured
constraints are parsed into a general form that the constraint solver can process. Also, if a program
only consists of limited data types, specific constraint solvers can be used for efficiency.
52
4.1.2 Support non-linear expressions
Due to the limitations of the constraint solver we used earlier, our previous SW ATG tool could
not handle nonlinear expressions (e.g., x= y z). Since some solvers (e.g., new Z3 solver) sup-
port simple non-linear arithmetic operators, in the new version of SW ATG we implemented the
interface to use those solvers that are capable of solving non-linear constraints.
4.2 Guaranteeing completeness of the SW D-algorithm
4.2.1 Observations
Our previous SW D-algorithm is not complete, our investigation showed that it does not search all
possible sites that are capable of propagating mutation effect (ME).
More specifically, during the process of identifying MEP subtasks, we first need to find all
uses that have ME as potential sites for ME propagation. Our previous algorithm only considers
every use that has def with ME as its only reaching def (RD). In other words, only those justified
uses with ME are considered as future MEP sites. For any unjustified use that has at least one
RD with ME, the algorithm does not consider it as a potential MEP site, although the ME can
possibly propagate through it. While this reduces the complexity of our previous SW D-algorithm,
completeness is sacrificed.
Figure 4.1: A general case during MEP identification
53
Fig. 4.1 shows a general case during MEP identification process. In the figure, ROD(x) is a
def with generated mutation effect at subtask ROA’s ROD. there are two uses that have ROD(x) as
their RD, use
a
(x) and use
b
(x). While use
b
(x) is justified to ROD(x), use
a
(x) is not, since use
a
(x)
has another RD, de f
p
(x). Our previous algorithm misses the opportunity to propagate ME through
use
a
(x), thus it may miss the opportunity to generate a valid test. Therefore, we must consider both
uses as propagation sites for MEP subtasks, to ensure the completeness of the algorithm.
4.2.2 Key ideas
During the process of identifying MEP subtasks, We need to consider all uses that have at least
one RD with mutation effect as MEP site, no matter whether the use is justified or not justified. For
any MEP identified for those unjustified uses with ME, we first try to justify the use. If no conflict
is found, we process it as a normal MEP subtask.
4.2.3 Our new approach
After MEE/MEP subtask, if ME is generated and no conflict is found, the algorithm checks all
ME’s reachable uses. Among those uses with ME: For each justified use, a normal MEP subtask is
created; for each unjustified use, an MEP subtask with potential mutation effect (PME) is created,
and the corresponding RD with mutation effect is also recorded. In addition, if an unjustified use
has multiple RDs with mutation effects, multiple MEP tasks are created for each of its RD with
mutation effect. In the example in Fig. 4.1, we create a normal MEP subtask for use
b
(x), and a
MEP subtask with PME for use
a
(x).
When an MEP subtask with PME is being processed, we first perform justification for the DU
pair from the corresponding RD with mutation effect to the target use with PME. In example in
Fig. 4.1, the DU pair that needs to be justified is ROD(x) to use
a
(x). If there is no conflict after
justification, we proceed to the normal MEP subtask handling procedure. If a conflict occurs, the
current PME MEP task is eliminated from the operation stack, and the remaining processes are
handled as normal.
54
Table 4.1: Compare previous D-algorithm with the improved D-algorithm - part 1
Program 1 Program 2 Program 3 Program 4
Success rate M score Time Success rate M score Time Success rate M score Time Success rate M score Time
Pre D 47 100 39.7 76 100 20.1 30 60 73.3 56 95 21.5
Improved D 65 100 40.5 79 100 18.1 36 90 87.7 56 95 25.0
Program 5 Program 6 Program 7 Program 8
Success rate M score Time Success rate M score Time Success rate M score Time Success rate M score Time
Pre D 58 96 15.4 67 89 51.6 78 100 4.0 88 100 9.9
Improved D 62 96 16.3 75 95 53.1 81 100 4.1 88 100 8.8
4.2.4 Results
Table. 4.1 shows the experiment results comparing our previous D-algorithm with the D-algorithm
considering PME subtasks. In Table. 4.1, six out of eight programs show improvement on the test
generation success rate. Also, the run-time of the improved method does not change too much.
4.3 Improving the efficiency of SW D-algorithm
Our SW D-algorithm has evolved from HW D-algorithm, by extending several existing definitions
and methods used in HW D-algorithm, and creating several new concepts and methods. Although
a SW program and a digital circuit share a lot of similarities, they are different in many key ways.
Understanding these differences are vitally important for us to improve the efficiency of the SW
D-algorithm.
Value space One difference between a HW digital circuit and a SW program lies in the value
space. In a HW digital circuit, only Boolean values are used, since the state of a node in a digital
circuit can only be 0 or 1. In contrast, in a SW program, various data types can be used, and
the value spaces are generally huge. Take integer as an example, it can take a value ranges from
-2,147,483,648 to 2,147,483,647.
Program structure In addition, the structure and behavior of SW program is much more com-
plex than HW digital circuit. In a digital circuit, gates are connected by circuit lines, which not
only carry logic values, but also indicate the flow of circuit control via logic value transitions.
55
Also, every circuit element is always ”ON”. In a SW program, there are no physical ”lines” be-
tween statements. All statements are ”connected” by control flow and data flow. Also, a statement
can either be executed or not, depending on whether its path conditions are satisfied or not.
Challenges for SW D-algorithm The huge value space and the complex program structures
make our ROA-level analysis based SW D-algorithm less efficient, and make test generation within
a time limit challenging even for many small programs/units.
More specifically, in our SW D-algorithm, all subtasks (including MEE, MEP and Justifica-
tion) are processed at ROA-level (local-level), to reduce overall run-time complexity and make the
algorithm more scalable. As shown in Fig. 4.2, during subtask handling, new values are calculated
and assigned at RIU and ROD solely based on its ROA-level constraints. After newly computed
RIU and ROD are assigned, based on the control flow and data flow between the ROA and the code
outside the ROA, global-level imply&check is performed to imply more values outside the ROA.
It also checks if there are any conflicts between the new assignments and the existing values.
Figure 4.2: Illustration of ROA
56
The huge value space means huge search space for the ROA-level subtask, identifying a valid
value assignment without conflict with the rest of the program becomes more difficult. Experi-
ments with our previous SW D-algorithm show that in cases of conflict, the D-algorithm searches
ineffectively and aborts (give up on the mutant) even in cases when ROA-level and global-level
constraints can be satisfied simultaneously. Moreover, the complex program structure makes this
process even more challenging, since the value assignments at ROA-level analysis and its global-
level neighbouring statements are ”connected” in a more complex way. Identifying useful informa-
tion to fill the gap between ROA-level analysis and global-level constraints is crucial for reducing
computational complexity.
In this chapter, we propose several improvements that use low-cost static approaches to capture
some global information to eliminate the inefficiencies of our previous SW D-algorithm due to
these key differences.
4.3.1 Edge condition assist
4.3.1.1 Challenges
During ROA-level subtask handling, our algorithm tries to generate mutation effect at MEE/MEP’s
ROD, or to justify a variable def, by solving ROA-level symbolic constraints. The solver then
generates a solution that satisfies all these local constraints. But the solution is not necessarily
compatible to the rest of the program. This is why we need global-level imply&check to detect
any inconsistency in the context of the whole program, after the ROA-level solution is applied.
When a conflict is identified, the ROA-level subtask needs to generate a new solution. In many
cases, we find the algorithm bounces back and forth between constraint solving at ROA-level and
a failure of imply&check at a more global level, until the time limit or backtrack limit is reached,
without any valid solution.
For example, in the code snippet and its mutant shown in Fig. 4.3, we know the condition
to trigger the mutation is x= y, but x> 10000 must also be satisfied so that the MEE ROA can
be executed. If we only focus on satisfying the constraint for exciting the mutation at the ROA,
57
Figure 4.3: A code snippet and its mutant
millions of backtracks may needed before it can reach a solution. Because when the constraint
x= y is being solved, the solver has no hint from the global context. Due to the nature of constraint
solver, it will keep trying integer values with increment/decrement of 1. Therefore, identifying a
valid SW test within a time limit is sometimes challenging if we solely rely on ROA-level analysis.
On the other hand, we can use global symbolic analysis and generate a symbolic expression
covering the whole program, so that the x> 10000 is also considered. Although the global analysis
reduces the test generation problem to a constraint solving problem, it is not scalable due to the
high time complexity of symbolic analysis at the global scale.
4.3.1.2 Key ideas
When the global-level imply&check that follows the ROA-level subtask returns conflict, instead of
blindly generating the next possible solution for the same constraint, we can actively use (some of)
program’s global static information to identify useful information to guide the ROA-level analysis
to a promising direction.
Unlike symbolic analysis, static analysis (including control flow and data flow analysis) has
much lower complexity. With a small added overhead of static analysis, we try to identify nec-
essary conditions to ensure the validity of the solution for the ROA-level subtask in the global
context of the whole program. The process of adding increasing level of global static analysis is
triggered by failure of imply&check process. This extra global information reduces search space
thus potentially improves the efficiency of the algorithm.
58
For ROA-level subtask handling, we observed that three scenarios of conflict under which extra
information from global static analysis can be useful to reduce unnecessary search: (1) when the
subtask’s ROA is found not executed; (2) when a value assignment at subtask’s RIU becomes not
valid (not controllable); and (3) when the mutation effect produced by current subtask is blocked
from propagating to program’s outputs.
When a conflict is triggered during post subtask global-level imply&check under these sce-
narios, we use static information to identify the conditional node on CFG where the program flow
diverges to the unwanted branch that caused the issue, and record the pre-computed edge condition
(EC) for the edge that needs to be executed. In the next round of ROA-level subtask processing
after backtrack, we consider the identified edge condition as an additional necessary condition to
the existing ROA-level constraints, to avoid repeating the same mistake.
Note that in order to minimize complexity, we want to identify a minimal set of global-level
constraints that is necessary to ensure the compatibility between the ROA-level assignments and
the global context. In contrast, global symbolic analysis identifies all constraints that reflect the
dependencies within the whole program, with high complexity.
Therefore, during each round of subtask handling, if a conflict is triggered under above men-
tioned scenarios, our method identifies necessary conditions only for the problematic conditional
node that caused the current issue, instead of including all necessary conditions. And we want to
keep the added constraints as minimal as possible.
In the following sections, we describe three types of edge condition assist under different sce-
narios.
4.3.1.3 EC assist for executability of ROA
Problem statement After ROA-level subtask is analyzed and a solution is identified, during post-
subtask global-level imply&check phase, the current subtask ROA may become not executable as
a consequence. We design a procedure to identify the edge condition where the implied values
59
select the wrong branch that caused the ROA to be not executed, and include the expected edge
condition as a necessary condition in the next round of constraint analysis at ROA-level.
Our new approach Step 1: During post-subtask global-level imply&check phase, if the work-
ing ROA of the subtask becomes not executable, check all the ROA’s control dependent nodes
(CDNode) on CFG and identify the one that was evaluated to the wrong branch (which prevented
the ROA from being executed), extract the pre-computed edge condition of the opposite branch
(the desired branch) as the extra edge condition (EEC). EEC is considered as a necessary con-
dition for executing the ROA of the current subtask, it is composed of all variables used at the
conditional node.
Figure 4.4: EC assist for executability of ROA - part 1
For example, in Fig. 4.4, there are two CDNodes for the target subtask ROA, CDNode
1
and
CDNode
2
. In this example, CDNode
2
is identified as the target conditional node that evaluates to
the wrong branch (it should go right), so the pre-computed EEC (EEC
2
) for the correct edge is
recorded as a necessary condition for executing the ROA.
Step 2: Using static backward program slicing, we identify the nearest common dependent
defs (NCDef) on DDG that both EEC and RIU depend on. Note that the NCDef must exist, since
during global-level imply&check phase, the evaluation result (which edge is taken) of the target
60
CDNode was directly or indirectly determined by the RIU assigned during the subtask’s ROA-level
analysis.
Step 3: Within the identified program slices, generate the data transformation expressions
(DTE) between NCDef and EEC, as well as the DTE between NCDef and RIU, through symbolic
execution.
Note that while this step requires high complexity symbolic analysis outside the ROA, we
picked the nearest common def in step 2 to limit its complexity, by identifying a minimal slice of
the program.
Step 4: Include the identified EEC and DTEs to the subtask’s ROA-level constraints and solve
the augmented constraints in the next round. In Fig. 4.4, EEC
2
, DT E
21
and DT E
22
are included to
make sure CDNode
2
will be evaluated to the right-side branch.
Step 5: If in the next round, the post subtask global-level imply&check fails again, we repeat
the whole process for checking the possible cause and include EEC and DTEs as described in step
1 to step 4 in this section and the following two sections (4.3.1.4 and 4.3.1.5).
We note that after the conflict is triggered at some location, our procedure identifies the prob-
lematic conditional node, and includes the expected EEC and DTEs in the next round. The EEC
is identified for one problematic control dependent node at a time. If a new conflict is found in the
next round, we repeat the process and include the new EEC and DTEs. Thus, our procedure is a
”lazy” handling process. This, along with our use of static analysis to minimize the complexity of
outside-the-ROA symbolic analysis at each step, keeps the complexity of our approach low.
Another option is identifying and including all necessary conditions ”proactively” when a con-
flict is triggered. We find that this options is less efficient and it violates the idea of local analysis.
Firstly, we are only interested in the EEC that depends on the variables used in the subtask’s
ROA-level constraint, so that it is meaningful to solve the EEC and ROA-level subtask constraint
together. EECs of interest are automatically identified during the post subtask global-level im-
ply&check, since any conflict is the consequence of the newly assigned values of variables used in
61
the subtask’s ROA-level constraint. Secondly, our method takes advantage of local search philoso-
phy, instead of identifying all global constraints and applying them in one run, to avoid overloading
symbolic execution and constraint solver. The EECs are only included when necessary, so that the
the complexity grows slowly and only when necessary.
We set a upper limit K
EC
for the maximal number of EECs that can be added to the original
constraints of the subtask. We repeat the whole process until a valid solution is generated or
K
EC
/backtrack/time limit is reached. If a solution is generated without conflict, we can proceed to
the remaining steps of the D-algorithm.
Figure 4.5: EC assist for executability of ROA - part 2
For example, in Fig. 4.5, after we include the correct EEC and DTEs for CDNode
2
shown in
Fig. 4.4, we find conflict again in the following round as the wrong branch is taken at CDNode
1
.
Then, we incrementally include EEC
1
, DT E
11
and DT E
12
to the subtask constraints and re-solve
them in the next round. If all EECs are satisfied and no conflict is detected, the executability of the
ROA is guaranteed as shown in Fig. 4.6.
62
Figure 4.6: EC assist for executability of ROA - part 3
4.3.1.4 EC assist for controllability of RIU
Problem statement After ROA-level subtask is analyzed and a solution is identified, during
post-subtask global-level imply&check phase, the previous reaching def for RIU may be blocked
and alternate reaching def is activated, which makes the RIU value inconsistent with the value
assignment before we perform this subtask. We design a procedure to identify the edge condition
where the implied values select the wrong branch that caused the controllability issue, and then in-
clude the expected edge condition as a necessary condition in the next round of constraint analysis
at ROA-level.
Our new approach Step 1: During post-subtask global-level imply&check phase, if the sub-
task’s RIU becomes not controllable, we check all its CDNodes and its def ’s CDNodes on CFG
and identify the one that was evaluated to the wrong branch (which made the RIU uncontrollable).
There are two cases: (1) The original def that reaches RIU is blocked (killed), or (2) the original
def is not executed. Then we extract the pre-computed EC of the expected branch as the EEC, the
satisfaction of this EEC is necessary for the controllability of the RIU.
The two cases mentioned above are illustrated in Fig. 4.7 and Fig. 4.8, where de f(x) is the
RD for RIU(x) prior to this subtask handling. During post-subtask global-level imply&check
63
Figure 4.7: EC assist for controllability of RIU - part 1
phase, Fig. 4.7 shows the case that de f(x) is found blocked (killed) by de f
0
(x), which invalidates
RIU(x)’s original RD, de f(x). Therefore, the EEC for the right-side branch of CDNode
2
and its
corresponding DTEs are identified and included in the next round of constraint solving at ROA-
level. Fig. 4.8 shows the case that de f(x) is found not executed, since CDNode
1
is evaluated to
the wrong branch, so the EEC for the right-side branch CDNodes
1
and its corresponding DTEs are
identified and included in the next round of constraint solving at ROA-level.
Step 2-5 are similar to the procedure described in 4.3.1.3.
4.3.1.5 EC assist for observability of ME
Problem statement After ROA-level subtask is analyzed and the a solution is identified, dur-
ing post-subtask global-level imply&check phase, the propagation path from the mutation effect
generated at ROD to the program’s primary o/p (PO) may be blocked, which makes the mutation
effect unobservable. We design a procedure to identify the edge condition where the implied values
64
Figure 4.8: EC assist for controllability of RIU - part 2
select the wrong branch that blocked the mutation effect, and include the expected edge condition
as a necessary condition in the next round of constraint analysis at ROA-level.
Our new approach Step 1: During post-subtask global-level imply&check phase, if the gen-
erated mutation effect at ROD becomes not observable at program’s PO, check all its conditional
post-dominator (CPDom), identify the one that was evaluated to the branch that blocks the mu-
tation effect (which made the mutation effect unobservable). Extract the pre-computed EC of the
expected branch as the EEC, the satisfaction of this EEC is necessary for the full observability of
the mutation effect.
Note that the mutation effect’s post-dominators are of the similar concept as the future unique
D-drive [39] in HW ATPG. Both of them describe the sites that the erroneous effect must propagate
through.
65
Figure 4.9: EC assist for observability of ME
In Fig. 4.9, ROD(x) is a mutation effect generated by the ROA-level subtask. During post-
subtask global-level imply&check phase, we find ROD(x) is blocked (killed) by de f(x), which
triggers the conflict for ROD(x)’s observability, so the EEC of the expected edge (false branch)
of CPDom and its corresponding DTEs are identified and included in the next round of constraint
solving at ROA-level.
Step 2-5 are similar to the procedure described in 4.3.1.3.
4.3.1.6 Results
Table. 4.2 shows the experiment results of our previous D-algorithm and the D-algorithm with
all three edge condition assist methods. We find six out of eight programs have improved test
generation success rate. In addition, the run-time does not change too much. Since the extra
66
Table 4.2: Compare previous D-algorithm with the improved D-algorithm - part 2
Program 1 Program 2 Program 3 Program 4
Success rate M score Time Success rate M score Time Success rate M score Time Success rate M score Time
Pre D 47 100 39.7 76 100 20.1 30 60 73.3 56 95 21.5
Improved D 47 100 36.3 90 100 18.3 47 100 100.5 73 100 24.6
Program 5 Program 6 Program 7 Program 8
Success rate M score Time Success rate M score Time Success rate M score Time Success rate M score Time
Pre D 58 96 15.4 67 89 51.6 78 100 4.0 88 100 9.9
Improved D 62 100 17.7 70 92 60.3 78 100 3.4 93 100 7.8
global-level static information saves run-time on search, but the algorithm is able to generate test
for more mutants at the same time.
4.3.2 Possible value pre-computation
4.3.2.1 Observations
In HW D-algorithm, indirect implication [39] is introduced to make implication process more
complete. We want a more complete implication procedure, since the more values are determined
during implication, the less search is required for test generation process.
The reduction of search space is even more imperative for SW D-algorithm, because the huge
value space for variables in SW program (e.g. value space for integer is 2
32
) potentially leads to
enormous amount of search to generate a valid test. Although indirect implication used in HW
D-algorithm is not applicable to SW D-algorithm, due to the impractical computational complex-
ity caused by huge value space and the complex logic, other methods can be applied to make
implication process more complete and thus reduce search.
4.3.2.2 Key ideas
We observed that in some programs, values are pre-defined (not defined by user) by the code.
During pre-processing, implication is performed, all other values that can be uniquely determined
by these pre-defined values are implied and assigned. When we look further, we find in some
cases, although a variable may not have concrete value, it may have a finite set of concrete values.
67
Identifying such a finite set of values for a variable can greatly reduce search space, and this process
can be considered as a form of indirect implication in SW D-algorithm.
For example, in Fig. 4.10, use
3
(x) can only be either 1 or 2, since it has two RDs with different
concrete values (1 and 2) from de f
1
(x) and de f
2
(x). Also, de f
3
(y) can only be 2 or 3, because
we already know the possible concrete values for use
3
(x) can only be 1 or 2, and y is uniquely
determined by x.
Figure 4.10: Example for possible value computation
Also, we propose to only perform possible value computations during pre-processing phase
and reuse the information during the whole test generation process, to minimize computational
overhead that offsets the benefits.
4.3.2.3 Problem statement
In SW D-algorithm, value implication process only implies values that are uniquely determined.
We extend the use of implication, to also determine a finite set of possible concrete values each
variable can have.
4.3.2.4 Our new approach
During pre-processing phase, for each variable used or defined in the program, we need to keep
track of all possible concrete values they can have. For a variable use or def, we define a set called
possible concrete value (PCV), which contains all possible concrete values for a variable.
Then, for each known pre-defined value in the program, we utilize the method used in implica-
tion to determine the PCV set for every use or def of the variable throughout the program.
68
The first case is the forward calculation process from def to use: based on static information,
when a use has one or more reaching defs with non-empty PCA, we take the union of all those
defs’ PCA and assign them to the PCA of the use. In Fig. 4.11, PCA
p
(x) is the union of PCAs of
all its three RDs, PCA
i
(x), PCA
j
(x), and PCA
k
(x).
Figure 4.11: Forward pre-computation from def to use
The second case is the forward calculation process from use to def : When de f
p
(y) is uniquely
defined by use
p
(x) and PCA
p
(x) is not empty, we use pre-computed symbolic expression for the
statement at line p, exp, to calculate PCA
p
(y) as shown in Fig. 4.12.
Figure 4.12: Forward pre-computation from use to def
In example in Fig. 4.10, PCA
3
(x) isf1,2g, which is the union of the PCAs of all its RDs. And
PCA
3
(y) isf2,3g, which is calculated based on the symbolic expression of statement 3.
In addition, the size of PCA should be kept small, i.e. less than K
PCA
. If the PCA is too large,
the computation cost for calculating PCA might offset the benefit due to the reduction of search
space.
69
Table 4.3: Compare previous D-algorithm with the improved D-algorithm - part 3
Program 1 Program 2 Program 3 Program 4
Success rate M score Time Success rate M score Time Success rate M score Time Success rate M score Time
Pre D 47 100 39.7 76 100 20.1 30 60 73.3 56 95 21.5
Improved D 56 100 34.0 76 100 19.1 33 91 72.7 56 95 22.0
Program 5 Program 6 Program 7 Program 8
Success rate M score Time Success rate M score Time Success rate M score Time Success rate M score Time
Pre D 58 96 15.4 67 89 51.6 78 100 4.0 88 100 9.9
Improved D 82 100 6.3 76 95 50.7 78 100 3.2 88 100 8.9
4.3.2.5 Results
Table. 4.3 shows the experiment results comparing our previous D-algorithm with the improved
D-algorithm. In the table, four out of eight programs show an increase on success rate. And the
run-time of the improved method are reduced due to a much smaller value space.
4.3.3 Improving use to def justification
4.3.3.1 Challenges
The justification process for an unjustified use can be tedious in some cases. Since the justification
process is carried out using the CFG, the problem of justifying an use is being converted to a
problem of identifying a valid path on CFG from the unjustified use to one of its RDs. From the
unjustified use and move backwards node-by-node, our previous D-algorithm uses DFS to traverse
all possible paths that reach the target use, until a valid DU path is found.
The search space depends on the number of paths of the DU chain between the potential RDs
and the target use. In case that only a specific def can be justified to the target use, the algorithm
becomes even less efficient. The worst case is that the whole DU chain needs to be explored before
the specific def is reached. For example, in Fig. 4.13, assume de f
3
(x) is the only valid RD for
use(x). Our algorithm justifies the path from de f
3
(x) to use(x) after it traverses the whole CFG
graph. We try to improve the algorithm for use to def justification, to reduce the amount of search.
70
Figure 4.13: Backward use to def justification
4.3.3.2 Key ideas
Instead of exhaustively searching backward from the target use to all its potential reaching def s,
we can choose one compatible DU pair at a time and make sure there is a def clear path between
them using static analysis.
4.3.3.3 Problem statement
During use to def justification, our previous method performs backward node-to-node search on
CFG without guidance, to identify a path between a DU pair that we want to justify. Depending on
program structure and the location of the target def, this process may become redundant and time
consuming.
4.3.3.4 Our new approach
During justification phase, given the unjustified use, use
p
(x), choose one of its potential reaching
def, de f
i
(x). If de f
i
(x) and use
p
(x) are compatible, create a new justification task with the identi-
fied potential DU pair: [de f
i
(x), use
p
(x)], and push it onto the justification stack. Here we say the
two variable def/use are compatible if they have the same concrete value, or at least one of them
has no concrete value.
Then, we process the justification task for the target DU pair: [de f
i
(x), use
p
(x)], with addi-
tional constraints obtained from static information : (1) de f
i
(x) must be executed: we add EXS
forward/backward implication tasks for the de f
i
(x) node on CFG. (2) Assign use
p
(x)’s concrete
value to de f
i
(x)] and add value forward/backward implication tasks for de f
i
(x). (3) For each of the
71
Table 4.4: Compare previous D-algorithm with the improved D-algorithm - part 4
Program 1 Program 2 Program 3 Program 4
Success rate M score Time Success rate M score Time Success rate M score Time Success rate M score Time
Pre D 47 100 39.7 76 100 20.1 30 60 73.3 56 95 21.5
Improved D 47 100 34.2 80 100 18.7 31 91 71.3 66 98 19.6
Program 5 Program 6 Program 7 Program 8
Success rate M score Time Success rate M score Time Success rate M score Time Success rate M score Time
Pre D 58 96 15.4 67 89 51.6 78 100 4.0 88 100 9.9
Improved D 58 100 22.6 69 92 49.3 78 100 3.3 91 100 7.9
other reaching def s, say de f
j
(x), if we find that de f
j
(x) node is on any of the de f
i
(x) to use
p
(x)
paths (shown in Fig. 4.14), de f
j
(x) must not be executed, add forward/backward EXS implication
tasks for it. (4) Finally, imply&check is performed. If it fails, try next compatible DU pair, or
backtrack if no more untried compatible DU pair.
Figure 4.14: Improved backward use to def justification
4.3.3.5 Results
Table. 4.4 shows the experiment results of our previous D-algorithm and the D-algorithm with the
improved justification procedure. We find that five out of eight programs have higher success rate
while taking less amount of run-time.
72
4.3.4 Data transformation history for possible value pre-computation
4.3.4.1 Observations
In section 4.3.2, we described a method to compute possible value assignment for all variables
in the program during pre-processing phase, to reduce the value space for search in the following
procedures.
We observed that during the computation of possible value assignments, we can also collect the
information about how a specific value is defined, i.e., the data transformation history of a specific
possible value assignment.
4.3.4.2 Key ideas
This data transformation history can be used to further reduce search during subtasks. (1) When
we want to justify a use with a specific pre-computed possible value assignment, instead of trying
all its compatible RD and finding a valid DU path, we can directly lookup its data transformation
history and identify the DU path without further search. (2) During MEE/MEP subtask, if an RIU’s
value is assigned from its possible concrete value set, we can extract the path conditions from the
specific value’s data transformation history. The identified path condition is an necessary condition
for the subtask.
Therefore, in addition to computing possible concrete value assignments during pre-processing
phase, we also record the transformation history for every possible concrete value assignment.
During subtask handling, and when a possible concrete value assignment is present, we can
retrieve its data transformation history to reduce search and speed up the procedure.
4.3.4.3 Problem statement
During the computation of possible concrete value assignment, the information about how those
values are computed is also important. We design a procedure to capture the data transformation
history, and use this information to further reduce search space during subtask handling.
73
4.3.4.4 Our new approach
There are two types of tasks for computing possible concrete value assignments: the first one is
from def to use, the other one is from use to def.
During def to use possible concrete value computation, we record the specific def that defines
the target use. Also, we identify all paths between this DU pair. For each common conditional
node among these paths, we record which out-edge is taken if all paths take the same decision at
this node.
During use to def possible concrete value computation, we record the specific use that is used
to compute the target def.
Finally, we obtain the full history of the data transformation for all possible concrete value
assignment. Given a variable use/def, we have the information of the path through which the
variable is transformed to the current concrete value, and which edge is taken at conditional nodes
along the path.
Use data transformation history during justification During justification, if the unjustified use
has a value that belongs to the use’s possible concrete value set, we can facilitate the justification
process by using its data transformation history.
Specifically, we retrieve the def (s) that defines the target use. If there is only one such def, the
target use can only be justified to this def, and all condition node decisions along the DU path can
be retrieved. If there are multiple possible def s that define the target use, we try them one by one,
using condition node decisions information along the path.
Use data transformation history during MEE/MEP During MEE/MEP subtask handling, if
an RIU is assigned with a concrete value that belongs to this use’s possible concrete value set, and
this use has a unique def that defines it, we can reduce search by considering the target use’s data
transformation history as an additional restrictive information.
74
Table 4.5: Compare previous D-algorithm with the improved D-algorithm - part 5
Program 1 Program 2 Program 3 Program 4
Success rate M score Time Success rate M score Time Success rate M score Time Success rate M score Time
Pre D 47 100 39.7 76 100 20.1 30 60 73.3 56 95 21.5
Improved D 54 100 36.2 76 100 19.2 68 100 89.1 75 100 20.9
Program 5 Program 6 Program 7 Program 8
Success rate M score Time Success rate M score Time Success rate M score Time Success rate M score Time
Pre D 58 96 15.4 67 89 51.6 78 100 4.0 88 100 9.9
Improved D 82 100 5.5 79 98 71.0 78 100 3.5 88 100 8.2
Specifically. we know condition nodes’ decisions along the DU path, if the decision is violated
during post-subtask imply&check, we can include the correct edge condition and all corresponding
DTEs for the problematic conditional node in the next round to avoid the same conflict. This
process is similar to the approaches we described in section 4.3.1.
4.3.4.5 Results
Table. 4.5 shows the experiment results of our previous D-algorithm and the D-algorithm with the
help from data transformation history. Within the results, five out of eight programs show higher
success rate, and the run-times are not changed a lot.
4.3.5 Other improvements
4.3.5.1 Test template
Observation Our SW D-algorithm identifies MEE ROA in a way that the behavior change
caused by mutation can be captured solely by its ROD. All the mutants with code change at the
same location have the same MEE ROA range (same start/end line). It is not uncommon that many
mutants share the same MEE ROA range, since multiple mutants are generated by applying various
mutation operators to the same code line.
For all the mutants with mutation at the same location, the necessary condition for MEE ROA’s
executability are identical. Instead of trying to identify this necessary condition repeatedly, we can
reuse the necessary condition once identified, for all mutants with mutation at the same location.
75
Key ideas To eliminate the redundancy on identifying necessary condition for MEE ROA’s ex-
ecutability for the mutants with mutation at the same location, we can record the constraints that
represent the necessary conditions for executability of the MEE ROA and reuse the constraints for
all other mutants with mutation at the same location.
Thus, once the necessary condition for executability is identified for one mutant, the process
of trying to reach (execute) MEE ROA is reduced to a constraint solving problem for all other
mutants with mutation at the same location.
This idea is similar as the method described in [40], where tests are generated for an RTL ele-
ment without considering the specific fault effect, only controllability and observability conditions
are considered. Those conditions can be considered as testing template for any fault at this RTL
element.
Our new approach In SW D-algorithm, the effort spent on re-computing necessary condition
for MEE ROA’s executability for the mutants with mutation at the same location is redundant. We
develop a method to capture and reuse this necessary condition for executability of MEE ROA
when applicable, to eliminate redundant computations and improve efficiency.
We define testing template (TT) as a set of constraints that represent necessary condition for
executing MEE ROA of a mutant.
There are three types of TT with regard to its relation with MEE’s subtask specific constraint:
Type-I TT is composed of variables that are independent with any variable used in MEE’s con-
straint; Type-II TT uses at least one variable (but not all) that is related to variables used in MEE’s
constraint; Type-III TT is composed of variables that are all related to variables used in MEE’s
constraint.
Here we only collect Type-III TT and partial Type-II TT constraints which consist of variables
that are related to variables used in MEE’s constraint. TT constraints with variable use that is
independent with MEE’s constraint are not being considered during MEE processing phase. Since
those TT constraints may be related to other part of program, they should better be left when
76
the procedure requires them to be satisfied later. We don’t want to overly impose constraints on
variables before those variables are involved in search.
TT is collected when MEE subtask is handled. Specifically, during post-subtask imply&check,
when conflict is detected and EC assist for executability of MEE ROA is performed, we record
the added EC and the corresponding DTEs. The EC and DTEs are cumulatively added to form
the TT for the current ROA range of the mutant. When another mutant with mutation is ready
for processing, we reuse the TT by adding constraints stored in TT to the target MEE’s subtask
specific constraint and solve them together. If EC assist for executability is still needed, we add
the newly identified EC and DTEs to the TT of the current ROA range, in a cumulative way.
When working on the mutants with mutation at the same location, we apply the pre-computed
TT to save computational effort.
4.3.5.2 Indirect EXS Implication
During the process of EXS implication, normally the EXS value of a node/edge is implied from its
adjacent edge/node. We find the EXS implication process can be sped up by reducing the number
of times the EXS implication performs by utilizing program’s CFG structural information.
On a program’s CFG, given a node/edge, we can easily identify its dominators and post-
dominators. Based on the definition of dominators and post-dominators, once a node/edge’s EXS
is changed, we know the EXSes of all its dominators and post-dominators should also be changed
to the same value. The process of implying non-adjacent node/edge’s EXS is called indirect EXS
implication.
Therefore, when the EXS of a node/edge is changed, we can not only imply the EXSes of its
adjacent edges/nodes, but also indirectly imply the EXSes of all its non-adjacent dominators and
post-dominators. By applying indirect EXS implication, the run-time spent on implication should
be reduced. Since the implication between some non-adjacent nodes/edges that previously requires
numerous steps are now being done in one step.
77
4.3.5.3 Multi-pass justification
Compared to HW D-algorithm, justification process for SW D-algorithm is more complicated
because of the lack of a composite value system to represent different values in the original program
and the mutant. Hence the justification process needs to be done in both versions separately and
finally needs to merge without conflict. This increases run-time complexity, and the process is
redundant when the target ROA does not include mutation, where the target ROA in the original
program and the mutants have the same constraints. We can further reduce complexity by avoiding
the effort spent on such redundant calculations.
We get inspiration from sequential circuit ATPG research, [41] proposed a method that simpli-
fies the justification process for sequential circuit. We know that the sequential circuit needs to be
first converted to a sequence of its combinational circuits in form of multiple time frames before
combinational ATPG can be performed. The converted combinational circuit consists of several
copies (time frames) of the original circuit. After the fault effect is propagated to at least one pri-
mary output of the last unrolled time frame, the algorithm needs to perform justification. In [41], it
unrolls the sequential circuit in a way that only the last time frame has the fault, all the other time
frames are fault free. The algorithm justifies the circuit with fault only for a single time frame, and
treats all other time frames as fault free circuits. After a test is generated, it simulates the circuit
using the generated test to verify the correctness of the test. It shows that the test generated using
fault-free justification mostly covers the faulty circuit.
This approach has been shown to reduce the complexity induced by multiple fault scenario
by performing justification in multiple passes under different time frame configurations. Results
shows this method improves the overall efficiency, and maintains high accuracy at the same time.
Inspired by this improvement designed for HW ATPG, we propose multi-pass justification to
optimize our justification process in SW D-algorithm: (1) We first perform justification only on
the original version of the program. (2) If justification succeeds but the generated test is not killing
the mutant, we further perform justification on both original program and the mutant, and try to
generate a test that is compatible with both versions.
78
4.3.5.4 Add randomness to constraint solving process
During our ATPG procedure, constraint solver is used to generate solutions. We notice that the
constraint solver obtains solution in a way that is lack of randomness. For example, given the
integer constraint ”a!=1 or b!=1 or c!=1”, which is a common form of constraint when we want to
exclude previous solution a=1, b=1 and c=1. The solver will most likely generates a=2, b=1, c=1
in the next round, and then generates a=3, b=1, c=1, etc. The solver will keep increasing variable
a one by one until other constraints are violated. Because of this characteristic, the valid solution
may not be obtained within time limit.
We design several simple heuristic to add randomness to the process of constraint solving: (1)
When the solver processes constraints used for excluding old solution, we force the solver change
a random variable each time, instead of keep changing a specific variable. (2) When solving a
constraint, if a variable can have positive or negative number, we modify the solver so that it
generates positive results one time, and generates negative results next time. This is to avoid the
solver monotonically increasing or decreasing the variable.
79
Chapter 5
SW ATG System
In this chapter, we propose the SW ATG system, which integrates our research on test generation,
equivalent mutant detection and test compaction. Our SW ATG system takes advantage of all
different approaches, and generates minimal number of high quality unit tests in a scalable manner.
5.1 Multi-pass test generation
5.1.1 Observations
We proposed SW D-algorithm to improve scalability and efficiency of previous constraint-based
mutation-oriented test generation approaches. However, there is always a trade-off between the
test generation algorithm’s run-time complexity and the quality of the tests it generates.
To generate guaranteed tests to kill all mutants, global-scale symbolic analysis should be used,
although it is not scalable. To generate many tests at low complexity, random test generation can
be used, although there is no guarantee of test quality. Further, both these approaches generate
large numbers of tests and hence require much manual effort to create expected responses when
tests are used.
Also, within the SW D-algorithm, we can always increase the backtrack limit to trade time
complexity for a higher probability of successfully generating a test.
80
Therefore, we can take advantage of all types of test generation algorithms and design a multi-
pass test generation process. Specifically, among the population of the mutants generated for a
specific program, some mutants are easy to be killed, and others are much harder. We can use a
low complexity method like random test generation to kill all easy-to-kill mutants, use our SW
D-algorithm to kill harder-to-kill mutants, and use global-scale algorithm to kill any remaining
hardest-to-kill mutants.
By applying a low complexity method like random tests to kill all easy-to-kill mutants at first,
only harder-to-kill mutants are left for higher complexity algorithms which have better test gener-
ation capability. This reduces the use of high complexity deterministic test generation methods.
We expect the multi-pass test generation procedure to obtain the best test quality with lower
overall computational complexity. However, such an approach may generate redundant tests. Re-
duction of the number of tests must also be addressed.
5.1.2 Our approach
We propose multi-pass test generation: (1) we first generate n
r
random tests and perform mutation
testing. In this step, most easy-to-kill mutants should be killed. (2) For the not-killed mutants
from the previous step, we apply our SW D-algorithm with all advanced methods using an initial
backtrack limit BL
i
. (3) For any remaining mutants, we increase BL
i
by BL
inc
and re-run our
improved SW D-algorithm, we expect more mutants to be killed in this step. (4) Finally, for each
mutant that is not killed by previous steps, we perform global-scale symbolic analysis to generate
tests.
We note that after each pass of test generation and when a test is generated, we simulate the
test on all remaining mutants, any killed mutant is dropped to save the test generation effort in the
following steps.
81
5.2 Equivalent mutant elimination during test generation
5.2.1 Observations
The equivalent mutant problem is studied in detail in Chapter 2. In short, equivalent mutant is a
mutant program which is different from the original program in syntax, but the difference doesn’t
change the semantics of the original program. Hence, no test can detect an equivalent mutant, and
any effort spent on test generation for an equivalent mutant should be avoided. Also, the existence
of equivalent mutant affects the accuracy of the mutation score, since only non-equivalent mutants
should be considered to obtain meaningful mutation score.
Therefore, prior to test generation process, we should eliminate equivalent mutants to avoid
unnecessary test generation effort, and ensure the accuracy of the mutation score.
In Chapter 2, we proposed an ROA-level constraint-based method for equivalent mutant detec-
tion method, which is more scalable than the global-level constraint-based method. We can use the
proposed method to detect equivalent mutants before the test generation process. Although, that
approach requires a full pass of equivalence detection for all mutants.
In fact, the ROA-level equivalent mutant detection process can be combined with our D-
algorithm to reduce run-time complexity. This combination is feasible since the way in which
we formulate the equivalent mutant detection process makes it a sub-problem of the D-algorithm:
during the MEE subtask of our D-algorithm, if a mutant cannot be excited within its ROA, we know
that the mutant is equivalent to the original program, since there is no input that can differentiate
the MEE ROA of the mutant and the MEE ROA of the original program.
During the combined process, if a mutant is identified as equivalent, it is dropped; if not, we
proceed to generate a test for it. We note that any dropped equivalent mutant is not counted in the
computation of the mutation score.
Also, as we described in Chapter 2, a non-equivalent mutant in ROA-level may still be equiv-
alent to the original program at global-level. This case of false negative is caused by the limited
global information when we perform local ROA-level equivalent mutant detection.
82
In our multi-pass test generation procedure, if any equivalent mutant (false negative) is not
eliminated after our D-algorithm is executed, the last test generation pass (global scale symbolic
analysis) can naturally detect it by solving the global-level constraints.
5.2.2 Our approach
The equivalent mutants are detected and eliminated during D-algorithm pass as well as global-scale
pass.
During the D-algorithm pass, for a given mutant, if the mutation effect cannot be excited during
MEE subtask, this mutant is detected as an equivalent mutant. We drop this mutant from test
generation process and we don’t count this mutant when calculating mutation score.
During global-scale analysis phase, if a test does not exist for a given mutant, the mutant is
identified as an equivalent mutant. We drop this mutant and we don’t count this mutant when
calculating mutation score.
5.3 Test compaction
5.3.1 Observations
In HW testing, shorter test sets are desired as they require smaller test application times and hence
considerably decrease testing costs, since every fabricated chip needs to be tested. In contrast, SW
testing is only performed once for a version of the program, independent of how many copies are
sold. Despite this, a compact test set is extremely important in SW testing for a very different
reason. Since a golden model of the software usually does not exist (test oracle problem [2]),
each test’s outcome must be manually checked for correctness. Since such manual checking is
extremely expensive, a compact set of tests that provides high quality can considerably reduce
costs.
In HW ATPG, one of the most common approaches to compact tests is by performing reverse
order fault simulation [39]. Given the set of all generated test vectors, starting with the list of
83
all faults, this method simulates all these test vectors in an order that is reverse of the order in
which they were generated. Any test vector that does not detect any fault that was not detected by
previous simulated vectors can be dropped. Since all faults detected by this test vector can also be
detected by other previous simulated vectors, and there is no added value for keeping this vector.
The study shows this method often reduces the test set size significantly.
We find that the idea of reverse order fault simulation is even more suitable for our multi-pass
SW test generation. Since the multi-pass method generates tests in an order of increasing level
of the ability of killing mutants (from pure random test generation to whole program symbolic
analysis), which indicates the tests generated earlier are ”weaker” than the tests generated later.
Here a ”weaker” test means it kills fewer mutants than ”stronger” tests. When we apply reverse
order simulation in our ATG system, it will first simulate those ”stronger” tests, which are highly
likely to eliminate those ”weaker” tests. Therefore, it will provide a more compact set of tests.
5.3.2 Our approach
After all test generation processes are completed, we start with the list of all mutants and perform
mutation testing by applying tests in an order that is reverse of the order in which they were
generated. Any test that does not kill any mutant that was not killed by previous simulated tests is
eliminated.
5.4 Complete SW ATG system
We propose SW ATG system to perform multi-pass test generation to obtain the highest possible
mutation score, and then perform reverse order mutant simulation to compact the test set.
During the deterministic test generation passes, including SW D-algorithm passes and global-
scale analysis, we also perform mutant dropping: After a new test is generated, simulate this test
on all remaining mutants in the mutant list, and eliminate any mutants that are killed by this test
from the mutant list.
84
The complete SW ATG system is described below:
i. Mutant generation: Given a program under test, generate mutants using mutation operators.
Each mutant makes small change to the original program. All the generated mutants are stored
in the mutant list, m.
ii. Random test generation: Use random test generation to generate n
r
tests. simulate these n
r
tests on all mutants in m. Eliminate all mutants that were killed from m. In addition, only
those tests that kill at least one mutant are kept.
iii. First pass of SW D-algorithm: If m is not empty, for each mutant in m, perform SW D-
algorithm with initial backtrack limit, BL
i
, to general tests and eliminate equivalent mutants.
Mutant dropping is then performed to eliminate any mutant that was killed by the new test.
iv. Second pass of SW D-algorithm: Ifm is not empty, for each remaining mutant inm, perform
SW D-algorithm with larger backtrack limit, BL
i
+ BL
inc
, to general tests. Mutant dropping is
then performed to eliminate any mutants that was killed by the new test.
v. Global-scale symbolic analysis: If m is not empty, for each remaining mutant in m, use
global-scale symbolic analysis to generate test, or eliminate equivalent mutant if no test can
be generated. Mutant dropping is then performed to eliminate any mutants that was killed by
the new test.
vi. Reverse order simulation: After all test generation passes are completed, starting with the
complete list of mutants (original m), simulate all generated tests in an order that is reverse of
the order in which they were generated, and eliminate any redundant tests.
85
Chapter 6
Conclusion
6.1 Experiments and results
6.1.1 Experiments setup
We test and compare several test generation tools/methods, including our proposed methods and
other state-of-the-art tools/methods, on programs from Ammann and Offutt’s text [34], Demillo
and Offutt’s paper [13], and from leetcode.com practice problems. These programs’ sizes range
from 20 to 200 lines of Java bytecode, which are typical sizes of the program modules for unit
testing.
Table. 6.1 show the details of the eight Java programs under test, including the size of the
program, the average size of all ROAs within the program, the number of different ROAs within
the program, the number of mutants that are generated from the original program, and the number
of non-equivalent mutants.
Table 6.1: Details of the programs under test
Program 1 Program 2 Program 3 Program 4 Program 5 Program 6 Program 7 Program 8
Program size 65 53 167 70 36 98 27 71
Avg ROA size 5 6 11 9 3 8 2 9
# of ROAs 23 16 59 29 14 32 14 29
# of mutants 102 126 308 78 66 232 40 94
# of non-equ 78 109 268 50 51 161 32 86
86
We note that for all programs, the average size of ROA is much smaller than the size of the
whole program, which means that the run-time complexity for local-level analysis is much smaller.
Although, a program is composed of ROAs for all statements within the program, our SW D-
algorithm identifies separate ROA (local) level subtasks, to excite mutation effect, propagate mu-
tation effect to program’s outputs, and justify all internal values within the program. This divide
and conquer technique reduces the overall run-time complexity.
All test generation tools/methods we compared are listed below:
• Baseline approach: It uses global symbolic analysis [13] to generate symbolic expressions
for original program and one mutant (called the target mutant). It then construct the con-
straint expression that at least one program output has different values between the original
program and the mutant. Any solution satisfies this constraint is a valid test for killing the
target mutant. If no solution is found, the mutant is an equivalent mutant.
• Previous SW D-algorithm: This is the previous version of our SW D-algorithm (see chapter
3) without any advanced methods, it uses 5 as backtrack limit and 3 second as time-limit. The
simultaneous equivalent mutant elimination is included. After a successful test generation,
mutant dropping is performed to eliminate any killed mutants.
• Improved SW D-algorithm with advanced methods: This is the SW D-algorithm with all
advanced methods (described in Chapter 4) applied, it also use 5 as the algorithm’s backtrack
limit and 3 second as time-limit. Simultaneous equivalent mutant elimination is included as
well. Mutant dropping in performed after each successful test generation.
• SW ATG system: This is the proposed SW ATG system which performs multiple test gen-
eration passes and test compaction at the end (described in Chapter 5).
• EvoSuite [5]: EvoSuite is a search based test generation tool, it is one of the best state-of-
the-art SW test generation tools and it won the SBST 2017 tool competition.
87
• American fuzzy lop (AFL) [42]: This is a fuzzer that uses genetic algorithms. It uses
several algorithms and try to trigger unexpected behavior of the program under test.
We compare the performance of all these methods in terms of mutation score, number of
tests and run-time.
6.1.2 Experiments process
Given a program under test, we use muJava [31] to generate all its mutants. For all the generated
mutants: we use the baseline approach, previous SW D-algorithm, improved SW D-algorithm with
advanced methods, SW ATG system, EvoSuite and AFL to generate tests and obtain the statistics
of interest.
For the equivalent mutant elimination: (1) The baseline approach uses global-level symbolic
analysis that automatically detects equivalent mutants; (2) Our previous SW D-algorithm and
SW D-algorithm with advanced methods identify equivalent mutants during test generation as de-
scribed in section 5.2; (3) Our ATG system identifies equivalent mutants during two D-algorithm
phases and global-scale phase as described in section 5.2; (4) EvoSuite and AFL are not able to
identify equivalent mutants.
6.1.3 Results
Table. 6.2 shows the detailed comparison among different approaches. In Table. 6.2, the baseline
approach achieves 100% mutation score for all programs, since the global-level symbolic analysis
captures the complete program behavior. But it has high time complexity compared to our other
methods, and it generates huge numbers of test cases.
Our previous SW D-algorithm can get high mutation score within in shorter time than the
baseline method, and the size of the test set is much smaller.
88
Table 6.2: Comparison among different approaches
Program 1 Program 2 Program 3 Program 4
M Score Test # Time M Score Test # Time M Score Test # Time M Score Test # Time
Baseline 100 78 24.6 100 109 23.1 100 268 80.6 100 50 31.5
Previous D 81 3 18.5 93 7 5.9 58 12 46.5 76 10 20.2
New D 94 6 8.6 96 7 4.8 98 27 14.8 95 12 15.9
ATG system 100 5 7.9 100 6 6.5 100 24 26.2 100 8 17.1
EvoSuite 70 6 60 82 8 60 69 18 60 62 9 60
AFL 43 29 60 53 28 60 35 31 60 42 19 60
Program 5 Program 6 Program 7 Program 8
M Score Test # Time M Score Test # Time M Score Test # Time M Score Test # Time
Baseline 100 51 14.6 100 161 54.1 100 32 15.5 100 86 19.7
Previous D 91 5 5.2 85 21 17.6 78 5 3 90 12 3.4
New D 96 7 2.0 92 24 15.7 92 8 5.1 98 12 2.5
ATG system 100 4 3.4 100 20 19.4 100 6 10.5 100 10 4
EvoSuite 74 5 60 58 16 60 75 5 60 85 10 60
AFL 39 16 60 23 31 60 40 15 60 31 21 60
Compared to the previous D-algorithm, our improved SW D-algorithm with advanced methods
achieves better mutation score for all programs. Moreover, the run-times are all greatly reduced in
almost all cases.
Our SW ATG system obtains 100% mutation scores for all mutants, and generates compact test
set. At the same time, the run-times are comparable to our two versions of SW D-algorithm.
In the case of EvoSuite, we use 60 second (the default time limit) for its search process. Al-
though EvoSuite generates decent quality tests for some programs, its test generation quality is
inconsistent, and even our standalone D-algorithm with advanced methods performs better.
For AFL, we keep the same 60 second time limit. The result shows that it does not generate
good quality tests, and generates too many tests, which means enormous work for manual gold
model check.
Overall, our SW ATG system outperforms the baseline approach and other tools in terms of
run-time, mutation score, and test compactness.
In Table. 6.3, we show the detailed statistics of all phases of our ATG system, as well as
the baseline method as reference. For each phase (random, D-algorithm with initial backtrack
limit, D-algorithm with increased backtrack limit, global-scale symbolic analysis, reverse order
simulation), it shows the mutation score, number of the equivalent mutants, number of tests, and
89
Table 6.3: Detailed view of all ATG system phases
Program 1 Program 2 Program 3 Program 4
M Score Equ # Test # Time M Score Equ # Test # Time M Score Equ # Test # Time M Score Equ # Test # Time
Baseline 100 24 78 24.6 100 17 109 23.1 100 40 268 80.6 100 28 50 31.5
ATG Ran 34 0 3 0.1 83 0 3 0.1 15 0 2 0.4 60 0 4 0
ATG D1 95 19 7 6.9 96 12 6 4.5 97 35 28 19 90 28 8 13.1
ATG D2 95 19 7 7.2 96 12 6 5.4 98 35 29 23.5 100 28 12 14.5
ATG Glob 100 24 7 7.8 100 17 6 6.3 100 40 29 24.8 100 28 12 16.9
ATG Rev 100 24 5 7.9 100 17 6 6.5 100 40 24 26.2 100 28 8 17.1
Program 5 Program 6 Program 7 Program 8
M Score Equ # Test # Time M Score Equ # Test # Time M Score Equ # Test # Time M Score Equ # Test # Time
Baseline 100 15 51 14.6 100 71 161 54.1 100 8 32 15.5 100 8 86 19.7
ATG Ran 64 0 3 0 12 0 3 0.5 58 0 3 0 47 0 2 0.1
ATG D1 96 13 7 2.9 82 52 23 12.2 94 6 12 4.6 98 6 10 3.5
ATG D2 96 13 7 3.0 85 52 27 15.7 94 6 12 9.1 98 6 10 3.6
ATG Glob 100 15 7 3.3 100 71 29 18.9 100 8 12 10.4 100 8 10 3.9
ATG Rev 100 15 4 3.4 100 71 20 19.4 100 8 6 10.5 100 8 10 4
time elapsed up to the end of that phase. From the random phase to the global-scale phase, we
notice the mutation score, the number of detected equivalent mutant, the number of tests are all
increasing (not strict), since the ATG system is designed to perform test generation in an order of
increasing capability.
We note that in most cases (seven out of eight cases), all kill-able mutants are killed at the
end of the 2
nd
D-algorithm phase, and the global-scale phase only eliminates remaining equivalent
mutants, without adding new tests. This shows our D-algorithm with increased backtrack limit is
able to kill all hard-to-kill mutants, and we already have perfect quality test without performing
global-scale step.
Also, in the last row in Table. 6.3, we find the reverse order simulation is effective for test set
compaction, which is very important given that a compact test data is crucial in SW testing, since
testers need to manually check the expected outputs for each test (test oracle problem [2]).
In sum, our SW ATG system provides highest mutation score with a shorter run-time and
generates compact unit tests. It is more scalable than the baseline and it is effective for generating
high quality yet highly compact unit tests.
90
6.2 Contributions
This dissertation contributes to the areas of mutation testing and mutation-oriented test generation.
It utilizes methods used in HW testing and invents several new approaches to improve the current
SW mutation testing and mutation-based test generation methods, and develops tools to identify
equivalent mutants and automatically generate high quality unit tests in a scalable manner(practical
run-time). Our primary contributions are:
I. Identification of the similarities between HW testing and SW mutation testing.
II. Application of important insights from HW testing to solve SW equivalent mutant problem,
and improve its scalability.
III. Development of local-based scalable equivalent mutant detection method, and development
of tools in Java.
IV . Application and extension of fundamental concepts from HW testing, and development of
numerous new methods to improve existing mutation-oriented constraint-based SW test gen-
eration in terms of scalability.
V . Development of local-based scalable SW D-algorithm for test generation, and development
of the SW unit test generation tools in Java.
VI. Extensive study of the similarities and the differences between HW testing and SW testing,
development of various advanced methods to further reduce run-time complexity, and im-
proves capability and efficiency of our SW D-algorithm.
VII. Development of SW ATG system to perform multi-pass unit test generation and test com-
paction, to reduce run-time complexity, improve test quality, and dramatically reduce the size
of test set.
91
Most importantly, our research demonstrates convincingly that deterministic search based ap-
proaches can indeed be effective for SW test generation. This is a major shift from the thinking in
the field and hence opens up a new direction of research.
6.3 Limitations
Our proposed approaches have the following limitations: (1) Our methods are based on white-
box testing, which requires the source code of the program; (2) Our tools do not fully support
arrays with symbolic index, an open problem in SW research field; (3) Our tools cannot handle
SW methods/procedures with inter-procedure calls; (4) Our tools cannot process non-primitive
data/object (except for strings). These are topics of follow-on research.
6.4 Conclusion
In our research, we adapt and significantly expand fundamental ideas and concepts from HW
testing, and develop various approaches to improve SW mutation testing.
We develop SW ATG system for mutation-oriented unit test generation that provides near-
universal automation, including equivalent mutant elimination, multi-pass test generation, and test
compaction. Compared with global-level constraint based approaches, our method is more scal-
able. Compared with other state-of-the-art test generation approaches, our methods generate high
quality and extremely compact unit tests. Finally, our research dramatically shifts the thinking and
opens up a new direction of research in SW testing, by showing, for the first time, that deterministic
search based technique can indeed be more powerful.
92
References
1. DeMillo, R. A., Lipton, R. J. & Sayward, F. G. Hints on Test Data Selection: Help for the
Practicing Programmer. Computer 11, 34–41. doi:10.1109/C-M.1978.218136 (1978).
2. Barr, E. T., Harman, M., McMinn, P., Shahbaz, M. & Yoo, S. The Oracle Problem in Software
Testing: A Survey. IEEE Transactions on Software Engineering 41, 507–525. doi:10.1109/
TSE.2014.2372785 (2015).
3. Jia, Y . & Harman, M. An Analysis and Survey of the Development of Mutation Testing. IEEE
Transactions on Software Engineering 37, 649–678. doi:10.1109/TSE.2010.62 (2011).
4. Roth, J. P. Diagnosis of Automata Failures: A Calculus and a Method. IBM Journal of Re-
search and Development 10, 278–291. doi:10.1147/rd.104.0278 (1966).
5. EvoSuite - Automatic Test Suite Generation for Javahttp://www.evosuite.org/ (2021).
6. Y2K bughttps://www.britannica.com/technology/Y2K-bug (2021).
7. ARIANE 5 Flight 501 Failure. Report by the inquiry board (1996).
8. Anand, S., Burke, E. K., Chen, T. Y ., Clark, J. A., Cohen, M. B., Grieskamp, W., et al. An
orchestrated survey of methodologies for automated software test case generation. Journal of
Systems and Software 86, 1978–2001 (2013).
9. Del Frate, F., Garg, P., Mathur, A. P. & Pasquini, A. On the correlation between code cov-
erage and software reliability in Proceedings of Sixth International Symposium on Software
Reliability Engineering. ISSRE’95 (1995), 124–132.
10. Ghosh, I. & Fujita, M. Automatic test pattern generation for functional register-transfer level
circuits using assignment decision diagrams. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems 20, 402–415. doi:10.1109/43.913758 (2001).
11. Inozemtseva, L. & Holmes, R. Coverage is not strongly correlated with test suite effectiveness
in Proceedings of the 36th international conference on software engineering (2014), 435–
445.
12. Just, R., Jalali, D., Inozemtseva, L., Ernst, M. D., Holmes, R. & Fraser, G. Are mutants a
valid substitute for real faults in software testing? in Proceedings of the 22nd ACM SIGSOFT
International Symposium on Foundations of Software Engineering (2014), 654–665.
13. DeMilli, R. A. & Offutt, A. J. Constraint-based automatic test data generation. IEEE Trans-
actions on Software Engineering 17, 900–910. doi:10.1109/32.92910 (1991).
14. Nica, S. & Wotawa, F. EqMutDetect — A tool for equivalent mutant detection in embed-
ded systems in Proceedings of the 10th International Workshop on Intelligent Solutions in
Embedded Systems (2012), 57–62.
15. Papadakis, M. & Malevris, N. An Effective Path Selection Strategy for Mutation Testing in
2009 16th Asia-Pacific Software Engineering Conference (2009), 422–429. doi:10.1109/
APSEC.2009.68.
93
16. Zhang, L., Xie, T., Zhang, L., Tillmann, N., de Halleux, J. & Mei, H. Test generation via
Dynamic Symbolic Execution for mutation testing in 2010 IEEE International Conference on
Software Maintenance (2010), 1–10. doi:10.1109/ICSM.2010.5609672.
17. Papadakis, M. & Malevris, N. Automatic Mutation Test Case Generation via Dynamic Sym-
bolic Execution in 2010 IEEE 21st International Symposium on Software Reliability Engi-
neering (2010), 121–130. doi:10.1109/ISSRE.2010.38.
18. Bottaci, L. A Genetic Algorithm Fitness Function for Mutation Testing. SEMINAL: Software
engineering using metaheuristic inovative algorithms, workshop (2001).
19. Howden, W. E. Weak Mutation Testing and Completeness of Test Sets. IEEE Transactions
on Software Engineering SE-8, 371–379. doi:10.1109/TSE.1982.235571 (1982).
20. Goel, P. & Rosales, B. C. PODEM-X: An Automatic Test Generation System for VLSI Logic
Structures in 18th Design Automation Conference (1981), 260–268. doi:10.1109/DAC.
1981.1585361.
21. Just, R., Ernst, M. & Fraser, G. Using State Infection Conditions to Detect Equivalent Mu-
tants and Speed up Mutation Analysis (2013).
22. Offutt, A. J. & Craft, W. M. Using compiler optimization techniques to detect equivalent
mutants. Software Testing, Verification and Reliability 4, 131–154 (1994).
23. Smith, B. H. & Williams, L. An Empirical Evaluation of the MuJava Mutation Operators
in Testing: Academic and Industrial Conference Practice and Research Techniques - MUTA-
TION (TAICPART-MUTATION 2007) (2007), 193–202. doi:10.1109/TAIC.PART.2007.12.
24. Offutt, A. J., Lee, A., Rothermel, G., Untch, R. H. & Zapf, C. An Experimental Determination
of Sufficient Mutant Operators. ACM Trans. Softw. Eng. Methodol. 5, 99–118. doi:10.1145/
227607.227610 (1996).
25. Delamaro, M. E., Offutt, J. & Ammann, P. Designing Deletion Mutation Operators in 2014
IEEE Seventh International Conference on Software Testing, Verification and Validation (2014),
11–20. doi:10.1109/ICST.2014.12.
26. Hierons, R., Harman, M. & Danicic, S. Using Program Slicing to Assist in the Detection
of Equivalent Mutants. Softw. Test., Verif. Reliab. 9, 233–262. doi:10.1002/(SICI)1099-
1689(199912)9:43.0.CO;2-3 (1999).
27. Kunz, W. HANNIBAL: An efficient tool for logic verification based on recursive learning in
Proceedings of 1993 International Conference on Computer Aided Design (ICCAD) (1993),
538–543. doi:10.1109/ICCAD.1993.580111.
28. Kuehlmann, A. & Krohm, F. Equivalence Checking Using Cuts and Heaps in Proceedings
of the 34th Annual Design Automation Conference (Association for Computing Machinery,
Anaheim, California, USA, 1997), 263–268. doi:10.1145/266021.266090.
29. King, J. C. Symbolic Execution and Program Testing. Commun. ACM 19, 385–394. doi:10.
1145/360248.360252 (1976).
30. ORACLE JAVA Documentation, Lesson: Basic I/O https://docs.oracle.com/javase/
tutorial/essential/io/index.html (2021).
31. Ma, Y .-S., Offutt, J. & Kwon, Y . R. MuJava: An Automated Class Mutation System: Research
Articles. Softw. Test. Verif. Reliab. 15, 97–133 (2005).
32. Anand, S., P˘ as˘ areanu, C. S. & Visser, W. JPF–SE: A symbolic execution extension to java
pathfinder in International conference on tools and algorithms for the construction and anal-
ysis of systems (2007), 134–138.
94
33. Dutertre, B. & De Moura, L. The yices smt solver. Tool paper at http://yices. csl. sri. com/tool-
paper. pdf 2, 1–2 (2006).
34. Ammann, P. & Offutt, J. Introduction to software testing (Cambridge University Press, 2016).
35. Kurtz, B., Ammann, P. & Offutt, J. Static analysis of mutant subsumption in 2015 IEEE
Eighth International Conference on Software Testing, Verification and Validation Workshops
(ICSTW) (2015), 1–10.
36. Zhang, J. & Gupta, S. K. Using hardware testing approaches to improve software testing:
Undetectable mutant identification in 2016 IEEE 34th VLSI Test Symposium (VTS) (2016),
1–6. doi:10.1109/VTS.2016.7477281.
37. De Moura, L. & Bjørner, N. Z3: An efficient SMT solver in International conference on Tools
and Algorithms for the Construction and Analysis of Systems (2008), 337–340.
38. Zhang, J., Gupta, S. K. & J. Halfond, W. G. A New Method for Software Test Data Generation
Inspired by D-algorithm in 2019 IEEE 37th VLSI Test Symposium (VTS) (2019), 1–6. doi:10.
1109/VTS.2019.8758641.
39. Jha, N. K. & Gupta, S. Testing of digital systems (Cambridge University Press, 2003).
40. Ghosh, I. & Fujita, M. Automatic test pattern generation for functional register-transfer level
circuits using assignment decision diagrams. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems 20, 402–415. doi:10.1109/43.913758 (2001).
41. Ghosh, A., Devadas, S. & Newton, A. R. Test generation for highly sequential circuits tech.
rep. (MASSACHUSETTS INST OF TECH CAMBRIDGE MICROSYSTEMS RESEARCH
CENTER, 1989).
42. AFL - American Fuzzy Lophttps://github.com/google/AFL (2021).
95
Abstract (if available)
Abstract
After decades of research and development, test generation for digital hardware is highly automated, scalable (in practice), and provides high test quality. In contrast, current software automatic test data generation approaches suffer from either low test quality or high complexity. One of the important reasons for this discrepancy is that hardware automatic test pattern generation (ATPG) is fault oriented. Although mutation-oriented (mutations are analogues to faults in hardware) constraint-based test data generation for software was proposed to generate high quality test data focusing on real program bugs, all existing implementations require symbolic analysis for the whole program and hence are not scalable even for unit testing, i.e., testing the lowest-level software modules. Importantly, these approaches generate too many tests, and hence require impractically high manual effort during testing (test oracle problem). Also, equivalent mutant problem remains an open question which affects the efficiency of test generation and the accuracy of the mutation score. ❧ In this research, we study the similarities and differences between software (SW) testing and hardware (HW) testing, and apply important insights from hardware testing to improve existing software mutation-oriented testing. We combine global structural static analysis and a sequence of small and reusable symbolic analyses of local parts of the program, instead of symbolically executing each mutated version of the entire program, to reduce run-time complexity and improve scalability. ❧ In particular, we propose the first approach for local analysis in software testing to identify equivalent mutants, and a new method inspired by hardware D-algorithm and divide and conquer for software unit automatic test generation (ATG). ❧ In addition, we develop multiple new algorithms and heuristics to further reduce run-time complexity and improve the test quality provided by our SW D-algorithm. We also propose a multi-pass SW ATG system for an optimized test generation process that reduces run-time complexity and the number of tests generated. ❧ We compare our tools with one of the best state-of-the-art software test generation tools (EvoSuite, which won the SBST 2017 tool competition). The results shows that our SW ATG system generates perfect quality unit tests in a scalable manner. We also demonstrate that our approach dramatically reduces the number of tests and hence drastically reduces effort for testing. ❧ Finally, our research dramatically shifts the thinking and opens up a new direction of research in SW testing, by showing, for the first time, that deterministic search based technique can indeed be more powerful.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Constraint-based program analysis for concurrent software
PDF
Techniques for methodically exploring software development alternatives
PDF
Assessing software maintainability in systems by leveraging fuzzy methods and linguistic analysis
PDF
Design and testing of SRAMs resilient to bias temperature instability (BTI) aging
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Understanding dynamics of cyber-physical systems: mathematical models, control algorithms and hardware incarnations
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
Software quality understanding by analysis of abundant data (SQUAAD): towards better understanding of life cycle software qualities
PDF
Production-level test issues in delay line based asynchronous designs
PDF
Custom hardware accelerators for boolean satisfiability
PDF
A reference architecture for integrated self‐adaptive software environments
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
Learning to adapt to sensor changes and failures
PDF
Software architecture recovery using text classification -- recover and RELAX
PDF
A variation aware resilient framework for post-silicon delay validation of high performance circuits
PDF
Toward understanding mobile apps at scale
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Detection, localization, and repair of internationalization presentation failures in web applications
Asset Metadata
Creator
Zhang, Jianwei
(author)
Core Title
Automatic test generation system for software
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/09/2021
Defense Date
01/15/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
ATPG,constraint-based test generation,equivalent mutants,mutation testing,OAI-PMH Harvest,software D-algorithm,test quality,unit testing
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gupta, Sandeep (
committee chair
), Halfond, William (
committee member
), Nuzzo, Pierluigi (
committee member
)
Creator Email
jianweiz@usc.edu,miudocuim@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-419335
Unique identifier
UC11668645
Identifier
etd-ZhangJianw-9266.pdf (filename),usctheses-c89-419335 (legacy record id)
Legacy Identifier
etd-ZhangJianw-9266.pdf
Dmrecord
419335
Document Type
Dissertation
Rights
Zhang, Jianwei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
ATPG
constraint-based test generation
equivalent mutants
mutation testing
software D-algorithm
test quality
unit testing