Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Software architecture recovery using text classification -- recover and RELAX
(USC Thesis Other)
Software architecture recovery using text classification -- recover and RELAX
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SOFTWARE ARCHITECTURE RECOVERY USING TEXT CLASSIFICATION RECOVER AND RELAX by Daniel Gabriel Link A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2022 Copyright 2022 Daniel Gabriel Link Dedication This dissertation is dedicated to my loving wife Melanie, whose support and encouragement allowed me to see it through. ii Acknowledgements I am deeply indebted to my advisor, Dr. Barry Boehm, for his unwavering support at all times, his helpful advice and his belief in me. I would like to sincerely thank the members of my dissertation committee, Dr. Sandeep Gupta and Dr. Aiichiro Nakano, for agreeing to serve on my committee and their insightful comments on my research that will inspire me long into the future. My gratitude is also due to my advisor, Dr. Barry Boehm, for serving as the committee chair. I am also indebted to Dr. Russell Anderson, who encouraged me to take on challenging research tasks in my undergraduate days. I cannot let my friends and co-authors Dr. Ramin Moazeni and Kamonphop Srisopha go unmentioned. Both have provided exceptional support and contributed valuable insights over the years. I would like to extend my thanks to Dr. Pooyan Behnamghader, Dr. Duc Le, Suhrid Karthik, Dr. Anandi Hira and Bo Wang for their practical suggestions and lending me their ears to discuss research matters. This dissertation would not have been possible without Julie Sanchez going above and beyond the call of duty to support me in all administrative and other human matters. My thanks should also go to Winsor Brown for his encouragement of this pursuit. Special thanks goes to all my friends who supported me morally or by simply allowing me to focus on other things and thereby recharging my batteries. iii TableofContents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Stakeholder Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Software Architecture for Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Architecture Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.1 Desirable Properties of Architecture Recovery Methods . . . . . . . . . . . . . . . 10 1.3.2 Concern-Oriented Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Introducing RELAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4.1 RELAX Standouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.6 Publications Related to This Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.7 Organization of This Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1 Software Architecture and Architectural Views . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Architectural Change, Drift, Decay and Erosion . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.1 Architectural Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.2 Architectural Erosion and Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.2 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Architectural Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5 Smells and Anti-Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5.1 Code Smells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.2 Architectural Smells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6 ARCADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.6.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 iv 2.6.3 Smell Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.7 ARC (Architectural Recovery Using Concerns) . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.7.1 Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.7.2 Number of Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.7.3 Topic Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.7.4 Determinism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.7.5 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.7.6 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7.7 Software architecture and its recovery . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.7.8 Concerns in Software Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.7.9 Topic Modeling and Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . 43 2.7.10 Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.7.11 ACDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.7.12 ARC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.7.13 PKG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.7.14 Commonalities of PKG, ACDC and ARC . . . . . . . . . . . . . . . . . . . . . . . . 45 2.7.15 Measures of Architectural Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.7.15.1 MoJoFM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.7.15.2 a2a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.7.15.3 cvg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.7.16 Code Smells and Architectural Smells . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.7.17 ARCADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.7.18 Software Architecture and Architecture Recovery . . . . . . . . . . . . . . . . . . . 49 2.7.19 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Chapter 3: Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.1 Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 Number of Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 Topic Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4 Determinism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.6 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7 Main Recovery Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7.1 Selecting Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7.2 Collecting Training Data and Training a Classifier . . . . . . . . . . . . . . . . . . 56 3.7.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.7.3.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.7.4 Modularity and Crosstalk Prevention . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.7.5 Textual Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7.7 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.7.8 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.7.9 Textual Output Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 4: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.1 Hypothesis 1 (View) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1.1 Ground Truth Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1.1.1 Comparisons of RELAX Recoveries to Expert Decompositions . . . . . . 80 v 4.2 Hypothesis 2 (Maintenance) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.1 Study Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.1.1 Study Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.1.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.1.3 Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.1.4 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.1.5 Tools and Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.1.6 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.1.7 Warm-Up Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.1.8 Experimental Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.1.9 Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.1.10 Initial Survey (Participants’ Experience) . . . . . . . . . . . . . . . . . . 89 4.2.1.11 Exit Survey (Evaluations of RELAX) . . . . . . . . . . . . . . . . . . . . . 89 4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.2.2.1 Initial Survey on Preexisting Experience . . . . . . . . . . . . . . . . . . 90 4.2.2.2 Timings of Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.2.2.3 Exit Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2.4 Answers to the Study Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2.4.1 SQ1 (Does using RELAX architecture recovery results reduce the time to find the location in the code where maintenance needs to be performed?) 94 4.2.4.2 SQ2 (What are the perceptions of new maintainers who work with RELAX architecture recovery results?) . . . . . . . . . . . . . . . . . . . 94 4.3 Hypothesis 3 (Scalability) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.3.1 Scalability by Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.3.2 Measured Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4 Hypothesis 4 (Modularity) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.4.1 Runtime Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4.3 Study Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4.4 Study Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Chapter 5: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.1 Architecture Recovery Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.1.1 Bunch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.1.2 LIMBO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.1.3 WCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.1.4 ACDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.1.5 PKG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.1.6 ARC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.1.7 MUDABlue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.1.8 Revealer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.1.9 MoDisco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.1.10 Quadratic Assignment Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.1.11 Genetic Black Hole Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2 Other Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3 Value of Recovery Methods for Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 vi 5.3.2 Selection of Recovery Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3.3 Selection of Subject Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3.4 Computing Environment and Parameters . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.5.1 RQ1 (Architectural Output) . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.5.2 RQ2 (Feasibility) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3.5.3 RQ3 (Clarity) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.3.5.4 RQ4 (Proportionality) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3.5.5 RQ5 (Determinism) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3.5.6 RQ6 (Continuity) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.3.5.7 RQ7 (Isolation, Modularity, Predictability) . . . . . . . . . . . . . . . . . 116 5.3.5.8 Topic Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.3.6.1 ACDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.3.6.2 ARC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.3.6.3 PKG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.3.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.3.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Chapter 6: Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.1 Limitations of RELAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.1.1 Current Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2 Evaluation Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2.1 Selection of Subject Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2.1.1 Open Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2.1.2 Selection and number of recovered open source systems . . . . . . . . . 126 6.2.2 Comparing RELAX to Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.2.3 User Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.2.3.1 Construct Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3.1 Task Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3.2 Subject System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3.3 Number of Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.3.4 Java Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.3.5 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Chapter 7: Conclusion and Ongoing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.2.1 Diff-Based Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.2.2 Enabling Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.2.3 RELAX IDE Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.2.4 Extensions for the Recovery of More Source Code Programming Languages . . . . 132 7.2.5 Automatic Compilation and Documentation of Distoric or Ongoing Projects . . . . 134 7.2.6 GUI Frontend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 vii Appendix A: RELAX Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 A.1 Invocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 A.1.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 A.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 A.2.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 A.2.2 Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 A.2.3 Java Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 A.2.4 Software Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 A.2.4.1 jdeps (Java class dependency analyzer) . . . . . . . . . . . . . . . . . . . 152 A.2.4.2 Graphviz (Visualization software) . . . . . . . . . . . . . . . . . . . . . . 152 A.2.4.3 cloc (Code count) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A.2.5 Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A.2.5.1 log4j2.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A.2.5.2 relax_config.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A.3 Known Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 viii ListofTables 3.1 Entities with Feature Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2 Clusters with Feature Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3 RELAX Output Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1 MoJoFM Expert Decomposition Comparison Values . . . . . . . . . . . . . . . . . . . . . . 82 4.2 Survey Results (Scales from 1-5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3 Task Timing Results (in Minutes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4 RELAX Recovery Speed (2015 Hardware) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.5 RELAX Recovery Speed (2022 Hardware) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.6 Mann-Whitney U Test for Smaller and Larger Systems . . . . . . . . . . . . . . . . . . . . . 102 4.7 Modular Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 ix ListofFigures 1.1 The Virtuous Cycle of Architecture Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 ARCADE Recovery Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2 Text Classification Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.1 Training Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2 Classifier Training Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3 Classifier Candidate Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4 RELAX Directory Graph Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5 RELAX Directory Graph Legend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.6 RELAX Directory Graph Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.7 RELAX Directory Graph Low Level Comparison . . . . . . . . . . . . . . . . . . . . . . . . 67 3.8 RELAX Directory Graph of First System Version . . . . . . . . . . . . . . . . . . . . . . . . 69 3.9 RELAX Directory Graph of Second System Version . . . . . . . . . . . . . . . . . . . . . . 70 3.10 RELAX Recovery Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.1 UML Deployment Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Shared Topic Models and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3 Grond Truth Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4 Physical SLOC per Second for Projects of Different Sizes (2015 Hardware) . . . . . . . . . . 97 4.5 Logical SLOC per Second for Projects of Different Sizes (2015 Hardware) . . . . . . . . . . 97 x 4.6 Recovery Speed in SLOC per Second for Projects of Different Sizes (2015 Hardware) . . . . 98 4.7 Recovery Speed in SLOC per Second for Projects of Different Sizes (2022 Hardware) . . . . 98 5.1 The Apache License, Version 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.1 Plugin Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.2 Plugin Configuration Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.3 Plugin Layer Dependency Warning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.4 RELAX Metrics Runner Frontend Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A.1 Recovery Subject System Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 149 A.2 Chukwa Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 xi Abstract In order to maintain a software system, it is beneficial to know its architecture. At each point in the de- velopment and maintenance processes of a software system, its stakeholders are legitimately interested in where and how its architecture reflects each of their respective concerns. Having this knowledge available at all times can permit them to continuously adjust their system’s structure at each juncture and reduce the build-up of technical debt that can be hard to reduce once it has accumulated and persisted over many iterations. Unfortunately, however, software systems commonly lack reliable and current documentation about their architecture. To remedy this situation, researchers have conceived a number of architecture recovery methods that each employ algorithms with varying degrees of automation to recover different architectural views of the system, some of them concern-oriented. Yet, the design choices forming the bases of most existing recovery methods make it so none of them have a complete set of desirable qualities for the purpose stated above. Tailoring existing recovery methods to a system is often either not possible or only achievable through iterative experiments with numeric parameters. This can be time-consuming and even open-ended. Fur- thermore, limitations in the scalability of the employed recovery algorithms can make it prohibitive to apply these existing techniques to large systems. Finally, since several current recovery methods employ non-deterministic sampling, their inconsistent results do not lend themselves well to tracking a system’s architectural development over several versions, as needed by its stakeholders. xii The goal of this dissertation research is to overcome these issues with a new recovery method that follows the concern-oriented paradigm and produces an architectural view that can benefit all stakeholders of a software system. RELAX (RELiable Architecture EXtraction), a new concern-based recovery method that uses text clas- sification, addresses these issues efficiently (1) by assembling the overall recovery result from smaller, independent parts, (2) basing it on an algorithm with linear time complexity and (3) allowing itself to be tailored to the recovery of a single system or a sequence thereof through the selection of meaningfully named, semantic topics. An intuitive and informative architectural visualization rounds out RELAX’s con- tributions. RELAX is illustrated on a number of open-source systems and its results surveyed with regards to its accuracy, utility, scalability and modularity. Through its results in these areas, RELAX has shown itself to be a valuable help that could form the basis of further tools that support the software development process with a focus on maintenance. xiii Chapter1 Introduction Every software system serves one or more stakeholders. It also has an architecture, even if none of its stakeholders are aware of it. That architecture forms the basis of the communication between its stake- holders [32, 116]. Knowledge of this architecture is therefore a prerequisite for any informed decisions about its implementation and maintenance. The extent to which the architecture is known determines the degree to which its stakeholders are aware of their system and can effectively implement and maintain it. This awareness avoids architectural technical debt – having to spend considerably more work fixing deficiencies in the architecture later on than if the system would have been built right from the ground up [69] – and ensures the system’s continued architectural integrity. Before any of these benefits can be reaped, however, the interested stakeholders will need to agree on what exactly, to them, constitutes the architecture of their software system. They also need to be aware of their concerns which form the basis of the architecture discussion. 1.1 StakeholderConcerns Before we discuss the relevance of stakeholder concerns for an architecture any further we need to clarify the terms. 1 The ISO has defined a stakeholder of a system to be an “individual, team, organization or classes thereof, having an interest in a system” [59], or, more recently, an “individual or organization having a right, share, claim, or interest in a system or in its possession of characteristics that meet their needs and expectations” [61]. Another definition is that of the persons or organizations who influence a system’s requirements or who are impacted by that system [52]. All stakeholders of a software system expect it to fulfill a set of functional and non-functional re- quirements. They may also have expectations related to how and where in the system these features are implemented or realized. The interests of the stakeholders are expressed as concerns, which are interests in a system relevant to one or more of its stakeholders [59]. In the context of software engineering, functional and non-functional properties are referred to as concerns. The definition of concern followed here is based on two perspectives. The first is that a concern is a conceptualareaofinterestorfocusforastakeholderofasoftwareproject(e.g.,adeveloper) [129]. The second also refers totheconcretemanifestationofconceptualconcerns(e.g.,insourcecode,designdiagrams,orother artifacts) [129]. For our purposes, both perspectives are important, with the emphasis being on the second. Out of the possible meanings the word concern can have in common English language usage, the best matches in this context are the noun defined as matter for consideration and the verb to be of interest or importance to [103]. It is not limited to something that causes worry [103]. Defined in this way, a concern can be viewed something that one or more human beings want to exist or to happen and which can be expressed in natural language. In a software system, a concern is something the system needs to do or to have, such as a functional or non-functional property. 2 Since under this definition, all software systems exist to address concerns, it follows that at the begin- ning of every software project, from a small program to a major system, must stand a non-empty set of concerns by the stakeholders, which may be written down, communicated verbally, or implied. 1.2 SoftwareArchitectureforMaintenance The continuing progress of hardware and software as well as in related disciplines such as software en- gineering is offset by the invariants obtained from observations that have been made decades ago, only a short few years after the term “software engineering" became popular: That software systems need to be adapted in order to stay useful, that their complexity will increase unless steps are taken to prevent this, and that all stakeholders of a system need to expend effort to stay familiar with the system [75]. The success of a software systems is primarily measured by the degree to which it meets the purposes for which it was intended [117]. For this, the requirements should be collected not only before the sys- tem’s initial implementation, but also regularly while it is being maintained. A well-chosen architecture then provides appropriate abstractions and design contracts which facilitate dealing effectively with the system’s increasing complexity, and can be adapted as needed over the system’s life-cycle. While software architecture has been compared to that of buildings (within limits [141]), it can also be described as the “software development culture" of a system: A description of how its development should be done, as well as where and in which order. In keeping with the analogy, a well-defined architecture is like a civilized culture in that it contains detailed and workable definitions of appropriate behaviors, and being aware of it removes doubt. Conversely, a poorly defined or missing architecture resembles an uncivilized culture in which few generally agreed-upon rules exist and everyone improvises. To identify what needs to be achieved between the stakeholders to reach an architectural consensus that can support a system’s evolution, it is helpful to compare and contrast the architecture of a software system with that of a building. 3 While several definitions of the term software architecture exist [136], e.g., the set of design decisions about a software system [141], they all refer to the structure of a software system and the reasoning process that led to that structure. Commonly, the process of constructing a building above a very small size is highly regulated through building codes and other laws and has for a prerequisite that an architect produce a planned architecture that adheres to all applicable regulations. This requirement formalizes the definition of a building’s ar- chitecture [4, 114] so it can be described sufficiently by its physical structure, the materials it consists of and their current states. Any later changes to the building’s prescriptive architecture, be they through intentional modifications or decay over time, will be plainly visible or at least measurable and are thus easily identified. Any modifications to an existing building’s structure are not easily performed or remain unnoticed, and major modifications in particular have to be carefully planned so as to stay within the parameters specified by all applicable regulations and to not pose a danger to the building’s integrity. In contrast to constructing a building, outside of a number of special domains, creating even a very large software system is not a government-regulated process. Instead, the entire process can be informal. In the general case of software development, no documentation of the prospective software architecture has to be produced and presented to any authorities for approval. Additionally, different from buildings, software can be easily modified since it is “infinitely malleable”[26]. Major parts of its source code (e.g. libraries or frameworks) can easily be exchanged or updated. Similarly, functionality can be added or removed. Regular maintenance updates, often in relatively short intervals that add up to several per year are common even for large and complex software systems. On the flip side, the relative ease of such major modifications can make obtaining, understanding or using the architecture of a non-trivial software system difficult. 4 Regardless of whether a system grew from a single source file and by having its programmers adding code entities to it piece by piece, or whether a written, prescriptive architecture (an as-designed architec- ture, i.e. a set of principal design decisions that reflect the architect’s intent) exists that its developers conformed to as they implemented the system was, it is true that at least a descriptive architecture (as- implemented, either realizing the decisions made in the prescriptive architecture or, in its absence, more ad-hoc ones) always exists. Consequently, all software systems have an architecture. Even if a written architecture exists, its value cannot be taken for granted since it may not hold much informational worth when it is only a wishful statement or was not communicated to the developers. Lastly, the descriptive architecture may have followed the prescriptive one at some point, but both may have diverged over time due to architectural drift, erosion or decay. Two insights result from this: (1) The descriptive architecture of a system may not be the same as the prescriptive one, and (2) Not all stakeholders may be aware of the architecture. For these reasons, knowledge of the architecture of a system can have different levels of quality, and may never have existed to begin with. While over time, some systems fall out of usage and have their space taken over by others, it is also not out of the ordinary for some popular software systems to stay in use and be maintained for many years, if not decades. By now, well-known major desktop operating systems such as Windows, macOS (1984) [157] and Linux (1991) [144] have existed for decades and have their roots in even older ones. The two most important and widely used mobile operating systems, Android and iOS, are in turn based on the latter two of those desktop operating systems [57, 87]. Similar longevity is found in the realm of commercial applications such as Microsoft Office [27], Microsoft Word in 1983 [77] and Adobe Photoshop in 1990 [123]. 5 On the open source side, major examples are the Apache HTTP Server, which is web server software that has existed for decades 1 and which powers most of the World Wide Web, the GNU C Compiler (gcc) having had its initial release in 1990 [139], or the Perl programming language in 1987 [152]. In addition to the changes that such a long-lived system’s code and architecture have to undergo during its lifetime as required by the changes its technological environment, it should also be considered that its developer team is likely to experience a great deal of personnel turnover [54], with some recent projects reaching a yearly turnover of as much as 15% [65]. This makes it so that a project like the Linux kernel, which has 15,600 contributors, has the need for over 2300 new programmers per year to become productive in maintaining it [33]. Whether a transfer of system knowledge to new personnel is possible depends on a number of factors, the most basic of which is whether that transferable knowledge has ever existed at all. As we have established, this is doubtful with regards to the architecture in many situations. A look at how each of these successful systems has evolved over such a long time span underscores that for a system to have longevity, it must address a recurring need for changes that reflect those in its environment [76]. When a software system evolves, its source code and its architecture change [63, 143]. Two everyday tasks developers have to tackle are adding functionality to an existing system or repair- ing it, typically based on a description of what needs to be added, but not where the necessary changes need to be performed. While making modifications to a major system always carries the risk of committing errors and introducing unintended side effects, this risk is exacerbated if the developers lack the knowledge of where and how to make them. In particular, developers new to a large project may be unfamiliar with its substantial code base and not know where to begin. Having an actionable view of a software system is useful for its stakeholders for a variety of reasons, such as usability [10], security [14], and maintenance, as I am about to show in the evaluation in Chapter 4. Documentation or access to the original authors may 1 https://httpd.apache.org 6 not be available to mitigate the situation. At that point, feature location (finding the code relevant to the desired features in the system’s implementation) and the speed at which it can become performed become important [37]. Manual search can be frustrating and time-consuming. One effect of this is that it will take a careful developer a long time to get started on a task because much of a developer’s time is spent understanding unfamiliar code [67]. For this reason, the COCOMO II cost estimation model takes software understanding into account as a cost factor [19]. It is easy to see that maintainers would be well served by having automatic support in mitigating these risks. The other stakeholders may also want to draw conclusions from the architecture on how the concern structure of the system is organized, e.g., regarding the dependency hierarchy between concerns that is expressed in its layers. This is where architectural recovery comes in. 1.3 ArchitectureRecovery A software system can only be maintained to the extent that it is known. Fully knowing a system includes being aware of its architecture. This awareness avoids technical debt [151] and ensures the system’s con- tinued integrity. In practice, however, explicit knowledge of a system’s architecture may have never existed. Even if it has, it may have deteriorated over time through phenomena such as missing, incomplete or poor docu- mentation, personnel turnover removing familiarity from the organization as well as architectural drift or erosion [141]. The latter two are caused by careless or unintentional addition, removal, and modification of architectural design decisions [48]. In many of these cases, the only way to obtain any architectural in- formation is to recover it from implementation-level artifacts. To address this, a wide variety of software architecture recovery methods have been created [38, 48]. These recover views of a system’s architecture under their respective, different paradigms. For this, they each apply different specific algorithms to the system’s implementation artifacts (e.g., the source code, byte-code, executable files, directory structure, 7 configuration files). By its definition, architectural recovery is the process of retrieving the architecture of a system from its implementation-level artifacts [141]. Some conceivably useful related artifacts such as requirements definitions, a written architecture description, manuals and others may or may not exist and can therefore not be relied upon for any recovery method that wants to make a claim of general ap- plicability to all systems. In addition to their availability, the veracity and relevance of these documents may also be hard to ascertain: They may describe features and requirements that have no counterpart in the system as it currently exists. Accordingly, to the degree that these documents do not reflect the system as it exists, a recovery based on them will be skewed. On the other hand, one artifact that will always be present for a system to be maintained as well as always be true and relevant is its source code. The product of architectural recovery is not its set of design decisions proper, but another artifact: A view of its architecture. Many different recovery methods exist that produce different kinds of views. There is no single “correct" view of an architecture that contains the whole truth about it. The same architecture can be described through different views [98]. While the incompleteness of an individual architectural view may be regarded as a disadvantage, one positive aspect is that such views make the recovered systems or versions thereof comparable along a dimension by using tools and metrics. The same cannot be said of written specifications, since there are different approaches to these run the gamut from informal to formal [51] and thus do not have to follow a specific representational style. A software architecture can be represented through many views that follow different paradigms, such as program comprehension and subsystem patterns [147], optimized clustering [95], dependencies, or concerns [49, 84]. However, even after a view of the architecture has been obtained, that view must be understood by the system’s stake- holders, and it must hold some utility for them. As laid out in Chapter 5, not all views may be similarly easy to understand or hold utility to all stakeholders to use in decision-making. In the event of such a mismatch, a different architectural view may have to be recovered using another recovery method. For 8 these reasons, the paradigm each recovery method needs to follow is determined by the purpose it intends to serve for which of its stakeholders. While the task of determining its principal architectural design decisions from a system’s implemen- tation can have any degree of difficulty, no conceivable recovery method would be able to always retrieve this set of decisions. This has several reasons: 1. Some decisions may have been made, but not realized. The system may be intended to have some functionality, which has not been implemented yet. 2. The system as built may violate some decisions. 3. By accident, the system may conform to decisions that have never been made. The promise of software architecture recovery is that it yields results that are not only accurate, but also help the stakeholders of a system to evaluate the system and to estimate the impact possible changes of the system would have on its architecture. The virtuous cycle of architecture recovery [85], as shown in Figure 1.1, lets stakeholders cycle through the following steps: • Recover the architecture and produce a view, • Analyze the produced view, • Take cues from the analysis, and • Fix architectural issues. The benefit is that the architecture of the system should be improved in each cycle until no more issues are to be found. This holds true without any regard to the specific paradigm any recovery method’s view is based on. Proving the accuracy of a recovery result is relatively easy. Verifying its utility is more complicated, but it can certainly be shown that some accurate results have limited utility. 9 Recover Architecture Analyze View Fix Issue(s) Take Cues Figure 1.1: The Virtuous Cycle of Architecture Recovery One reason for the interest in architecture recovery is to gain information on whether the architecture fits a certain desired style or exhibits anti-patterns such as architectural smells [46]. Once such determinations have been made, there may follow an interest of the stakeholders to rear- range the architecture to fit the desired style or to fix its issues. This can happen from one version of the system to the next, or over a sequence of more than two versions. Architecture recovery will be helpful in this process if it can show the stakeholders whether they are on the right track. For any metric or view of a system produced by an algorithm to be a true aid in system maintenance, it will have to take into account that system maintenance is commonly performed (1) incrementally [111], (2) by teams [109], and (3) distributed and on different parts of the system [56]. 1.3.1 DesirablePropertiesofArchitectureRecoveryMethods Out of the many possible views of a system, many can be interesting or hold different utility for different groups. Judging from the paradigms of existing recovery methods, a lot of work has already been done to group code entities that belong together by certain measures into components [95, 3, 98, 147, 50]. Many of their results can only be interpreted by experts familiar with those methods. Even after some expert 10 interpretation, the recovery methods’ results at best serve only some technical stakeholders [85] and can- not give a direct answer to some simple-sounding questions that arise from the discussion of stakeholder concerns above in Section 1.1 and are pertinent to any system and benefit its stakeholders: DOES this system address our needs? WHERE does it address them? HOW does it organize addressing them? Recovery methods, whether or not they are concern-oriented, can have issues and constraints that reduce their utility. These are addressed in Chapter (5) and include 1. Lack of modularity 2. Non-determinism 3. Unexpected sensitivity to changes in source code between versions 4. Limited scalability to large system sizes in SLOC 5. Adjustments needed per individual run 6. Experts needed to configure the recovery and interpret its output Major software systems are not written in one session, but incrementally over a time. Even after their release, further increments are undertaken as the system is maintained. Recently, a range of studies have begun looking at the nature, rate, and impact of changes in a system’s architecture and the resulting architectural decay in existing systems [11, 119, 25]. For the recovery to aid in such a maintenance process, it has to necessarily fulfill several criteria, some of them very basic. The following enumeration lists them, along with reasons why I believe they have to be fulfilled. The need to facilitate such large-scale studies places special emphasis on a group of requirements on the recovery methods, such as efficiency and scalability, to existing ones such as accuracy, with the result that a suitable recovery method needs to fulfill the following criteria [85]: 11 • ArchitectureRepresentation: To provide an architecture, the result of the recovery needs to show a structure, and some kind of reasoning about the structure must either be visible or possible to be deduced (refer to Section 2.1 for the definition of Software Architecture). • ScalabilityandFeasibility: Considering the very large systems that are common today and how the size of software systems continues to grow, feasibility is not only determined by how long recov- eries take in absolute time, but also how a method’s processing times scale with a subject system’s size. This scalability determines the time difference between two possible samplings of the architec- ture and how much it will rise when the system grows larger. Being able to analyze the code bases for individual systems and different system versions quickly is crucial for evolutionary studies that track architectural changes over a range of versions. Therefore it has to be possible to recover the architecture of a system from its code base within a finite and predictable amount of time. If this is not the case, architectural recovery becomes an open-ended task and unreliable as an aid. When the recovery is not open-ended, other relevant questions are how long recoveries take in absolute time as well of how a method scales to large systems. • Clarity: The output needs to show how the system’s entities were grouped and explain why. Ex- plaining the general paradigm under which the recovery takes place is not enough since a stake- holder may be interested in why a given entity has been assigned to a group, so they can infer how their possible modification choices would affect the architecture. Therefore, the result needs to be either self-explanatory or provide the user with some clear information they can refer to for further inquiries. Clarity also helps in communicating results within a team. • Proportionality: The magnitude of modifications between two given versions should be adequately reflected in the recovery results, i.e., the difference between the obtained architectural views must be commensurate with the amount and type of system change. This means that changes to the system 12 should be reflected in the new recovered architecture in proportion to the extent that they affect the attributes of the system that form the basis of the paradigm. The alternative is that maintenance becomes an unpredictable process. If, according to measures of architectural similarity, any change in the source code of a system results in the recovered architecture of every version of a system being entirely different from any other version, changes cannot be meaningfully compared. This diminishes the usefulness of such recoveries for evolutionary studies on the impact of changes in a system’s source code on its architecture. • Determinism: Recovery results should be reached in a deterministic manner. A lack of determin- ism makes it not only impossible to reproduce the results, but also to tell if any differences in the architectural view between two versions stem from changes made by the maintainers or instabili- ties of the recovery algorithm. Non-determinism can also put the accuracy of a recovery method in question when it leads to significantly different results on each new run. • Continuity: The maintainers of a system may want to move the system from its current state and its current recovered architecture to a more desirable goal state. In a complex system, this process may require several steps. If, by measures of architectural similarity, two different architectural views represent the beginning and the end of this planned process of one or more modifications to the system’s architecture, and it is possible to reach an intermediate state between the two, then the architectural view of the system in the intermediate state should reflect this by being recognizable as between the two extreme points. The reason for this is that maintainers will want to know if their changes are setting the system on the correct path to the desired goal. • Isolation: It needs to be possible to change only one element in isolation, and have this reflected in the recovered architecture. (Note that what this element is depends on the paradigm on each 13 recovery method.) This is important for enabling distributed work on the system without incurring undesirable crosstalk between changes. • Accuracy: The resulting architectural view must be a proper reflection of the architecture either by being intrinsically true to the system itself or some architecturalgroundtruth view of the system that can be arrived at by other means than running the recovery method. • Flexibility: The paradigm should serve as many different groups of stakeholders as possible by allowing itself to be tailored to their interests, instead of only a limited number of groups. • Interpretability: The paradigm should produce results that are easily interpreted and understood. To the extent that it can be shown that some or all of these conditions are not satisfied, a given recovery method will be of limited use for the purposes of incremental development. I have therefore for the first time studied how several available recovery methods that have been used for studies such as [48, 91, 72, 11] measure up against the criteria listed above (see Chapter 5)and addressed the utility of expert decompositions of architectures (see Section 4.1.1). While many existing recovery methods may give an accurate view of the system under their respective paradigms, they lack one or more of the above listed desirable attributes, which limits their use and their utility in many situations. I posit that these shortcomings can be overcome by using a different approach based on text classification. 1.3.2 Concern-OrientedRecovery The utility of an architectural view based on a system’s implemented concerns in comprehending an ar- chitecture has been shown [50]. Approaching a system from this point of view is useful for many different types of stakeholders: Maintainers and particularly programmers will be interested in learning what a 14 system does and how and where it does it. A concern-centric view can also be useful for the other stake- holders of the system. For instance, the architect can assess how well concerns are separated in the im- plementation. Project managers can optimize task allocation among programmers with varying degrees of familiarity with the system and its constituent concerns. Customers for whom the system is being built can check whether their concerns are reflected in it. To fully understand the foundation of RELAX, it is beneficial to define the concern-oriented view that it adopts as well as examine the meaning of several established terms, such as programming, maintenance and bug fixing under this view. Under the concern-centric view, determining a suitable architecture is the process of finding the design or structure for a system that is most likely to allow its concerns to be addressed programmatically. This means that not only will the concerns be addressed, but also that they will be addressed in a way that organizes them so they will be placed and separated from each other in a way that facilitates the success of the overall system. After such an architecture has been found, programming means that the developers follow the architecture and split each concern into smaller concerns which can be addressed by individual instructions or statements of the programming language being used. It follows that while concerns can be understood on their own, a software system and its architecture cannot be fully understood without knowing their addressed concerns. This applies regardless of whether programming aims at adding new features or performing maintenance such as resolving issues which can be manifested as bugs or otherwise. Thus, a system’s code and concerns are self-referential structures that can be broken down recursively into further, smaller instances of their own respective types until an atomic level is reached. The atom of programming is the individual instruction, while the atom of a system- related concern is the need that this single instruction addresses. Maintenance and bug fixing constitute improvements of how concerns are addressed in an existing system. Specifically, maintenance requires adapting the system to new or changed requirements. Issues such as bugs are negative discrepancies 15 between the concerns that a system is supposed to address and those that it addresses in reality. While maintenance requests can come in as precise technical descriptions, they can also have their origins in non-technical descriptions by stakeholders, for example end-user bug reports or new requirements handed down from upper management. As laid out further above, knowledge not only of the system’s code but also its architecture can be invaluable to perform these tasks. However, in order to perform maintenance and fix bugs, it can also be beneficial to know the architecture of the system so as not to inadvertently create a local sub-architecture in the process that runs counter to the system’s overall one. For each system, at a given point in its life-cycle, regarding the knowledge about its concerns, several cases are possible: 1. Full awareness: Its addressed concerns are known, and it is known where in the system they are addressed. This should be the state of the system from its origin. 2. Deficient knowledge: Its addressed concerns are known, but it is not known where in the system they are addressed. This is the state of a functional system that requires some recovery. 3. Poor or no knowledge: Its addressed concerns are unknown (and therefore it can also not be known where they are addressed). This is the state of a legacy system. While the first case is desirable, there are a multitude of reasons why it cannot be achieved in many situations. These include transfers of a system or parts thereof to new teams, or even that the team may stay the same, but knowledge has either never been documented, was lost over time or that the architecture has drifted from its original state, possibly even without the knowledge of the architects. In the second and third case, steps must be taken to make sure that work on the system adds to its utility without adding to its entropy. Given that code entities mirror concerns, it would be beneficial if in the absence of a written architecture, concerns could be derived from the code entities. 16 Since concerns are expressed in natural language and programming in modern high-level languages leaves several points where a programmer can use natural language, such as in variable names, method (used synonymous with "function" throughout this dissertation) names, class names or comments, it seems that these named entities could serve to recover the concerns that led to their creation and that recovering the concerns of a system in turn would help us to understand the system. In all these cases, one of the first steps of the process needs to be to locate the bug to be eradicated or the feature to be enhanced in the code. There are few software architecture recovery methods that address this, and they have deficiencies. Therefore, I believe that basing an architecture recovery method on concerns can provide a much clearer answer to these questions. For these purposes, I have aimed at choosing a flexible paradigm that serves as many different types of stakeholders of a system as possible, instead of only a limited number of types. These considerations have lead me to choose a concern-oriented architectural view [50] for my pro- posed new recovery method, RELAX (RELiable Architecture EXtraction). In this context, aconcern can be defined as a role, responsibility, concept, or purpose of a software system. Data Persistence, Networking, andGUI are examples of generic concerns that a system may commonly address. On the other hand, there are domain-specific or application-specific concerns, e.g., Interrupt Handling as part of an OS Kernel. 1.4 IntroducingRELAX To address the issues encountered with other software architecture recovery methods, I have developed RELAX (RELiable Architecture EXtraction). While many existing recovery methods may give an accurate view of the system under their respective paradigms, they lack one or more of the desirable attributes listed in Section 1.3.1, which can limit their use and their utility in many situations. 17 Just like every system has an architecture, even if none of its stakeholders are aware of it, each recovery method needs to follow a paradigm that is determined by the purpose it intends to serve. I have aimed at choosing a flexible paradigm that serves as many different groups of stakeholders of a system as possible, not only one. Another goal was to make using the method and interpreting its result as straightforward as possible. It is my hope that this will lead to a democratization of architecture recovery. RELAX is a scalable, concern-oriented recovery method which makes use of text classification of source code entities and clustering based on predefined concerns to efficiently recover accurate architectural views. It is deterministic, does not require significant work per run, takes guesswork out of its input, its concerns are good as defined, small changes in input systems lead to only small changes in the recovery result, and it scales up nearly linearly with the size of input systems. I claim the architectural view (described below) produced by RELAX is useful and correctly reflects the underlying architecture. Given an input of a system’s source code and a set of concerns, RELAX classifies and clusters a system’s code entities into word classes that relate to user-specified concerns. Its output is a view that represents the system’s architectural structure and location of concerns textually and visually. Both elements of the view provide actionable information to its maintainers. Additionally, the visualization allows the viewer to gather important facts about the overall architecture of the system at a glance while also allowing them to dig deeper. To recover one or more versions of a system with RELAX, users can choose to either use preexisting classifiers that have been trained with training data for common concerns or train a classifier based on a combination of training data that comes with RELAX or their own training data. This classifier is then used to classify code entities based on their textual contents and to group them into concern clusters that are related to either one concern or a combination of concerns. Since for each code entity, both its classification and its subsequent assignment to a concern cluster are independent from the classification and assignment 18 of all other code entities, this recovery method is modular, easily distributable and scalable. The resulting architectural views will enable maintainers to ascertain whether and where given concerns are addressed. Recovered views can be easily compared between different versions, which will make it easy to find which concerns have been moved or added. Relating the produced view to the structure of the system and the dependencies will yield information that can be used to detect indicators of possible deficiencies in the code or the architecture of the systems, such as smells. 1.4.1 RELAXStandouts What sets RELAX apart from other recovery methods is that it is a concern-oriented architecture recovery method that at the same time addresses the shortcomings and issues of other methods while also being at least as accurate. RELAX is appropriately sensitive to changes in that minor changes to source code do not cause ma- jor changes in the recovered view and make it unrecognizable when compared to the version before the changes. I also claim that RELAX is both efficient and scalable, enabling it to recover architectures of large systems. This is enabled by RELAX’s modularity, which allows the composition and reuse of partial re- sults, distribution of the recovery process and reduction of the workload on new versions of a system to just the parts that have changed. Additionally, RELAX is tailorable by allowing different stakeholders to maximize the utility of the recovery by considering their perspective. Approaching a system from this point of view is useful for many different types of stakeholders: Main- tainers and particularly programmers will be interested in learning what a system does and how and where it does it. A concern-centric view can also be useful for stakeholders other than programmers. For instance, the architect can assess how well concerns are separated. Project managers can determine task allocation among programmers with varying degrees of familiarity with the system. Customers for whom the system is being built can check whether their concerns are reflected in it. Interested end users may use RELAX 19 to find out whether a system’s source code implements a functionality that may not be mentioned in its documentation. The latter two types of stakeholders do not need to be experts in software development to derive utility from RELAX. 1.5 Contributions The main research contributions of this dissertation can be summarized as follows: 1. Novel, fully automatic and concern-oriented architecture recovery method RELAX (RELiable Archi- tecture EXtraction), which allows a separate analysis of framework and per-project code, thereby improving on pure code analysis. RELAX does not require any specialized knowledge for input pa- rameters and outputs a semantic view of a system textually and through an integrated visualization. Attributes include • Accurate architectural recovery of the subject system, • Overall architectural recovery output that can be easily interpreted and directly applied to the maintenance of the system, • Efficient scalability to large system sizes, • Ability to compose architecture recovery results of a system from its parts or from the system as a whole without adding effort. 2. An implementation of RELAX. 1.6 PublicationsRelatedtoThisDissertation I have coauthored the following large-scale studies of architectural change, evolution and decay in software [72, 12, 74]. 20 • D. M. Le, P. Behnamghader, J. Garcia, D. Link, A. Shahbazian, and N. Medvidovic. Anempiricalstudy of architectural change in open-source software systems. 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories (2015) • P. Behnamghader, D. M. Le, J. Garcia, D. Link, A. Shahbazian, and N. Medvidovic. Alarge-scalestudy of architectural evolution in open-source software systems. Empirical Software Engineering (2017) • D. M. Le, D. Link, A. Shahbazian, and N. Medvidovic. An empirical study of architectural decay in open-source software. 2018 IEEE International conference on software architecture (ICSA) (2018) The themes of these studies and the shortcomings encountered in the architecture recovery methods employed for these studies (listed in Section 1.3.1) formed part of the motivation to pursue RELAX. Specifi- cally, I felt that some shortcomings of the employed architecture recovery methods needed to be addressed using a new approach, and that in addition to that, a new kind of architectural view could be created. In the context, I have published one paper that describes the ways in which a new recovery method could set itself apart from existing ones so it could be most useful in software maintenance [85]. The subsequent paper established RELAX and compared it to existing methods [84]. After this, a user study was conducted to ascertain how RELAX can help in software maintenance [86] My first-author papers related to RELAX are: • D. Link, P. Behnamghader, R. Moazeni, and B. Boehm. Thevalueofsoftwarearchitecturerecoveryfor maintenance. Proceedings of the 12th Innovations on Software Engineering Conference (formerly known as India Software Engineering Conference) (2019) • D. Link, P. Behnamghader, R. Moazeni, and B. Boehm. RecoverandRELAX:concern-orientedsoftware architecture recovery for systems development and maintenance. Proceedings of the International Conference on Software and System Processes (2019) 21 • D. Link, K. Srisopha, and B. Boehm. Study of the Utility of Text Classification Based Software Archi- tecture Recovery Method RELAX for Maintenance. Proceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (2021) Of these three first-author papers, the first one established the basis for parts of Chapters 1, 2, and 5 as well as Section 4.1. The second one laid the groundwork of this dissertation throughout. The third paper is related to Section 4.2. I have further co-authored papers on incremental development productivity decline [107, 106, 105, 108]. Their relation to RELAX is that the observed decline in productivity may be based on new contributors being unfamiliar with the existing architecture as well as the ongoing architectural changes made by their peers and the fact that architectures tend to erode over the lifetime of a project, which requires them to reacquaint themselves with the descriptive architecture regularly . They are: • R. Moazeni, D. Link, and B. Boehm. Lehman’s Laws and the productivity of increments: Implications for productivity. Proceedings - Asia-Pacific Software Engineering Conference, APSEC (2013) • R. Moazeni, D. Link, and B. Boehm. Incrementaldevelopmentproductivitydecline. Proceedings of the 9th International Conference on Predictive Models in Software Engineering - PROMISE ’13 (2013) • R. Moazeni, D. Link, and B. Boehm. COCOMOIIparametersandIDPD:bilateralrelevances. Proceed- ings of the 2014 International Conference on Software and System Process - ICSSP 2014 (2014) • R. Moazeni, D. Link, C. Chen, and B. Boehm. Software domains in incremental development produc- tivity decline. Proceedings of the 2014 International Conference on Software and System Process - ICSSP 2014 (2014) 22 1.7 OrganizationofThisDissertation The remainder of this dissertation is organized as follows: Chapter 2 explains the foundation of RELAX. Chapter 3 describes RELAX’s approach. Chapter 4 presents evaluation results. Chapter 5 compares the approach of RELAX to that of other recovery methods. Chapter 6 addresses threats to validity. Chapter 7 presents the conclusion. 23 Chapter2 Background This chapter introduces and explains the concepts and principles that underlie RELAX and are referenced in later chapters. 2.1 SoftwareArchitectureandArchitecturalViews When referring to a "Software Architecture" or a view thereof that should be recovered, it is necessary to define the terms. Unfortunately, no single, generally recognized definition of "Software Architecture" exists. The term has been defined in so many different ways that at least one academic website saw fit to collect a large number of definitions [135]. Virtually all of them are compatible with a definition as "Requirement-driven essential ideas about a System’s Structure", while some emphasize the ideas and oth- ers the structure more. While I think that the architecture of a system can be correctly defined as the set of principal design decisions about it, framing architectural recovery in the context of this idea-driven definition can set high expectations that would be disappointed by the output achievable by an automatic recovery method. Such a method cannot be expected to produce a list of written design decisions, some of which may not even be present in the system. A definition that is more in line with the capabilities of programmatic architecture recovery is that of "fundamental concepts or properties of a system in its environment embodied in its 24 elements, relationships, and in the principles of its design and evolution" [60]. This is the case because the definition includes what is embodied in a system’s elements, such as source code that the recovery can be based on. 2.2 ArchitecturalChange,Drift,DecayandErosion Architectural change in itself is neither good nor bad; it goes hand in hand with any substantial change to the software system. One interest in software architecture recovery is to find out where, when, why and how much the architecture has changed. In order to judge whether the change that occurred is harmful to the system’s architecture, a distinction between drift, decay and erosion needs to be made. 2.2.1 ArchitecturalDrift "Architectural drift is introduction of principal design decisions into a system’s descriptive architecture that are (a) not included in, encompassed by, or implied by the prescriptive architecture, but which (b) do not violate any of the prescriptive architecture’s design decisions." [141] There is widespread consensus that a system’s architecture will drift as it evolves [50, 145, 125, 159, 38]. Though drift does not cause a mismatch between the system’s documented or expected functional or non-functional properties on one hand and the actually implemented ones on the other, it adds complexity to its maintenance because a maintainer will have to do research on the system’s implementation in order to be able to effectively maintain it. 2.2.2 ArchitecturalErosionandDecay "Architectural erosion is the introduction of architectural design decisions into a system’s descriptive ar- chitecture that violate its prescriptive architecture." [141] The terms "Architectural erosion" and "Architec- tural decay" are used interchangeably for the same concept, which is that software degenerates because 25 of changes that violate its design principles [36]. Architectural decay is caused by repeated, sometimes careless changes to a system during its lifespan [11]. A recent large-scale study [11] has shown that the versioning scheme of a system does not "protect" it from changes at any time because major architectural changes may happen between minor system versions. Other findings of the study were that a system’s ar- chitecture can be comparatively unstable in the run-up to a release and that the stability of the higher-level overall implementation architecture (e.g. package structure) is not an indicator of whether the architec- ture of the system changed. Architectural recovery methods are much more helpful for that task. Finally, in many cases, a concern-based architectural view revealed important changes that would have remained concealed in other views. 2.3 NaturalLanguageProcessing Natural language processing is a scientific field of great interest that is undergoing rapid development whose many applications usually fall into one or more of syntax, semantics, discourse and speech [31, 113, 82]. For source-code based architecture recovery, two of these can be dismissed right off the bat: speech and discourse. Speech deals with processing human speech as input or producing it as output. Discourse- related tasks expect full sentences as input. While it is certainly possible that code comments contain full sentences, this cannot generally be expected. Trying to process the syntax of code and comments would assume that they are part of complete natural language sentences. This is definitely not the case for the code and cannot be assumed for comments, which may often consist of just one or two words. This leaves semantics, which intuitively aligns well with the goal of finding what the entities of the system (or even the system overall) mean. Two methods of natural language processing which deal with the detection of the semantics of bodies of text and which are promising for architectural recovery are topic modeling and text classification. 26 2.3.1 TopicModeling While not employed by RELAX, topic modeling [150, 2] is important to it in that it is the foundation of ARC (Architecture Recovery using Concerns), another recovery method intended to find concerns in a software system [50], which makes comparisons between topic modeling and text classification instructive. The aim of probabilistic topic models is to discover the thematic structure in large archives of documents [17]. One advantage is that the algorithms for probabilistic topic modeling do not need any annotations or labeling of these documents because the topics emerge from their analysis. Probabilistic topic modeling analyzes a body of text to find groups of words which are likely to occur together. Such groups of words are then presented as topics. Latent Dirichlet Allocation (LDA) is the simplest topic model [16] and employed by ARC [49]. It is based on the intuition that documents exhibit multiple topics with a different proportion, where each word is drawn from one of the topics, and where the selected topic is chosen from the per-document distribution over topics [17]. When fed a software system as a corpus (set of documents), LDA allows ARC to compute similarity measures between concerns and identify which concerns appear in a single software entity, such as a class, function, package or other [49].A document is represented as a bag of words, which are identifiers and comments in the source code of a software entity. A document can have different topics, which are the concerns in the approach of ARC. LDA allows ARC to represent concerns in a human-readable form since a topic’s meaning can be ascertained by examining its most probable words [49]. Consequently, a document is represented as a multinomial probability distribution over topics, drawn from a Dirichlet distribution. A document in ARC is a class [49]. 2.3.2 TextClassification Text classification [1, 68] is the natural language processing method RELAX makes use of to find concerns in a software system. Typically, in classification, a set of labeled examples taken from two or more classes 27 are supplied and then used to classify a new document as belonging to the class which has the highest similarity to it [80]. The major difference between topic modeling and text classification is that while in topic modeling, the topics are produced as the result, they come first in text classification. Several classifying algorithms exist. One practical limitation is which of these classifiers are supported by the toolkit used, which in the case of RELAX is MALLET [100]. Classification algorithms supported by MALLET include Naïve Bayes [112, 80], C4.5 [126], Maximum Entropy [115] and Decision Trees [130]. These algorithms use the presence of features in the text to determine the class to which it belongs. How much individual features or combinations thereof contribute is learned in training. What sets Naïve Bayes apart from the other algorithms is that it assumes that given that a document belongs to a specific class, the values of its individual features are not dependent of each other. This seems to be the correct assumption for the way code can be organized and split up in arbitrary ways. Consider a single source file that implements (1) an SQL library call to set up data structures for a database connection, (2) running an SQL query and (3) a callback handler once the response to the query is received. If the classifier is trained on SQL terms, this file should probably be classified as addressing an SQL concern. In the next version of the system, this source file may have been split up into three new ones that each handle one of the three functions described and are called by another. If the classifier would have been trained in a way that the outcome - classification as SQL - would have required the presence of specific constellations of these features, it would not be able to match the new source files to SQL. An advantage of the independence of features in Naïve Bayes is that less training data is needed since combinations of features do not matter to it. The following enumeration is a shortened adaptation from a general description of the Naïve Bayes algorithm found in [78], reduced to the sections that are relevant to RELAX: 28 1. LetT be a training set of samples, each with their class labels. There arek classes,C 1 ,C 2 ,...,C k . Each sample is represented by ann-dimensional vector,X ={x 1 ,x 2 ,...,x n }, depictingn measured values of then attributes,A 1 ,A 2 ,...,A n , respectively. 2. Given a sampleX, the classifier will predict that X belongs to the class having the highest a poste- riori probability, conditioned onX. That is, X is predicted to belong to the classC i if and only if P(C i |X)>P(C j |X) for 1≤ j≤ m,j̸=i . Thus, we find the class that maximizes P(C i |X). The classC i for whichP(C i |X) is maximized is called the maximum posteriori hypothesis. By Bayes’ theorem P(C i |X) = P(X|C i )P(C i ) P(X) 3. As P(X) is the same for all classes, only P(X|C i )P(C i ) need be maximized. If the class a priori probabilities,P(C i ), are not known, then it is commonly assumed that the classes are equally likely, that is,P(C 1 ) =P(C 2 ) =... =P(C k ) , and we would therefore maximize Otherwise we maximize P(X|C i ) . Note that the class a priori probabilities may be estimated byP(C i ) =freq(C i ,T)/|T|. 4. Given data sets with many attributes, it would be computationally expensive to computeP(X|C i ). In order to reduce computation in evaluating P(X|C i )P(C i ), the naive assumption of class con- ditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the sample. Mathematically, this means that P(X|C i )≈ n Y k=1 P(x k |C i ) The probabilities P(x 1 |C i ),P(x 2 |C i ),...,P(x n |C i ) can easily be estimated from the training set. Recall that herex k refers to the value of attributeA k for sampleX. IfA k is categorical (as is the case with topics in text classification), then P(x k |C i ) is the number of samples of class C i in T having the valuex k for attributeA k , divided byfreq(C i ,T), the number of sample of classC i inT . 29 5. In order to predict the class label ofX,P(X|C i )P(C i ), is evaluated for each classC i . The classifier predicts that the class label ofX isC i if and only if it is the class that maximizesP(X|C i )P(C i ). Another classifier algorithm that is supported by RELAX (and which has been used for some ex- periments with it, but not as extensively as Naïve Bayes) is the Maximum Entropy Classifier [115]. For this classifier, the probability that a document belongs to a particular class given a context must maximize the entropy of the classification system. Maximum entropy ensures that no biases are introduced into the system [101]. 2.4 ArchitecturalRecovery Architectural Recovery is defined as the process of determining a software system’s architecture from its implementation artifacts [141]. As laid out in the introduction, only a view of the descriptive architecture of a system can be the immediate output resulting from architectural recovery. To form long-term plans for a system or to be able to summarize or categorize an architecture, a prescriptive architecture is desir- able. While this cannot be produced automatically, a useful descriptive architectural can be an aid in its manual production when used in conjunction with artifacts and other information that is not part of the implementation, such as manuals, use case descriptions and any other knowledge about the system that can be tapped into. Because the need for architectural recovery methods has existed ever since software architecture has been studied, the evolution of recovery methods came with a host of different approaches and levels of automation. While this gives stakeholders a large number of methods to choose from, it makes it a challenge to come up with criteria on which method to choose. To alleviate this state of affairs, taxonomies of recovery methods have been proposed along several dimensions, such as the one specified by Ducasse et al [38]. In order to facilitate comparisons with RELAX, it is helpful to specify where RELAX fits in this tax- onomy of architecture recovery methods. The goals of RELAX are Re-documentation and understanding, 30 Conformance checking, Co-evolution, Analysis as well as Evolution and maintenance. The process model that RELAX follows is Bottom-Up: The source code is used to abstract a representation of the system, which is then presented textually and in visualizations. RELAX belongs to the category of Quasi-Automatic Tech- niques. (Notably, the taxonomy does not have a category for fully automatic techniques. This stems from the assertion that "pure automatic techniques failed in reconstructing software architectures" and the au- thors’ insight that engineers must still steer the recovery methods to some extent [39].) Finally, along with textual representations of the architecture, RELAX produces a visualization as part of its output. 2.5 SmellsandAnti-Patterns Since the output of RELAX can be used in the detection of concern-related code and architectural smells, some elaboration on smells is in order. Smells are related to anti-patterns, which are similar to good patterns in that they are recurring and look like solutions to problems, but are defective [128]. While the term can be adapted to any domain of problem-solving, it has become most popular in relation to the domain of software engineering, and has been applied to its subdomains such as management, design and programming. Anti-patterns are dangerous because their effects tend to be delayed in that the presence of an anti- pattern is not necessarily detrimental to the functionality of the system, but that at some later time, the maintainability of the system requires that effort needs to be expended to eradicate the pattern and replace it with one that has fewer negative implications. Not addressing the anti-pattern results in both increased cost for the immediate maintenance and higher cost further down the line to extinguish the anti-pattern. These effects are comparable to those of technical debt, in that an organization or developers sometimes accept compromises in one dimension to meet an urgent demand in another and that at some point the "principal" needs to be repaid (i.e. the overall issue needs to be addressed) to prevent having to pay further "interest" (recurring effort) [23]. Smells are symptoms of anti-patterns in software systems that reduce 31 its maintainability [110, 122]. Smells serve as a means for identifying parts of the system that need to be refactored or restructured. Restructuring is the modification of software to make the software easier to understand and to change, or less susceptible to error when future changes are made [7], or the trans- formation from one form to another at the same relative abstraction level while preserving the subject system’s external behavior (functionality and semantics) [30]. Refactoring is the same concept applied to object-oriented systems [102]. Refactoring facilitates future adaptations and extensions. 2.5.1 CodeSmells Code smells relate to the subset of anti-patterns that are associated with bad design or bad programming practices [42]. Many types of smells have been discovered and defined. Examples that are easy to un- derstand just from their names include Duplicated Code, Large Class, Long Method, Long Parameter List [134] and Dependency Cycle. Fixing them usually requires a degree of refactoring. While code smells are not code issues or bugs and do not indicate their presence, it is conceivable that they will lead to increased maintenance effort. While the results of studies of the impact of code smells on the maintainability of systems differ (depending on the individual code smells selected, some found no impact or even a benefit in one case [133] while others did [158]), code smells have not found to be generally beneficial and remain a matter of interest in software engineering research. 2.5.2 ArchitecturalSmells An architectural smell is a commonly (although not always intentionally) used architectural decision that negatively impacts system quality [46]. Code smells are related to code entities, architectural smells are related to components. While some concepts such as circular dependencies carry over from the code level to the architectural one, no mapping between the same types on the different levels exists because entities 32 Figure 2.1: The ARCADE Recovery Workflow are grouped into components on the architectural level. Consider the example of a dependency cycle on the code level vs. the architectural level. On the code level, a dependency cycle exists when a set of two or more code entities have a circular dependency, i.e. there is a chain of dependencies from one of them to the next where the last link in that chain depends on the first (and the cycle begins anew). At the architecture level, this smell follows the same principle, but replaces the code entities with components instead. For this, a component is said to depend on another component if at least one entity in the first component depends on at least one entity in the second component. While it is possible that code entities that have a circular dependency on the code level end up in different components in a way that makes the component also have a circular dependency, this is not the case if they end up in the same component. Conversely, it is possible for components to have a circular dependency that does not translate to the code level if the dependencies between the components are based on different disjoint sets of components within the components. Since RELAX is able to detect clearly named concerns, it can be used to not only detect concern-based smells, but also smells that depend on whether concerns are orthogonal to each other. 2.6 ARCADE In order to describe the original context of RELAX, it is necessary to introduce ARCADE (Architecture Recovery, Change, And Decay Evaluator) [149], into which RELAX was integrated originally. ARCADE was developed as a workbench to study architectural change and decay [47]. Its capabilities include to (1) perform architecture recovery from a system’s implementation, use the recovered informa- tion to compute (2) architectural change metrics and (3) decay metrics, and (4) perform different statistical 33 analyses of the obtained data [11]. Figure 2.1 [149] shows part of the workflow of ARCADE and shows that ARCADE’s foundational element is architecture recovery, depicted as the Recovery Techniques com- ponent in the pipeline represented in the figure. The architectures produced by the respective recovery techniques are directly used for studying change. ARCADE currently provides access to eleven recovery techniques, most of which use algorithms for clustering implementation-level elements into architectural components, while one technique reports the implementation view of a system’s architecture (i.e., the system’s directory and package structure). ARCADE thereby allows an engineer (1) to extract multiple architectural views and (2) to ensure max- imum accuracy of extracted architectures by highlighting their different aspects. ARCADE integrates implementations of eleven different architecture recovery methods as well as tools that allow studies of software systems over their lifecycle by recovering and comparing the architecture views that result from the respective recoveries. ARCADE originated in the USC-CSSE software architecture research group 1 . It has been maintained throughout the years of its existence. Several branches of ARCADE exist. I am maintaining my own branch that had occasional contributions by others in our team. Among others, it has the following major enhancements: • A GUI front-end, • Automatic visualizations, • Integration of Unified Code Count (UCC [148] and cloc [35], two tools that allow multi-threaded counting of physical and logical SLOC), • Threaded execution of code where feasible, such as: – Visualization of one version while the architectures of subsequent versions are recovered, – Counting code metrics in parallel with architecture recovery. 1 https://softarch.usc.edu 34 • Simplification and better maintainability through higher code clarity by moving the code base to Java 8, • Configuration management including automatically saving configurations. Several of these elements are adding elements of a framework to what was mostly a toolkit before: The GUI front-end is a unified user interface that allows the selection of the directories that contain the source code of the systems to be recovered as well as the desired locations for the recovered views. A configuration dialog can optionally be used to change parameters that are typically not modified for each individual run. Each modification is immediately saved to the configuration files. Those files are in XML format for easy manual modification where desired. The latest versions of RELAX are text-based in the interest of being scriptable and separate between a tool that determines the software metrics and another that runs the architecture recovery proper. 2.6.1 Architecture ARCADE is implemented mostly in Java, with some of its change metrics tools implemented in Python. For all recovery methods, the recovery proper can be run with the Java parts alone. 2.6.2 Implementation ARCADE was designed with the requirement of conducting replicable, reusable, and scalable studies [11]. Therefore, a support framework named ARCADE-Controller was developed as part of ARCADE. ARCADE-Controller allows a software architect to define a workflow for architecture recovery analysis and to distribute that analysis over a set of nodes in the computing. ARCADE-Controller rapidly recovers the architectures of many versions and revisions of a system using multiple cloud instances simultane- ously. ARCADE-Controller then transfers the recovered architectures to an analysis server, allowing an engineer to run evolutionary analyses, such as architectural change analysis. 35 2.6.3 SmellDetection ARCADE contains implementations of algorithms for the detection of code and architecture smells. Since RELAX is able to determine the locations and distribution of concerns in a software system, its output could conceivably be useful for the detection of some of these smells in a system, particularly those that are based on concerns. Several smells are based on the way concerns are grouped in the system and indicate violations of best practices or principles, such as the separation of concerns. Since the detection of some concern-based smells can depend on whether the concerns involved are similar or orthogonal to each other, the semantic clarity of concerns provided by RELAX could be valuable in their accurate detection. One example of a concern-based a smell whose detection has been implemented in ARCADE is Scat- tered Parasitic Functionality. This smell, which can apply to the code and architecture of a system, exists if multiple components are responsible for realizing the same high-level concern while some of those com- ponents are also responsible for additional, orthogonal concerns [92]. In this context, "orthogonal" is used to mean "unrelated". Orthogonal concerns, then, are pairs or sets of concerns that should not be addressed within the same component. One possible example of such a case could be a low-level sound driver that also addresses GUI concerns. Whether two given concerns are orthogonal to each other cannot be decided for all systems with generality, but needs to be defined by domain experts or the stakeholder running the smell detection. One indispensable prerequisite of deciding that concerns as orthogonal to each other is being able to name the concerns. While this is not guaranteed to be possible with a recovery method that depends on topic modeling (as laid out in Section 2.7), this will be always be the case with text classification at the base of RELAX-based smell detection, since it uses a classifier that is trained with named concerns. Another example of an ARCADE-implemented concern-based architecture smell is Concern Overload. This smell indicates that a component implements an excessive number of concerns, which constitutes a violation of the principle of separation of concerns [73]. This smell might be detected with the help of 36 RELAX depends if the right type of clustering is selected. If clustering is done based in such a way that clusters are mapped to concerns one-to-one, then the resulting clusters cannot be suffering from concern overload. If, however, clusters with combinations of several concerns are built, then it is possible that the non-empty clusters that address more than a certain number of concerns suffer from concern overload. 2.7 ARC(ArchitecturalRecoveryUsingConcerns) While ARC [50], an architectural recovery method, constitutes related work (and is indeed described as such in Section 5.1.6), it is also part of the background of RELAX. ARC is an architecture recovery method which uses topic modeling to find concerns and combines them with the structural information to automatically identify components and connectors [50]. The topic model employed in ARC is a statistical language model called Latent Dirichlet Allocation (LDA) [18]. LDA can detect concerns in individual code entities and compute similarities between them. The software system is represented as a set of documents (a corpus) and each document as a bag of words. Each document can have different topics. Documents (representing implementation entities) are clustered using structural information (dependencies in the case of ARC) and concerns (the topics from the topic model) as features. Though ARC and the foundational ideas behind it are groundbreaking, there are several unaddressed issues that limited the use of its implementation and results. In an attempt to salvage it, I had expended some effort to mitigate them. The following description of the issues and my approaches to fix them is not provided as a criticism of ARC, but as a reference of problems I aim to avoid. They are outlined below: 2.7.1 StopWords Topic modeling requires lists of stop words for its input [153]. Stop words are words in a language that are very common, but do not have any meaning by themselves, such as "the", "and" and many others. Such 37 words are like noise in that they carry low information content, cause low retrieval rates and have no predictive value. They are commonly grouped into either the general or domain specific category [94]. As the term "domain-specific" implies, they differ from domain to domain. Words like "data", "compute" and "code" can be stop words in software engineering (because nearly every piece of software deals with data, computations and software in some way) and would be useless in most software related concerns, but keywords in the automotive field, where they hold informational value. While ARC makes use of the list of general English stop words provided by MALLET, no domain- specific list of stop words for the computer science field is provided. This leads to words that should be stop words showing up in detected topics prominently and devaluing the results accordingly. In addition to the need for stop words that apply to the domains of computer science or software engineering, lists of these words can also be needed for individual systems or groups of systems. One example is that if topic modeling is run on source code entities that all originate from the same organization and therefore contain the licensing statement, filtering out standard English language stop words or even domain-specific ones does nothing to filter out the words in that apply to software licensing, and a topic made up of software licensing terms will be found to be very popular in the system to be recovered and may be shown as the most important concern. Licensing information, however, is not a meaningful concern, since it is extraneous to the system: While it is common to all code entities, it is not addressed by the code itself. Finally, stop words can also be specific to a specific corpus [21], which in the case of topic modeling in ARC translates to stop words related to a specific system. Examples are the name of the system, its version numbers, the names of the staff members who created it and so forth. This was also not addressed by ARC. To fix it, I have enabled the use of additional stop word lists of the second and third type. It turned out that coming up with good system-specific stop word lists is an iterative process that requires running ARC several times on a system until no new undesirable words show up in the topics that were found. This is in line with the iterative processes usually applicable to stop words [21]. 38 2.7.2 NumberofTopics Another basic input for topic modeling is the desired number of topics to be generated [53]. This parameter needs to be set carefully, since it will directly influence the quality of the generated topic model: If it is set too low, the topics will be too broad, whereas a setting that is too high will result in uninterpretable topics that are composed of idiosyncratic word combinations [140]. ARC used user-provided parameters for this. In order to approximate how many topics a given system version would support, I wrote code that ran topic modeling with two topics and then whether the resulting topics differed in their most common words. Depending on that, my algorithm increased or decreased the number of topics in the next run using exponential search. 2.7.3 TopicQuality Due to the issues described and others, the topics that ARC generated had several issues: • Repetitive topics that have major overlap between their most popular words, • Topics that prominently contain words that should be stop words, • Topics that no semantics can be assigned to because there is no conceivable single concern they would appear together in. The number of topics can affect the interpretability of the results. A solution with too few topics will generally result in very broad topics whereas a solution with too many topics will result in uninterpretable topics that pick out idiosyncratic word combinations [140]. One possible way to address this is to adjust the stop word lists and the number of topics and to re-run ARC until good topics emerge. This is an open-ended process. 39 2.7.4 Determinism The results of ARC were non-deterministic for two reasons: First, Latent Dirichlet Allocation takes random samples from the documents for its internal variables [18]. Second, the MALLET toolkit [100], as employed by ARC for topic modeling, uses non-deterministic Java data structures when running topic modeling. Consequently, the results for two recoveries could be different if run on different versions of the Java Virtual Machine, or even on the same one at different times. I addressed this in two ways in the MALLET toolkit: I hard-coded a fixed seed for LDA instead of one derived from the system time, and I replaced the non-deterministic data structures in MALLET with deterministic ones. This strategy requires considerable effort, however, since it must be repeated for every new version of MALLET. 2.7.5 Sensitivity One interest in comparing architectures that are variations, modifications or different versions of the same one is how different they are from each other. Intuitively, the differences in the recovery results should be in proportion to the differences between the architectures of the subject systems on the input side. Since ARC is based on topic modeling, this would also mean that the topic models stay similar. While there can be some difference of opinion about which amount of architectural change should result in how much change in the recovery result, it can probably be agreed that changing a single bit in the input should not result in an output whose topics are completely reshuffled. This, however, is the case with unmodified ARC. Any change, however minuscule, to the body of text to be modeled will result in a different topic model with completely different topics. While this does not necessarily affect the correctness of the topics, it affects their comparability, and with it, the comparability of the concerns that are based on them. This reduces the usability of ARC for evolutionary studies, since one matter of interest is how concern clusters develop over several versions of a system. Moreover, this leads to the result that results of ARC cannot be composed from its parts. 40 This issue was addressed in the context of our evolutionary studies through the introduction of the "supermodel": For an evolutionary study, topic modeling was run on a combination of all versions to be evaluated, and the entities in individual versions were clustered accordingly [11]. One problem with this approach was a lack of generality. Part of the normal life-cycle of a system is that it grows in size (physical and logical SLOC, file size) and functionality. Early versions are virtually certain to not have all the functionality (and with it, the addressed concerns) of the later ones. They may just be a mock-up of a GUI or contain placeholder code. A topic model which is derived from all versions being studied may be a poor representation of the early versions. Additionally, due to the sensitivity of the topic modeling process, the makeup of the supermodel de- pends on the arbitrary selection of the system versions to be evaluated. Due to this, studies that are run on different sets of versions of the same system will lack comparability, even if the sets differ by only one version. One major issue in architecture recovery can be whether a library or a framework should be recovered as part of the system, and whether its existence should affect the overall recovery. A curious result of the sensitivity of ARC is that the existence of a library or a framework and the availability of its source code affects the overall topics of the system and the result of the recovery. 2.7.6 Scalability The scalability of ARC has been found to be limited. As an example, It has presented a challenge to recover the Google Chromium system with it on commercially available commodity hardware [48]. In the overall analysis, while ARC is interesting due to being concern-oriented, software architecture recovery with it is a process that in the best cases has no guaranteed end or upper time bound, has severe limitations for evolutionary studies and cannot scale to large systems. 41 2.7.7 Softwarearchitectureanditsrecovery Before we settle on a definition of architecture, we need to consider what architecture recovery is and what it can do. Architecture recovery is the process of recovering a system’s architecture from its imple- mentation artifacts, such as its source code. In many cases, such as the ones under review here, the result is a set of clusters that contain artifacts of the system, typically its source files. This result aims to repre- sent a view of the architecture under a paradigm espoused by the respective architecture recovery method. While one widely accepted definition of softwarearchitecture as “the set of principal design decisions about a system”[141], there is a potential mismatch between the architecture of the system under this definition and the view that can be obtained through architecture recovery: Principal architecture decisions may not have been implemented and are therefore out of reach of architecture recovery. Conversely, the system may reflect inadvertent decisions that have never been made explicit, but are nonetheless present. Finally, some decisions may not be embodied in the artifacts or attributes thereof that a given recovery method considers or only emerge when the system is used in a certain way. For this reason, I think that a definition that is more in line with the capabilities of programmatic archi- tecture recovery is that of “fundamental concepts or properties of a system in its environment embodied in its elements, relationships, and in the principles of its design and evolution” [60]. This is the case because the definition includes what is embodied in a system’s elements, such as source code, that the recovery can be based on. However, both may be of use to analyzers of the system. 2.7.8 ConcernsinSoftwareEngineering The concept of aconcern is used throughout software engineering literature, frequently without any defi- nition. It is generally agreed upon that a separation of concerns is desirable. Out of the several meanings the word “concern” can have in the commonly used English language, the ones most suitable for software engineering are the noun defined as “matter for consideration”, and the verb defined as “to be of interest or 42 importance to” [103]. Defined in this way, a concern can be viewed as something that one or more human beings want to exist or happen, and which can be expressed in natural language. Applied to a software system, a concern is something the system needs to do or to have, such as a functional or non-functional property. This is in line with the definition of it as “a software system’s role, responsibility, concept, or purpose” used in [50]. I also refer to the definition mentioned in Section 1.1: The definition of concern followed here is based on two perspectives. The first is that a concern is a conceptualareaofinterestorfocusforastakeholderofasoftwareproject(e.g.,adeveloper) [129]. The second also refers totheconcretemanifestationofconceptualconcerns(e.g.,insourcecode,designdiagrams,orother artifacts) [129]. For our purposes, both perspectives are important, with the emphasis being on the second. 2.7.9 TopicModelingandLatentDirichletAllocation The aim of probabilistic topic models is to discover the thematic structure in large archives of documents [17]. One advantage is that the algorithms for probabilistic topic modeling do not need any annotations or labeling of these documents because the topics emerge from their analysis. Probabilistic topic modeling analyzes a body of text to find groups of words which are likely to occur together. Such groups of words are then presented as topics. Latent Dirichlet Allocation (LDA) [18] is a statistical model based on the intuition that documents exhibit multiple topics with a different proportion, where each word is drawn from one of the topics, and where the selected topic is chosen from the per-document distribution over topics. 2.7.10 StopWords Topic modeling requires lists of stop words for its input. Stop words are words in a language that are very common, but do not have any meaning by themselves, such as “the”, “and” and many others. Such words are like noise in that they carry low information content, cause low retrieval rates and have no 43 predictive value. They are commonly grouped into either the general or domain specific category [94]. As the term “domain-specific” implies, they differ from domain to domain. Words like “data”, "compute" and “code” can be stop words in software engineering (because nearly every piece of software deals with data, computations and software in some way) and would be useless in most software related concerns, but keywords in the automotive field, where they hold informational value. 2.7.11 ACDC The ACDC (Algorithm for Comprehension-Driven Clustering) algorithm [147] uses structural relation- ships specified as patterns to create an algorithm for recovering components and configurations that bounds the size of the cluster (the number of software entities in the cluster), and provides a name for the cluster based on the names of files in the cluster. ACDC’s view is oriented toward components that are based on structural patterns (e.g., a component consisting of entities that together form a particular subgraph). It bears mentioning that ACDC was not introduced as an architecture recovery method, but as a clustering algorithm with the goals of facilitating program comprehension and coming up with system decompositions that are close to those of experts for given systems [147]. The official implementation of ACDC is written in the Java language [146]. It has been re-implemented for inclusion in the ARCADE workbench (see Section 2.7.17), omitting options for (1) reordering its patterns, (2) excluding some of them, (3) limiting the size of clusters in subgraphs, (4) using a hierarchical decomposition instead of a flat one, (5) including only top level clusters in the flat decomposition as well as (6) graphical output of the decom- position. (In order to evaluate recovery results in a way that is comparable with other research, such as the one in, I used a different version of ACDC that has been adapted for and integrated into ARCADE (see Section 2.7.17. In the further course of this dissertation, this is the version of ACDC evaluated in places such as Section 2.7.14.) 44 2.7.12 ARC ARC (Architectural Recovery using Concerns) uses topic modeling to find concerns and combines them with the structural information to automatically identify components and connectors [50]. For this, it leverages LDA (see Section 2.7.9) to find concerns and compute the similarity between them [50]. LDA can detect concerns in individual code entities and compute similarities between them. For this, the software system is represented as a set of documents called a corpus. Individual documents within it are “bags of words”. Each document can contain different topics, which stand for concerns. In the output, topics are represented by the words that are most likely to appear in them, in descending order. It is also determined how relevant a topic is to each document in the corpus. Documents (representing implementation entities) are clustered using structural information (dependencies in the case of ARC) and concerns (the topics from the topic model) as features. The number of topics and clusters is set via parameters. ARC’s view aims to produce components that are semantically coherent due to sharing similar system-level concerns (e.g., a component whose main concern is handling of distributed jobs). ARC has been implemented in Java 2 . 2.7.13 PKG PKG [72] is very simple in that it only recovers the package-level structure view of a system’s implemen- tation. It produces an objective but not architecturally satisfying view in that it stays at the surface instead of trying to assist its user to determine why the system is built the way it is. PKG has been implemented in Python [118]. 2.7.14 CommonalitiesofPKG,ACDCandARC In addition to the results obtained through their own different recovery algorithms, all three recovery methods also rely on dependencies among source code entities and report them in their output in order 2 https://bitbucket.org/joshuaga/arcade/src/master/ 45 to enable further processing (e.g., smell detection). These dependencies are detected using the Classycle library [41]. In the case of Java source code, this is done from the compiled class files. 2.7.15 MeasuresofArchitecturalSimilarity The following measures compare two architectures in terms of how similarly they cluster the entities that make up a system, such as source files. For all three measures, the results will be a value between 0% (no similarity) and 100% (equality). It is important to keep in mind that these comparisons are purely structural. Differences in the underlying reasons for these structures (e.g., concerns in the case of ARC or programming patterns for ACDC) are not measured. 2.7.15.1 MoJoFM MoJoFM [160] is an effectiveness measure for software clustering algorithms. It assumes that both archi- tectures contain the same entities and is therefore not suitable for studies in software evolution, but can still serve to compare the distance between two different views produced by different recovery methods or to determine the extent of non-determinism within the same method. MoJoFM assigns values between 0 (no similarity) and 100 (identity), with higher values meaning greater similarity. 2.7.15.2 a2a This new measure [11] has been created for evolutionary studies. It is based on a distance measure that determines the number of transformations necessary to get from one clustering to the other. Like Mo- JoFM, a2a assigns values between 0 (no similarity) and 100 (identity), with higher values indicating greater similarity. 46 The minimum-transform-operation (mto) is the minimum number of operations needed to transform one architecture to another: mto(A 1 ,A 2 ) =remC(A 1 ,A 2 )+addC(A 1 ,A 2 ) +remE(A 1 ,A 2 )+addE(A 1 ,A 2 )+movE(A 1 ,A 2 ) (2.1) The five operations used to transform architecture A 1 into A 2 comprise additions (addE), removals (remE), and moves (movE) of implementation-level entities from one cluster (i.e., component) to another; as well as additions (addC ) and removals (remC ) of clusters themselves. Note that each addition and removal of an implementation-level entity requires two operations: an entity is first added to the architecture and only then moved to the appropriate cluster; conversely, an entity is first moved out of its current cluster and only then removed from the architecture. The mto is normalized to calculate a2a: a2a(A 1 ,A 2 ) = (1− mto(A 1 ,A 2 ) mto(A ∅ ,A 1 )+mto(A ∅ ,A 2 ) )× 100% (2.2) wheremto(A ∅ ,A i ) is the number of operations required to transform a “null” architectureA ∅ intoA i . 2.7.15.3 cvg The cluster coverage measured by cvg [11] shows to which extent components that exist in one clustering exist in another. Cluster coverage (cvg) indicates the extent to which two architectures’ clusters overlap. In other words, cvg allows engineers to determine the extent to which certain components existed in an earlier version of a system or were added in a later version: cvg(A 1 ,A 2 ) = |simC(A 1 ,A 2 )| |C A 1 | × 100% (2.3) 47 where|C A 1 | is the number of clusters in architectureA 1 . simC(A 1 ,A 2 ) returns the subset ofA 1 clusters that have at least one “similar” cluster inA 2 : simC(A 1 ,A 2 ) ={c i | c i ∈A 1 , ∃c j ∈A 2 , c2c(c i ,c j )> th cvg } (2.4) wherec2c [48] measures the degree of overlap between the implementation-level entities contained within two clusters. More specifically, simC(A 1 ,A 2 ) returns A 1 ’s clusters for which the c2c value is above a thresholdth cvg for one or more clusters fromA 2 . 2.7.16 CodeSmellsandArchitecturalSmells Code smells are anti-patterns in programming that are associated with bad design or bad programming practices [42]. While code smells are not necessarily code issues or bugs and do not indicate their presence, it is likely that they will lead to increased maintenance effort. An architectural smell is a commonly (although not always intentionally) used architectural decision that negatively impacts system quality [46]. Code smells are related to code entities, architectural smells are related to components. One does not automatically map to the other[74]. 2.7.17 ARCADE ARCADE is a collection of tools that offers (1) a collection of architecture recovery methods (including PKG, ACDC and ARC) (2) detection of architectural smells, (3) metrics of architectural change and decay and (4) correlations of implementation issues and architectural ones [47]. It is implemented in the form of two Eclipse [142] projects, in the Java and Python languages respectively. Different branches emphasize automatic processing of large numbers of systems and their versions or ease of use through a GUI. PKG and ARC have been developed for inclusion with ARCADE. ACDC has been adapted to ARCADE. All recovery methods expose some parameters directly (such as input and output directories in the file system) or require 48 the user to modify their code (such as the number of topics in ARC or the order in which different versions of a system are evaluated for evolutionary studies) [118] The tools in it have been used in for several comparative and evolutionary studies [72, 11, 132]. Since I am referring to parts of those studies in my evaluations of the recovery methods, I used their implemen- tations as found in ARCADE. It bears mentioning that in the latter case, the process has been aided by use of existing architecture recovery techniques. 2.7.18 SoftwareArchitectureandArchitectureRecovery Many different definitions of “Software Architecture” exist [135]. Additionally, many different recovery methods exist that espouse different views of a software architecture [38]. This creates the potential of a mismatch if both are not selected in light of each other. Architecture recovery is the process of retrieving a system’s architecture from its implementation-level artifacts [141]. Since this means that only what is actually present in the system’s implementation can be used for recovery, and consequently, that the definition of “Software Architecture” which forms the basis of a given recovery method needs to reflect this for consistency. This makes a definition such as “the set of principal design decisions about a system” [141] unsuitable for the purposes of architecture recovery, since due to erosion and drift, there is no guarantee that a single one of these decisions is realized in a current version of the system as built. In extreme cases, they may never have been present at all, rendering any attempt at recovering an architecture under this definition moot. A definition that fits architecture recovery in general well is: “Fundamental concepts or properties of a system in its environment embodied in its elements, relationships, and in the principles of its design and evolution” [60]. This definition covers the source code, which most recovery methods as well as mine 49 are using for their basic resource, as an element of the system. I think that another definition fits the concern-orientation of RELAX even better [43]. According to it, a software architecture comprises • A collection of software and system components, connections, and constraints. • A collections of system stakeholders’ need statements. • A rationale which demonstrates that the components, connections, and constraints define a system that, if implemented, would satisfy the collection of system stakeholders’ need statements. When considering that the output of any recovery method is a view of a system’s architecture, it needs to be kept in mind that there is no single “correct” view that contains the whole truth Instead, the same architecture can be described through different views [98]. While it is possible to recover the architecture of very small systems manually, for large systems with millions of SLOC only computer-aided recovery is feasible. 2.7.19 TextClassification Text classification is the natural language processing method RELAX employs to locate concerns in a software system. This method automatically assigns provided documents to specified categories [131]. To achieve this, a set of labeled examples from two or more categories of interest are supplied and then used to train a classifier algorithm which can then determine whether any given document belongs in one of the categories the classifier was trained on. Figure 2.2 shows the training and prediction phases of text classification and how they interact [15]. 50 Figure 2.2: Text Classification Workflow 51 Chapter3 Approach This chapter describes the approach of RELAX toward software architecture recovery in the context in- troduced in the previous chapter and how it addresses the themes and issues touched on there. As its input, RELAX takes a software system’s source code alongside a trained classifier. It then lever- ages text classification techniques to incrementally tag individual source entities with attributes and group them into concern-related clusters. The combination of choices I made in the way in which RELAX uses text classification to build its architectural view results in a set of features that address several weaknesses present in other recovery methods. 1. The basic building block of RELAX’s architectural view is the result of the independent text classifi- cation of each source code entity. 2. The modular nature of forming clusters from individual source code entities whose attributes are independent of each other directly facilitates RELAX’s scalability. 3. The Naïve Bayes classifier-based algorithm further aids the scalability and accuracy of RELAX. 4. Explicitpreventionofcrosstalk between changes in individual code entities limits their impact on the resulting architecture. 52 5. Theabilityofuserstoselecttheconcerns on which RELAX bases its recovery enables tailoring RELAX to specific needs through easily understandable choices. 6. An intuitive and informative visualization allows stakeholders to quickly get an overview of the prevailing system-level concerns, and also to dig deeper to the level of individual source code entities. The use of text classification helps RELAX to address the shortcomings of topic modeling driven re- covery methods, such as ARC (compare to Section 2.7). They are addressed in the following ways: 3.1 StopWords Stop words can matter when training classifiers [29], but not when running the classification. This is because the stop words can be removed from the training data and once removed, the classifier will not look for them in the bodies of text to be classified. They will therefore be ignored. 3.2 NumberofTopics Since in text classification, topics stand at the beginning and not the end of the process, finding the correct number of topics is not an issue. Rather, that number is provided by the user and mirrors what the user is interested in. 3.3 TopicQuality The quality of the topics is fully controlled by the user since it is the user who can define topics and train classifiers (or use the provided ones if they user is satisfied with their quality). 53 3.4 Determinism Text classification with MALLET with the Naive Bayes classifier is deterministic. This means that RELAX as a whole is also deterministic since this applies to all individual source entities and the clustering of these entities is also deterministic. 3.5 Sensitivity The atomic operation of RELAX is the classification of an individual code entity based on a set of predefined topics. The classification of an individual file does not depend on that of any other. This has far-reaching positive consequences for RELAX: • The change of the recovery result is confined to the code entities that have changed and (possibly, depending on whether their associated concerns have changed) their cluster association. • The size of the change in the result scales with the size of the changes in the input. • For the architecture of a system to be recovered, it can be split up into smaller units which can then be distributed to be classified and associated with a cluster. This way, the ceiling of the system size that can be evaluated is nearly unlimited. • For evolutionary studies on a system, only the entities that have changed will need to be evaluated. The information on the remaining entities gained in a previous recovery run can be reused. • Libraries or frameworks can be evaluated separately and their results added or subtracted from the whole as needed. 54 3.6 Scalability Since each file is classified individually and independent of any other, and the time of an individual Naïve Bayes classification depends only on the size of the file to be classified [79], the time required to recover a system’s architecture should scale linearly with the overall size of files in a system to be classified, or the lines to be processed. This section describes the key principles underlying RELAX and its visualization, as well as the details of its implementation. 3.7 MainRecoveryProcess 3.7.1 SelectingConcerns The stakeholders and their concerns stand at the beginning of the process. Those concerns can have any level of granularity, ranging from top level concerns (e.g. Database, Graphics or Networking) to lower application-specific levels (e.g. HDFSUpgradeManagement orInterDataNodeProtocol for Apache Hadoop 1 . In addition, non-functional concerns (e.g. Security,Backup,Interoperability) can also be used. RELAX does not impose a hard limit on the number of input concerns to use in the training phase. The “right” number of concerns to look for in a system is not determined by any attributes of that system, such as its size or complexity. Instead, based on their knowledge about the system to be recovered and their use case, users can decide on any set of named concerns that form the basis of the system’s recovery. For example, a project manager might be interested in a suitable task distribution of maintenance ac- tivities among programmers with specific skills. The project manager can then choose to conduct a coarse grained recovery with a selection of topics that mirrors the fields of specialization of the programmers, such as Database, Graphics and Networking. In another situation, a researcher may be interested in how 1 https://hadoop.apache.org 55 certain concerns are shared among related systems. For example, they could be interested in whether a project like Apache Chukwa 2 , which is built on the Apache Hadoop File System (HDFS) 3 , addresses HDFS- related concerns such as HDFS UpgradeManagement or InterDataNode Protocol. The choice of concerns is the only activity required of the user that is similar to setting parameters in other recovery methods. How- ever, RELAX aims to make this an intuitive choice because the concerns are either named for well-known topics of general interest or, optionally, named by the users themselves. 3.7.2 CollectingTrainingDataandTrainingaClassifier The kind and amount of work necessary in this step depends on the concerns selected by the user. If the concerns are already covered by an existing classifier that is provided by RELAX or that the user otherwise has access to (such as through having trained it in an earlier recovery), no additional work is necessary here. If this is not the case, the user will be interested in training their own classifier on their chosen concerns. Figure 3.2 shows the workflow of training a classifier for RELAX. The required labeled training data can either come from the curated labeled training data already provided by RELAX, or it can be provided by the user. In the latter case, the user needs to find sources of training data related to the desired concerns and label them with names of their choice. This can be any mixture of source code, API documents, articles on the subject or simply a list of related words. It is important to note that the user is not required to fully understand the training data. Subsequently, the user needs to label the different categories of training data with the concern names of their choosing. Figure 3.1 shows an example of a directory structure with training data files. A classifier is then trained from the provided labeled training data. For this, the distinguishing features of the sets of documents labeled with different concerns are determined by the classifier training algorithm. 2 https://chukwa.apache.org 3 https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html 56 GUI Audio Networking Training Data … Figure 3.1: Training Data Example 57 From this, a classifier model is generated that can later be used to label different sets of data that were not part of the training process, such as the code entities of systems whose architecture is to be recovered. The training process generates a number of classifier candidates whose accuracy is then checked on a portion of the training data that has been set aside for this purpose. The classifier with the best accuracy is then chosen as the classifier to be used for architecture recoveries. Figure 3.3 shows the accuracy informa- tion obtained for two classifier candidates called “Trial 30” and “Trial 31”. It shows the overall accuracy of each classifier candidate as well as a confusion matrix. A confusion matrix is a table whose rows show the labels of the test data and whose columns show the labels determined by the classifier candidate. It is easy to determine whether a candidate’s results are fully correct (i.e., all documents are labeled by the classifier candidate with the labels of the test data) by looking at whether or not all numbers are lined up on the diagonal from the table’s origin to the lower right. As a summary, an overall accuracy value, between 0 and 1, for a candidate can be computed by dividing the number of correctly identified test documents by the overall number of test documents. The training output also shows the overall accuracy value for the candidate as a value between 0 and 1 that is calculated by dividing the number of correctly identified test documents by the overall number of test documents. In the case of “Trial 30” in the figure, we can see that While “Trial 30” has misclassified two test documents that should have been labeled with “security” as “networking” and therefore has an accuracy close to 0.94, or 31/33. “Trial 31” has classified all documents correctly and consequently has an accuracy of 1.0. It should therefore be chosen. Once a classifier is trained on a set of topics, it is reusable. 3.7.3 Classification In the classification step, the trained classifier extracts the features from each code entity and assigns a feature vector to it based on that entity’s affinities to each concern the classifier has been trained on. I have chosen Naïve Bayes as a classifier for RELAX based on several considerations: first, it assumes 58 Start End Labeled Training Data Train Classifier Candidates Trained Classifier Classifier Candidates Select Candidate Figure 3.2: RELAX Classifier Training Workflow package org . apache . hadoop . chukwa . d a t a b a s e ; import org . apache . hadoop . chukwa . u t i l . Database Wr ite r ; import j a v a . s q l . SQLException ; import j a v a . s q l . R e s u l t S e t ; import j a v a . s q l . R e s u l t S e t M e t a D a t a ; public class C o n s o l i d a t o r extends Thread { private Database Config dbc = new Database Config ( ) ; private S t r i n g t a b l e = null ; public C o n s o l i d a t o r ( S t r i n g t a b l e , S t r i n g i n t e r v a l S t r i n g ) { super ( t a b l e ) ; this . t a b l e = t a b l e ; } } S t r i n g query = ‘ ‘ s e l e c t ∗ from ’ ’ + t a b l e ; l o g . debug ( ‘ ‘ Query : ’ ’ + query ) ; r s = db . query ( query ) ; if ( r s . next ( ) ) { R e s u l t S e t M e t a D a t a rmeta = r s . getMetaData ( ) ; for ( int i = 1 ; i <= rmeta . getColumnCount ( ) ; i ++) { if ( rmeta . getColumnName ( i ) . e q u a l s ( ‘ ‘ timestamp ’ ’ ) ) { s t a r t = r s . getTimestamp ( i ) . getTime ( ) ; } end = s t a r t + ( i n t e r v a l ∗ 6 0 0 0 0 ) ; } catch ( SQLException ex2 ) { l o g . e r r o r ( ‘ ‘ Unable t o d e t e r m i n e s t a r t i n g p o i n t i n t a b l e : ’ ’ + this . t a b l e ) ; l o g . e r r o r ( ‘ ‘ SQL E r r o r : ’ ’ ) ; return ; } Listing 3.1: Example code with SQL-related text in red 59 Figure 3.3: Classifier Candidate Selection that features are independent, which appears to be a good fit for code files, where each feature may be encountered individually and can individually determine which topic a code entity belongs to. Second, its linear time complexity serves the scalability of RELAX. Third, classifiers trained with it have performed well in my accuracy evaluation (compare to Section 4.1). Last but not least, the prediction model of the Naïve Bayes algorithm is deterministic [70]. Determinism is an important feature for evolutionary software studies since without it, we cannot determine with certainty whether two different recovered architectural views which were produced by the same recovery method came from two different systems or system versions. Listing 3.1 shows a database-related code snippet with the words that indicate its relation to SQL databases highlighted. For a Naïve Bayes classifier, the feature vector consists of values between 0 and 1 for each concern. For example, the feature vectors for three code entities called “SQL.Java”, “Screen.Java” and “ConnectIP.Java” could look like the rows of Table 3.1. We can see that the affinity values over all concerns do not have to add up to 1.0 and that they can have values that are not 0 or 1. This is because a code entity may not be related to any selected concern or it may be strongly related to more than one concern. 60 3.7.3.1 Clustering Before clustering begins, each user-selected concern-related cluster is assigned an orthogonal feature vec- tor that mirrors that concern and allows code entities to be grouped into it. A default “Unknown” cluster without any concern affinities is always created for the code entities that are not related to any selected concern. The rows in Table 3.2 show the feature vectors for three clusters related to databases, graphics and networking, respectively as well as the default cluster for entities that are not related to any selected concern. Based on the results of the classification, each code entity is then assigned to the concern-related cluster that its feature vector is most similar to. This similarity is determined using the cosine similarity between the feature vector of the code entity and the cluster. Cosine similarity is a measure of the distance between vectors and is commonly used in Natural Language Processing in order to determine how close a body of text is to a given topic [9, 58, 83]. 3.7.4 ModularityandCrosstalkPrevention Recall that my goals for RELAX include scalability, efficiency, appropriate sensitivity and determinism. My intuitive approach to this is to explore building up the overall recovery result from individual parts that could be individually and independently processed and reused or updated as needed. I then decided that these individual parts should be the source code entities of the system and analyzed which beneficial properties would emerge. Table 3.1: Entities with Feature Vectors Entities Database Graphics Networking SQL.Java 0.9 0.1 0.2 Screen.Java 0.05 0.95 0.1 ConnectIP.Java 0.02 0.01 0.92 61 RELAX classifies each code entity as belonging to a set of user defined concerns. The classification task is performed on an individual source code entity and has no dependence on the classification of any other entity. The classification of source code entities individually enables RELAX’s important property of modu- larity. This means that the recovery results of the whole system or its subsystems can be composed from smaller parts, eventually reaching down to the ground level of individual source code entities. Modularity in turn enables scalability and efficiency by allowing the following operations: • For the architecture of a system to be recovered, it can be split up into smaller units which can then be distributed to be classified and associated with a cluster. This way, the ceiling of the system size that can be evaluated is nearly unlimited. • For evolutionary studies on a system, only the entities that have changed will need to be evaluated. The information on the remaining entities gained in a previous recovery run can be reused. • Libraries or frameworks can be evaluated separately and their results added or subtracted from the whole as needed. Further, the individual classification also limits “Crosstalk”. Crosstalk is a phenomenon in which a change in a source code entity affects parts of the recovered architectural view that do not pertain to it. Therefore, without Crosstalk affecting RELAX, it means that the change of the recovery result is only Table 3.2: Clusters with Feature Vectors Cluster Feature Database Graphics Networking Database 1 0 0 Graphics 0 1 0 Networking 0 0 1 Unknown 0 0 0 62 confined to the code entities that have changed and (possibly, depending on whether their associated concerns have changed) their cluster association. Further, the scale of the changes in the recovered view is proportional to with the size of the changes in the source code entities. 3.7.5 TextualOutput Conceptually, the textual output produced contains (1) The classification of each source code entity, (2) the constituents of all concern-related clusters, and (3) the auxiliary output from other tools, such as the list of dependencies between code entities or the size of entities in SLOC, which can be used for further processing and analysis. (RELAX uses the Classycle library [41] to determine the dependencies between code entities. are not determined by RELAX, but can be used for further analysis.) 3.7.6 Visualization The textual output of RELAX alone may not be very helpful to users of the method who wish to interpret its results or need to get to work on maintenance quickly. For this, visualizations are helpful because they can enhance the understanding of information by reducing cognitive overload [45]. Colors in visualizations have been shown as effective in highlighting events in a system’s evolution over its release history [44]. The directory visualization of RELAX can be zoomed out to give any stakeholder a high-level overview of the system architecture and the system’s addressed concerns as shown in Figure 3.4, but can also be zoomed in so as to be enlarged to the level of individual source entities, (see Figures 3.6 and 3.7). The visualization is based on the directory structure of the system, which corresponds to the package structure in Java. The system is shown as a directory tree. Nodes are either packages (inner nodes) or source entities (leaf nodes). Nodes that belong to the same package are surrounded by a rectangle. Since software systems can consist of a very large number of source entities, individual nodes can be very small 63 Figure 3.4: RELAX Directory Graph Example 64 in an overview (situations in which an individual node would make up less space than a pixel would be conceivable), and gaining an impression of their concerns would be impossible. Therefore, in order to guarantee that concerns can be shown, the lines from each package folder to its children are shown in the color prevailing that corresponds to the prevailing concern in that package. The prevailing concern is determined as follows: For each child node, the main concern is determined by the classifier as the topic most relevant to the corresponding code entity. The weight of this entity is then determined by its file size (physical or logical SLOC can also be selected for this). If a child node is not a leaf node (i.e. it stands for a package), then its prevailing concern is the concern that carries the most weight with its children. This relationship holds recursively throughout the tree. One important outcome of this is that there is an easy way to see what the main concern of the overall system is by checking the color of the root node (or colors of the root nodes, if several exist) of the system. Because of the recursion, this holds for each package. RELAX generates a legend for the directory visualization which shows the names of the concerns as they exist in the classifier in the color automatically selected for them by RELAX, as shown in Figure 3.5. The color selection is based on guidelines for optimal distinguishability of adjacent colors [22]. Individual nodes can be examined by zooming in. The paradigm that this visualization is following is that of a navigational file manager, such as the Finder in macOS or the Explorer in Windows. The details shown in Figure 3.6 correspond to the metadata view obtained by right-clicking and selecting “Get Info” on the macOS Finder or right-clicking and selecting “Properties” on the Windows Explorer. In the example shown in Figure 3.6, we are seeing a package (left) and two Java source entities. Since all three belong to the same package, we see three incoming arrows from the top in the same color Each entity is shown with a group of attributes: • A top box containing the base-name of its canonical name, • A second row showing 65 Legend graphics gui io networking security sound sql text no_match Figure 3.5: RELAX Directory Graph Legend 66 Figure 3.6: RELAX Directory Graph Detail Figure 3.7: RELAX Directory Graph Low Level Comparison 67 – its file size in bytes with the color of its concern as the background color, – its logical SLOC with the same background color. • A third row with all outgoing dependencies colored for the corresponding entities, • A fourth row with all incoming dependencies colored for the corresponding entities. Checking individual entities can give the user an impression of how connected an entity is and which type of concerns the related entities address. Questionable dependencies could be caught here. The format of the file that is used to lay out the directory graph is a human-readable text file that describes a directed graph. The actual layout is done by dot, a program from the Graphviz package. The dot program creates hierarchical layouts. Results are created in PDF format. It is possible to provide specific directives for the width and height of the graph. Computationally, the directory visualization layout is "free" in the sense that it runs as a separate task that may be assigned to another CPU core after from the one the recovery runs on after the recovery proper has finished, and because the recovery of the next version, which does not depend on the visualization in any way, can begin immediately. The layout task is started by RELAX, but it does not wait for it to finish. From the hierarchical diagram of the system shown in Figure 3.4, a stakeholder can immediately get an overview of the system and gain some first impressions: First, it is apparent that the system has two top level folders (with branching only beginning several levels below the top due to the Java packaging conventions, which use the reverse Internet domain names of organizations [120]). It is clear that five package levels have leaf nodes (which stand for code entities). The third level from the bottom has the most code entities. Regarding concerns, the system seems to be mostly addressing the one that is shown in bright red. Two concerns, bright green and dark blue, seem to be addressed mostly in one package each (second level from the bottom at the very left and near the middle of the third level from the bottom). Several concerns, such 68 Figure 3.8: RELAX Directory Graph of First System Version as the orange, the light blue one and chiefly the bright red one, are shown to be distributed throughout the system. This could indicate a poor separation of concerns (or possibly the need for a narrower definition of the concerns that should be used for classification). Conclusions can also be drawn when studying the evolution of a system. The diagrams in Figures 3.8 and 3.9 show two consecutive minor versions of the same system: The similar outlines are making the two versions of the system easy to compare (though some differ- ences in shape are due to the automatic layout in the Graphviz package). The comparison shows that the leftmost package in the hierarchy, which was dominated by the red concern in the first version is now 69 Figure 3.9: RELAX Directory Graph of Second System Version 70 more evenly split three ways between red, blue and green and has changed its prevailing concern from red to green. Figure 3.7 shows how the evolution of a software system is mirrored in the low-level view of the RELAX directory graph. The major changes are: • Version 0.1.2 is the orignial version. • In version 0.2.0, ChukwaOutputCollector loses outgoing dependency • In version 0.3.0, ChukwOutputCollector gains an outgoing dependency • In version 0.5.0, the reducer package changes its prevailing concern as indicated by the color of the second line. • In version 0.7.0, the mapper package changes its prevailing concern. 3.7.7 Workflow Figure 3.10 shows the workflow of a RELAX recovery from the point of view of the programmatic process, which incorporates all parts of my approach. The selection of concerns is not shown as an explicit step, but is an implicit part of the selection of a trained classifier. Training a classifier based on training data is shown with dashed lines since it is not a necessary part of each recovery. 3.7.8 Implementation RELAX uses the MALLET [100] toolkit, which includes different classification algorithms such as Naïve Bayes [112, 80], Maximum Entropy [115] and Decision Trees [130] and allows training and applying them. RELAX has been implemented in Java as part of a workbench comprising of a suite of architecture re- covery techniques. The implementation is GUI-based and allows training classifiers, running RELAX, and 71 Start End Compiled System Source Generate Metadata Trained Classifier Entity Manifest Classify Entities Determine Cluster Affinities Assemble Clusters Visualize? N Generate Visualization Classified Entities Entity Clusters Visualization Y Metadata List Compiled Entities Training Data Train Classifier Figure 3.10: RELAX Recovery Workflow visualization of the results without leaving the GUI. The principal output produced is a textual clustering of the system’s source code entities and a directory visualization. 3.7.9 TextualOutputDetail The following relevant output is created in the individual recovery run output folder: 72 File Name Entry Format Contents <System Name>_relax_manifest.txt File path List of entities used for recovery <System Name>_relax_entities.txt Classifier file path, Classifier MD5 hash, Entity filename, Entity Class affinities Entity information <System Name>_deps.rsf depends <Canonical Name 1> <Canonical Name 2> Class dependencies <System Name>_relax_clusters.rsf contain <Cluster Name> <Canonical Name> Cluster contents <System Name>_relax_clusters_fn.rsf contain <Cluster Name> <File Name> Cluster contents <System Name>_legend.pdf Adobe PDF Legend of directory graph (Concerns in their colors), see Section 3.7.6 <System Name>_directories.pdf Adobe PDF Directory graph, see Section 3.7.6 Table 3.3: RELAX Output Files In the "depends..." format, "depends A B" means that A depends on B. 73 In the "contain..." format, "contain A B" means that A contains B. This convention was introduced by MoJoFM [155] to facilitate comparing the clustering results of different architecture recovery methods. 74 Chapter4 Evaluation In this chapter, RELAX is evaluated using four hypotheses that are tested by employing open-source soft- ware systems and gaining empirical results from them. The hypotheses are: 1. (View): RELAX can find the locations of given concerns in a software system and generate a view that includes those concerns. 2. (Maintenance): RELAX facilitates efficient maintenance by pointing out where a system’s concerns are addressed. 3. (Scalability): RELAX scales efficiently to large system sizes in SLOC. 4. (Modular Recovery): RELAX can recover a system and its modules (such as frameworks) separately or together without adding recovery effort. The four hypotheses are discussed in evaluated in sections 4.1, 4.2, 4.3 and 4.4. 75 4.1 Hypothesis1(View) ViewHypothesis: RELAX can find the locations of given concerns in a software system and generate a view that includes those concerns. By its design, using text classification, RELAX will find the locations of given concerns and generate a view through its textual and visual output, which represents a structure and a reasoning behind it. How- ever, it is still of interest to evaluate how the quality of this view compares to the views produced by other recovery methods. 4.1.1 GroundTruthConsiderations The need to evaluate the accuracy of architectural views produced by recovery algorithms has sparked a desire for reference architectures that can serve as reference views that these other views can be compared to. The accuracy of each view is then determined by its closeness to that reference view. While such reference views have been obtained without the systems’ engineers [20, 99], more recently there have been efforts to involve them in the process of arriving at such a view [48]. Expert decompositions are manual clusterings of systems’ entities, performed by experts of that systems [98]. In some cases they can be arrived at through an iterative process between the developers of a particular recovery method and a system’s architects [48]. Garcia et al. [48] have evaluated the results that six different recovery methods achieved for eight systems on how close they were to what they referred to as “ground truth” architectures available for each respective system. For this, they compared the clusterings produced by the recovery methods to expert decompositions using MoJoFM (described in Section 2.7.15.1). However, notably, they experimented with letting each recovery method produce different amounts of clusters in order to arrive at the best possible MoJoFM values the recovery methods could produce. In the case of ARC, they additionally started with 76 setting the number of concerns to 10 and then raising it until a further increase did not improve the MoJoFM values or they had reached a number of V/3 concerns, with V being the number of terms in a system. For another method that they knew to be non-deterministic, they also took the best result of three that it produced [48]. In light of how several measures were taken in order to present all recovery methods in the best possible light and to pick and choose from their results, it is my conclusion that this kind of analysis of the ideal behavior of these recovery methods cannot be representative of how accurate any of these algorithms are when there is no “ground truth” to iteratively fit their results to. In this much more typical case, the user will not have an expert decomposition of the system, which is the reason to run architectural recovery in the first place. Additionally, the way these “ground-truth” architectures are produced cause them to have four intrinsic limitations: 1. They rely on a comparison of clusterings of the same system. Conceivably, a recovery method could produce a conceptual view of a system which does not make use of clusters. (Consider a recovery method that bases its recovery exclusively on the semantics gained from parsing a system’s documentation or the deployment of a system and would then come to a view similar to the UML deployment diagram 1 shown as an example in Figure 4.1.) 2. The view they produce is necessarily affected by the paradigm of the existing recovery method the system expert uses as an aid. This can give an advantage to views generated with the same method if we consider that some recovery methods may use default values for the amount of clusters they generate (such as ARC, see Section 2.7.12). 3. The measure used to determine the distance between two architectural views limits how meaningful that comparison is. For example, while in concern-based recovery, the ground-truth architecture may assign one set of concerns to each cluster, none of the common similarity measures mentioned 1 https://www.uml-diagrams.org//web-application-clusters-uml-deployment-diagram-example.html 77 Figure 4.1: UML Deployment Diagram in Section 2.7.15 take the semantics or names of concerns into account, but just clusters and their structures. Therefore, they do not take into account why code entities end up in a cluster. 4. Human involvement introduces a measure of subjectivity to the process. Additionally, trying to conform the output of architecture recovery methods to the view of one archi- tect dismisses the fact that different architects can produce different views of the same architecture that are equally valid [98]. Calling an expert decomposition a ground truth implies that there can only be one correct expert de- composition or at least that testing views of the architecture against this ground truth alone is sufficient. Since this is not the case with architectural views, where several views can be equally valid, I will use the term expert decomposition to better capture this fact. Behnamgader et al. [72, 11] have used the recovery methods ACDC, ARC and PKG to study how the architecture of software systems changes between different kinds of versions (major, minor, patch). As a justification for using PKG, ACDC and ARC as implemented in ARCADE, they referred to the analysis 78 mentioned above [48], but did not take the iterative approach to running recovery methods espoused there, using pre-set parameters instead. Both ACDC and ARC were found to produce non-deterministic results. The non-determinism of ACDC, which was caused by the use of non-deterministic data structures in its ARCADE adaptation was addressed by pre-sorting the package names supplied to its orphan adoption code. The non-determinism of the LDA algorithm employed by ARC as implemented in the MALLET toolkit was addressed by assigning a constant value as the seed to the random number generator in MALLET. Additionally, they observed that the impact of any changes to the system could not be predicted. They fixed this by generating a shared topic model for all versions of a system which are considered in an evolutionary study. This shared topic model used for ARC has several deficiencies which are illustrated in Figure 4.2: Consider five consecutive versions 1-5 of a software system. Over time, the system has addressed concerns A through E as shown. Two evolutionary studies of the system are conducted. The scope of the first one covers versions 1-3, the second one versions 3-5. The issues of the shared topic model become apparent when we try to determine the correct architecture for version 3. Which concerns should be considered for its recovery? Looking at version 3 in isolation, we would consider concerns B and C. In the first study, we would apply a shared model made up of concerns A, B and C. In the second study, it would consist of concerns B, C, D and E. The former case would attest that version 3 addresses concern A when it clearly does not, while the latter would do the same with concerns D and E. It is also easy to see that for a system with n versions, 2 n different topic models could be built depending on which versions are included. Lutellier et al. [91] have studied six architecture recovery methods with a focus on their respective distances to “ground truths” and their processing times, but did not address these issues. 79 D B C E Version 1 Version 2 Version 3 A A B B C Scope of first study Version 4 B C D Version 5 C D E Scope of second study A B C First shared topic model Second shared topic model Figure 4.2: Shared Topic Models and Scope 4.1.1.1 ComparisonsofRELAXRecoveriestoExpertDecompositions Some of the principal results of a RELAX recovery are the classification of all individual code entities of a system and their grouping into a set of concern-related clusters. The accuracy of the clustering can be determined by measuring its similarity to an expert decomposition, which is another clustering manually prepared by an expert on the system, such as its architect. The expert decomposition serves as a “ground truth” [97]. A known measure of similarity for this is MoJoFM [160]. It was selected for this evaluation because it has been used in several studies such as [121] as well as [93] and data for my evaluation, which compares the respective closeness of RELAX to expert decompositions is already available for ACDC and ARC (see Section 5), while the clustering results that formed the basis of the study [48] are not. I felt it was important to compare the performance of RELAX to that of ARC, since the latter is another recovery method whose paradigm is that of concern-oriented clustering. It expresses the similarity between two partitions of a set as a percentage, where 100% represents identity and 0% maximal difference. Its formula is: MoJoFM(M) = (1− mno(A,B) max(mno(∀A,B)) )× 100% (4.1) 80 Where M stands for the clustering technique. A is the clustering produced by M and B is the expert decomposition. mno(A,B) represents the minimum number of Move (moving an object to a different cluster) and Join (joining one or more clusters to form a new cluster) operations to transform partition A to B. For the purposes of my comparisons of RELAX clusterings to expert decompositions, I am interested in answering the following question: How close would RELAX come to the expert decomposition if a classifier would be trained to categorize a system’s code entities into clusters related to the concerns present in the expert decomposition? Table 4.1 compares the MoJoFM values of RELAX to those of two other recovery methods (ACDC and ARC, both of which are described in detail in Chapter 5) that have been identified previously as the two closest to the expert decompositions of eight systems out of a set of ten recovery methods [48], on the same system versions that they were originally performed on. As can be seen in Figure 4.3, RELAX exceeds their MoJoFM values in all cases. (An earlier test that in which RELAX used the first candidate of each trained classifier was documented in [84] and showed RELAX better in five cases, between the two in one case and closely below them in two.) This leads to the conclusion that RELAX’s overall accuracy is better than that of the two most accurate known recovery methods so far. Summaryoffindings: • The views generated by RELAX are closer to the expert decompositions than the other software architecture recovery methods in all individual tests as well as on average. • RELAX can find the locations of concerns in a software system and build a high-quality view of an architecture based on them. 81 Table 4.1: MoJoFM Expert Decomposition Comparison Values System RELAX ARC ACDC Bash 75.86 57.89 49.35 OODT 72.07 48.48 46.01 Hadoop 70.55 54.28 62.92 ArchStudio 88.24 76.28 87.68 Linux-D 74.67 51.47 36.31 Linux-C 93.70 75.72 63.76 Mozilla-D 53.47 43.44 41.20 Mozilla-C 90.62 62.50 60.30 Average 75.51 58.22 55.32 Figure 4.3: RELAX Expert Decomposition Comparisons 82 4.2 Hypothesis2(Maintenance) MaintenanceHypothesis: RELAX facilitates efficient maintenance by pointing out where a system’s concerns are addressed. When maintainers who have RELAX recovery results of a system can get started with maintenance work on that system faster than maintainers who do not have RELAX recovery results, then RELAX facil- itates efficient maintenance. To find out whether this is the case, I have conducted a user study. 4.2.1 StudyApproach A task that maintainers working with unfamiliar code encounter repeatedly and regularly is that of search- ing for relevant code to perform maintenance on [67]. It was concluded from this that if it could be shown that the availability of RELAX recovery results significantly reduces this time, it would hold utility for maintainers. Accordingly, I simulated a situation in which individuals with at least some Java experience are assigned maintenance work on projects they are not familiar with and need to become productive as soon as possible. 4.2.1.1 StudyQuestions The study sought to answer two research questions: • SQ1. Does using RELAX architecture recovery results reduce the time to find the location in the code where maintenance needs to be performed? • SQ2. What are the perceptions of new maintainers who work with RELAX architecture recovery results? 83 4.2.1.2 Participants The sample of participants had to fulfill the following criteria: • Sufficient Java programming experience to be able to search for and recognize code structures • Available for the duration of the study without suffering financial or other loss/sacrifices • Motivated to follow through with the study • Lack of distractions that would end the study prematurely, such as suddenly arising work obligations • No monetary or social (friendship) interest in the outcome of the study The software engineering group at our faculty holds several directed research sessions for graduate students of computer science which take place over the full duration of the regular semesters and Summer sessions. A call for participation in the forthcoming study was sent out to enrolled directed research students via email and held a presentation. In it, the project was outlined and it was made the main criterion outside of interest in the project that they should have Java programming skills commensurate with a bachelor’s degree in computer science. Prospective participants were offered extra credit. Nine participants joined my study. 4.2.1.3 Duration Two considerations drove my selection of a duration for the study. On one hand, enough time needed to be available to instruct participants about software architecture, its recovery and RELAX. On the other hand, since the emphasis of the project was on studying performance and not educating the students, I could not expect to work with the participants for an entire semester. My personal experience over several directed research projects had also shown me that the interest and with it the motivation wears out over time when tasks are repetitive and regarded as work. This would have risked influencing the outcome. 84 I felt that an overall duration of eight hours distributed over four weeks suitably reconciled the two opposing factors. 4.2.1.4 Parameters During the four weeks my study ran, each participant contributed up to a maximum of two hours per week, depending on whether and how fast they completed the tasks that were part of the study. All nine participants from my convenience sample stayed for the entire duration. The first week was spent on introducing to or refreshing them on the basics of software architecture and instructing them on software architecture recovery with RELAX, followed by an initial survey on their experience levels going into the project. The second week served to install, try out and validate the participants’ hardware and software setups and perform dry runs with warm-up tasks that were not part of the study. The third week was spent exclusively on the first experimental task, while the fourth week was spent on the second task as well as the exit survey. 4.2.1.5 ToolsandInstrumentation Survey instruments consisted of an initial survey, time-taking during live sessions and an exit survey. Data was gathered through surveys and observations. To hold online surveys, the Qualtrics 2 platform was used. Live programming sessions were held over Zoom 3 . The participants had the option to choose between two maintenance environment setups, depending on their comfort levels: 1. A downloadable virtual machine that was pre-configured by me and contained a current desktop installation of Linux Ubuntu 4 with current versions of three major Java IDEs (Eclipse 5 , NetBeans 6 2 qualtrics.com 3 zoom.us 4 ubuntu.com 5 eclipse.org 6 netbeans.apache.org 85 and IntelliJ IDEA 7 as well as the current source code of Apache Jackrabbit 8 (version 2.20) and Apache Ant 9 (version 1.9.15), which were the two systems that were chosen for the experiment. 2. Their own system with an IDE of their choice. The time to download source code and to set up the chosen IDEs was not counted in the second case, while in the first one this had already been done. Time was recorded by me. The participants had 60 minutes to finish each task. A timer was started when the URL of the task description was sent to the participant via Zoom chat. If the participants finished a task successfully, time-taking was stopped and the actual time taken in minutes was recorded. If they reached the 60 minute mark without having finished the task, DNF (standing for Did Not Finish) was recorded and counted as 60 minutes. In order to make timekeeping as accurate as possible, all tasks needed to be worked on live during a Zoom session in which the participants shared their screen and sound with the observer. This part of the setup aimed to prevent consultations of outside sources to the extent possible over a remote connection. The observers were authorized to help the participants with their Zoom or Virtual Machine setups. However, such a need did not arise. The open source Java projects Apache Jackrabbit and Apache Ant were chosen for the following prin- cipal reasons: • Open source • Written in Java, which is the source code language that RELAX can currently recover • Compilable with recent (not unsafe) Java versions • No dependencies on older libraries or Java versions that may be hard to obtain • Do not take very long (less than a minute) to compile 7 jetbrains.com/idea/ 8 jackrabbit.apache.org 9 ant.apache.org 86 • Can be run and tested on vanilla Ubuntu without having to install additional prerequisites • Thousands of source files make it unlikely that participants can find maintenance locations by guess- ing or browsing • Mature systems that are still updated and not likely to exhibit artifacts of antiquity or instability 4.2.1.6 Tasks There were two warm-up tasks that were observed, but not counted, followed by two experimental tasks that were used for the study. 4.2.1.7 Warm-UpTasks The warm-up tasks had the goal to establish the participants’ programming skills and validate the infras- tructure and setup of the experiment without using RELAX or its results. I estimated these tasks to be feasible in less than ten minutes for the participants when not knowing anything about the respective systems. The only items that were provided to the participants were the source code of the system and an instruction on how to run the command-line version of the system in a shell. The first, easier task was to change the output of Apache Jackrabbit on startup. The second, more difficult task was to do the same with Apache Ant. While in the first case, this was achievable by making changes to the Java source code, it required modification of an external XML file in the second. 4.2.1.8 ExperimentalTasks Since there were 9 participants, the split between the control group (participants not given the recovery results) and the experimental group (participants given the recovery results) could not be even, but was made 4 to 5 for the first task and 5 to 4 for the second (compare Table []reftable:timing] below). Whichever 87 participant was in the control group for the first task was in the experimental group for the second and vice-versa. Both tasks were to be performed on Apache Jackrabbit 2.20 and had a common setting in which the participants were to assume that they know nothing about the architecture of Apache Jackrabbit and are working alone in a closed bunker with no internet access. As surveyed at the outset of the study (Sec- tion []refsubsec:experience]), none of the participants had any experience with the architecture or use of Jackrabbit and none had contributed any code to it. The participants had 60 minutes to finish each task. Timing was capped at 60 minutes as described in Section 4.2.1.5]. Task1: • You are tasked with making fixes to a filesystem that interfaces with a database. In order to start your work, you think of adding a few print statements - but where? Task2: • You are tasked to clean up system after an unreachable former contributor has made modifications to onesourcefilesoitdoesnothandlesecurityanymore,butSQLinstead. Thesourcefilewasnotrenamed. You want to make the system maintainable again and change the name of that file – but where do you look? While the first experimental task was oriented toward an individual system version, the second one focused on the evolution of a system. 4.2.1.9 Surveys The participants were given an initial survey in the first week, directly after the instruction phase (so that there would be no confusion over terms used) and an exit survey at the end of the study period. 88 The surveys consisted of a variety of question types, including text fields (e.g., for email addresses), multiple-choice (e.g., whether textual results, the diagram or both together gave participants the most complete information about the software system), yes-no (e.g., whether participants had contributed to Apache Jackrabbit), with rating questions shown as rating scales or Likert scales. To reduce fatigue and hold the attention of participants, the survey varied question types where possi- ble. For example, some questions that could have been put on a Likert scale were put on ratings scales. For the same reason, some ratings questions used sliders and some asked participants to enter their ratings as numbers in text fields. Questions are shown along with their results in Section []refsubsec:exitsurveyresults]. 4.2.1.10 InitialSurvey(Participants’Experience) This asked participants about the levels of experience they had before the experiment started, regarding • Programming • Text classification • Software architecture and its recovery • Apache Jackrabbit Some questions, such as one question that asked participants whether they had contributed to Apache Jackrabbit or another that asked about preexisting experience with software architecture recovery opened further branches of questions on the relevant details. However, this did not apply to any of the participants. 4.2.1.11 ExitSurvey(EvaluationsofRELAX) This asked the participants utility-related questions regarding RELAX recovery results and working with them on individual and multiple system versions. 89 4.2.2 Results The full results of my study including the surveys are shared publicly 10 . 4.2.2.1 InitialSurveyonPreexistingExperience The top of Table 4.2 shows results of my survey on the experience the participants had before they began the study. Out of the fields surveyed, the only field they tended to have some experience in is that of text classification. They were new to Apache Jackrabbit when it came to its architecture and use and had next to no experience with its code and documentation. They had no experience with Software architecture recovery and none had contributed to Apache Jackrabbit. Out of nine study participants, eight had 1-5 years of programming experience, only one had 5-10 years of experience. Typically, they had some text classification experience, with the average being 2.56 on a scale of 1-5, with 5 being the highest. Interpretation: This is in line with expectations for graduate students in computer science and was not likely to distort any results. It is particularly agreeable for my study that they did not have any significant experience with Apache Jackrabbit. 4.2.2.2 TimingsofTasks Table 4.3 shows how much time in minutes each participant took for each individual task. As mentioned, if they did not finish a task within 60 minutes, the time is shown as DNF for “Did Not Finish” and counted as 60 minutes. The columns refer to the warm-up tasks described in Section 4.2.1.7 and experimental tasks 1 and 2 outlined in Section 4.2.1.8. The time RELAX took to recover an architectural view of Jackrabbit 2.20 on a 3.6 GHz Intel Core i9 using macOS was 235 seconds. For both experimental tasks, the results are shown without and with 10 https://doi.org/10.5281/zenodo.5323958 90 Table 4.2: Survey Results (Scales from 1-5) Question Average σ Participant familiarity with... Text Classification 2.56 1.13 Software Architecture Recovery 1.00 0.00 Apache Jackrabbit Source Code 1.11 0.33 Apache Jackrabbit Architecture 1.00 0.00 Apache Jackrabbit Documentation 1.11 0.33 Apache Jackrabbit Use 1.00 0.00 RELAX view of Apache Jackrabbit: Easy to understand 4.22 0.42 I can learn the structure of the system from it 4.78 0.42 I can learn from it why the system is structured the way it is 3.78 0.79 It can support new contributors 4.67 0.47 It can tell me something about the quality of the system 4.00 0.47 Real-time RELAX version valuable 4.33 0.47 RELAX views of several Apache Jackrabbit versions: Easy to compare 4.89 0.31 Differences are proportional to changes in the system 4.67 0.47 I can track the development of the system with them 4.56 0.50 Explain differences between versions 4.89 0.31 Describe system evolution 4.67 0.47 Agreement that RELAX recovery results for Apache Jackrabbit... Reduced time to start maintenance vs. only working with source files 4.89 0.31 Gave me an overview of how the system is organized 4.67 0.47 I can use it as a source to get started on other maintenance tasks 4.67 0.47 Knowing the architecture of a system can be valuable 4.89 0.31 A concern-based architecture view can be a valuable aid 4.67 0.47 On the question whether the textual results, the diagram, or both gave the most com- plete information on the architecture of a system, 3 participants chose the textual ones and 6 both. 91 Table 4.3: Task Timing Results (in Minutes) Warm-Up Experiments Participant 1 2 Task 1 Task 2 1 DNF DNF DNF 3.00 (7.00) 2 DNF 27.00 DNF 2.00 (6.00) 3 11.50 8.00 6.00 2.00 (6.00) 4 20.75 4.00 DNF 1.00 (5.00) 5 DNF DNF DNF 3.00 (7.00) 6 12.50 DNF 7.08 (11.08) DNF 7 14.50 DNF 2.50 (6.50) 4.50 8 16.50 DNF 5.00 (9.00) DNF 9 12.00 DNF 6.00 (10.00) DNF Averages 29.75 44.33 Without recovery results 49.20 46.13 With recovery results 5.15 (9.15) 2.20 (6.20) Speedup factor 9.56 (5.38) 20.90 (7.44) Numbers in parentheses include RELAX runtime of 4 minutes. DNF (Did Not Finish) counts as 60 minutes. counting the time the architectural view recovery took, rounded up to 4 minutes and added for each case in which a participant had the recovery results. Timings including the recovery runtime are shown in parentheses. The addition of this time is discussed in Section 4.2.3 below. Before we look at the average speedups, two observations are notable: First, that of the nine partici- pants, three were unable to complete the first warm-up task and six the second. Second, while for each experimental task, only one participant was able to complete it without having recovery results available, all participants were able to complete their tasks when they had them. Having recovery results has sped up the maintenance tasks by a factor of at least 5.38 for the first task or 7.44 for the second. These factors rise to 9.56 and 20.97 if we exclude the recovery time. It needs to be considered that in those cases where participants did not finish, the speedup factors are influenced by the timing cap of 60 minutes. Higher caps could have resulted in higher speedup factors. 92 4.2.2.3 ExitSurvey Table 4.2 shows the results of the exit survey in its lower two sections regarding working with either archi- tectural views of one system version or views produced for several versions, respectively. The participants have formed overall favorable opinions from the view of one Jackrabbit version as well as the views of several versions. The highest rankings are held by the ease of understanding the view of a single version as well as comparing and explaining the differences between versions. One result that will be discussed below is that they found a hypothetical version of RELAX that would recover architectural views in the background valuable. 4.2.3 Discussion The basis of our discussion can be summed up as follows: • The start of maintenance is sped up by at least a factor of 5.38 • Across the board, participants strongly considered architectural views produced by RELAX helpful in maintenance tasks • This applies to both individual architectures and evolution To which extent the recovery runtime for the current system version needs to be counted as part of the time the participant is working on a task depends on the specific setup regarding the number of maintainers and other factors. As discussed below in Section 7.2.3, I do not envision this to be a major timing factor. However, the results are presented with and without counting the runtime and conservatively report the longer time as the overall result. 93 4.2.4 AnswerstotheStudyQuestions 4.2.4.1 SQ1(DoesusingRELAXarchitecturerecoveryresultsreducethetimetofindthelocation inthecodewheremaintenanceneedstobeperformed?) The start of maintenance is sped up by factors of 5.38 and 7.44 for both tasks, respectively. Based on these results, SQ1 can therefore be answered affirmatively. 4.2.4.2 SQ2(WhataretheperceptionsofnewmaintainerswhoworkwithRELAXarchitecture recoveryresults?) As reported in Section 4.2.2.3, the participants strongly considered the architectural views produced by RELAX to be helpful in their maintenance tasks. This constitutes the answer to SQ2. Summaryoffindings: • Using RELAX architecture recovery results reduces the time to find the location in the code where maintenance needs to be performed. • Study participants strongly considered the architectural views produced by RELAX to be helpful in their maintenance task. 4.3 Hypothesis3(Scalability) ScalabilityHypothesis: RELAX scales efficiently to large system sizes in SLOC. With code sizes of software systems continuing to grow [62, 89], efficient scalability in SLOC size is an important utility indicator for a software architecture recovery method. A recovery method would not 94 be of much use if its performance slowed down considerably when it is applied to larger system sizes in SLOC. Since classification is done for each file individually and independently, and the time of an individual classification depends on the size of the file to be classified, the time required to recover a system’s ar- chitecture should scale linearly with the overall size of files in a system to be classified, or the lines to be processed. When trying to determine the scalability of RELAX over different input sizes, several consid- erations come into play. The first one is which part of a run of RELAX should be considered part of the recovery. These should be the ones without which no recovery can take place. They are the classification and the clustering. Two parts which are not part of the recovery proper are the code count and the visu- alization. This is so because the code count will stay the same if a different classifier is used and it is only used by the visualization, which is optional. 4.3.1 ScalabilitybyDesign Another aspect is which size metrics are most appropriate to determine scalability. Taking into account that RELAX only recovers source files in one language at a time, all others should not be considered. One measure of productivity and maintainability are source lines of code (SLOC). SLOC, however, can be measured physically or logically SLOC. Physical SLOC are simply the lines of code in a source file, no matter how they are formatted. Logical SLOC are based on the concept of a logical statement. The two do not map to each other 1:1 because depending on formatting an programming styles, several statements can appear in a single line, but, conversely, a statement can also span several lines. Several tools exist to count physical SLOC (PSLOC) and logical SLOC (LSLOC). Two tools that have been reliable have been UCC [148] and cloc [35]. Both have different features and strengths. UCC counts PSLOC and LSLOC and reports them for individual files, but even though the latest version is able to run multi-threaded, it has not been able to count the code for some major systems within a manageable time frame. (A run on the source 95 code a Chromium version from April 2016 was aborted after 24 hours, even though checks on duplicate files were disabled.) The cloc tool does not report PSLOC and LSLOC directly, but shows three categories of lines: Blank, comment and code. It only offers a summary for each programming language and does not list individual files. The great advantage of cloc over UCC is that it scales much better. It has been able to generate a report on the version of Chromium named above in 21 minutes. 4.3.2 MeasuredScalability In order to determine how the performance of RELAX changes with the system size, its performance was measured with versions 15 versions of Apache Hadoop 11 , 7 versions of Apache Chukwa 12 , and one version each of Log4j2 13 and Chromium 14 . Altogether, these comprised more than 2.45 million SLOC and 1.78 million LSLOC. Since cloc does not use the PSLOC/LSLOC terminology, it is of interest which of its three categories corresponds most closely to either the physical or logical SLOC of UCC. For the systems studied, the code size measurement of cloc has been virtually identical to the PSLOC counted by UCC. The average ratio between PSLOC (UCC) and Code (cloc) is 1.000345514, with a standard deviation of 0.000515912. Additionally, the average ratio between PSLOC and LSLOC as counted by UCC is 1.38, with a standard deviation of 0.04. Based on these two observation and the fact that cloc has counted 495506 Java lines of code for the version of Chromium used, it can be reasonably assumed that it would have counted about 359480 logical SLOC for that system. The scatter plots in Figure 4.4 and Figure 4.5 show my observations of how many PSLOC or LSLOC were processed per second, respectively, for the different numbers of SLOC in each of the 24 systems as listed in Table 4.4. A linear trend line was fitted to both plots. Figure 4.6 shows the same observations for all systems shown in Table 4.4, also along with a trend line. 11 https://hadoop.apache.org 12 https://chukwa.apache.org 13 https://logging.apache.org/log4j/2.x/ 14 https://www.chromium.org/Home 96 Figure 4.4: Physical SLOC per Second for Projects of Different Sizes (2015 Hardware) Figure 4.5: Logical SLOC per Second for Projects of Different Sizes (2015 Hardware) 97 Figure 4.6: Recovery Speed in SLOC per Second for Projects of Different Sizes (2015 Hardware) Figure 4.7: Recovery Speed in SLOC per Second for Projects of Different Sizes (2022 Hardware) 98 Table 4.4: RELAX Recovery Speed (2015 Hardware) Apache Hadoop 0.1.0 16 7.5 Apache Hadoop 0.1.1 16 10.6 Apache Chukwa 0.5.0 37 14.5 Apache Chukwa 0.1.2 40 7.8 Apache Chukwa 0.6.0 41 15.6 Apache Chukwa 0.7.0 42 16.2 Apache Hadoop 0.10.1 52 11.7 Apache Hadoop 0.11.2 55 16.6 Apache Chukwa 0.2.0 70 20.8 Apache Chukwa 0.3.0 83 19.1 Apache Struts Core 2.5.2 84 9.7 Apache Chukwa 0.4.0 92 19.3 Apache Hadoop 0.14.0 97 16.8 Apache Hadoop 0.14.1 97 18.6 Apache Hadoop 0.14.2 97 18.3 Apache Hadoop 0.14.3 97 15.6 Apache Hadoop 0.14.4 97 13.3 Apache Hadoop 0.15.0 111 11.4 Apache Hadoop 0.15.1 111 14.7 Apache Hadoop 0.15.2 111 15.9 Apache Hadoop 0.15.3 111 14.2 Apache Hadoop 0.17.0 116 14.8 Apache Log4j 2 2.8 123 7.1 Apache Hadoop 0.16.0 142 15.3 Apache Tomcat 8.0.33 298 15.9 Apache Tomcat 8.0.38 301 19.3 Apache Tomcat 8.0.37 300 18.7 Apache Tomcat 8.0.39 301 17.7 Apache Tomcat 8.0.42 302 19.4 Apache Tomcat 8.0.36 300 18.7 Apache Tomcat 8.0.35 299 18.3 Apache Tomcat 8.0.41 302 17.7 Apache Tomcat 8.0.43 302 18.8 Apache Jackrabbit 1.6.5 250 12.5 Name Version Size in KSLOC KSLOC/s Continued on next page 99 Table 4.4: RELAX Recovery Speed (2015 Hardware) (Contin- ued) Apache Jackrabbit 2.0.0 277 14.5 Apache Jackrabbit 2.7.0 326 14.0 Apache Jackrabbit 2.3.5 317 13.5 Apache Jackrabbit 2.3.6 318 13.1 Apache Jackrabbit 2.10.5 339 14.0 Apache Jena 2.10.0 322 9.1 Apache Jena 2.7.2 326 12.1 Chromium 4/16 496 16.9 Name Version Size in KSLOC KSLOC/s 100 Table 4.5: RELAX Recovery Speed (2022 Hardware) Name Version Size in KSLOC KSLOC/s Apache Chukwa 0.1.2 40 49.5 Libreoffice 4/24/22 274 35.8 Apache ofbiz 18.12.05 281 34.7 Apache Tomcat 8.0.33 298 36.0 Apache Axis2/Java 1.8.0 300 35.3 Apache River 3.0.0 309 37.4 Apache Jackrabbit 2.20.2 343 33.4 Apache activemq-artemis 2.19.1 580 29.2 Apache Drill 1.19.0 607 26.8 Apache Cloudstack 4.16.1.0 685 34.7 Vespa GitHub 4/30/22 736 40.1 Apache Nifi 1.16.1 817 35.9 Apache Harmony 6.0-src-r991881 1177 33.5 Apache Ignite Github 4/22/22 1344 32.0 Apache Geode GitHub 4/22/22 1402 36.9 Apache Flink GitHub 4/22/22 1510 35.9 Apache Camel GitHub 4/22/22 1741 31.8 IntelliJ IDEA Community Edition 4/24/22 4458 32.6 OpenJDK 4/24/22 5275 44.5 Apache NetBeans GitHub 4/21/22 5786 35.4 After these tests, which were conducted on 2015 hardware, another test series was run on newer hardware from the year 2022 and the results were therefore plotted in a separate diagram as shown in Figure 4.7 and Table 4.5. The new series focused on larger systems, but also retained some of the previous ones to cover system sizes from just under 40 KSLOC (Apache Chukwa 0.1.2) to 5.8 MSLOC (Apache NetBeans). In order to check whether the speed of recovering smaller systems is the same as that of larger systems, I divided the systems into two sets. One set comprised the smaller systems with less than 700 KSLOC and the other the larger ones with more than 700 KSLOC. I then ran a Mann-Whitney U test with the following Null hypothesis: 101 Table 4.6: Mann-Whitney U Test for Smaller and Larger Systems Group Smaller Systems Larger Systems Sample Size 10 10 Median 35613 34988 Rank Sum 110 100 U 45 55 Nullhypothesis: For randomly selected RELAX software architecture recovery speeds X and Y from the group of small systems and large systems, the probability of X being greater than Y is equal to the probability of Y being greater than X. The distributions of both populations are equal. The result of the Mann-Whitney U test is shown in Table 4.6. Based on it, the null hypothesis cannot be rejected and the difference between the two groups is not statistically significant. The observations confirm that, as expected, the number of PSLOC or LSLOC that can be processed over a given unit of time does not decline with the size of the system. (The trend line shows an increasing performance with an increase in SLOC. This may be an artifact of the underlying OS and is not expected to be sustained for bigger system sizes.) Since each file is classified individually and independent of any other, and the time of an individual Naïve Bayes classification depends only on the size of the file to be classified [79], the time required to recover a system’s architecture should scale linearly with the overall size of files in a system to be classified, or the lines to be processed. The scatter plot in Figure 4.4 shows my observations of how many SLOC were processed per second by RELAX, respectively, for each of the 24 systems. A trend line was fitted to the plot. The observations confirm that, as expected, the number of source lines of code (SLOC) that are pro- cessed over a given unit of time does not decline with the size of the system. (The trend line shows an increasing performance with an increase in SLOC. This may be an artifact of the underlying operating system or java virtual machine and is not expected to be sustained for bigger system sizes.) 102 Summaryoffindings: • The distributions of timings for small systems and large systems are equal. • The current base classification performance of RELAX is 27 KSLOC/s, with a median of just below 35 KSLOC/s. This allows us to expect to recover the architecture of a 1 MSLOC system in around 37 seconds. • Therefore, RELAX scales efficiently to large system sizes in SLOC. 4.4 Hypothesis4(Modularity) Modularity Hypothesis: RELAX can recover a system and its modules (such as frameworks) sepa- rately or together without adding recovery effort. There exist situations in which it may be of interest to recover only a part of the system or to split up the recovery of the system so that several parts can be recovered separately and the overall recovery result composed from them. One situation is when between two different versions of a system, only some code entities have been added, deleted or modified. In this case, the stakeholders may be interested in running differential architecture recoveries, in which only the source code entities which have been added or changed are evaluated by RELAX while the results for the remainder of the system are retained and reused. Another situation is when a system with a very large code base needs to be fully recovered. There, stakeholders may be interested in distributing the recovery of major systems over several machines or threads on the same machine. The latter can also be of general interest for its potential to speed up any recovery that processes more than a single source code entity. To make these distributed or partial recovery approaches attractive requires that no significant addi- tional effort expended by such a recovery. The design of RELAX is such that it composes its architecture recovery results from the classifications of individual code entities. Therefore, the recovery effort for a given code entity should not depend on the surrounding system in which it is recovered (e.g., as part of different systems or individually). For each code entity, the effort stays the same: Classify it as a body of text. However, an empirical study was still conducted in order to make sure that this holds true in practice. 103 4.4.1 RuntimeVariations When running the same program on a computer multiple times, runtime is not constant [28]. Factors that cause these variations are system load, OS activities, disk and CPU cache states, system temperature (which can cause throttling), GUI state (where does the textual output go?), JVM (garbage collection) [8, 154] and many others. Some of these factors are not controllable (caches), others are somewhat controllable by exiting other programs. On a modern operating system like macOS (which was used for the studies) it is, however not practically possible to exit all other programs and prevent it from running additional ones. Cron jobs and other factors still cause fluctuations. Another aspect is that on the Apple M1 architecture, on which the tests were run, two different types of processor cores exist (efficiency and performance), with considerable speed differences [34, 81], and the scheduler in macOS decides whether tasks or parts thereof are fully or partially executed on one or the other type of CPU core. There is no command extant in macOS to set core affinity currently. A similar differentiation of processor cores has also been implemented for the more common Intel CPU architecture and is therefore likely to further complicate benchmarking. 4.4.2 ResearchQuestion Research Question: Using RELAX, is there an average runtime difference of 5% or greater between clas- sifying the code entities of a software system (1) together or (2) as the sum of its separately classified parts? 4.4.3 StudySetup Five systems were partitioned into between 2 and 17 parts as shown in Table 4.7. This selection was made in order to cover the range of code sizes in open source Java projects as published by the Apache Software Foundation 15 . For each system, the following procedure was applied: 1. Run recoveries of the system and its partitions 20 times. 15 https://projects.apache.org/statistics.html 104 2. For each run, determine the difference between the recovery time for the complete system and the sum of those of the parts in the partition. 3. Determine the percentage of that difference. 4. Determine if the percentages of all differences are normally distributed. 5. If yes, calculate the boundaries of the confidence range in which 95 percent of all observation should occur. 6. If the mean± 5% is outside of that boundary, we can reject the null hypothesis. 4.4.4 StudyResults Table 4.7: Modular Performance Results Version 15 (GitHub 4/22/22) 2.20.5 3.16.0 (GitHub 4/22/22) 1.16.1 6.0-src- r991881 SLOC 5785878 342715 1741173 816597 1177337 Slices 2 3 9 17 3 Mean Difference Full vs. Sum of Slices in % 1.28 1.75 -0.27 -1.62 0.30 Shapiro-Wilk Normality Yes Yes Yes Yes Yes d’Agostino- Pearson Normality Yes Yes Yes Yes Yes Standard Deviation of Runtime Differences in % 0.56 0.53 0.17 0.76 0.11 System Apache NetBeans Apache Jackrabbit Apache Camel Apache Nifi Apache Harmony Continued on next page 105 Table 4.7: Modular Performance Results (Continued) Minimum Observed Difference in % 0.23 0.73 -0.59 -2.61 0.10 Maximum Observed Difference in % 2.27 2.36 0.03 -0.14 0.45 95% Confidence Range Lower Bound in % 0.18 0.71 -0.60 -3.11 0.09 95% Confidence Range Upper Bound in % 2.37 2.78 0.06 -0.12 0.51 Mean± 5% Outside of the Confidence Range? Yes Yes Yes Yes Yes System Apache NetBeans Apache Jackrabbit Apache Camel Apache Nifi Apache Harmony Table 4.7 shows the results. Summary offindings: For all five systems studied and two different partitions of each system, with a likelihood of 95%, the true mean RELAX recovery runtime is between two upper and lower bounds that are each 5% different from the average RELAX recovery runtime of the unpartitioned system. 106 Chapter5 RelatedWork This chapter covers related work that is relevant for comparisons or has aspects that make it partially similar to RELAX. 5.1 ArchitectureRecoveryMethods 5.1.1 Bunch The Bunch clustering tool [95, 104, 96] automatically creates a system decomposition by treating clustering as an optimization problem. In order to achieve an optimal partitioning of the dependency graph, it groups code entities into subgraphs using criteria such as inter-connectivity and Modularization Quality. While it comes with an optimal clustering algorithm, beyond a certain number of modules (15 according to the authors) the execution time for that algorithm becomes prohibitive. Bunch therefore uses a sub-optimal algorithm that starts with a random partition and tries to improve on it until no better neighboring par- tition can be found anymore. It employs search algorithms such as genetic algorithms, hill climbing and simulated annealing toward this goal. 5.1.2 LIMBO LIMBO, short for scaLable InforMation BOttleneck is a scalable hierarchical clustering algorithm based on the minimization of information loss [3]. It tries to create decompositions that convey as much information as possible by choosing clusters that represent their contents as accurately as possible. Since LIMBO can incorporate any type of information relevant to the software system and assign weights to different features, it can be used to assess which attributes are useful in architecture recovery. 107 5.1.3 WCA The Weighted Combined Algorithm (WCA) [98, 97] is a hierarchical agglomerative clustering algorithm that recovers components. Similar to LIMBO, WCA also tries to reduce information lost during cluster creation by combining features of entities in the cluster. WCA was evaluated against LIMBO on four systems with the outcome that either one of them can produce more accurate results in different situations. 5.1.4 ACDC The ACDC (Algorithm for Comprehension-Driven Clustering) software clustering algorithm [147] targets program comprehension by using subsystem patterns to recover components. It bounds the size of the cluster (the number of software entities in the cluster) based on its configuration and provides a name for the cluster based on the names of files in the cluster. ACDC’s view is oriented toward components that are based on structural patterns (e.g., a component consisting of entities that together form a particular subgraph). 5.1.5 PKG PKG [72] is very simple in that it only recovers the package-level structure view of a system’s implemen- tation. It produces an objective but not architecturally satisfying view in that it stays at the surface instead of trying to assist its user to determine why the system is built the way it is. 5.1.6 ARC ARC (Architectural Recovery using Concerns) [49] leverages an information retrieval technique called Latent Dirichlet Allocation (LDA) in order to find concerns and compute the similarity between them [50]. In it, the software system is represented as a set of documents called a corpus. Individual documents within it are "bags of words". Each document can contain different topics, which stand for concerns. In the output, topics are represented by the words that are most likely to appear in them, in descending order. It is also determined how relevant a topic is to each document in the corpus. Code entities are clustered based on their similarity to the detected topics. The number of topics and clusters is set via parameters. ARC’s view produces components that are semantically coherent due to sharing similar system-level concerns (e.g., a 108 component whose main concern is handling of distributed jobs). The architectures recovered by ACDC and ARC are complemented with each system’s package-structure view extracted by PKG. While the existing architectural recovery method ARC (Architectural Recovery using Concerns) [50] purports to find where concerns are addressed in the system and to which extent, it has limitations to its usefulness and scalability: • It is non-deterministic. This is because the LDA algorithm [18] on which it is based is non-deterministic [90]. • A significant amount of work needs to be done on it before and after each individual recovery run. • Good input parameters need to be determined through open-ended experimentation. • The quality of the resulting concerns is questionable. • Its output is disproportionately sensitive to the smallest changes in its input systems. • It does not scale up well with the size of input systems. 5.1.7 MUDABlue The MUDABlue method [66] takes the source code of software systems and employs LSA (Latent Semantic Analysis) to categorize systems into a set of categorizes that have not been predefined, but generated by itself. One system is allowed to belong to multiple categories. The method does not scale to large systems. MUDABlue does not recover the software architecture of a system. 5.1.8 Revealer Revealer is a lightweight source model extraction tool that combines lexical with syntactic analysis capa- bilities [124]. It requires users to define patterns in a specification language which are then used to extract code patterns. From this, a view of the system is extracted. 5.1.9 MoDisco MoDisco [24], a model-driven reverse engineering framework, supports software reverse engineering and modernization. One of its many features is building models from a system’s implementation, which can 109 then be used on order to improve the maintenance and evolution processes. MoDisco can be installed as a plug-in [40] to the Eclipse integrated development environment [142]. 5.1.10 QuadraticAssignmentProblem An unnamed method solves the recovery of layered architectures of object oriented systems as a quadratic semi-assignment problem [71] with the designer’s input [13]. 5.1.11 GeneticBlackHoleAlgorithm The genetic black hole algorithm [64] is a hybrid clustering technique formed from a genetic algorithm [156] and the nature-inspired black hole algorithm [55]. A study [72] showed that two of the techniques - ACDC and ARC - exhibit significantly better accuracy and scalability than the remaining clustering-based techniques, and that they produce complementary architectural views. 5.2 OtherRelatedWork Clustering has been suggested on the basis on file names and file naming conventions [5, 6]. 5.3 ValueofRecoveryMethodsforMaintenance In preparation for research on RELAX, I have undertaken a review of existing recovery methods to de- termine whether they have desirable attributes for recovery methods that could support stakeholders in maintenance [85]. 5.3.1 ResearchQuestions For each selected recovery method, the following questions were asked: • RQ1: Does the output represent an architecture? • RQ2: How feasible is it to recover a given system? 110 • RQ3: Does the output explain itself clearly? • RQ4: Are changes in the output proportional to those in the input? • RQ5: Is it deterministic? • RQ6: Are results continuous? • RQ7: Can change be isolated? 5.3.2 SelectionofRecoveryMethods For practical reasons, the sample of recovery methods was limited to those which • Have working implementations available • Have their source code available • Can be run on current commodity hardware and operating systems • Did not require purchasing any software licenses for the recovery method implementation itself or its dependencies • Are well-documented Additionally, recovery methods that had been evaluated against ground-truth architectures were fa- vored. For this reason, out of the recovery methods mentioned in [48], I chose PKG, ACDC and ARC. Running Bunch [95] and Limbo [3] was not possible. An implementation of Limbo was not available. While there is an apparently working implementation of Bunch available, it relies on another tool called chava to generate its input from a Java system’s source files. That tool could not be located. 5.3.3 SelectionofSubjectSystems My criteria were for selecting systems were that a system be: • Well-known or systems have had recovery methods run on them in published literature • Written in Java or C/C++ (so ACDC and ARC could work on them) 111 • Having meaningful words in their source code (important for ACDC and ARC) • Able to be compiled without modifications • Open-source For this reason, Google Android 1 , Apache Hadoop 2 and Apache Chukwa 3 were selected. 5.3.4 ComputingEnvironmentandParameters In order to not impose tight limitations on the recovery methods, they were run on an 8-core Xeon system with 64 GB of main memory. All recovery methods were run without modifications of their source code and their default settings, unless otherwise indicated. Since a recovery with ARC requires setting the number of desired concerns as a parameter, 100 was chosen for all systems. 5.3.5 Results 5.3.5.1 RQ1(ArchitecturalOutput) For the architectural view that the output of a recovery method constitutes to be considered an architecture, it has to reveal a structure as well as a reason for the structure (compare to Section 2.7.7). When a recovery method clusters source code entities, it will always produce a structure. To what extent it represents an architecture then depends on the reasoning for the structure it provides. The richer that reasoning is, the better the case for the recovery method being architectural becomes. PKG The output of PKG clusters the system’s code entities by package, reflecting the package structure of the system. No attempt whatsoever is made to reveal the reasoning behind creation of the packages and the placement of the code entities in them. Considering also that packages are incidental to any software system, the output of PKG cannot be considered to represent an architecture. 1 https://www.android.com 2 https://hadoop.apache.org 3 https://chukwa.apache.org 112 ACDC ACDC clusters code entities based on subsystem patterns and tries to assign meaningful names to the clusters. While the programming patterns constitute a reason for the structure, they themselves are incidental to the programming process. With the exception of systems serving as examples for learning system building, systems are not designed to exhibit programming patterns, but rather these patterns occur necessarily with the system’s functionality. To the extent that the names ACDC assigns its cluster reveal meaningful information about the system’s design, they reveal the reasoning behind the structure. Therefore, ACDC will produce architectural output when they do. This puts it a step above PKG. ARC ARC clusters code entities based on topics derived from words found in them. The clusters provide a structure, and the topics provide reasoning behind the structure as long as they are meaningful. 5.3.5.2 RQ2(Feasibility) How feasible it is to use a recovery method depends on how much time and space is needed for it to be run on different existing systems and come to a satisfyingly accurate architectural view. This can be observed in absolute terms, and can be predicted on its time and space complexities. Rather than analyzing each algorithm found in the recovery methods for their complexity, I will make some simpler observations: (1) If the recovery method crashes when applied to one or more software systems, then it cannot be applied to all systems. (2) If a method needs to be re-run an indeterminate amount of time, then its runtime cannot be predicted and is not guaranteed to have an upper bound. (3) If the recovery method is guaranteed to produce an accurate result after the first run, I can experiment with running it on several systems of different sizes. PKG PKG bases its view solely on the packages found in a system and their contents. This simply reflects the unambiguous information found in a Java system’s class files. There is no room for possible differences between two runs in which source code entities and packages exist and which entities the packages consist of. Therefore no need to re-run it in order to come to better results, since no variations in its results can exist. Running it once on any system will suffice. ACDC In order to optimize its accuracy, ACDC has to be re-run an unknown amount of times (see Section 5). Since the expert decompositions necessary to assess its accuracy were not available, I could not compare the relative accuracies of different results and ran it once for each of the selected systems. 113 It bears mentioning that ACDC crashed repeatedly when I attempted to recover the current version of Android with it. Since I were able to recover smaller systems with it, I suspect that the considerable size of the Android source code was the culprit. I have been unable to establish a relation between available hardware resources and the sizes of recoverable systems. In any event, based on this situation, ACDC is not applicable to all sizes of software systems. ARC What I said in Section 5.3.5.2 also applies to ARC. I have observed that ARC exits attempts to recover the current version of Android with a java.lang.OutOfMemoryError exception, indicating that its virtual machine has run out of memory. Similar to ACDC, this seems to be caused by the size of Android, since I have not made any similar observations on smaller systems. Summary While PKG is feasible to run on any system, the results for ACDC and ARC are variable and unpredictable without a guarantee that a result will be produced at all. 5.3.5.3 RQ3(Clarity) For clarity, I looked at whether a user receives an explanation for why a given entity ended up in a specific cluster, unless it is immediately obvious. PKG Since for this recovery method, all code entities are grouped and clustered by packages, no further explanation is needed. ACDC No explanation is provided that would explain which specific programming pattern or set of patterns determined the grouping of a given entity into a specific cluster. Since this is not obvious, the user will not have this information. ARC ARC does not output its generated topics in one or more files, but they are displayed on the standard output while running it and can be captured from there. Furthermore, it does not explain which topic an entity was clustered by, only providing the number of the topic. (It would be possible for the user to relate the numbers to the topics displayed in the standard output, but this functionality would have to be provided by the user) Even if it did so, in order for the reasons of the clustering to be clear the topics would have to be understandable, which is not always the case, as we can see in Section 5.3.5.8. 114 5.3.5.4 RQ4(Proportionality) Similarly to the considerations on feasibility (see Section 5.3.5.2), instead of committing to what exactly would be required of a result in order to be considered appropriately proportional to the changes in the input, I consider situations in which this would not be the case. One such situation is if the same, unchanged input can lead to significantly different results. Note that this requirement is stronger than nondeterminism alone, which may be locally contained or only lead to a small difference when measured by the methods described in Section 2.7.15. Another case where a change in the result is disproportionate is when the results are deterministic, but labile, i.e., they change profoundly when only one bit is changed in the input. PKG PKG exhibits no disproportionate change in its output for changed input. ACDC Similar to PKG, ACDC did not react disproportionately to minor changes in the input. ARC In order to check the proportionality of ARC, I changed one letter in a comment in a arbitrary source file. The MojoFM value between the recovery of the original system and that of the changed system was 61.4%, corresponding to a change of 38.6%. The value for a2a was close to 90%, the change being 10%. This is out of the expected proportion for changing a single character. 5.3.5.5 RQ5(Determinism) PKG PKG is deterministic for the reasons laid out in Section 5.3.5.2. ACDC ACDC as implemented in ARCADE is deterministic. ARC ARC as implemented in ARC should be deterministic considering the fixes made to it, but sur- prisingly, the empirical results have shown it to be non-deterministic. MojoFM values between the same version of Apache Chukwa have had an average value of less than 72%, meaning that they differ by more than 28%. The average for a2a is 89.3%, indicating a difference of 10.7%. 115 5.3.5.6 RQ6(Continuity) For continuity, it was evaluated whether it is possible to go from one state to another through a series of intermediate states that show a “direction”. In other words, if we follow the paradigm of the recovery methods, is it possible to do partial work that is reflected in the resulting architectural view? PKG Continuity is possible. Consider moving two entities from package A to package B, one at a time. In the initial state, both entities would be clustered under package A. In the final state, they would both be under package B’s cluster. In the incomplete state when only the first component is moved, that component will be clustered in B, while the second one remains in A. ACDC While the lack of clarity of ACDC results makes this uncertain for all cases, continuity should be possible for small modifications. ARC This is not possible for the same reasons that apply to ACDC. 5.3.5.7 RQ7(Isolation,Modularity,Predictability) Here we look for crosstalk, which means that changes in one part of the systems can have an effect on an unrelated one. PKG Adding, moving or deleting entities only affects the packages in which it happens, therefore there is no crosstalk. ACDC The exact impact of a change cannot always be predicted, given that ACDC does not explain its results and which combination of patterns most influenced its result. ARC As we have already seen further above (see Section 5.3.5.4), any change in the system, no matter how small, leads to a considerably different view that reorganizes clusters everywhere. It is therefore not possible to isolate change in ARC. 5.3.5.8 TopicQuality EvaluatedSystems I have evaluated versions of Apache Hadoop, Android and Apache Chukwa. 116 Apache Hadoop 3.1.1 is the current version of Apache Hadoop as of this writing. Since space considerations make showing all top words in the 100 topics generated for it prohibitve, the following is a relevant sample: Hadoop 3.1.1 ARC topic model concern samples: 1. qe uu vv nb eq xx unie ip ey wf dd wr wo wn ww kw mb kg sn uy oz cv cg vy yr db jg ky ov ty aa um te wy ed ib ob ae nn xr de zm sy mm io wu le ue vm 2. xe xa xb xd xf xc xbb xdb xcb xee xbf xad xff xef xba xfd xfe xdc xeb xcd xea xaf xfb xcc xec xaa xca xbc xed xfa xcf xdf xfc xdd xbd xce xac xbe xda xab xae xde ca fe bt cf fu da fs 3. version resolved https registry tgz yarnpkg dependencies bb cc df cb cd ab ff dc de da dd ad fe fc fb ae aa ed ef bf fd ea db eb ce bd ba af bc fa cf ee ac crc ec ca 4. file files set data configuration directory default user cluster number support list change note html system time 5. path file fs dir filesystem directory delete files filestatus exists create 6. path uri fs filesystem final fc ioexception throws hadoop file src filestatus filecontext 7. path json response class string mediatype exception application jsonobject 8. public return override string 9. public return override private void import long boolean null int class super protected string apache extends 10. hadoop code apache org file 11. hadoop code file apache org path fs 12. hadoop org apache jar 13. hadoop jira apache issues https org 14. hadoop dir home classpath 117 Immediate observations include that • The first three topics contain combinations of two or three letters that will be meaningless at least to outsiders and newcomers to Hadoop. • The next three topics appear to be related to file handling. • The presence of “uri” in the sixth and “json”, “response” and “jsonobject” in the seventh object make seem likely that “path” may be used in different context in different topics, considering that it can mean a path to a file in a filesystem or a path to a resource on the web. (Note that all words in all topics are converted to lowercase.) • The most prominent words in topics 8 and 9 are Java keywords. • The remaining five topics all start with “hadoop”. (This word appears in 67 of 100 topics overall.) The following is a selection of 5 from 30 concerns that were automatically recovered by ARC when recovering Apache Chukwa 0.7.0 4 : Chukwa 0.7.0 ARC topic model concern samples: 1. apach file org softwar hadoop chukwa write http addit inform copi notic foundat basi agre contrib- utor permiss specif complianc express 2. file path day chukwa hour apach record hadoop sourc org format cluster log conf merg dest status roll folder job 3. file chunk chukwa apach test org hadoop conf writer key archiv impl configur stream cluster sequenc print reader seq util 4. path conf log file status error org apach chukwa configur loader hadoop param util uri logger nagio level factori warn 5. file path dir log conf chukwa read local delet directori apach hdfs gold len configur write hadoop absolut exist sys Immediate observations include that 4 https://chukwa.apache.org 118 Figure 5.1: The Apache License, Version 2.0 1. All five topics shown above prominently include the words “file”, “apach”, “chukwa” and “hadoop”. 2. All five topics seem to related to writing or loggin files. 3. 17 topics contain the word “chukwa”, 13 contain “hadoop”. (Space considerations prohibit showing all topics here.) 4. Many words in the first topic correspond to words or parts of words from the apache software license that Chukwa is distributed under (see Figure 5.1). These occurences include “softwar”, “inform”, “notic”, “foundat”, “agre”, “contributor”, “permiss”, “compliance” and “express”. Resultsapplyingtoallsystems All systems exhibit common phenomena. 1. In cases where several or even all source code entities feature software license notices, one or more topics will be formed from it. (This is due to these words occurring together frequently.) 2. Some topics cannot be assigned a meaning to (such as the first three topics for Apache Hadoop). 3. The same concern, such as file handling, can be split over several topics. 4. The name of the system is prominently featured in several concerns. 119 5. Topics are formed from words that are incidental to a system and its functionality, such as licensing or the name of the system. Such topics cannot be regarded as a concern since they are not related to the system’s role, responsibility, concept, or purpose (compare to the definition of a concern Chapter 2). The basic premise of ARC is that each detected topic maps to one concern. As has been shown above, this premise does not hold. Due to the basic building blocks being flawed, it is not possible to use an automatic run of ARC as the basis for an architectural view that reflects the concerns present in a system. Further conclusions from the architectural view are just as unwarranted. For this reason, my conclusions about its shortcomings remain. 5.3.6 Conclusion 5.3.6.1 ACDC How useful ACDC is for the maintenance of a given system depends on whether circumstances that the user cannot predict are beneficent. Even whether or not its results are architectural in nature can vary from one case to another. This not only forces the users to make this determination, but also makes it impossible to rely on using automatically gained architectural views from it as a basis for further processing. 5.3.6.2 ARC Several issues exist that impact the quality of recovery results of ARC: • Low-quality topics with no discernible meaning • Code entities are assigned extraneous concerns such as license wording (which is not addressed though programming) • Repetitive topics • Topics formed from words like the name of the system (which should be a stop word) The deficiencies of the topics render clusters or architectural smells determined based on them question- able. Consider the examples of the architectural smells Scattered Parasitic Functionality [46] and Concern 120 Overload [73] Scattered Parasitic Functionality occurs when multiple components realize the same high- level concern while at least one of them also realizes an orthogonal concern. Detecting such a situation requires that the shared concern as well as the potentially orthogonal one are meaningful. As I have shown, this can not only not be guaranteed, but is seldom the case. Even if this would pose no problem, in order to attest this smell, the second concern would have to be orthogonal to the first one. Leaving out the question of how this could be automatically determined for two arbitrarily generated topics, it is clear that since a topic cannot be orthogonal to itself, two topics that have the same meaning can not be orthogonal to each other. As shown in Sections 2 and 5.3.5, there is no way to guarantee that topic modeling will automatically come up with topics that are meaningful and different from each other. The same issues also affect the detection of Concern Overload, which occurs when a component serves too many different concerns, i.e. the number of different concerns rises above a threshold. An overload of a component would require that the concerns are guaranteed to be actually different from each other, since otherwise they may be reduced to a number below the threshold used for this smell or even to one. While it is possible to mitigate some of the deficiencies of the topics of ARC by closely supervising its topic generations, this would add an unpredictable about of time to its workflow. 5.3.6.3 PKG PKG fulfills all our criteria for a valuable aid in maintenance, save for the one which requires its result to represent an architecture. This makes it unsuitable as an architecture recovery method, but leaves it usable to explore structural differences between different system versions. Its plethora of desirable qualities make it an interesting candidate for expansion into a full-fledged architecture recovery method. 5.3.6.4 Summary All three recovery methods bring either potentially interesting views or desirable maintenance attributes to the table, but none of them has everything. The accuracy of the architectural views from ACDC and ARC remains hard to assess. None of the three recovery methods allows maintainers to establish the virtuous cycle shown in Figure 1.1. 121 5.3.7 FutureWork (Note that the following paragraph has since been realized in the shape of RELAX, but is left in to explain part of the motivation of its creation.) Since none of the recovery methods I considered fulfilled all or most of the criteria outlined in the introduction while also producing a meaningful architecture, my goal is to develop such a recovery method. A possible starting point for such a recovery method could be an extension of PKG, that makes it produce a more meaningful recovery result than just a package structure while also keeping its considerable number of desirable properties as outlined in Section 5.3.6.3 intact. One approach under consideration is to enrich the results of PKG with elements from ACDC and ARC, such as programming patterns and concerns. Considering the observed situations in which even comparably high memory resources were not enough to recover the popular Android system, it is also of interest to me to develop a recovery method that can build up its recovery results from smaller parts that are computed in a distributed manner and then combine the partial results. PKG PKG [72], the simplest approach to architecture recovery is based on the package-level structure view of a system’s implementation. This approach produces an objective but not architecturally satisfying view in that it stays at the surface instead of trying to assist its user to determine why the system is built the way it is. Other clustering techniques have been suggested on the basis on file names and file naming conventions [5, 6]. However, their assumptions about naming conventions are not always correct. ACDC The ACDC (Algorithm for Comprehension-Driven Clustering) algorithm [147] uses structural relationships specified as patterns to create an algorithm for recovering components and configurations that bounds the size of the cluster (the number of software entities in the cluster), and provides a name for the cluster based on the names of files in the cluster. ACDC’s view is oriented toward components that are based on structural patterns (e.g., a component consisting of entities that together form a particular subgraph). ARC ARC [50] uses topic modeling to find concerns and combines them with the structural information to automatically identify components and connectors [50]. The topic model employed in ARC is Latent Dirichlet Allocation (LDA) [18]. Using LDA, ARC can detect concerns in individual code entities and 122 compute similarities between them. A software system’s implementation entities, such as its source files, are represented as a set of documents (a corpus) and each document in turn as a ”bag of words” [50]. Each document can be related to several different topics. Based on those topics, the documents are clustered using dependencies between them as structural information and concerns (the topics from the topic model) as features. It is very important to note that topic modeling as applied in ARC will not name the detected topics automatically and is an iterative process which is not guaranteed to yield topics that are consistent or that a human being can name [85]. In contrast, document classification uses named topics from the outset. I have outlined issues with ARC in detail elsewhere [85] and will therefore limit myself to a short overview. They comprise • Its handling of stop words, • The selection of the number of topics to be detected, • Topic quality, • Determinism, • Sensitivity to architectural change, and • Scalability. 123 Chapter6 ThreatstoValidity This chapter covers cases in which the research undertaken may inadvertently not apply or lead to false results. 6.1 LimitationsofRELAX RELAX employs text classification for its software architecture recovery. In order for RELAX (or any other recovery method that is based on natural language processing) to be able to produce meaningful results through natural language processing, all of the following conditions need to hold: 1. The programming language the system is written in allows the addition of comments or the assign- ment of names to variables, 2. The code contains meaningful comments or variable names, 3. The comments and variable names are pertinent to the purpose of the code. This excludes code that has misleading comments or variable names or has been obfuscated. Addition- ally, due to availability issues regarding closed-source systems’ source code, only versions of open-source systems have been evaluated. For text classification to work on concerns that are expressed through words in human language, there have to be human language words in the text that can be classified by a trained classifier. This excludes • Programs written in binary code or macro languages (e.g., APL), 124 • Programs that mainly include such content (e.g., inclusions of long blocks of byte code that gets saved somewhere and executed), • Obfuscated code that contains no meaningful words. The words in the code have to be meaningful to humans. This means that at least some variable names, comments etc. have to be related to what a source code entity actually intends to do. This excludes Code that is mostly lacking comments and uses meaningless variable names, such as single letters or words made from random combinations of letters. The words in the code have to represent the current concerns mostly accurately. This excludes • Code that originally served one concern but now serves another while keeping the names of and repurposing variables used for the previous one, • Code that originally served one concern but now serves another without reflecting this in the com- ments. 6.1.1 CurrentImplementation RELAX is currently implemented in Java and can recover architectures (or parts thereof) written in Java. One effect of this is that even if a system has mostly meaningless variable names and lacks comments, the standard words used in Java standard packages can still be used for recovery. 6.2 EvaluationValidity 6.2.1 SelectionofSubjectSystems One threat to validity common to all tests and studies is the selection of the subject systems. 6.2.1.1 OpenSource Systems for my evaluation fully consists of open source systems. One reason is the fact that the source code of such systems is more readily available than that of closed source ones. The other one is that even if 125 I were able to obtain the source code of closed source systems, stipulations by the owner could have made it so I would not be able to publish it along with my results, thus making their reproduction impossible to most reviewers who would have no access to the source code of those software systems. While due to the nature of closed-source software, there can be no certainty about quality differences between it and open-source software, but in cases where the metrics of closed-source software could be studied, no such evidence was found [137]. The programming style or quality of closed source systems may be significantly different from that of open source systems in ways that would lead to different results. However, studies [138, 127] have found no significant differences between the code quality of open- and closed-source systems. 6.2.1.2 Selectionandnumberofrecoveredopensourcesystems Any selection of systems is limited and may not be representative of all existing software systems. How- ever, since my reasoning about the attributes of each recovery method was deduced from the foundations of how these recovery methods work, I believe it should apply to any system and that the obtained empir- ical results are only reality checks. 6.2.2 ComparingRELAXtoOtherMethods Selection of recovery methods The selection of recovery methods has been limited mostly due to practicalities such as the availability of working implementations. Out of that group, the research was mo- tivated by the interest to investigate how suitable recovery methods that are concern-oriented like RELAX and have performed well in expert decomposition comparisons such as [48] might be for maintenance. Number of topics selected for ARC Depending on the respective systems and versions evaluated, setting the amount of topics to be recovered to 100 may be regarded as arbitrary high or low and therefore unfavorable. However, ARC does not contain any guidance on how many topics to start out with for a given system. After a given run, indications may be taken that the number of topics is too large or too small, and it may be iteratively adjusted over subsequent runs, but the literature on ARC does not mandate such a process, and I have not seen any indication that it was followed by studies that based conclusions on its results. 126 6.2.3 UserStudies 6.2.3.1 ConstructValidity Confounding Variables Confounding variables related to the speed of maintenance include factors such as the fields of participants’ engineering degrees, their levels of academic and professional experience with different program languages, tools used (such as IDEs), the projects they are working on and the domains of the issues to be fixed as well as the hardware employed to fix them. The threats from these variables are addressed by surveying the participants each time one of the variables can have changed in a way that would influence their timing. For this, they are surveyed before they start working on each issue. InternalValidity SelectionBias The engineers participating in the user studies have recruited themselves from students studying for a master’s degree in Computer Science. It may be possible that the setup of that group is different from a typical group of maintainers of software systems. Using the selection criteria described in the study, it was attempted to select the students that were closest to the profile of persons that might work on those projects. 6.3 ExternalValidity 6.3.1 TaskSelection The experimental tasks focused on finding code locations and not on other parts of the maintenance pro- cess, such as following dependencies, reading documentation, or more. Such activities would have been outside of the scope of the study as described in Section 4.2.1, however. 6.3.2 SubjectSystem The subject system was selected after dismissing several other ones for not being a good fit due to lack of compilability (which is required to gather SLOC and dependency data for the visualization), trivial size, not being open, not having any documentation (needed to come up with tasks) and other reasons. 127 6.3.3 NumberofSystems While a larger amount of systems might have provided more insights, this was not possible due to the limitations in the availability of participants. In order to select a system that is as representative of a project that could conceivably take on new contributors, Apache Jackrabbit was chosen as project from the Apache Software Foundation 1 , which hosts a large number of open source projects. 6.3.4 JavaProgrammingLanguage A Java-based subject system was selected because the current version of RELAX only allows to recover architectural views from Java source code. The experimental tasks could be easier or more difficult if posed in different programming languages. Java, however is one of the most commonly used programming languages among developers worldwide [88] and therefore a typical environment. 6.3.5 Participants The participants were self-selected from computer science graduate students with at least some Java pro- gramming experience. While this group does not match up fully with all possible stakeholders in a system that can be served by RELAX recoveries, some of which can be non-technical, it does match up with the group of maintainers who would want to get started on their tasks. 1 apache.org 128 Chapter7 ConclusionandOngoingWork This chapter summarizes how the results of my research on RELAX are meaningfully now and in the foreseeable future. 7.1 Conclusion With this dissertation, I set out to show that a novel concern-oriented software architecture recovery method could serve the stakeholders of software systems through having a combination of desirable at- tributes which include being accurate, useful in systems maintenance, efficient in scaling to large program sizes and when composing a recovery of the system from smaller parts of it. I then surveyed related work and found that no existing recovery method simultaneously possesses these desirable attributes. For this reason, I created RELAX, which presents a novel architecture recovery method that employs text classification to recover a concern-oriented view of a system. The approach classifies source code entities to clusters based on concerns. The tool that implements the approach was then evaluated for accuracy, utility in maintenance, scal- ability to high system sizes in SLOC, and modular performance on a set of open source systems. For accuracy, the recovery results of RELAX were compared to other recovery methods on the same systems. The result was that the results produced by RELAX come closer to the export decompositions than the previously closest ones. For the utility in maintenance, I conducted a user study with nine participants to assess the utility of the architectural recovery method RELAX. The participants were subjected to a controlled experiment in 129 which speed and other metrics with and without access to RELAX recovery results were compared to each other on Apache Jackrabbit 2.20. Through speeding up the start of maintenance by factors of at least 5.38 and other benefits experienced by the participants in my study, RELAX has shown itself to be a valuable help in the software maintenance process that can make the difference that allows new maintainers to finish a maintenance task on time. Performance measurements have shown RELAX to scale up efficiently to high system sizes in SLOC. Timing measurements demonstrated that RELAX can produce a recovery from the composition of individual parts of a system as fast as from the system as a whole. I have seen that RELAX constitutes a novel approach to software architecture recovery and through its conceptual and design choices recovers architectures • accurately, • with utility, • timely, and • distributed as desired without additional effort, 7.2 FutureWork The unique features of RELAX inspire several possible lines of future work. I envision at least four: Implementation of finding diffs between project versions and updating the architecture without having to run a full recovery, An IDE plugin, Extensions for the recovery of more languages, Automatic compila- tion and documentation of historic or ongoing projects, 7.2.1 Diff-BasedRecovery This would take advantage of the modularity of RELAX in that it could keep track of which files have been added/deleted/modified between versions and adapt/modify the recovery result of different versions accordingly. This would greatly reduce the effort for the IDE plugin and the documentation of historic or ongoing projects. 130 7.2.2 EnablingMaintenance Two standout observations come from Section 4.2.2.2. The first is that without having an architectural view of the system, for each of the two respective experimental tasks, only one participant was able to finish it. The second is that with an architectural view of the system, all nine participants, including the seven that had not successfully finished one or both warm-up tasks, were able to successfully finish their experimental tasks. This means that even if one does not consider any speedups, a benefit of RELAX lies in making timely maintenance of a system possible. 7.2.3 RELAXIDEPlugin Based on the positive results of the maintenance study in Section 4.2, in which I have demonstrated the utility of RELAX for maintenance tasks, I considered new tools for software maintainers, whether they are new to a system or know it well. A possible tool would be a plugin for major IDEs. One feature that sets RELAX apart from other recovery methods is that it assembles its architectural view in a modular manner from the text classification of individual source code entities of a system. This has the important consequences that individual source code files can be classified while they are being edited and that in order to see changes in the system, there is no need to recover the architecture of the whole system again because classifying the changed entities is sufficient. I have considered how an IDE plugin that takes all this into account would be designed with regards to its internal and visual workflow. Four considerations would be that the plugin would (1) recover the system by request or automatically in the background after each successful compilation, (2) show the maintainer the current status of the architecture visually at a glance, (3) allow the kind of searches that helped the participants in my study, and (4) warn if the current architecture has issues, such as architectural smells [46]. Figure 7.1 shows the plugin status view that a maintainer could leave open in an IDE. The upper left shows information related to the current code entity being edited, such as its name, file size, SLOC, color- coded outgoing and incoming dependencies as described in [84]. The right half shows a graphical or textual view of the most recently recovered architectural view, allowing zooming in and panning. Finally, the lower left shows how many architectural issues have been found in the last recovery, along with a button that will open up a dialog showing issue details. Figure 7.2 shows the configuration dialog of the 131 Figure 7.1: Plugin Status plugin. This would allow the maintainer to select or deselect auto-recovery after each compilation, select a trained classifier and to open a dialog that allows them to train a new classifier. The latter option could become helpful if the concerns of interest for the maintainer have changed. Figure 7.3 shows an example of an architectural warning that is enabled by the concern-orientation of RELAX. Maintainers could assign layers to some or all concerns in their system and the IDE plugin could analyze the architecture for layer breaches in which a lower-level entity calls a higher one. The maintainer could then check the details and take action by opening the file in the editor or view the next warning. 7.2.4 ExtensionsfortheRecoveryofMoreSourceCodeProgrammingLanguages Recovery of source code written in more programming languages should be added. There could even be a way for users to add this on themselves. For this purpose, language specifications such as regular expressions for distinguishing source code files from other files could be externalized into configuration files. 132 Figure 7.2: Plugin Configuration Dialog Figure 7.3: Plugin Layer Dependency Warning 133 7.2.5 AutomaticCompilationandDocumentationofDistoricorOngoingProjects All versions or only certain types of versions (e.g., releases) of a project could be downloaded, compiled and recovered. The automatically generated documentation could be used for research or to support the system’s architect(s). In this dissertation I presented a novel architecture recovery method which employs text classification to recover a concern-oriented view of a system. The approach classifies source code entities to clusters based on user-defined concerns. The conceptual and design choices made have resulted in an accurate and scalable concern based recovery method. The tool that implements the approach has been evaluated for accuracy and scalability on a set of open source systems. The results confirm the claims made. Currently, RELAX assigns each source code entity to the cluster that represents its dominant concern. I aim to allow users to control the way in which code entities are assigned to clusters. For instance, the users could choose clusters which represent what to them are meaningful combinations of more than one concern and instruct RELAX to assign code entities to such clusters. Another feature to be implemented is the ability to define undesirable dependencies in a system (e.g., those that break a desired layered archi- tectures by having entities that serve low-level concerns depend on others in a higher layer.) Of further interest are studies of architecture evolution with RELAX, as well as comparisons between its performance and that of other recovery methods on large or very large systems (e.g. Chromium OS). 7.2.6 GUIFrontend Work has begun on front ends for the RELAX metrics and recovery runners (see Figure 7.4). 134 Figure 7.4: RELAX Metrics Runner Frontend Dialog 135 References [1] C. C. Aggarwal and C. Zhai. A survey of text classification algorithms. In Mining text data, pages 163–222. Springer, 2012. [2] R. Alghamdi and K. Alfalqi. A survey of topic modeling in text mining. Int. J. Adv. Comput. Sci. Appl.(IJACSA), 6(1), 2015. [3] P. Andritsos and V. Tzerpos. Information-Theoretic Software Clustering. IEEE Transactions on Software Engineering, 31(2):150–165, 2005. [4] C. o. L. Angeles. City of los angeles department of building and safety, 2022.url: https://www.ladbs.org/services/core-services/plan-check-permit/types-of-permit- processes/plan-submittal-for-regular-plan-check. [5] N. Anquetil and T. Lethbridge. File Clustering Using Naming Conventions for Legacy Systems. In Proceedings of the 1997 conference of the Centre for Advanced Studies on Collaborative research, 1997.url: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.48.7129%7B%5C&%7Drep= rep1%7B%5C&%7Dtype=pdf. [6] N. Anquetil and T. C. Lethbridge. Recovering software architecture from the names of source files. Journal of Software Maintenance: Research and Practice, 11(3):201–221, 1999.issn: 1096-908X. doi: 10.1002/(SICI)1096-908X(199905/06)11:3<201::AID-SMR192>3.0.CO;2-1. [7] R. S. Arnold. Software Restructuring. Proceedings of the IEEE, 77(4):607–617, 1989.issn: 15582256. doi: 10.1109/5.24146. [8] J. Auerbach, D. F. Bacon, F. Bömers, and P. Cheng. Real-time music synthesis in java using the metronome garbage collector. In ICMC, 2007. [9] L. D. Baker and A. McCallum. Distributional Clustering of Words for Text Classification. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 96–103. ACM, 1998. [10] L. Bass and B. E. John. Linking usability to software architecture patterns through general scenarios. Journal of Systems and Software, 66(3):187–197, 2003. 136 [11] P. Behnamghader, D. M. Le, J. Garcia, D. Link, A. Shahbazian, and N. Medvidovic. A large-scale study of architectural evolution in open-source software systems. Empirical Software Engineering:1–48, 2016.issn: 15737616.doi: 10.1007/s10664-016-9466-0. [12] P. Behnamghader, D. M. Le, J. Garcia, D. Link, A. Shahbazian, and N. Medvidovic. A large-scale study of architectural evolution in open-source software systems. Empirical Software Engineering, 22(3):1146–1193, 2017. [13] A. B. Belle, G. El Boussaidi, C. Desrosiers, S. Kpodjedo, and H. Mili. The layered architecture recovery as a quadratic assignment problem. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 9278:339–354, 2015.issn: 16113349.doi: 10.1007/978-3-319-23727-5_28. arXiv: arXiv:1011.1669v3. [14] C. Bidan and V. Issarny. Security benefits from software architecture. In International Conference on Coordination Languages and Models, pages 64–80, Berlin. Springer, 1997. [15] S. Bird, E. Klein, and E. Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009. [16] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003. [17] D. M. Blei. Introduction to Probabilistic Topic Modeling. Communications of the ACM, 55:77–84, 2012.issn: 00010782.doi: 10.1145/2133806.2133826. arXiv: 1003.4916. [18] D. M. Blei, A. Y. A. Ng, and M. M. I. Jordan. Latent dirichlet allocation. Neural Information Processing Systems, 3(4-5):993–1022, 2003.issn: 1532-4435.doi: 10.1162/jmlr.2003.3.4-5.993. arXiv: 1111.6189v1. [19] B. W. Boehm, C. Abts, A. W. Brown, S. Chulani, B. K. Clark, E. Horowitz, R. Madachy, D. J. Reifer, and B. Steece. Software cost estimation with COCOMO II. Prentice Hall Press, 2009. [20] I. T. Bowman, R. C. Holt, and N. V. Brewster. Linux as a case study: its extracted software architecture. In Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No. 99CB37002), pages 555–563. IEEE, 1999. [21] J. Boyd-Graber, D. Mimno, and D. Newman. Care and feeding of topic models: Problems, diagnostics, and improvements. Handbook of Mixed Membership Models and Its Applications:225–254, 2014.issn: 1098-6596.doi: 10.1017/CBO9781107415324.004. arXiv: arXiv:1011.1669v3. [22] C. Brewer, M. Harrower, and The Pennsylvania State University. Colorbrewer 2.0 - Color Advice for Cartography, Apr. 24, 2017.url: http://colorbrewer2.org/%7B%5C#%7Dtype=qualitative%7B%5C&%7Dscheme=Paired%7B%5C&%7Dn=12). [23] N. Brown, Y. Cai, Y. Guo, R. Kazman, M. Kim, P. Kruchten, E. Lim, A. MacCormack, R. Nord, I. Ozkaya, R. Sangwan, C. Seaman, K. Sullivan, and N. Zazworka. Managing Technical Debt in Software-reliant Systems. Proceedings of the FSE/SDP Workshop on Future of Software Engineering Research:47–52, 2010.doi: 10.1145/1882362.1882373. 137 [24] H. Bruneliere, J. Cabot, G. Dupé, and F. Madiot. Modisco: A model driven reverse engineering framework. Information and Software Technology, 56(8):1012–1032, 2014.issn: 0950-5849.url: http://www.sciencedirect.com/science/article/pii/S0950584914000883%7B%5C%%7D5Cnhttps: //hal.inria.fr/hal-00972632/document. [25] Y. Cai, H. Wang, S. Wong, and L. Wang. Leveraging design rules to improve software architecture recovery. Proceedings of the 9th international ACM Sigsoft conference on Quality of software architectures:133–142, 2013.doi: 10.1145/2465478.2465480. [26] M. Campbell-Kelly. From Airline Reservations to Sonic the Hedgehog: A History of the Software Industry. MIT Press, 2004. [27] M. Campbell-Kelly. Not Only Microsoft : The Maturing of the Personal Computer Software Industry , 1982- 1995. Computers and Communications Networks, 75(1):103–145, 2001. [28] F. J. Cazorla, T. Vardanega, E. Quiñones, and J. Abella. Upper-bounding program execution time with extreme value theory. In13thInternationalWorkshoponWorst-CaseExecutionTimeAnalysis. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2013. [29] P. Chaovalit and L. Zhou. Movie Review Mining: a Comparison between Supervised and Unsupervised Classification Approaches. Proceedings of the 38th Annual Hawaii International Conference on System Sciences, 00(C):1–9, 2005.issn: 1530-1605.doi: 10.1109/HICSS.2005.445. [30] E. J. Chikofsky and J. H. Cross. Reverse Engineering and Design Recovery: A Taxonomy. IEEE Software, 7(1):13–17, 1990. [31] K. Chowdhary. Natural language processing. Fundamentals of artificial intelligence :603–649, 2020. [32] P. C. Clements and L. M. Northrop. Software Architecture: An Executive Overview. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA SOFTWARE ENGINEERING INST, 1996. [33] J. Corbet and G. Kroah-Hartman. „2017 Linux Kernel Development Report “. A Publication of The Linux Foundation, 2017. [34] V. Dalakoti and D. Chakraborty. Apple m1 chip vs intel (x86). EPRA International Journal of Research and Development (IJRD), 7(5):207–211, 2022. [35] A. Danial. cloc Count Lines of Code, Apr. 21, 2017.url: https://github.com/AlDanial/cloc. [36] L. De Silva and D. Balasubramaniam. Controlling software architecture erosion: A survey.Journal of Systems and Software, 85(1):132–151, 2012.issn: 01641212.doi: 10.1016/j.jss.2011.07.036. [37] B. Dit, M. Revelle, M. Gethers, and D. Poshyvanyk. Feature location in source code: A taxonomy and survey. Journal of software: Evolution and Process, 25(1):53–95, 2013.issn: 20477481.doi: 10.1002/smr.567. [38] S. Ducasse and D. Pollet. Software Architecture Reconstruction : A Process-Oriented Taxonomy. IEEE Transactions on Software Engineering, 35(4):573–591, 2009. 138 [39] S. Ducasse, D. Pollet, and L. Poyet. A process-oriented software architecture reconstruction taxonomy. In CSMR 2007. IEEE Computer Science, 2007. [40] G. Dupe and H. Bruneliere. MoDisco, May 2, 2017.url: https://projects.eclipse.org/projects/modeling.mdt.modisco. [41] F.-J. Elmer. Classycle: Analysing Tools for Java Class and Package Dependencies, 2014.url: http://classycle.sourceforge.net. [42] E. V. Emden. Software Quality Assurance by Detecting Code Smells. Informatica:16–17, 2002. [43] C. Gacek, A. Abd-Allah, B. Clark, and B. Boehm. On the definition of software system architecture. In Proceedings of the First International Workshop on Architectures for Software Systems, pages 85–94. Seattle, WA, 1995. [44] H. Gall, M. Jazayeri, and C. Riva. Visualizing Software Release Histories: The Use of Color and Third Dimension. In IEEE International Conference on Software Maintenance, ICSM, pages 99–108, 1999.isbn: 9781479925520. [45] K. Gallagher, A. Hatch, and M. Munro. Software architecture visualization : an evaluation framework and its application. IEEE Trans. Softw. Eng., 34(2):260–270, 2008. [46] J. Garcia, D. Popescu, G. Edwards, and N. Medvidovic. Identifying Architectural Bad Smells. 2009 13th European Conference on Software Maintenance and Reengineering:255–258, 2009.issn: 1534-5351.doi: 10.1109/CSMR.2009.59. [47] J. Garcia. A unified framework for studying architectural decay of software systems . University of Southern California, 2014. [48] J. Garcia, I. Ivkovic, and N. Medvidovic. A Comparative Analysis of Software Architecture Recovery Techniques. Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on:486–496, 2013.doi: 10.1109/ASE.2013.6693106. [49] J. Garcia, D. Popescu, C. Mattmann, N. Medvidovic, and Y. Cai. Enhancing architectural recovery using concerns. In 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), pages 552–555, Melbourne. IEEE, 2011. [50] J. Garcia, D. Popescu, C. Mattmann, N. Medvidovic, and Y. Cai. Enhancing architectural recovery using concerns. In 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), pages 552–555. IEEE, 2011. [51] D. Garlan. Software architecture: A travelogue. Future of Software Engineering, FOSE 2014 - Proceedings:29–39, 2014.doi: 10.1145/2593882.2593886. [52] M. Glinz and R. Wieringa. Stakeholders in Requirements Engineering. Software, IEEE, 24(2):18–20, 2007.issn: 0740-7459.doi: 10.1109/MS.2007.42. 139 [53] D. Greene, D. O’Callaghan, and P. Cunningham. How many topics? stability analysis for topic models. In Joint European conference on machine learning and knowledge discovery in databases, pages 498–513. Springer, 2014. [54] P. Grubb and A. A. Takang. Software maintenance: concepts and practice. World Scientific, 2003. [55] A. Hatamlou. Black hole: A new heuristic optimization approach for data clustering. Information Sciences, 222:175–184, 2013.issn: 00200255.doi: 10.1016/j.ins.2012.08.023. [56] J. D. Herbsleb and A. Mockus. An empirical study of speed and communication in globally distributed software development. IEEE Transactions on software engineering, 29(6):481–494, 2003. [57] S. Holla and M. M. Katti. Android Based Mobile Application Development and its Security. International Journal of Computer Trends and Technology, 3(3):486–490, 2012.url: http://ijcttjournal.org/Volume3/issue-3/IJCTT-V3I3P130.pdf. [58] A. Huang. Similarity measures for text document clustering.ProceedingsoftheSixthNewZealand, (April):49–56, 2008.url: http://nzcsrsc08.canterbury.ac.nz/site/proceedings/Individual%7B%5C_ %7DPapers/pg049%7B%5C_%7DSimilarity%7B%5C_%7DMeasures%7B%5C_%7Dfor%7B%5C_%7DText%7B%5C_ %7DDocument%7B%5C_%7DClustering.pdf. [59] I. ISO. Ieee: iso/iec/ieee 42010: 2011-systems and software engineering–architecture description. Proceedings of Technical Report, 2011. [60] ISO/IEC/IEEE. Systems and software engineering — Architecture description, Apr. 21, 2017.url: http://www.iso-architecture.org/42010/defining-architecture.html. [61] Iso/Iec/Ieee. Systems and software engineering—system life cycle processes. International Organization for Standardization, 2015. [62] A. Israeli and D. G. Feitelson. The linux kernel as a case study in software evolution. Journal of Systems and Software, 83(3):485–501, 2010. [63] M. A. Javed and U. Zdun. A systematic literature review of traceability approaches between software architecture and source code. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, pages 1–10, 2014. [64] K. Jeet and R. Dhir. Software Architecture Recovery using Genetic Black Hole Algorithm. ACM SIGSOFT Software Engineering Notes, 40(1):1–5, 2015.issn: 01635948.doi: 10.1145/2693208.2693230. [65] H. Karna, S. Gotovac, L. Vicković, and L. Mihanović. The effects of turnover on expert effort estimation. Journal of Information and Organizational Sciences, 44(1):51–81, 2020. [66] S. Kawaguchi, P. K. Garg, M. Matsushita, and K. Inoue. MUDABlue: An automatic categorization system for Open Source repositories. Journal of Systems and Software, 79(7):939–953, 2006.issn: 01641212.doi: 10.1016/j.jss.2005.06.044. 140 [67] A. J. Ko, B. A. Myers, M. J. Coblenz, and H. H. Aung. An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Transactions on Software Engineering, 32(12):971–987, 2006.issn: 00985589.doi: 10.1109/TSE.2006.116. [68] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown. Text classification algorithms: a survey. Information, 10(4):150, 2019. [69] P. Kruchten, R. L. Nord, and I. Ozkaya. Technical debt: from metaphor to theory and practice. Ieee software, 29(6):18–21, 2012. [70] R. Krzysztofowicz. Bayesian forecasting via deterministic model. Risk Analysis, 19:739–749, 4, 1999.issn: 02724332.doi: 10.1023/A:1007050023440. [71] E. L. Lawler. The Quadratic Assignment Problem. Management Science, 9(4):586–599, 1963. [72] D. M. Le, P. Behnamghader, J. Garcia, D. Link, A. Shahbazian, and N. Medvidovic. An empirical study of architectural change in open-source software systems. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pages 235–245. IEEE, 2015.doi: 10.1109/MSR.2015.29. [73] D. M. Le, C. Carrillo, R. Capilla, and N. Medvidovic. Relating architectural decay and sustainability of software systems. Proceedings - 2016 13th Working IEEE/IFIP Conference on Software Architecture, WICSA 2016, (April):178–181, 2016.doi: 10.1109/WICSA.2016.15. [74] D. M. Le, D. Link, A. Shahbazian, and N. Medvidovic. An empirical study of architectural decay in open-source software. In 2018 IEEE International conference on software architecture (ICSA), pages 176–17609. IEEE, 2018. [75] M. M. Lehman. On Understanding Laws , Evolution , Conservation in the Large-Program and Life Cycle. Journal of Systems and Software, 1:213–221, 1979. [76] M. M. Lehman. Programs, life cycles, and laws of software evolution. In Proceedings of the IEEE, volume 68 of number 9, pages 1060–1076, 1980. [77] P. Lemmons. Microsoft windows. Byte, 8:12, 1983. [78] K. M. Leung. Naive bayesian classifier. Polytechnic University Department of Computer Science/Finance and Risk Engineering, 2007:123–156, 2007. [79] D. D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. In Third annual symposium on document analysis and information retrieval, volume 33, pages 81–93, 1994. [80] Y. H. Li and a. K. Jain. Classification of Text Documents. The Computer Journal, 41(8):537–546, 1998.issn: 0010-4620.doi: 10.1093/comjnl/41.8.537. [81] X. Liao, B. Li, and J. Li. Impacts of apple’s m1 soc on the technology industry. In 2022 7th International Conference on Financial Innovation and Economic Development (ICFIED 2022), pages 355–360. Atlantis Press, 2022. 141 [82] E. D. Liddy. Natural language processing, 2001. [83] Y.-S. Lin, J.-Y. Jiang, and S.-J. Lee. A Similarity Measure for Text Classification and Clustering. IEEE Transactions on Knowledge and Data Engineering, 26(7):1575–1590, 2014.issn: 1041-4347. doi: 10.1109/TKDE.2013.19. [84] D. Link, P. Behnamghader, R. Moazeni, and B. Boehm. Recover and relax: concern-oriented software architecture recovery for systems development and maintenance. In Proceedings of the International Conference on Software and System Processes, pages 64–73, Montréal. IEEE, 2019. [85] D. Link, P. Behnamghader, R. Moazeni, and B. Boehm. The value of software architecture recovery for maintenance. In Proceedings of the 12th Innovations on Software Engineering Conference (formerly known as India Software Engineering Conference), pages 1–10, 2019. [86] D. Link, K. Srisopha, and B. Boehm. Study of the utility of text classification based software architecture recovery method relax for maintenance. In Proceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 1–6, 2021. [87] C. Liu, Q. Zhu, K. A. Holroyd, and E. K. Seng. Status and trends of mobile-health applications for iOS devices: A developer’s perspective. Journal of Systems and Software, 84(11):2022–2033, 2011. issn: 01641212.doi: 10.1016/j.jss.2011.06.049. [88] S. Liu. Most used languages among software developers globally 2020, Apr. 2021.url: https://www.statista.com/statistics/793628/worldwide-developer-survey-most-used-languages/. [89] S. Lohr and J. Markoff. Windows is so slow, but why. Te New York Times, Mar..(Referenced on page.), 2006. [90] N. Lopez. Using topic models to understand the evolution of a software ecosystem. Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering - ESEC/FSE 2013:723, 2013.doi: 10.1145/2491411.2492402. [91] T. Lutellier, D. Chollak, J. Garcia, L. Tan, D. Rayside, N. Medvidovic, and R. Kroeger. Comparing Software Architecture Recovery Techniques Using Accurate Dependencies. Proceedings - International Conference on Software Engineering, 2:69–78, 2015.issn: 02705257.doi: 10.1109/ICSE.2015.136. [92] I. Macia, J. Garcia, D. Popescu, A. Garcia, N. Medvidovic, and A. von Staa. Are automatically-detected code anomalies relevant to architectural modularity? 11th annual international conference on Aspect-oriented Software Development (AOSD ’12):167, 2012.doi: 10.1145/2162049.2162069. [93] A. Mahmoud and N. Niu. Evaluating software clustering algorithms in the context of program comprehension. In 2013 21st International Conference on Program Comprehension (ICPC), pages 162–171. IEEE, 2013. [94] M. Makrehchi and M. S. Kamel. Automatic extraction of domain-specific stopwords from labeled documents. In European Conference on Information Retrieval, pages 222–233. Springer, 2008. 142 [95] S. Mancoridis, B. Mitchell, Y. Chen, and E. Gansner. Bunch: a clustering tool for the recovery and maintenance of software system structures. Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM’99). ’Software Maintenance for Business Change’ (Cat. No.99CB36360):50–59, 1999.issn: 1063-6773.doi: 10.1109/ICSM.1999.792498. [96] S. Mancoridis, B. Mitchell, C. Rorres, Y. Chen, and E. Gansner. Using Automatic Clustering to Produce High-Level System Organizations of Source Code. In 6th International Workshop on Program Comprehension, pages 45–52, 1998. [97] O. Maqbool and H. Babri. The weighted combined algorithm: a linkage algorithm for software clustering. Eighth European Conference on Software Maintenance and Reengineering:15–24, 2004. issn: 1534-5351.doi: 10.1109/CSMR.2004.1281402. [98] O. Maqbool and H. A. Babri. Hierarchical clustering for software architecture recovery. Transactions on Software Engineering, 33(11):759–780, 2007. [99] C. A. Mattmann, J. Garcia, I. Krka, D. Popescu, and N. Medvidovic. Revisiting the anatomy and physiology of the grid. Journal of Grid Computing, 13(1):19–34, 2015. [100] A. McCallum. MALLET MAchine Learning for LanguagE Toolkit, Apr. 20, 2017.url: http://mallet.cs.umass.edu. [101] N. Mehra, S. Khandelwal, and P. Patel. Sentiment identification using maximum entropy analysis of movie reviews. 2002.url: http://www.stanford.edu/class/cs276a/projects/reports/nmehra-kshashi-priyank9.pdf. [102] T. Mens and T. Tourwé. A survey of software refactoring. IEEE Transactions on Software Engineering, 30:126–139, 2004.issn: 00985589.doi: 10.1109/TSE.2004.1265817. [103] Merriam-Webster. Concern | Definition of Concern by Merriam-Webster, Apr. 27, 2017. url: https://www.merriam-webster.com/dictionary/concern. [104] B. S. Mitchell and S. Mancoridis. On the automatic modularization of software systems using the bunch tool. IEEE Transactions on Software Engineering, 32(3):193–208, 2006.issn: 00985589.doi: 10.1109/TSE.2006.31. [105] R. Moazeni, D. Link, and B. Boehm. Cocomo ii parameters and idpd: bilateral relevances:20–24, 2014.doi: 10.1145/2600821.2600847. [106] R. Moazeni, D. Link, and B. Boehm. Incremental development productivity decline:1–9, 2013.doi: 10.1145/2499393.2499403. [107] R. Moazeni, D. Link, and B. Boehm. Lehman’s laws and the productivity of increments: implications for productivity. 1:577–582, 2013.issn: 15301362.doi: 10.1109/APSEC.2013.84. [108] R. Moazeni, D. Link, C. Chen, and B. Boehm. Software domains in incremental development productivity decline:75–83, 2014.doi: 10.1145/2600821.2600830. 143 [109] A. Mockus, R. T. Fielding, and J. D. Herbsleb. Two case studies of open source software development: apache and mozilla. ACM Transactions on Software Engineering and Methodology (TOSEM), 11(3):309–346, 2002. [110] N. Moha, Y.-G. Guéhéneuc, L. Duchien, and A.-F. L. Meur. {DECOR}: A Method for the Specification and Detection of Code and Design Smells. IEEE Transactions on Software Engineering, 36(1):20–36, 2010.doi: 10.1109/TSE.2009.50. [111] P. Mohagheghi, B. Anda, and R. Conradi. Effort estimation of use cases for incremental large-scale software development. In Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005. Pages 303–311. IEEe, 2005. [112] K. P. Murphy. Naive Bayes classifiers. Bernoulli, 4701(October):1–8, 2006.doi: 10.1007/978-3-540-74958-5_35. [113] P. M. Nadkarni, L. Ohno-Machado, and W. W. Chapman. Natural language processing: an introduction. Journal of the American Medical Informatics Association, 18(5):544–551, 2011. [114] C. o. New York. Obtaining a permit, 2022.url: https://www1.nyc.gov/site/buildings/industry/obtaining-a-permit.page. [115] K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. IJCAI-99 workshop on machine learning for information filtering , 1999.doi: 10.1.1.63.2111. [116] L. Northrop. The importance of software architecture. Software Engineering Institute, Carnegie Mellon University. Available: http://sunset. usc. edu/gsaw/gsaw2003/s13/northrop. pdf, 2003. [117] B. Nuseibeh and S. Easterbrook. Requirements Engineering : A Roadmap. In Conference on the Future of Software Engineering, pages 35–46, 2000. [118] U. of Southern California Computer Science Department. ARCADE Manual. University of Southern California, 2017.url: https: //softarch.usc.edu/~lemduc/Recovered_files/ArchitectureEvolutionAnalysiswithARCADE.pdf. [119] W. N. Oizumi, A. F. Garcia, T. E. Colanzi, M. Ferreira, and A. V. Staa. When code-anomaly agglomerations represent architectural problems? an exploratory study. Proceedings - 28th Brazilian Symposium on Software Engineering, SBES 2014:91–100, 2014.doi: 10.1109/SBES.2014.18. [120] Oracle. Naming a Package, Apr. 27, 2017.url: https://docs.oracle.com/javase/tutorial/java/package/namingpkgs.html. [121] C. Patel, A. Hamou-Lhadj, and J. Rilling. Software clustering using dynamic analysis and static dependencies. In 2009 13th European Conference on Software Maintenance and Reengineering, pages 27–36. IEEE, 2009. [122] R. Peters and A. Zaidman. Evaluating the lifespan of code smells using software repository mining. Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR:411–416, 2012.issn: 15345351.doi: 10.1109/CSMR.2012.79. 144 [123] A. Photoshop. Adobe systems.[software]. San Jose, CA: Adobe Systems, 1990. [124] M. Pinzger, M. Fischer, H. Gall, and M. Jazayeri. Revealer: a lexical pattern matcher for architecture recovery. Ninth Working Conference on Reverse Engineering:170–178, 2002.issn: 1095-1350.doi: 10.1109/WCRE.2002.1173075. [125] M. Pinzger and H. Gall. Pattern-supported architecture recovery. Proceedings 10th International WorkshoponProgramComprehension:53–61, 2002.issn: 1092-8138.doi: 10.1109/WPC.2002.1021318. [126] J. R. Quinlan. Program for machine learning. C4. 5, 1993. [127] S. Raghunathan, A. Prasad, B. K. Mishra, and H. Chang. Open source versus closed source: software quality in monopoly and competitive markets. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 35(6):903–918, 2005. [128] L. Rising. The patterns handbook. Cambridge University Press, 1998. [129] M. P. Robillard. Workshop on the modeling and analysis of concerns in software (macs 2005). ACM SIGSOFT Software Engineering Notes, 30(4):1–3, 2005. [130] S. R. Safavian and D. Landgrebe. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics, 21(3):660–674, 1990. [131] H. Schütze, C. D. Manning, and P. Raghavan. Introduction to information retrieval, volume 39. Cambridge University Press Cambridge, 2008. [132] A. Shahbazian, Y. K. Lee, D. Le, Y. Brun, and N. Medvidovic. Recovering architectural design decisions. In 2018 IEEE International Conference on Software Architecture (ICSA), pages 95–9509. IEEE, 2018. [133] D. I. Sjøberg, A. Yamashita, B. C. Anda, A. Mockus, and T. Dybå. Quantifying the effect of code smells on maintenance effort. IEEE Transactions on Software Engineering, 39(8):1144–1156, 2012. [134] S. Slinger. Code Smell Detection in Eclipse. Science, (March):1–69, 2005. [135] Software Engineering Institute. Community Software Architecture Definitions, Apr. 22, 2017. url: http://www.sei.cmu.edu/architecture/start/glossary/community.cfm. [136] F. Solms. What is software architecture? In Proceedings of the south african institute for computer scientists and information technologists conference, pages 363–373, Pretoria. ACM, 2012. [137] D. Spinellis. A tale of four kernels. 2008 ACM/IEEE 30th International Conference on Software Engineering:381–390, 2008.issn: 0270-5257.doi: 10.1145/1368088.1368140. [138] D. Spinellis. A tale of four kernels. In Proceedings of the 30th international conference on Software engineering, pages 381–390, Leipzig. ACM, 2008. [139] R. Stallman et al. The gnu project, 1998. 145 [140] I. Steyvers, Mark (University of California and T. ( o. C. B. Griffiths. Probabilistic Topic Models. Handbook of latent semantic analysis:427–448, 2007.issn: 1386-4564.doi: 10.1016/s0364-0213(01)00040-4. arXiv: 1111.6189v1. [141] R. N. Taylor, N. Medvidovic, and E. M. Dashofy. Software Architecture - Foundations, Theory, and Practice. Wiley, 2010. [142] The Eclipse Foundation. Eclipse, May 2, 2017.url: https://www.eclipse.org. [143] F. Tian, P. Liang, and M. A. Babar. Relationships between software architecture and source code in practice: an exploratory survey and interview. Information and Software Technology, 141:106705, 2022. [144] L. Torvalds et al. Linux. URL: http://www. linux. org, 2:263–297, 2002. [145] J. B. Tran, M. W. Godfrey, E. H. Lee, and R. C. Holt. Architecture Analysis and Repair of Open Source Software. ACM SIGSOFT Software Engineering Notes, 17(4):40–52, 1992. [146] V. Tzerpos. Acdc algorithm. May 2010.url: https://wiki.eecs.yorku.ca/project/cluster/protected:acdc. [147] V. Tzerpos and R. C. Holt. ACDC: An algorithm for comprehension-driven clustering. In Proceedings Seventh Working Conference on Reverse Engineering, pages 258–267. IEEE, 2000. [148] USC-CSSE. UCC Unified Code Count, May 4, 2022. url: https://cssedr.usc.edu:4443/csse-tools/ucc. [149] USC-CSSE Software Architecture Group. ARCADE, May 1, 2017.url: http://softarch.usc.edu/wiki/doku.php?id=arcade:start%7B%5C&%7Didx=wiki. [150] I. Vayansky and S. A. Kumar. A review of topic modeling methods. Information Systems, 94:101582, 2020. [151] R. Verdecchia, P. Kruchten, and P. Lago. Architectural technical debt: a grounded theory. In European Conference on Software Architecture, pages 202–219. Springer, 2020. [152] L. Wall et al. The perl programming language, 1994. [153] H. M. Wallach. Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning, pages 977–984, 2006. [154] L. Wei, C. Zhang-long, and T. Shi-hang. Research and analysis of garbage collection mechanism for real-time embedded java. In 8th International Conference on Computer Supported Cooperative Work in Design, volume 1, pages 462–468. IEEE, 2004. [155] Z. Wen and V. Tzerpos. An effectiveness measure for software clustering algorithms. In Proceedings. 12th IEEE International Workshop on Program Comprehension, 2004. Pages 194–203. IEEE, 2004. 146 [156] D. Whitley. A genetic algorithm tutorial. Statistics and computing, 4(2):65–85, 1994. [157] G. Williams. Apple macintosh computer. Byte, 9(2):30–31, 1984. [158] A. Yamashita. Assessing the capability of code smells to explain maintenance problems: An empirical study combining quantitative and qualitative data. Empirical Software Engineering, 19(4):1111–1143, 2014.issn: 15737616.doi: 10.1007/s10664-013-9250-3. [159] X. Zhang, M. Persson, M. Nyberg, B. Mokhtari, A. Einarson, H. Linder, J. Westman, D. Chen, and M. Torngren. Experience on applying software architecture recovery to automotive embedded systems. 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering, CSMR-WCRE 2014 - Proceedings:379–382, 2014.doi: 10.1109/CSMR-WCRE.2014.6747199. [160] Zhihua Wen and V. Tzerpos. An effectiveness measure for software clustering algorithms. Proceedings. 12th IEEE International Workshop on Program Comprehension, 2004.:194–203, 2004. issn: 1092-8138.doi: 10.1109/WPC.2004.1311061. 147 AppendixA RELAXManual Version: 5/3/2022 This manual describes the operations of the RELAX Java JAR. A.1 Invocation RELAX can be called from the command line and takes five parameters. A.1.1 Parameters RELAX.jar <Classifier File> <Configuration File> All five parameters are mandatory. Their meaning is as follows: InputDirectory: An absolute directory path that contains one or more folders that contain compiled versions of a system. The input directory may also contain optional additional directories generated by auxiliary tools that contain additional information, e.g. code count results. Example: /Users/daniellink/icloud/RecProjects/chukwa_0.1.2_compiled The directory structure may look like this: In this case, the chukwa_0.1.2.CLOC and chukwa_0.1.2.Dependencies_jdeps directories will not be re- garded as system version directories, but rather as directories with additional information for the recovery of the system found in chukwa_0.1.2. OutputDirectory: An absolute directory path in which each run of RELAX will place a new directory with its recovery results. 148 Figure A.1: Recovery Subject System Directory Structure Example: /Users/daniellink/tmp Source Directory: A relative directory path that is expected to exist in each version folder and that will be the only directory considered for recovery. Example 1: src Example 2: “ ” Assuming the same directory structure in the chukwa_0.1.2 directory referenced further above for both examples: In the first example, only the “src” folder would be considered for our recovery, while in the second case, all folders would be considered. ClassifierFile : An absolute file path to a file containing a classifier generated by auxiliary tools. Example: /Users/daniellink/git/relax-standalone/relax.classifier ConfigurationFile : An absolute file path to a RELAX configuration file. Example: /Users/daniellink/git/relax-standalone/relax.config.xml Note: While the recovery will work if the config file does not exist, the parameter must be supplied. A.2 Requirements A.2.1 Hardware Since RELAX is under active development for research and testing for exact hardware requirements is an effort that will be undertaken when a version will be publicly released, there are no specific requirements 149 Figure A.2: Chukwa Directory Structure 150 to list at this time. Any machine capable of running the Java 17 VM should be sufficient. In my experience, performance will mostly vary based on CPU and storage speed. A.2.2 OperatingSystem Name Supported OS Version(s) Tested Remarks macOS Yes Monterey (12.3.1) None Linux Ubuntu Yes 22.04 LTS See below Other Linux distributions No --None-- Should work (see below) Other Unix-like OSs (e.g. FreeBSD, Solaris) No --None-- Should work (see below) Windows No --None-- Should work (see below) Since RELAX development takes place on macOS, this will generally be the environment with the most straightforward setup. While RELAX is a Java application, it depends on external tools to run (see below under “Software Environment”). The versions and features of these tools are not always the same for all operating systems or distributions thereof. (GraphViz is notable in this aspect.) Given that RELAX is developed on macOS, testing it with the tools is part of debugging work. Releases are generally tested on Linux Ubuntu when empirical studies are run or in similar situations. Other Linux distributions, other Unix derivatives and Windows are only checked on demand. I expect it to be possible to make RELAX work on all of them, but additional setup and configuration work may be required. A.2.3 JavaVersion Java VM 17.0.3 or newer. Both the Oracle JDK and OpenJDK have been tested successfully. 151 A.2.4 SoftwareEnvironment This is a collection of tools and utilities that are expected to be installed to enable the full operativity of RELAX. While recoveries could conceivably take place without some of them, lacking the tools may lead to unexpected or undefined behavior in the current implementation. On macOS, Homebrew 1 is a prerequisite. A.2.4.1 jdeps(Javaclassdependencyanalyzer) jdeps Where to obtain How to install Default path macOS Oracle JDK 17 2 , OpenJDK 17 3 Included with Oracle JDK, additional package when using OpenJDK /usr/bin Ubuntu Same as macOS Same as macOS Same as macOS The path can be configured in the RELAX config file (see below). A.2.4.2 Graphviz(Visualizationsoftware) Graphviz Where to obtain How to install Default path macOS Via Homebrew brew install graphviz /usr/local/bin Ubuntu Via apt apt install graphviz /usr/bin Note: In the past, some versions of GraphViz downloadable through apt (the Ubuntu package manager) were missing features to generate complete directory graphs. The current status is unknown (but will be updated here). Default path: /usr/local/bin 1 https://brew.sh 2 https://www.oracle.com/java/technologies/downloads/#java17 3 https://openjdk.java.net 152 A.2.4.3 cloc(Codecount) cloc Where to obtain How to install Default path macOS Via Homebrew brew install cloc /usr/local/bin Ubuntu See note below apt install cloc /usr/bin Note: This needs to be version 1.80 or newer 4 . Version 1.74, which comes with Ubuntu 18.04, will NOT work with the command line issued by RELAX. Newer versions of cloc that can be installed on Ubuntu 18.04 are available. Installing a new version may require you to install libparallel-forkmanager-perl first using the apt tool A.2.5 ConfigurationFiles A.2.5.1 log4j2.xml Optional. Used for the configuration of Log4j2. The format is described here: https://logging.apache.org/log4j/2.x/manual/configuration.html. Default path: ~/relax_userdir Effect of not having: An excessive amount of debug information may be printed to the console. A.2.5.2 relax_config.xml Optional on macOS, required on all other operating systems. Used for configuring RELAX. This is an XML file with tags that should be self-explanatory and easy to change manually. Default path: ~ The following is an example with some irrelevant tags omitted: <edu.usc.csse.relax.config.Config> <dotLayoutCommandDir>/usr/local/bin</dotLayoutCommandDir> <dotLayoutCommand>twopi</dotLayoutCommand> <dotOutputFormat>pdf</dotOutputFormat> <graphParams>-x -Goverlap=scale -Tpdf</graphParams> 4 https://launchpad.net/ubuntu/+source/cloc 153 <graphOpenerLoc>/usr/bin/open</graphOpenerLoc> <runGraphs>false</runGraphs> <viewGraphs>true</viewGraphs> <showSvgToolTips>false</showSvgToolTips> <detectDependencies>false</detectDependencies> <perUserRelaxDir>/Users/daniellink/relax_userdir</perUserRelaxDir> <malletExecutable>/usr/local/mallet/bin/mallet</malletExecutable> <relaxOutputDir>noname</relaxOutputDir> <relaxRandomSeed>0</relaxRandomSeed> <relaxVerbosity>1</relaxVerbosity> <relaxTrials>1</relaxTrials> <relaxClassifierFileMain>noClassiferSet</relaxClassifierFileMain> <relaxTrainingPortionPercentage>60</relaxTrainingPortionPercentage> <matchConfidence>0.6</matchConfidence> <relaxDirectoryGraphWidth>85.0</relaxDirectoryGraphWidth> <relaxDirectoryGraphHeight>110.0</relaxDirectoryGraphHeight> <baseCountOnSLOC>false</baseCountOnSLOC> <vDist>0</vDist> </edu.usc.softarch.relax.config.Config> A.3 KnownIssues 1. Every run of RELAX will log “Log4j2 configuration file /Users/daniellink/log4j2.xml not found” on the console directly after invocation. This is harmless and will happen regard- less of whether a Log4j2 configuration is found and loaded. 2. In some cases, a recursion in the referenced JavaParser library will lead to a termination due to a stack overflow exception. This can be prevented by raising the stack size with a VM parameter, for example: -Xss4m 154
Abstract (if available)
Abstract
In order to maintain a software system, it is beneficial to know its architecture. At each point in the development and maintenance processes of a software system, its stakeholders are legitimately interested in where and how its architecture reflects each of their respective concerns. Having this knowledge available at all times can permit them to continuously adjust their system’s structure at each juncture and reduce the build-up of technical debt that can be hard to reduce once it has accumulated and persisted over many iterations.
Unfortunately, however, software systems commonly lack reliable and current documentation about their architecture. To remedy this situation, researchers have conceived a number of architecture recovery methods that each employ algorithms with varying degrees of automation to recover different architectural views of the system, some of them concern-oriented. Yet, the design choices forming the bases of most existing recovery methods make it so none of them have a complete set of desirable qualities for the purpose stated above.
Tailoring existing recovery methods to a system is often either not possible or only achievable through iterative experiments with numeric parameters. This can be time-consuming and even open-ended. Furthermore, limitations in the scalability of the employed recovery algorithms can make it prohibitive to apply these existing techniques to large systems. Finally, since several current recovery methods employ non-deterministic sampling, their inconsistent results do not lend themselves well to tracking a system’s architectural development over several versions, as needed by its stakeholders.
The goal of this dissertation research is to overcome these issues with a new recovery method that follows the concern-oriented paradigm and produces an architectural view that can benefit all stakeholders of a software system.
RELAX (RELiable Architecture EXtraction), a new concern-based recovery method that uses text classification, addresses these issues efficiently (1) by assembling the overall recovery result from smaller, independent parts, (2) basing it on an algorithm with linear time complexity and (3) allowing itself to be tailored to the recovery of a single system or a sequence thereof through the selection of meaningfully named, semantic topics. An intuitive and informative architectural visualization rounds out RELAX’s contributions.
RELAX is illustrated on a number of open-source systems and its results surveyed with regards to its accuracy, utility, scalability and modularity.
Through its results in these areas, RELAX has shown itself to be a valuable help that could form the basis of further tools that support the software development process with a focus on maintenance.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Architectural evolution and decay in software systems
PDF
Assessing software maintainability in systems by leveraging fuzzy methods and linguistic analysis
PDF
A user-centric approach for improving a distributed software system's deployment architecture
PDF
A reference architecture for integrated self‐adaptive software environments
PDF
Techniques for methodically exploring software development alternatives
PDF
Toward better understanding and improving user-developer communications on mobile app stores
PDF
Design-time software quality modeling and analysis of distributed software-intensive systems
PDF
Software quality understanding by analysis of abundant data (SQUAAD): towards better understanding of life cycle software qualities
PDF
Constraint-based program analysis for concurrent software
PDF
A unified framework for studying architectural decay of software systems
PDF
Incremental development productivity decline
PDF
Automated synthesis of domain-specific model interpreters
PDF
Process implications of executable domain models for microservices development
PDF
The effects of required security on software development effort
PDF
Architecture and application of an autonomous robotic software engineering technology testbed (SETT)
PDF
A value-based theory of software engineering
PDF
Analysis of embedded software architecture with precedent dependent aperiodic tasks
PDF
A search-based approach for technical debt prioritization
PDF
Automatic test generation system for software
PDF
A model for estimating cross-project multitasking overhead in software development projects
Asset Metadata
Creator
Link, Daniel Gabriel
(author)
Core Title
Software architecture recovery using text classification -- recover and RELAX
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
07/18/2022
Defense Date
05/11/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
machine learning,OAI-PMH Harvest,program understanding,software architecture recovery,software engineering,software maintenance
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Boehm, Barry (
committee chair
), Gupta, Sandeep (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
danielglink@icloud.com,dlink@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111372151
Unique identifier
UC111372151
Legacy Identifier
etd-LinkDaniel-10839
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Link, Daniel Gabriel
Type
texts
Source
20220718-usctheses-batch-954
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
machine learning
program understanding
software architecture recovery
software engineering
software maintenance