Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Architectural evolution and decay in software systems
(USC Thesis Other)
Architectural evolution and decay in software systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Architectural Evolution and Decay in Software Systems by Duc Minh Le A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) Dec 2018 Copyright 2018 Duc Minh Le Dedication To my deceased father, who have in uenced my life far more than any others my mother, who unconditionally supports me through ups and downs my brother, who always takes care of our family for me to chase my dreams ii Acknowledgements \Life is like riding a bicycle. To keep your balance, you must keep moving." | Albert Einstein | I rst knew the above quote of the famous physicist Albert Einstein when I started facing life diculties after leaving my hometown to Hanoi for my college and living as a grown-up adult. Since then, that quote has been one of my favourite sentences reminding me many times during my journey of study and living abroad, especially during my Ph.D. chapter. My Facebook friends are usually jealous of my life in the U.S. with tons of road trips, new friends, new kinds of food. However, they are just a bright side of my Ph.D. life. Looking back at the last ve years, I had faced the extremes of both my happiness and stressfulness. To be compared, if my undergrad was like riding a bicycle on city streets, then my Ph.D. was actually like extreme mountain biking with lots of ups and downs. Now at the end of the road, I would like to express my deep gratitude to people who have helped me balance my \bike" and be able to overcome the \Mt Ph.D.". First, I would like to wholeheartedly thank my Ph.D. advisor, professor Nenad (Neno) Medvidovi c for his priceless guidance and supports. To me, he is not only a knowledgeable academic advisor but also an inspiring life advisor. Before joining Neno's lab, I had always been an \Asian-style" person, humble and quite. He has helped me to be more condent about myself and pushed my capabilities further with his advice on research, collaboration, presentation skills and so on. Neno has a very high expectation for our research that requires me to work really hard. However, under his supervision, iii my great eort was paid o by ve publications in the top-tier journal and conferences in the software architecture area. One of them was recognized by the best paper award at International Conference on Software Architecture (ICSA) 2018. I would like to thank the amazing committee members of my dissertation, Professors Barry Boehm, Sandeep Gupta, William G.J. Halfond, and Chao Wang, for all of their constructive feedback and insightful thoughts about my dissertation. I also would like to thank Professor Rafael Capilla for our successful collaboration, which resulted in a portion of my dissertation. My passion for doing research in Software Engineering area was originated from my M.Sc. years at SELab, POSTECH, S. Korea. I would like to thank Professor Kyo Chul Kang, who has rst-hand given me a lot of knowledge about the world of research and academic. Two years of doing research, writing papers, hiking mountains, and drinking soju with Professor Kang and other SELab's members are always ones of my best memories. I would like to thank the mentors in my three internships, especially Karthik Rajamony from Veritas Technologies. Working with him not only helped me to nd the joy of working in the industry but also to gain a vision about my career path, in which I currently end up at Bloomberg. I would like to thank other Ph.D. students at SoftArch research group, who have helped me during my Ph.D. years. I really appreciate Joshua Garcia for his guidance in my rst year, and he always answered my questions within an eye-blink, day and night. I also would like to thank Jae Yong Bang, another senior as well as my buddy, who taught me a lot about SoftArch, USC and Los Angeles (LA). He even trusted and passed me his beloved car. Other members that I would like to thank include Yixue Zhao, Daniel Link, Arman Shabazian, Pooyan Behnamghader, Gholamreza Sa, Youn Kyu Lee, Suhrid Karthik, and others. Finally, I would like to send my special gratitude to Lizsl De Leon for all her help during the past ve years. iv I would like to thank the Vietnamese friends at USC, who have helped me and shared with me lots of joys, fun time road trip, gossips, and frappuccino cups at Starbuck. Phong Trinh, Hien To, Khiem Ngo, Anh Tran, Loc Huynh, Hieu Nguyen, Minh Nguyen and others, you are a part of my youth in the U.S. I also would like to thank the Vietnam Education Foundation (VEF) for granting me a scholarship to pursue my Ph.D. at USC. This is not only a great opportunity for me to approach top researchers and professors in the world, but also to know a lot of amazing Vietnamese scholars who are contributing for both the U.S and Vietnam. Finally, I would like to thank my family. Without them, I will not be here today. I wish my dad, L¶ Ngåc Quang, could be here today with me to share this moment. The more I live, the more I understand how much he influenced me and so. Although it has been years since he has passed, I still take his lessons with me, every day. Secondly, I would like to express my gratefulness to my mother, °ng Thà Li¶n, who supports me unconditionally. I appreciate her to be with my father during his illness while my brother and I were far away from home. Finally, I would like to thank my brother, L¶ Minh Ngåc, who has been supporting my family while I was in Korea and the U.S. He is the reliable mainstay of the family so I can focus on my Ph.D. Let me wrap up everything with heartfelt thanks to everyone, whom I didn't name here but they are denitely a part of my life and help me grow up. At 10 years old, being able to solve maths problems make me feel that I can overcome any problems in my life. At 20 years old, standing up after my rst love breakup again taught me how to be strong to confront with troubles in my life. And now, at the age of 30, nishing the Ph.D. makes me more condence in myself and what I can do. No one knows exactly what will happen in the next ten years and I am not an exception. However, I am now ready to start a new chapter of my life and happy to take any challenges. v Table of Contents Dedication ii Acknowledgements iii List Of Tables viii List of Figures x Abstract xii Chapter 1 Introduction 1 1.1 Research Problems 2 1.2 Insights and Hypotheses 3 1.3 Dissertation's Approach and Contributions 5 1.4 Dissertation's Structure 13 Chapter 2 Background 14 2.1 Software Architecture 14 2.2 Architecture Recovery 17 2.3 Architectural Change Metrics 18 2.4 Architectural Smells 22 2.5 ARCADE Framework 22 2.6 Machine Learning Framework 24 Chapter 3 A Categorization of Architectural Smells 27 3.1 Smell Detection Strategies 28 3.1.1 Lexicon 28 3.1.2 Structure 29 3.1.3 Measurement 29 3.2 Formalization of Architectural Smells 30 3.3 Smell Detection Algorithms 34 Chapter 4 An Empirical Study of Software Architectural Changes 42 4.1 Foundation 42 4.1.1 Adapting Architectural Recovery Techniques 42 4.2 Research Questions and Hypotheses 43 vi 4.3 Empirical Study Setup 45 4.4 Results 49 4.4.1 RQ1: Architectural Change at the System-Level 49 4.4.2 RQ2: Architectural Change at the Component-Level 54 4.4.3 RQ3: System-Level vs. Component-Level Change 58 4.4.4 RQ4: Architectural Change in Consecutive Minor Versions 61 Chapter 5 An Empirical Study of Software Architectural Decay 64 5.1 Foundation 64 5.1.1 Architecture Recovery 64 5.1.2 Architectural Smells 65 5.1.3 Issue Tracking Systems 66 5.2 Research Question and Hypotheses 67 5.3 Subject Systems 70 5.4 Results 72 5.4.1 Hypothesis H1 { Issue-Proneness of Files 72 5.4.2 Hypothesis H2 { Change-Proneness of Files 74 5.4.3 Hypothesis H3 { Bug-Proneness of Issues 76 5.5 Further Discussion 77 5.5.1 Signicance and Implication of Our Results 78 5.5.2 Architectural Smells as Technical Debt 80 Chapter 6 Speculative Analysis to Predict Impact of Architectural Decay on the Implementation of Software Systems 83 6.1 Foundation 83 6.1.1 Labeling Data 84 6.1.2 Balancing Data 86 6.2 Research Question and Hypotheses 86 6.3 Subject Systems 89 6.4 Results 89 6.4.1 RQ1. System-specic Prediction 90 6.4.2 RQ2. Generic Prediction 93 Chapter 7 Visualizing Architectural Changes and Decay 101 7.1 Motivation 101 7.2 Implementation of Architectural Visualizations 102 7.3 Visualizing Architectural Changes and Decay of Android Framework 102 Chapter 8 Related Work 109 8.1 Architectural Changes 109 8.2 Architectural Decay 111 8.3 Predicting Implementation Issues and Code Change 113 Chapter 9 Conclusions and Future Work 115 9.1 Future Work 119 Bibliography 122 vii List Of Tables 3.1 Subject systems to illustrate the proposed smell catalog 28 4.1 Apache subject systems analyzed in the empirical study of architectural changes 45 4.2 Non-Apache subject systems analyzed in our study 46 4.3 Average a2a values between versions of Apache subject systems. 50 4.4 Average a2a values between versions of non-Apache software systems. 53 4.5 Average cvg values between versions of Apache subject systems. 55 4.6 Average cvg values between versions for non-Apache subject systems. 57 4.7 Minimum a2a values between minor versions 62 5.1 Subject systems analyzed in the empirical study of architectural decay 70 5.2 Average numbers of architectural smells per system version 72 5.3 Average number of issues per le 72 5.4 Average number of commits per le 74 5.5 Average percentage of buggy issues 75 5.6 R-squared values of linear regression models of long-lived smelly les 82 6.1 Data samples of Hadoop 86 6.2 Subject systems used in building prediction models 90 6.3 Predicting issue-proneness using Decision Table 90 viii 6.4 Predicting change-proneness using Decision Table 92 6.5 Numbers of combinations of dierent experiment setups 94 6.6 Predicting issue-proneness - precision under ACDC view 96 6.7 Predicting issue-proneness - recall under ACDC view 96 6.8 Predicting issue-proneness - precision under ARC view 98 6.9 Predicting issue-proneness - recall under ARC view 98 6.10 Predicting issue-proneness - Precision under PKG view 99 6.11 Predicting issue-proneness - recall under PKG view 99 6.12 Standard deviations of precisions - ACDC view 99 6.13 Standard deviations of recalls - ACDC view 100 ix List of Figures 2.1 A notional software system's architecture. 14 2.2 ARCADE's key components and the artifacts it uses and produces. 23 2.3 Weka UI 25 2.4 Weka KnowledgeFlow 26 4.1 Average a2a values in three subject systems 51 4.2 a2a values between minor versions of Ivy 58 4.3 cvg(s;t) values between minor versions of Ivy 59 4.4 cvg(t;s) values between minor versions of Ivy 59 5.1 Mapping architectural smells to issues. 66 5.2 Percentage of smelly les in Camel (ACDC view) 78 5.3 Size distribution of smelly vs. non-smelly les in Camel (ACDC view) 79 5.4 Top-5 long-lived smelly les in Hadoop (top) and Struts2 (bottom) 81 6.1 precision under ACDC view 97 6.2 Recall under ACDC view 97 7.1 Visualizing component changes 103 7.2 Visualizing dependency changes (1) 104 7.3 Visualizing dependency changes (2) 105 x 7.4 Visualizing interface changes (1) 106 7.5 Visualizing interface changes (2) 106 7.6 Visualizing smelly components (1) 107 7.7 Visualizing smelly components (2) 108 xi Abstract Changes to a software system require understanding and, in many cases, updating its architecture. A system's architecture is the set of principal design decisions about the software system. Over time, a system's architecture is increasingly aected by a phenomenon called architectural decay, which is caused by careless or unintended addition, removal, and modication of architectural design decisions. These decisions deviate from the architects' well-considered intent, and result in systems whose implemented architectures dier signicantly, sometimes fundamentally, from their originally designed architectures. In practice, software systems regularly exhibit increased architectural decay as they evolve. The architectural decay may not cause immediate system failures; however, it imposes real costs in terms of engineers' time and eort, as well as system correctness and performance. Therefore, software maintenance tends to increase and dominate the cost and eort across activities in a system's lifecycle. The observation that architectural decay occurs regularly in long-lived systems has been a part of software engineering folklore from the very beginnings of the study of software architecture. At the same time, there is a relative scarcity of empirical data about the nature of architectural change and the actual extent of decay in existing systems. Before this dissertation, there has been no large-scale empirical study to explore architectural changes and decay, their characteristics, and the possible trends they may follow. One main reason for this missing is that while software programs' code bases can be monitored closely by many existing development support tools, their architectures xii cannot. In fact, there is currently a lack of support methods and/or tools to monitor a software architecture's evolution as well as to detect and prevent architectural decay. Consequently, our current understanding of the nature of architectural changes and decay is still very poor. To address the lack of knowledge about architectural changes and decay, this dissertation has completed dierent pieces of work that range from extending theory foundation to conducting empirical studies on actual systems. On the side of theory, this dissertation has surveyed a set of architecture similarity metrics to measure architectural changes across the development history of software systems. In addition, a classication framework for software architectural smells, including formalization of denition and detection algorithms, has been developed to capture the symptoms of decay. On the side of empirical studies, this dissertation has conducted two large-scale empirical studies to determine and understand the nature of architectural changes and decay in real-world software systems. Those studies have empirically demonstrated ndings that had previously only been suspected. First, software developers, in many times, are unaware of dramatic architectural changes. Second, the parts of a system that exhibit signs of architectural decay experience more implementation-level problems than the system's architecturally \clean" parts. Those studies have also found some decaying patterns along software systems' lifespan, such as an increase over time of reported issues and maintenance eort related to long-lived architectural smells. Finally, based on the ndings about architectural decay, this dissertation has developed an architectural-based speculative analysis to predict issues in software systems. The evaluation of this approach has shown that architectural decay information, e.g., architectural-smells, can be used to build models with high accuracy to predict issue-proneness and change-proneness of software systems. xiii Chapter 1 Introduction In practice, software systems change regularly, and so do their architectures. A software system's architecture is the set of principal design decisions about that system [106]. During the lifetime of the system, it is expected that such decisions will be added, removed, and modied. Over time, a system's architecture is increasingly aected by a phenomenon called architectural decay, which is caused by careless or unintended architectural design decisions [94]. Decay results in systems whose implemented architectures dier signicantly, sometimes fundamentally, from their designed architectures. For this reason, changes to a software system require understanding and, in many cases, updating its architecture. Ever since the beginning of software architecture research, researchers and practitioners have experienced both architectural changes and decay as well as their negative impacts on software systems. The downsizes of architectural decay have been parts of the software engineering commu- nity's folklore. As the consequence of architectural decay, software maintenance tends to increase over time and dominate the cost and eort across activities during a system's lifespan. 1 1.1 Research Problems It is widely accepted that, during the lifetime of a software system, the system's architecture changes constantly, leading to instances of architectural decay [94]. For this reason, to identify and track architectural decay across the evolution history of a software system, the system's architecture and its changes must be reliably determined and understood. However, while there are many existing development support tools to frequently monitor software programs' code bases, there is a lack of support methods and/or tools to monitor the architectural evolution of software systems. Furthermore, while there are research works [51, 50] about architectural decay, there is no systematic categorization to capture the symptoms of architectural decay and algorithms for automatically detecting them. Consequently, before this dissertation, there has been no large-scale empirical study to explore the nature of architectural changes and decay, their characteristics, and the possible trends they may follow. This fact, which urgently calls for in-depth empirical studies of architectural changes and decay, is the main research motivation of this dissertation. Recent advances in related areas have provided a solid foundation to address the aforementioned problem. First, to address the missing documentation about a system's architecture, there are many existing architecture recovery techniques to recover the system's actual architecture from the system's implementation. A recent comparative analysis of architecture recovery techniques [48] has identied the scenarios in which various recovery techniques are accurate and scalable. Second, in order to identify symptoms of architectural decay, researchers have started dening architectural smells [51, 50], and have shown the usefulness of certain smells in highlighting dierent problems in real systems [75, 82]. Third, inspired by the open-source development, the source code and development information of many real world systems are now publicly available. Specically, those open-source systems's source codes are free to download and 2 their artifacts are publicly organized by many advanced software development tools (e.g., GitHub to monitor source code, Jira to keep track implementation issues). Last but not least, computing power is cheap enough as needed to analyze a large amount of data in a short time. Those advances allow this dissertation to further explore and answer sophisticated research questions about architectural evolution and decay. The next sections of this chapter discuss the dissertation's insights and hypotheses, our proposed approach and a summary of the main contributions. 1.2 Insights and Hypotheses This section introduces the hypotheses which this dissertation plans to test and the insights from which these hypotheses are drawn. The insights are mainly drawn from our knowledge of the software engineering domain and the underlying techniques that may be relevant to architecture recovery, estimation of architectural similarity, classifying and cataloging architectural smells, analyzing and predicting implementation issues. Observation 1 Architectural recovery techniques are reasonably reliable in terms of accuracy and performance (veried in preexisting research). Observation 2 Software engineers keep track most of the development information (e.g., implementation issues, revisions which were used to resolve issues, release notes for each new version, etc.). Insight 1a Architectural recovery methods can extract as-is architecture views of software systems from the source code. Continued on next page 3 Insight 1b In open-source community, software engineers use the \major.minor.patch" scheme to re ect their intentions and perceptions on how a system changes. Hypothesis 1 The versioning scheme commonly used in open-source systems is not an accurate predictor of the extent of architectural change dur- ing the system's evolution. Insight 2a \Architectural smells", which are indications of poorly thought-through design decisions, constitute manifestations of architectural decay. Architectural smells can be used to study some characteristics of decay. Insight 2b Tracked implementation issues can be used to validate the impact of architec- tural smells on the system's implementation over its evolution. Hypothesis 2 Over a system's evolution, architectural smells exacerbate imple- mentation issues. Insight 3a Architectural decay is one form of technical debt. Insight 3b A large amount of available data (e.g., # of system versions, # of reported issues) allows us to use data mining techniques to build prediction models of architectural decay's impact. Continued on next page 4 Hypothesis 3 It is possible to construct accurate models to predict the impact of architectural smells on system implementation issues. 1.3 Dissertation's Approach and Contributions To evaluate the above three hypotheses, this dissertation's approach consists of dierent themes that range from extending theoretical foundation of architectural smells to conducting empirical studies, which are described as follows: 1. Architecture Recovery Techniques: To study architectural evolution, the architecture at a given point in time during a system's evolution must be extracted. To that end, a number of software architecture recovery techniques have been designed [65, 40, 48, 111, 78]. This dissertation needs to survey dierent recovery techniques to select appropriate ones as well as adapt them to meet the needs of the empirical studies on architectural changes and decay. 2. Architectural Change Metrics: To measure architectural changes across the development history of a software system, architectural change metrics are needed. Those metrics have to consider changes at dierent levels: inter- and intra-components. This dissertation needs to collect appropriate metrics or invent new ones if necessary. 3. A Classication Framework for Software Architectural Smells: Two key reasons for the missing of empirical studies to investigate architectural decay in actual systems are the lack of a systematic categorization of smells|the symptoms of architectural decay|and the absence of 5 algorithms for automatically detecting smells. A classication of smells will be the foundation for further studies about the essence of architectural smells, their inter-relationships, and their impact on software systems. Automated detection algorithms will facilitate those studies by allowing analyses of large numbers of systems and ensuring the repeatability and generality of results. 4. An Empirical Study of Software Architectural Changes: Architectural changes during the lifespan of a system may result in architectural decay of the system, nevertheless, there is a relative scarcity of empirical data regarding the architectural changes that may lead to decay, and developers' understanding of those changes. Specically, engineers must be able to pinpoint important architectural changes at dierent levels of abstraction and from multiple architectural views, which can, in turn, point to factors that cause decay. 5. An Empirical Study of Software Architectural Decay: This study aims to empirically demon- strate a nding that had previously only been suspected: The parts of a system that exhibit signs of architectural decay experience more implementation-level problems than the system's architecturally \clean" parts. It is also expected that one would observe symptoms of decay during a system's life span, such as an increase in reported issues and maintenance eort related to chronic architectural smells. 6. Speculative Analysis to Predict Impact of Architectural Decay on the Implementation of Software Systems: Even though architectural decay may not crash a system outright, it imposes real costs in terms of engineers' time and eort, as well as system correctness and performance. If using the historical data regarding architectural decay can create the highly accurate prediction models for issue- and change-proneness for a system, it indicates that 6 architectural decay have consistent impacts on systems' implementations. Furthermore, the models will be useful for maintainers to foresee likely future problems in newly decay-impacted parts of the system. It could also help in creating maintenance plans in order to eectively reduce the system's issue- and change-proneness. Lastly, understanding decay's impact would raise developers' awareness about their systems' architecture and also motivate them to produce architecturally \clean" code. The listed work items require this dissertation to build an adaptable analysis framework to analyze a large amount of data. To that end, the dissertation's approach is built based on an existing framework, Architecture Recovery, Change, And Decay Evaluator ( ARCADE) [47]. ARCADE is a software workbench that employs (1) a suite of architecture recovery techniques, (2) a set of metrics for measuring dierent aspects of architectural changes, (3) a set of architectural smell detection techniques. As a part of this dissertation, ARCADE has been extended to be able to construct an expansive view showcasing the actual (as opposed to idealized) evolution of a software system's architecture and collect information which is relevant to changes and decay of the software system. In summary, the list of the contributions of this dissertation is as follows: 1. A classication of architectural smells and smell detection algorithms: This classication aims at improving on the shortcomings of prior research of architectural decay. It provides a systematic framework (1) to classify architectural smells based on their characteristics and (2) to detect these smells based on their symptoms. In summary, this dissertation settled on 11 architectural smells, all of which result in violations of widely accepted software engineering principles. Each smell dened in the classication falls into one of four categories: (1) concern-based smells are caused by inappropriate or inadequate separation of concerns [38]; (2) dependency-based smells arise due to the system components' interconnections 7 and interactions; (3) interface-based smells are yielded by the deciency in dening system components' interfaces; and (4) coupling-based smells originate from the couplings of system components. This dissertation has applied the developed smell detection algorithms on 421 versions of 8 widely used open-source systems, totaling 376 MSLOC. Average analysis times of those algorithms per system version ranged from several seconds to several minutes, demonstrating the algorithms' practical applicability. On average, those algorithms detected nearly 140 architectural smells per system version, demonstrating the prevalence of architectural decay in real-world systems. 2. An Empirical Study of Architectural Changes: This dissertation has employed ARCADE in an empirical study in which it analyzed several hundred versions of 14 open-source Apache systems. Specically, this dissertation applied three of the ten architecture recovery techniques that ARCADE currently implements. Two of these techniques|Algorithm for Comprehension- Driven Clustering (ACDC ) [109] and Architecture Recovery using Concerns (ARC ) [52]| recover conceptual views of a system's architecture; the third|PKG|recovers a system's package-level organization which represents the implementation view of the architecture [66]. ACDC and ARC were chosen because they demonstrated better accuracy and scalability compared to other recovery techniques in our previous empirical evaluation [48]. PKG provides an objective (if partial from an architectural perspective) baseline for assessing our results. Additionally, the three techniques approach recovery from dierent, complementary angles: ACDC leverages a system's module dependencies; ARC relies on information retrieval to derive a more semantic view of a system's architecture; and PKG strictly re ects the system's 8 implementation organization. The empirical study has resulted in the following ndings regarding architectural changes in software systems: (a) A semantics-based architectural view (yielded by ARC ) highlights notably dierent aspects of a system's evolution than the corresponding structure-based views (yielded by ACDC and PKG). The study found several cases in which the semantics-based view revealed important architectural changes that remained concealed in the two structure- based views. At the same time, existing architecture recovery techniques have heavily relied on structural information [40, 48, 64, 65, 78]. This suggests that more research on semantics-based recovery is needed in order to properly aid software system maintenance. (b) Architectural changes occur within software components during a system's evolution, even when the system's overall architectural structure remains relatively stable. Intra- component architectural changes are especially important to track in cases of relatively small system evolution increments. Relying on the architecture's structural stability in those cases may conceal non-trivial issues that will become apparent much later, when subsequent architectural changes make them more dicult to address. (c) While useful as an accurate representation of how a system's code base is organized (i.e., of the system's \implementation architecture" view [66]), the package structure is a limited indicator of the system's underlying architecture. PKG yielded especially misleading results when implementation changes that were conned to specic, already existing packages actually had far-reaching architectural implications. Such implications were more readily uncovered by ACDC and ARC , and were independently conrmed by a group of researchers through code and architecture inspections. 9 (d) Finally, dramatic architectural change tends to occur, both, (1) between the end of one major version and the start of the next one, and (2) across one or more minor versions of a software system. In other words, minor versions may result in major architectural changes. Furthermore, this study discovered that, in some cases, signicant architectural changes happen between pre-releases of a minor version. In other words, major changes to a system's architecture occur very late in the run-up to a new release, when common sense suggests that the architecture should be stable. This suggests that a system's versioning scheme is not strongly related to the extent of architectural change. In turn, this may be an added factor complicating the maintenance of a system's architecture and contributing to architectural decay. 3. An Empirical Study of Architectural Decay: To pursue the problem of architectural decay, this dissertation conducted the largest empirical study to date of architectural decay in long-lived software systems. The study's scope is re ected in the total number of versions (421) of 8 dierent subject systems comprising 376 MSLOC, the number of examined implementation issues (41,889), the number of identied architectural smells (172,934) spanning 11 dierent smell types, the number of applied architecture-recovery techniques (3) resulting in distinct architectural views produced for each system, and the number of analyzed architectural models (1,263, i.e., three views per system version). In the absence of reliable data from developers, who are most often unable to pinpoint the exact source of their maintenance diculties, this study relies on documented implementation issues to nd evidence of the impact of architectural smells. This study is believed to be the rst to empirically demonstrate a nding that had previously only been suspected: The parts of a system that exhibit signs of architectural decay experience 10 more implementation-level problems than the system's architecturally \clean" parts. For example, the study found that les that implement \smelly" parts of a system's architecture are more issue-prone and change-prone than the \clean" les. Furthermore, in four subject systems, more bugs appear in parts that are aected by architectural decay. The study also found an increase over time of reported issues and maintenance eort related to long-lived architectural smells. Those ndings indicate that underlying architectural problems have increasing negative consequences on the quality of a system's implementation, and may be the root cause of many implementation-level issues. 4. Speculative Analysis to Predict Impact of Architectural Decay on the Implementation of Software Systems: Based on the correlations between between architectural decay and the reported implementation issues, this dissertation has developed an architecture-based approach to accurately predict the issue and change-proneness of a system's implementation. The approach has been validated over ten actual software systems, considering 11 dierent smells under three dierent architectural views. The study has resulted in the following ndings regarding the predictability of the architecture-based models: (a) Architectural smells have consistent impacts on systems' implementation along the systems' life cycle. Then, the architectural smells detected in a system can help to predict the issue-proneness and change-proneness of that system at a given point in time with high accuracy. The average precisions and recalls in the ACDC and ARC views at least 70%. ACDC and ARC exceed PKG in predicting both issue- and change-proneness. This observation again emphasizes the importance of architectural recovery techniques to help developers understand the underlying architecture precisely. This is a useful approach for maintainers to foresee future problems of newly smell-impacted parts of the system. 11 The approach also helps them create maintenance schedules in order to eectively reduce the system's issue- and change-proneness. The result also suggested a future work, in which we would like to validate if architectural smells should be detected under a specic architectural view to get the highest accuracy. (b) Software systems tend to share properties with respect to issue- and change-proneness. This allows developers to use generic models, created by using data from a set of software systems, to predict the issue- and change-proneness of new software systems in the early stages of their development, before suciently large numbers of system versions become available. The accuracy of generic models is less than the one of specic models, however, the gap is just about 10% or less. Furthermore, our empirical study suggested that using at least 5 systems can help to create a reliable generic model to prediction the issue- and change-proneness of new software systems with high accuracy. As one of our future works, we would like to extend the number of subject systems to look for the potential of other kinds of predicting. 5. Visualizations of architectural changes and decay: Dierent visualization techniques, e.g., interactive visualizations and color-coding labels, have been used in this dissertation to create visualizations of the obtained data about the architectural changes and decay. Regarding this aspect, the visualizations can be an useful medium in order to facilitate the adoption process of the dissertation's approach. Notably, Huawei [13], a big telecommunication equipment manufacturer, has made an attempt to adopt ARCADE and its visualizations and integrate the tool into their software development toolset. 12 1.4 Dissertation's Structure The remainder of this dissertation is structured as follows: Chapter 2 denes and formalizes fundamental concepts, which are software architecture, architecture recovery, architectural change metrics, and architectural smells. Chapter 3 describes this dissertation's classication of smells. Chapter 4, 5, and 6 present the results and implications of the two empirical studies as well as the architectural-based approach to predict implementation issues' properties. Chapter 7 introduces an attempt to visualize architectural changes and decay in an intuitive and adoptable way. Chapter 8 covers the related work. Finally, the dissertation concludes in Chapter 9. 13 Chapter 2 Background This section describes the basic concepts that will be used throughout this dissertation. Specically, this section focuses on the concepts of software architecture, architecture recovery, architectural change metrics, and architectural smells. 2.1 Software Architecture Figure 2.1 shows a notional software architectureA that comprises two components,C 1 andC 2 . Each component contains multiple implementation-level entities. Between entities, links are presented by solid arrows and couplings by dashed lines. Links and couplings represent connections among architectural components and implementation entities of a system. Links are the channels over which components transfer data and control via their interfaces. In addition to traditionally considered Figure 2.1: A notional software system's architecture. Architecture A Component C 1 1 2 3 Component C 2 4 5 14 explicit links, implicit couplings have been shown to play a signicant role in detecting architectural problems [117]. Couplings are entities in a code base that are required to be updated at the same time, even though there may be no explicit links between those parts. Couplings therefore tie architectural components to each other during the evolution of the system. There are two types of couplings: Co-change (Co) and Duplicate (Du). Co-changes represent coupled components that tend to be modied together during a system's lifetime. Duplicates (a.k.a. clones) represent coupled components with identical pieces of code. A software system's architecture is a graph A whose vertices represent the system's set of components C, and whose topology represents the connections embodied in the set of links L and the set of couplings Cp between these components. A = (C; L; Cp) For our purposes, a system is a tuple that consists of architecture A and a nonempty set of topics (i.e., concerns) T addressed by the system. Each topic is dened as a probability distribution Pd over the system's nonempty set of keywords W, whose elements are used to \describe" that system (e.g., via comments in source code). By examining the words that have the highest probabilities in a topic, the meaning of that topic may be discerned. In this way, a topic can serve as a representation of a concern addressed by a system. The set of topics T is then a representation of the system's concerns. While links and couplings provide structural information about a software system, concerns represent roles, responsibilities, concepts, or purposes of the system's entities [52]. S = (A;T ) W =fw i ji2Ng T =fz i ji2Ng z =Pd(W ) A component can be either simple or composite. A composite component is an architecture in its own right, allowing for multiple levels of architectural abstraction. Its denition is therefore 15 the same as that for architecture A above. Our denition of architectural smells does not depend on architectural hierarchy, and we thus exclude composite components from our formalization. Additionally, since automated architecture-recovery techniques overwhelmingly focus on components, we do not distinguish between components and connectors here. A component is a tuple comprising the component's internal entities E and the probability distribution over the system's topics T. Entities are implementation elements used to build a system. An entity e contains its interface I (a set of elements ie that expose that component's functionality or data), a set of links L E , and a set of couplings Cp E to other entities. In the object-oriented paradigm, entities are classes and interfaces are public methods. C =fc i ji2Ng c = (E; c ) E =fe i ji2Ng e i = (I i ;L E i ;Cp E i ) I =fie i ji2Ng Both a link l and a coupling cp consist of a source src and a destination dst, which are entities involved in an interconnection. Links are unidirectional, while couplings are bidirectional. The union of the links of all entities is the set of links L of the graph A. The union of the couplings of all entities is the set of couplings Cp of the graph A. When composing the two unions, we convert the entities dst and src to their parent components, so that L and Cp are links and couplings between components. L E =fl i ji2N 0 g l = (src;dst) Cp E =Co\Du Cp E =fcp i ji2N 0 g cp = (src;dst) L =\ n i=1 L E i Cp =\ n i=1 Cp E i 16 2.2 Architecture Recovery Garcia et al. recently conducted a comparative evaluation of software architecture recovery tech- niques [48]. The objective was to evaluate the existing techniques' accuracy and scalability on a set of systems for which the authors and other researchers had previously obtained \ground-truth" ar- chitectures [49]. To that end, the authors implemented a tool suite oering a large set of architecture recovery choices to an engineer. The study shows that a number of the state-of-the-art recovery techniques suer from accuracy and/or scalability problems. At the same time, two techniques consistently outperformed the rest across the subject systems. I select these techniques for my analysis. These two techniques| ACDC [109] and ARC [52]|take dierent approaches to architecture recovery: ACDC leverages a system's structural characteristics to cluster implementation-level modules into architectural components, while ARC focuses on the concerns implemented by a system. The former is obtained via static dependency analysis, while the latter leverages information retrieval and machine learning. ACDC [109] groups entities into clusters based on patterns, most of which involve the dependencies among the entities. For example, ACDC 's main pattern attempts to group entities so that only a single dependency exists between any two clusters. ARC [52] groups entities that handle similar system concerns into a single cluster. For instance, ARC may group together the entities that handle user interface behaviors. I complement these two clustering-based architectural views with PKG, a tool was also implemented in ARCADE to extract a system's package structure. Package structure is considered a reliable view of a system's \implementation architecture" [66]; while not indicative of the actual architecture underlying the system [107], the package structure provides a useful baseline (a \sanity check") for my study. 17 2.3 Architectural Change Metrics Architectural changes can be considered at two dierent levels: system-level and component-level. At the system-level, architectural change refers to the addition, removal, and modication of components; at the component-level, architectural change re ects the placement of a system's implementation- level entities inside the architectural components (i.e., clusters). Studying architectural change at these two levels of abstraction allows researchers to determine when a system-level architectural view evolves signicantly dierently than a component-level view. Identifying such discrepancies may reveal points in a software system's evolution where architectural maintenance issues occur, as well as the scope of those issues. Using such information, engineers can further determine if an architectural maintenance issue aects the system-level, particular components of an architecture, or both. There are dierent metrics for quantifying architectural change at dierent levels. This section will describe two popular metrics: MojoFM, a system-level metric, and cvg, a component-level metric. A third metric, c2c [48] will also be introduced because it enables the computation of cvg. MojoFM [114] is a widely-used change metric. It is a distance measure between two architectures, expressed as a percentage. MojoFM is based on two operations used to transform one architecture into another: moves (Move) of implementation-level entities from one architectural cluster (i.e., component) to another and merges (Join) of clusters. MoJoFM is dened as MoJoFM (A i ;A j ) = (1 mno(A i ;A j ) max(mno(8A i ;A j )) ) 100% 18 where A i is the architecture obtained from version i of a system S; A j is the architecture from version j of S; and mno(A i ;A j ) is the minimum number of Move and Join operations needed to transform A i into A j . Cluster-to-cluster (c2c) is a metric that Garcia et al. developed and applied in their recent work [48] to assess component-level change. This metric measures the degree of overlap between the implementation-level entities contained within two clusters: c2c(c i ;c j ) = jentities(c i )\ entities(c j )j max(jentities(c i )j;jentities(c j )j) 100% where entities(c) is the set of entities in cluster c; and c i is a cluster from version i of system S. The denominator is used to normalize the entity overlap in the numerator by the number of entities in the larger of the two clusters. This ensures that c2c provides the most conservative value of similarity between two clusters. Cluster coverage (cvg) is another change metric developed by Garcia [47] to indicate the extent to which two architectures' clusters overlap according to c2c: cvg(A 1 ; A 2 ) = jsimC (A 1 ; A 2 )j jallC (A 1 )j 100% simC (A 1 ; A 2 ) =fc i j (c i 2A 1 ;9c j 2A 2 )(c2c(c i ; c j )> th cvg )g simC (A 1 ; A 2 ) returns the subset of clusters from A 1 that have at least one \similar" cluster in A 2 . More specically, simC (A 1 ; A 2 ) returns A 1 's clusters for which the c2c value is above a threshold th cvg for one or more clusters from A 2 . allC (A 1 ) returns the set of all clusters in A 1 . 19 cvg allows an engineer to determine the extent to which certain components existed in an earlier version of a system or were added in a later version. For example, consider a system whose version v2 was created after v1, and for which cvg(A 1 ;A 2 ) = 70%, and cvg(A 2 ;A 1 ) = 40%. This means that 70% of the components in version v1 still exist in version v2, while 100% cvg(A 2 ;A 1 ) = 60% of the components in version v2 have been newly added. Architecture-to-architecture (a2a) [21] is a similarity metric which was developed for assess- ing system-level change. a2a was inspired by the widely used MoJo [110] and MoJoFM metrics [114]. Neither MoJo nor MoJoFM is intended or designed for a study of architectural evolution of the type attempted here: MoJo is a heuristic distance metric intended to determine the similarity between two dierent architectures with the same set of implementation-level entities [110]; MoJoFM is an eectiveness measure for software clustering algorithms based on MoJo intended to compare a recovered architecture with a ground-truth architecture [114]. MoJoFM proved to be ill-suited for our study because it assumes that the entity sets in the architectures (depending on the recovery method used, entities may be classes, methods or other building blocks of a system) undergoing comparison will be identical; this is unrealistic for systems whose versions are known to have evolved, sometimes substantially. In order to address this shortcoming, the authors introduce mto, a distance metric that measures distance between two architectures with arbitrary entity sets, then normalize it to calculate a2a. Minimum-transform-operation (mto) is the minimum number of operations needed to transform one architecture to another: mto(A 1 ;A 2 ) =remC(A 1 ;A 2 ) +addC(A 1 ;A 2 ) +remE(A 1 ;A 2 ) +addE(A 1 ;A 2 ) +movE(A 1 ;A 2 ) (2.1) 20 The ve operations used to transform architecture A 1 into A 2 comprise additions (addE), re- movals (remE), and moves (movE) of implementation-level entities from one cluster (i.e., component) to another; as well as additions (addC ) and removals (remC ) of clusters themselves [18, 80, 91]. Note that each addition and removal of an implementation-level entity requires two operations: an entity is rst added to the architecture and only then moved to the appropriate cluster; conversely, an entity is rst moved out of its current cluster and only then removed from the architecture. This is supported by several foundational works on architectural adaptation (e.g., [18, 80, 91]). The underlying intuition is as follows. If we think of the recovered architecture as a set of constituent building blocks (i.e., clusters and entities) and their congurations (i.e., arrangement of entities inside clusters), then there is a dierence between (a) simply changing the architectural conguration and (b) also changing the constituent building blocks. mto is normalized to calculate a2a, a similarity metric between two architectures with dierent implementation-level entities: a2a(A 1 ;A 2 ) = (1 mto(A 1 ;A 2 ) mto(A ; ;A 1 ) + mto(A ; ;A 2 ) ) 100% (2.2) where mto(A ; ;A i ) is the number of operations required to transform a \null" architecture A ; into A i . In other words, the denominatormto(A ; ;A 1 ) +mto(A ; ;A 2 ) is the number of operations needed to construct architectures A 1 and A 2 from a \null" architecture. This approach is inspired by the foundational work on architectural adaptation cited above. Further discussion of a2a can be found in our published journal [21]. 21 2.4 Architectural Smells Similar to the concept of smells at other levels of system abstraction (namely, code and design smells), architectural smells are ultimately instances of poor design decisions [81] at the architectural level. They have a negative impact on system life cycle properties, such as understandability, testability, extensibility, and re-usability [50]. While code smells [44], anti-patterns [30], or structural design smells [46] originate from implementation constructs (e.g., classes), architectural smells stem from poor use of software architecture-level abstractions { components, connectors, interfaces, patterns, styles, etc. Instances of architectural smells detected in a system's as-is architecture (as opposed to its idealized, as-designed architecture) are candidates for restructuring [28]. Removing architectural smells from a system helps to prevent architectural decay, which in turn improves the system's quality. 2.5 ARCADE Framework To study architectural change and decay, Garcia [47] have developed ARCADE, a software work- bench that employs four key elements: (1) architecture recovery, (2) architectural-smell detection, (3) metrics that quantify change and decay, and (4) analysis of the correlation of the reported implementation-level issues in a given system with the discovered architecture-level issues. ARCADE combines these elements in the manner depicted in Figure 2.2 to investigate a variety of questions regarding architectural change and decay. In this section, an overview of ARCADE's elements will be provided. Recovery: ARCADE's foundational element is architecture recovery, depicted as Recovery Techniques in Figure 2.2. The architectures produced by Recovery Techniques are directly used for 22 Figure 2.2: ARCADE's key components and the artifacts it uses and produces. studying change and decay. ARCADE currently provides access to eight recovery techniques. This allows an engineer (1) to extract multiple architectural views and (2) to ensure maximum accuracy of extracted architectures by highlighting their dierent aspects. Architectural-Smell Detection: ARCADE enables the study of architectural decay through the detection of architectural smells. Architectural Smell Detector in Figure 2.2 implements smell- detection algorithms based on the formalization of architectural concepts and denitions of smells. Garcia has implemented the detection of the four smells: scattered parasitic functionality, concern overload, link overload and dependency cycle. Quantifying Change and Decay: As depicted in Figure 2.2, Change Metrics Calculator and Decay Metrics Calculator analyze the architectures yielded by Recovery Techniques. The computed metrics comprise two of the artifacts produced by ARCADE, which are then used to interpret the degree of architectural change and decay. Relating Issues and Architectural Decay: A software system's issue repository (e.g., Jira or Bugzilla) can be a valuable source of information about architectural decay. To that end, ARCADE includes two components: Issue Extractor and Relation Analyzer. Issue Extractor obtains issues from an issue repository. Relation Analyzer determines relationships between the extracted issues and architectural smells by computing two dierent correlation coecients: Pearson's and Spearman's. 23 2.6 Machine Learning Framework Machine learning is a subeld of Articial Intelligence (AI) area. Machine learning gives computers the ability to execute specic tasks by learning from data, without being explicitly programmed [88]. This dissertation plans to use dierent machine learning techniques to analyze a large amount of information extracted from source code repositories and issue repositories. To facilitate the process of data mining, data manipulation, and model training, a machine learning framework is needed. For this purpose, WEKA [89], a well-known machine learning framework in the research community, is selected. WEKA implemented a large collection of machine learning algorithms for dierent data mining tasks, which include data pre-processing, classication, clustering, regression, association rules, and visualization. The framework provides an UI called WEKA Explorer (Figure 2.3) for users to access those features. In addition, WEKA has a tool named KnowledgeFlow which allows users to easily combine WEKA's implemented algorithms to create their own data processing work ows. Figure 2.4 shows an example of how a work ow is dened. In this dissertation, the outputs of ARCADE are converted to the CSV format which is a compatible input format of WEKA. Then WEKA processes the data based on users' dened work ow and generate evaluation results. In practical prediction problems, their datasets are frequently imbalanced [59]. It is the scenario where the observation of one class is signicantly lower or higher than the observations of the other classes. Because classication algorithms usually do not take into account the distribution or balance of classes, the overtting is usually happened with imbalanced datasets. To resolve this problem, \resampling" approach is usually used to adjust the class distribution of a dataset. \Resampling" consists of \oversampling" and \undersampling". Oversampling is used more frequently than undersampling to balance imbalanced datasets because oversampling not only balances the distribution of the dataset but also increase the number 24 Figure 2.3: Weka UI of data points. Among a number of oversampling methods, the most common one is known as SMOTE: Synthetic Minority Over-sampling Technique [31]. To oversample a minority class, SMOTE takes a sample of the class from the dataset, then consider its k nearest neighbors (in feature space). SMOTE creates a new synthetic data point by taking the vector between one of those k neighbors, and the current data point, then multiplying this vector by a random number x which lies between 0, and 1. Finally, this new synthetic data point is added to the dataset to increase the population of the minority class. This process is repeated until the dataset is balanced. 25 Figure 2.4: Weka KnowledgeFlow 26 Chapter 3 A Categorization of Architectural Smells Although architectural smells have been discussed in literature, a large-scale empirical study of architectural smells and their impact is still missing. One reason for this is the lack of automated smell detection algorithms: without an ecient way to detect architectural smells from a system's implementation, it is impossible to scale up an empirical study to many systems with many versions. This state of aairs drives us to dene detection algorithms for our proposed smell catalog. This section describes (1) three smell detection strategies, (2) formal denitions and (3) cor- responding detection algorithms for the 11 architectural smells. A detection strategy is a general approach to detect a group of smells. We dened four categories of smells, and smells in each category are detected based on dierent detection strategies or combinations of strategies. Taking advantage of prior research on specifying and detecting smells at lower abstraction levels (i.e., code and design smells), we made use of the resulting detection strategies. Three popular smell detection approaches [83] are lexicon, structure, and measurement. We applied the detection strategies that are related to the characteristics of each smell category. Furthermore, we also selected some of the best-proven practices (e.g., how to select metric thresholds [67]) from the literature. 27 All architectural-smell detection algorithms have been implemented and integrated into the ARCADE [68]. We have applied ARCADE to several systems, to detect smells in their architectures. Each system has a large number of versions, allowing us to study the smells over the evolution of the system. For illustration purposes in this section, given the space constraints, we will only highlight one instance of each smell per system. To that end, we selected two Apache systems that contain examples of all 11 smells dened above: CXF, a widely used open source web services framework, and Nutch, a large open source web crawler, the parent to Hadoop. Table 3.1 shows the two subject systems, the respective numbers of versions we analyzed, and the average sizes of each version. Table 3.1: Subject systems to illustrate the proposed smell catalog System Domain No. Versions Avg. SLOC Apache CXF Service Framework 120 915K Apache Nutch Web Crawler 21 118K 3.1 Smell Detection Strategies 3.1.1 Lexicon The lexicon-based detection strategy relies on NLP techniques. This strategy is appropriate for detecting concern-based smells. Our approach uses the LDA implementation in MALLET [79], a topic modeling tool. For each system, ARCADE builds a topic model, which contains a list of topics and their associated keywords, from the textual contents of its source les. Those topics are referred as the concerns of the system. We then use the extracted topic model to compute the concern distribution of each entity in the system. The resulting list of concern distributions can be used to detect violations of the separation of concerns in architectural components (i.e., concern-based smells). 28 3.1.2 Structure The structure-based detection strategy relies on dierent types of interconnections among components. In our approach, if two entities are connected by a link or a coupling, we say their parent components are connected by that link or coupling. Dependency-based and coupling-based smells are undesirable structural patterns formed by interconnections (links and couplings) among architectural components. Therefore, this strategy is appropriate for detecting smells in those two categories. 3.1.3 Measurement The measurement-based detection strategy relies on the values of dierent software metrics. One can reuse classic software metrics suites, such as CK [32] and MOOD [55], or dene new metrics to this end. This approach has also been used extensively [67] to detect code smells. This strategy can be used by itself (e.g. to detect interface-based smells, such as Unused Interface) or as a supplement to other strategies (e.g. combine with lexicon-based strategy to detect Concern Overload). One critical issue in the measurement strategy is where to set thresholds, i.e., dening the criteria that classify a given value of a metric to be an indicator of a smell. In our detection algorithms, we set thresholds by using InterQuartile analysis [108], which is an ecient technique for detecting outliers (e.g., smells in our study) in a population without requiring it to have a normal probability distribution. In the InterQuartile method, the lower quartile (q 1 ) is the 25 th percentile, and the upper quartile (q 3 ) is the 75 th percentile of the data. The inter-quartile range (iqr) is dened as the interval between q 1 and q 3 . q 1 (1:5iqr) and q 3 + (1:5iqr) are dened as \inner fences", that mark o the "reasonable" values from the outliers. \Inner fences" have been used widely in research on software metrics [42] as thresholds to nd outliers. In this section, we will use two shorthand 29 functions getLowThreshold() and getHighThreshold(), which accept a list of values and return the low and high values of the \inner fences". 3.2 Formalization of Architectural Smells The architectural smells considered in this classication fall into one of four categories: 1. concern-based smells are caused by inappropriate or inadequate separation of concerns 2. dependency-based smells arise due to the system components' interconnections and interactions 3. interface-based smells are yielded by the deciency in dening the system components' interfaces 4. coupling-based smells appear because of logical couplings across the system components We dene the individual architectural smells within each category below. Concern-based smells include scattered parasitic functionality and concern overload. Scattered Parasitic Functionality (SPF) describes a system in which multiple components are responsible for realizing the same high-level concern while some of those components are also responsible for additional, orthogonal concerns. Such an orthogonal concern \infects" a component, akin to a parasite. Orthogonal concerns are either dened by domain experts or NLP techniques (e.g., cosine similarities). Components in an SPF smell violate the principle of modularity [93]. Presence of the smell reduces the understandability and maintainability of the system. For example, if multiple components implement the \networking" concern, xing a networking problem might require developers to check all these components even though some of the components may primarily focus on other concerns. Formally, a set of components SPF suer from this smell i 9z2Tj (jSPFj> th tc ) ^ ((8c2 SPF)(P (zjc)> th spf )) 30 where 0th spf 1 species the acceptable degree of scattering per concern, while th tc captures that scattering of a topic is allowed to occur across a given number of components before they are considered to be aected by this smell. Concern Overload (CO) indicates that a component implements an excessive number of concerns. CO violates the principle of separation of concerns [38]. It may increase the size of a component, hurting its maintainability. Formally, a component c suers from this smell i jfz j j (j2N)^ (P (z j jc)>th z c )gj>th co where 0 th z c 1 is the threshold indicating that a topic is signicantly represented in the component, while th co 2N is a threshold indicating the maximum acceptable number of concerns per component. Dependency-based smells include dependency cycles and link overloads. Dependency Cycle (DC) indicates a set of components whose links form a circular chain, causing changes to one component to possibly aect all other components in the cycle. Furthermore, an issue occurring in one component can potentially propagate to all the other components in the cycle. This high coupling between components violates the principle of modularity. Formally, this smell occurs in a set of three or more components i 9l2Lj (8xj (1xk)j ((x<k) =) (l:src2 c x :E^ l:dst2 c x+1 :E))^ ((x =k) =) (l:src2 c x :E^ l:dst2 c 1 :E)) Link Overload (LO) is a dependency-based smell that occurs when a component has interfaces involved in an excessive number of links, aecting the system's separation of concerns and isolation of changes. For example, modications to a component with LO due to an excessive number of outgoing links can potentially aect every component to which it is linked. Conversely, any 31 component with an excessive number of incoming links becomes brittle because it is aected by many other components. Formally, a component c suers from both incoming and outgoing link overload i jfl2Ljl:src2c:Egj +jfl2Dejl:dst2c:Egj>th lo whereth lo is a threshold indicating the maximum number of links for a component that is considered reasonable. Excessive incoming links and outgoing links are dened analogously. Interface-based smells include unused interfaces, unused components, sloppy delegation, func- tionality overload, and Lego syndrome. Unused Interface (UI) is an interface of a system entity that is linked to no other entities. In this case, we say the entity itself is unused. The unused entity might have been added by developers for future use. However, adding entities without any associated use cases violates the principle of incremental development [45]. Having that unused entity adds unnecessary complexity to the component and the software system which, in turn, hinders software maintenance. Formally, a component c2C contains an UI smell in entity e2b:E i (je:Ij6= 0)^ (69l2e:Lj l:dst =e) Unused Component (UC) is a component whose internal entities all exhibit the UI smell. UC inherits all of the negative eects of UI. Formally, a component c2C is unused i 8e2c:Ej isUnusedInterface(e) =true where isUnusedInterface(e) is a function that returns true if entity e is unused. Sloppy Delegation (SD) occurs when a component delegates to other components functionality it could have performed internally. This inappropriate separation of concerns complicates the system's data- and control- ow which, in turn, slows down the maintenance of that system. An example of SD is a component that manages all aspects of an aircraft's current velocity, fuel level, and altitude, 32 but passes that data to an entity in another component that solely calculates that aircraft's burn rate. Formally, SD occurs between components c 1 ;c 2 2C i 9l2Lj l:src =e 1 2c 1 :E^l:dst =e 2 2c 2 :E (outLink(e2) = 0)^ (inLink(e2)<th sd )^ (c 1 6c 2 ) where outLink(e) and inLink(e) return the numbers of links from and to entity e, respectively. Thresholdth sd ensures entitye 2 is not a library-type entity. In a strict constraint,th sd = 2, meaning that e 2 's functionality is only used by e 1 . Functionality Overload (FO) occurs when a component performs an excessive amount of func- tionality. Excessive functionality is another form of inappropriate modularity in a system, and it violates the principles of separation of concerns and isolation of change. Formally, component c2C has FO i e i 2c:Ej P n i=1 je i :Ij>th fo where th fo species the threshold for an excessively high number of operations, i.e., amount of functionality. Lego Syndrome (LS) occurs when a component handles an excessively small amount of function- ality. This smell points to components that do not represent an appropriate level of abstraction or separation of concerns. Formally, a component c2C has LS i e i 2c:Ej P n i=1 je i :Ij<th ls where th ls species a threshold for an excessively small number of operations, i.e., amount of functionality. Coupling-based smells include duplicate functionality and co-change curling. Duplicate Functionality (DF) aects a component if the component shares the same functionality with other components. Changing one duplicated instance of the functionality without changing the 33 others may create errors or inconsistency in the system's behaviors. DF violates the principle of modularity and increases complexity. Changing one duplicated instance requires changing all other instances for consistency. Formally, a component c2C has a duplicate functionality i e i 2c:Ej P n i=1 je i :Duj>th df where th df species a threshold for an excessively high number of duplications. Co-change Coupling (CC) occurs when changes to an entity of a given component also require changes to an entity in another component (recall Section ??). CC has similar negative consequences to DF. Specically, making a single change to a system aected by CC smells might require engineers to check and make changes to multiple components each time. Formally, a component c2C has a co-change coupling i e i 2c:Ej P n i=1 je i :Coj>th cc where th cc species a threshold for an excessively high number of logical couplings. 3.3 Smell Detection Algorithms Concern-based smells Scattered Parasitic Functionality (SPF): Algorithm 1, detectSPF, returns a map SPFsmells where each key in the map is a scattered concern z, and each value is a set of at least th spf number of components that have the corresponding key z above the threshold th zc . Lines 3-8 create a map where keys are concerns and values are the number of components that have that concern above threshold th zc . The getConcernsOfComponent function in Line 4 returns the topic distribution of the component, which is computed by MALLET. Line 5 calculates the threshold th zc for each componentc, which helps determine representative topics of a component (Line 7). Line 9 calculates the threshold th spf for the maximum number of concerns in a component. Both thresholds are 34 determined dynamically by the InterQuartile method. Lines 10-14 identify SPF instances by checking if concern z appears in at least th spf components (Line 13). In most versions of CXF, under the ARC view, we found a Scattered Parasitic Functionality instance where theorg:apache:cxf:BusFactory entity and its subclasses are scattered across dierent components. Even their parent packages are dierent. Although those classes address the same big concern, which is \bus", the subclasses implement dierent specic concerns. This is the main reason for ARC to assign them to dierent components. Concern Overload (CO): Algorithm 2, detectCO, determines which components in the system have CO. The algorithm operates in a manner similar to detectSPF. detectCO begins by creating a map,componentConcernCounts, where keys are components and values are the number of relevant concerns in the component (Lines 3-8). While creating the map, threshold th zc is dynamically computed for each component (Line 5) and used to determine prevalent concerns in each component. Algorithm 1: detectSPF Input: C: a set of components, T : a set of system concerns Output: SPFsmells : a map where keys are concerns and values are components 1 SPFsmells initialize map as empty 2 concernCounts initialize all concern counts to 0 3 for c2C do 4 T c getConcernsOfComponent(c) 5 th zc getHighThreshold(P(T c )) 6 for z2T c do 7 if P (zjc)> th zc then 8 concernCounts[z] = concernCounts[z] + 1 9 th spf getHighThreshold(concernCounts) 10 for z2T do 11 if concernCounts[z]> th spf then 12 for c2C do 13 if P (zjc)> th zc then 14 SPFsmells[z] SPFsmells[z][fcg 35 Later, detectCO uses that map to compute thresholdth co in Line 9 which is then used to determine which components have the CO smell (Lines 10-12). Algorithm 2: detectCO Input: C: a set of components, T : a set of system concerns Output: COsmells : a set of Component Concern Overload instances 1 COsmells ; 2 componentConcernCounts initialize all brick concern counts to 0 3 for c2C do 4 T c getConcernsOfComponent(c) 5 th zc getHighThreshold(P(T c )) 6 for z2T c do 7 if P (zjc)> th zc then 8 componentConcernCounts[c] = componentConcernCounts[c] + 1 9 th co getHighThreshold(componentConcernCounts) 10 for c2C do 11 if componentConcernCounts[c]> th co then 12 COsmells COsmells[fcg Like Scattered Parasitic Functionality, we also found a long-lived Concern Overload instance under the ARC view. We found a component that contains the most classes in the org:apache:cxf:phase package. This component implements dierent steps of information processing in CXF, which include reading a message, transforming it, processing headers and validating the message. Although all these steps are related to message handling, putting all of them into a single component causes it to have the CO smell. Dependency-based smells Dependency Cycle (DC): We detect DC smells by identifying strongly connected components in a software system's architectural graph G = (C;L). A strongly connected component is a graph or subgraph where each vertex is reachable from every other vertex. Each strongly connected component inG is a Component Dependency Cycle. Any algorithm that detects strongly connected 36 components [39, 72] can then be used to identify DC. Therefore, we do not include the detection algorithm for DC in this classication. Both Nutch and CXF have one instance of Dependency smell starting with their early versions. These instances are persistent throughout both systems' life cycles As both system evolve, the cycles increase in size by involving more components. This observation holds with all three architectural views. Link Overload (LO): Algorithm 3, detectLO, extracts the LO variants for a set of components C by examining their links L. The algorithm rst determines the number of incoming, outgoing, and combined links per component (Lines 4-6). detectLO sets the threshold th lo for each variant of LO by computing the thresholds th lo for incoming, outgoing, and combined links (Lines 7-8) . The last part of detectLO identies each component and the directionality that indicates the variant of LO the component suers from (Lines 9-12). In Nutch, we found that some components, especially those which are related to web user interfaces, suered from Link Overload, such as org:apache:nutch:webui:pages:crawls ororg:apach e:nutch:webui:pages:instances in the PKG view. These components were created with many inner classes within the main classes. Interface-based smells Unused Interface (UI) and Unused Component (UC): As we mentioned in Section 3.2, UC is the extreme case of UI. Algorithm 4, detectUI UC, allows us to detect both of these smells. detectUI UC uses the set of links L to determine if an interface has been used or not. The algorithm checks every entity in each component (Lines 2-7), and if an entity has a public interface but no link, then the entity and its parent component are added to the UI instances list. Line 9 37 Algorithm 3: detectLO Input: C: a set of components, L: links between components Output: LOsmells : a set of Link Overload instances 1 LOsmells ; 2 numLinks initialize map as empty 3 directionality f\in"; \out"; \both"g 4 for c2C do 5 for d2directionality do 6 numLinks[(c;d)] numLinks[(c;d)] + getNumLinks(c; d; L) 7 for d2directionality do 8 th lo [d] getHighThreshold(numLinks; d; C ) 9 for c2C do 10 for d2directionality do 11 if getNumLinks(c; d; L)> th lo [d] then 12 LOsmells LOsmells[f(c;d)g uses a boolean ag, isUB, to mark that a component does not have UC if at least one entity in the component does not have UI. Line 11 checks and adds UC instances to the smell list. In Nutch, we found that the SequenceFileInputFormat class was unused in some 1.x versions (in each view, the parent component of this class was aected by the Unused Interface smell). This class had been removed in version 2.0. We assume that developers noticed and decided to remove this unused class. We could not nd any instances of Unused Component in Nutch, but in CXF. Under the PKG view, the org:apache:cxf:simple component had been unused from version 2.0.6 to version 2.2.9. Later, it was used again from version 2.2.10. Sloppy Delegation (SD): Algorithm 5, detectSD requires a threshold th sd , which denes the minimum number of in-links to consider a delegation appropriate. The algorithm checks every link in each entity (Lines 2-5), and if a link has a `dst' entity which satises the checking condition of SD (dened in Section 3.2), then the `dst' entity and its parent component are added to the list of SD instances (Line 10). In Nutch, we found that org:apache:nutch:crawl and org:apache:nutch:scoring:webgraph are two components which are heavily aected by the Sloppy Delegation smell. Out of the three 38 architectural views, ACDC provides a better clustering to reduce the number of SD smells. However, those two components still need refactoring. Function Overload (FO) and Lego Syndrome (LO): These two smells come in a pair, which indicates overloaded and underloaded functionality in components. Algorithm 6, detectFO LS, allows to us to detect both smells in one run. The algorithm rst creates a map between components and their numbers of interfaces (Lines 3-5). This map then is used to compute two thresholds th fo and th ls , i.e., the high (Line 6) and low values (Line 7) of the inner fences, respectively. Finally, the algorithm revisits each component, checking and adding detected smell instances into two smell lists (Lines 8-12). Across all 1.x versions of Nutch under the PKG view, we found that one component, org.apache.nutch.crawl, has Function Overload and two other components, org.apache.nutch.net.protocols and org.apache.nutch.tool.arc, have Lego Syndrome. Coupling-based smells Algorithm 4: detectUI UC Input: C: a set of bricks, L: links between components Output: UIsmells : a set of Unused Interface instances, UCsmells : a set of Unused Brick instances 1 UIsmells ;, UCsmells ; 2 for c2C do 3 isUC true 4 for e2c:E do 5 if getNumInterfaces(e:I )> 0 then 6 if getNumLinks(e:L) = 0 then 7 UIsmells UIsmells[f(c;e)g 8 else 9 isUC false 10 if isUC then 11 UCsmells UCsmells[f(c)g 39 Algorithm 5: detectSD Input: C: a set of bricks, L: links between components, th sd : threshold for relevance delegation Output: smells : a set of Sloppy Delegation instances 1 smells ; 2 for c 1 2C do 3 for e 1 2c 1 :E do 4 for l2e 1 :L do 5 if l:src =e 1 then 6 e 2 l:dst 7 c 2 getParent(e 2 ) 8 if (e 1 6e 2 )^ (getOutLink(e 2 ) = 0) 9 ^(getInLink(e 2 )<th sd ) then 10 smells smells[f((c 1 ;e 1 ); (c 2 ;e 2 ))g Algorithm 6: detectFO LS Input: C: a set of bricks, L: links between components Output: FOsmells : a set of Functionality Overload , LSsmells :a set of Lego Syndrome instances 1 FOsmells ;, LSsmells ; 2 numInterfaces initialize map as empty 3 for c2C do 4 for e2c:E do 5 numInterfaces[c] numInterfaces[c] + getNumInterfaces(e:I ) 6 th fo getHighThreshold(numInterfaces; C ) 7 th ls getLowThreshold(numInterfaces; C ) 8 for c2C do 9 if numInterfaces[c]> th fo then 10 FOsmells FOsmells[f(c)g 11 else if numInterfaces[c]< th ls then 12 LSsmells LSsmells[f(c)g Duplicate Functionality (DF) and Co-changes Coupling (CC): Detecting coupling-based smells is similar to detecting Link Overload. The dierence is that the detection algorithms for coupling- based smells use couplings instead of links as their input. Detecting DF depends on duplicates, and detecting CC depends on co-changes. Algorithm 7, detectDF CC, shows how to detect these two types of smells. The algorithm rst creates two maps between components and their numbers of duplicates as well as co-changes (Lines 4-7). detectDF CC uses these maps to compute two thresholds th df and th cc , which are the high inner-fence values (Lines 8-9). Finally, the algorithm 40 Algorithm 7: detectDF CC Input: C: a set of bricks, Cp: couplings between components Output: DFsmells : a set of Duplicate Functionality , CCsmells :a set of Co-change Coupling instances 1 DFsmells ;, CCsmells ;, 2 numDu initialize map as empty 3 numCo initialize map as empty 4 for c2C do 5 for e2e:E do 6 numDu[c] numDu[c] + getNumDu(c; e:Du) 7 numCo[c] numCo[c] + getNumCo(c; e:Co) 8 th df getHighThreshold(numDu; C ) 9 th cc getHighThreshold(numCo; C ) 10 for c2C do 11 if numDu[c]> th df then 12 DFsmells DFsmells[f(c)g 13 if numCo[c]> th cc then 14 CCsmells CCsmells[f(c)g visits each component again, check and add detected smell instances into two smell lists (Lines 10-14). Under the PKG view of CXF, we found that the component org.apache.cxf.interceptor has the Co- changes Coupling smell along with 6 other components in version 2.x. Later on, one CC instance has been removed in version 3.x. However, new CC smell instances were also introduced. Under the PKG view, we also foundorg:apache:cxf:ws:security:wss4j:policyvalidators andorg:apache:cxf:jibx to be strongly aected by Duplicate Functionality. Almost all entities in those two components have duplications with entities in other components. 41 Chapter 4 An Empirical Study of Software Architectural Changes 4.1 Foundation Our work discussed in this chapter was directly enabled by two research threads: (1) architecture change metrics and (2) software architecture recovery. The details ofa2a andcvg has been discussed in Sections 2.3. We will summarize the second foundational work in this section. 4.1.1 Adapting Architectural Recovery Techniques In order to use two existing architecture recovery techniques, ACDC and ARC, in ARCADE, we had to overcome two related problems. First, in order to represent the topic models needed for ARCADE's concern-based architecture recovery (ARC ), we have used the MALLET machine learning toolkit [79]. The topic-model extraction algorithms implemented by MALLET are non-deterministic. On the other hand, in order to meaningfully compare two concern-based architectures as required for our study, we needed a shared topic model for their recovery. Therefore, for each subject system, we created a topic model by using all available versions of the system as the input to MALLET. The number of topics was 42 determined based on our experience with ARC from a previous empirical evaluation [48]. We then used this topic model to retrieve the architectures for all of that system's versions. In addition, we also computed architectural changes between a large number of pairs of dierent systems' versions by using topic models created from only the involved two versions. The architectural change results yielded by the two approaches|a single topic model for all system versions vs. dierent topic models for each pair of versions|are highly similar, with the variation of 1-2%. This supports our hypothesis that topic models created from a large number of versions would not produce signicant noise when recovering the architecture of a particular version. Second, to recover architecture using ACDC , we obtained an implementation of the technique from ACDC 's authors [109] and used its default settings. Although ACDC relies on a deterministic clustering algorithm, it turned out that its implementation is not deterministic, which created inac- curacies in our empirical analysis. We traced the source of ACDC 's non-determinism to the Orphan Adoption (OA) algorithm used in its implementation. OA is an incremental clustering algorithm that ACDC employs to assign a system's implementation entities to architectural components. The order of entities provided as input aects the result of OA, and subsequently the architecture recovered by ACDC . In the original implementation of ACDC , this order is not the same in every execution of the algorithm, causing the non-deterministic output. We resolved this problem by rst sorting the input to OA based on the full package name of each class le. 4.2 Research Questions and Hypotheses The rst empirical study of this dissertation targets four research questions regarding the nature of architectural changes. The absence of empirical data on architectural change in real systems has resulted in that phenomenon being relatively poorly understood. As a result, the extent of 43 architectural change, types of architectural change, and the points in a system's lifecycle when a major architectural change occurs are generally unclear. By conducting this study, this dissertation aims to understand how architectural changes happen at two levels of architecture, system- and components-level. This study also focuses on exploring whether developers are aware of the architectural changes and re ect their awareness into the versioning scheme of the software systems. RQ1: To what extent do architectures change at the system level? This research question focuses on the structural stability of a system's architecture. During development and evolution, a system's implementation entities are usually reallocated (added, removed, moved) among its architectural components. This question will shed light on when, how, and to what extent this reallocation happens. RQ2: To what extent do architectures change at the component level? This research question focuses on the structural stability of a system's individual components. Implementation- level entities that realize an architectural component will change over time as the system evolves. Beyond a certain change threshold, it may be dicult to argue that an evolved component is still \the same" as the original component. This question will, therefore, study the component evolution patterns and thresholds. RQ3: Do architectural changes at the system and component levels occur concur- rently? This research question aims to reveal the extent to which changes to overall architectural structure are also accompanied by changes to individual system components, and when and why the two fall out of step. RQ4: Does signicant architectural change occur between minor system versions within a single major version? As a commonly adopted rule of thumb, developers decide to introduce a new major version for their system when the new APIs become incompatible with the 44 previous versions (e.g., as in the case of the Apache Portable Runtime (APR) project [19]). In turn, this should imply a substantial change to the system's architecture. This research question will target our hypothesis, formed after our initial observations, that a system's architecture may experience signicant change even though the system remains within the same major version. 4.3 Empirical Study Setup In order to answer these research questions, we apply ARCADE to a total of 720 versions of 18 Apache open-source systems.The largest versions of these systems range between 50 KSLOC and 800 KSLOC. All of these systems are implemented in Java and managed in the Apache Jira repository. Table 4.1 summarizes each system we analyzed, its application domain, number of versions analyzed, timespan between the earliest and latest analyzed version, and cumulative size of all selected versions. Table 4.1: Apache subject systems analyzed in the empirical study of architectural changes System Domain No. of Ver. Time span MSLOC Accumulo Data Storage System 10 05/15-09/15 1.59 ActiveMQ Message Broker 20 08/04-01/07 3.40 Cassandra Distributed DBMS 127 09/09-09/13 22.0 Chukwa Data Monitor 7 05/09-02/14 2.20 Hadoop Data Process 63 04/06-08/13 30.0 HttpClient HTTP Toolset 88 12/07-09/15 3.31 Ivy Dependency Manager 20 12/07-02/14 0.40 JackRabbit Content Repository 97 08/04-02/14 34.2 Jena Semantic Web 7 06/12-09/13 3.50 JSPWiki Wiki Engine 54 10/07-03/14 1.20 Log4j Logging 41 01/01-06/14 2.40 Lucene Search Engines 21 12/10-01/14 4.90 Mina Network Framework 40 11/06-11/12 2.30 PDFBox PDF Library 17 02/08-03/14 2.70 Poi Java API 20 05/15-09/15 1.68 Struts 2 Web Apps Framework 36 10/06-02/14 6.70 Tika Content Analysis Toolkit 30 05/10-05/15 0.56 Xerces XML Library 22 03/03-11/09 2.30 Total 720 01/01-09/15 125.33 45 In addition to the Apache subject systems, we have analyzed another 5 systems that are not from the Apache Software Foundation. We refer to them as non-Apache systems. In total, we have analyzed 211 versions of the non-Apache systems, as summarized in Table 4.2. Table 4.2: Non-Apache subject systems analyzed in our study System Domain No. of Ver. Time span MSLOC Druid-core Alibaba JDBC Library 27 04/12-08/14 4.61 Guava-core Google Java Library 20 08/12-09/15 1.36 Jackson(JS)-databind Data Binding Library 56 02/12-10/15 4.23 PgJDBC PostgreSQL JDBC driver 64 01/05-10/15 2.27 TestNG Testing Framework 44 07/10-10/15 2.33 Total 211 01/05-10/15 14.8 We applied ARCADE's work ow depicted in Figure 2.2 to the dierent versions of each system. For each version, ARCADE produced (1) three recovered Architectures, by ACDC , ARC and PKG, and the values of Change Metrics. All artifacts produced in our study are available at [1]. In our analysis of the subject systems, we leveraged their shared hierarchical versioning scheme: major.minor.patch-pre-release. A Major version entails extensive changes to a system's func- tionality and typically results in API modications that are not backward-compatible. A Minor version involves fewer and smaller changes than a major version and typically ensures backward- compatibility of APIs. A Patch version, also referred to as a point version, results from bug xes or improvements to a system that involve limited change to the functionality. A Pre-release version, which can be classied as alpha, beta, or release candidate (RC), usually contains new features and is provided to users before the ocial version (major or minor) to get feedback. This shared versioning scheme enabled us to make certain comparisons despite the dierences among the systems and their numbers of versions. However, dierent systems follow dierent release evolution paths (recall Section 2.5). Determining the accurate evolution path for each system turned into an unexpected, non-trivial challenge. For example, in one system, version 1:2:0 may 46 represent a direct evolution of version 1:1:7; in another system, 1:2:0 may represent a completely new development branch. In order to determine the correct version sequences in our subject systems, we relied on git-log [10] and svn-graph-branches [6]. We then manually analyzed, and if appropriate updated, the results of those tools to ensure the accuracy of the suggested evolution paths. In this process, we identied three frequently-occurring patterns that aected our selection of version pairs and evolution paths. In a number of cases, a minor version directly evolved from a previous minor version, rather than from a numerically more proximate patch version. Similarly, a new major version frequently evolved from a minor version, rather than from a numerically more proximate patch version; however, changes in patch versions would be merged at a later time. Lastly, the evolution paths for patch and pre-release versions typically followed the numeric ordering of their version numbers. The evolution paths we selected in our study contain the four types of versions (Major, Minor, Patch, and Pre). In the case of major versions, we decided to consider two separate evolution paths because that allowed us to uncover dierent aspects of a system's evolution: 1. The evolution path involving all changes from the start of one major version to the start of the subsequent major version (e.g., the version pair (1:0:0; 2:0:0)). This evolution path represents the totality of changes a system undergoes within a single major version (hence we refer to it as Major below). 2. The evolution path involving a single version pair that comprises the last minor (or patch) version within a major version and the next major version (e.g., the version pair (1:9:0; 2:0:0), where there are no other system versions between the two). This evolution path represents the degree of change to the system at the time the developers decide to make the \jump" to the next major version. We refer to this evolution path as MinMaj. 47 As an example of selected version pairs and evolution paths, consider the following set of versions obtained from the same system: 1:0:0, 1:1:0, 1:1:1, 1:2:0, 1:2:1, 1:2:2, 2:0:0-beta1, 2:0:0-beta2, and 2:0:0. For the Major evolution path, only the pair (1:0:0; 2:0:0) is in the path, as expected. On the other hand, for the MinMaj evolution path, (1:2:0; 2:0:0) is in the path for this system, rather than (1:2:2; 2:0:0). The Minor evolution path contains (1:0:0; 1:1:0), as expected, but instead of (1:1:1; 1:2:0) it contains (1:1:0; 1:2:0). The Patch evolution path consists of the pairs (1:1:0; 1:1:1), (1:2:0; 1:2:1) and (1:2:1; 1:2:2). Finally, the Pre-release path includes (2:0:0-beta1; 2:0:0-beta2) and (2:0:0-beta2; 2:0:0). In addition to excluding minor and patch versions, as in the above example, in a limited number of cases we also excluded a major version along with all of its associated minor, patch, and pre- release versions. That occurred when a major version was actually an entirely dierent development branch from the system's other major versions. For instance, Struts 1 and Struts 2 [5] have been developed independently and comparing their architectures would yield no useful information from the perspective of architectural change. In this case, we selected Struts 2 for our study since it provided a richer set of minor, patch, and pre-release versions. The version numbering convention adopted by developers in the non-Apache systems is similar, although less consistent, when compared to that in the Apache systems. In the non-Apache systems, the developers tend not to strictly follow the convention, or they tend to have a preference for a single type of system version. For example, Google almost exclusively releases major and beta versions, along with a few patch versions, of the Guava library [54]. However, we still apply the above approach of selecting version pairs to the non-Apache systems. This helps us to understand the dierences in the version change decisions the subject systems. 48 4.4 Results To shed light on the four research questions about architectural change, we leveraged ARCADE to compute the a2a and cvg metrics (recall Section 2.3). For each version pair within each evolution path of a system , we computed these metrics using the three architectural views produced by ACDC , ARC , and PKG. For ease of comparison, the results obtained from the two sets of subject systems|Apache systems and non-Apache systems|are separated into dierent tables below. Tables 4.3 and 4.4 show the average a2a values for the two sets of subject systems, while Tables 4.5 and 4.6 show the average cvg values for each system in the two sets. Empty table cells indicate comparisons of versions that are invalid or cannot be determined. For example, if a software system has only one major version, architectural change values for Major and MinMaj cannot be computed. We discuss our ndings for each research question below. 4.4.1 RQ1: Architectural Change at the System-Level To study RQ1, we leveraged a2a, which allows us to compute architectural change at the system- conguration level. Tables 4.3 and 4.4 show average a2a values for the ve dierent types of evolution paths we selected across the three architectural views. In Table 4.3, we observed a consistent trend for system-level architectural change among the three views of the Apache systems. The a2a similarity values for the Major and MinMaj evolution paths are lower than for the remaining three types. This means that most signicant architectural changes tend to involve major system versions. From the table, we can see a prevalent overall trend: a2a Pre a2a Patch > a2a Minor > a2a MinMaj > a2a Major 49 Table 4.3: Average a2a values between versions of Apache subject systems. ACDC ARC System Major MinMaj Minor Patch Pre Major MinMaj Minor Patch Pre Accumulo - - 84 99 - - - 84 98 - ActiveMQ 62 69 95 100 99 61 66 93 100 98 Cassandra 42 80 77 99 99 32 75 70 98 99 Chukwa - - 78 - 95 - - 71 - 92 Hadoop 17 73 86 98 - 19 74 83 95 - HttpClient - - 87 98 98 - - 85 97 97 Ivy 50 67 91 98 99 28 52 86 95 97 JackRabbit 38 76 84 91 98 26 75 84 99 95 Jena - - 88 99 - - - 89 95 - JSPWiki 18 30 86 98 99 10 25 72 98 99 Log4j 9 13 64 97 85 5 6 68 99 86 Lucene 12 8 96 98 94 11 9 97 100 93 Mina 28 30 92 99 88 15 16 93 99 89 Poi - - 90 99 98 - - 86 100 94 Struts2 - - 90 99 - - - 94 99 - Tika 60 97 94 - 100 54 96 92 - 100 Xerces 21 54 92 83 - 18 54 91 94 - AVG 32 55 87 97 96 25 50 85 98 95 DEV 19 30 8 4 5 17 31 9 2 4 PKG System Major MinMaj Minor Patch Pre Accumulo - - 85 99 - ActiveMQ 62 71 94 100 98 Cassandra 36 79 74 99 99 Chukwa - - 79 - 94 Hadoop 14 81 91 100 - HttpClient - - 90 99 98 Ivy 35 57 89 98 99 JackRabbit 30 82 92 100 99 Jena - - 94 99 - JSPWiki 8 13 87 99 100 Log4j 1 2 61 98 91 Lucene 1 1 97 99 90 Mina 13 13 98 100 86 PDFBox - - 97 100 - Poi - - 92 100 99 Struts2 - - 93 99 - Tika 60 98 95 - 100 Xerces 15 63 91 90 - AVG 21 49 89 96 96 DEV 22 36 10 2 5 Value unit is percentage. Lower numbers mean more change. Empty table cells indicate versions that do not exist for a given system. The second bottom-most row is the average-of-averages. The bottom-most row is the standard deviation. 50 Figure 4.1: Average a2a values in three subject systems This observation is expected: as discussed earlier, patch versions usually come with bug-xes, minor versions usually come with new features, and pre-release versions are wait-for-feedback versions of a minor or major version that sometime require more changes than patch versions. Although the averages of the Patch and Pre columns are approximately equal, we observed signicant changes between pre-release versions in some cases. For example, in Log4j, the a2a(1.3-alpha-6, 1.3-alpha-7) value is 49%, and thea2a(2.0-rc1, 2.0-rc2) value is 72%. The prevalent overall trend can be observed from a side-by-side depiction of three representative systems' evolutions, shown in Figure 4.1. Dierences between the a2a MinMaj and a2a Major values for a given software system re ect dierent aspects of change that has occurred both within and across that system's major versions. For example, in the case of Hadoop, a2a MinMaj is 73% while a2a Major is 17% for ACDC . Hadoop had more than twenty minor versions between versions 0.1.0 and 0.20.x, before releasing version 1.0.0 [2]. We consider 0.1.0 to be Hadoop's rst major release since it is, in fact, Hadoop's very rst release. As a result, the architectural gap between version 0.1.0 and 1.0.0 is expected to be very large, yielding a low a2a Major value. On the other hand, changes between the last minor version 51 and the subsequent major version that is derived from it (i.e., for the version pair (0:20:0; 1:0:0)) are comparatively small, resulting in a relatively high a2a MinMaj value. Incremental changes between consecutive minor versions need not always result in higher architectural similarity between the last minor version and the subsequent major version, and may be dwarfed by the changes a major version introduces. This is illustrated by the case of Lucene, whose pair (a2a Major ; a2a MinMaj ) is (12%, 8%) for ACDC , while its a2a Min is 96% for its six minor releases from versions 3.0.0 through 3.6.0. In fact, the new major version 4.0.0 has no signicant similarity to the previous major version (3.0.0) or its most proximate minor version (3.6.0). Looking into the code history of Lucene, we found that multiple changes between minor versions are related to backward-compatibility issues. For example, Lucene 3.6.0 contains packages that are added to support backward-compatibility with versions 3.1.x and 3.3.x. At the time version 4.0.0 was released, a substantial number of changes happened in the backward-compatibility policy. Subsequently, those packages were removed from version 4.0.0. Obtaining the consistent trends across the recovered architectural views that are shown in Table 4.3 at times required that we manually adjust the inputs into two of the three architecture recovery techniques. Namely, in several instances we observed that ARC (the semantics-based view) provided a signicantly better insight into architectural change than ACDC and PKG (the structure-based views). Inspection of our subject systems' source code uncovered that, in some systems (e.g., Log4j, Lucene), developers decided to change the root package name when releasing a new major version. Since ACDC and PKG rely directly on the package structure of the system, such an architecturally inconsequential change caused them to return exceptionally low a2a values. On the other hand, ARC performs clustering of code entities (for the Java code evaluated here, these are classes and enums) based on topic models of systems, and changing package names had no eect on its accuracy. 52 Table 4.4: Average a2a values between versions of non-Apache software systems. ACDC ARC System Major MinMaj Minor Patch Pre Major MinMaj Minor Patch Pre Druid 82 - - 99 - 79 - - 96 - Guava 94 - - 100 100 93 - - 100 100 JS-databind - - 95 99 100 - - 92 99 99 PgJDBC 69 89 95 99 - 64 86 91 99 - TestNG - - 91 99 - - - 87 99 - AVG 76 89 94 99 100 76 86 91 99 100 DEV 10 - 2 0 0 11 - 2 1 0 PKG System Major MinMaj Minor Patch Pre Druid 86 - - 99 - Guava 93 - - 100 100 JS-databind - - 96 99 100 PgJDBC 71 90 96 99 - TestNG - - 91 99 - AVG 83 90 94 99 100 DEV 9 - 2 0 0 Value unit is percentage. Lower numbers mean more change. Empty table cells indicate versions that do not exist for a given system. The second bottom-most row is the average-of-averages. The bottom-most row is the standard deviation. Although PKG performed signicantly better at the system level than at the component level (see Section 4.4.2), our analysis of the a2a metric's results provided the rst indication that PKG may not always accurately re ect architectural change. Namely, the a2a values for the architectures suggested by PKG are uniformly higher than corresponding values in the ACDC and ARC views. This suggests a simple scenario under which PKG falters: if developers put all of the, arbitrarily many, new features of a system's new minor or major version into a small subset of the system's packages, PKG will still indicate only small, if any, architectural changes. For all three architectural views, the standard deviation values (the bottom-most \DEV" row) show a consistent trend. The DEV value of a2a MinMaj is larger than the one of a2a Major . This re ects that developers tend to increase a major version number depending on the accumulated architectural change during the entire major version, rather than on the degree of architectural change since the last minor version. This is re ective of practices expected of well-organized 53 development teams, which usually dene a long-term road map of future features and associate those with system versions. The a2a values of non-Apache systems in Table 4.4 follow the observed trend in the Apache systems. For example, thea2a Major values are the lowest among the ve evolution paths, indicating that the most signicant architectural change occurs between two major versions of a non-Apache system. Compared to the Apache systems, the non-Apache systems have higher a2a values with smaller gaps between dierent evolution paths. For example, in the ACDC view, the average values of a2a Major , a2a Minor and a2a Patch are, respectively, 32%, 87%, and 97% for the Apache systems, and 76%, 94%, and 99% for the non-Apache systems. This can be explained by the dierent versioning style used in non-Apache systems. For example, as explained in Section 4.3, developers of Guava almost exclusively use major numbers. 4.4.2 RQ2: Architectural Change at the Component-Level To understand architectural change at the level of individual components, we relied on ARCADE's cvg metric. In the results reported here, we set the threshold th cvg (recall Section 2.3) to 67%. We experimented with several th cvg values, and 67% gave us the most intuitive result. This setting allows ARCADE to treat two clusters as dierent versions of the same cluster, while allowing a reasonable fraction of the new cluster's constituent code elements to change. Table 4.5 and 4.6 depict average cvg values for architectures recovered by ACDC , ARC , and PKG. As in the case of a2a, these values are computed for Major, MinMaj, Minor, Patch, and Pre-release version pairs. Average cvg values are computed for each version pair (s;t), which obtains the percentage of extant components, and its inverse (t;s), which allows us to determine the extent to which new components were added to a version. 54 Table 4.5: Average cvg values between versions of Apache subject systems. ACDC ARC Major MinMaj Minor Patch Pre Major MinMaj Minor Patch Pre System (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) Accumulo - - - - 71 64 99 99 - - - - - - 77 66 99 98 - - ActiveMQ 28 19 61 48 95 92 100 100 99 99 37 27 57 54 91 85 99 99 94 92 Cassandra 5 4 59 53 52 46 98 99 98 98 39 19 71 61 65 55 97 96 95 95 Chukwa - - - - 63 54 - - 86 86 - - - - 54 44 - - 72 72 Hadoop 0 0 54 46 83 74 95 98 - - 20 3 71 55 82 73 96 96 - - HttpClient - - - - 65 64 97 97 96 95 - - - - 74 70 93 94 93 92 Ivy 6 4 46 45 67 57 100 96 100 96 7 5 46 42 47 41 82 75 93 95 JackRabbit 16 7 53 57 87 81 98 97 96 96 49 15 66 65 85 78 99 99 92 92 Jena - - - - 81 74 96 96 - - - - - - 81 77 92 92 - - JSPWiki 0 0 0 0 38 35 85 84 98 98 0 0 39 9 41 33 90 87 98 98 Log4j 0 0 0 0 29 21 94 93 85 82 7 2 5 2 57 42 87 85 81 79 Lucene 0 0 0 0 87 84 98 98 99 99 0 0 0 0 92 91 99 99 85 91 Mina 4 2 4 2 78 78 99 99 87 80 10 6 12 8 85 85 99 99 82 77 PDFBox - - - - 94 92 95 94 - - - - - - 88 85 99 97 - - Poi - - - - 86 83 100 100 94 95 - - - - 84 78 100 100 88 89 Struts2 - - - - 79 83 96 96 - - - - - - 74 78 96 95 - - Tika 25 17 100 100 80 77 - - 100 100 52 26 83 85 82 78 - - 100 100 Xerces 0 0 20 16 83 81 86 83 - - 12 5 26 29 85 79 86 82 - - AVG 7 5 36 33 73 69 96 95 95 94 21 10 43 37 72 74 96 94 89 89 DEV 10 7 33 32 19 20 5 5 6 7 19 10 29 29 16 18 6 7 8 8 PKG Major MinMaj Minor Patch Pre System (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) Accumulo - - - - 82 72 100 99 - - ActiveMQ 33 27 60 67 96 93 100 100 100 97 Cassandra 29 18 77 69 69 60 98 100 99 99 Chukwa - - - - 76 67 - - 93 93 Hadoop 0 0 54 46 95 85 100 99 - - HttpClient - - - - 83 81 99 99 98 98 Ivy 14 11 48 46 81 65 100 97 100 97 JackRabbit 28 12 65 73 93 86 99 98 98 98 Jena - - - - 97 93 99 99 - - JSPWiki 0 0 25 5 63 51 97 96 100 100 Log4j 0 0 0 0 69 54 99 97 92 88 Lucene 0 0 0 0 88 85 99 99 70 89 Mina 8 4 8 4 96 96 97 96 91 83 PDFBox - - - - 97 97 98 97 - - Poi - - - - 94 90 100 100 97 98 Struts2 - - - - 91 95 98 98 - - Tika 30 20 100 100 93 90 - - 100 100 Xerces 7 3 20 10 85 83 90 88 - - AVG 13 9 42 38 86 80 98 97 95 95 DEV 13 10 33 35 10 15 2 3 8 5 Value unit is percentage. Lower numbers mean more change. Empty table cells indicate versions that do not exist for a given system. The second bottom-most row is the average-of-averages. The bottom-most row is the standard deviation 55 We rst discuss the result of Apache subject systems in Table 4.5. The cvg values for a version pair and its corresponding inverse pair shared the same general trend, across all three recovery techniques, that we observed with a2a values: cvg Pre cvg Patch > cvg Minor > cvg MinMaj >cvg Major However, individual version pairs and their inverses were notably dissimilar in some cases. For example, across the four major versions of ActiveMQ, ACDC yielded cvg MinMaj (s; t) = 61% and cvg MinMaj (t; s) = 48%. This means that a newly introduced major version retained 61% of the immediately preceding minor version's components. In turn, this comprised only 48% of the new major version's components due to the system's increase in size; the remaining 52% were newly introduced components. In other words, ActiveMQ grew by an average of 27% (cvg MinMaj (s; t)/cvg MinMaj (t; s)) in the number of components during the introduction of a new major version. Overall, the dierences between the average cvg values for version pairs and their inverses across all subject systems (the AVG row in Table 4.5) ranged between 0% (Patch versions in ACDC ) and 9% (MinMaj versions in ARC ). All three recovery techniques show extensive component-level change at the Major and MinMaj levels. Conversely, all three show signicant stability at the Minor, Patch, and Pre-release levels. However, the results yielded by analyzing ARC 's recovered architectures are notably dierent from, both, ACDC and PKG. First, both PKG and especially ACDC tended to under-report the degree of component-level similarity of architectures between major version pairs. In several cases, the two techniques yielded no similarity (the 0% values in Table 4.5) even though a manual inspection of the corresponding versions suggested that some component-level similarity was, in fact, preserved. While ARC also yielded very low values for the same cases, in most of those cases it did, accurately, maintain some component-level similarity. The reason for this is that both ACDC and PKG rely 56 on the system's structural dependencies and are signicantly aected by changes that span most or all of the system's implementation packages. On the other hand, ARC 's reliance on the information contained in the system's implementation elements, rather than on the relative organization of those elements, made it less susceptible to misinterpreting the large system changes that typically happen at the Major and MinMaj levels. An analogous argument explains why ARC yields lower component similarity values for the Minor, Patch, and Pre-release levels: ACDC and especially PKG fail to recognize large architectural changes to system components if those changes are conned within a package or a small number of packages. Table 4.6: Average cvg values between versions for non-Apache subject systems. ACDC ARC Major MinMaj Minor Patch Pre Major MinMaj Minor Patch Pre System (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) Druid 57 50 - - - - 97 96 - - 77 67 - - 94 93 - - Guava 89 85 - - - - 100 100 100 100 89 85 - - - - 100 100 100 100 JS-databind - - - - 89 88 100 100 99 99 - - - - 83 81 100 99 98 98 PgJDBC 31 21 76 68 87 83 98 98 - - 50 34 79 74 83 78 98 97 - - TestNG - - - - 83 81 99 100 - - - - - - 80 77 99 99 - - AVG 59 52 76 68 86 84 99 98 99 99 72 62 79 74 82 79 98 98 99 99 DEV 24 26 - - 2 3 1 2 1 1 16 21 - - 11 2 2 3 1 1 PKG Major MinMaj Minor Patch Pre System (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) (s;t) (t;s) Druid 82 72 - - - - 99 98 - - Guava 94 88 - - - - 100 100 100 100 JS-databind - - - - 97 96 100 100 97 97 PgJDBC 87 59 95 86 99 92 99 100 - - TestNG - - - - 89 87 99 100 - - AVG 88 73 95 86 95 92 99 100 99 99 DEV 5 12 - - 4 4 0 1 2 2 Value unit is percentage. Lower numbers mean more change. Empty table cells indicate versions that do not exist for a given system. The second bottom-most row is the average-of-averages. The bottom-most row is the standard deviation The prevalent trend ofcvg values remains intact across all ve evolution paths of the non-Apache systems, as shown in Table 4.6. We note that the cvg values in Table 4.6 are markedly higher than the corresponding values in the Apache subject systems in Table 4.5. For example, in the ACDC 57 view, the average values (\AVG") of cvg Major (s;t) and cvg Major (t;s) are, respectively, 7% and 5% for the Apache systems, and 59% and 52% for the non-Apache systems. Therefore, not only are the non-Apache systems more stable at the system-level, but also at the component-level. We can conclude that a2a and cvg mostly display consistent trends for both the Apache and non-Apache systems. In Section 4.4.3, we will discuss instances in which a2a and cvg show dierent aspects of architectural change that buck this trend. However, this is apparent in some of the larger Apache systems, but was not observed in the non-Apache systems. Another noticeable dierence between the Apache and non-Apache systems is that the architectural changes in non-Apache systems are much smaller when moving to a new major version (MinMaj ). This again reiterates the importance of stability and backward compatibility in non-Apache systems. 4.4.3 RQ3: System-Level vs. Component-Level Change While the discussion of RQ1 and RQ2 indicated that architectural change followed the same general trends in our subject systems at the overall-structure and individual-component levels, the extent of that change diered. We can see signicant dierences between the a2a (architecture-level) and cvg (component-level) change metrics. For example, all three architecture recovery techniques yielded 0% cvg values for JSPWiki's Major versions; none of them did so in the case of a2a. The reason for Figure 4.2: a2a values between minor versions of Ivy 58 Figure 4.3: cvg(s;t) values between minor versions of Ivy Figure 4.4: cvg(t;s) values between minor versions of Ivy that is that a2a and cvg measure two dierent aspects of architectural evolution. a2a measures the similarity between two architectures in terms of the number of operations required to transform one architecture to the other, while cvg measures the similarity between two architectures in terms of the number of components that persist in the course of evolution. In JSPWiki, cvg yielded 0% because no two components in the major versions were \suciently" similar based on the chosen threshold. On the other hand, a2a found some shared entities among the major versions, resulting in a non-zero degree of similarity. Another revealing example is Lucene. Lucene may be thought of as a catalogue of multiple information retrieval systems that have historically been added to and removed from it. For example, the Solr project was initially developed by CNET Networks, and later released as an open-source 59 project and merged with the Lucene code base [3]. Due to this nature of Lucene, it has tended to undergo a lot of signicant changes before the release of a new major version. Although some parts of the system structure would be maintained (indicated by a2a Major anda2a MinMaj ), Lucene's components changed signicantly (both cvg Major and cvg MinMaj are 0% across all three recovery techniques). In Apache systems, we note that the growth of divergence between the a2a and cvg values from Patch and Pre-release versions at one end of the spectrum to Major versions at the other end is more pronounced in the case of ACDC and PKG than in the case of ARC . This is another indicator that the two structure-based recovery techniques are indeed much better equipped to track system-level than component-level architectural changes. Additionally, the consistently higher a2a values that both ACDC and PKG yield as compared to ARC suggest that the semantics-based perspective of ARC yields more architectural changes at the system-level as compared to the structure-based perspectives of ACDC and PKG. To illustrate the dierence in architectural similarity yielded by the structural views|ACDC and PKG|as compared to ARC , Figures 4.2{4.4 depict architectural changes among minor versions of Ivy: Figure 4.2 depicts the a2a values; Figure 4.3 depicts the cvg values for each version pair (s;t); and Figure 4.4 shows its inverse, i.e., the cvg values for version pairs (t;s). Note that, for clarity, we do not depict the MinMaj evolution paths in Figures 4.2{4.4. Figure 4.2 shows that the trends for a2a values involving Ivy's minor versions are similar among the three architectural views, with the ARC values generally slightly lower than the ACDC and PKG values. However, Figures 4.3 and 4.4 show that ARC reveals signicant component-level changes for the same set of minor versions of Ivy. 60 To verify these and other similar results, we examined the changes that occurred in the involved versions. We found two key reasons for the lower ARC values, particularly at the component level: (1) class additions and (2) renaming of classes and variables. Classes were added, e.g., in Ivy's versions 0.6.0 and 0.8.0, indicating that the semantics of the aected components changed. However, these classes were mostly added to existing packages or components, resulting in a much smaller change to the architecture's structure. This type of semantic change at the component level is precisely the kind of change that the cvg values for ARC are intended to highlight. Furthermore, many classes and variables underwent refactoring across system versions (e.g., from URLDownloader to URLHandler in Ivy). These are semantic rather than structural changes, and are more readily taken into account by ARC than in either of the structural views. On the other hand, in the non-Apache systems used in this study, we did not nd any instances in which the values of the a2a and cvg metrics led to contradicting conclusions. We attribute this to the desired high stability of non-Apache systems, which is also re ected in their architectures. 4.4.4 RQ4: Architectural Change in Consecutive Minor Versions Our nding that major architectural change tends to involve major system versions was not surprising (although several of its facets, discussed above, were unexpected). In particular, we have found that a \jump" to a new major version (MinMaj ) results in signicant change, sometimes comparable to the cumulative sequence of changes experienced by a system across an entire major version. This can be seen in the MinMaj results in Tables 4.3 and 4.5. These results also indicate that, on the average, a transition to a new major version involves more pronounced architectural changes than transitions between minor versions within the same major version. 61 Table 4.7: Minimum a2a values between minor versions (a) Apache subject systems System ACDC ARC PKG Accumulo 83 83 84 ActiveMQ 86 78 84 Cassandra 60 55 49 Chukwa 72 73 76 Hadoop 57 51 72 HttpClient 71 70 74 Ivy 85 65 79 JackRabbit 76 69 74 Jena 83 77 86 JSPWiki 47 55 58 Log4j 62 62 59 Lucene 96 89 90 Mina 93 92 98 PDFBox 87 87 87 Poi 72 67 70 Struts2 79 80 83 Tika 85 80 86 Xerces 41 37 48 AVG 74 70 73 DEV 15 15 14 (b) Non-Apache subject systems System ACDC ARC PKG Druid - - - Guava - - - Jackson-databind 94 89 95 PgJDBC 91 80 91 TestNG 77 71 78 AVG 87 80 88 DEV 7 7 7 An interesting question we set out to explore in this study was whether this is always the case. In other words, can a system's architecture experience changes between two consecutive minor versions that are comparable to the changes between a minor version and the subsequent major version? To this end, we conducted an analysis to determine the minimum similarity among all consecutive minor version pairs within a major version. Table 4.7a and 4.7b show the a2a results of that analysis on architectures produced by ACDC , ARC and PKG. In our dataset, Druid and Guava do not have minor versions. Therefore, the cells of those systems in Table 4.7b are empty. We rst discuss the Apache systems, shown in Table 4.7a. Several values in the table indicate that considerable architectural change can indeed occur between two minor versions (e.g., 47% for ACDC in JSPWiki; 37% for ARC in Xerces). In some systems (e.g., Cassandra), the minimum a2a values between consecutive minor versions (60% for ACDC ; 55% for ARC ; 49% for PKG) are lower than the corresponding MinMaj values (80% for ACDC ; 75% for ARC ; 79% for PKG, as shown in Table 4.3). The analogous analysis involving minimum cvg values shows similar results, 62 but is elided due to space constraints. The main reason for this is that developers tended to add a large number of new features to a new minor version of a system, especially at the beginning of the system's life cycle. For example, Xerces more than doubled in size from version 1.0 to version 1.2, which is its next downloadable minor version. In addition to the substantial system changes that are likely at early stages of development, this may also have occurred due to the lack of clear and consistent versioning guidelines. This was one of several ndings that indicated that software engineers may be missing a crisply dened, shared intuition as to how and to what extent a software architecture changes as a system evolves. Such ndings reveal that software developers do not always base their versioning schemes on the architectural impact of their changes, be it because they are not aware of said impact or because they do not consider it relevant enough to be re ected in the versioning scheme. The minimum a2a values for the non-Apache subject systems are shown in Table 4.7b. Although the architectures of these systems are highly stable, in some instances big architectural changes are noticeable (e.g., 71% for PKG in TestNG). However, the minimum a2a Minor values are relatively high compared to the Apache systems. In addition, the standard deviation values of non-Apache systems (7% in Table 4.7b) are smaller than standard deviation values of Apache systems (14%-15% in Table 4.7a). This reinforces the inference that developers of those three systems, all of which are libraries, care about stability and backward compatibility, and are likely to maintain the system's architecture stable across a single major version. 63 Chapter 5 An Empirical Study of Software Architectural Decay 5.1 Foundation Our work discussed in this chapter is directly enabled by three research threads: (1) software architecture recovery, (2) denition of architectural smells, and (3) tracking of implementation issues. We overview each of these threads. 5.1.1 Architecture Recovery In this work, an implemented software system's architecture is represented as a graph that captures components as vertices and their interconnections (e.g., call dependencies and logical couplings) as edges. Each component, in turn, contains a set of entities (e.g., implementation classes). Our previous studies [48, 68] have shown that three architecture recovery techniques| ACDC [109], ARC [52], and PKG [68]|generally exhibit higher accuracy than their competitors. For this reason, we use these three techniques in our study of architectural decay. ACDC is oriented toward compo- nents that are based on structural patterns (e.g., components consisting of entities that together form a particular subgraph). ARC produces components that are semantically coherent due to sharing similar system-level concerns (e.g., a component whose main concern is handling distributed jobs). 64 Finally, a system's package structure extracted by PKG forms a sort of \developers' perception" of the as-implemented architecture [68]. While the three selected recovery methods dier in how they approach architecture, they all produce (1) clusters of source-code entities, (2) dependencies among code entities within a cluster, and (3) dependencies across clusters. We use this information to detect architectural smells. 5.1.2 Architectural Smells Similar to the concept of smells at other levels of system abstraction (namely, code and design smells), architectural smells are instances of poor design decisions [81] | at the architectural level. They negatively impact system lifecycle properties, such as understandability, testability, extensibility, and reusability [50]. While code smells [44], anti-patterns [30], or structural design smells [46] originate from implementation constructs (e.g., classes), architectural smells stem from poor use of software architecture-level abstractions | components, connectors, interfaces, patterns, styles, etc. Detected instances of architectural smells are candidates for restructuring [28], to help prevent architectural decay and improve system quality. Researchers have collected and reported a growing catalog of architectural smells. Garcia et al. [50, 51] have identied an initial set of four architectural smells related to connectors, interfaces, and concerns. Mo et al. [82] extended Garcia's list with a new concern-related smell. Ganesh et al. [46] also summarized a catalog of structural design smells, some of which are actually at the architecture-level. We recently provided a short description of 12 dierent architectural smells [69] and proposed a framework for relating architectural smells to system sustainability. The work described in this chapter extends these prior approaches. 65 5.1.3 Issue Tracking Systems All subject systems that were selected for this chapter use Jira [7] as their issue repository. Our approach relates implementation issues with smells based on the data available on Jira. A similar approach can be applied to other repositories. When reporting implementation issues, engineers categorize them into dierent types: bug, new feature, feature improvement, task to be performed, etc. Each issue has a status that indicates where the issue is in its lifecycle [17]. Each issue starts as \open", progresses to \resolved", and nally to \closed". We restrict our study to closed and xed issues because they were veried and addressed by developers, so that any eects caused by them would presumably appear in certain system versions and disappear once the issue is addressed. Additionally, a xed issue contains information that is useful for our study: (1) aected versions in which the issue has been found, (2) type of the issue, and (3) xing commits, i.e., the changes applied to the system to resolve the issue. Finding xing commits is not always easy since there is no standard method for engineers to keep track of this information on issues trackers. In Jira, we found three popular methods: (1) directly mapping to xing commits, (2) using pull requests, and (3) using patch les. Our approach supports all three methods. Figure 5.1: Mapping architectural smells to issues. 66 Based on the collected information, issues are mapped to detected smells using the model depicted in Figure 5.1. First, we nd the system versions that the issue aects. Then we nd the smells present in those versions. We say the issue is infected by a given smell i (1) both the issue and the smell aect the same version of a system and (2) the resolution of the issue changes les that are involved in the smell. Note that resolving the issue may or may not remove the smell that may have caused the issue in the rst place: developers may nd a dierent workaround. Based on this relationship between issues and smells, we studied if the characteristics of an issue (e.g. type, number of xing commits) are correlated with whether the issue is infected by a given smell. 5.2 Research Question and Hypotheses Research Question: How do architectural smells manifest themselves in a system's implementa- tion? Although smells are characterized as instances of architectural decay and candidates for restruc- turing, it is currently unknown how they manifest themselves in the implementation of a software system. Our research question in this study seeks empirical evidence to prove the long-established claims about the negative impact of architectural decay on software systems. As mentioned in the introduction, we hypothesized that the reported issues of a system re ect the decay in its architecture. Therefore, all hypotheses that we dened in this study focus on exploring relationships between detected smells and reported implementation issues. If such a correlation exists, then the decay of a system's architecture can lead to tangible problems experienced by developers. Smells would show their usefulness to point out the \hot spots" [9] in a software system. Those \hot spots" are the highly active areas in the code that (1) have a lot of bugs, (2) are related to performance ne-tuning of the system, or (3) are plug-in points for new features. A \hot spot" creates issues 67 repeatedly, as developers try to tackle its myriad problems. On the other hand, these reported issues may be the manifestations of underlying smells. Therefore, studying relationships between smells and issues may help developers nd the architectural root causes of implementation-level problems. In turn, this may reveal proper resolutions for the issues and remove the \hot spots". Based on the mapping between smells and issues described in Section 5.1.3, we have dened several concepts that serve to formulate and validate our research hypotheses. We call a le \smelly" in a system version if that le is aected by at least one smell found in the recovered architectures of that version. Otherwise, we call it \clean". We call an issue \smelly" if at least one smelly le of a version aected by that issue is involved in the issue's resolution. If no such le exists, we call the issue \clean". We studied correlations between smells and issues at the level of les instead of architectural components. There are several reasons for this decision. First, for each issue, trackers record le modications, and commonly, les serve as proxies for classes, which are the elements of components in our architectural model. In addition, smells can be divided into two categories depending on the scope of their detection algorithms: Partial-component smells aect only a subset of implementation entities (e.g., classes) in a component. Unused Interface, Sloppy Delegation, Duplicate Functionality, Logical Coupling, and Dependency Cycle are smells of this type. Whole-component smells aect the entire component. Unused Brick, Brick Functionality Overload, Lego Syndrome, Link Overload, Scattered Parasitic Functionality, and Concern Overload are smells of this type. 68 When an instance of a partial-component smell is found, it means that some of the component's implementation entities are not aected by it. Analyzing smell-issue correlations for such smells at a higher level of abstraction (i.e., at the component-level) will waste the more ne-grained information we have already gained about that smell instance at the le level, and will consequently reduce the value of that smell in pinpointing the precise location where xes need to take place. For example, the members of a Dependency Cycle (a partial-component smell) can be two or more proper subsets of the classes from two or more dierent components. If an issue changes les that do not participate in the cycle, associating the issue with the component-level dependency cycle smell would be inaccurate. To answer our research question in this study, we dened the following three hypotheses about the relationships between architectural smells and implementation issues. Hypothesis H1 (issue-proneness): Smelly les are more likely to have issues than clean les. Architectural smells can be used to identify the \hot spots" of a software system. Therefore, we target all types of issues that are reported on the Jira issue tracker system. We expect that the smelly parts of a system are its most likely \hot spots", which indicates that smell-infected les are more issue-prone than clean les. Hypothesis H2 (change-proneness): Smelly les are more likely to be changed than clean les. This hypothesis addresses the key, frequently stated assumption about architectural decay: that it causes more maintenance eort. We posit that system changes are more likely to happen around smelly than clean components. It is important to verify this assumption because changes of a given le related to an architectural smell may readily cause chain reactions to multiple additional les aected by that smell. 69 Hypothesis H3 (bug-proneness): Smelly issues are more likely to be of the bug-type than clean issues. By denition, smells are violations of fundamental design decisions [8, 44], and not usually related to bugs. Although bugs are problems that manifest themselves at the implementation level, we posit that in many cases, bugs are introduced into a system because engineers make changes without considering their architectural impact. We expect to nd that smelly parts of a system are related to a signicant numbers of bugs, and that smell-infected issues are more likely to be bugs than clean issues. 5.3 Subject Systems This study reports empirical results of a set of Apache open-source projects. We selected Apache because it is one of the largest open-source organizations and has produced a number of impactful systems. Furthermore, Apache has well-established code repositories and bug trackers that are open to the public. Researchers can query a system for past versions and analyze its development history (e.g., issues and release notes). Table 5.1: Subject systems analyzed in the empirical study of architectural decay System Domain No. of Ver- sions No. of Issues Avg. SLOC Camel Integration Framework 78 9665 1.13M Continuum Integration Server 11 2312 463K CXF Service Framework 120 6371 915K Hadoop Data Process Framework 63 9381 1.96M Nutch Web Crawler 21 1928 118K OpenJPA Java Persistence 20 1937 511K Struts2 Web App Framework 36 4207 379K Wicket Web App Framework 72 6098 332K 70 Table 5.1 shows the list of systems selected for our study. To ensure the generality of our conclusions, we manually looked at all the available Apache open-source systems that use Jira to track the issues [7] and used the following selection criteria. 1. Dierent software domains, to ensure broad applicability of our results. 2. Tracking of issues and their xing commits, to help map architectural smells to implementation problems. Specically, we analyze \resolved" and \closed" issues because they have complete sets of xing commits. 3. Large numbers of resolved and closed issues, to give us sucient data to ensure the accuracy of our analysis. In the selected set of subject systems, the numbers of such issues per system vary between 1,928 and 9,665. 4. Availability of versions along the system's lifetime, to allow tracking of architectural decay trends. In the selected systems, numbers of versions vary between 11 and 120. Among these criteria, the most critical one is the availability of reported implementation issues and their xing commits: The motivating observation for our study was that if a given architectural smell has a concrete impact on the system's implementation, then this impact can be revealed by studying the implementation issues. Section 5.1.3 explained how we exploited the information found in issue repositories. For each version of each subject system, we rst recovered its architectures by using the three selected architecture recovery techniques in ARCADE: ACDC, ARC, and PKG. Then, the smells in those architectures were identied using the automated detection algorithms. Table 5.2 shows the average number of smells in each category, across all analyzed versions of each subject system. (Note that concern-based smells can be only detected under ARC, which produces a concern-based 71 architectural view.) Subsequently, information on issues was extracted from the issue-tracking systems. Finally, the smells were mapped to the issues, as described above. Table 5.2: Average numbers of architectural smells per system version System ACDC ARC PKG Dep. Con. Int. Cou. Dep. Con. Int. Cou. Dep. Con. Int. Cou. Camel 9 0 51 0 21 195 426 0 10 0 111 0 Continuum 3 0 11 3 4 9 26 4 4 0 23 2 CXF 34 0 131 92 15 78 156 114 56 0 245 116 Hadoop 6 0 46 18 43 27 112 47 9 0 53 18 Nutch 5 0 24 3 7 12 34 5 6 0 32 4 OpenJPA 9 0 46 0 15 19 55 0 12 0 14 0 Struts2 3 0 25 6 9 17 70 7 6 0 51 7 Wicket 11 0 76 0 28 81 327 0 15 0 119 0 (ACDC and PKG are unable to identify concern-based smells since they do not capture system concerns) 5.4 Results For each research hypothesis stated above, we will discuss the method employed in attempting to answer it and the associated ndings under the three architectural views. 5.4.1 Hypothesis H1 { Issue-Proneness of Files Table 5.3: Average number of issues per le System ACDC ARC PKG Smelly Clean Factor p- value Smelly Clean Factor p- value Smelly Clean Factor p- value Camel 2.5 1.7 1.47x 0.0001 2.5 1.4 1.71x 0.0001 2.5 1.6 1.56x 0.0001 Continuum 2.4 1.4 1.71x 0.0001 2.2 1.5 1.47x 0.0010 2.4 1.6 1.50x 0.0010 CXF 4.9 2.9 1.69x 0.0001 5.6 2.7 2.07x 0.0001 5.6 3.0 1.86x 0.0001 Hadoop 4.4 2.5 1.76x 0.0001 4.4 2.1 2.10x 0.0001 4.5 2.4 1.86x 0.0001 Nutch 3.8 1.6 2.38x 0.0001 3.8 1.7 2.24x 0.0001 3.9 1.6 2.44x 0.0001 OpenJPA 3.3 2.0 1.65x 0.0001 3.3 2.2 1.5x 0.0001 3.3 2.1 1.57x 0.0001 Struts2 2.1 1.7 1.24x 0.0020 2.1 1.7 1.24x 0.0100 2.1 1.7 1.24x 0.0016 Wicket 3.4 2.2 1.54x 0.0001 3.5 1.8 1.94x 0.0001 3.4 2.3 1.47x 0.0001 72 To verify this hypothesis, for each version of a software system, we rst collected the les that were changed in order to x the issues that aected that version. We then divided the collected les into two groups: smelly and clean les. For each le, we counted the number of issues that aected the le. Finally, we used MiniTab [14] to apply a 2-sample t test [92], which compares two population means, to nd the dierence in the number of issues between two groups. We used an alternative formulation of hypothesis H1 to apply this test, namely, (m smelly m clean )> 0. Table 5.3 shows the average numbers of issues across all of the analyzed versions of each subject system under the three selected architectural views (ACDC, ARC and PKG). The columns under each view show the averages values of numbers of issues in smelly and clean les, the dierence factor between those two averages, and the p-value of the test to verify our hypothesis. We found that p-value < 0:05 for each of the three architectural views of each subject system. Therefore, we can accept the alternative formulation of the hypothesis, i.e., the average number of issues involving smelly les is higher than the average number of issues involving clean les. Our conclusion is statistically signicant with a condence level of 95% in all subject systems. Multiplication factors in Table 5.3 range from 1.24 to 2.10, which means that the issue rates in the smelly les of our subject systems increase from 24% to 110%. This shows clear evidence that the issue-proneness problem is strongly correlated with architectural smells. Among the dierent views, the results of ARC recoveries have higher multiplication factor values than ACDC and PKG. In particular, the smelly les in CXF, Hadoop, Nutch and Wicket have roughly twice the number of issues under ARC as compared to the clean les. This suggests that architectural smells detected under the ARC view may isolate issue-prone parts of a system better than smells detected under the ACDC and PKG views. We are currently further investigating this observation. 73 5.4.2 Hypothesis H2 { Change-Proneness of Files Table 5.4: Average number of commits per le System ACDC ARC PKG Smelly Clean Factor p- value Smelly Clean Factor p- value Smelly Clean Factor p- value Camel 14.7 11.3 1.30x 0.0001 13.7 8.9 1.53x 0.0001 14.3 9.7 1.47x 0.0001 Continuum 32.4 17.7 1.83x 0.0380 30.3 17.7 1.71x 0.0034 30.1 17.8 1.69x 0.0301 CXF 13.3 12.1 1.10x 0.0349 12.6 11.1 1.13x 0.0063 13.3 11.7 1.14x 0.0065 Hadoop 10.1 5.9 1.71x 0.0001 7.2 5.9 1.22x 0.1366 9.7 5.9 1.64x 0.0003 Nutch 11.9 9.3 1.27x 0.0015 12.3 9.3 1.32x 0.0005 11.8 9.3 1.26x 0.0024 OpenJPA 19.1 13.6 1.40x 0.0001 20.3 13.6 1.49x 0.0001 19.2 13.6 1.41x 0.0001 Struts2 17.6 9.7 1.81x 0.0001 17.9 9.8 1.83x 0.0001 17.7 9.7 1.82x 0.0001 Wicket 7.4 5.2 1.42x 0.0001 7.1 6.3 1.12x 0.0001 7.4 5.8 1.26x 0.0001 To verify this hypothesis, we performed an analysis which is similar to the one of hypothesis H1. However, instead of counting the numbers of issues, we used git-log [10] to analyze the source code repositories of the subject systems and extract the numbers of commits related to smelly and clean les. We employed the same statistical method as in H1 to nd the dierence between the two groups, i.e., the smelly and clean issues. We used an alternative formulation of hypothesis H2 to apply this test, namely, (m smelly m clean )> 0. Table 5.4 shows the average numbers of commits for the two groups across all of the analyzed versions of each subject system. The structure of this table is similar to Table 5.3. With the exception of a single case out of 24 cases, we found that p-value < 0:05 in every subject system and across all architectural views. For all but that one case, we can accept the alternative hypothesis, i.e., the average number of changes in smelly les is higher than the analogous number in clean les. Our conclusion is statistically signicant with a condence level of 95% in those cases. The exceptional case is Hadoop under the ARC view, in which p-value = 0:1366. Recall that ARC can be used to detect concern-based smells, while ACDC and PKG cannot. We found that many of Hadoop's issues are aected by the Concern Overload smell type (385 aected issues out 74 of the total 9,381 issues). This is a signicantly greater proportion than in other systems (e.g., 34 aected issues out of 4,207 issues in Struts2). The average number of commits of smelly les (7.2) under the ARC view is lower compared to ACDC (10.1) and PKG (9.7) views. This hints that Concern Overload may aect les that have fewer commits in Hadoop. More generally, this observation suggests that dierent smell types may have dierent levels of usefulness in isolating problematic les (e.g., when targeting change-proneness). Denitively conrming this will require further analysis that will isolate dierent smell types. This is an area of planned future work. Multiplication factors in Table 5.4 range from 1.10 to 1.83 across the three views, meaning that the issue rates in the smelly les of our subject systems range from 10% to 83%. These values show consistent trends across the three views for all systems except Hadoop and Wicket. In the cases of Hadoop and Wicket, the multiplication factors under the ARC view are somewhat lower as compared to the other two views. This observation may be accounted for by the above-discussed eect of dierent smell types on change-proneness. Further analysis will be required to conrm this. Table 5.5: Average percentage of buggy issues System ACDC ARC PKG Smelly Clean Factor p- value Smelly Clean Factor p- value Smelly Clean Factor p- value Camel 58.8% 55.7% 1.05x 0.2737 60.3% 54.2% 1.11x 0.1174 59.7% 54.3% 1.10x 0.1906 Continuum 69.7% 53.3% 1.31x 0.1014 70.9% 61.1% 1.16x 0.1664 71.3% 58.3% 1.22x 0.1079 CXF 79.4% 62.9% 1.26x 0.0382 79.3% 52.3% 1.52x 0.0001 79.1% 63.0% 1.33x 0.0001 Hadoop 71.7% 57.7% 1.24x 0.0018 71.6% 60.0% 1.19x 0.0076 71.5% 59.5% 1.20x 0.0054 Nutch 70.6% 49.7% 1.42x 0.0120 68.1% 46.9% 1.45x 0.0214 69.7% 47.5% 1.46x 0.0185 OpenJPA 77.6% 70.5% 1.10x 0.1294 76.0% 69.7% 1.09x 0.1354 77.6% 70.6% 1.10x 0.1307 Struts2 59.8% 45.6% 1.31x 0.0179 59.9% 45.4% 1.32x 0.0152 60.1% 45.1% 1.33x 0.0130 Wicket 66.5% 65.1% 1.02x 0.3556 67.7% 62.2% 1.08x 0.1242 66.5% 64.1% 1.03x 0.5072 75 5.4.3 Hypothesis H3 { Bug-Proneness of Issues To verify this hypothesis, for each version of a software system, we rst collected all issues that were reported during the system's lifetime. We then used the mapping between smells and issues from Section 5.1.3 to categorize the collected issues as \smelly" or \clean". Subsequently, we calculated the relative proportions of issues that are bugs for each group. Finally, we used MiniTab [14] to apply a 2 proportions test [92], which compares the proportions of bug-type issues in both groups, to nd the dierence between them. We used an alternative formulation of hypothesis H3 to apply this test, namely, (p smelly p clean )> 0. Table 5.5 has the structure similar to two previous tables, but it shows the average percentages of smelly and clean issues that are bugs, across all analyzed versions of each subject system, under the three selected architectural views (ACDC, ARC and PKG). As indicated in the shaded cells in Table 5.5, p-value < 0:05 for CXF, Hadoop, Nutch, and Struts2 under all three views. In these cases, we accept the alternative formulation of the hypothesis stated above: the percentages of smelly issues that are of type bug are higher than the percentages of clean issues that are of the bug type in these four systems. This interesting result shows that architectural smells can signicantly aect the bug rates of some software systems. Under the ACDC view of these systems, the multiplication factors range from 1.24 to 1.42, meaning that the numbers of smelly issues that are of type bug increase between 24% and 42% as compared to clean issues of the same type. The analogous ranges under the ARC and PKG views are 19-52% and 20-46%, respectively. This indicates that those parts of a system that are aected by architectural smells have substantially higher defect rates than the clean parts. In the case of Continuum, we cannot accept the alternatively formulated hypothesis with a high condence level (p-values are 0.1014, 0.1164, and 0.1079 under the three views). We believe that 76 this is because Continuum has a comparatively small number of released versions (we analyzed 11 of them). However, we do see a big dierence between the average proportions of the smelly and clean groups: the multiplication factors in the case of Continuum indicate that these dierences range between 16% and 31% under the three views. Camel, OpenJPA and Wicket represent the case that software researchers would typically expect: Smells are not directly relevant to bugs. We cannot accept the alternatively formulated hypothesis for them because p-value > 0:05. While the dierence between the two groups of issues still exists, it is smaller, ranging between 2% and 11%. There is one possible reason for this: none of these systems has coupling-based smells (recall Table 5.2). This suggests that coupling-based smells may be an in uential factor in increasing a system's bug rate. This observation is consistent with prior research focusing at lower abstraction levels, which revealed a relationship between coupling and software defects [37]. It is also suggested by the trends between the respective numbers of coupling-based smells and the p-values in the other four subject systems in our study. To conrm this intuition, we are currently formulating the details of further analysis using multiple-regression and machine-learning techniques to determine bugginess along multiple dimensions (e.g., multiple smell types). This is part of our future work. 5.5 Further Discussion In this section, we rst discuss the signicance and implications of the empirical study's results. We then consider architectural smells from the perspective of technical debt. 77 5.5.1 Signicance and Implication of Our Results The results of hypotheses H1, H2, and H3 conrm that smells can manifest themselves in dierent ways. In most cases, the appearance of smells increases the rate of (1) issues that developers experience and (2) changes required in the system. Furthermore, there are indicators that the appearance of smells belonging to a specic category|coupling-based smells|also tends to increase the number of bugs in the system, which means that developers have to expend more eort on unplanned maintenance tasks. Although our empirical study yielded a number of interesting results, it was designed to conduct general analyses that do not isolate the impact of individual smells. As mentioned in Section 5.4, we observed several interesting phenomena that we were unable to explain denitively. This will require further analysis focusing on the impact of individual smells and/or specic smell groupings. Figure 5.2: Percentage of smelly les in Camel (ACDC view) System versions are shown along the bottom; the numbers of les are on the left; the percentages are on the right. One aspect of our empirical results that needs to be further considered is the percentage of smelly les in a given system. If the list of identied smells is contained in a large number of les, showing this information to developers does not provide much benet. Figure 5.2 shows an example of the percentage of smelly les in Apache Camel under the ACDC view. This gure is representative of the trends we observed in other subject systems and architectural views. The columns and the 78 left y-axis of the gure show the size of the system, in terms of the number of les, across the 78 versions of Camel. The lower, highlighted portions of the columns show the numbers of smelly les in each version, indicating that signicant signs of architectural decay were present starting with the initial versions of Camel. The gure also shows that the actual numbers of smelly les increase slowly over the lifetime of the system. This was observed for all our subject systems. In general, about 18% of the les in Camel are aected by architectural decay. We consider this not to be a prohibitively large portion of a system, especially since engineers can narrow down the list of smelly les by looking for a specic smell or smell category. Figure 5.3: Size distribution of smelly vs. non-smelly les in Camel (ACDC view) The le-size ranges are shown along the bottom; the percentages of identied smelly/non-smelly les are on the left. We also analyzed the sizes of the smelly les we identied. Previous studies have shown that le size correlates with error rates and churn (e.g., [118]). Thus, we investigated whether the identied smells are more likely to appear in large les. To this end, we used git-log to collect the sizes of the smelly les. As a representative example, Figure 5.3 shows the size distribution of smelly and non-smelly les in Camel. Similar distributions are observed in other systems and under dierent architectural views. The x-axis is the range of le sizes: >90% indicates the largest 10% of all les in the system, 80%-90% indicates les in the second largest group by size, and so forth. The 79 columns and the y-axis indicate the percentage of smelly and non-smelly les that belong to each size range. For example, 11.2% of the les in identied smells are in the top 10% (>90%), 11.7% of the les are in the next 10% (80%-90%), and so forth. Figure 5.3 shows that the architectural smells aect les in all size ranges. Furthermore, the distribution of smells among the les is relatively even and, to a large extent, independent of the le size. For example, in the case of Camel, the largest 50% of les are aected by 54.4% of all identied smells. A similar observation is found for non-smelly les. Given the correlation between smells on the one hand and issues (Section 5.4.1) and commits (Section 5.4.2) on the other, this result seems to be at odds with previous publications, and to indicate that symptoms of architectural decay are independent of the sizes of implementation les in which they emerge. 5.5.2 Architectural Smells as Technical Debt Similarly to code smells, architectural smells can be considered a form of technical debt [34]. Debt is introduced into a software system when developers apply a quick solution to achieve a near-term goal, but which has negative long-term consequences. One signicant consequence of technical debt is the increase in maintenance eort over time. To investigate whether architectural smells cause signicant problems, we looked for (1) \long- lived" smelly les and (2) indications that those les are continuously involved in reported imple- mentation issues. We call a smelly le long-lived if it involves architectural smells across a large number of analyzed versions. From the collected data, we observed that long-lived smelly les tend to continuously be involved in issues during a system's lifetime. Furthermore, the numbers of issues related to those les are at times very large. 80 Figure 5.4: Top-5 long-lived smelly les in Hadoop (top) and Struts2 (bottom) The x-axis indicates system versions; the y-axis indicates the number of smells. Figure 5.4 illustrates this in the case of two subject systems, Hadoop and Struts2; the results for other subject systems show a similar trend. In the two plots, each line represents a long-lived smelly le in the respective system. The x-axis represents the analyzed system versions, and the y-axis represents the accumulated numbers of issues related to smelly les. We identied the long-lived smelly les by tracking individual les that were involved in architectural smells along the evolution of the system. To simplify the plots, we only selected the top 5 long-lived les in both Hadoop and Struts2. 81 Long-lived smelly les of Hadoop and Struts2 are involved in new implementation issues across most of the analyzed versions. The total numbers of issues in those les were not stable, but appeared to increase linearly over the respective systems' lifespans. To verify this, we computed the R-squared values [16], which estimate how the numbers of issues in each long-lived smelly le t a linear regression model. Table 5.6 shows the R-squared values of the top ve long-lived smelly les in both Hadoop and Struts2. All values are close to 100%, conrming our observation of linear increase. New issues are usually associated with a smelly le when a major version of the system is released. For example, the four big jumps in the number of issues of Figure 5.4 are versions 0.13.0, 0.14.0, 0.15.0, and 0.16.0 of Hadoop | the major versions in Hadoop's version numbering scheme. This association between architectural smells and implementation issues provides a direct motivation for engineers to refactor the system's architectures and remove smells before each release. Table 5.6: R-squared values of linear regression models of long-lived smelly les System R-squared File #1 File #2 File #3 File #4 File #5 Hadoop 94.61% 96.21% 98.41% 97.56% 96.07% Struts2 94.24% 96.90% 91.69% 95.21% 94.86% 82 Chapter 6 Speculative Analysis to Predict Impact of Architectural Decay on the Implementation of Software Systems 6.1 Foundation This chapter discusses the dissertation's third large study, which aims to solve a prediction problem by using supervised machine learning (ML) techniques. In supervised learning, ML systems learn how to combine input to produce useful predictions of data which has not been seen before. In particular, we would like to predict the impact of architectural decay on the implementation of software systems. In this prediction problem, the input features are architectural smells of a system and the output features, i.e., labels, are the issue- and change-proneness of that system. This section will describe two steps to pre-process the raw data before it is used in supervised ML algorithms. The two pre-processing steps are (1) labeling implementation's properties and (2) balancing the datasets. 83 6.1.1 Labeling Data Labeling data is a crucial step to ensure the success of building prediction models. In our prediction problem, the raw information of a system's implementation includes the numbers of issues and changes of each source le. Nonetheless, these numbers could not be used directly as predicted labels because intuitively it would be impossible to build a model which can accurately predict a precise number of issues or changes. Rather than that, those numbers have to be converted to nominal labels which represent the levels of issue- and change-proneness. The way to convert a set of numeric values to nominal labels depends on the distribution of the numeric values. In our problem, the numbers of issues and changes of a system follow a heavy-tailed distribution [12]. These distributions are often segmented into ranges. Most simply and commonly, a distribution can be divided into two segments, which are the head and the tail. A more sophisticated approach is to divide the distribution into three parts, which are the head, the body, and the tail. This study uses the three-segment division, which represents three levels of proneness: low, medium, and high. This approach is chosen because of its potential to give developers a better estimation of architectural decay's impact. To segment a dataset, this study uses the Pareto principle [15], a popular segmentation method for heavy-tailed distributions. This principle is also widely used in software engineering, particularly value-based software engineering [23]. To obtain three segments, the Pareto principle is repeatedly applied two times as suggested in the literature [20]. To collect the data regarding architectural decay, this study uses the approach already applied in the second study of this dissertation on architectural decay. More specically , for each version of a subject system, we rst collect the list of issues which aects that version. Similar to the second study, only \resolved" and \xed" issues are taken into account. Then the les that were changed in order to x the issues are collected. For each le, we gather its associated architectural 84 smells (determined by the triad relationships of issues, les, and smells in Figure 5.1), the number of issues whose xing commits changed the le, and the total number of changes.After the raw data is collected, it has to be labeled using the Pareto technique mentioned above before being fed to supervised ML algorithms. Specically, to determine the level of issue-proneness of a source le in a software version, rst, the number of issues related to that source le is collected. This is considered as one data point. We collect data points for all les in all versions of a software system, and then sort the dataset by the numbers of issues, from low to high. Then, the rst 80% of data points are marked with \low" labels. Finally, the next 16% (80% of 20%) and 4% (20% of 20%) of data points are marked with \medium" and \high" labels respectively. Similarly, to determine the change-proneness of a source le in a software version, we count the number of commits related to that le and repeat the labeling process like above. Table 6.1 shows a few data samples in our datasets after labeling. All the eleven architectural smells in the second empirical study of this dissertation are considered as the input features of this third study. The examples of input features in Table 6.1 are CO (Concern Overload), SPF (Scattered Parasitic Functionality), LO (Link Overload), and DC (Dependency Cycle) smells. The output features, i.e., labels, are the levels of issue- and change-proneness. The two leftmost columns show the versions and the lenames of data points. The next eleven columns are binary features which indicate whether or not the les have a specic smell. \1" means having the smell, and \0" is the other way around. The two rightmost columns indicate the issue- and change-proneness of the les. For example, in version 0.20.0 of Hadoop, DFSClient.java has three smells: Scattered Parasitic Functionality, Link Overload, and Dependency Cycle. The le's issue-proneness is high, and its change-proneness is low. 85 Table 6.1: Data samples of Hadoop Version Filename CO SPF LO DC ... Issue- proneness Change- proneness 0.20.0 hadoop/dfs/DFSClient.java 0 1 1 1 ... high low 0.20.0 hadoop/mapred/JobTracker.java 1 0 1 0 ... medium medium 0.20.0 hadoop/tools/Logalyzer.java 0 0 0 0 ... low low ... ... ... ... ... ... ... ... ... 6.1.2 Balancing Data Because of the data's distribution and the labeling approach, the datasets in this third study are mostly unbalanced. Recall from Section 6.1.1 that the \low":\medium":\high" ratio of our datasets is 80:16:4 (i.e., 20:4:1). If a dataset with that ratio is used to train a prediction model, the model will tend to predict \low" every time, with an 80% chance of being correct. For this reason, it is very important to balance these datasets to ensure that weighted metrics are not biased by less (or more) frequent labels. This dissertation use SMOTE [31] to balance our datasets. Details of SMOTE has been described in Section 2.6. For our specic problem, SMOTE is used to oversample \medium" by a factor of 5 and \high" by a factor of 20. After oversampling, the dataset will be balanced and its \low":\medium":\high" ratio will be 1:1:1. 6.2 Research Question and Hypotheses The two empirical studies of software architectural changes and decay in this dissertation (Section 4 and Section 5) have answered some key research questions in software architecture community and conrmed the visible impact of architectural decay on software systems' implementation. The decay's impact reveals itself in the form of correlations with systems' issue-proneness and change- proneness. This is the cornerstone for exploring further research questions. Specically, that nding has provided the intuition for the third study in this dissertation, which attempts to create an 86 architectural-based approach to predicting the decay's impact on a system's implementation. The following hypothesis about the approach has been developed. Hypothesis: It is possible to construct accurate models to predict the impact of architectural smells on systems' implementation. To prove this hypothesis, this study particularly focuses on the predictability of the issue- proneness and change-proneness of a system based on its architectural smells. This decision is based on the result of this dissertation's second empirical study, which showed the signicant correlations between architectural smells and those two properties. The two following research questions have been dened accordingly. RQ1. To what extent can the architectural smells detected in a system help to predict the issue-proneness and change-proneness of that system at a given point in time? The training data used to build the prediction models of a system is collected from dierent versions of that system during its lifetime. Therefore, if these models can yield high accuracy in predicting issue- and change-proneness, this indicates that architectural smells have consistent impacts on those two properties throughout the system's lifecycle. This is very important because it ensures that the impact of architectural smells is not related to other factors, such as the system's size, which may change during the system's evolution. In addition, a highly accurate prediction model will be useful for developers to foresee the future issue-proneness and change-proneness of newly smell-aected parts of the system. The prediction models will also be useful for the system's maintainers in order to decide when and where to refactor the system. For example, the maintainers can use the models to estimate the reduction of the issue- and change-proneness if they remove some smell instances from the system's implementation. 87 RQ2. To what extent do unrelated software systems tend to share properties with respect to issue- and change-proneness? This research question aims to determine whether architectural smells have similar impacts on the implementations of dierent systems. Specically, we would like to see if the issue- and change-proneness of a system can be accurately predicted by a \general-purpose" model trained by the datasets of other software systems. If this hypothesis is conrmed, architectural smell-based models can be reused by developers to predict the issue- and change-proneness of new software systems in the early stages of their development, before suciently large numbers of system versions become available. Moreover, we would like to see how dierent combinations of systems result in the accuracy of the general-purpose models. To answer these two research questions, we need to build dierent prediction models based on the data regarding architectural smells and then determine how accurate these models are. First, we collect the data regarding architectural decay of the subject systems as well as their issue- and change-proneness. Then we apply dierent machine learning techniques and use accuracy metrics to evaluate their eectiveness in building prediction models in the context of this study. The models' accuracy will be measured under dierent architectural views to see how dierent architectural recovery techniques aect the prediction models. Similar to the rst two empirical studies in this dissertation, this study uses three architectural recovery methods: ACDC [109], ARC [52], and PKG [68]. To evaluate the accuracy of prediction models, we use three widely accepted metrics: precision, recall, and f-score [95]. Precision is the fraction of correctly predicted labels over all predicted labels. Recall is the fraction of correctly predicted labels overall actual labels. Finally, f-score is the harmonic mean of precision and recall to represent a test's accuracy. 88 We use two dierent approaches to obtain evaluation results. The rst approach uses two independent datasets: a training set and a test set. The second approach uses only a dataset with cross-validation setup. For the second approach, this study uses 10-fold-cross-validation, where the dataset is randomly divided into ten equal-sized subsets. Then we sequentially select one subset and test it against the prediction model built by the other nine subsets. The nal result is the mean of ten tests' results. To facilitate the whole process, this study uses ARCADE (Section 2.5)|our framework to study software architecture|to collect raw data, and WEKA[89]|a well-known ML framework|to pre-process data, build prediction models and evaluate the models' accuracy. 6.3 Subject Systems In order to answer the above research questions, we use the data collected from the subject systems in the second empirical study (Section 5) of this dissertation. In addition, we extend the list by adding three more Apache subject systems. Table 6.2 shows the list of the ten subject systems. One system, Continuum, was excluded from this study because its small number of samples is not appropriate for building of prediction models. The data that this study uses include architectural smells detected in recovered architectures, implementation issues collected from the Jira issue repository [4], and code commits extracted from GitHub [11]. 6.4 Results For each research question, the method employed in validating it and the associated ndings will be discussed in the following sections. 89 Table 6.2: Subject systems used in building prediction models System Domain # Versions # Issues Avg. LOC Camel Integration F-work 78 9665 1.13M CXF Service F-work 120 6371 915K Hadoop Data Proc. F-work 63 9381 1.96M Ignite In-memory F-work 17 3410 1.40M Nutch Web Crawler 21 1928 118K OpenJPA Java Persist. 20 1937 511K Pig Data Analysis F-work 16 3465 358K Struts2 Web App F-work 36 4207 379K Wicket Web App F-work 72 6098 332K ZooKeeper Cong. Manag. F-work 23 1390 144K 6.4.1 RQ1. System-specic Prediction RQ1: To what extent can the architectural smells detected in a system help to predict the issue- proneness and change-proneness of that system at a given point in time? Table 6.3: Predicting issue-proneness using Decision Table System ACDC ARC PKG Precision Recall F-score Precision Recall F-score Precision Recall F-score Camel 69.9% 68.4% 68.9% 70.8% 67.0% 67.5% 68.2% 62.8% 63.4% CXF 78.0% 76.7% 77.1% 68.9% 68.3% 68.3% 64.7% 63.8% 63.8% Hadoop 81.2% 80.1% 80.3% 76.6% 76.6% 75.4% 72.8% 73.4% 72.1% Ignite 78.9% 78.1% 78.3% 78.9% 79.1% 79.0% 70.4% 71.0% 70.5% Nutch 80.8% 71.6% 70.9% 82.5% 82.7% 82.3% 68.3% 52.1% 43.7% OpenJPA 71.4% 68.3% 70.8% 74.5% 73.2% 72.6% 69.2% 67.9% 67.3% Pig 71.7% 69.1% 64.7% 71.3% 71.1% 70.0% 68.6% 69.5% 67.1% Struts2 89.2% 89.0% 89.0% 95.0% 94.8% 94.8% 79.1% 78.3% 78.4% Wicket 69.2% 70.1% 68.8% 76.7% 77.1% 76.5% 63.7% 65.4% 59.6% ZooKeeper 72.0% 72.6% 71.9% 70.8% 69.2% 65.3% 68.7% 69.4% 68.7% Average 76.2% 74.4% 74.1% 76.6% 75.9% 75.2% 69.4% 67.4% 65.5% In our prediction problem, all the features are binary (recall Table 6.1). A feature indicates whether or not a le has an architectural smell. For this reason, decision-based techniques are likely to yield a better result than others. Our observation of the evaluation metrics yielded by dierent classication techniques | such as decision table [62], decision tree [96], logistic regression [70], naive bayes [61] | also conrms this intuition. This section will only discuss the obtained results of 90 the decision table based models, which generally has the best accuracy among the aforementioned prediction techniques. Table 6.3 shows the precisions, recalls and f-scores of the decision table models for predicting the issue-proneness of 7 subject systems. Those values are computed using the 10-fold-cross-validation setup [63]. The left column shows the systems' name and the last row shows the average values across all the systems. For each system, we built three dierent prediction models based on three sets of architectural smells, which were detected in three architectural views: ACDC, ARC, and PKG. In total, 30 prediction models based on the decision table technique were created and evaluated. In general, the prediction models of PKG yield the lowest accuracy. The models of ACDC and ARC exhibit performance that is up to 15% better than PKG. Across all the systems, the prediction models of ACDC achieve the precision of at least 69.2%, the recall of at least 68.4%, and f-score of at least 68.8%. The corresponding values of the prediction models of ARC are 68.9%, 67.0%, and 67.5%. Notably, the prediction model of Struts2 under ARC view achieves95.0% in all three metrics. As a result, ARC has the highest average accuracy, which is 76.6% in precision, 75.9% in recall, and 75.2% in f-score. The corresponding values in ACDC are slightly lower, 76.2% in precision, 74.4% in recall, and 71.1% in f-score. These results conrm that architectural smell-based models can accurately predict the issue- proneness of a system. In other words, architectural smells have a consistent impact on a system's implementation with respect to issue-proneness over the system's lifetime. This nding will provide software maintainers with a powerful indicator of an implementation's health. It urges the maintainers to pay more attention to architectural smells existed in their systems. The maintainers can use the architectural smell-based prediction models to foresee future problems as well as to devise refactoring plans. 91 Table 6.4: Predicting change-proneness using Decision Table System ACDC ARC PKG Precision Recall F-score Precision Recall F-score Precision Recall F-score Camel 69.9% 63.4% 63.3% 68.0% 67.1% 63.2% 60.3% 61.0% 61.9% CXF 73.7% 70.8% 78.3% 69.7% 63.4% 64.1% 60.8% 63.4% 62.8% Hadoop 78.1% 73.2% 75.6% 74.9% 74.8% 73.7% 67.4% 70.0% 68.2% Ignite 77.5% 76.1% 76.5% 75.8% 76.1% 76.0% 68.7% 69.1% 69.0% Nutch 73.1% 66.8% 67.9% 76.3% 78.0% 82.0% 62.2% 46.1% 44.7% OpenJPA 78.3% 77.7% 73.0% 74.3% 70.0% 66.5% 68.2% 62.1% 64.4% Pig 70.1% 67.4% 68.9% 69.6% 70.2% 69.9% 65.9% 66.5% 66.2% Struts2 89.3% 85.8% 83.1% 87.8% 96.7% 96.0% 71.2% 73.7% 80.2% Wicket 66.6% 65.3% 65.9% 72.1% 71.8% 73.3% 62.7% 59.0% 56.4% ZooKeeper 69.9% 69.6% 69.8% 67.8% 67.2% 67.3% 65.5% 64.4% 65.0% Average 74.7% 71.6% 72.2% 73.6% 73.5% 73.2% 65.3% 63.5% 63.9% The relatively poor performance of PKG in answering RQ1 is in line with one nding that emerged from our rst empirical study on architectural changes (recall Chapter 4). The nding showed that PKG is not as useful for software architects as ACDC and ARC, in order to understand the actual underlying architectural changes. In this study, PKG again shows that is not as useful as ACDC and ARC in correctly capturing the impact of the underlying architectural smells. This can be explained by the fact that PKG is a simple architecture recovery technique which only depends on the package structure of subject systems. On the other hand, ACDC and ARC do clustering based on sophisticated algorithms with respect to the systems' dependencies and concerns. Recall the categorization of architectural smells in Chapter 3: two categories out of the four are dependency-based and concern-based smells. This observation suggests a future work for us to examine whether a smell should be detected under a specic architectural view in order to be more accurate. Like issue-proneness, we used the same approach to evaluate the accuracy of 30 architectural smell based models for predicting change-proneness. Table 6.4 shows the accuracy of models built using the decision table technique. The models of PKG again have the lowest accuracy. In some 92 systems like CXF, Nutch, and Struts2, the evaluation values in PKG are 10-20% lower than the corresponding values in the other two views. This is similar to the result of issue-proneness. In the ACDC view, the average precision is 74.7%, the average recall is 71.6%, and the average f-score is 72.2%. The corresponding numbers in the ARC views are 73.6%, 73.5% and 73.2%. These results again are promising and they prove that architectural smells based models can accurately predict the change-proneness of a system. In summary, we can use the historical data of a system regarding its architectural smells, issues, and changes to develop the models which can predict issue- and change-proneness of that system with high accuracy. This result indicates that architectural smells have a consistent impact on systems' implementations. Our architecture-based prediction approach is useful for maintainers to foresee likely future problems in newly smell-impacted parts of the system. The approach could also help in creating maintenance plans in order to eectively reduce the system's issue- and change-proneness. Lastly, ACDC and ARC outperform PKG in predicting both issue- and change-proneness. This again emphasizes the importance of architectural recovery techniques to help developers understand the underlying architecture precisely. 6.4.2 RQ2. Generic Prediction RQ2: To what extent do unrelated software systems tend to share properties with respect to issue- and change-proneness? The results of the RQ1-related experiments show that architectural smells have consistent impacts on the issue- and change-proneness of a software system during its lifetime. In that sense, the RQ2 can be considered as an extension of the RQ1, in which we would like to see if architectural smells have consistent impacts across unrelated software systems. Specically, we would like to see 93 Table 6.5: Numbers of combinations of dierent experiment setups Prediction Setup All 10 9 Others 8 Others 7 Others 6 Others 5 Others 4 Others # of Combinations 1 1 9 36 84 126 126 if the issue- and change-proneness of a system can be accurately predicted by the models trained by data from other unrelated software systems. To answer this research question, instead of using 10-fold-cross-validation, we sequentiality select a system from ten subject systems as the test system and use its dataset as the test set. The training set is created by combining datasets of other systems. To obtain comprehensive results, we conducted dierent experiments by combining four to nine systems (excluding the test system) to create training datasets. If we select the other nine systems, there is only one combination. However, if we select eight out of the nine other systems, there are nine dierent combinations. We tried all those nine combinations and computed the means of precisions and recalls. Similarly, for combining seven out of the nine systems, we tried all thirty six possible combinations and computed the mean and standard deviation of precisions and recalls. Note that the datasets of dierent subject systems have dierent sizes, hence we have to resample those datasets to a same size before combining them. Table 6.5 shows the total numbers of combinations for each case. Tables 6.6, 6.7, 6.8, 6.9, 6.10, and 6.11 summarize the precision and recall values of all RQ2-related experiments with regard to predicting issue-proneness under ACDC, ARC, and PKG, respectively. The left sides of these tables show the list of systems. The precisions and recalls in 8 dierent cases are presented: 10-fold-cross-validation (\10-fold" column) on the test test, models trained by 10 datasets including the test set (\All 10" column), models trained by 4-9 other systems' datasets (the columns from \4 Others" to \9 Others"). Note that the columns from \4 Others" to \8 Others" contain the mean values of the precision and recall of the models which are trained on all possible 94 combinations of systems. In summary, beside 30 issue-proneness models from the RQ1's experiments, we built and evaluated 3830 more issue-proneness models to answer the RQ2. We found several consistent trends across all three architectural views. First, a prediction model built by combining datasets of dierent software systems, even if the test system is included, has lower precision and recall than the model built for that specic test system. In all the six tables from 6.6 to 6.11, the \10-fold" columns have the highest precision and recall values. This result can be explained by the intuition that adding more datasets of new systems can create a more general-purpose model, but it also adds noise and reduces the model's capability in predicting the properties of a specic system. For this reason, if the dataset of a system is available, its prediction models should be trained only on its own dataset. Second, the prediction models trained by the datasets of six systems|excluding the test system| yield relatively high precisions and recalls. They are close to their corresponding values in \All 10" columns, and lower than corresponding values in the \10-fold" columns by10% or less. For example, for the models built by 6 systems under the ACDC view (Tables 6.6 and 6.7), the precisions are from 55.1% to 69.9% and the recalls are from 50.8% to 68.9%. Similarly, the prediction models trained by the datasets of 5 systems also achieve good results. Notably, in the cases of Nutch and Pig, the results in \5 Others" are equal or slightly higher the corresponding values in \6 Others". Furthermore, the precision and recall values in \7 Others", \8 Others", and \9 Others" columns decrease slowly, respectively, in most of the systems. To clearly see the trend, Figures 6.1 and 6.2 show the precision and recall of dierent combinations in a chart format. In most of the cases, the accuracy of the precision models is maximized when combining 6 systems, and it decreases when adding more systems. This can be explained by the fact that adding more systems can make a 95 prediction model be complete. Adding too many systems from dierent domains, however, make the model more generic, then reduce its the accuracy for a specic system. Table 6.6: Predicting issue-proneness - precision under ACDC view System 10-fold All 10 9 Others 8 Others 7 Others 6 Others 5 Others 4 Others Camel 69.9% 64.8% 53.6% 54.8% 55.5% 55.6% 54.8% 46.7% CXF 78.0% 71.4% 66.4% 66.7% 67.0% 67.0% 58.3% 48.0% Hadoop 81.2% 71.1% 62.8% 63.4% 63.6% 63.7% 60.3% 46.2% Ignite 78.9% 73.9% 60.2% 63.4% 66.3% 68.8% 63.9% 46.3% Nutch 80.8% 74.9% 59.6% 61.7% 63.9% 66.0% 68.1% 48.6% OpenJPA 71.4% 68.8% 63.9% 62.7% 62.0% 61.8% 60.7% 45.4% Pig 71.7% 66.8% 61.4% 62.5% 63.9% 65.4% 65.8% 51.7% Struts2 89.2% 77.1% 69.1% 69.2% 69.5% 69.9% 59.7% 46.3% Wicket 69.2% 66.7% 55.0% 55.2% 55.1% 55.1% 53.2% 47.8% ZooKeeper 72.0% 65.4% 56.0% 57.6% 59.1% 60.8% 63.4% 52.4% Table 6.7: Predicting issue-proneness - recall under ACDC view System 10-fold All 10 9 Others 8 Others 7 Others 6 Others 5 Others 4 Others Camel 68.4% 57.5% 46.7% 48.9% 50.1% 50.8% 50.1% 45.0% CXF 76.7% 71.3% 65.7% 65.9% 65.9% 65.9% 55.1% 46.2% Hadoop 80.1% 69.2% 62.9% 63.4% 63.5% 63.5% 60.5% 45.8% Ignite 78.1% 73.5% 59.3% 62.4% 65.3% 67.9% 63.4% 40.6% Nutch 71.6% 68.8% 54.4% 55.8% 57.2% 58.6% 60.0% 41.9% OpenJPA 68.3% 63.0% 57.3% 56.5% 56.1% 56.0% 54.6% 46.7% Pig 69.1% 64.1% 58.8% 60.2% 61.5% 62.9% 59.8% 45.8% Struts2 89.0% 76.4% 68.8% 68.6% 68.7% 68.9% 56.2% 45.4% Wicket 70.1% 66.0% 54.9% 55.5% 55.7% 55.6% 52.4% 49.2% ZooKeeper 72.6% 60.3% 56.9% 58.5% 60.0% 61.5% 60.2% 48.6% The third observations is that the precisions and recalls of the prediction models are reduced signicantly when the number of training systems is reduced to four. In most of the cases of all three architectural views, the precisions and recalls in \4 Others" columns are lower than corresponding values in \5 Others" columns by at least10% or more. It suggests that combining four systems is not enough to build a generic prediction model, if that model is intended to be used on unrelated systems. 96 Figure 6.1: precision under ACDC view Figure 6.2: Recall under ACDC view Tables 6.12 and 6.13 show the further details of the deviations of the precision and recall values, respectively, in our experiments with the combinations under the ACDC view. As mentioned in the beginning of this section, for each test system, there are many possible combinations of four to eight other systems' datasets. Tables 6.12 and 6.13 show the standard deviations of both the 97 Table 6.8: Predicting issue-proneness - precision under ARC view System 10-fold All 10 9 Others 8 Others 7 Others 6 Others 5 Others 4 Others Camel 70.8% 64.9% 59.7% 60.0% 58.8% 55.2% 57.4% 51.1% CXF 68.9% 55.2% 49.0% 48.1% 47.7% 49.1% 47.3% 47.3% Hadoop 76.6% 67.6% 59.6% 59.9% 60.5% 59.5% 57.5% 42.2% Ignite 78.9% 66.9% 62.3% 60.6% 59.3% 58.4% 57.6% 50.6% Nutch 82.5% 64.6% 62.3% 58.2% 55.6% 51.0% 45.2% 40.6% OpenJPA 74.5% 66.9% 63.9% 63.6% 63.6% 64.2% 48.6% 45.2% Pig 71.3% 62.1% 61.7% 60.8% 60.2% 59.6% 54.2% 43.3% Struts2 95.0% 76.1% 63.8% 63.3% 62.6% 53.3% 53.2% 51.1% Wicket 76.7% 63.3% 62.0% 62.1% 61.5% 60.0% 59.8% 54.7% ZooKeeper 70.8% 66.3% 50.4% 54.7% 56.4% 50.5% 50.4% 40.4% Table 6.9: Predicting issue-proneness - recall under ARC view System 10-fold All 10 9 Others 8 Others 7 Others 6 Others 5 Others 4 Others Camel 67.0% 59.4% 48.5% 49.7% 49.4% 48.7% 50.4% 45.2% CXF 68.3% 62.3% 54.5% 53.9% 53.6% 54.4% 53.4% 45.3% Hadoop 76.6% 67.4% 59.4% 59.9% 60.5% 59.9% 58.0% 44.6% Ignite 79.1% 66.5% 61.6% 60.9% 60.6% 59.5% 59.4% 52.3% Nutch 82.7% 58.1% 53.9% 53.1% 52.1% 51.1% 49.6% 48.5% OpenJPA 73.2% 65.5% 62.0% 62.0% 62.2% 58.9% 48.4% 46.3% Pig 71.1% 62.5% 61.1% 60.8% 60.6% 61.5% 50.1% 45.4% Struts2 94.8% 75.7% 63.7% 63.3% 62.5% 53.6% 53.5% 51.3% Wicket 77.1% 65.3% 63.6% 62.3% 61.2% 59.6% 59.0% 52.1% ZooKeeper 69.2% 67.1% 56.4% 57.7% 58.9% 56.5% 55.9% 45.3% precisions and recalls for each test system. The tables show that the standard deviations of both precisions and recalls are decreased when the number of systems is increased. The combinations of four systems have the high standard deviations of those two metrics. which sometimes can go up to more than 10%. A low standard deviation indicates that the data values tend to be close to the mean, while a high standard deviation indicates that the data values are spread out over a wider range around the mean. These results indicate that the accuracy of prediction models starts converging while we increase the number of systems in the training sets. The more systems we include, the lower variation is, however the accuracy will be slightly reduced when combining too 98 Table 6.10: Predicting issue-proneness - Precision under PKG view System 10-fold All 10 9 Others 8 Others 7 Others 6 Others 5 Others 4 Others Camel 68.2% 59.5% 46.0% 47.0% 48.2% 45.8% 35.9% 38.6% CXF 64.7% 62.7% 59.1% 60.3% 61.3% 62.2% 62.4% 49.7% Hadoop 72.8% 61.8% 50.2% 51.8% 53.0% 54.0% 49.3% 38.2% Ignite 70.4% 70.2% 62.6% 63.8% 64.2% 56.6% 50.2% 39.5% Nutch 68.3% 66.9% 51.9% 54.4% 56.7% 58.7% 60.5% 38.1% OpenJPA 69.2% 71.2% 53.1% 56.0% 58.7% 53.6% 56.9% 44.1% Pig 68.6% 68.0% 53.6% 57.0% 60.0% 60.9% 57.2% 46.9% Struts2 79.1% 92.4% 67.6% 71.7% 72.9% 64.7% 59.7% 41.7% Wicket 63.7% 66.1% 60.2% 60.1% 60.2% 58.2% 54.2% 43.4% ZooKeeper 68.7% 66.3% 44.0% 44.9% 45.8% 46.6% 52.5% 40.3% Table 6.11: Predicting issue-proneness - recall under PKG view System 10-fold All 10 9 Others 8 Others 7 Others 6 Others 5 Others 4 Others Camel 62.8% 50.9% 43.5% 43.7% 44.3% 43.5% 39.6% 39.8% CXF 63.8% 60.0% 44.7% 45.9% 46.9% 47.2% 46.5% 45.3% Hadoop 73.4% 61.5% 50.3% 51.2% 52.1% 52.4% 48.0% 40.8% Ignite 71.0% 69.5% 62.3% 63.4% 63.9% 56.7% 49.0% 39.9% Nutch 62.1% 54.1% 50.9% 51.9% 53.0% 55.6% 54.7% 44.8% OpenJPA 67.9% 68.3% 39.2% 42.7% 45.7% 43.2% 46.5% 39.3% Pig 69.5% 68.0% 44.5% 48.8% 52.5% 50.9% 47.4% 38.8% Struts2 78.3% 92.0% 67.1% 71.4% 72.8% 65.3% 57.3% 42.9% Wicket 65.4% 66.1% 58.9% 58.7% 58.4% 55.5% 50.2% 41.4% ZooKeeper 69.4% 66.8% 42.7% 43.3% 44.2% 45.1% 49.1% 41.8% many systems. Based on our experiments, combining ve to six systems generally yields the best results. Table 6.12: Standard deviations of precisions - ACDC view System 9 Others 8 Others 7 Others 6 Others 5 Others 4 Others Camel 0.0% 5.5% 6.2% 6.5% 7.3% 12.5% CXF 0.0% 0.6% 0.9% 1.1% 4.6% 8.0% Hadoop 0.0% 1.2% 1.6% 1.6% 2.7% 8.2% Ignite 0.0% 5.4% 6.6% 6.9% 6.3% 12.7% Nutch 0.0% 6.5% 8.2% 9.1% 9.6% 12.1% OpenJPA 0.0% 2.8% 3.2% 3.4% 3.0% 12.2% Pig 0.0% 4.9% 6.6% 7.8% 7.6% 12.4% Struts2 0.0% 1.4% 2.2% 2.7% 4.1% 12.0% Wicket 0.0% 3.5% 5.6% 6.7% 9.9% 14.9% ZooKeeper 0.0% 3.3% 4.8% 6.2% 9.2% 12.9% 99 Table 6.13: Standard deviations of recalls - ACDC view System 9 Others 8 Others 7 Others 6 Others 5 Others 4 Others Camel 0.0% 3.8% 4.0% 3.8% 3.8% 6.3% CXF 0.0% 0.6% 0.7% 1.0% 3.0% 6.3% Hadoop 0.0% 0.8% 1.3% 1.3% 2.5% 7.8% Ignite 0.0% 5.9% 6.9% 7.1% 5.9% 8.9% Nutch 0.0% 4.2% 5.3% 6.0% 6.3% 7.8% OpenJPA 0.0% 2.9% 3.4% 3.7% 4.0% 6.6% Pig 0.0% 4.9% 6.2% 6.9% 4.0% 4.9% Struts2 0.0% 1.2% 1.6% 1.9% 2.7% 10.2% Wicket 0.0% 2.9% 4.9% 6.2% 8.4% 10.5% ZooKeeper 0.0% 4.5% 5.7% 6.5% 4.9% 6.2% We also found the similar trends in the experiments which attempt to predict the change- proneness of a system using unrelated systems' datasets. In summary, the results of the RQ2-related experiments conrm that software systems tend to share properties with respect to issue- and change-proneness. The accuracy of generic models is less than the one of specic models, however, the gap is just about 10% or less. This allows developers to use generic models to predict the issue- and change-proneness of new software systems in the early stages of their development, before suciently large numbers of system versions become available. Furthermore, our empirical study's results suggest that using at least ve systems can create a reliable generic model which can predict the issue- and change-proneness of new software systems with high accuracy. 100 Chapter 7 Visualizing Architectural Changes and Decay 7.1 Motivation The output of ARCADE can help developers answer dierent research questions related to dierent empirical studies about architectural changes and decay, thus this dissertation would like to promote the proposed approach and ARCADE to software architects and engineers. However, the output of ARCADE is comparatively hard to comprehend because they are mostly numbers and text les. If engineers want to investigate a specic change or smell, they have to search through several dierent output les. This reduces the understandability of the analyses' result and hampers maintenance activities such as tracking and refactoring changes and smells. Consequently, this makes the adoption of ARCADE and the dissertation's result challenging. With the aid of an appropriate visualization, software architects and engineers (end-users) can improve their understanding of the architectural changes and decay of the systems on which they are working. Although some existing eorts are able to visualize recovered architectures, no eort exists that can eectively visualize (1) structural changes in software architectures, (2) changes of components' dependencies and (3) interfaces, and (4) locations of decay (i.e., detected smells). Those visualizations have been created and included as a part of ARCADE. 101 7.2 Implementation of Architectural Visualizations To aids engineers in understanding how architectural changes and decay happen along systems' lifespan, this dissertation developed dierent visualization views and included them as a part of ARCADE. Using interactive visualizations and color-coding labels will be signicantly helpful for engineers to adopt our research's approach. To achieve this goal, the visualization views are developed based on D3.js [24], a well-known JavaScript library for manipulating documents based on data. D3 allows binding arbitrary data to a Document Object Model (DOM), and then applying data-driven transformations to the document. It also provides many powerful visualization components. 7.3 Visualizing Architectural Changes and Decay of Android Framework In this section, four dierent eorts that aim to visualize dierent aspects of architectural changes and decay are introduced. Android Framework will be used as illustrative examples. Two Android versions are used in the visualizations are 6.0 and 7.0. The four visualizations are as follows: 1. Visualizing Component Changes: Figure 7.1 represents the architectures of Android Framework, version 7.0 and 6.0. In this gure, components are placed around a circle, and dependencies among components are inner lines. To indicate changes, a color coding scheme is used. On the right-hand side (version 6.0), the components, which are removed in version 7.0, are visualized in red. On the left-hand side (version 7.0), the components, which are added since version 6.0, are visualized in green. The yellow components are changed components. The unchanged components are visualized in gray. 2. Visualizing Dependency Changes: Figures 7.2 and 7.3 have a similar structure like Figure 7.1. However, when a user hovers on a component, the dependencies of that component are 102 highlighted. The red lines represent incoming dependencies, the green lines represent outgoing dependencies, and the purple lines represent removed dependencies. 3. Visualizing Interface Changes: Figures 7.4 and 7.5 represent interface changes of components. Figures 7.4 is the overview of the architecture. Entities are visualized by small rectangles, and components are rectangles with labels. Entities are color-coded based on how their interface change. The blue entities have new interfaces. The red entities have been removed old interfaces. The yellow entities both have new interfaces and have been removed old ones. Green entities are unchanged entities. By clicking on a concerned component, a user can see Figures 7.4, which provides the detailed changes within a component. 4. Visualizing Smelly Components: Figures 7.6 and 7.7 represent smell infected components. By clicking on dierent check boxes, components, which are aected by selected smells, will be highlighted by red colors. Users can select multiple choices, and components, which have a least one selected smell, will be highlighted. Figure 7.1: Visualizing component changes 103 Figure 7.2: Visualizing dependency changes (1) 104 Figure 7.3: Visualizing dependency changes (2) 105 Figure 7.4: Visualizing interface changes (1) Figure 7.5: Visualizing interface changes (2) 106 Figure 7.6: Visualizing smelly components (1) 107 Figure 7.7: Visualizing smelly components (2) 108 Chapter 8 Related Work In this chapter, the related works of this thesis are listed and summarized. Three main categories of existing literature are surveyed: (1) architectural changes, (2) architectural decay, and (3) predicting implementation issues and code change. 8.1 Architectural Changes Software evolution has been studied extensively at the code level, dating back several decades (e.g., Lehman's laws [71]). We will highlight a number of examples that have in uenced our work. Godfrey and Tu [53] discovered that Linux's already large size did not prevent it from continuing to grow quickly. Eick et al. [41] found a reduction in modularity over the 15-year evolution of software for a telephone switching system. Murgia et al. [84] studied the evolution of Eclipse and Netbeans and found that 8%-20% of code-level entities contain about 80% of all bugs. While interesting, informative, and in uential in our work, these studies do not examine the evolution of a software system's architecture. A few studies [35, 87] have attempted to investigate architectural evolution. These studies are smaller in scope than our work in this dissertation. Additionally, unlike ARCADE's use of structural 109 and semantic architectural views, only one of these studies considers more than one architectural perspective|however, in that study, as well as several others, the chosen perspectives are arguably not architectural at all. Each study also diers from our work in other important ways. D'Ambros et al. [35] present an approach for studying software evolution that focuses on the storage and visualization of evolution information at the code and architectural levels. Their study utilizes a dierent set of architectural metrics than ours, specically targeted at their visualizations. Nakamura et al. [87] present an architectural change metric based on structural distance, and apply it to 13 versions of four software systems. However, they dene their metric on class dependency graphs, therefore measuring change at the level of a system's OO implementation rather than its architecture. A group of studies by Bowers et al. [25, 26, 27] has treated implementation packages as architec- tural components in assessing the usefulness of metrics for balancing the number of components in a system and for measuring coupling between components. We considered both of these metrics for inclusion in ARCADE. We decided against including the balancing metric because our previous studies [48, 49] have indicated that ACDC and ARC obtain appropriate numbers of components in practice. We are currently studying the coupling metric and assessing its eectiveness in measuring recovered architectures. Inspired by these studies, we have implemented and included PKG in our study as well. However, our recent work has shown that software architects consider the package structure to be useful but, by itself, an inaccurate architectural proxy [49]. We thus consider the results of Bowers et al.'s studies to be more indicative of implementation change than of architectural change. This is consistent with the widely referenced 4+1 architectural-view model [66], in which packages belong to a system's implementation view. 110 8.2 Architectural Decay From the very beginning of the study of software architecture, researchers have recognized ar- chitectural decay as a commonly occurring phenomenon in long-lived systems [94]. There is a great deal of folklore and personal experiences from researchers and practitioners when referring to architectural smells and their negative impact on software systems. They were the motivation for many studies which have attempted to investigate architectural evolution and decay. There are two major approaches in those studies: the indirect approach uses code-level anomalies as the medium, and the direct approach uses recovered architectures. The indirect approach emerged from the abundance of research on code-level anomalies, especially code smells. Previous studies by Fontana et al. [43], Macia et al. [74], Oizumi et al. [90] have tried to detect architecture degradation by using code-level anomalies as a means of detecting high-level problems. Rather than focusing on code smells individually, those researchers have tried to identify groups of code smells that can point to design problems, e.g., design smells. However, the result of this approach is very limited. Even in those successful cases, design smells tend to focus on problems and abstractions that are not necessarily architectural (e.g., broken inheritance hierarchy or duplicate class implementations with dierent APIs). Previous study [76] conrmed this by failing to nd any signicant relationship between the detected code anomalies and architectural modularity in a system. In another work, Mo et al. [82] observed that certain architectural issues cannot be characterized by existing notions of code smells or anti-patterns. Fixing architectural anomalies requires a well understanding of the system's as-is architecture. The direct approach is based on automated software architecture-recovery techniques, which have been around for over three decades [22, 60, 103, 104]. Two studies have examined architectural decay by using the re exion method, a technique for comparing intended and recovered architectures. 111 Brunet et al. [29] studied the evolution of architectural violations in 76 versions selected from four subject systems. Rosik et al. [101] conducted a case study using the re exion method to assess whether architectural drift, i.e., unintended design decisions, occurred in their subject system and whether instances of drift remain unsolved. Wermelinger et al. [115] applied architectural-decay metrics to 53 versions of Eclipse. In their study, Eclipse exhibited generally decreasing cohesion, increasing coupling, and low instability. Sangwan et al. [102] applied architectural-decay metrics to 21 versions of Hibernate, and concluded that Hibernate tends to have low instability. Hassaine et al. [56] presented a recovery technique, which they used to study decay in six versions of three systems. Gurp et al. [112] conducted two case studies of software systems to better understand the nature of architectural decay and how to prevent it. Although there are several studies about architectural decay using recovered architectures, they were conducted in a smaller scope than our work in this dissertation and were usually individual case studies. The accuracy of their employed recovery techniques is also unclear. Furthermore, none of these studies relied on architectural smells as concrete instances of architectural decay. Our study has explicitly targeted these issues of scope (via the number of studied systems and system versions), accuracy (by using multiple state-of-the-art recovery techniques that were evaluated independently), and actually studied phenomenon (by relying on existing denitions of architectural smells). The concept of architectural debt has been introduced recently by Xiao et al. [117]. Architectural smells in our study also can be considered a type of architectural debt. In contrast to Xiao's work, we dene architectural smells in the avor of code smells, which is \sniable" | or quick to spot [44]. The most useful smells are those that are easy to identify and that are understood even by inexperienced engineers. By looking for architectural smells in a system, it is easier for developers to understand the technical debt and propose refactoring solutions. 112 8.3 Predicting Implementation Issues and Code Change Within software maintenance community, predicting implementation issues and code change have been well-known research problems because they directly benet to software's maintenance. Eorts have been made to predict these properties by using dierent types of information and tools. Regarding implementation issues, in the early days, the main type of issues, which researchers had interests in, are defects (i.e., bugs). Li et al. [73] used object-oriented metrics as predictors of software maintenance eorts. The work by Subramanyam et al. [105] also pointed out that CK metrics [33] has a signicant implication for software defects. Recently, Nagappan et al. [86] found a good set of code complexity measurements to determine failure-prone software entities. However, using metrics suite does not provide a clear intuition about new defects and actionable tasks how to prevent them. This approach also cannot prevent high abstraction level defects, such as architecture-level problems. Issue prediction based on bug-xing history is also an established research area. Rahman et al. [98] found an algorithm that ranks les by the number of times they have been changed in history. It helps developers nd hot spots in the system that needs their attention. There are more sophisticated methods combined historical information and software change impact analysis to increase the eciency of the prediction [113, 58, 97]. However, these approaches do not explain high abstraction level defects caused by architectural decay. Regarding code change, it has a tight connection with software systems' defects. Nagappan et al. [85] used code churn, a measurement of code change, to predict defect density of software systems. Later, Hassan et al. [57] also used some complexity metrics based code change to predict systems' faults. Code change is also used in many other research works [117, 36, 69, 68] to evaluate systems' maintainability. 113 To predict code change, Romano et al. [99] proposed an intuitive approach using code metrics. Later, they proposed another approach using anti-patterns [100]. Xia et al. [116] proposed an approach to predict change-proneness of a software system using co-change information of unrelated software systems. This approach is similar to the RQ2 of our third empirical study, however, their approach yields relatively low f-scores. Malhotra et al. [77] made an attempt to using hybridized techniques to identify change prone classes. However, the size of their empirical study is relatively small. Lastly, all of the above works do not consider the impact of architecture-level problems on change-proneness like what we did in this dissertation. 114 Chapter 9 Conclusions and Future Work From its very inception, the study of software architecture has recognized architectural decay as a regularly occurring phenomenon in long-lived software systems. Despite decay's prevalence, there is a relative dearth of empirical data regarding the architectural changes that may lead to decay, the impact of decay on a system's implementation, and developers' understanding of those architectural changes and decay. This dissertation takes a step toward addressing that scarcity by developing a categorization of decay's symptoms and conducting two empirical studies of software architectural evolution and decay spanning several hundred versions of a dozen of open source systems. Those studies reveals a lot of new important ndings regarding the frequency of architectural changes in software systems, the common points of departure in a system's architecture during maintenance and evolution, the dierence between system-level and component-level architectural change, the suitability of a system's implementation-level structure as a proxy for its architecture, the strong correlations of decay on the system's implementation issues, and the predictability of the system's issue- and change-proneness. Datasets and developed tools in those studies are publicly available for research about software architecture. In summary, the list of the contributions of this dissertation is as following: 115 1. A classication of architectural smells and detection algorithms: This dissertation presented a classication framework for software architectural smells. The framework is intended to serve as a systematic reference that helps engineers to relate one smell to one another by considering the architectural elements they aect. The framework has classied and formally dened 11 smells across four categories, and provided algorithms that allow engineers to automatically detect these smells from a system's implementation using existing architectural recovery techniques and software analysis tools. These algorithms have been successfully applied on a large, and growing, corpus of real systems. This dissertation has also provided concrete examples of smell instances found in actual systems. 2. An Empirical Study of Architectural Changes: This dissertation presented the largest empirical study to date of architectural changes in long-lived software systems. The study's scope is re ected in the number of subject systems (23), the total number of examined system versions (931), the total amount of analyzed code (140 MSLOC), the number of applied architecture recovery techniques (3) resulting in distinct architectural views produced for each system, the number of analyzed architectural models (2793, yielded by the three recovered views per system version), and the number of architectural change metrics (3) applied to each of the 2793 architectural models. This scope was enabled by ARCADE, a novel automated workbench for software architecture recovery and analysis. This study corroborated a number of widely held views about the times, frequency, scope, and nature of architectural change. However, the study also resulted in several unexpected ndings. The foremost is that a system's versioning scheme is not an accurate indicator of architectural change: major architectural changes may happen between minor system versions. Even more revealing was the observation that a system's architecture may be relatively unstable in the 116 run-up to a release. We believe that enabling engineers to spot such instability would go a long way toward stemming the types of developer habits that result in unstable, buggy system releases. Finally, this study's results further corroborated the observation made in recent interactions with practicing software architects [49] that the gross organization of a system's implementation is, by itself, not an adequate representation of the system's architecture. This is especially magnied in cases where the overall implementation architecture remained very stable while, in fact, the system experienced signicant growth. For this reason, analyzing a system's recovered conceptual architecture, both at the level of overall structure and at the level of individual components, is a much more appropriate way of assessing and understanding architectural change. Another broad conclusion of this study points to the signicance of the semantics-based architectural perspective. The study encountered multiple instances where a concern-based architectural view revealed important changes that remained concealed in the corresponding structure-based views. At the same time, a signicant segment of the research of software architecture, and in particular the research of architecture recovery, has focused on system structure. Along with the results of a recent evaluation of recovery techniques [48], this suggests that there is both a need and an opportunity for investigating more eective approaches to architecture recovery. 3. An Empirical Study of Architectural Decay: This dissertation has described an empirical study aimed at providing initial answers to the long-standing research question of how architectural smells manifest themselves in a system's implementation. This empirical study is the largest study to date of architectural decay and its impact in long-lived software systems. The study has analyzed several hundred versions of 8 well-known systems, totaling 376 MSLOC. Each 117 system version has been applied 3 dierent architecture-recovery techniques and analyzed the recovered architectures for 11 dierent types of smells. On average, the study detected nearly 140 architectural smells per system version in each of the three architectural views. Lastly, the study have examined the relationships between collected smells and about 42,000 issues that were extracted from the issue repositories of the subject systems. This study has not only highlighted the visible manifestations of architectural decay, but also empirically conrmed an assertion that had previously been discussed prominently in the literature: architectural decay and its symptoms|architectural smells|are undesirable, and they can cause signicant problems for a software system. The empirical study's results have shown that implementation les involved in \smelly" parts of the system's architecture are statistically signicantly more issue-prone and change-prone than the \clean" les. Further- more, although not all of the architectural smells can directly lead to bugs, in many subject systems, more bugs appear in those parts of a system that are aected by architectural decay. 4. Speculative Analysis to Predict Impact of Architectural Decay on the Implementation of Software Systems: Based on the correlations between architectural decay and the reported implementation issues, this dissertation has developed an architecture-based approach to accurately predict the issue- and change-proneness of a system's implementation. The approach has been validated over seven actual software systems, considering 11 dierent smells under three dierent architectural views. The study has resulted in several important ndings regarding the predictability of architecture-based models. This study conrmed that architectural smells have consistent impacts on a system's imple- mentation during the system's life cycle. Hence, the architectural smells detected in a system can help to predict the issue-proneness and change-proneness of that system at a given point 118 in time with high accuracy. Architecture-based prediction is a useful approach for maintainers to foresee future problems of newly smell-impacted parts of the system. The approach also helps them create maintenance schedules in order to eectively reduce the system's issue- and change-proneness. Furthermore, software systems tend to share properties with respect to issue- and change- proneness. This allows developers to use the generic models, created by using data from a set of software systems, to predict the issue- and change-proneness of new software systems in the early stages of their development, before suciently large numbers of system versions become available. The accuracy of the generic models is less than the one of specic models, however, the gap is just about 10% or less. Lastly, our empirical study suggested that using at least 5 systems can help to create reliable generic models with high accuracy to predict the issue- and change-proneness of new software systems. 5. Visualizations of architectural changes and decay: This dissertation has used visualization techniques, e.g., interactive visualizations and color-coding labels, to create 4 dierent visualiza- tions of the obtained data about the architectural changes and decay. These visualizations can be used as an useful medium to facilitate the adoption process of the dissertation's approach. 9.1 Future Work Although the empirical studies which were conducted in this dissertation do not consider the characteristics of individual architectural smell types, the results of these studies are still useful and promising. The impact of architectural decay, which reveals itself in the form of issue-proneness, change-proneness and bug-proneness, is a foundation for exploring further research questions to 119 draw a ne-grained distinction among individual architectural smells. Potential future works of this dissertation are as follows: 1. Should we analyze a smell category (or a smell) only under a specic architectural view? For example, detecting dependency-based smells under the ACDC view might yield more accurate results than it would under the ARC view because ACDC's recovery algorithm is based on call dependencies. In contrast, concern-based smells should be detected under the ARC view because ARC's algorithm is based on concerns found in classes. 2. Can unions or intersections of the architectural smells identied in architectures produced by multiple recovery techniques yield signicantly better results than architectures produced by individual techniques? This is an extension of the previous question. 3. Are there any dierences between the issues aected by a single smell and those aected by multiple smells? It is possible that co-occurrence and mutual dependencies among smells have specic impacts on a system. This can only be uncovered empirically. 4. Would investing eort into creating new architectural recovery techniques necessarily improve engineers' ability to understand systems' architecture and identify other types of smells? There are other types of smells, such as connector-based or layer-based smells, which are not considered in this dissertation because they are unable to be detected from the output of current architectural recovery techniques. Creating new type of recovery techniques could extensively increase the scope of future empirical studies. 5. Can architectural smells be used as a medium to determine when a system's architecture needs refactoring? Architectural smell-based models can predict the issue- and change-proneness of software systems with high accuracy. This nding suggests that architectural smells can be used to estimate the technical debt of a system. Software maintainers might use these models to foresee 120 likely future problems in newly smell-impacted parts of the system as well as to create maintenance plans in order to eectively reduce the system's issue- and change-proneness. 121 Bibliography [1] arcade:start [USC SoftArch Wiki]. http://softarch.usc.edu/wiki/doku.php?id=arcade: start, 2014. [2] hadoop-releases. http://hadoop.apache.org/releases.html#News, 2014. [3] lucene-wiki. http://en.wikipedia.org/wiki/Lucene, 2014. [4] rcarz/jira-client GitHub. https://github.com/rcarz/jira-client, 2014. [5] struts-wiki. http://en.wikipedia.org/wiki/Apache_Struts, 2014. [6] svn-graph-branches. https://code.google.com/p/svn-graph-branches/, 2014. [7] Apache jira. https://issues.apache.org/jira, 2017. [8] Architectural smell detection algorithms. http://csse.usc.edu/TECHRPTS/2017/TR_decay_ arch.pdf, 2017. [9] Bug prediction at google. http://google-engtools.blogspot.com/2011/12/ bug-prediction-at-google.html, 2017. [10] git-log. http://git-scm.com/docs/git-log, 2017. [11] Github. https://github.com/, 2017. [12] Heavy-tailed distribution. https://en.wikipedia.org/wiki/Heavy-tailed_distribution, 2017. [13] Huawei. http://www.huawei.com/, 2017. [14] Minitab. http://www.minitab.com/, 2017. [15] Pareto principle. https://en.wikipedia.org/wiki/Pareto_principle, 2017. [16] R-squared. http://blog.minitab.com/blog/adventures-in-statistics-2/ regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit, 2017. [17] What is an issue. https://confluence.atlassian.com/jira064/ what-is-an-issue-720416138.html, 2017. [18] Brent Agnew, Christine Hofmeister, and James Purtilo. Planning for change: A reconguration language for distributed systems. Distributed Systems Engineering, 1(5):313, 1994. 122 [19] Apache. Apache portable runtime versioning. http://apr.apache.org/versioning.html, 2014. [20] Jay Arthur. Six Sigma simplied: quantum improvement made easy. KnowWare International, 2001. [21] Pooyan Behnamghader, Duc Minh Le, Joshua Garcia, Daniel Link, Arman Shahbazian, and Nenad Medvidovic. A large-scale study of architectural evolution in open-source software systems. Empirical Software Engineering, pages 1{48, 2016. [22] LA Belady and CJ Evangelisti. System partitioning and its measure. Journal of Systems and Software (JSS), 1981. [23] Barry W Boehm. Value-based software engineering: Overview and agenda. In Value-based software engineering, pages 3{14. Springer, 2006. [24] Mike Bostock. D3.js. https://d3js.org/, 2017. [25] Eric Bouwers, Jos e Pedro Correia, Arie van Deursen, and Joost Visser. Quantifying the analyzability of software architectures. In Software Architecture (WICSA), 2011 9th Working IEEE/IFIP Conference on, pages 83{92. IEEE, 2011. [26] Eric Bouwers, Arie van Deursen, and Joost Visser. Evaluating usefulness of software metrics: an industrial experience report. In ICSE, pages 921{930. IEEE Press, 2013. [27] Eric Bouwers, Arie van Deursen, and Joost Visser. Dependency proles for software architecture evaluations. In Software Maintenance (ICSM), 2011 27th IEEE International Conference on, pages 540{543. IEEE, 2011. [28] I.T. Bowman, R.C. Holt, and N.V. Brewster. Linux as a case study: its extracted software architecture. In ICSE, 1999. [29] Joao Brunet, Roberto Almeida Bittencourt, Dalton Serey, and Jorge Figueiredo. On the evolutionary nature of architectural violations. In Reverse Engineering (WCRE), 2012 19th Working Conference on. IEEE, 2012. [30] Frank Buschmann, Kevin Henney, and Douglas C Schmidt. Pattern-oriented software archi- tecture, on patterns and pattern languages, volume 5. John wiley & sons, 2007. [31] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of articial intelligence research, 16:321{ 357, 2002. [32] S. R. Chidamber and C. F. Kemerer. A metrics suite for object oriented design. IEEE Transactions on Software Engineering, 20(6):476{493, Jun 1994. [33] S. R. Chidamber and C. F. Kemerer. A metrics suite for object oriented design. IEEE Trans. Softw. Eng., 20(6):476{493, June 1994. 123 [34] Ward Cunningham. The wycash portfolio management system. In Addendum to the Proceedings on Object-oriented Programming Systems, Languages, and Applications (Addendum), OOPSLA '92, pages 29{30, New York, NY, USA, 1992. ACM. [35] Marco D'Ambros, Harald Gall, Michele Lanza, and Martin Pinzger. Analysing software repositories to understand software evolution. Springer, 2008. [36] Marco D'Ambros, Harald Gall, Michele Lanza, and Martin Pinzger. Analysing software repositories to understand software evolution. In Software evolution, pages 37{67. Springer, 2008. [37] Marco D'Ambros, Michele Lanza, and Romain Robbes. On the relationship between change coupling and software defects. In Reverse Engineering, 2009. WCRE'09. 16th Working Conference on, pages 135{144. IEEE, 2009. [38] Edsger W Dijkstra. On the role of scientic thought. In Selected writings on computing: a personal perspective, pages 60{66. Springer, 1982. [39] Edsger Wybe Dijkstra, Edsger Wybe Dijkstra, Edsger Wybe Dijkstra, and Edsger Wybe Dijkstra. A discipline of programming, volume 1. prentice-hall Englewood Clis, 1976. [40] S. Ducasse and D. Pollet. Software architecture reconstruction: A process-oriented taxonomy. IEEE TSE, 2009. [41] Stephen G Eick, Todd L Graves, Alan F Karr, J Steve Marron, and Audris Mockus. Does code decay? assessing the evidence from change management data. IEEE TSE, 2001. [42] Norman E. Fenton and Shari Lawrence P eeger. Software Metrics: A Rigorous and Practical Approach. PWS Publishing Co., Boston, MA, USA, 2nd edition, 1998. [43] Francesca Arcelli Fontana, Vincenzo Ferme, and Marco Zanoni. Towards assessing software architecture quality by exploiting code smell relations. In Proceedings of the Second Interna- tional Workshop on Software Architecture and Metrics, SAM '15, pages 1{7, Piscataway, NJ, USA, 2015. IEEE Press. [44] M. Fowler. Refactoring: Improving the Design of Existing Code. Addison-Wesley Professional, 1999. [45] Martin Fowler and Kendall Scott. UML Distilled: Applying the Standard Object Modeling Language. Addison-Wesley Longman Ltd., Essex, UK, UK, 1997. [46] SG Ganesh, Tushar Sharma, and Girish Suryanarayana. Towards a principle-based classication of structural design smells. Journal of Object Technology, 12(2):1{1, 2013. [47] Joshua Garcia. A Unied Framework for Studying Architectural Decay of Software Systems. PhD thesis, University of Southern California, 2014. [48] Joshua Garcia, Igor Ivkovic, and Nenad Medvidovic. A comparative analysis of software architecture recovery techniques. In Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on, pages 486{496, 2013. 124 [49] Joshua Garcia, Ivo Krka, Chris Mattmann, and Nenad Medvidovic. Obtaining ground-truth software architectures. ICSE, 2013. [50] Joshua Garcia, Daniel Popescu, George Edwards, and Nenad Medvidovic. Toward a catalogue of architectural bad smells. In QoSA '09: Proc. 5th Int'l Conf. on Quality of Software Architectures, 2009. [51] Joshua Garcia, Daniel Popescu, George Edwards, and Medvidovic Nenad. Identifying Architec- tural Bad Smells. In 13th European Conference on Software Maintenance and Reengineering, 2009. [52] Joshua Garcia, Daniel Popescu, Chris Mattmann, Nenad Medvidovic, and Yuanfang Cai. Enhancing architectural recovery using concerns. In ASE, 2011. [53] Michael W Godfrey and Qiang Tu. Evolution in open source software: A case study. In Software Maintenance, 2000. Proceedings. International Conference on. IEEE, 2000. [54] Google. Guava. https://code.google.com/p/guava-libraries/, 2015. [55] R. Harrison, S. J. Counsell, and R. V. Nithi. An evaluation of the mood set of object-oriented software metrics. IEEE Transactions on Software Engineering, 24(6):491{496, Jun 1998. [56] Salima Hassaine, Y Gu eh eneuc, Sylvie Hamel, and Giuliano Antoniol. Advise: Architectural decay in software evolution. In Software Maintenance and Reengineering (CSMR), 2012 16th European Conference on. IEEE, 2012. [57] Ahmed E Hassan. Predicting faults using the complexity of code changes. In Proceedings of the 31st International Conference on Software Engineering, pages 78{88. IEEE Computer Society, 2009. [58] Hideaki Hata, Osamu Mizuno, and Tohru Kikuno. Bug prediction based on ne-grained module histories. In Software Engineering (ICSE), 2012 34th International Conference on, pages 200{210. IEEE, 2012. [59] Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263{1284, 2009. [60] David H. Hutchens and Victor R. Basili. System structure analysis: Clustering with data bindings. IEEE TSE, 1985. [61] George H John and Pat Langley. Estimating continuous distributions in bayesian classiers. In Proceedings of the Eleventh conference on Uncertainty in articial intelligence, pages 338{345. Morgan Kaufmann Publishers Inc., 1995. [62] Ron Kohavi. The power of decision tables. In European conference on machine learning, pages 174{189. Springer, 1995. [63] Ron Kohavi et al. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, volume 14, pages 1137{1145. Montreal, Canada, 1995. 125 [64] R. Koschke. What Architects Should Know About Reverse Engineering and Rengineering. In Working IEEE/IFIP Conference on Software Architecture (WICSA), 2005. [65] R. Koschke. Architecture reconstruction. Software Engineering, 2009. [66] Philippe B Kruchten. The 4+ 1 view model of architecture. Software, IEEE, 1995. [67] Michele Lanza, Radu Marinescu, and St ephane Ducasse. Object-Oriented Metrics in Practice. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005. [68] Duc Minh Le, Pooyan Behnamghader, Joshua Garcia, Daniel Link, Arman Shahbazian, and Nenad Medvidovic. An empirical study of architectural change in open-source software systems. In Proc. Mining Software Repositories, 2015. [69] Duc Minh Le, Carlos Carrillo, Rafael Capilla, and Nenad Medvidovic. Relating architectural decay and sustainability of software systems. In 13th Working IEEE/IFIP Conference on Software Architecture (WICSA), 2016. [70] Saskia Le Cessie and Johannes C Van Houwelingen. Ridge estimators in logistic regression. Applied statistics, pages 191{201, 1992. [71] Meir M Lehman. Programs, life cycles, and laws of software evolution. Proceedings of the IEEE, 1980. [72] Charles E Leiserson, Ronald L Rivest, Cliord Stein, and Thomas H Cormen. Introduction to algorithms. The MIT press, 2001. [73] Wei Li and Sallie Henry. Object-oriented metrics that predict maintainability. Journal of Systems and Software, 23(2):111 { 122, 1993. Object-Oriented Software. [74] I. Macia, R. Arcoverde, A. Garcia, C. Chavez, and A. von Staa. On the relevance of code anomalies for identifying architecture degradation symptoms. In Software Maintenance and Reengineering (CSMR), 2012 16th European Conference on, pages 277{286, March 2012. [75] Isela Macia, Joshua Garcia, Daniel Popescu, Alessandro Garcia, Nenad Medvidovic, and Arndt von Staa. Are automatically-detected code anomalies relevant to architectural modularity?: an exploratory analysis of evolving systems. In Proceedings of the 11th annual international conference on Aspect-oriented Software Development. ACM, 2012. [76] Isela Macia, Joshua Garcia, Daniel Popescu, Alessandro Garcia, Nenad Medvidovic, and Arndt von Staa. Are automatically-detected code anomalies relevant to architectural modularity?: An exploratory analysis of evolving systems. In Proceedings of the 11th Annual International Conference on Aspect-oriented Software Development, AOSD '12, pages 167{178, New York, NY, USA, 2012. ACM. [77] Ruchika Malhotra and Megha Khanna. An exploratory study for software change prediction in object-oriented systems using hybridized techniques. Automated Software Engineering, 24(3):673{717, 2017. [78] Onaiza Maqbool and Haroon Babri. Hierarchical clustering for software architecture recovery. IEEE TSE, 2007. 126 [79] A.K. McCallum. Mallet: A machine learning for language toolkit. 2002. [80] N. Medvidovic. Adls and dynamic architecture changes. In Proceedings of the Second International Software Architecture Workshop (ISAW-2), 1996. [81] T Mens and T Tourwe. A survey of software refactoring. IEEE TSE, January 2004. [82] Ran Mo, J. Garcia, Yuanfang Cai, and N. Medvidovic. Mapping architectural decay instances to dependency models. In Managing Technical Debt (MTD), 2013 4th International Workshop on, pages 39{46, 2013. [83] N. Moha, Y. G. Gueheneuc, L. Duchien, and A. F. Le Meur. Decor: A method for the specication and detection of code and design smells. IEEE Transactions on Software Engineering, 36(1):20{36, Jan 2010. [84] Alessandro Murgia, Giulio Concas, Sandro Pinna, Roberto Tonelli, and Ivana Turnu. Empirical study of software quality evolution in open source projects using agile practices. In Proceedings of the First International Symposium on Emerging Trends in Software Metrics 2009. Lulu. com, 2009. [85] Nachiappan Nagappan and Thomas Ball. Use of relative code churn measures to predict system defect density. In Proceedings of the 27th international conference on Software engineering, pages 284{292. ACM, 2005. [86] Nachiappan Nagappan, Thomas Ball, and Andreas Zeller. Mining metrics to predict component failures. In Proceedings of the 28th international conference on Software engineering, pages 452{461. ACM, 2006. [87] Taiga Nakamura and Victor R Basili. Metrics of software architecture changes based on structural distance. In Software Metrics, 2005. 11th IEEE International Symposium, pages 24{24. IEEE, 2005. [88] Nasser M Nasrabadi. Pattern recognition and machine learning. Journal of electronic imaging, 16(4):049901, 2007. [89] The University of Waikato. Weka 3: Data mining software in java. https://www.cs.waikato. ac.nz/ml/weka/, 2018. [90] Willian Oizumi, Alessandro Garcia, Leonardo da Silva Sousa, Bruno Cafeo, and Yixue Zhao. Code anomalies ock together: Exploring code anomaly agglomerations for locating design problems. In Proceedings of the 38th International Conference on Software Engineering, ICSE '16, pages 440{451, New York, NY, USA, 2016. ACM. [91] Medvidovic N. Oreizy, P. and R.N. Taylor. Architecture-based runtime software evolution. In Proceedings of the 20th International Conference on SoftwareEngineering (ICSE'98), 1998. [92] R Lyman Ott and Micheal T Longnecker. An introduction to statistical methods and data analysis. Cengage Learning, 2008. 127 [93] D. L. Parnas. On the criteria to be used in decomposing systems into modules. Commun. ACM, 15(12):1053{1058, December 1972. [94] Dewayne E Perry and Alexander L Wolf. Foundations for the study of software architecture. ACM SIGSOFT SEN, 1992. [95] David Martin Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. 2011. [96] J Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014. [97] Foyzur Rahman and Premkumar Devanbu. How, and why, process metrics are better. In Software Engineering (ICSE), 2013 35th International Conference on, pages 432{441. IEEE, 2013. [98] Foyzur Rahman, Daryl Posnett, Abram Hindle, Earl Barr, and Premkumar Devanbu. Bugcache for inspections: Hit or miss? In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE '11, pages 322{331, New York, NY, USA, 2011. ACM. [99] Daniele Romano and Martin Pinzger. Using source code metrics to predict change-prone java interfaces. In Software Maintenance (ICSM), 2011 27th IEEE International Conference on, pages 303{312. IEEE, 2011. [100] Daniele Romano, Paulius Raila, Martin Pinzger, and Foutse Khomh. Analyzing the impact of antipatterns on change-proneness using ne-grained source code changes. In Reverse Engineering (WCRE), 2012 19th Working Conference on, pages 437{446. IEEE, 2012. [101] Jacek Rosik, Andrew Le Gear, Jim Buckley, Muhammad Ali Babar, and Dave Connolly. Assessing architectural drift in commercial software development: a case study. Software: Practice and Experience, 2011. [102] Raghvinder S Sangwan, Pamela Vercellone-Smith, and Colin J Neill. Use of a multidimensional approach to study the evolution of software complexity. Innovations in Systems and Software Engineering, 2010. [103] Robert W. Schwanke. An intelligent tool for re-engineering software modularity. In ICSE, 1991. [104] Robert W. Schwanke and Stephen Jos e Hanson. Using neural networks to modularize software. Machine Learning, 1994. [105] R. Subramanyam and M. S. Krishnan. Empirical analysis of ck metrics for object-oriented de- sign complexity: implications for software defects. IEEE Transactions on Software Engineering, 29(4):297{310, April 2003. [106] R. N. Taylor, N. Medvidovic, and E. M. Dashofy. Software Architecture: Foundations, Theory, and Practice. Wiley Publishing, 2009. [107] R.N. Taylor, N. Medvidovic, and E.M. Dashofy. Software Architecture: Foundations, Theory, and Practice. 2009. 128 [108] John W Tukey. Exploratory data analysis. 1977. [109] V. Tzerpos and R.C. Holt. ACDC: an algorithm for comprehension-driven clustering. In Working Conference on Reverse Engineering (WCRE), 2000. [110] Vassilios Tzerpos and Richard C Holt. Mojo: A distance metric for software clusterings. In Reverse Engineering, 1999. Proceedings. Sixth Working Conference on, pages 187{193. IEEE, 1999. [111] Arie Van Deursen, Christine Hofmeister, Rainer Koschke, Leon Moonen, and Claudio Riva. Symphony: View-driven software architecture reconstruction. In Software Architecture, 2004. WICSA 2004. Proceedings. Fourth Working IEEE/IFIP Conference on. IEEE, 2004. [112] Jilles van Gurp, Sjaak Brinkkemper, and Jan Bosch. Design preservation over subsequent releases of a software product: a case study of baan erp. Journal of Software Maintenance and Evolution: Research and Practice, 2005. [113] Shaowei Wang and David Lo. Version history, similar report, and structure: Putting them together for improved bug localization. In Proceedings of the 22nd International Conference on Program Comprehension, pages 53{63. ACM, 2014. [114] Zhihua Wen and Vassilios Tzerpos. An eectiveness measure for software clustering algorithms. In International Workshop on Program Comprehension (IWPC). IEEE, 2004. [115] Michel Wermelinger, Yijun Yu, Angela Lozano, and Andrea Capiluppi. Assessing architectural evolution: a case study. Empirical Software Engineering, 2011. [116] Xin Xia, David Lo, Shane McIntosh, Emad Shihab, and Ahmed E Hassan. Cross-project build co-change prediction. In Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on, pages 311{320. IEEE, 2015. [117] Lu Xiao, Yuanfang Cai, Rick Kazman, Ran Mo, and Qiong Feng. Identifying and quantifying architectural debt. In Proceedings of the 38th International Conference on Software Engineering, ICSE '16, pages 488{498, New York, NY, USA, 2016. ACM. [118] Hongyu Zhang. An investigation of the relationships between lines of code and defects. In Software Maintenance, 2009. ICSM 2009. IEEE International Conference on, pages 274{283. IEEE, 2009. 129
Abstract (if available)
Abstract
Changes to a software system require understanding and, in many cases, updating its architecture. A system's architecture is the set of principal design decisions about the software system. Over time, a system's architecture is increasingly affected by a phenomenon called architectural decay, which is caused by careless or unintended addition, removal, and modification of architectural design decisions. These decisions deviate from the architects' well-considered intent, and result in systems whose implemented architectures differ significantly, sometimes fundamentally, from their originally designed architectures. In practice, software systems regularly exhibit increased architectural decay as they evolve. The architectural decay may not cause immediate system failures
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A unified framework for studying architectural decay of software systems
PDF
Techniques for methodically exploring software development alternatives
PDF
A user-centric approach for improving a distributed software system's deployment architecture
PDF
Architecture and application of an autonomous robotic software engineering technology testbed (SETT)
PDF
A reference architecture for integrated self‐adaptive software environments
PDF
Software quality understanding by analysis of abundant data (SQUAAD): towards better understanding of life cycle software qualities
PDF
Analysis of embedded software architecture with precedent dependent aperiodic tasks
PDF
Automated synthesis of domain-specific model interpreters
PDF
Design-time software quality modeling and analysis of distributed software-intensive systems
PDF
Software connectors for highly distributed and voluminous data-intensive systems
PDF
Assessing software maintainability in systems by leveraging fuzzy methods and linguistic analysis
PDF
Constraint-based program analysis for concurrent software
PDF
Detecting SQL antipatterns in mobile applications
PDF
Software architecture recovery using text classification -- recover and RELAX
PDF
Automated repair of presentation failures in Web applications using search-based techniques
PDF
Calculating architectural reliability via modeling and analysis
PDF
Detection, localization, and repair of internationalization presentation failures in web applications
PDF
Proactive detection of higher-order software design conflicts
PDF
Automatic detection and optimization of energy optimizable UIs in Android applications using program analysis
PDF
Reducing user-perceived latency in mobile applications via prefetching and caching
Asset Metadata
Creator
Le, Duc Minh
(author)
Core Title
Architectural evolution and decay in software systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/05/2018
Defense Date
07/20/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
architectural decay,architectural smells,architectural technical debt,architecture evolution,empirical study,issue repository,OAI-PMH Harvest,open source,Software Architecture,software repository mining,Visualization
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medvidovic, Nenad (
committee chair
), Boehm, Barry (
committee member
), Gupta, Sandeep (
committee member
), Halfond, William G. J. (
committee member
), Wang, Chao (
committee member
)
Creator Email
duclm.bk@gmail.com,ducmle@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-100976
Unique identifier
UC11675719
Identifier
etd-LeDucMinh-6927.pdf (filename),usctheses-c89-100976 (legacy record id)
Legacy Identifier
etd-LeDucMinh-6927.pdf
Dmrecord
100976
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Le, Duc Minh
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
architectural decay
architectural smells
architectural technical debt
architecture evolution
empirical study
issue repository
open source
software repository mining