Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Using metrics of scattering to assess software quality
(USC Thesis Other)
Using metrics of scattering to assess software quality
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
USING METRICS OF SCATTERING TO ASSESS SOFTWARE QUALITY
by
Gustavo Uanús Perez
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2011
Copyright 2011 Gustavo Uanús Perez
ii
Table of Contents
List of Tables iv
List of Figures v
List of Equations vii
Abstract ix
Chapter 1. Introduction 1
1.1 Problem & Motivation 3
1.2 Proposed Solution 7
1.3 Performed Study 11
1.3.1 Research Context 12
Chapter 2. Related Work 15
Chapter 3. Code Scattering Model 23
3.1 Introduction 23
3.2 Code Implementation Process History 23
3.3 Code Implementation Model 26
3.3.1 Information Theory 31
3.3.2 Modeling Level of Scattering of Code Implementation 33
3.3.3 Extracting the Churn Events 40
3.3.4 Characterizing the Code Units 43
3.3.5 Defining the Periods of Evolution 44
3.3.6 Entropy Normalization 46
3.4 Evolving Level of Scattering 48
3.5 Analysis Framework 51
3.5.1 Framework Architecture 52
Chapter 4. Model Validation 57
4.1 Introduction 57
4.2 Empirical Study Design and Preparation 57
4.2.1 Data Set 58
4.2.2 Defects 58
4.2.3 Code Unit 60
4.2.4 Periods of Time 61
4.2.5 Data Extraction and Calculation 61
4.3 Model Validation 66
4.3.1 Preliminary Assessment 68
iii
4.3.2 Shannon Entropy – Periods of Time 71
4.3.3 Shannon Entropy – Developers 75
4.3.4 Logistic Regression 77
4.3.5 Linear Regression 86
Chapter 5. Summary of Contribution and Future Research Directions 93
5.1 Introduction 93
5.2 Summary of Contributions 93
5.3 Future Research Directions 95
Bibliography 99
iv
List of Tables
Table 1: Summary of projects used in study 58
Table 2: Number of periods of time per project 61
Table 3: Frequency of churn and defect 69
Table 4: Probability of defect based on the number of times code unit churned 71
Table 5: Number of periods of time per project 72
Table 6: Correlation analysis between level of scattering of churn events based
on periods of time and defect measures 75
Table 7: Correlation analysis between entropy of churn events based on
developers and defect measures 76
Table 8: Logistic regression model results 81
Table 9: R-statistic coefficient for generated models 82
Table 10: Model performance using testing set 85
Table 11: Detail of the statistical linear regression models for the studied projects 89
Table 12: Statistical model prediction error for each project 91
v
List of Figures
Figure 1: Necessary achievements to reach the goal of a superior process defined
by Humphrey et al. from a quality point of view 4
Figure 2: Churn frequency of files distributed over project lifecycle 10
Figure 3: COCOMO II Modeling Approach used in the study. Model reproduced
from (Boehm, et al., 2000) 11
Figure 4: Data extraction process 12
Figure 5: Lifecycle activities schedule distribution in a typical software factory 14
Figure 6: Typical revision structure and associated information of a file in a
version control system 25
Figure 7: Patterns of change of three code units of a system distributed in ten
periods of time 28
Figure 8: Patterns of change of four code units of a system distributed among the
developers 30
Figure 9: Patterns of change of three code units of a system distributed in ten
periods of time 34
Figure 10: Churn frequency of files distributed through developers 37
Figure 11: Scenario 1 of changes in combined dimensions 49
Figure 12: Scenario 2 of changes in combined dimensions 49
Figure 13: Cumulative and noncumulative entropy for Scenario 1 50
Figure 14: Cumulative and noncumulative entropy for Scenario 2 51
Figure 15: Analysis framework high-level architecture 53
Figure 16: Data extraction and processing steps 62
Figure 17: Distribution of code units based on the number of groups in which they
fall 73
vi
Figure 18: Distribution of defects based on the number of groups in which the
code units where the defects happen fall 73
Figure 19: Type I and II errors for different cutoff values 84
Figure 20: Two code units with similar pattern of changes but with a significant
gap between events 96
vii
List of Equations
Equation 1: Shannon's entropy mathematical formula 31
Equation 2: Entropy formula to measure level of scattering based on periods of
time with probability calculated based on frequency of churn 38
Equation 3: Entropy formula to measure level of scattering based on time with
probability calculated based on size of churn 38
Equation 4: Formula to calculate probability based on frequency of churn 39
Equation 5: Formula to calculate probability based on size of churn 39
Equation 6: Entropy formula to measure level of scattering based on developers
with probability calculated based on frequency of churn 39
Equation 7: Entropy formula to measure level of scattering based on developers
with probability calculated based on size of churn 39
Equation 8: Formula to calculate probability based on frequency of churn 40
Equation 9: Formula to calculate probability based on size of churn 40
Equation 10: Minimum and maximum value of Shannon's entropy 46
Equation 11: Standard Shannon's entropy 47
Equation 12: Probability of defect based a specific frequency of churn 70
Equation 13: Probability of defect based a range of frequency of churn 70
Equation 14: Logistic regression equation 77
Equation 15: Logistic model to estimate probability of defect based on entropy of
code churn throughout the project lifecycle 78
Equation 16: Linear regression equation 87
Equation 17: Linear regression equation (A) based on the level of scattering 88
viii
Equation 18: Linear regression equation (B) based on the frequency of churn 88
Equation 19: Number of expected defects in the code units 90
Equation 20: Actual prediction error of each model 90
Equation 21: Total prediction error of each model 90
Equation 22: T-test Hypotheses 91
ix
Abstract
Defect prediction and removal continues to be an important subject in software
engineering. Previous studies have shown that reworking defects introduced at different
phases of the development lifecycle typically consumes on average 40% to 50% of the
total cost of the software implementation. Another important factor is the time between
the injection of the defect in the code and when it is identified and removed. It has been
demonstrated that the longer the defect is in the product the larger the number of
elements that will likely be involved with it increasing the cost of fixing the defect. This
work investigates the defects introduced during the coding phase of the development
lifecycle of a software project. Considering that a high percentage of the defects are
introduced in the source code at this phase, finding ways to better understand how they
are introduced can have a significant impact in the testing and maintenance costs of a
project and consequently in the quality of the final product.
This dissertation is the result of a series of investigations in the domain of quality
assurance through the use of attributes of code evolution and more specifically attributes
of code churn. Code churn represents a measure of the amount of code change taking
place within a code unit over time.
Software systems are developed and evolved through a sequence of events of
churn in their source code. These changes follow specific patterns and are characterized
by a series of attributes such as size and frequency. Previous studies have successfully
used some of the attributes of code churn as predictors of software quality. The problem
x
is that these attributes of churn alone provide only a partial representation of how the
source code evolved. Very few studies have focused on how the process followed by the
developers to implement the code can impact quality limiting the ability to understand
how the pattern of changes performed by the developers impacts its quality.
This dissertation is based on the belief that a complex code development process
has a negative impact in the quality of the software. One way to measure this complexity
is by modeling the level of scattering of the changes performed in the project. The
concept of Information Theory is used as an innovative approach to model these patterns
of changes. More specifically, Shannon’s mathematical model (Shannon’s Entropy) –
developed to measure the amount of uncertainty in a distribution – is applied to generate
a group of metrics that measure how scattered are the changes performed in the source
code of software project. The study is empirically validated by a series of statistical
analysis using data derived from the development process of four large real projects from
an established software company.
1
Chapter 1. Introduction
Defect prediction and removal continues to be an important subject in software
engineering. Jones (Jones, 1986) and Boehm (Boehm, 1987) in separate studies have
shown that reworking defects introduced at different phases of the development lifecycle
typically consumes on average 40% to 50% of the total cost of the software
implementation. Another important factor is the time between the injection of the defect
in the code and when it is identified and removed. Humphrey (Humphrey, 1994)
demonstrated that the longer the defect is in the product the larger the number of
elements that will likely be involved with it increasing the cost of fixing it.
This dissertation focuses on the analyses of defects introduced during the
construction phase of the development lifecycle of a software project. Considering that at
least 40% of the defects in a software are introduced at this phase (Gilb, 1988), finding
ways to better predict situations that are more susceptible to generate defects can have a
substantial impact in the testing and maintenance costs of this system and consequently
on the reliability of the final product.
There is a significant amount of previous work on the field of software quality
prediction. While most of these studies use source code based metrics such as number of
lines of code, more attention has been given to code churn measures as a way to predict
fault incidence (Graves, et al., 2000) (Nagappan, et al., 2005) (Munson, et al., 1998).
Although successful in their own context, many of these researches have a superficial
approach to the problem. Most of the findings lack an explanation of why certain patterns
2
of churn resulted in higher or lower incidence of defects limiting one’s ability to
eliminate the source of the problem.
The process of constructing a software system is composed of successive changes
(or code churn) to the elements that compose it. For new projects the code base set is
usually empty or composed of basic elements that define the framework on top of which
the software will be built. Changes made to a unit of code of a system – here called code
unit – carry a series of attributes such as time and size of the change or who performed
them.
The information of who churned the source code and when it was churned can be
used to characterize the process used by the developers to create the source code. A code
unit, for example, may have churned throughout the entire construction phase while
another may have churned only for a short period of time.
The goal of this study is to demonstrate that characteristics of code churn can be
used to identify error prone modules in new software projects. In particular, this study is
based on the belief that a more chaotic code implementation process has a direct impact
in the outcome of the project. The concept of information theory and Shannon’s Entropy
is used to measure the level of scattering of the churn events of the code units that
compose the software. The investigation shows how the distribution of the events in
different dimensions impacts the quality of the software. Two perspectives are considered
– time when the churn happened and the developers that churned the code. The main
hypothesis is that code units that evolve in a more scattered (or chaotic) way are more
error-prone.
3
Basili et al. have showed that any process depends to a large degree on a
potentially larger number of relevant context variables (Shull, et al., 2005) hence the need
to proper contextualize any empirical study. The validation of the proposed models is
done using data collected from two software factories established in Brazil. Data from
four distinct software projects are being used. All projects are Java based with size
between 50k and 80k source lines of code (SLOC).
1.1 Problem & Motivation
Humphrey’s et al. in (Humphrey, et al., 2007) lists the future directions in
software development process improvement emphasizing the increasing demand for
quality. According to the study, the process of the future must meet five requirements
with two of them being reduction of development costs and schedule, and development of
quality products predictably. Humphrey not only mentions the need for better quality but
also stresses the goal of minimizing cost and schedule by optimizing project staffing.
Advances have been made in the field of quality assurance and defect reduction
but there are still considerable improvements to be made. Important future needs in the
field of software defect reduction are listed by Boehm and Basili in (Boehm, et al., 2001).
The first fact pointed by Boehm and Basili is the high cost associated with locating and
fixing a problem in the software after its release. This cost can increase to up to 100 times
when compared to the cost of removing the defect before delivery (Boehm, 1981). This is
consistent with the results presented in (Humphrey, 2006), (Juran, et al., 1988), and
(Deming, 1982). Other important facts listed in the same study are the findings made by
4
distinct researchers in different contexts that about 80 percent of the defects come from
20 percent of the modules, that about half the modules are defect free, and that 80 percent
of the avoidable rework comes from 20 percent of the defects (Boehm, et al., 2001). Yet,
despite of all these findings and recommendations, empirical studies have showed that,
among the many activities that compose a software project, testing is still one of the most
resource-intensive phases of a software project consuming between 30% and 50% of the
total development costs (Ramler, et al., 2006).
To be able to achieve the goals defined by Humphrey and his colleagues for a
superior process (Humphrey, et al., 2007), software engineers will need to be more
efficient on handling defects. Figure 1 describes based on the facts listed what are
necessary achievements that are in the critical path to accomplish the software
development process of the future described in Humphrey’s study.
80% of Defects in
20% of the
Modules
(Boehm et al.)
80% of Rework
from 20% of the
Defects
(Boehm et al.)
Defect Removal
Cost Rate up to
100:1
(Boehm et al.)
Testing Consumes
up to 50% of the
Development Cost
(Ramler et al.)
Goal Proved Facts Important Achievements
Efficient Identification of Error-
Prone Modules for Better
Allocation of Testing
Resources
(Context Dependents or not)
Better Understanding of
Causes of Defects
(Context Dependents or not)
Superior Process
Reduce Rework from Defect
Removal by Preventing
Defects Before Test
(Humphrey et al.)
Produce Quality Products
Predictably
(Humphrey et al.)
Reduce Costs and Schedule
through Efficient Staffing
(Humphrey et al.)
Figure 1: Necessary achievements to reach the goal of a superior process defined by Humphrey et al. from a
quality point of view
5
With the high costs associated with testing alone (Ramler, et al., 2006), one can
reason that cutting testing time will have a significant impact in reducing cost and
schedule of a software project except that improper levels of testing will result in less
reliable software and higher post-release defect density. Depending on the severity of the
defects, the project may end up with the high costs associated with late identification and
removal of defects possibly voiding any economic gain perceived. The goal is actually to
cut the testing time by performing them more efficiently. Given that 80% of the defects
are in 20% of the modules (Boehm, et al., 2001), practitioners need to be able to
concentrate most of their efforts in these modules to have a more reliable system. The
problem is how to identify which modules to test more thoroughly. The importance of
proper identification of the characteristics of error-prone modules before the release of
the project is acknowledged by Boehm (Boehm, et al., 2001) who stresses the importance
of properly identifying both context dependent and independent factors that contribute to
error-proneness.
Better allocation of testing resources can result in a significant reduction of cost
and schedule but that alone is not enough to achieve significant process improvement. As
Humphrey states in (Humphrey, et al., 2007), “removing defects before test is accepted
as an essential element of all quality-management programs” and “they [the software-
and systems-development community] should include explicit defect and productivity
measures and analysis to ensure that they are addressing the most critical defect sources
in the most effective way”. It is important not only to find and remove the defects early
but it is also essential to try to identify and understand the possible causes and sources of
6
these defects so the appropriate measures can be taken to prevent the defects from getting
introduced in the software.
Defects are introduced in the project during various activities of the project
lifecycle and, as such, they are classified according to their origin as Requirements
Defects, Design Defects, and Coding Defects (Chulani, et al., 2003). A vast amount of
research has been performed in the field of failure prediction models for these three types
of defects for different contexts. These models work in different levels projecting number
of failures and their distribution over time either for an entire project or for specific level
of software component. The techniques vary considerable with both dynamic and static
defect models. The dynamic models are based on the statistical distribution of faults
given some fault data (Musa, et al., 1990). The static model models use attributes of the
project and the process used to develop the project to predict the number of faults
(Chulani, et al., 2003).
Most of the existing prediction models focus on the properties of the code to try to
predict if, when or how often it will fail. A large number of prediction models also use
product metrics to predict the number of defects or the defect density of a system.
Although useful and sound, these properties are usually context independent (Boehm, et
al., 2001) and give us very little insight on the context-specific factors that are
contributing to error-proneness. Very few researchers have actually looked in the
properties of the changes (Graves, et al., 2000), (Karunanithi, 1993), (Khoshgoftaar, et
al., 1996), (Nagappan, et al., 2005) provided very limited understanding of how the code
7
implementation process used by the developers to construct the source code of the system
affects quality.
1.2 Proposed Solution
A software system is implemented by a series of changes – or churn – performed
to the source code. These changes can be characterized based on when they were
performed, their size or who performed them. A source file of a system, for example,
may have churned throughout the entire development phase while another may have
churned only for a short period of time. In this study the term code implementation
process is used to define these patterns of changes performed by a developer during the
construction of a software project.
Attributes of code churn have been studied as predictors of quality before
although all the relevant studies concentrated their analysis in very large systems in their
maintenance or evolutionary phases (Graves, et al., 2000) (Karunanithi, 1993)
(Khoshgoftaar, et al., 1996) (Nagappan, et al., 2005). Furthermore, the most commonly
used metrics of code churn in these studies have been frequency and size of churn or
variation of these metrics. They have been used to assess quality and, although efficient
in specific contexts, they fail to provide a proper description of the code implementation
process of the units of code of a system and how it is affecting quality. The simple fact
that a source file of the project that churned more frequently has a higher probability of
having more defects (Nagappan, et al., 2005) does not give enough information to
understand the causes of the defect from a code development point of view. Software
8
engineering practitioners may look at these metrics of churn individually or combined to
assess quality but they alone will not allow improvements to their decision making
process. The simple act of reducing the number of changes performed in a unit of code of
a system, for example, will most probably not impact its quality.
During the process of this study a series of preliminary studies similar to the ones
described in (Graves, et al., 2000), (Karunanithi, 1993), (Khoshgoftaar, et al., 1996), and
(Nagappan, et al., 2005) were also performed to assess the efficiency of code churn
collected during the development of new software projects as a predictor of code unit
quality. Multiple successful statistical regression models were generated however they
were considered limited by project managers because they would only indicate the
presence of defects in a code unit without giving any insight on the possible causes of the
defects.
Practitioners that participated in the preliminary studies have indicated a need for
a better understand the risk associated to the code implementation processes as a way to
better plan their testing activities. This is aligned with Humphrey’s recommendation that
empirical software engineering researches in the field of defect reduction must supply
software project managers with the means to more efficiently allocate testing resources in
a way that costs and schedules can be reduced while quality is not affected. Maybe more
important, researchers need to provide project managers with a detailed understanding of
the causes of defects so decisions can be made to prevent or at least minimize the
insertion of these defects before the end of the construction phase.
9
This research is based on the belief that the code implementation process plays a
key role in the quality of the project. More important than assessing how frequently a
code unit is churning, this study concentrates on measuring how distributed (or scattered)
is the construction of the software project based on the churn events associated to it and
how it affects the final quality of the code units of a software system.
Considering that frequency and size of churn have an impact in the quality of
code units, one could assume for example that two code units with the same frequency of
churn have the same probability of presenting post-release defects. Figure 2 presents the
distribution of churn events of three code units (A, B, and C) in ten distinct periods of
time during the project lifecycle. All the three code units have the exactly same frequency
of churn (eighty churn events per file) but a completely different distribution of these
events. Code unit A had all the changes clustered in the two initial periods of time while
code unit B continued to churn throughout the entire project lifecycle. Even though code
unit C had most of the changes concentrated in one period of time, it presented a
considerable gap between these events and the last event of churn. The initial assumption
that all three files have the same probability of post-release defects might no longer hold
after a better understand of their code implementation process.
10
Figure 2: Churn frequency of files distributed over project lifecycle
In this dissertation the code implementation process is quantified based on how
scattered the events of churn are. A new approach based on the concepts of information
theory (Weaver, 1949) is defined to measure the level of distribution of the changes –
here called level of scattering – in different dimensions. The level of scattering is
measured considering the distribution of the churn events throughout the project lifecycle
and among the developers. The final objective is to demonstrate the impact that the level
of scattering of code churn has in the quality of code units. This will provide to the
software engineer practitioners not only an innovative way to assess error-proneness of
software modules but also the means to identify risky behavior patterns when
implementing software projects.
11
1.3 Performed Study
This study is based on a series of preliminary studies correlating a number of
different aspects of software characterization and development methodology against
failures occurring both during the development process and on customer tests. The
COCOMO II Modeling Approach described in (Boehm, et al., 2000) as presented in
Figure 3 was used to evolve the initial models as the objective of research evolved with
the findings from these preliminary studies and the needs of the organization providing
the data.
Figure 3: COCOMO II Modeling Approach used in the study. Model reproduced from (Boehm, et al., 2000)
Data used in the study was extracted from real software projects. The data was
automatically extracted from the version control system and the defect tracking system.
When necessary, information was validated through interviews with project managers
and developers.
12
A mathematically sound validation of the model of the level of scattering was
performed using both statistical logistic and linear regressions.
Figure 4: Data extraction process
1.3.1 Research Context
Basili et al. in (Shull, et al., 2005) describes the need for proper contextualization
of the research. Boehm and Basili in (Boehm, et al., 2001) and Basili in (Shull, et al.,
2005) highlight the need for a better contextualization of empirical research in software
engineering and a better identification of context dependent variables that impact quality.
According to Basili, it is unclear what specific variables will influence the effectiveness
of a process in a given context hence knowledge about software process must be built
from families of studies executed in different contexts. Although the evidences found in
this study may be extended to other contexts, a good understanding of the context gives a
better insight of the problems currently faced by a typical software company.
13
The data used in this study was collected from software factory organizations.
Software factories were initially described by Bremer in (Bremer, 1969) and later by
Humphrey in (Humphrey, 1991). These software development companies use
fundamentals of industrial manufacturing – standardized frameworks, specialized skill
sets, parallel processes, and a consistency of quality – to achieve a superior level of
application assembly even when assembling new or horizontal solutions. Software
factories are primarily providers of software development services. They work on a
contract bases serving clients by implementing and delivering customized software
system to meet specific needs that cannot be served by the available commercial
application. To be competitive a software factory must be able to demonstrate to the
clients a record of high productivity at lower costs and at an acceptable level of quality.
This creates the need for a more structured development environment hence the
popularity of CMMi, ISO and certificates among large software factories.
Quality of the delivered products is a priority in software factories. Their entire
business model depends on their ability to deliver the correct defect free software the first
time to the client. Any time spent by a developer fixing a bug in a project has a direct
impact in the profit generated by that project plus the impact to other projects under
development that depend on the availability of that particular developer. The developers
assigned to remove the defects are in general the most experienced developers increasing
the burden on the ongoing projects. Quality of the generated code has an impact in the
allocation of testing resources. With different projects competing with each other to be
tested, a low quality project will require more testing resources risking other projects to
14
miss the deadline due to lack of testers. There is also a very serious indirect cost
associated to the delivery of a low quality product. Software factories depend on their
ability to acquire projects from clients. A software factory known for providing low
quality products will have difficulty in acquiring future projects.
Non-published empirical evidence collected from the studied software factories
indicates a distribution of the schedule of the project per group of activities as described
in Figure 5. The activities related to testing, validation and fixing of defects account for
almost 20% of the total time of the project. Post-release adjustments account for 5%.
Contract between software factories and the clients usually specifies the client validation
phase. In most cases it varies between 3 and 6 months (average of 4 months) for a project
of up to 100 KSLOC. During that time the clients test and submit any defects found in the
system to be fixed by the software factory. Requests for changes are not atypical in that
stage and most of the time the contracts account for them as long as they are consider
minor changes. Major changes need to be renegotiated and most of the time results in an
extension of the contract already established. Once the final version of the software is
released the software factory does not have any further obligation to it.
Figure 5: Lifecycle activities schedule distribution in a typical software factory
Planning &
Control
(6%)
Elaboration
(10%)
Construction
(57%)
Test,
Validation &
Fix (11.5% +
7%)
Initial Release
(3.5%)
Client
Validation
Post-release
Adjustments
(5%)
Final Release
15
Chapter 2. Related Work
There have been a number of researches assessing the measures of code churn as
a predictor of software reliability. The experiments in some cases concentrated
specifically in code churn or used it as part of a larger set of metrics although very few of
them used attributes of code churn alone as predictors of code quality. The prior work is
divided in three groups. The first group covers the previous study in the field of defect
prediction based on measures of churn; the second group covers previous investigations
on the correlation between developer related metrics and software quality; and the last
group covers the defect prediction through the use of attributes of code.
Karunanithi in (Karunanithi, 1993) presents the applicability of the neural
network approach to the problem of developing an extended software reliability growth
model in the presence of continues code churn. He demonstrates that when comparing
two neural network models, one with the code-churn information and another one
without it, the one that incorporated the code churn information was capable of providing
a more accurate prediction.
Supporting the idea of this study is the work of Ball and Nagappan (Nagappan, et
al., 2005). They present a technique for early prediction of system defect density using a
set of relative code churn measures that normalize the size of churn base on other
variables. The experiment was validated using the metrics collected on the development
of Windows Server 2003 and the Windows Server 2003 Service Pack 1. Among other
things, they were able to demonstrate that relative code churn measures can be used as
16
efficient predictors of system defect density in evolving systems when churn is measured
at the binary level. In a separate study, Nagappan (Layman, et al., 2008) was able to use
159 churn and structure metrics from six, four-month snapshots of a 1 million SLOC
Microsoft product to create an interactive model to identify defect-prone binaries. The
interactive model was 80% accurate at predicting fault-prone and non-fault-prone
binaries
Graves et al. (Graves, et al., 2000) present a very detailed work on predicting fault
incidence using software change history with the main objective of understanding the
nature and causes of faults. In the experiment, several statistical models were built to
evaluate which characteristics of a module’s change history were likely to indicate that it
would see large numbers of defects generated as it continued to be implemented. Among
the generalized linear models analyzed in the experiment, the one that presented the best
results used numbers of changes to the module in the past combined with a measure of
the age of the modules (the age was calculated as the average age of the code in that
module). The weighted time damp model was the overall best model using a sum of
contributions from all the changes to the module in its history to predict fault potential. In
this model, larger and recent changes have a substantial contribution of fault potential.
The experiment also showed that once code churn was considered, the model could not
be improved by including the module’s size in lines of code and, therefore, many of the
software complexity metrics did not have an impact in the model; and the number of
developers involved in the changes made in the module did not seem to correlate with the
number of faults in the module.
17
Ohlsson et al. (Ohlsson, et al., 1999) studied 8 releases of a large embedded mass
storage system with 800,000 lines of C code and 130 components of a software system.
His objective was to identify measures which could be used to create a model that could
identify fault-prone modules before they caused any major problems. Among the
measures found relevant, twelve of them were related to size and change. The authors
used these measures to successfully identify 25% of the most fault-prone components.
Elbaum et al. (Munson, et al., 1998) analyzed 18 builds of a real time system with
300,000 lines of code and more than 3,700 modules in C. The experiment was able to
demonstrate that code churn is a powerful surrogate for software faults.
Khoshgoftaar et al. (Khoshgoftaar, et al., 1996) focused their experiment only on
code churn related to bug fixes (debug code churn). Their primary goal was to identify
modules in the system where debug code churn exceed a threshold so they could classify
those modules as fault-prone.
In (Gayatri, 2003), Rapur defined a model capable of predicting the modules of a
system that were expected to churn and the size of the churn using historical churn data.
Code churn in the experiment is defined as the amount of update activity that has been
done to a software product in order to fix bugs. The experiment was validated using four
versions of an open-source software product containing a total of 17,000 lines of code.
The author was able to demonstrate through the case study that the defined model could
correctly predict the rate of churn of the modules in the future. It is important to consider
that, although these results seem very promising, they require further validation as stated
by the author.
18
Mockus et al. in (Mockus, et al., 2000) uses properties of a software change such
as size, duration, diffusion and objective to measure the impact on the risk of failure.
Through the use of logistic regression they are able to successfully measure the risk of
each change made in the studied software. The model generated was validated using data
from an evolving software system.
Hassan et al. in (Hassan, et al., 2003) use the concept of information entropy to
measure the chaos of a change in evolving software systems. Different from the approach
in this dissertation, they use entropy to measure the level of chaos of each change
performed in the system based on the number of files that a change impacted. They were
able to demonstrate that the more scattered the changes are among different files the
higher the probability of failure.
Moser et al. in (Moser, et al., 2008) compared efficiency of product related
against process related software metrics in classifying Java files of the Eclipse project as
defective respective defect-free. Results indicated that process metrics, in the Eclipse
project, are more efficient defect predictors than code metrics. In a separate study, Moser
et al. (Moser, et al., 2008) uses Eclipse project data to demonstrate that out of eighteen
change metrics, three metrics remain stable as predictors of software defects – frequency
of churn, number of bug fixes, and maximum size of all of its change sets.
Most of the experiments presented here are related to very large systems with
more than 100K lines of code that kept evolving over many years. Another important
relationship between them is that churn is analyzed at the module or component level.
19
The only exception is the experiment performed in (Nagappan, et al., 2005) that
performed the analysis at the binary level.
Many researchers have investigated how developer related factors may impact
software cost and quality such as Tom DeMarco in (DeMarco, et al., 1999) and Curtis in
(Curtis, et al., 1988).
Both the COCOMO II (Boehm, et al., 2000) and the COQUALMO (Chulani, et
al., 2003) models provide strong evidence of the effect of personnel capability and
experience in the outcome of a software project.
Paech et al. in (Illes-Seifert, et al., 2008) combine both process and human
aspects to demonstrate that number of changes and authors changing files in evolving
systems impact quality of software projects. They also demonstrate that number of co-
changed files does not correlate with the defect count.
Turner et al. (Turner, et al., 2003) highlights the importance of people factors in
the success of software projects. This fact was realized by Turner during an investigation
that compared Agile and Plan-Drive software development methods. This dissertation
helps to corroborate Turner’s findings as it will be later demonstrated.
Mockus et al. (Mockus, et al., 2000) demonstrated that the experience that a
developer has with the module being changed has a significant impact in the failure
probability. They had successfully demonstrated that in the studied projects the greater
the programmers experience the lower the probability that the change will result in a
defect.
20
LaToza et al. (LaToza, et al., 2007) performed an assessment on how experience
affect changes made to the code. They found that, when removing a seeded defect from
the system, experienced developers tended to address the cause of the problems while
novice developers provided inferior solutions that addressed only the symptoms. In
another work, LaToza et al. (LaToza, et al., 2006) analyzed the dynamics that goes on
among developers to spread tacit knowledge about code. Among the problems found by
LaToza there is the fact that “developers spend vast amounts of time gathering precious,
demonstrably useful information, but rarely record it for future developers”. This can
have a large implication when a large number of developers are working in large code
units.
Wolf et al. (Wolf, et al., 2009) used data from IBM’s Jazz project to study the
communication structures of development teams with high coordination needs. The
results of this study indicated that developer communication plays an important role in
the quality of software integrations.
Murphy et al. (Pinzger, et al., 2008) investigated the relationship between the
fragmentation of developer contributions and the number of post-release failures. The
study used network centrality measures to measure the degree of fragmentation of
developer contributions. Fragmentation was determined by the centrality of software
modules in the contribution network. Results of the study demonstrated that central
modules were more failure-prone than modules located in surrounding areas of the
network. Results also confirmed that number of authors and number of commits were
significant predictors for the probability of post-release failures.
21
Meneely et al. in (Meneely, et al., 2008) used social network analysis to assess the
structure of developer collaboration and its impact in the reliability of the final product.
Similar to the approach used by COQUALMO, the influence of organizational
structure on software quality is assessed by Nagappan et al in (Nagappan, et al., 2008). A
set of organizational measures that quantify the complexity of the software development
organization are defined in the study. The organizational measures are then used to
quantify and study the effect that an organization structure would have on software
quality. For the organizational metrics, they try to capture issues such as organizational
distance of the developers; the number of developers working on a component; the
amount of multi-tasking developers are doing across organizations; and the amount of
change to a component within the context of that organization etc. from a quantifiable
perspective. The efficacy of measures in detecting defect-prone binaries was evaluated
using data collected from the development of Windows Vista. The results indicated that
the organizational metrics in the development of Windows Vista were better predictors of
failure-proneness in binaries than the traditional metrics of code.
A few other studies in the field of defect prediction through characteristics of
code had an influence in the techniques used in this dissertation and must be mentioned
here.
Selby et al. in (Selby, et al., 1988) and (Selby, et al., 1990) used software metrics
to generate classification trees with the objective of identifying error prone modules. A
similar approach was used in preliminary investigations when assessing the potential of
the different attributes of churn in identifying error prone modules.
22
Basili et al. (Basili, et al., 1984) performed an analysis of the various factors that
have an impact on software development. They focused primarily on the complexity of
the software, the developer’s experience with the application, and the reuse of existing
design and code. The study was performed in a project of approximately 90,000 lines of
code developed in FORTRAN. Basili was able to demonstrate among other things that
modified modules appeared to be more susceptible to errors due to the misunderstanding
of the specifications if compared to new modules developed from scratch.
Zimmermann et al. in (Zimmermann, et al., 2009) demonstrates the challenges in
transferring prediction models from one project to another. Results indicate that there are
major challenges in cross-project predictions.
Different from many of the studies presented here, this dissertation focus on the
study of the identification of error prone modules using the pattern of changes in new
software projects in a context yet not vastly explored by the software engineering
researchers. In particular this dissertation investigates how the way a file churn, and more
specifically, how the level of scattering of the churn events of a code unit impacts its
quality. The measure of the level of scattering is done in an innovative way using the
concept of information entropy. Although this approach had been used before to measure
the complexity of a change (Hassan, et al., 2003) by assessing the distribution of the
changes among different files, it has never being used in the way presented in here.
23
Chapter 3. Code Scattering Model
3.1 Introduction
Measures of frequency and size of churn has been previously used to identify how
the implementation process of a software can impact the quality of a code unit. Many
researchers have successfully demonstrated that the frequency of change is a good
predictor of number of defects in a source file (Graves, et al., 2000) (Khoshgoftaar, et al.,
1996) (Mockus, et al., 2000). Although efficient in identifying error-prone files,
frequency or the size of change alone does not reveal important facts about the
development process that may be contributing to the introduction of these defects. Two
files with the exact same frequency of churn, for example, can have a completely
different distribution of the churn events as demonstrated in Figure 2.
This dissertation presents an innovative approach centered on the mathematical
model behind the concept of information theory to model the level of scattering of the
implementation process of a software project. This model promotes a better understand of
the impacts that a more scattered implementation process has on the final quality of the
code units of the software.
3.2 Code Implementation Process History
The development of a software project is composed of a series of changes to the
source code. These modifications are carried by the developers at different points in time
and are performed mostly to implement a new feature, change an existing one or remove
24
an existing defect. In this study the term code implementation process is used to define
the pattern of these changes.
This study concentrates on defining ways to quantify the chaos of the code
implementation process of a software project. More specifically it concentrates on
measuring how scattered are the churn events that contribute to the implementation of
each code unit. The final goal is to use these measures to better understand the impact
that the code implementation process has in the post-release quality of the units of code
of a project and consequently in the entire project.
Version control systems (or revision control systems) are tools widely used by
software companies today (Tichy, 1985). Their primary objective is to keep track of all
changes performed in the source code of a project (CVS). They help organize the
implementation process between the various developers of the team. It also provides the
means to restore the state of any part of the source code of the project to a previous state
at any given time in the past. Additionally, version control systems provide the tools to
reconcile the changes made by developers that work on the same part of the source code
at the same time.
A version control system is a rich repository of information capable of describe
the entire development history of each code unit of a software project from the time it
was initially added to the project until the current date. The version control system stores
not only the actual source file of the project but also the date and time that it was
submitted, the lines that changed in the file including the number of lines added and
deleted, who performed the change, and any specific comments submitted by the
25
developer. These comments usually contain a description of what has changed and why
the change was performed. Each time a developer finishes a change in a file he checks it
in the version control system. Each instance of a change checked in is called a revision
and is the basis of this study. A complete history of the changes in the file is maintained
by the tool enabling anyone with access to it to get any revision of a file.
Revision 1
Revision 2
Revision 3
...
Revision n
· Date/Time
· Developer
· Lines Added
· Lines Deleted
· Description of the Change
Figure 6: Typical revision structure and associated information of a file in a version control system
Files in a software project churn for three main reasons as described below
(Swanson, 1976).
Add a new feature (NF): These changes are performed to add a new feature to
the software project. They are the most common modifications performed during
the development of a new software project and the foundation for the model
proposed in this study.
26
Improve or modify an existing feature (MF): This type of change is performed
mostly in evolving systems when a certain feature needs to be improved or
modified to comply with changes in the business process. The number of MF
changes in the studied context tends to be minimal since these are usually new
projects where the scope is carefully defined before the start of the project.
Remove a defect (DR): These are modifications performed to remove a defect
from the code unit. In this study, changes of this type are used as a way to count
post-release defects in the code units.
The idea of using the information stored in version control systems is not new and
have been widely used in previous studies (Graves, et al., 2000) (Nagappan, et al., 2005)
(Gayatri, 2003) (Mockus, et al., 2000) (Eick, et al., 1990). Information can be retrieved
automatically as the system evolves in a less invasive way requiring no effort from the
development team.
3.3 Code Implementation Model
The development process of a code unit can be modeled based on how frequently
it churned during the project implementation. Imagine a system composed by three code
units. Through the version control system it is possible to gather the complete history of
changes of these three code units including the dates they churned and the developers
who churned them. This information can be used to plot the evolution of the code of each
file. Figure 7 presents the number of times each code unit in this hypothetical software
churned at specific periods of time of the project lifecycle (e.g. a week, a month, etc.).
27
Each file in this example churned a total of eighty times. A superficial assessment based
on frequency of churn alone could suggest that all the three code units had a similar
pattern of changes. The problem is that frequency alone, although efficient in predict
post-release quality (Nagappan, et al., 2005), does not model important aspects of the
development of a code unit. The churn events of the code units provided in the example
are distributed through the periods of time following different patterns. The frequency of
churn of Code Unit C is concentrated almost entirely in the third period of time while the
churn events of Code Unit B are evenly distributed through eight periods of time. It is
reasonable to assume that the development process of Code Unit C is less chaotic than
the development of Code Unit B because the changes performed happened all during a
small period of time. The developers involved in the implementation of Code Unit C
were probably more focused maintaining a better understanding of the changes
performed in the code unit. The implementation of Code Unit B on the other hand has a
more distributed pattern with changes happening in the file through the entire
development phase. It becomes harder for the developer to keep a good grasps of what is
changing in the files increasing stress and the chance of defects.
28
Figure 7: Patterns of change of three code units of a system distributed in ten periods of time
This study is based on the belief that a chaotic distribution of the modifications
performed in a code unit has a direct impact in the ability of the developers to maintain a
good understanding of what is changing over time resulting in a more complex file that is
consequently harder to change. Brooks in (Brooks, 1995) presents a number of
observations that corroborates the assumptions behind this study. In his work, Brooks
indicates how the understanding of the code by those involved in implementing it is
crucial to the outcome of the project. The way the files change over time can significantly
affect the knowledge that each developer has of these files. This can influence the ability
of the developers to understand the impact that future changes will have in the files
resulting in defects in the code unit.
This dissertation measures the level of chaos of the implementation process of a
code unit based on how scattered are the changes performed in it. Time, as presented in
Figure 7, is just one of the angles (also called here dimensions) that can be used to model
29
the level of scattering of the code implementation process of a code unit. It is also
possible to assess the evolution of a file based on the level of influence that each
developer had in the code unit. Figure 8 presents the distribution of the churn events of
four code units of another hypothetical software project among five developers that
worked in the project. Although each file churned exactly thirteen times, there is a
significant difference in the level of distribution of the churn events of each file among
the developers. Code Unit 2, for example, was entirely developed by developer Dev 04
while Code Unit 1 had a higher distribution of the changes among developers Dev 01,
Dev 02, and Dev 03. It is expected an increase in the complexity of the code unit as the
number of developers with significant influence in the files also increases. One of the
reasons for that is described by Brooks (Brooks, 1995). The higher the number of
developers involved in a project, the higher is the number of communication paths for
these developers to communicate increasing the communication overhead. A code unit
implemented by five developers, for example, requires ten communication paths. The
communication becomes even more critical in cases where the level of influence of each
developer in the development of the code unit increases. The more each developer
influences the code, the higher are the chances that they will fail to communicate
important aspects of the code unit resulting in potential issues in the future. Brooks also
points the risk of adding developers late in the project (Brooks, 1995), a fact that happens
constantly in software factories caused by the high turnover rates and the continuous
reallocation of developers to different projects.
30
Figure 8: Patterns of change of four code units of a system distributed among the developers
The investigation of the level of scattering of the code implementation process
among developers is also justified by the contradictory results found in previous studies
that attempted to correlate the number of developers with quality of a software project.
Thayer et al. (Thayer, et al., 1978) found no significant correlation between the number
of developers and the quality of the code units. Others researchers found different results
such as the one presented in (Graves, et al., 2000). Many studies have demonstrated the
impact of developer skill in the outcome of the project (Turner, et al., 2003) (DeMarco, et
al., 1999) (Chulani, et al., 2003) (Boehm, et al., 2000) (Curtis, et al., 1988). The question
then becomes why the results are contradictory when it comes to the number of
developers that worked in the code unit. The response may be less related to the number
of developers that churned the file and more related on the level of influence that each
developer had in that file.
31
There is a need for proper quantification of the complexity of the code
implementation process and the amount of predictability and chaos that is associated with
the code churn events of a code unit. This dissertation uses the concept of Information
Theory and Shannon’s Entropy as the foundation of the proposed model.
3.3.1 Information Theory
Shannon in (Weaver, 1949) defines the basis of Information Theory. The theory
deals with assessing and defining the amount of information in a message. It focuses on
measuring uncertainty that is related to information. Imagine for example that a person is
collecting the output of a device that produces 4 symbols, A, B, C, or D in an unknown
order. As the person waits for the next output, he does not know the exact symbol that
will be produced. In other words, he is uncertain about the distribution of the output.
Once the symbol is outputted, the level of uncertainty decreases. He now has a better idea
of the distribution of the output.
Shannon in (Weaver, 1949) proposes a mathematical model to measure the
amount of uncertainty (or entropy) in a distribution. Shannon’s Entropy (H) is presented
in Equation 1.
( ) ∑ (
)
( (
) ⁄ )
Equation 1: Shannon's entropy mathematical formula
Where p(x
i
) ≥ 0, i Є 1, 2, …, n and ∑ (
)
For a distribution X where all elements have the same probability of happening
( (
) ⁄ ), the entropy achieves its maximum entropy. On the other
32
hand, for a distribution X where one of the elements has a probability of occurrence
equals to 1 (or 100%) and all the other elements have probability of occurrence equal to
zero there is no uncertainty and consequently the entropy is zero.
By defining the amount of uncertainty in a distribution, H describes the minimum
number of bits required to uniquely distinguish the distribution. It defines the best
possible compression for the distribution. This approach has been used to measure the
quality of compression techniques against the theoretically possible minimum
compressed size.
Consider tossing a coin with known, not necessarily fair, probabilities of coming
up heads or tails. The entropy of the unknown result of the next toss of the coin is
maximized if the coin is fair (that is, if heads and tails both has equal probability 1/2).
This is the situation of maximum uncertainty as it is most difficult to predict the outcome
of the next toss; the result of each toss of the coin delivers a full 1 bit of information.
However, if it known that the coin is not fair, but comes up heads or tails with
probabilities p and q, then there is less uncertainty. Every time it is tossed, one side is
more likely to come up than the other. The reduced uncertainty is quantified in lower
entropy: on average each toss of the coin delivers less than a full 1 bit of information.
The extreme case is that of a double-headed coin which never comes up tails.
Then there is no uncertainty hence the entropy is zero.
33
3.3.2 Modeling Level of Scattering of Code Implementation
It is possible to think of the implementation process of a software development
project as a system that emits data with the data defined as the changes performed in each
code unit. Either periods of time in the implementation phase or the developers that
participated in the development of the project can be considered a possible output of the
system. Using this approach, the concept of Information Theory and Shannon’s Entropy
can be applied to measure the level of scattering of the implementation process of the
code units of a software project. This measure of the entropy of the code implementation
process can then be used as an indicator of the level of chaos of the implementation
process of each code unit resulting in a measure of the quality or complexity of the code
unit. The assumption is that a more distributed construction of the code unit during the
project implementation phase or among the developers that participated in the project
tends to result in a more complex code unit.
The date of each modification performed in the code unit and the developer
responsible for these changes can be extracted from the version control system. Because
this study deals with new software projects prior to its initial release to the users, all types
of changes performed during the implementation phase are considered and no major
distinction of the types are made. During this stage, the changes performed to add new
features (NF) are the ones that contribute the most with the final size of the code unit.
Changes to modify an existing feature (MF) or to remove defects (DR) have reduced
influence in the development of the code unit and tend to be more restricted having a
small impact in the calculation of the level of scattering of the code implementation.
34
Nevertheless, different types of changes tend to have different impact in the quality of the
files (Mockus, et al., 2000); hence, whenever extracting churn information from the
version control system, it is important to take measures to assess the code unit with high
incidence of MF and DR changes.
Figure 9: Patterns of change of three code units of a system distributed in ten periods of time
The process of modeling the level of scattering of the code implementation
process of a code unit starts by dividing the implementation stage into periods of time
(e.g. a week or a month). Each period can be considered a possible outcome of the system
that emits signal. The frequency of churn of a code unit in each period of time can
represent the outcome of the system. Figure 9 presents the implementation stage of a
software project divided in ten periods of time. The frequency of churn of three code
units is calculated for each period of time T
x
. The churn probability distribution P can be
calculated for each specific code unit. P represents the probability that the code unit will
churn in a specific period of time T. For each period of time for each code unit, count the
35
number times the file churned and divide by the total number of times that code unit
churned during the entire project. Using the example presented in Figure 7, Code Unit A
churned a total of 80 times – 40 in T
1
and 40 in T
2
resulting in the probabilities P(Code
Unit
A
)= 0.5 for T
1
and P(Code Unit
A
) = 0.5 for T
2
.
If the probability that the code unit will churn at period of time T
x
is equal to 1
and 0 for all the other periods of time then the churn events of the code unit are clustered
together resulting in the minimal entropy or the lowest level of scattering. Otherwise, if
the probability of a file churning at all the ten periods of time is exactly the same then the
amount of entropy for that code unit reaches its maximum. Therefore, it is reasonable to
assume that the code unit is undergoing high rates of churn throughout the entire project
lifecycle affecting the ability of the developers to keep a good understanding of the
modifications performed.
The calculation of the probability P may be performed using different attributes of
the events of churn. Two approaches are considered in this dissertation. The first
approach performed the calculation of the probability of churn based on the number of
times the code unit was modified in a specific period of time. The second approach
calculates the probability based on the size of the changes performed. Each measure
carries important information therefore the decision of presenting the model using both
metrics.
· Frequency of Churn: Frequency of churn indicates how often a code unit
undergoes churn during the project lifecycle. The metric associated with
36
this attribute is straightforward. It represents the number of times the code
unit was submitted to the version control system by a developer.
· Size of Churn: every churn event represents a change in the size of the
code unit in source lines of code. Measures of size of code churn were
successfully used as predictor of defects in binaries in (Nagappan, et al.,
2005). Different measures of size of churn may be used. In this study the
number of lines added and deleted where added up to calculate the size of
the churn.
The approach to calculate the entropy of the code implementation process of a
code unit based on the level of participation of each developer of the project uses the
same concept but, instead of using periods of time, the probabilities are calculated using
the distribution of frequency or size of churn among these developers. Figure 10 gives an
example of the distribution of four code units among the developers based on the
frequency of churn. The probability that a Code Unit
1
will be churned by developer
Dev01 is equal to P(Code Unit
1
) = 6/13 or 0.46 and the probability that Code Unit
2
will
be churned by developer Dev04 is equal to P(Code Unit
2
) = 13/13 or 1 and zero for all
other developers. High number of developers with probability of churning a code
indicates the need for more communication paths among them increasing the risk of
miscommunications affecting their ability to keep a good grasp of how the code unit is
evolving due to the influence of other developers.
37
Figure 10: Churn frequency of files distributed through developers
Existing measures of distribution of churn events among developers or periods of
time have failed to proper model the strength of these distributions. One of the main
advantages of using the concept of information theory to model the code implementation
process is that it better quantifies the level of influence that each developer or each period
of time had in the development of the code unit. The higher the entropy, the more
scattered was the implementation of the code unit. The entropy using either the developer
or the time dimension carries significant information on their own. The raw entropy can
be used by project managers, for example, to better understand the implementation
process of each code unit.
The following terms are used in this dissertation to identify the entropy measure:
38
· ScattTime: Indicates the entropy calculated to measure the level of
scattering of the churn events the perspective of time to calculate the
probability.
· ScattDev: Indicates the entropy calculated to measure the level of
scattering of the churn events among the developers that worked coding
the project.
The mathematical formulas used to calculate the entropy based on the frequency
and size of churn of the code units are presented in Equation 2 and Equation 3.
∑ (
)
( (
))
Equation 2: Entropy formula to measure level of scattering based on periods of time with probability calculated
based on frequency of churn
∑ (
)
( (
))
Equation 3: Entropy formula to measure level of scattering based on time with probability calculated based on
size of churn
Where:
· z corresponds to a specific code unit in the project
· n represents to the total number of periods of time in which the project lifecycle
was divided into and in which the code churned
· i represents a specific point in time in the project lifecycle
· FreqChurn
z,i
represents the number of times that code unit z churned at point in
time i
39
· SizeChurn
z,i
represents the total size of churn of code unit z at point in time i
· p(FreqChurn
z,i
) and p(SizeChurn
z,i
) represents the probability that the file will
churn in that particular period of time and is calculated based on one of the two
formulas depending on the approach used to measure the entropy:
(
)
∑
Equation 4: Formula to calculate probability based on frequency of churn
(
)
∑
Equation 5: Formula to calculate probability based on size of churn
The following formulas are used to calculate the entropy of the churn events
based on the developers that worked on the code units.
∑ (
)
( (
) ⁄ )
Equation 6: Entropy formula to measure level of scattering based on developers with probability calculated
based on frequency of churn
∑ (
)
( (
) ⁄ )
Equation 7: Entropy formula to measure level of scattering based on developers with probability calculated
based on size of churn
Where:
· z corresponds to a specific code unit in the project
· n is equal to the total number of developers in the project that churned the file
· i represents a specific developer of the project
40
· FreqChurn
z,i
and SizeChurn
z,i
represent the number of times and the total size of
churn respectively performed by developer i in the code unit z
· p(FreqChurn
z,i
) and p(SizeChurn
z,i
)represent the probability that the file will
churn by developer i and is calculated based on one of the two formulas
depending on the approach used to measure the entropy:
(
)
∑
Equation 8: Formula to calculate probability based on frequency of churn
(
)
∑
Equation 9: Formula to calculate probability based on size of churn
The result of the model is a value that indicates how scattered is the
implementation process of each code unit based either on time or on the developers that
worked in the project. This approach of measuring scattering of code churn is completely
innovative in the field of software engineering. The only similar approach was presented
in (Hassan, et al., 2003) with a entirely different objective. This approach not only gives a
measure of the level of scattering of the churn events but also allows the value of the
entropy alone to indicate important factors such as number of developers participating in
a code unit and size and frequency of churn.
3.3.3 Extracting the Churn Events
Version control systems can be a rich environment for software engineering
researchers to collect empirical data regarding the software development process.
Nevertheless researchers must have a very good understanding of how the tool is used by
41
the development team when extracting and analyzing data from a version control system.
This becomes even more critical when the data collected will be used to draw conclusions
about the implementation process of a software development project. A developer that
has the custom of submitting a revision of a file at the end of each day to avoid losing his
work can falsely generate a history of high number of changes in the file skewing the
calculation of the entropy. Different organizations apply different processes regarding
how their developers should check-in modified code units. Developers may be instructed
to submit any changed files at the end of each day to avoid loss of work or to submit a
change only when the entire request is over. The level of scattering of a code unit is
calculated in a way that it can be compared against the entropy of other code units. Lack
of standards on how revisions are submitted by the developers can make the project unfit
for the model presented in this study or can, at least, significantly impact the potential for
generalization of the results obtained from one project when the goal is to use it across
projects or organizations. It is necessary to assess the process used by the developers to
submit new revisions of modified code units with the objective of identifying any
developer that might have failed to follow the standard process. If a developer uses an
approach significantly different and a high level of influence is identified for this
developer, it might be necessary to eliminate all code units that this particular developer
worked on.
The churn events can be characterized in three types as described in section 3.2.
This is consistent with previous studies such as (Mockus, et al., 2000) and (Swanson,
1976). Certain consideration must be made when counting the churn events. This
42
dissertation concentrates on assessing how the pattern of changes influences the quality
of the code units of a new software project. It is expected that a certain number of defect
removal and improvement modifications will be performed during the development of
any new software project. The issue is that certain types of changes can impact the code
unit in different ways. Previous studies have shown that the objective of a change
performed in a file may have a direct impact in its quality (Mockus, et al., 2000). The
assumption is that defect removal activities are less error prone than adding a new
functionality to the code unit. According to developers interviewed in a previous study
(Fritz, et al., 2007), the act of debugging a problem requires more attention of the
developer and a better understand of the implementation of the code unit. Proper
categorization of the churn events becomes important in the context of this dissertation
because two changes of similar sizes with different objectives might have a different
influence in the quality of the code unit.
The impact of different types of changes is usually reduced by the way that the
changes are considered in the presented model. Previous studies have indicated that the
purpose of the changes influence their sized and interval (Mockus, et al., 2000) (Kemerer,
et al., 1997). New code development tends to add most lines, followed by improvement
changes, and then defect removal changes. In addition to that, it has been shown that
corrective changes usually have the shortest interval, followed by improvement changes,
and then by new feature implementation. The development of a new software project
constitutes mostly of modifications to add features and remove defects. Very few
perfective changes are performed.
43
In this dissertation the churn events of a code unit are counted not only based on
their frequency but based also on their size. It is expected that, because defect removal
changes tend to be smaller if compared to the other types of changes, they will not have a
significant influence when measuring the level of scattering of the events of churn of a
code unit. Nevertheless, measures must be taken to avoid any code unit with a code
implementation history highly influenced by a large number of corrective changes
because these changes can impact the calculation of the level of scattering of the
development process of that code unit. The code units in this situation were considered
outliers. A series of steps can be taken to help identify these potential outliers. If the
project uses a system to track defects in the project, it may be possible to map the date of
each revision of the code units with the date the defects were marked as fixed in the
defect tracking system. It is possible also to look for references in the description section
of these systems that may highlight the revision associated to the defect. Another
approach that can be used is to check for key words such as “defect” or “bug” in the
comments section of the revisions of each code unit or look for certain pattern of changes
based on size or period of changes as described by Mockus et al. in (Mockus, et al.,
2000).
3.3.4 Characterizing the Code Units
The term code unit is used in this study as a generic name to describe the elements
that combined constitutes a software project. They are the elements of the system that
undergoes churn and can be a source file, a compiled binary composed of one or more
44
source files, a component, etc. Although the model presented here is generic enough to
support different levels of granularity, selecting the proper code unit depends on the final
objective of those using the model. If the primary goal is the identification of error prone
modules, selecting a level too high might make the identification of these modules
inefficient; if too low, the results might not be conclusive. Related research showed that
the binary level is more appropriate when correlating churn attributes with defect density
(Nagappan, et al., 2005). Because this study deals with data collected from projects
developed in a technology that the mapping between binary and source file is one to one,
the source file level was selected. In different technologies where a binary is a result of a
number of source files, the binary level might represent a better option than the source
file since related changes that affect one binary may be performed together in many
different source files.
3.3.5 Defining the Periods of Evolution
An important aspect when measuring the level of scattering through the
dimension of time is the definition of the periods of time that will be used to calculate the
probabilities. Different studies used distinct methods to divide the development phase of
a software project into periods of time. The model was designed to supports different
approaches. For the validation in this dissertation, the development phase of the studied
projects was divided based both on fixed lengths (e.g. 3 weeks) and on the gap between
the changes that falls in different periods; an approach similar to the one used in (Hassan,
et al., 2003). The fixed length period approach consists in defining a specific length for
45
each period of time. In this dissertation, the project implementation phase was divided in
periods of time of three weeks each. This length of three weeks is backed up by previous
investigations that studied the various aspects of the human side of software engineering
and in particular the cognitive process of designing and implementing a software system
(Detienne, 2001) (Sharp, 1991) (Visser, et al., 1990). In a study performed to assess the
knowledge that each developer had of the code he worked on, several subjects stated that
they would forget the implementation details if they did not have any contact with the
code unit for more than 3 weeks (Fritz, et al., 2007). Churns performed in a code unit by
the same developer at distinct points in time might be as risky as churns performed by
multiple developers.
This dissertation grounded many of its assumption on the cognitive aspects of
software development. Fixed length alone could lead to the wrong conclusion in cases
where changes performed by the same developer occur both at the end of one period and
at the beginning of the following. Although distributed, the distance between these
changes are not significant enough to impact the memory of the developer. A set of rules
were defined to combine these churn events in the same period of time to avoid situations
like these. After splitting the implementation phase of a software project, a code churn
was considered as part of a previous period of time if all of the following rules applied.
· The gap between the last churn event A in period of time T
x
and the first
churn event B in T
x+1
was less than one week and;
· Both churn events A and B were implemented by the same developer and;
46
· The amount of time of inactivity between churn event A and churn event C is
not higher than four weeks.
The objective of the last rule is to set a limit to prevent a period of time from
concentrating all the churn events of a code unit that had a high frequency of churn.
3.3.6 Entropy Normalization
The maximum value that the Shannon’s entropy can assume is based on the
number of possible groups it can fall or the number of symbols the device can generate.
If, for example, the frequency of churn of a code unit is distributed among four
developers, the maximum value of the entropy will be equal to log
2
4 or 2. If it is
distributed among 2 developers only, the maximum value can then be reduced to log
2
2 or
1. The value of the entropy alone can be used to account for the number of the developers
that churned the code unit, for example.
( )
Equation 10: Minimum and maximum value of Shannon's entropy
The measure has significant value since the main objective is to compare the level
of scattering of the implementation process among code units from the same software
project. Plus, because the maximum entropy value increases with the number of possible
outcomes, the high entropy values alone can give some indication of how many
developers participated in the implementation of the code unit. The disadvantage is that it
depends on specific characteristics of the project (e.g. periods of time or number of
developers) making it harder to use across different projects. A normalized entropy
47
measure is proposed to allow the entropy values to be compared across projects with
different number of developers and longer or shorter periods of time. The normalized
entropy is here called standardized level of scattering and is calculated based on the
following equation:
( ( ))
( )
( ( ))
( )
Equation 11: Standard Shannon's entropy
Where n is either the total number of periods of time in which a file can churn or
the number of developers that churned the file. This method normalizes the level of
scattering to a value that ranges from 0 and 1.
The normalized level of scattering is a necessary part of the process to enable the
use of the entropy values across projects but they must be used with caution. There is one
important piece of information that the entropy loses when normalized. Consider for
example a project with sixteen developers where all the developers that had the same
amount of influence in implementing a code unit. The maximum value for the entropy in
this case will be four (log
2
16 = 4). The level of scattering of this file before the
normalization demonstrated how scattered was its code implementation. The fact that the
entropy is such a high number indicates not only that the code implementation was highly
distributed but also that a high number of developers participated in the development of
that project. This information tends to disappear when the value is normalized. Consider
another example where a code unit is churned evenly by two developers. The entropy in
48
this case will be one (log
2
2 = 1). The standardized level of scattering for both cases will
be one when in fact the development of the code unit in the first example was much more
scattered than in the second example. As assessment performed in all projects used in the
study indicated a cap on the variation of the number of developers that churned a code
unit; nevertheless, the loss of information that happens with the normalization must be
considered when using the model.
3.4 Evolving Level of Scattering
The level of scattering based on periods of time and on the developer that worked
in the project can be combined to allow an evaluation of how the entropy of the code unit
is changing over time. If the development phase of a software project is divided in
periods of time it is possible to measure the level of scattering of the code development
among the developers for each period of time. The entropy can be calculated using either
an accumulative or a noncumulative approach. In the noncumulative approach the level
of scattering among the developers is calculated for a period of time considering only the
changes that happened in that particular period of time while in the accumulative
approach all changes performed in the code unit are considered. The importance of these
approaches is demonstrated by Figure 11 and Figure 12. Two scenarios of code
development process for a code unit are presented. The development lifecycle is divided
in five periods of time and the number of times that each developer churned the code unit
is shown. In both cases, the entropy of the code development for the entire period of time
using the developer dimension is exactly the same since in both scenarios developer 1
49
(Dev
1
) churned the code unit thirteen times, developer 2 (Dev
2
) seven times, developer 3
(Dev
3
) thirteen times, and developer 4 (Dev
4
) twenty five times. The difference is when
these changes occurred. In scenario 1, each developer churned the code unit at one period
of time presenting a less chaotic code development history. The entropy calculated using
a noncumulative approach considers only the changes performed at a specific period of
time. In this case, although both scenarios have the exact same entropy for the entire life
cycle, the entropy for Scenario 1 will be zero for each period of time and different than
zero for periods of time T1, T3, and T4 in scenario 2.
Figure 11: Scenario 1 of changes in combined dimensions
Figure 12: Scenario 2 of changes in combined dimensions
Figure 13 and Figure 14 present the noncumulative and cumulative entropy values
for each scenario. In Scenario 1 the noncumulative entropy remains zero for all periods of
50
time but as it evolves it the influence of more developers in the code unit increases. Since
the maximum value for the entropy is log
2
n as presented in Equation 10, if the entropy
has reached a value of close to two it is possible to assume that at least four developers
churned that code unit. If the number of developers is confirmed to be four, then, because
the entropy is close to its maximum, all the developers have a significant influence in it.
Figure 13: Cumulative and noncumulative entropy for Scenario 1
In Scenario 2 the noncumulative entropy is relatively high for at least three
periods of time as presented in Figure 14 (T1, T3, and T4). This indicates that more than
one developer worked at the same period of time in the file possibly increasing the
complexity of that code unit. The cumulative entropy remained stable after the third
period of time indicating that the number of developers stabilized. Period of time T3 is
when developer Dev 4 churned the code unit performing a certain amount of changes and
consequently increasing his participation in the file significantly. This resulted in a stable
value for the entropy.
51
Figure 14: Cumulative and noncumulative entropy for Scenario 2
3.5 Analysis Framework
The model presented in this dissertation can be generalized into a framework
resulted from a system generated in the studied organization. The objective of this
framework is to provide to software engineering researchers and practitioners a method
that can be used to predict the risk associated to the development process of a code unit.
The method tracks the complexity of the development process of the code units and raise
flags when behaviors of risks are identified. The concept of information theory is used as
described in sections 3.3 and 3.4 to measure the complexity of the implementation
process.
The analysis process is described in Figure 15. The process starts by inserting the
necessary information into the analysis engine such as the level of granularity of the code
unit (e.g. source file, binary, etc.) and the length of each period of time that should be
used to calculate the entropy. Once the implementation of the project starts, the system
52
begins to track the churn rates of all code units as they start to get checked in the version
control system. Levels of scattering are calculated continuously and linear and logistic
statistical regression models can be used to evaluate the potential risk of the pattern of
changes performed in the code unit. Examples of behavior of risk that could raise alarms
are:
· Code units with high levels of scattering
· Code units with constant high noncumulative levels of scattering through the
entire project lifecycle
The central part of the process is an analysis engine which uses historical entropy
measures combined with failure information to interpret the risk of the current pattern of
churn of the code unit in the different dimensions.
3.5.1 Framework Architecture
The analysis engine is the core of the analysis framework. It gathers all the
necessary information, flags potential outliers for manager evaluation, calculates the level
of scattering, and performs a risk assessment to highlight error-prone modules.
53
Version Control System
- Date of churn
- Size of churn
- Developer who churned
Analysis
Engine
· Calculate entropy metric for each period of time
· Flag potential outliers
· Identify code evolution patterns
Risk Assessment
For Code Unit
· Definition of code unit
· Definition of period of time and method
to be used to combine churn data
Defect Tracking Tool
- Defect rates
associated to code units
Figure 15: Analysis framework high-level architecture
The analysis engine, as defined for this study, is composed of five components –
the data extraction component, the metric calculation component, the outlier
identification component, and the analysis and report component.
The data extraction component has two main modules – the configuration module
and the data access module. The configuration module is responsible for collecting all the
information required for the extraction of the data necessary for the analysis process. For
the system used in this study the following information was gathered.
· Granularity of code unit: the granularity of the code unit (e.g. source file,
binary, component, etc.) must be defined so the information extracted from
the version control system is properly processed before the level of scattering
is calculated. The version control system keeps only the information related to
each source file. Any code unit at a level above the source file is usually
formed by one or more files. A binary, for example, may be composed of
54
multiple source files. It is important to provide the tool with the means to
combine the metrics extracted for the multiple source files.
· Length of period of time for evaluation: the period of time is used to combine
churn information. This is necessary for the calculation of the probability in
the entropy based on periods of time. The framework must know how long
each period of time will last and if any specific measures will be used to
combine the extracted data as described in section 3.3.5.
The data access module performs the core operation of the data extraction
component. It uses the information in the configuration module to extract the proper data
from the version control system and the defect tracking tool. The implemented version of
the data access module extracts the following information.
· Version Control System: used primarily to extract the date the code unit
churned, who churned it and the size of each churn. Other secondary
information may be extracted such as the description associated to each
revision in case a method similar to the (Mockus, et al., 2000) is used to
identify outliers as described in section 3.3.3.
· Defect Tracking Tool: defect tracking tools are used to track defects identified
in the software project. This information can be used to better identify
possible outliers or to validate certain patterns of risk.
The metric calculation component uses the data extracted from the data extraction
component and calculates of the entropies. The process needs to be completely automated
55
and must be done whenever a new revision of a file is submitted to the version control
system or when the end of a period of time specified is reached.
The outlier identification component works as an expert system applying rules to
both the information collected from the data extraction component and the metrics
generated from the metric calculation component to flag potential outliers. The rules used
in this study to identify outliers are described in chapters 3.3.3 and 4.2. One example
would be the code units churned by a developer that made incorrect use of the version
control system or a code unit that had an unusual high incidence of defect removal
changes.
The analysis and report component uses a set of rules and statistical regression
techniques to identify error-prone modules. The following methods can then be applied to
this component as a way to measure the risk associated to the code development of each
code unit.
· Statistical linear regression can be used to estimate the number of post-release
defects in the code unit based on the level of scattering. The model can be
generated using data collected from previous projects as defined in chapter
4.3.5.
· Logistic linear regression can be used to calculate the probability that a file
will have post-release defects based on the level of scattering. Chapter 4.3.4
gives a detailed description of how this framework was extended in a way that
enabled users to define how the cut-off value to define the project.
56
The analysis and report component can report the result using a traffic light to
indicate the risk of the pattern of changes. The report of the results can be made also
using just the probability that the code unit will have post-release defects or even the
number of defects expected in the code unit.
57
Chapter 4. Model Validation
4.1 Introduction
The goal of this chapter is to validate the conjecture that a more chaotic code
implementation process negatively affects the quality of a code unit. The level of chaos
of the implementation process is measured based on how scattered are the events of churn
associated to the code unit. To quantify the level of chaos of the implementation process
the concept of information theory and Shannon’s Entropy is used.
For the validation data from a series of projects from a large software factory was
used. A total of 1388 source file from four different projects were used. Statistical
regression was used to assess the quality of the measures as a predictor of the risk and the
number of post-release defects in the code units.
4.2 Empirical Study Design and Preparation
The process of collecting data in an empirical study is crucial for its validity.
Researches like the one proposed here are usually invalidated either by noise in the data
caused by outliers (Tsay, et al., 2000) or by a data set not large enough resulting in non-
significant outcomes. A detailed analysis of the data was done to remove potential
outliers from the dataset. This investigation included a vast examination of the
development process to make sure the metrics accurately represented what they were
intended to measure.
58
This section describes the process applied to define and extract the data used in
the validation.
4.2.1 Data Set
The data set used in the validation of the model was extracted from four large
software development projects. All the projects contemplated the development of a web-
based software system using the same platform – Java 2 Enterprise Edition (J2EE).
Project Start Date Size (F.P.) Size
(SLOC)
Num. of
Developers
Num. of Source
Files
Proj. 1 08/01/2005 1,356 66,623 21 348
Proj. 2 12/01/2004 3,185 145,846 40 589
Proj. 3 12/15/2004 1,683 49,206 29 168
Proj. 4 03/12/2005 1,247 57,873 23 283
Table 1: Summary of projects used in study
A total of 1,388 source files were used in the study after the outliers were
eliminated. Only Java file were considered and defects found and corrected in non-Java
files were ignored.
4.2.2 Defects
The quality of a code unit was measured based on the number and the presence of
post-release defects.
1. Presence of defect: each code unit was flagged as having a post-release defect or
not.
59
2. Number of Defects (Num. Defects): the total number of post-release of defects in
the code unit.
Defects in this study are considered any changes performed in a code unit after
the release of the project. Code churn that occurs after the release of the software in the
studied domain predominantly contain defect fixes and performance improvements. The
reasons for the fixes are caused mostly by problems identified when the project is initially
deployed for final client validation. A similar approach has been successfully used in
(Nagappan, et al., 2005) when analyzing data from the Windows 2003 project.
This approach of counting defects in the code unit can lead to a significant
number of false positives therefore proper filters need to be applied to make sure that the
defects are properly counted. Simply counting all changes after post-release of the project
could falsely identify related changes as multiple defects when in fact they were all
performed to remove one defect. The organization from where the data was extracted
used a defect tracking tool (Mantis) as part of their development process. This tool works
as a repository where all defect information found in the software is stored including a
unique identifier. The organization had a process in place that, if a change was performed
to remove a defect, during the check-in of the code to the version control system the
developers were required to submit in the description of each revision the unique
identifier associated to the defect. When counting the number of post-release changes the
text of the revision was considered and multiple revisions associated with the same defect
were counted as one defect.
60
Counting the number of defects based on the number of changes can also mislead
to fewer defects in the code unit than the actual number. This can be caused by situations
in which a developer will combine a set of fixes in the same revision (Mockus, et al.,
2000). Following the same approach described before, as part of the process of extracting
data from the version control system, the comments section of each revision was checked
for changes associated to multiple entries in the defect tracking system. Cases that had
reference to more than one ticket were counted as multiple defects.
A validation of the process used to count defects was performed to confirm its
accuracy. The number of post-release changes in the code unit was compared against the
number of defects documented as fixed in the defect tracking system. The results
indicated that that the numbers were consistent except for a few cases. Any files with an
inconsistent count could potentially skew the results and were eliminated as outliers.
4.2.3 Code Unit
To validate the model it is important to define the level of granularity of the code
unit. In this dissertation the source file was defined as the unit of code to build the change
probability distribution for each period of time and for each developer. Other units of
code can be used as described in section 3.3.4. The source file was chosen in this
particular case because it usually represents a conceptual unit of development where
developers tend to combine related methods and data types. It was also considered that
the study dealt with data collected from Java projects and the compiled version of these
files represents the equivalent to a binary file. Another option would have been to
61
consider the packages or binaries which organize a set of related classes and interfaces
and are usually generated from multiple source files.
4.2.4 Periods of Time
The code implementation model uses periods of time to calculate the change
probability distribution. There are different approaches that can be used to break up the
evolution of a software project into periods. The development phase of all projects used
in the validation of the proposed model was divided in fixed length periods of three
weeks each.
Project Number of
Periods of Time
Proj. 1 6
Proj. 2 12
Proj. 3 7
Proj. 4 7
Table 2: Number of periods of time per project
A series of rules were applied when extracting the churn events associated to each
period of time as described in section 3.3.5. The application of these rules will be
elaborated in the next section.
4.2.5 Data Extraction and Calculation
All projects used in the validation were specially selected together with project
managers as cases that better represent a typical project in the studied context. Because of
62
the approach used to map defects with the individual code units, any project in which the
client required major post-release changes were eliminated as an outlier.
Data for the project were automatically extracted from the version control system
and the defect tracking tool used by the organizations. The organization that provided
data for the study used CVS (CVS) as their version control system and Mantis (Mantis)
as their defect tracking tool.
For this study, a series of tools were used to extract and store the attributes of
churn for each revision of each file in the project from the version control system (James,
et al.) (ViewVC). For each file, each revision was extracted from CVS with the name of
developer responsible for it, the date and time revision of the file was submitted, and the
number of lines added and deleted. The process to extract and analyze the data was
composed of the steps defined in Figure 16. These steps compose the initial part of the
process applied by the Analysis Engine as described in Figure 15.
Figure 16: Data extraction and processing steps
The first step of the data extraction consisted of pulling out the code churn
information from the version control system. Among the information extracted was the
Extract Churn
Information
from Version
Control
System
Process
Extracted
Information
&Identify
Outliers for
Assessment
Calculate Level
of Scattering
Normalize
Measure
63
date and time when the file had churned, the developer responsible for churning it, the
comments submitted by the developer, and the size of the churn.
Next the necessary information was defined to calculate the entropy of the
development process of each file. The number of developers that churned each file was
defined and for each developer identified, the number of times that the developer churned
the file and the size of the churn events associated to the developer were calculated. The
version control system used by the organization stored the changes performed in the file
and the number of lines added and deleted in each revision. The size of the churn was
calculated adding up both values. The information associated to the developer was
calculated for the entire project and for each period of time so the evolution of the
entropy could be measured as defined in section 3.4. The periods of time that the file
churned were also collected to calculate the entropy based on time. Related modifications
that happened in sequential periods of time were combined based on specific rules as
defined in section 3.3.5. Churn events that appeared in sequential periods of time were
combined in the same period if (1) the gap between the last churn event A in period of
time T
x
and the first churn event B in T
x+1
was less than one week, (2) both churn events
A and B were implemented by the same developer, and (3) the amount of time of
inactivity between churn event A and churn event C was not higher than four weeks.
Similar to the information collected for the developers, for each period of time, the
number of times the file churned and the size of churn (counted as lines added + lines
deleted) were calculated.
64
Outliers were properly identified and eliminated. In empirical research, outliers
can have a significant impact in the generation of a statistical method hence the need to
be removed (Tsay, et al., 2000). Outliers were assessed and removed at different levels
including at the project level. Out of all the projects considered for the research, any
projects that did not comply with the following conditions were also eliminated from the
study:
1. Non-Java based projects: Java and Microsoft .Net technologies were the most
common technologies supported by the studied organization. Most of the tools
available to collect the metrics supported Java only. Although only java based
projects were considered there is no reason to suspect that the technology has
an impact in the study.
2. High number of post-release changes requested by users: although not
common, clients may require a large number of post-release changes in a
project. It was decided to remove these projects from the study because of the
approach it was chosen to count post-release defects.
Outliers were also considered at the code unit level. Files were removed from the
study if they were not java files, if they were deleted during the development, or if they
were created after the release of each project. Outliers were also defined based on the
characteristics of their changes. Files were considered outliers if they had been
significantly churned by developers who did not properly follow the default revision
process. As previously explained, a developer that performed daily commits just to
backup his work could skew considerable the results of the entropy when the probability
65
was calculated based on the frequency of churn. The objective of each change was
identified by parsing the comments collected from each revision of a file and by
comparing them with the information from the defect tracking systems. Files with high
incidence of defect removal (DR) modifications during the implementation phase were
also marked as potential outliers. An approach similar to the one presented in (Mockus, et
al., 2000) was used. Any revision with the words defect or bug in the comments field was
marked as a potential DR change. The defect tracking tool was also used as a reference to
identify any possible DR changes. The type of change was performed by comparing the
comments submitted by the developers with the revision and the information of the
defects in the defect tracking system information.
The level of scattering was calculated based on the periods of time defined and on
the developers that worked in the project. Both the cumulative and the noncumulative
measures were calculated. The cumulative measures consisted of calculating one measure
of the level of scattering considering the entire implementation phase of the project. The
following values were calculated:
1. ScattTime(Freq): consisted of the entropy of the code development of the code
unit based on the periods of time. This approach used the frequency of
changes that occurred in each period of time as the basis to calculate the
probability distribution of the modifications.
2. ScattTime(Size): the entropy based on the periods of time but with the
probability distribution calculated based on the size of the modifications.
66
3. ScattDev(Freq): represented the entropy of the code implementation process
of the code unit through based on the developers that worked in the file.
Similar to the ScattTime(Freq) this approach used the frequency of churn
performed by each developer as the basis to calculate the probability
distribution of the changes.
4. ScattDev(Size): similar to the ScattDev(Freq), this entropy uses the size of the
changes performed by each developer to calculate the probability distribution
of the changes.
To measure the evolution of the entropy over the development phase the
noncumulative entropy was also calculated. In this case, the level of scattering was
measured only based on the developers. For each period of time the level of scattering of
the code implementation among the developers was measured.
The final step in the process of calculating the level of scattering was to normalize
the value. To perform the normalization, each measure of the level of scattering was
divided by the maximum valued that the entropy could achieve as defined in section
3.3.6.
4.3 Model Validation
This dissertation demonstrates that code units that have a complex or chaotic code
development process have higher probability of containing post-release faults. The level
of chaos of the development process is determined by the model of the level of scattering
of the changes performed in a code unit as presented in the previous chapter.
67
Statistical regression models were generated to evaluate the efficiency of the
measures of level of scattering in identifying the quality of the code units. Both linear and
logistic regression models were used. The linear regression model was applied determine
the number of post-release defects based on the entropy measure. Logistic regression
model was applied to identify the probability that a file would have a post-release defect.
A statistical procedure called cross-validation (Shao, 1993) was applied to build
both regression models. In this procedure two different data sets are used:
· Training set: set of data used to build the statistical regression model;
· Testing set: data used to validate the accuracy of the built regression model.
This approach gives a more realistic view of the accuracy of the statistical
regression model (Walkerden, et al., 1999). Similar approach has been successfully used
in many other studies (Khoshgoftaar, et al., 2001) (Graves, et al., 2000). In both cases,
approximately 70% of the code units of each project were used in the training set while
the remaining 30% of the files were used as part of the testing set.
As any empirical research that uses data collected in the field, multicollinearity
can have a significant impact when assessing statistical correlations and causality
(Boehm, et al., 2000). Scientists prevent that by controlling variables that might be
influencing the results. Because of the nature of this study and the source of data, such
approach cannot be applied here. Nevertheless this investigation follows the same
assumptions defined by Boehm et al. in (Boehm, et al., 2000) during the construction of
the COCOMO II model. This dissertation is corroborated with previous related studies
and reasonable assumptions generated after interviews with project managers and
68
developers in the organizations. The proposed research is strongly supported by the past
related experiments and by the results found in the preliminary investigation.
4.3.1 Preliminary Assessment
The validation process starts by presenting a series of preliminary assessments
that lead to the current model that measures the level of scattering of the code
implementation process. Spearmen rank correlation was used to verify the relationship
among certain measures of churn and the different measures of defect. Spearman
correlation is a known and robust correlation technique used by previous researchers
(Nagappan, et al., 2005). It can be applied even when the variables violate parametric
assumptions. That was the case for some of the studied variables that assumed, for
example, categorical values.
To measure the size of an effect, the following standard was adopted to assess the
correlation coefficient:
1. ±0.1 represents a small effect
2. ±0.3 represents a medium effect
3. ±0.5 or larger represents a large effect
All results presented here are statistically significant at the 95% level unless
otherwise specified.
The study initiated with a sample of the files collected from the available data set.
An initial assessment was performed to correlate frequency of churn and the defect
measures. Files were grouped based on the number of times they churned. The number of
69
files that contained post-release defects was then added up and combined with the initial
data. The result is presented in Table 3.
Freq.
Churn
# of
Files
Files w/
Defects
# of
Defects
% w/
Defects
% of
Defects Out
of Total
% of Files
Out of Total
2 67 10 20 15% 8% 39.2%
5 40 18 30 45% 11% 23.4%
10 37 35 80 95% 30% 21.6%
20 22 21 87 95% 33% 12.9%
30 3 3 32 100% 12% 1.8%
40 1 1 5 100% 2% 0.6%
More 1 1 10 100% 4% 0.6%
Table 3: Frequency of churn and defect
A total of 264 post-release defects were identified in the sample data. Defects
were identified in 89 (or 52%) of the 171 files analyzed. Files that churned five or less
times represented approximately 39% of the total. Defects were identified in
approximately 15% of them while the number of defects in this group accounted for only
11% of the total number of defects identified in the files. Almost all files that churned six
or more time contained a post-release defect. These cases accounted for less than 37% of
total number of files and for 69% of the defects in the project.
A probability analysis was performed to assess the probability of a code unit
presenting a defect based on its frequency of churn. Two probabilities were calculated –
70
one that measured the probability of a defect given that the code unit churned a specific
number of times and a second one that measured the probability that a code unit had a
defect given it churned more than a certain number of times.
( | )
( )
( )
Equation 12: Probability of defect based a specific frequency of churn
( | )
( )
( )
Equation 13: Probability of defect based a range of frequency of churn
Table 4 presents the probability for the studied project. There is a probability of
more than 90% that files that churned more than four times will have a post-release
defect.
All the files that churned 7 or more times in the sample data ended up churning
after the release of the project to which the file belongs.
This analysis of a sample of the data used indicates that, although churn alone
could be used to identify the risk of post-release defects with files that churned more than
6 times, more than 28% of the post-release defects were located in files that churned less
than 6 times. This represented a universe of more than 62% of the files in the project.
71
Num.
Churn Events
Prob.
Defect (>=)
Prob.
Defect (=)
1 52% 5%
2 77% 39%
3 84% 31%
4 93% 47%
5 95% 78%
6 98% 82%
7 98% 100%
8 97% 100%
9 96% 100%
10 96% 100%
11 96% 100%
12 100% 83%
Table 4: Probability of defect based on the number of times code unit churned
4.3.2 Shannon Entropy – Periods of Time
The entropy of the events of churn for a code unit based on the periods of time
measures how chaotic is the overall development of that file through the implementation
phase. This helps validate the first hypothesis of this study that states that the level of
scattering of the churn events of a code unit may be used to assess the post-release quality
of the code units.
The implementation phase of the project was divided in groups of three weeks
resulting in different number of groups depending on the duration of each project as
72
described in Table 5. For the calculation of the entropy it was assumed that each group
could be a signal that could be generated by any file in the project. The probability of any
signal to happen was calculated based on the frequency or the size of churn of each file in
each one of the groups. Entropy based on frequency focused on the level of scattering of
the churn events throughout the project development phase while the entropy based on
size measured how distributed was the growth of the file. A file could have churned a
significant amount of time in three distinct periods of time but 99% of it code could had
been written in one period of time.
Project Periods of Time
Proj. 1 6
Proj. 2 12
Proj. 3 7
Proj. 4 7
Table 5: Number of periods of time per project
A preliminary assessment of how distributed the churn events of the files are
among the groups indicated that, on average, more than 67% of the files in each project
churned in at least two distinct periods of time as indicated in Figure 17.
73
Figure 17: Distribution of code units based on the number of groups in which they fall
Figure 18 demonstrates the distribution of the defects for all projects based on the
number of periods of time that the files associated to the defects fall. Almost 5% of all
defects in the projects felt in files that churned in one period of time while more than
50% were identified in code units that churned two or three periods of 3 weeks.
Figure 18: Distribution of defects based on the number of groups in which the code units where the defects
happen fall
0%
5%
10%
15%
20%
25%
30%
35%
40%
1 2 3 4 5 6 7 8 9 10 11
Perc. of Files that Churned
Num. of Periods of Time (3 weeks)
Proj. 3
Proj. 1
Proj. 4
Proj. 2
74
The correlation coefficients between the normalized level of scattering based on
the periods of time and the number of defects and the presence of defect (HasDefect) are
presented in Table 6 for both the probability based on frequency and the probability
based on size of churn. Both coefficients are high and significant with the entropy based
on the distribution of size presenting the best results. This is an indication that the level of
scattering of the churn events based on the size of churn has a higher impact in the
quality of the code units.
The measure of scattering based on entropy presented a high significant
correlation with all defect measures with the measures calculated using the size of the
churn presenting better results than the measure based on the frequency of churn for all
projects.
75
Spearman's rho Num. Defects Has Defects
Proj. 1 Std(ScattTime(Freq)) .435** .442**
Sig. (1-tailed) .000 .000
N 345 345
Std(ScattTime(Size)) .583** .572**
Sig. (1-tailed) .000 .000
N 345 345
Proj. 2 Std(ScattTime(Freq)) .587** .567**
Sig. (1-tailed) .000 .000
N 584 584
Std(ScattTime(Size)) .679** .660**
Sig. (1-tailed) .000 .000
N 584 584
Proj. 3 Std(ScattTime(Freq)) .686** .710**
Sig. (1-tailed) .000 .000
N 168 168
Std(ScattTime(Size)) .722** .719**
Sig. (1-tailed) .000 .000
N 168 168
Proj. 4 Std(ScattTime(Freq)) .580** .562**
Sig. (1-tailed) .000 .000
N 283 283
Std(ScattTime(Size)) .746 ** .723**
Sig. (1-tailed) .000 .000
N 283 283
**. Correlation is significant at the 0.01 level (1-tailed)
Table 6: Correlation analysis between level of scattering of churn events based on periods of time and defect
measures
4.3.3 Shannon Entropy – Developers
The concept of entropy is used to measure the scattering of the construction of the
code units among the developers that worked on them. This part of the dissertation
focuses on the hypothesis that a highly distributed implementation of a code unit among
the developers has an impact on its post-release quality.
76
The number of times the developer churned the file and the number of lines added
to the file by each developer per code unit. The assumption is that a higher distribution of
the implementation of the code unit by the developers impacts its post-release quality.
Table 7 presents the correlation coefficients.
Spearman's rho Num. Defects Has Defects
Proj. 1 Std(ScattDev(Freq)) .708** .638**
Sig. (1-tailed) .000 .000
N 345 345
Std(ScattDev(Size)) .670** .638**
Sig. (1-tailed) .000 .000
N 345 345
Proj. 2 Std(ScattDev(Freq)) .704** .688**
Sig. (1-tailed) .000 .000
N 584 584
Std(ScattDev(Size)) .685** .666**
Sig. (1-tailed) .000 .000
N 584 584
Proj. 3 Std(ScattDev(Freq)) .784** .731**
Sig. (1-tailed) .000 .000
N 168 168
Std(ScattDev(Size)) .689** .649**
Sig. (1-tailed) .000 .000
N 168 168
Proj. 4 Std(ScattDev(Freq)) .738** .709**
Sig. (1-tailed) .000 .000
N 283 283
Std(ScattDev(Size)) .744** .713**
Sig. (1-tailed) .000 .000
N 283 283
**. Correlation is significant at the 0.01 level (1-tailed).
Table 7: Correlation analysis between entropy of churn events based on developers and defect measures
Table 7 shows a high significant correlation between the number of defects and
the standardized level of scattering based on the developer dimension. Different from
77
what was found in the study with the entropy based on periods of time time, the
correlation between the entropy based on the developers and the defect measures are
higher when the probabilities are calculated using the frequency of churn. This may be an
indication that just the fact that many developers worked in the code unit increases the
chances of defect in that code unit.
4.3.4 Logistic Regression
Logistic regression (McCullagh, et al., 1989) is a standard way to model
probability largely used in the medical and social sciences. It is a multiple regression with
an outcome variable that is a categorical dichotomy and predictor variables that are
continuous or categorical. It is used to predict the likelihood that an element will belong
to a specific group. It assumes a nonlinear relationship between the probability and the
explanatory variables hence no transformation need to be applied to the variables. The
resulting value from the equation is a probability value that varies between 0 and 1.
(
)
Equation 14: Logistic regression equation
Logistic regression is used in this study to validate the use of the measures of the
level of scattering to identify defect prone code units. Linear regression is not suitable in
this part of the study because the expected response must have values between zero and
one. In this case, the response variable is one if the code unit had a post-release defect,
and zero otherwise.
78
A logistic regression model was created for the measures of scattering. The
regression model was created using a training set formed by 70% of the data and
validated using the remaining 30%.
Equation 15 describes the logistic the model formula. Both the level of scattering
based on period s of time and based on the developers were used as predictors in the
model.
(
)
(
( ( ))
( ( ))
)
The decision of using both predictors is based on the assumption that each
variable measures the complexity of the code implementation of the code unit from a
different perspective and they tend to complement each other. The code implementation
process may not be highly scattered throughout the project implementation phase but can
be significantly complex when assessed based on the number of developers that
participated in the construction of the code unit. Nevertheless, because both predictors
were strongly correlated, model selection techniques based on the stepwise regressions
(McCullagh, et al., 1992) were applied to try to confirm the need for both predictors in
the model. In stepwise regressions, decisions about the order in which the predictors are
entered into the model are based on mathematical criterion. The forward method starts
with a model that contains only the constant β
0
. The method then searches for the
predictor that best predicts the outcome variable by selecting the one that has the highest
Equation 15: Logistic model to estimate probability of defect based on entropy of code churn throughout the
project lifecycle
79
correlation value with the outcome. The predictor remains in the model if it improves
significantly the ability of the model to predict the outcome. The following predictors are
selected based on the largest semi-partial correlation with the outcome. The semi-partial
correlation gives a measure of how much new variance in the outcome it can be
explained by each remaining predictor. The backward method is the opposite of the
forward method because it starts with all the predictors in the model and then calculates
the contribution of each one by looking at the significance value of the t-test for each
predictor. This significance value is compared against a removal criterion and, if the
predictor meets the removal criterion, it is removed from the model and the model is re-
estimated using the remaining predictors. In this study the backward method is used since
it is less susceptible to the suppressor effects that occur when a predictor has a significant
effect but only when another variable is held constant (Miller, 1990).
The log-likelihood statistic was used to evaluate the model. It uses the observed
and predicted values to assess the fit of the model (Tabachnick, et al., 2001). The log-
likelihood is comparable to the residual sum of squares in multiple regression since it is
an indicator of how much unexplained information there is after the model has been
fitted. Therefore large values of the log-likelihood indicate more unexplained
observations. The log-likelihood may be used to calculate more literal version of the
multiple correlation in logistic regression known as the R-statistic. This statistic can vary
between -1 and 1 and represents the partial correlation between the outcome and each
predictor used in the model. A positive value indicates that an increase in the predictor
variable results in an increase in the likelihood that the event will occur. A negative value
80
indicates that the chances of the event occurring decreases as the predictor value
increases. Small R value indicates that the predictor makes a small contribution to the
model. Different approaches are suggested to calculate the equivalent to the R statistic in
logistic regression. The basic R calculation is based on the Wald statistic which tells
whether the b-coefficient associated to the predictor is significantly different from zero. If
the coefficient is different from zero then it can be assumed that the predictor is making a
significant contribution to the prediction of the outcome. The basic R calculation based
on the Wald statistic must be used with caution (Menard, 1995) hence other methods
such as the Homer and Lemeshow’s
, Cox and Snell’s
(Cox, et al., 1989), and
Nagelkerke’s
(Nagelkerk, 1991) are used to validate the performance of the model.
Although these measures are different in the way they are calculated, collectively they
can be seen as similar to the
in linear regression since they provide an estimate of the
significance of the model.
A total of four logistic regression models as described in Equation 14 were
generated for each project using the training set. The results are presented below for each
model.
Table 8 demonstrates that the coefficients for both measures of level of scattering
are significantly different than zero supporting the hypothesis that change properties do
affect the probability of failure at least in the studied projects. The value of Exp(B) works
as an indicator of the change in odds resulting from a unit change in the predictor. If the
value of Exp(B) is greater than 1, then it indicates that as the predictor increases, the odds
of the outcome occurring increases. All projects had Exp(B) values higher than 1
81
indicating that an increase in the measures of the level of scattering increase the
probability of defect in the files.
95% CI for Exp(B)
Project Predictors B(SE) Lower Exp(B) Upper
Proj. 1
Constant -3.88* (.56)
ScattDev(Freq) 9.18* (1.50) 508.97 9660.43 1.83E+05
ScattTime(Size) 3.09** (.97) 3.30 21.897 145.41
Proj. 2
Constant -5.71* (.66)
ScattDev(Freq) 11.92* (2.00) 3.00E+03 1.51E+05 7.58E+06
ScattTime(Size) 7.73* (1.39) 148.08 2.26E+03 3.46E+04
Proj. 3
Constant -4.86* (1.10)
ScattDev(Freq) 10.41* (2.79) 139.54 3.31E+04 7.83E+06
ScattTime(Size) 9.84* (2.54) 128.52 1.88E+04 2.74E+06
Proj. 4
Constant -6.53* (1.04)
ScattDev(Freq) 12.748* (2.69) 1762.49 3.44E+05 6.71E+07
ScattTime(Size) 9.024* (1.91) 196.58 8.03E+03 3.50E+05
* p < .001. ** p < .01. ***, p < .05.
Table 8: Logistic regression model results
Table 9 presents the R-statistic coefficients for the regression models. All models
resulted in high R coefficients indicating a good fit of the model.
82
Project
Proj. 1 .393 .418 .559
Proj. 2 .583 .517 .725
Proj. 3 .641 .587 .784
Proj. 4 .620 .528 .752
Table 9: R-statistic coefficient for generated models
Residual analysis was performed in each project to isolate points for which the
model fitted poorly, and to isolate points that exert an undue influence on the model. The
information collected from this type of analysis can usually be used to either flag
potential outliers or to assess the overall performance of the model. Standardized residual
was used to isolate points for which the model fitted poorly. This statistic had the
property that 95% of the cases in an average, normally distributed sample should have
values which lie within ±1.96, and 99% of cases should have values that lie within ±2.58.
Values outside the ±3 should raise concern and any outside ±2.5 should be examined
more closely. Cook's distance and DFBeta is the recommended approach to isolate points
that exert high influence on the model. Cook’s distance should be less than 1 and the
absolute of DFBeta should be also less than 1. Leverage was also used to identify if
certain cases were wielding undue influence over the model. Value can fall between 0 (no
influence) and 1 (complete influence). The expected leverage (average leverage) is
(k+1)/N, where k is the number of predictors and N is the sample size. The
recommendation is to check for cases with leverage value greater than three times the
average value. All data points which the residual analysis indicated as heavily influencing
83
the results were analyzed. An investigation was performed with the developers and
project managers to identify the reason for the level of influence and the files that were
influenced by either the number of defect removal (DR) changes or by the number of
changes due to the improper use of the version control system were removed as outliers.
Once the identified were removed the analysis was performed again generating the final
results demonstrated in Table 8.
Once the logistic regression models were built they were validated using the
testing set data. The testing set data was generated by randomly selecting 30% of the files
in each project.
With the β
0
, β
1
, and β
2
values for each project, the built models were used to
predict the probability of post-release defects in the files from the training set. The levels
of scattering of a file were considered risky if their predicted probability of post-release
of failure were above a cutoff value. When choosing a cutoff value two factors need to be
balanced:
· The percentage of files with certain levels of scattering that does not have
post-release defects but are identified as risky. This group is categorized as
type I error and generates waste of effort if the project managers try to reduce
their risk.
· The percentage of files that contain post-release defects but are not identified
as risky. This group is categorized as type II error and can result in defects
been passed to the clients.
84
A series of factors need to be considered when choosing an appropriate cutoff
value. High dependability projects may have a very low tolerance for defects requiring a
low cutoff value. The cutoff may be also defined based on the availability of the
resources. The cutoff value may be raised or lowered accordingly to the availability of
the testing resources.
Figure 19 presents the relationship between the error types I and II and the cutoff
values. A graph is generated for each project analyzed. The errors have a very similar
trend in all projects.
Figure 19: Type I and II errors for different cutoff values
Further analysis of the error plots indicated that one cutoff value would be enough
to identify files with a high or low risk of having post-release defects. A cutoff value of
0.5 was used to validate the models generated. Table 10 presents the performance of the
85
models when running the testing data through them using the cutoff value of 0.5. All
models have a satisfactory performance identifying the error prone files.
Project No Defect
Identified
Defect
Identified
Overall
Success %
Type I
Error
Type II
Error
Proj. 1 78% 81% 80% 10% 10%
Proj. 2 95% 74% 89% 4% 7%
Proj. 3 72% 77% 75% 11% 14%
Proj. 4 94% 87% 91% 4% 5%
Table 10: Model performance using testing set
Following the same approach defined in (Mockus, et al., 2000), a system was
created based on the logistic regression so not only a flag would be raised but an
explanation would be provided to the project manager of why the source file was flagged.
For example, a flag would be raised if a specific measure of the level of scattering of a
file is above 95% of the level of scattering of all the other files in the project.
The methodology was embedded into a system as defined in section 3.5 to be
used by the project manager to more easily identify areas of risk in the project and
improve allocation of testing resources. At certain periods of time the tool updates the
information to flag the files based on the measures of the level of scattering. Each time
the update is executed the following steps are performed:
1. Extract the necessary information to calculate the level of scattering of the
files in the developer and time dimensions;
86
2. Fit the logistic regression model as defined in Equation 15 based on the
history of changes of the file;
3. Use the fitted model to estimate the risk of the pattern of changes of all the
files in the project;
4. Calculate the probability of post-release defects and flag the files according to
the predefined cutoff value.
The project manager can then be notified of all the files flagged by the system and
the reason why they were flagged (e.g. high level of influence of a large number of
developers or a file with the code development highly distributed during the entire project
implementation).
4.3.5 Linear Regression
In statistics, linear regression refers to any approach to model the relationship
between one or more variables denoted Y and one or more variables denoted X, such that
the model depends linearly on the unknown parameters to be estimated from the data.
This model is called a linear model. Most commonly, linear regression refers to a model
in which the conditional mean of Y given the value of X is a function of X.
Linear regression has many practical uses. Most applications of linear regression
fall into one of the following two categories:
· If the goal is prediction, or forecasting, linear regression can be used to fit a
predictive model to an observed data set of Y and X values. After developing
such a model, if an additional value of X is then given without its
87
accompanying value of Y, the fitted model can be used to make a prediction
of the value of Y.
· Given a variable Y and a number of variables X
1
, ..., X
p
that may be related to
Y, then linear regression analysis can be applied to quantify the strength of the
relationship between Y and the X
j
, to assess which X
j
may have no
relationship with Y at all, and to identify which subsets of the X
j
contain
redundant information about Y, thus once one of them is known, the others
are no longer informative.
The linear regression models have the following form, where Y is the dependent
variable and X
i
the predictors or independent variables.
Equation 16: Linear regression equation
Two statistical linear regression models for each project were generated to
validate that the measures of the level of scattering are better indicators of post-release
defects than the frequency of churn alone. The first linear model used the two measures
of the level of scattering as the predictors variables as demonstrated in Equation 17. The
first predictor variable in model (A) is the normalized level of scattering based on period
of time and the second variable is the normalized level of scattering using based on the
developers that churned the files. Each chosen variable used a different approach to
calculate the probability. The choices were made based on the correlation analysis
presented in 4.3.2 and 4.3.3. The chosen level of scattering based on the periods of time
used the size of the changes to calculate the probability while the level of scattering based
88
on the developers was calculated using the probability based on how many times each
developer churned the code.
(
)
( ( ))
( ( ))
Equation 17: Linear regression equation (A) based on the level of scattering
The second linear model presented in Equation 18 used only the frequency of
churn for the entire implementation phase as the predictor variable. Frequency of churn
has been the most used attribute of churn to assess the quality of a software.
(
)
Equation 18: Linear regression equation (B) based on the frequency of churn
The dependent variable in the linear models is the number of defects in each
code unit. As previously described, the number of defects was determined based on the
number of post-release changes performed in each code unit.
Each parameter of the statistical linear regression model was estimated using the
training set that consisted of 70% of the code units for each project randomly selected.
Table 11 shows the information for the statistical linear regression model for
each project, and the estimated β
0
, β1, and β
2
parameters. The parameter β
2
is missing for
all (B) models since this model uses only the frequency of churn as predictor. The R
2
statistic indicates the quality of the fit. The higher the value of R
2
the better is the fit. A
zero R
2
indicates that no relationship exists between the dependent variable – number of
defects in the code unit – and the predictor variables – measures of level of scattering
(model A) or frequency of churn (model B). The models that use the level of scattering as
89
the predictor variables show a better fit for all cases if compared to the model that uses
the frequency of churn as the predictor except of Proj. 3.
SLR Model Parameters
Project Linear
Regression
Model
β
0
β
1
β
2
R
2
Sig.
Proj. 1 (A) -.47 (.152)** 4.53 (.514)* 1.50 (.489)** .441 *
(B) .33 (.123)** 0.16 (.015)* - .311 *
Proj. 2 (A) -.24 (.043)* 1.68 (.218)* 2.173 (.220)* .619 *
(B) .10 (.039)* .07 (.004)** - .482 *
Proj. 3 (A) -.41 (.190)*** 3.25 (.710)* 3.608 (.700)* .576 *
(B) -.30 (.146) *** .34 (.023)* - .671 *
Proj. 4 (A) -.34 (.071)* 2.06 (.298) * 3.049 (.327)* .662 *
(B) .21 (.060)* .54 (.004) * - .472 *
* p<.001. ** p<.01. *** p<.05
Table 11: Detail of the statistical linear regression models for the studied projects
The testing data set was used to validate the performance of each statistical
linear regression model. The testing data set is composed of 30% of randomly selected
code units from each project.
With each parameter β
0
, β
1
, and β
2
estimated for each case, the regression
models were used to predict the number of post-release defects that occurred in each code
unit that was part of the testing data set. For each model with the calculated parameters
β
0
, β
1
, and β
2
or just β
0
and β
1
, the result is a ý
i
for each combination of predictor
variables, where ý
i
is the number of expected defects in the code units used in the testing
data set.
90
́
Equation 19: Number of expected defects in the code units
The absolute prediction error is defined as Equation 20 where y
i
is the actual
number of defects that happened in each code unit in the testing data set.
| ́
|
Equation 20: Actual prediction error of each model
The total prediction error for all code units in each project can be calculated
based on Equation 21.
∑
Equation 21: Total prediction error of each model
Table shows the details of the total prediction error E for each of the models
generated. The table shows also the percentage of improvements in prediction when the
model based on the measures of level of scattering – Model (A) – was used instead of the
model based on the frequency of churn alone – Model (B).
Each statistical regression model (A and B) was generated to predict the number
of defects hence their accuracy can be compared using this information. The accuracy of
each statistical model is defined as the amount of error in the predictions. The lower is
the prediction error the higher is the quality of the model. In this case, if E
A
is the total
prediction error of the statistical model A and E
B
is the total prediction error of the
statistical model B then, if E
A
> E
B
then model B has better accuracy than model A.
Based on Table 12, model A, which is based on the two measures of the level of
scattering of the churn events, has a better accuracy than statistical model B that was
91
generated based on the frequency of churn. For Project 1, for example, model A has
shown an improvement of 21% when compared to model B.
T-Test
Project SLR
Model
Error (E) Improvement
In E
P(H
0
Holds)
Proj. 1
(A) 89.21
21% 1.38E-02
(B) 112.32
Proj. 2
(A) 82.13
11% 1.64E-02
(B) 92.19
Proj. 3
(A) 38.57
33% 4.00E-02
(B) 57.4
Proj. 4
(A) 47.16
21% 1.91E-02
(B) 59.48
Table 12: Statistical model prediction error for each project
It is important to validate if the difference of the prediction error is statistically
significant. The statistical significance test is performed through a paired t-test. The
following test hypotheses are formulated.
(
)
(
)
Equation 22: T-test Hypotheses
The µ(e
B,i
– e
A,i
) is the population mean of the difference between the absolute
error of each observation pair. The t-test can be safely used since the data is large enough
and the t-test is robust for non-normally distributed data. Another alternative for smaller
data set would be to use a non-parameterized test such as the U-test.
If the null hypothesis H
0
holds then the difference in prediction is not significant
hence H
0
needs to be rejected with a high probability. Table 12 demonstrates that a t-test
on paired observations of absolute error was significant at better than 0.04. This indicates
92
that the improvements in prediction errors between the models A and B is statistically
significant for all studied projects with a confidence of over 95%. Therefore, it is possible
to reject the null hypothesis H
0
and conclude that the model A, which is based on the
measures of scattering, has a significantly better accuracy than model B, which is based
on the frequency of the changes. This confirms the conjecture that the level of scattering
based on entropy is a good indicator of the level of chaos and consequently the
complexity of the code unit.
93
Chapter 5. Summary of Contribution and Future Research
Directions
5.1 Introduction
This chapter summarizes the contributions made to the software engineering
community by this study. It also provides a few potential shortfalls in the model in its
current state and gives suggestions for future research directions.
5.2 Summary of Contributions
1. Software engineering practitioners need to have a good understanding of how
the code implementation process impacts quality. Frequency and size of churn
have always been the preferred attributes of researchers who investigated the
impact that changes in the code can have in a software (Mockus, et al., 2000)
(Nagappan, et al., 2005) (Karunanithi, 1993) (Graves, et al., 2000) (Munson,
et al., 1998). Other attributes of churn such as the objective of churn or the
diffusion of the churn through different code units were also considered in
certain studies in the field of defect prediction (Mockus, et al., 2000) (Hassan,
et al., 2003). The problem is that all these attributes of churn provide a very
superficial understanding of how the software was developed resulting in a
poor interpretation of how the way that code is implemented impacts the
quality of the software. This dissertation focused on defining a method to
model the code implementation process of code units using the concepts of
94
Information Theory and Shannon’s Entropy. As a result, a set of metrics that
measure how scattered is the implementation of a code unit were created.
These new metrics are among the most important contributions of this
dissertation. These metrics can be used to efficiently and accurately model the
code implementation process in multiple dimensions promoting a better
understand of the risks associated with certain patterns of implementation of a
software projects.
2. Chapter 4 validated the efficiency of the new measures of scattering in
predicting the post-release quality of the code units. Approximately 1,388
source files from four large projects in which more than 100 developers
combined worked were used in the validation process. Both logistic and linear
statistical regression models were generated using the level of scattering
metrics to identify both the probability that a file would have post-defects and
the number of the defects respectively. Two statistical regression models were
generated when the validation was performed using the linear statistical
regression model; the first model used the measures of scattering and the
second model used the pre-release frequency of churn of the files. The linear
regression model using the measures of scattering had significantly better
accuracy than the model using the pure frequency of churn.
3. An analysis framework was created as described in chapter 3.5. The
framework was a result of the experience creating a system that could
automate the process of extracting the churn information from the version
95
control system, calculate the measures of scattering, apply the measures to the
proper statistical regression models, and provide feedback to the user of the
risk of certain patterns of churn in the code units. The framework describes a
generic methodology that assists in the implementation of a system capable of
calculate and use of the metrics of scattering to assess areas of concern in a
software project.
4. This dissertation promoted a better understanding of the software factory
environment in 1.3.1. It gives detailed information on how software factories
work, their business model, and what are some of the key challenges they
face.
5.3 Future Research Directions
1. Shannon’s entropy is an innovative approach of measuring how the churn
events of a code unit are scattered but it still does not capture all aspects that
need to be considered when assessing how scattered the events of churn are.
This is particularly true when measuring the distribution of the events using
the time dimension. Let’s assume the situation presented in Figure 20. Files A
and B will have the exact same entropy ignoring the fact the file B had a much
larger gap between their changes.
96
Figure 20: Two code units with similar pattern of changes but with a significant gap between events
2. To provide a more detailed assessment of the level of scattering of the churn
events throughout the project lifecycle, the entropy may need to be combined
with other churn attributes. Two main attributes that can be considered in
future studies are the gap between churn events and churn duration (or
variation of these metrics).
3. Another area with room for more detailed investigation is the level of
scattering based on the developer dimension. Several studies have
demonstrated an impact of the skill of the developers in the quality, cost, and
schedule of a software project (LaToza, et al., 2007) (DeMarco, et al., 1999)
(Detienne, 2001) (Fritz, et al., 2007) (Mockus, et al., 2000), (Visser, et al.,
1990) (Turner, et al., 2003) (Boehm, et al., 2000). There is a general belief
among project managers that they have a good idea of the skill of each
Gap file A
Gap file B
97
developer in their team and they tend to use this knowledge to assign the best
developers to the most difficult tasks. The fact that project managers tend to
assign the most difficult tasks to the best developers combined with the fact
that the skill of developers have an impact in the quality of the code may blind
the existing correlation between measure of scattering and code quality. The
assumption is that many highly skilled developers churning a code unit is less
risky than if low-skilled developers are churning it at the same level of
scattering. Future studies may consider combining the skill of the developers
with the level of scattering of the code churn events among the developers to
provide a more complete view of how the level of participation of these
developers impact quality of code units.
4. The study is based on the assumption that the pattern of changes performed in
a code unit can have an impact in that code unit. It ignores the potential
impact that the changes may have in other code units in the same system.
Future investigations may assess how the level of scattering of the churn
events of a code unit may impact other code units that are somehow
associated to the code unit that had a chaotic implementation process.
5. The analysis framework presented in 3.5 can be extended so more information
is combined and used to improve the quality of the results found. Defect rates
can be important, for example, in a case where there are low or no failure rates
associated with code units with high level of entropy. If a code unit that has a
98
complex development process is churning frequently, then a rate of failures
would be expected and zero failure may indicate a testing problem.
99
Bibliography
Basili V. R. and Perricone B. T. Software errors and complexity: An empirical
investigation [Journal] // Communications of the ACM. - January 1984. - 1 : Vol. 27. -
pp. 42-52.
Boehm B. and Basili V. R. Software Defect Reduction Top 10 List [Journal] //
Computer. - [s.l.] : IEEE Computer Society Press, January 2001. - 1 : Vol. 34. - pp. 135-
137.
Boehm Barry W. [et al.] Software Cost Estimation with Cocomo II [Book]. -
Upper Saddle River : Prentice Hall PTR, 2000.
Boehm Barry W. Improving Software Productivity [Journal] // Computer. - Los
Alamitos : IEEE Computer Society Press, 1987. - 9 : Vol. 20. - 0018-9162.
Boehm Barry W. Software Engineering Economics [Book]. - Upper Saddle
River, NJ : Prentice Hall PTR, 1981.
Bremer R.W. Positions paper for panel discussion: the economics of program
productions [Conference] // Information Processing. - North Holland, Amsterdam : [s.n.],
1969. - pp. 1676-1677.
Breyfogle F. W. Implementing Six Sigma: Smarter Solutions Using Statistical
Methods [Book]. - New York : John Wiley & Sons, Inc., 1999.
Brooks Frederick. P. The Mythical Man-Month: Essays on Software
Engineering [Book]. - [s.l.] : Addison Wesley Professional, 1995.
Chulani Sunita, Steece Bert M. and Boehm Barry Determining Software
Quality Using COQUALMO [Book Section] // Case Studies in Reliability and
Maintenance. - Sherman Oaks : John Wiley & Sons, 2003.
Cox D. R. and Snell D. J. The analysis of binary data (2nd edition) [Book]. -
London : Chapman & Hall, 1989.
Curtis Bill, Krasner Neil and Iscoe Neil A field study of the software design
process for large systems [Journal]. - New York : ACM, 1988. - 11 : Vol. 31.
CVS - Open Source Version Control [Online] // CVS - Concurrent Versions
System. - http://www.nongnu.org/cvs/.
100
DeMarco Tom and Lister Timothy Peopleware: Productive Projects and Teams
[Book]. - New York : Dorset House, 1999.
Deming Edwards W. Out of the crisis [Book]. - Cambridge : MIT Center for
Advanced Engeneering Study, 1982.
Detienne Francoise Software Design - Cognitive Aspects [Book]. - London :
Springer, 2001.
Eick S.G. [et al.] Does code decay? Assessing the evidence from change
management data [Article] // IEEE Trans Software Engineering. - 1990. - 27. - Vol. 1. -
pp. 1-12.
Enterprise Business Intelligence & Analytics Applications - Spotfire [Online] //
TIBCO. - http://spotfire.tibco.com/.
Foundation The Apache Software Struts [Online] // Struts. - The Apache
Software Foundation. - http://struts.apache.org/.
Fritz T., Murphy G. C. and Hill E. Does a programmer's activity indicate
knowledge of code? [Conference] // Proceedings of the the 6th Joint Meeting of the
European Software Engineering Conference and the ACM SIGSOFT Symposium on the
Foundations of Software Engineering. - Dubrovnik : ACM, 2007. - pp. 341-350.
Gayatri Rapur Assessment of Open-Source Software for High-Performance
Computing // master's thesis. - [s.l.] : Department of Computer Science and Egineering,
Mississipi State, Mississipi, August 2003.
Gilb Tom Principles of software engineering management [Book]. - Boston :
Addison-Wesley Longman Publishing Co., Inc., 1988.
Graves T.L. [et al.] Predicting fault incidence using software change history
[Article] // Transactions on Software Engineering. - July 2000. - 7 : Vol. 26. - pp. 653-
661.
Hassan A.E. and Holt R.C. Studying the chaos of code development
[Conference] // Proceedings. 10th Working Conference on Reverse Engineering. - 2003. -
pp. 123-133.
Humphrey W. S. [et al.] Future directions in process improvement [Journal] //
Crosstalk. - February 2007.
101
Humphrey W. Software and the factory paradigm [Journal] // Software
Engineering Journal. - Herts : Michael Faraday House, 09 1991, 1991. - 5 : Vol. 6. - pp.
370-376.
Humphrey W.S. TSP: Coaching Development Teams [Book]. - [s.l.] : Addison-
Wesley, 2006.
Humphrey Watts A Discipline for Software Engineering [Book]. - [s.l.] :
Addison-Wesley Professional, 1994.
Illes-Seifert Timea and Paech Barbara Exploring the relationship of history
characteristics and defect count: an empirical study [Conference] // Proceedings of the
2008 workshop on Defects in large software systems. - Seattle : ACM, 2008. - pp. 11-
15. - 978-1-60558-051-7.
James David [et al.] JCVSReport: Easy Progress Reports for CVS/Java Projects
[Online] // JCVSReport: Easy Progress Reports for CVS/Java Projects. -
http://www.cs.toronto.edu/~james/JCVSReport/.
Jones Capers Programming productivity [Book]. - New York : McGraw-Hill,
Inc., 1986. - 0-07-032811-0.
Juran J.M. and Frank M. Gryna Juran's Quality Control Handbook [Book]. -
New York : McGraw-Hill Book Company, 1988. - 4th.
Kachigan Sam Statistical Analysis An Interdisciplinary Introduction to
Univariate & Multivariate Methods [Book]. - New York : Radius Press, 198. - 0-942154-
99-1.
Karunanithi N. A Neural Network approach for Software Reliability Growth
Modeling in the Presence of Code Churn [Article] // Proceedings of International
Symposium on Software Reliability Engineering. - 1993. - pp. 310-317.
Karunanithi N. A Neural Network Approach for Software Reliability Growth
Modeling in the Presence of Code Churn [Conference] // Fourth International
Symposium on Software Reliability Engineering. - Denver : [s.n.], 1993. - pp. 310-317.
Kemerer C. F. and Slaughter S.A. Determinants of software maintenance
profiles: An empirical investigation [Article] // Software Maintenance: Research and
Practice. - 1997. - 9. - Vol. 4. - pp. 235-251.
102
Khoshgoftaar T. M. [et al.] Detection of software modules with high debug code
churn in a very large legacy system [Conference] // Proceedings of the The Seventh
International Symposium on Software Reliability Engineering (ISSRE '96). -
Washington : IEEE Computer Society, 1996.
Khoshgoftaar T. M. and Allen E. B. Empirical assessment of a software metric:
The information content of operators. [Article] // Software Quality Journal. - 2001. - 9. -
pp. 99-112.
Khoshgoftaar Taghi M. [et al.] Uncertain Classification of Fault-Prone Software
Modules [Journal] // Empirical Softw. Engg.. - Massachusetts : Kluwer Academic
Publishers, 2002. - 4 : Vol. 7. - pp. 297-318.
LaToza T. D. [et al.] Program Comprehension as Fact Finding [Conference] //
Proceedings of the the 6th Joint Meeting of the European Software Engineering
Conference and the ACM SIGSOFT Symposium on the Foundations of Software
Engineering. - Dubrovnik : ACM, 2007. - 361-370.
LaToza Thomas D., Venolia Gina and DeLine Robert Maintaining mental
models: a study of developer work habits [Conference] // Proceeding of the 28th
international conference on Software engineering. - Shanghai : ACM, 2006. - pp. 492-
501.
Layman Lucas, Kudrjavets Gunnar and Nagappan Nachiappan Iterative
Identification of Fault-Prone Binaries Using In-Process Metrics [Conference] //
Proceedings of the Second ACM-IEEE international symposium on Empirical software
engineering and measurement. - Kaiserslautern : ACM, 2008. - pp. 206-212. - 978-1-
59593-971-5 .
Lim T.-S., Loh W.-Y. and Shih Y.-S. A comparison of prediction accuracy,
complexity, and training time of thirty-three old and new classification algorithms
[Article] // Machine Learning. - Boston : Kluwer Academic Publishers, 2000. - Vol. 40. -
pp. 203-229.
Loh W.-Y. and Shih Y.-S. Split Selection Methods for Classification Trees
[Journal] // Statistica Sinica. - 1997.
Manning C. and Schütze H. Foundations of Statistical Natural Language
Processing [Book]. - Cambridge : MIT Press, 1999.
Mantis Mantis Bug Tracker [Online]. - www.mantisbt.org.
103
McCabe Thomas A Complexity Measure [Journal] // IEEE Transactions on
Software Engineering. - December 1976. - 4 : Vols. SE-2.
McCullagh P. and Nelder J. A. Generalized Linear Model, 2nd ed. [Book]. -
New York : Chapman and Hall, 1989.
McCullagh P. and Nelder J. A. Statistical Models in S [Book]. - Pacific Grove :
Wadsworth & Brooks, 1992.
Menard S. Applied logistic regression analysis [Book Section] // Sage university
paper series on quantitative applications in the social sciences. - Thousand Oaks (CA) :
Sage, 1995.
Meneely Andrew [et al.] Predicting failures with developer networks and social
network analysis [Conference] // Proceedings of the 16th ACM SIGSOFT International
Symposium on Foundations of software engineering. - Atlanta : ACM, 2008. - pp. 13-
23. - 978-1-59593-995-1.
Microsystems Sun Java EE at a Glance [Online] // Java EE at a Glance. - Sun
Microsystems. - http://java.sun.com/javaee/.
Microsystems Sun Sun ONE Architecture Guide [Online] // Sun ONE
Architecture Guide. - Sun Microsystems. -
http://www.sun.com/software/sunone/docs/arch/.
Miller A. J. Subset Selection in Regression [Book]. - London : Chapman and
Hall, 1990.
Mockus Audris and Votta Lawrence G. Identifying Reasons for Software
Changes Using Historic Databases [Article] // 16th IEEE International Conference on
Software Maintenance (ICSM'00). - 2000. - p. 120.
Mockus Audris and Weiss David M. Predicting Risk of Software Changes
[Journal] // Bell Labs Technical Journal. - [s.l.] : Lucent Technologies Inc., April–June
2000.
Moser Raimund, Pedrycz Witold and Succi Giancarlo A comparative analysis
of the efficiency of change metrics and static code attributes for defect prediction
[Conference] // Proceedings of the 30th international conference on Software
engineering. - Leipzig : ACM, 2008. - pp. 181-190. - 978-1-60558-079-1.
104
Moser Raimund, Pedrycz Witold and Succi Giancarlo Analysis of the
reliability of a subset of change metrics for defect prediction [Conference] // Proceedings
of the Second ACM-IEEE international symposium on Empirical software engineering
and measurement. - Kaiserslautern : ACM, 2008. - pp. 309-311. - 978-1-59593-971-5.
Munson J.C. and Elbaum S.G. Code churn: a measure for estimating the impact
of code change [Conference] // Proceedings. International Conference on Software
Maintenance. - 1998. - pp. 24-31.
Musa John D., Iannino Anthony and Okumoto Kazuhira Software reliability:
measurement, prediction, application [Book]. - New York : McGraw-Hill, Inc., 1990. - 0-
07-044119-7.
Nagappan Nachiappan and Ball Thomas Use of Relative Code Churn
Measures to Predict System Defect Density [Conference] // ICSE ’05. - St. Louis : [s.n.],
2005.
Nagappan Nachiappan, Murphy Brendan and Basili Victor The influence of
organizational structure on software quality: an empirical case study [Conference] //
Proceedings of the 30th international conference on Software engineering. - Leipzig :
ACM, 2008. - pp. 521-530. - 978-1-60558-079-1 .
Nagelkerk N. J. D. A note on general defeintion of the coefficient of
determination [Article] // Biometrika. - 1991. - 78. - pp. 691-692.
Ohlsson M., Mayrhauser A. von and Wohlin B. McGuire and C. Code Decay
Analysis of Legacy Software through Successive Releases [Conference] // Proceedings of
the IEEE Aerospace Conference. - [s.l.] : IEEE, 1999.
Park R. Software Size Measurement: A Framework for Counting Source
Statements [Report]. - Pittsburgh : Software Engineering Institute, 1992. - CMU/SEI-92-
TR-20.
Pinzger Martin, Nagappan Nachiappan and Murphy Brendan Can developer-
module networks predict failures? [Conference] // Proceedings of the 16th ACM
SIGSOFT International Symposium on Foundations of software engineering. - Atlanta :
ACM, 2008. - pp. 2-12. - 978-1-59593-995-1.
Ramler Rudolf, Biffl Stefan and Grünbacher Paul Value-Based management
of Software Testing [Book Section] // Value-Based Software Engineering / book auth.
Biffl Stefan [et al.]. - New York : Springer Berlin Heidelberg, 2006.
105
Selby R. W. and Porter A. A. Learning from Examples: Generation and
Evaluation of Decision Trees for Software Resource Analysis [Journal] // IEEE Trans.
Softw. Eng.. - New Jersey : IEEE Press, 1988. - 12 : Vol. 14. - pp. 1743--1757.
Selby Richard W. and Porter Adam A. Empirically Guided Software
Development Using Metric-Based Classification Trees [Journal] // IEEE Software. -
[s.l.] : IEEE, 1990. - 2 : Vol. 7. - pp. 46-54.
Shao J. Linear model selection by cross-validation [Journal]. - 1993. - 88. - pp.
486-494.
Sharp H C The Role of Domain Knowledge in Software Design [Article] //
Behaviour and Information Technology. - 1991. - 10. - Vol. 5. - pp. 383-401.
Shull F. [et al.] Simulating Families of Studies to Build Confidence in Defect
Hypotheses [Journal] // Journal of Information and Software Technology. - December
2005. - 15 : Vol. 47. - pp. 1019-1032.
SPSS, Data Mining, Statistical Analysis Software, Predictive Analysis, Predictive
Analytics, Decision Support System: [Online]. - SPSS. - http://www.spss.com/.
StatCVS - Generate statistcal HTML reports from your CVS repository logs
[Online] // StatCVS - Generate statistcal HTML reports from your CVS repository logs. -
http://statcvs.sourceforge.net/.
Swanson E. B. The dimensions of maintenance [Article] // Proc. 2nd Conf. on
Software Engineering. - San Francisco : [s.n.], 1976. - pp. 492-497.
Tabachnick B. G. and Fidell L. S. Using multivariate statistics (4th edition)
[Book]. - Boston : Allyn & Bacon, 2001.
Thayer Thomas A ., Lipow Myron and Nelson Eldred C . Software reliability:
a study of large project reality [Book]. - Amsterdam & Oxford : North-Holland
Publishing Co., 1978.
Tichy W. F. RCS - A system for version control [Article] // Software - Practice
and Experience. - 1985. - 7 : Vol. 15. - pp. 637-654.
Tsay Ruey S., Pena Daniel and Pankratz Alan E. Outliers in Multivariate Time
Series [Journal] // Biometrika, . - Great Britain : [s.n.], 2000. - 4 : Vol. 87. - pp. 789-804.
106
Turner Richard and Boehm Barry People Factors in Software Management:
Lessons from Comparing Agile and Plan-Driven Methods [Journal] // CrossTalk. -
December 2003. - pp. 4-8.
ViewVC: Repository Browsing [Online] // ViewVC: Repository Browsing:. -
ViewCVS Group. - http://www.viewvc.org/.
Visser W. and Hoc J-M. Expert software design strategies [Article] //
Psychology of programming. - London : Academic Press, 1990. - pp. 235-249.
Walkerden F. and Jeffery D. R. An empirical study of analogy-based software
effort estimation [Article] // Empirical Software Engineering. - June 1999. - 4. - Vol. 2. -
pp. 135-158.
Weaver Shannon The mathematical theory of communication [Book]. - Urbana :
University of Illinois Press, 1949.
Weka 3 - Data Mining with Open Source Machine Learning Software in Java
[Online]. - The University of Waikato. - http://www.cs.waikato.ac.nz/ml/weka/.
Wolf Tim [et al.] Predicting build failures using social network analysis on
developer communication [Conference] // Proceedings of the 2009 IEEE 31st
International Conference on Software Engineering. - [s.l.] : IEEE Computer Society,
2009. - pp. 1-11. - 978-1-4244-3453-4.
Zimmermann Thomas [et al.] Cross-project defect prediction: a large scale
experiment on data vs. domain vs. process [Conference] // Proceedings of the the 7th
joint meeting of the European software engineering conference and the ACM SIGSOFT
symposium on The foundations of software engineering. - Amsterdam : ACM, 2009. -
pp. 91-100. - 978-1-60558-001-2 .
Abstract (if available)
Abstract
Defect prediction and removal continues to be an important subject in software engineering. Previous studies have shown that reworking defects introduced at different phases of the development lifecycle typically consumes on average 40% to 50% of the total cost of the software implementation. Another important factor is the time between the injection of the defect in the code and when it is identified and removed. It has been demonstrated that the longer the defect is in the product the larger the number of elements that will likely be involved with it increasing the cost of fixing the defect. This work investigates the defects introduced during the coding phase of the development lifecycle of a software project. Considering that a high percentage of the defects are introduced in the source code at this phase, finding ways to better understand how they are introduced can have a significant impact in the testing and maintenance costs of a project and consequently in the quality of the final product.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Improved size and effort estimation models for software maintenance
PDF
Experimental and analytical comparison between pair development and software development with Fagan's inspection
PDF
Security functional requirements analysis for developing secure software
PDF
Value-based, dependency-aware inspection and test prioritization
PDF
Quantitative and qualitative analyses of requirements elaboration for early software size estimation
PDF
Architecture and application of an autonomous robotic software engineering technology testbed (SETT)
PDF
Software quality analysis: a value-based approach
PDF
Design-time software quality modeling and analysis of distributed software-intensive systems
PDF
Domain-based effort distribution model for software cost estimation
PDF
Calculating architectural reliability via modeling and analysis
PDF
Calibrating COCOMO® II for functional size metrics
PDF
Techniques for methodically exploring software development alternatives
PDF
Software connectors for highly distributed and voluminous data-intensive systems
PDF
A reference architecture for integrated self‐adaptive software environments
PDF
A value-based theory of software engineering
PDF
Software quality understanding by analysis of abundant data (SQUAAD): towards better understanding of life cycle software qualities
PDF
Assessing software maintainability in systems by leveraging fuzzy methods and linguistic analysis
PDF
Incremental development productivity decline
PDF
Composable risk-driven processes for developing software systems from commercial-off-the-shelf (COTS) products
PDF
Formalizing informal stakeholder inputs using gap-bridging methods
Asset Metadata
Creator
Perez, Gustavo Uanús
(author)
Core Title
Using metrics of scattering to assess software quality
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
01/25/2011
Defense Date
01/25/2011
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
code churn,information theory,OAI-PMH Harvest,Shannon's entropy,software quality
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Boehm, Barry W. (
committee chair
), Medvidovic, Nenad (
committee member
), Steece, Bert M. (
committee member
)
Creator Email
gup@usc.edu,guperez@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3624
Unique identifier
UC1153416
Identifier
etd-Perez-3919 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-428298 (legacy record id),usctheses-m3624 (legacy record id)
Legacy Identifier
etd-Perez-3919.pdf
Dmrecord
428298
Document Type
Dissertation
Rights
Perez, Gustavo Uanús
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
code churn
information theory
Shannon's entropy
software quality