Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Calculating architectural reliability via modeling and analysis
(USC Thesis Other)
Calculating architectural reliability via modeling and analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CALCULATING ARCHITECTURAL RELIABILITY VIA
MODELING AND ANALYSIS
by
Roshanak Roshandel
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2006
Copyright 2006 Roshanak Roshandel
ii
Dedication
To Ava.
iii
Acknowledgment
I wish to express my gratitude and appreciation to my advisor, Professor Nenad Med-
vidovic. Under his supervision, I have developed and evolved professionally. I am
thankful for his guidance, support, and direction in the past few years. I will forever
be grateful for all he has taught me. My special thanks to other dissertation committee
members – Professors Leana Golubchik, Barry Boehm, Michal Young, Andre van der
Hoek, and Najmedin Meshkati – who have provided me with excellent guidance and
support.
I would also like to especially thank Professor Andre van der Hoek from University
of California, Irvine for his support and mentorship. He has been a wonderful friend
and colleague, and I am grateful for everything. Dr. Jafar Adibi at USC’s Information
Sciences Institute has been an immense source of support, guidance, enthusiasm, and
friendship throughout this journey, and I will be forever grateful.
My friends and office mates in the software architecture group at USC – Architecture
Mafia – Sam Malek, Marija Mikic-Rakic, Chris Mattmann, Vladimir Jakobac, and
Ebru Dincel, I thank you for your friendship, and help during all these years. My spe-
cial thanks to Somo Banerjee and Leslie Cheung for reading early versions of this dis-
sertation and providing helpful feedback.
iv
To my husband and best friend Payman Arabshahi, I am forever thankful for your
endless love, support and encouragements both before and during this process, and all
the sacrifices you have made for me to be able to complete this dissertation. And my
dearest Ava, you are the joy of my life every day. Thank you for hugs and kisses, and
smiles and giggles. And thank you for letting maman do her work!
Last but not least, to my family for their support and encouragements – my parents
Parvin Samei and Jalil Roshandel, my brother Rooein, and my in-laws Azra Sadri and
Samad Arabshahi – Thank you! If it were not because of you, I could not have com-
pleted this endeavor!
v
Table of Contents
Dedication .......................................................................................................... ii
Acknowledgment ..............................................................................................iii
List of Tables .................................................................................................... vii
List of Figures ................................................................................................... viii
Abstract .............................................................................................................xii
Chapter 1: Introduction ..................................................................................... 1
1.1 Reliability of Software Architectures .......................................... 7
1.2 Research Hypotheses and Validation .......................................... 11
Chapter 2: Architectural Modeling for Reliability ............................................ 15
2.1 Example ....................................................................................... 18
2.2 Component Modeling .................................................................. 20
2.3 Our Approach .............................................................................. 23
2.4 Relating Component Models ....................................................... 33
2.5 Implications of the Quartet on Reliability ................................... 42
2.6 Defect Classification and Cost Framework ................................. 43
Chapter 3: Component Reliability .................................................................... 58
3.1 Classification of the Component Reliability Modeling Problem 62
3.2 Profile Modeling .......................................................................... 66
3.3 Reliability Prediction ................................................................... 81
Chapter 4: System Reliability ........................................................................... 94
4.1 Global Behavioral Model ............................................................ 96
4.2 Global Reliability Modeling ........................................................ 101
4.3 A Bayesian Network for System Reliability Modeling ............... 107
4.4 System Reliability Analysis ........................................................ 124
vi
Chapter 5: Tool Support .................................................................................... 136
5.1 Mae ..............................................................................................137
5.2 Component Reliability Modeling ................................................ 140
5.3 System Reliability Modeling ....................................................... 141
Chapter 6: Evaluation ........................................................................................ 142
6.1 Architectural Analysis and Defect Classification ........................ 144
6.2 Component Reliability Prediction ............................................... 148
6.3 System Reliability Prediction ...................................................... 173
Chapter 7: Related Work ................................................................................... 207
7.1 Architectural Modeling ...............................................................207
7.2 Reliability Modeling ................................................................... 211
7.3 Taxonomy of Architectural Reliability Models .......................... 216
Chapter 8: Conclusion and Future Work .......................................................... 222
8.1 Contributions ...............................................................................223
8.2 Future Work ................................................................................. 224
References .........................................................................................................228
Appendix A: Mae Schemas for Quartet Models ............................................... 240
Appendix B: Sample Matlab Code for Component Reliability Estimation ...... 248
Appendix C: SCRover Bayesian Network Generated by Netica ...................... 251
vii
List of Tables
Table 2-1. Sample Instantiation of the Cost Framework .............................. 53
Table 3-1. Classification of the Forms of the Reliability
Modeling Problem Space ............................................................. 65
viii
List of Figures
Figure 1-1. Our Approach to Local and Global Reliability Modeling ..........11
Figure 2-1. Software Model Relationships within a Component. .................17
Figure 2-2. SCRover’s Software Architecture. .............................................20
Figure 2-3. Controller component’s Interface and Static Behavior View. ..25
Figure 2-4. Model of Controller Component’s Dynamic Behavior View ...29
Figure 2-5. Model of Controller Component’s Interaction Protocol View ..32
Figure 2-6. Taxonomy of Architectural Defects ...........................................45
Figure 2-7. The Radar Chart View for the Cost Framework ........................55
Figure 2-8. Graphical View of the Cost Framework Instantiation for Different
Defect Types ..............................................................................57
Figure 3-1. Component Reliability Prediction Framework ..........................60
Figure 3-2. Controller’s Dynamic Behavior Model (Guards omitted
for Brevity).....................................................................................68
Figure 3-3. Formal Definition of AHMM .....................................................78
Figure 3-4. Graphical View of the Controller’s Reliability Model ...............82
Figure 3-5. Reliability Analysis Results for the Controller Component .......92
Figure 4-1. Our Approach to System Reliability Prediction .........................95
Figure 4-2. View of SCRover System’s Collective Behavior ......................97
Figure 4-3. SCRover’s Global Behavioral View in terms of Interacting
Components ...............................................................................99
Figure 4-4. Nodes of the SCRover’s Bayesian Network (Top), and
Initialization and Failure Links Extension (Bottom) .................111
ix
Figure 4-5. Interaction Links in SCRover’s Bayesian Network ...................112
Figure 4-6. SCRover’s Final Bayesian Network Model ...............................114
Figure 4-7. Expanded View of the SCRover’s Dynamic Bayesian Network 116
Figure 4-8. Summary of the BN’s Qualitative Construction Steps ...............117
Figure 4-8. SCRover’s Bayesian Network (top) and the Expanded Bayesian
Network (bottom) .......................................................................127
Figure 4-9. Cumulative Effect of Different Failures in SCRover .................131
Figure 4-10. Recovery Probability for Different Failures (Inverse of Cost) ..134
Figure 4-11. Weighted Cumulative Effect of Failures for SCRover ..............134
Figure 5-1. Overall View of the Required Tools for Architectural
Reliability Modeling and Analysis ............................................138
Figure 5-2. Mae’s Architecture .....................................................................139
Figure 6-1. Mae Defect Detection Yield by Type ........................................146
Figure 6-2. Defects Detected by UML, Acme, and Mae (by Type
and Number) ..............................................................................147
Figure 6-3. Controller Component Reliability Analysis Based on Various
Probabilities of Failures to the Two Failure States ....................152
Figure 6-4. Cost-framework Instantiation for Different Defect Types based
on data in Chapter 2 ...................................................................154
Figure 6-5. Changes to a Random Component’s Reliability based on
Different Failure Probabilities ...................................................154
Figure 6-6. Predicted Reliability for an Arbitrary Component Given a
Full Failure Probability Matrix (Left), and a Sparse Failure
Probability Matrix (Right) .........................................................156
Figure 6-7. Percentage of Changes in the Reliability Value of the
Controller Component (5%, 10%, and 20% Noise) ...................158
x
Figure 6-8. Percentage Change in Reliability Value of Three Arbitrary
Components with 5, 10, and 20 States (5% noise,
10% noise, and 20% noise, respectively) ...................................160
Figure 6-9. Controller Component Reliability Based on Random and Expert
Instantiation ................................................................................162
Figure 6-10. Arbitrary Component’s Full (Left) and Sparse (Right) Random
Instantiation for Training Data Generation ................................163
Figure 6-11. Sensitivity Analysis for the Controller Component with
Different Recovery Probabilities ...............................................166
Figure 6-12. Changes to the Probability of Recovery from Various Failure
Types for an Arbitrary Component ............................................167
Figure 6-13. Controller Component Reliability w.r.t. Different Failure Recovery
Probabilities ...............................................................................168
Figure 6-14. Controller Component Reliability w.r.t. A Full Range of Recovery
Probability Values ......................................................................170
Figure 6-15. Sensitivity Analysis for an Arbitrary 10-state Component ........171
Figure 6-16. Effect of Total Elimination of Failures ......................................172
Figure 6-17. SCRover’s Bayesian Network ....................................................176
Figure 6-18. Changes to the Reliability of the SCRover System at
Times t=0 and t=1 ......................................................................177
Figure 6-19. SCRover’s Reliability over Time based on its Dynamic Bayesian
Network ......................................................................................179
Figure 6-20. Updated Prediction of Reliability over Time based on New
Evidence .....................................................................................180
Figure 6-21. Effect of Changes to Components’ Reliabilities on System’s
Reliability ...................................................................................181
Figure 6-22. Changes to System Reliability Based on Different Component
Reliability Values at Time Step t=1 ...........................................183
xi
Figure 6-23. The Effect of Changes to System’s Reliability as the Reliability
of the Startup Process Changes ..................................................185
Figure 6-24. Effect of Elimination of Particular Failures on SCRover System’s
Reliability ...................................................................................186
Figure 6-25. OODT’s High Level Architecture ..............................................188
Figure 6-26. OODT’s Global Behavioral Model (top) and Corresponding
BN (bottom) ...............................................................................189
Figure 6-27. OODT Model’s Sensitivity to Different Initial Component
Reliabilities ................................................................................192
Figure 6-28. Changes in the OODT System’s Reliability as Components’
Reliabilities Change ...................................................................193
Figure 6-29. Effect of Changes to Components Reliabilities on System
Reliability ...................................................................................194
Figure 6-30. Reliability Prediction of the OODT System Over Three Time
Periods ........................................................................................195
Figure 6-31. Changes to the OODT’s Reliability based on Different Startup
Process Reliability..........................................................................196
Figure 6-32. Eliminating the Probability of different Failures and Their
Impact on System Reliability.........................................................197
Figure 6-33. Modeling Redundancy in OODT ...............................................199
Figure 6-34. Impact of Different Configurations on OODT System
Reliability 200
Figure 7-1. Taxonomy of Architecture-based Reliability Models ................217
xii
Abstract
Modeling and estimating software reliability during testing is useful in quantifying
the quality of the software systems. However, such measurements applied late in the
development process leave too little to be done to improve the quality and depend-
ability of the software system in a cost-effective way. Reliability, an important
dependability attribute, is defined as the probability that the system performs its
intended functionality under specified design limits. We argue that reliability models
must be built to predict the system reliability throughout the development process,
and specifically when exact context and execution profile of the system is unknown,
or when the implementation artifacts are unavailable. In the context of software archi-
tectures, various techniques for modeling software systems and specifying their func-
tionality have been developed. These techniques enable extensive analysis of the
specification, but typically lack quantification. Additionally, their relation to depend-
ability attributes of the modeled software system is unknown.
In this dissertation, we present a software architecture-based approach to predicting
reliability. The approach is applicable to early stages of development when the imple-
mentation artifacts are not yet available, and exact execution profile is unknown. The
approach is two fold: first, the reliability of individual components is predicted via a
stochastic reliability model built using software architectural artifacts. The uncer-
tainty associated with the execution profile is modeled using Hidden Markov Models,
xiii
which enable probabilistic modeling with unknown parameters. The overall system
reliability is obtained compositionally as a function of the reliability of its constituent
components, and their complex interactions. The interactions form a causal network
that models how reliability at a specific time in a system's execution is affected by the
reliability at previous time steps.
We evaluate our software architecture-based reliability modeling approach to demon-
strate that reliability prediction of software systems architectures early during the
development life-cycle is both possible and meaningful. The coverage of our architec-
tural analyses, as well as our defect classification is evaluated empirically. The com-
ponent-level and system-level reliability prediction methodology is evaluated using
sensitivity, uncertainty, and complexity, and scalability analyses.
1
Chapter 1: Introduction
The field of software architecture provides high-level abstractions for representing
the structure, behavior, and key properties of a software system. Architectural arti-
facts are critical in bridging the gap between requirement specification and imple-
mentation of the system. In general, a particular software system is defined in terms
of a collection of components (loci of computation) and connectors (loci of communi-
cation) as organized in an architectural configuration. Architecture description lan-
guages (ADLs) have been developed to aid architecture-based development [77].
ADLs provide formal notations for describing and analyzing software systems. Vari-
ous tools for analysis, simulation, and code generation of the modeled systems usu-
ally accompany these ADLs. Examples of ADLs include C2SADEL [76], Darwin
[71], Rapide [69], UniCon [118], xADL [25], and Wright [2]. A number of these
ADLs also provide extensive support for modeling behaviors and constraints on the
properties of components and connectors [77]. These behaviors and constraints can be
leveraged to ensure the consistency of an architectural configuration throughout a
system’s lifespan (e.g., by establishing conformance between the services of interact-
ing components). In essence, architecture is the first step in which important decisions
concerning the quality of the design are made. These decisions, in turn, directly influ-
ence dependability properties of the system under the development.
2
Software reliability is defined as the probability that the system will perform its
intended functionality under specified design limits. Software reliability techniques
are aimed at reducing or eliminating failures of software systems. Existing software
reliability techniques are often rooted in the field of reliability engineering in general,
and hardware reliability in particular. Such approaches provide significant experience
in building reliability models, and advanced mathematical formalisms for analytical
reasoning. However, they are not properly gauged toward today’s complex software
systems and their specific challenges. In particular, existing software reliability tech-
niques mainly address reliability modeling during a system’s testing. Similar to hard-
ware engineering, they build models of the system’s failure behavior by observing its
runtime operation. The reliability of the system is then measured by building formal-
isms that explain the failure behavior of the system. Such treatment of reliability mea-
surement, prediction, or estimation reveals defects late in the development process.
Defects detected earlier in the development life cycle are less costly to mitigate [13].
Consequently, following traditional reliability measurement as outlined above results
in an increase in the overall development costs, and prevents understanding the influ-
ence of early architectural decisions on the system’s dependability. Reliability and
other quality attributes must thus be built into a software system throughout the devel-
opment process, and as an innate aspect of system design. This requires developing
and/or adapting reliability models to predict and measure the reliability of a software
system early on. After all, “you can’t control what you can’t measure”[24].
3
To clarify upcoming discussions, let us define some basic concepts: An error is a
mental mistake made by the designer or programmer. A fault or a defect is the mani-
festation of that error in the system. It is an abnormal condition that may cause a
reduction in, or loss of, the capability of a functional unit to perform a required func-
tion; it is a requirements, design, or implementation flaw or deviation from a desired
or intended state [61]. Finally software failure is defined as the occurrence of an
incorrect output as a result of an input value that is received, with respect to specifica-
tion [101].
There are fundamental differences between the nature of failures in software and
hardware systems. Consequently the reliability methods in the two field vary accord-
ingly to accommodate these differences [70,101]. While the failure rate in hardware
systems has a bathtub curve, the failure rate in a software system is statistically non-
increasing (not considering software evolution). In other words, a software system is
not expected to become less reliable as time passes. Moreover, in a software system,
failures never occur if the software is not used. This is not true of hardware systems
where material deterioration can cause failures even though the system is not being
used. Software reliability models are often analytical models derived from assump-
tions about the system, and the interpretation of those assumptions and model param-
eters. On the other hand, hardware reliability methods are usually derived from fitting
specific distributions to failure data. This is done by extensive analysis as well as the
domain experience. Finally, once defects in a software system are repaired, a new
4
piece of software is obtained. This is not true of hardware repairs, which typically
restore the original system.
While the cause of hardware failures may be material deterioration, design flaws, and
random failures, software failures may be caused by incorrect specification or design,
human errors, and incorrect data. It is estimated that 85% of software defects are
introduced during analysis and design alone, of which only are detected in the same
phase [101]. Software architecture modeling techniques are used as an abstraction for
representing software systems. Analytical reasoning about these models can be used
to reveal a variety of design faults. Assuming the implementation artifacts are built in
a manner that preserves the architectural design properties, early detection of these
faults can help prevent their propagation into the final product, reduce the probability
of failures, thus improving the reliability of the system as a whole, and in turn reduc-
ing the development costs.
In order to quantify and predict the reliability of software architectural models, a reli-
ability model is needed that combines the result of architectural analyses (as failure
behavior) with the context in which the software will be used. Since the exact context
and operation profile of the system may not be known in advance, the reliability
model should account for this uncertainty. Stochastic approaches to reliability model-
ing seem to be especially appropriate for these circumstances. Probabilistic reliability
1
3
5
models are widely used in all engineering disciplines, including software reliability
during testing. However, for handling architectural reliability, they need to be specifi-
cally gauged to account for uncertainties associated with unknown operation profiles.
Furthermore, given all the uncertainties, a single meaningful estimation of the reli-
ability is not feasible. Instead a reliability prediction framework offering a range of
analyses and predictions is more appropriate. Such a framework can be used in con-
junction with standard design tools to quantify the effects of various design decisions
on the reliability of the system throughout the development process.
Complex mathematical models have been developed for modeling uncertainty in
other disciplines [19,49,103,104]. Such models leverage known data about the sys-
tem, and solve the model to obtain unknown information. Examples of such models
include Hidden Markov Models (HMM) [63] and Bayesian Networks (BNs) [49] that
combine concepts from the fields of Graph Theory and Probability Theory. Our
research leverages these two methodologies, and applies them to address reliability
prediction of software architectures.
Our work focuses on both structural and behavioral aspects of a software system’s
architecture, with the goal of addressing the following research question: Can we use
the architectural model of a software system to predict meaningfully the reliability of
an individual component, and consequently the reliability of the entire system?
6
Our approach for Calculating Architectural Reliability via Modeling and Analysis
(CARMA) attempts to answer this research question. We hypothesize that models of
architectural structure and behavior may be used as a basis for a stochastic reliability
model. We further hypothesize that this reliability model can be used to predict indi-
vidual component reliability, which then can be used to predict compositionally the
overall system reliability. The accuracy of the estimated reliability depends on the
richness of the architectural models: the more comprehensive and extensive the archi-
tectural model, the more accurate the reliability values obtained.
Furthermore, we hypothesize that this stochastic model can be parameterized based
on the type of architectural defects, (e.g., their frequency, severity, etc.). It then can be
used to identify defects that are more critical and cost-effective to fix. To validate
these hypotheses and evaluate our overall research, we have developed and used an
architectural modeling and analysis environment in the context of several case stud-
ies. We have also developed a reliability prediction framework for both component-
level and system-level reliability prediction and analysis, and applied them to exam-
ples and case studies to demonstrate that reliability prediction of a software architec-
ture is both meaningful and useful.
This dissertation research takes a step in closing the gap between qualitative represen-
tation of a system’s architecture on the one hand, and quantitative prediction of the
7
system’s reliability on the other hand. Particularly the contributions of this thesis
include:
1. Mechanisms to ensure intra- and inter-consistency among multiple views of sys-
tem’s architectural models,
2. A formal reliability model to predict both component-level and system-level reli-
ability of a given software system based on its architectural specification, and
3. Parameterized and pluggable defect classification and cost-framework to identify
critical defects, whose mitigation is most cost-effective in improving a system’s
overall reliability.
In the rest of this chapter, we describe our approach at a high-level, and discuss the
hypotheses upon which this research is based.
1.1 Reliability of Software Architectures
A survey of related literature in the area of architectural modeling and its connection
to system reliability reveals that, despite the development of sophisticated architec-
tural modeling techniques and their related analyses, proper quantification to measure
system dependability at the level of software architecture is lacking. Formal modeling
of software architecture is a complex and time consuming task. If such models cannot
reveal and quantify potential defects, and systematically outline the effects of these
defects on the overall system dependability, then their use may be considered to be
8
cost ineffective. As an answer, this thesis proposes an effective framework for archi-
tectural reliability prediction. The developed approach leverages architectural model-
ing and analysis, and enables sensitivity analyses that can be used to prescribe cost-
effective strategies to mitigate architectural defects.
1.1.1 Problem Description
The goal of this research is to find a solution to the following problem:
Given architectural models of a system’s structure and behavior,
1. Analyze the reliability of individual components, in terms of the probability that
each component performs its intended functionality successfully.
2. Analyze the reliability of the overall system’s architecture in terms of the proba-
bility that the system as a whole performs its intended functionality successfully.
The system reliability is estimated in terms of the composition of and interactions
among its constituent components (and their reliabilities).
3. Perform analysis to rank the components according to their effect on overall sys-
tem reliability.
1.1.2 Approach
In this thesis, we propose and evaluate a three-part solution to the problem of model-
ing and quantifying architecture-level reliability of software systems:
9
I. Multi-View Models. We advocate using a multi-view modeling approach called
Quartet to comprehensively model the properties of components in a software sys-
tem. The interface, static behavior, dynamic behavior and the interaction protocol
views each represent and help to ensure different characteristics of a component.
Moreover, the four views have complementary strengths and weaknesses with respect
to their ability to characterize systems.
These views can be analyzed to detect possible inconsistencies both within a compo-
nent’s models and across models of communicating components. The inconsistencies
signify architectural faults or defects, which may cause a failure during the system’s
operations and thus adversely affect the system’s reliability. The models also can be
used as a basis for generating implementation-level artifacts. In Chapter 2, we intro-
duce the details of each modeling view, and discuss an approach by which the consis-
tency among these view can be preserved.
II. Component Reliability. We offer a framework to predict and analyze the reliabil-
ity of individual components (referred to as Local Reliability) using a stochastic
model based on the Quartet. The reliability model leverages the Hidden Markov
Model formalism [104], and is built using the Quartet’s dynamic behavior view. The
model estimates the component’s reliability in terms of the probability of successfully
recovering from a failure occurring during the component’s operation. Local reliabil-
ity is estimated as a function of the inter-consistency of the component’s models, and
10
the internal behavior of the component described as a state machine. This technique is
discussed in detail in Chapter 3.
III. System Reliability. The technique for predicting and analyzing a system’s over-
all reliability (referred to as Global Reliability) is compositional in nature: the sys-
tem’s reliability is estimated in terms of the reliabilities of its constituent components
and their interactions. We leverage Bayesian Networks (BNs) [19,49], and model the
interactions among components in terms of the causal relationships among their reli-
abilities; when a change of state in a component causes a change of state in another
component, then the reliability of the second component depends on the reliability of
the first one. This Bayesian model is then augmented with the notion of a failure
state: any state in a component may result in a failure, so unreliability at each state
can affect the probability of the system’s failure (i.e., its unreliability value). The
model also leverages the estimated reliability of individual components (obtained
from the Local Reliability estimation step), to estimate compositionally the reliability
of the system.
A high-level conceptual view of our approach is depicted in Figure 1-1. We describe
each of these steps and evaluate the underlying methodology in the following chap-
ters. The rest of this section outlines the hypotheses upon which the research is based.
11
1.2 Research Hypotheses and Validation
Our research is based on the following hypotheses.
Hypothesis 1. In order to predict software reliability at the architectural level, we
need rich architectural models, as well as reliability models that do not rely on a run-
ning system’s operation profile. We hypothesize that models of architectural structure
and behavior alone can be used to obtain a meaningful estimate of system reliability.
Hypothesis 2. A component’s internal behavior is traditionally modeled using
dynamic behavioral models. Such models offer a continuous view of the component
Figure 1-1. Our Approach to Local and Global Reliability Modeling
Architecture
Global
Reliability
“The
Quartet”
Interfac
e
Protoco
ls
Static
Behavio
rs
Component
Interfac
e
Protocols
Static
Behavio
rs
Dynamic
Behaviors
Component
Interface
Protocols
Static
Behaviors
Dynamic
Behaviors
Component
“The
Quartet”
HMM
HMM
Local
Reliability
Local
Reliability
Local
Reliability
BN
“The
Quartet”
12
and how it arrives at certain states during its execution. Additionally, models of com-
ponents’ interaction protocols abstract away the details of internal component behav-
iors and focus on a component’s external interactions. We hypothesize that a
stochastic model constructed based on the component’s dynamic behavioral model
and its interaction protocols can be used to predict both component-level and then
system-level reliability.
Hypothesis 3. Different cost values are associated with different classes of defects
introduced during architectural design. Consequently, different classes of defects can
affect the system and reduce its overall reliability in different ways. We hypothesize
that our stochastic reliability estimation model can be parameterized for a set of cost
factors associated with different defect types, which then can be used to identify
defects that comparatively are more critical and cost-effective to fix.
Validation. The approach is evaluated by applying our architectural modeling, analy-
sis, and reliability prediction framework to several case studies. Using these case
studies, we demonstrate that our approach to reliability prediction and analysis of
software architecture is both meaningful and useful. We show that it is meaningful by
demonstrating its sensitivity to various model parameters. We further demonstrate
that it is useful via a set of sensitivity analyses that demonstrate our results can aid in
mitigating architectural defects and enhancing the quality of design in a cost-effective
13
manner. Our approach leverages architectural models of a system to construct compo-
nent-level and system-level reliability models. This will validate our hypothesis 1.
In the context of our stochastic methods, we validate the model by varying the data
provided as input to the models (e.g., type and number of defects in the case of com-
ponent reliability estimation, and component reliability value in the case of system
reliability estimation) to evaluate our hypothesis 2. In particular we will justify defect
mitigation strategies enabled by our methodology, by leveraging principles of sto-
chastic reliability modeling, and our parameterized cost-framework.
To further evaluate hypotheses 2 and 3, we use simulations that, given (1) an architec-
tural configuration of components and connectors, (2) an arbitrary set of defects for
each component, and (3) a particular interaction protocol for the system, estimates
each component’s and the entire system’s reliability, and ranks the components based
on their impact on the system reliability. The latter is done by leveraging a defect
classification and the cost framework.
The rest of this dissertation is organized as follows. Chapter 2 presents our work in
modeling and analysis of software architectures. Chapters 3 and 4 describe the details
of our technique to estimating component-level and system-level reliability, respec-
tively. Chapter 5 presents various tools used for modeling, analysis, and reliability
estimations via our approach. Chapter 6 details our evaluation strategy, as well as the
14
results obtained in evaluating our work. Chapter 7 details the related work both in
architectural modeling and analysis and in reliability estimation, and presents a classi-
fication of architecture-based reliability models. We conclude by summarizing the
contributions of this thesis, and discussing the future research directions.
15
Chapter 2: Architectural Modeling for Reliability
Component-based software engineering has emerged as an important discipline for
developing large and complex software systems. Software components have become
the primary abstraction level at which software development and evolution are carried
out. We consider a software component to be any self-contained unit of functionality
in a software system that exports its services via an interface, encapsulates the realiza-
tion of those services, and possibly maintains internal state. In the context of this
research, we further focus on components for which information on their interfaces
and behaviors may be obtained. In order to ensure the desired properties of compo-
nent-based systems (e.g., dependability attributes such as correctness, compatibility,
interchangeability, and functional reliability), both individual components and the
resulting systems’ architectural configurations must be modeled and analyzed.
The role of components as software systems’ building blocks has been studied exten-
sively in the area of software architectures [77,100,117]. While there are many
aspects of a software component worthy of careful study (e.g., modeling notations
[16,77], implementation platforms [1,2], evolution mechanisms [65,76]), we restrict
our study to an aspect of dependability only partially considered in existing literature,
namely, consistency among different models of a component. We consider this aspect
from an architectural modeling perspective, as opposed to an implementation or runt-
ime perspective.
16
The direct motivation for this work is our observation that there are four primary
functional aspects of a software component: (1) interface, (2) static behavior, (3)
dynamic behavior, and (4) interaction protocol. Each of the four modeling views rep-
resents and helps to ensure different characteristics of a component. Moreover, the
four views have complementary strengths and weaknesses with respect to their ability
to characterize systems. As detailed in Section 2.2, existing approaches to compo-
nent-based development typically select different subsets of these four views (e.g.,
interface and static behavior [65], or interface and interaction protocol [133]). At the
same time, different approaches treat each individual view in very similar ways (e.g.,
modeling static behaviors via pre- and post-conditions, or modeling interaction proto-
cols via finite state machines).
The four views’ complementary strengths and weaknesses in system modeling, as
well as their consistent treatment in the literature suggest the possibility of using them
in concert. However, what is missing from this picture is an understanding of the dif-
ferent relationships among these different models within a single component.
Figure 2-1 depicts the space of possible intra-component model relationship clusters.
Each cluster represents a range of possible relationships, including not only “exact”
matches, but also “relaxed” matches [136] between the models in question. Of these
six clusters, only the pair-wise relationships between a component’s interface and its
other modeling aspects have been studied extensively (relationships 1, 2, and 3 in
Figure 2-1).
17
It is our intent to focus on completing the modeling space depicted in Figure 2-1. We
present and discuss extensions to commonly used modeling approaches for each
aspect, relate them to each other, and ensure their compatibility. We also discuss the
advantages and drawbacks inherent in modeling all four aspects (the Quartet) and six
relationships shown in Figure 2-1. As part of this dissertation, we focus on providing
a framework for predicting architectural reliability, by addressing all the relationships
shown in Figure 2-1. In this manner, several important long-term goals may be
accomplished:
• Enrich, and in some respects complete, the existing body of knowledge in
component modeling and analysis,
• Suggest constraints on and provide guidelines to practical modeling techniques,
which typically select only a subset of the Quartet,
Figure 2-1. Software Model Relationships within a Component.
Interface
Static
Behavior
Dynamic
Behavior
Protocols
1
2
3
4
5
6
18
• Provide a basis for additional operations on components, such as retrieval, reuse,
and interchange [136],
• Suggest ways of creating one (possibly partial) model from another automatically,
and
• Provide better implementation generation capabilities from such enriched system
models.
In the rest of this chapter, we introduce a simple example that will be used throughout
the dissertation to clarify concepts. We will also provide an overview of existing
approaches to component modeling, introduce the Quartet, and discuss the relation-
ships among the four modeling perspectives. These relationships along with more tra-
ditional architectural analyses such as type checking and consistency checking will be
used as the core of our reliability modeling approach. The result of these analyses a
set of architecture-level defects which in turn, may translate into failures during com-
ponents’ operations. In order to distinguish and quantify the effect of each defect, we
have developed a defect classification and a cost framework presented later in this
chapter. The result of this quantification is used in our reliability models presented in
Chapter 3 and Chapter 4.
2.1 Example
Throughout this dissertation, we use a simple example of a robotic rover to illustrate
the introduced concepts. The robotic rover, called SCRover, is designed and devel-
19
oped in collaboration with NASA’s Jet Propulsion Laboratory, and in accordance with
their Mission Data System (MDS) methodology. To avoid unnecessary complexity,
we discuss a simplified version of the application. Our focus is particularly on
SCRover’s “wall following” behavior. In this mode, the rover uses a laser rangefinder
to determine the distance to the wall, drives forward while maintaining a fixed dis-
tance from that wall, and turns both inside and outside corners when it encounters
them. This scenario also involves sensing and controlled locomotion, including
reducing speed when approaching obstacles.
The system contains five main components: controller, estimator, sensor, actuator,
and a database. The sensor component gathers physical data (e.g., distance from the
wall) from the environment. The estimator component accesses the data and passes
them to the controller for control decisions. The controller component issues com-
mands to the actuator to change the direction or speed of the rover. The database com-
ponent stores the “state” of the rover at certain intervals, as well as when a change in
the values happens. Figure 2-2 shows a high-level architecture of the system in terms
of the constituent components, connectors, and their associated interfaces (ports): the
rectangular boxes represent components in the system; the ovals are connectors; the
dark circles on a component correspond to interfaces of services provided by the com-
ponent/connector, while the light circles represent interfaces of services required by
the component/connector. To illustrate our approach, we will specifically define the
20
static and dynamic behavioral models and protocols of interactions for the controller
component in Section 2.3.
2.2 Component Modeling
As previously discussed, our goal is to leverage the architectural models of software
components to predict their architectural reliability. We will then use this estimated
reliability along with models of components’ interaction to predict the overall reli-
ability of software systems. Architectural models form the core of our reliability mod-
eling approach. Analyzing these models reveals faults or defects. These faults
negatively affect the reliability of individual components and in turn, adversely influ-
ences the overall system reliability.
An overview of related approaches to architectural modeling and analysis is provided
in Chapter 7. In this chapter we focus on multiple functional modeling aspects of a
single software component. We advocate a four-view modeling approach, called the
Figure 2-2. SCRover’s Software Architecture
Sensor
Estimator
Controller
Actuator
Database
Legends
Components
Bi-directional
Connector
Provided Port
Required Port
Uni-directional
Connector
q
n
u
u
q
n
mq
mq
mu
mu
e
e
measQuery:mq
measUpdate:mu
Execution:e
Query:q
Notify:n
UpdateDB:u
Interface types
21
Quartet. Using the Quartet, a component’s structure, behavior, and its interaction with
other components in the system can be described. Moreover, analysis of these models
could reveal potential problems with the design and future implementation of the sys-
tem. In the rest of this section, we will first discuss the four component aspects. We
will use this discussion as the basis for studying the dependencies among these mod-
els and implications of maintaining their interconsistency in Sections 2.3 and 2.4.
2.2.1. Introducing the Quartet
Traditionally, functional characteristics of software components have been modeled
predominantly from the following four perspectives:
Interface modeling. Component modeling has been most frequently performed at the
level of interfaces. Interface models specify the points by which a component inter-
acts with other components in the system. Interface modeling has included matching
interface names and associated input/output parameter types. However, software
modeling solely at this level does not guarantee many important properties, such as
interoperability or substitutability of components: two components may associate
vastly different meanings with identical interfaces.
Static Behavior Modeling. Approaches to static behavior modeling describe the
behavioral properties of a system discretely, i.e., at specific snapshots in the system’s
execution. This is done primarily using invariants on the component states and pre-
22
and post-conditions associated with the components’ operations. Static behavioral
specification techniques are successful at describing what the state of a component
should be at specific points of time. However, they are not expressive enough to rep-
resent how the component arrives at a given state.
Dynamic Behavior Modeling. The deficiencies associated with static behavior mod-
eling have led to a third group of component modeling techniques and notations.
Modeling dynamic component behavior results in a more detailed view of the compo-
nent and how it arrives at certain states during its execution. It provides a continuous
view of the component’s internal execution details.
Interaction Protocol Modeling. The last category of component modeling
approaches focuses on legal protocols of interaction among components. This view of
modeling provides a continuous external view of a component’s execution by speci-
fying the allowed execution traces of its operations (accessed via interfaces).
Typically, the static and dynamic component behaviors and interaction protocols are
expressed in terms of a component’s interface model. For instance, at the level of
static behavior modeling, the pre- and post-conditions of an operation are tied to the
specific interface through which the operation is accessed. Similarly, the protocol
modeling approach [133] uses finite state machines (FSMs) in which component
interfaces serve as labels on the transitions. The same is also true of UML’s use of
23
interfaces specified in class diagrams for modeling event/action pairs in the corre-
sponding statechart models. This is why we chose to place Interface at the center of
the diagram shown in Figure 2-1.
2.3 Our Approach
We argue that a complete functional model of a software component can be achieved
only by focusing on all four aspects of the Quartet. At the same time, focusing on all
four aspects has the potential to introduce certain problems that must be carefully
addressed (e.g., large number of modeling notations that developers have to master,
model inconsistencies). While we use a particular notation in the discussion below,
the approach is generic such that it can be easily adapted to other modeling notations.
It is note worthy that the concise formulations used throughout this chapter to clarify
our definitions are not meant to serve as a formal specification of our model. Similar
to regular expression, in this notation x+ denotes one or more repetition of x, x*
denotes zero or more repetition of x, and x? denotes optional (zero or one) instance of
x, where x is a model element.
A component model is defined as follows:
Component_Model:
(Interface,
Static_Behavior,
Dynamic_Behavior,
Interaction_Protocol);
24
2.3.1. Interface
Interface modeling serves as the core of our component modeling approach and is
extensively leveraged by the other three modeling views. A component’s interface has
a type and is specified in terms of one or more interface elements. Each interface ele-
ment has a direction, a name (method signature), a set of input parameters, and possi-
bly a return type (output parameter). The direction indicates whether the component
requires (+) the service (i.e., operation) associated with the interface element or pro-
vides (-) it to the rest of the system. In other words:
Interface:
(Type,
Interface_Element+);
Interface_Element:
(Direction,
Method_signature,
Input_parameter*,
Output_parameter?);
In the context of the SCRover example discussed in Section 2.1, the controller com-
ponent exposes four interfaces through its four ports: e, u, q, and n, correspond to
Execution, UpdateDB, Query, and Notification interface types, respectively (recall
Figure 2-2). Each of these interfaces may have several interface elements associated
with them. These interface elements are enumerated in Figure 2-3. Examples include
the getWallDist and executeSpeedChange interface elements defined below
25
Figure 2-3. Controller component’s Interface and Static Behavior View.
Interface View
Interface types
e: Execution;
q: Query;
n: Notify;
u: UpdateDB;
Ports:
prov: {q:Query};
req: {n:Notify, u:UpdateDB, e:Execution};
Interfaces:
u: + setDefaults();
e: + executeSpeedChange (speed:speedType);
e: + executeDirChange (dir:DirType);
n: + notifyDistChange():DistType;
n: + notifySpeedChange():SpeedType;
n: + notifyDirChange():DirType;
q: - getWallDist():DistanceType;
Static Behavior View
StateVariable:
mode:Integer;
dist:DistanceType;
speed:SpeedType;
dir:DirType;
Invariant:
//off=0, on=1, halt=2, failure=3
{0 mode 3 AND 0 dist}; ≤ ≤ ≤
Operations:
op_getWallDist{
preCond: {dist 0};
postCond: {result=dist};
mapped_interfaces: {getWallDist};
}
op_notifyDistChange{
postCond: {result=~dist);
mapped_interfaces: {notifyDistChange};
}
op_notifyDirChange{
postCond: {result=~dir);
mapped_interfaces: {notifyDirChange};
}
op_notifySpeedChange{
postCond: {result=~speed);
mapped_interfaces:{notifySpeedChange};
}
op_setDefaults{
preCond: {dist > 100};
postCond: {~speed > 100 AND
~dir = 0};
mapped_interfaces: {setDefaults};
}
op_executeSpeedChange{
preCond: {val <> speed};
mapped_interfaces: {setDefaults};
}
op_executeDirChange{
preCond: {val <> 0};
mapped_interfaces: {setDefaults};
}
≥
26
u: +executeSpeedChange(speed: SpeedType);
q: -getWallDist():DistanceType;
where executeSpeedChange() is an interface element of type UpdateDB, required by
the controller component. Its input parameter speed is of user-defined type SpeedType
and it has no return value. Similarly, getWallDist() is provided by the controller com-
ponent, is an interface element of type Query, takes no input parameters, and returns a
value of type DistanceType.
2.3.2. Static Behavior
We adopt a widely used approach for static behavior modeling [65], which relies on
first-order predicate logic to specify functional properties of a component in terms of
the component’s state variables, invariants (constraints associated with the state vari-
ables), and operations (accessed via interfaces). Each operation is mapped to one or
more interface element (as modeled in the interface model), and specifies correspond-
ing pre- and post-conditions.
Static_Behaviors:
(State_Variable*,
Invariant*,
Operation+);
State_Variable:
(Name,
Type);
27
Invariant:
(Logical_Expression);
Operation:
(Interface_Element+,
Pre_Cond*,
Post_Cond*);
Pre/Post_Cond:
(Logical_Expression);
Interface and static behavior views of the SCRover’s controller component are
depicted in Figure 2-3. The specification details the interface types, instances, and
associated operations for performing various component’s functionality to query the
distance from obstacles, enacting changes in the rover’s speed or direction, and noti-
fying other components about these changes. The pre- and post-conditions are used to
specify conditions that must be true immediately prior to, or right after an operation is
invoked. For instance in the case of op_setDefaults, the new value of the variables
dist, speed, and dir (denoted by ~ followed by the variable name) is specified to be
within a certain range.
2.3.3. Dynamic Behavior
A dynamic behavior model provides a continuous view of the component’s internal
execution details. Variations of state-based modeling techniques have often been used
to model a component’s internal behavior (e.g., in UML). Such approaches describe
the component’s dynamic behavior using a set of sequencing constraints that define
28
legal ordering of the operations performed by the component. These operations may
belong to one of two categories: (1) they may be directly related to the interfaces of
the component as described in both interface and static behavioral models; or (2) they
may be internal operations of the component (i.e., invisible to the rest of the system
such as private methods in a UML class). To simplify our discussion, we only focus
on the first case: publicly accessible operations. The second case may be reduced to
the first one using the concept of hierarchy in statecharts: internal operations may be
abstracted away by building a higher-level state-machine that describes the dynamic
behavior only in terms of the component’s interfaces.
A dynamic behavior model serves as a conceptual bridge between the component’s
model of interaction protocols and its static behavioral model. On the one hand, a
dynamic behavior model serves as a refinement of the static behavior model as it fur-
ther details the internal behavior of the component. On the other hand, by leveraging a
state-based notation, a dynamic behavior model may be used to specify the sequence
by which a component’s operations get executed. A rich description of a component’s
dynamic behavior is essential to achieving two key objectives. First, it provides a rich
model that can be used to perform sophisticated analysis and simulation of the com-
ponent’s behavior. Second, it can serve as an important intermediate level model to
generate implementation level artifacts from the architectural specification.
29
Existing approaches to dynamic behavior modeling employ an abstract notion of
component state. These approaches treat states as entities of secondary importance,
with the transitions between states playing the primary role in behavioral modeling.
Component states are often only specified by their name and a set of incoming and
outgoing transitions. We offer an extended notion of dynamic behavioral modeling
that defines a state in terms of a set of variables maintained by the component and
their associated invariants. These invariants constrain the values of, and dependencies
among, the variables. As examples of this extension to the definition of states con-
sider the invariants associated with normal and emergency states in Figure 2-4 (the
details of state invariants are omitted from the diagram itself for clarity):
Figure 2-4. Model of Controller Component’s Dynamic Behavior View
init
normal
emergency
changed
getWallDist [dist > 0]/
notifyDistChange
/notifyDistChange
executeSpeedChange/
notifySpeedChange
executeDirChange/
notifyDirChange
setDefaults
/notifyDistChange
logState
executeSpeedChange[val <0]/
notifySpeedChange
executeDirChange/
notifyDirChange
executeSpeedChange[val >0]/
notifySpeedChange
/notifyDistChange
getWallDist [dist <= 0]/
notifyDistChange
[dist > 0]/
notifyDistChange
[dist <= 0]/
notifyDistChange
setDefaults
[dist>100]
getWallDist [dist <= 0]/
notifyDistChange
getWallDist [dist > 0]/
notifyDistChange
getWallDist/
notifyDistChange
30
normal
inv
:(0 dir < 360) AND (0 < speed < 500) AND (dist > 100)
emergency
inv
:(100 < speed < 200) AND (dir = 0) AND (dist 100)
This specification indicates that the state normal is defined in terms of three variables
dir, dist, and speed: the acceptable range for dir at this state is a value between 0 and
360, the acceptable value for dist is a value greater than 100, and the acceptable range
for variable speed is between 0 and 500. In the case of the state emergency, three vari-
ables speed, dir, and dist are used to define the state, and particular constraints on
these variables are specified above.
In summary, our dynamic behavior model consists of an initial state and a sequence of
guarded transitions from an origin to a destination state. Furthermore, a state is speci-
fied in terms of constraints it imposes over a subset of the component’s state vari-
ables. In other words:
Dynamic_Behavior:
(InitState,
(State:(Direction)Transition->State)+);
State:
(Name,
Variables*,
Invariant*);
Transition:
(Label,
Parameter*,
Guard*);
Guard:
(Logical_Expression);
≤
≤
31
A model of the controller component’s dynamic behavior is depicted in Figure 2-4.
The component has four states (init, normal, emergency, and changed). Upon initial-
ization, depending on the distance of the robot from an obstacle (parameter dist), the
component arrives at either the normal or the emergency state. This change of state is
either a result of the execution of the event getWallDist, or happens without any stim-
ulus (transitions without an event label – aka True transitions). Once in the normal
state, any change to the component’s direction or speed results in a transfer of state to
the changed state. A command to change the speed or direction once in the emergency
state, results in either returning to the normal state, or keeps the component in the
emergency state depending on the specific value of component’s state variables.
Note that the transitions are decorated with the statecharts’ event/action pairs. Event/
action pairs and their associated semantics are powerful mechanism to describe inter-
actions among components. Further discussion about event/action interaction is pro-
vided in Chapter 4, where we describe our global reliability model.
2.3.4. Interaction Protocols
Finally, we adopt the widely used notation for specifying component interaction pro-
tocols, originally proposed in [100]. Finite state semantics are used to define valid
sequences of invocations of component operations. Since interaction protocols are
concerned with an “external” view of a component, valid sequences of invocations
32
are specified irrespective of the component’s internal state or the pre-conditions
required for an operation’s invocation. More specifically:
Interaction_Protocol:
(InitState,
(State:(Direction)Transition->State)+);
State:
(Name);
Transition
(Label);
An interaction protocol for the controller component is shown in Figure 2-5. Starting
from S
1
, the model specifies different sequences of events that are acceptable by the
component. For example, one or more getWallDist may be followed by an execute-
DirChange event. Note that the transition from S
3
to S
1
has no label. In other words,
there is no event that corresponds to this transition. Similarly, the transitions from S
1
Figure 2-5. Model of Controller Component’s Interaction Protocol View
S
1
S
2
S
3
/notifyDistChange
/notifyDistChange
executeSpeedChange/
notifyDistChange
executeDirChange/
notifyDirChange
getWallDist/
notifyDistChange
setDefaults
getWallDist/
notifyDistChange
33
to
S
2
and one of the transitions from
S
2
to
S
2
do not have an event associated with
them. These transitions all correspond to a special event called True event, where no
stimuli are needed for it to be triggered.
2.4 Relating Component Models
Modeling components of complex software systems from multiple perspectives is
essential in capturing a multitude of structural, behavioral, and interaction properties
of the system under development. Emergence of dependable systems as a result of
following a rigorous modeling and design phase, requires ensuring the consistency
among different modeling perspectives [8,32,37,38,51]. Our approach addresses the
issue of model consistency in the context of component and system modeling using
the Quartet.
In order to ensure the consistency among the models, their inter-relationships must be
understood. Figure 2-1 depicts the conceptual relationships among these models. We
categorize these relationships into two groups: syntactic and semantic. A syntactic
relationship is one in which a model (re)uses the elements of another model directly
and without the need for interpretation. For instance, interfaces and their input/output
parameters (as specified in the interface model) are directly reused in the static behav-
ior model of a component (relationship 1 in Figure 2-1). The same is true for relation-
34
ships 2 and 3, where the dynamic behavior and protocol models (re)use the names of
the interface elements as transition labels in their respective finite state machines.
Alternatively, a semantic relationship is one in which modeling elements are designed
using the “meaning” and interpretation of other elements. That is, specification of ele-
ments in one model indirectly affects the specification of elements in a different
model. For instance, an operation’s pre-condition in the static behavior model speci-
fies the condition that must be satisfied in order for the operation to be executed. Sim-
ilarly, in the dynamic behavior model, a transition’s guard ensures that the transition
will only be taken when the guard condition is satisfied. The relationship between a
transition’s guard in the dynamic behavior model and the corresponding operation’s
pre-condition in the static behavior model is semantic in nature: one must be inter-
preted in terms of the other (e.g., by establishing logical equivalence or implication)
before their (in)consistency can be established. Examples of this type of relationship
are relationships 4 and 5 in Figure 2-1.
In the remainder of this section we focus in more detail on the six relationships among
the component model Quartet depicted in Figure 2-1.
2.4.1. Relationships 1, 2, and 3 — Interface vs. Other Models
The interface model plays a central role in the design of other component models.
Regardless of whether the goal of modeling is to design a component’s interaction
35
with the rest of the system or to model details of the component’s internal behavior,
interface models will be extensively leveraged.
When modeling a component’s behaviors from a static perspective, the component’s
operations are specified in terms of interfaces through which they are accessed. As
discussed in Section 2.3, an interface element specified in the interface model is
mapped to an operation, which is further specified in terms of its pre- and post-condi-
tions that must be satisfied, respectively, prior to and after the operation’s invocation.
In the dynamic behavior and interaction protocol models, activations of transitions
result in changes to the component’s state. Activation of these transitions is caused by
internal or external stimuli. Since invocation of component operations results in
changes to the component’s state, there is a relationship between these operations’
invocations (accessed via interfaces) and the transitions’ activations. The labels on
these transitions (as defined in Section 2.3) directly relate to the interfaces captured in
the interface model.
The relationship between the interface model and other models is syntactic in nature.
The relationship is also unidirectional: all interface elements in an interface model
may be leveraged in the dynamic and protocol models as transition labels; however,
not all transition labels will necessarily relate to an interface element. For example, in
the controller’s dynamic behavior view, transition logState corresponds to an internal
36
event used to log various parameters in the system when the controller is in the emer-
gency state. Our (informal) discussion provides a conceptual view of this relationship
and can be used as a framework to build automated analysis support to ensure consis-
tency among the interface and remaining three models within a component.
2.4.2. Relationship 4 — Static Behavior vs. Dynamic Behavior
An important concept in relating static and dynamic behavior models is the notion of
state in the dynamic model and its connection to the static specification of compo-
nent’s state variables and their associated invariant. Additionally, operation pre- and
post-conditions in the static behavior model and transition guards in the dynamic
behavior model are semantically related. We have identified the ranges of all such
possible relationships. The corresponding concepts in the two models may be equiva-
lent, or they may be related by logical implication. Although their equivalence would
ensure their inter-consistency, in some cases equivalence may be too restrictive. A
discussion of such cases is given below.
Transition Guard vs. Operation Pre-Condition. At any given state in a compo-
nent’s dynamic behavior model, multiple outgoing transitions may share the same
label, but with different guards on the label. In order to relate an operation’s pre-con-
dition in the static model to the guards on the corresponding transitions in the
dynamic model, we first define the union guard (UG) of a transition label at a given
state. UG is the disjunction of all guards G associated with outgoing transitions that
37
carry the same label:
where n is the number of outgoing transitions with the same label at a given state, and
G
i
is the guard associated with the i
th
transition.
As an example in Figure 2-4, the dynamic model is designed such that different states
(normal and emergency) are going to be reachable as destinations of the getWallDist()
transition depending on the distance of the encountered obstacle (dist variable in the
transition guards). In this case at state normal we have:
UG
getWallDist
= (dist > 100) OR (0 < dist 100)
Clearly, if the UG is equivalent to its corresponding operation’s pre-condition, the
consistency at this level is achieved. However, if we consider the static behavior
model to be an abstract specification of the component’s functionality, the dynamic
behavioral model becomes a concrete realization of that functionality. In that case, if
the UG is stronger than the corresponding operation’s pre-condition, the operation
may still be invoked safely. The reason for this is that the UG places bounds on the
operation’s (i.e., transition’s) invocation, ensuring that the operation will never be
invoked under circumstances that violate its pre-condition; in other words, the UG
should imply the corresponding operation’s pre-condition. This is the case for the get-
WallDist() operation in the rover’s controller component.
i
n
i
G UG
1 =
∨ =
≤
38
State Invariant vs. Component Invariant. The state of a component in the static
behavior specification is modeled using a set of state variables. The possible values of
these variables are constrained by the component’s invariant. Furthermore, a compo-
nent’s operations may modify the state variables’ values, thus modifying the state of
the component as a whole. The dynamic behavior model, in turn, specifies internal
details of the component’s states when the component’s services are invoked. As
described in Section 2.2, these states are defined using a name, a set of variables, and
an invariant associated with these variables (called state’s invariant). It is crucial to
define the states in the dynamic behavior state machine in a manner consistent with
the static specification of component state and invariant.
Once again, an equivalence relation among these two elements may be too restrictive.
In particular, if a state’s invariant in the dynamic model is stronger than the compo-
nent’s invariant in the static model (i.e., state’s invariant implies component’s invari-
ant), then the state is simply bounding the component’s invariant, and does not permit
for circumstances under which the component’s invariant is violated. This relation-
ship preserves the properties of the abstract specification (i.e., static model) in its con-
crete realization (i.e., dynamic model) and thus may be considered less restrictive
than equivalence. A simple case is that of the state normal and its invariant in the con-
troller component. Relating the invariant of the state normal, and the controller com-
ponent invariant we have:
39
normal
inv
:(0 dir < 360) AND (0 < speed < 500) AND (dist > 100)
Controller
inv:
(0 dir < 360) AND (0 < speed < 1000) AND (dist 0)
normal
inv
Controller
inv
State Invariants vs. Operation Post-Condition. The final important relationship
between a component’s static and dynamic behavior models is that of an operation’s
post-condition and the invariant associated with the corresponding transition’s desti-
nation state. For example, in Figure 2-3, the post-condition of the op_setDefaults
operation is specified as:
op_setDefaults
Post
:(~speed > 100) AND (~dir = 0)
while state normal is a destination state for setDefaults() and we have:
normal
inv
:(0 dir < 360) AND (0 < speed < 500) AND (dist > 100)
In the static behavior model, each operation’s post-condition must hold true following
the operation’s invocation. In the dynamic behavior model, once a transition is taken,
the state of the component changes from the transition’s origin state to its destination
state. Consequently, the state invariant constraining the destination state and the oper-
ation’s post-condition are related. Again, the equivalence relationship may be unnec-
essarily restrictive. Analogous to the previous cases, if the invariant associated with a
transition’s destination state is stronger than the corresponding operation’s post-con-
dition (i.e., destination state’s invariant implies the corresponding operation’s post-
condition), then the operation may still be invoked safely. As an example consider the
≤
≤
≥
⇒
≤
40
specification of state normal and operation op_setDefaults shown above. Clearly, the
appropriate implication relationship does not exist. The op_setDefaults operation may
assign the value of the variable speed to be greater than 500. Such assignment could
result in a fault in the component, which in turn, could negatively affect the compo-
nent’s dependability.
2.4.3. Relationship 5 — Dynamic Behavior vs. Interaction Protocols
The relationship between the dynamic behavior and interaction protocol models of a
component is semantic in nature: the concepts of the two models relate to each other
in an indirect way.
As discussed in Section 2.2, we model a component’s dynamic behavior by enhanc-
ing traditional FSMs with state invariants. Our approach to modeling interaction pro-
tocols also leverages FSMs to specify acceptable traces of execution of component
services. The relationship between the dynamic behavior model and the interaction
protocol model thus may be characterized in terms of the relationship between the
two state machines. These two state machines are at different granularity levels how-
ever: the dynamic behavior model details the internal behavior of the component
based on both internally- and externally-visible transitions, guards, and state invari-
ants; on the other hand, the protocol model simply specifies the externally-visible
behavior of the component. In the case of the SCRover models for instance, the
dynamic behavior model contains a transition logState used to log the status of the
41
component while in the emergency state. This is an internal operation of the compo-
nent and thus is not visible to the other components through interfaces, and as such is
not modeled in the interaction protocol model.
Our goal here is not to define a formal technique to ensure the equivalence of two
arbitrary state machines. This task cannot be done for models of different granularity
like ours, and thus first require some calibration of the models to make them compa-
rable. Moreover, several approaches have studied the equivalence of statecharts
[6,72,133]. Instead, we provide a more pragmatic approach to ensure the consistency
of the two models. We consider the dynamic behavior model to be the concrete real-
ization of the system under development, while the protocol of interaction provides a
guideline for the correct execution sequence of the component’s interfaces. For exam-
ple, recall models of the controller component specified in Figure 2-4, and Figure 2-
5. Assuming that the interaction protocol model demonstrates all the valid sequences
of operation invocations of the component, it can be deduced that multiple consecu-
tive invocations of setDefaults() are permitted. However, based on the dynamic
model, only one such operation is possible. Consequently, the dynamic and protocol
models are not equivalent. Since the controller component’s dynamic behavior FSM
is less general than its protocol FSM, some legal sequences of invocations of the com-
ponent are not permitted by the component’s dynamic behavior FSM. Such inconsis-
tencies in the models of the components, may contribute to a fault in the
implementation, which in turn may impact the component’s dependability.
42
2.4.4. Relationship 6 — Static Behavior vs. Interaction Protocol
The interaction protocol model specifies the valid sequence by which the compo-
nent’s interfaces may be accessed. In doing so, it fails to take into account the compo-
nent’s internal behavior (e.g., the pre-conditions that must be satisfied prior to an
operation’s invocation). Consequently, we believe that there is no direct conceptual
relationship between the static behavior and interaction protocol models. Note, how-
ever, that the two models are related indirectly via a component’s interface and
dynamic behavior models.
2.5 Implications of the Quartet on Reliability
The goal of our work is to support modeling architectural aspects of complex soft-
ware systems from multiple perspectives and to ensure the inter- and intra- consis-
tency among these models. Such consistency is critical in building dependable
software systems, where complex components interact to achieve a desired function-
ality. Dependability attributes must therefore be “built into” the software system
throughout the development process, including during the architecture phase. The
Quartet serves as the central piece of our reliability models. The analyses enabled by
the Quartet to ensure intra- and inter-consistencies among the models reveal defects
that may cause failures during components operations. These failures in turn, result in
reducing the reliability of components and consequently the reliability of the system.
43
The next two chapters of this dissertation describe our approach to estimating the reli-
ability of individual components and the overall reliability of the system. In order to
incorporate the result of architectural analyses, we need to quantify the influence of
each defect on component’s and system’s operations. To do this, we have developed
an architectural defect classification along with a pluggable cost-framework that take
domain-specific information into consideration when quantifying the defects. Our
reliability models then leverage the Quartet views as well as the quantification results,
to estimate both components’ and system’s reliability.
2.6 Defect Classification and Cost Framework
Architectural models represent properties of the system, from its high-level structure
in terms of its constituent components and their configuration, to the low-level behav-
ior of its constituent components. Analyses of these models reveal defects that may
result in component-level and system-level failures. These failures adversely affect
the reliability of the system.
The nature of these defects ranges from structural issues to behavioral problems.
Some may result in a catastrophic failure, while others may cause simple discrepan-
cies in the operation of components and their interactions. For example, a unit mis-
match in NASA’s JPL Mars Climate Orbiter mission in 1999, resulted in the loss of
the spacecraft with the total cost of over $300 million. This mismatch is considered to
44
be a behavioral problem, where two communicating components exchanged data
using two different units of measurement. While in a safety-critical system no failure
may be tolerable, in a different domain the same type of failure may only cause minor
problems with the system’s operation. All these factors affect how consequences of
defects must be measured and incorporated when modeling the reliability of a system.
We have developed a taxonomy of architectural defects that helps us classify various
defects discovered during the architectural modeling and analysis phase. We use this
taxonomy in conjunction with a pluggable cost framework to quantify the effect of
specific defects on the reliability of a component. Both of these approaches to defect
classification and quantification are pluggable in the context of our reliability models:
other relevant techniques may be substituted instead. In this section, first we describe
the defect classification in detail, and then introduce our cost framework.
2.6.1 Taxonomy of Architectural Defects
Our experience with several architectural modeling and analysis techniques [14,110]
enabled us to identify the existence of a pattern to the types of defects various model-
ing approaches attempt to reveal. This led us to develop a taxonomy of architectural
defects that is applicable to a wide range of design and architectural problems, and is
independent of the specific modeling approach adopted. The result is depicted in
Figure 2-6.
45
At its top level, the taxonomy classifies architectural defects as Topological errors or
Behavioral inconsistencies. Topological errors tend to be global to the architecture
and are concerned with aspects related to the configuration of components and con-
nectors in the system. They are often a result of the violation of constraints imposed
by architectural styles.
Some topological errors are directional in nature: the specific direction of communi-
cation required by the style is violated. An example is when in a Client-Server sys-
tem, the server component requests services from the client. In our experience with
modeling the SCRover system [110], an instance of this error was detected as follows.
Architectural
Defect
Directional
Structural
Usage
Incomplete
Interface Signatures
Static Behavior
Pre/Post
Conditions
Protocol
Interaction
Protocols
Topological
Error
Behavioral
Inconsistency
Architectural
Defect
Directional
Structural
Usage
Incomplete
Interface Signatures
Static Behavior
Pre/Post
Conditions
Protocol
Interaction
Protocols
Topological
Error
Behavioral
Inconsistency
Figure 2-6. Taxonomy of Architectural Defects
46
Recall the SCRover example introduced in Chapter 2. The controller component
issues commands to the actuator to change the direction or speed of the rover. In
other words, the controller requires certain functionality that the actuator provides. A
directional mismatch between the two components was revealed that reversed this
relationship: the actuator relied on the controller to provide needed functionality.
This directly violates the Mission Data System’s (MDS) architectural style using
which SCRover is designed and developed.
Other topological errors are structural in nature and are further divided into usage
violations and incompleteness of the specification. An example of a usage violation is
when a communication link between two components is missing, or alternately, when
a communication link between components exists where it should not be present (i.e.,
an incorrect use of the resources in the system). An SCRover related example is
when, due to a design error, the actuator component directly modifies the values in
the database.
The last type of topological error relates to the incompleteness of the specification,
and manifests itself when there is insufficient information for specifying the proper-
ties of the architecture’s components and connectors.
Behavioral inconsistencies are the second category of architectural defects that are
local to a component. An interface defect occurs when the signatures of the corre-
47
sponding provided and required services of two components are mismatched. For
example, in the case of the SCRover’s controller component discussed in Chapter 2,
the controller component provides an interface of type query that queries the estima-
tor component for the distance to the wall or other obstacles. The corresponding inter-
face element is defined as follows:
q: -getWallDist():DistanceType;
The returned value in this case is of type DistType. If the estimator component that
requires this service, expects the change in the distance to an obstacle via a different
user defined type such as LengthType as shown below, then there is a signature mis-
match between the provided and required services. As a result of this mismatch, the
query cannot be processed by the estimator component:
q: +getWallDist():LengthType;
A static behavioral inconsistency is concerned with mismatches between the pre- and
post-conditions of corresponding provided and required services in two components.
For example in the case of the SCRover’s controller component, the setDefaults ser-
vice required by the component, requires the value of the dist state variable to be
greater than 100 as its pre-condition:
op_setDefaults{
preCond: {dist > 100};
postCond: {~speed > 100 AND dir = 0};
mapped_interfaces: {setDefaults};
}
48
If the corresponding service in the database component that provides this service to
the controller component can assume speed values of greater than 120, then commu-
nication between the two components may exhibit some problems.
For instance the controller component may send a setDefaults request when the dis-
tance of the rover is 105 (units) from an obstacle. The database component will not be
able to process this request, because according to its specification, the dist value has
to be at least 120. This would be an example of a pre- and post-condition mismatch.
Principles upon which this type of analysis is based on can be found in [76, 136].
Finally, a protocol inconsistency reveals mismatched interaction protocols among
communicating components. For example according to the controller component’s
interaction protocol model depicted in Figure 2-5, upon instantiation the component
can either react to a getWallDist event generated by the component itself, or could
react to a notifyDistChange action (generated by the estimator component). Accord-
ing to this model, the controller will not be able to react a setDefaults event upon
instantiation. This would indicate that if the database component model would only
react to a setDefaults request upon instantiation, the two components may be unable
to communicate as intended. This is an example of a protocol mismatch between the
two components.
49
Our classification framework here is one that was developed experimentally based on
our experience in a collaborative project between NASA’s JPL, University of South-
ern California, and Carnegie Mellon University [110]. Our reliability prediction
approach leverages this specific classification, but in essence is independent from this
specific taxonomy. Classifying the defects using a taxonomy can help the architect
distinguish among different classes of defects (and the possible subsequent failures),
and provide a basis for quantifying the influence of each defect on the component’s
reliability. We have also developed a simple cost framework (described next) to quan-
tify this influence according to cost factors applicable to the specific domain.
2.6.2 Cost Framework
An architectural defect classification such as the one presented here helps the
designer to distinguish among different types of defects in the system. These defects,
at the level of architectural specification, may translate to failures during the compo-
nent’s operation. We assume that these failures are all recoverable failures: it is possi-
ble for a component to recover from them during its operation. The recovery may be
automatic (e.g., self-healing systems), or may require human intervention. This
assumption does not pose any limitation on our approach, since both recoverable and
non-recoverable failures may be represented in our models: a non-recoverable failure
is a failure for which the probability of recovery to a non-failure state is zero. Since
our approach is only concerned with the probability of occurrence or recovery from
50
failures, the specific recovery techniques and associated processes are outside the
scope of this research.
The probability that a component recovers from a certain type of failure (e.g., a fail-
ure resulting from a protocol mismatch between two components) depends on many
parameters. Examples include the impact of a component failure on other compo-
nents’ operations and system operations, the automatic correction and adaptation
mechanisms built into the system, manual error handling procedures, as well as the
effort, time, and the cost associated with the recovery process. These parameters are
highly domain and application dependent, and as such must be designed and adjusted
for each domain, specifically by a domain expert. For example, the types of relevant
parameters and their associated values in the computer games domain, would be very
different from those in a safety critical system where human lives are at stake. Per-
haps, the former primarily takes economical aspect into consideration (such as time
and resources), whereas the latter may take a wider view and incorporate various
parameters that measure the impact and risks on human lives associated with failures.
We call these various parameters cost factors, and leave designation of different cost
factors to the domain expert. Much research has focused on developing a comprehen-
sive set of cost factors [12,46,116]. While we select a few for our analysis, designa-
tion of various factors, and justification of their instantiation is not an integral part of
our reliability modeling, and as such is beyond the scope of this research.
51
In order to estimate the probability of recovery from each type of failures, we first
estimate the cost of such recovery. Using a mathematical cost function we incorporate
the values of all cost factors, and derive a single cost value associated with the recov-
ery from each failure type (cost). The recovery probability is then calculated as the
complementary probability of the cost value (1-cost): the higher the cost of recovery,
the lower the probability of recovery, and vice-versa. We now present the details of
our cost framework, as well as a specific adaptation of it applied to the SCRover sys-
tem.
Let us assume:
In our adaptation for the SCRover project, we define four cost factors that influence
the probability of recovery from failures: severity of a defect, the effort required for
its mitigation, the impact of a particular defect on the environment, and the develop-
ment team’s expertise. In other words:
1
: Set of all cost factors defined by the expert. { ,..., }
n
θ θθ θ =
GG
12 3 4
1
2
3
4
{, , , }
: severity of the defect
: efforts needed for mitigation of the defect
:impact
: team expertise
θθ θθθ
θ
θ
θ
θ
=
G
52
Once these cost factors are defined, for each defect type, a numerical value in the
range of [0,1] must be assigned to each factor by the domain expert. The domain
expert may be able to obtain evidence from past experiences with earlier versions of
the application, or may use her professional judgement to assign these values.
A sample instantiation of this framework in conjunction with our defect classification
is shown in Table 2-1. According to this instantiation, the severities of the usage,
incomplete, and signature type of defects are all considered to be the same and are
assigned a 0.9 value, while the pre-/post-condition and interaction protocol mis-
matches are considered to be less severe (0.7 and 0.6 respectively). This could be jus-
tified by considering that each of the usage, incomplete, and signature types of
defects results in complete inability of the related components to communicate,
whereas the other two defect types would only indicate that under some circum-
stances there may be a problem with the components’ interaction. Furthermore, while
the interaction protocol defect is assumed to be the least severe type of defect, it is
designated as the one that requires the most effort for mitigation. This goes back to
the nature of protocol mismatches. Identifying and mitigating this type of defect can
be potentially very difficult, and typically requires a lot of effort. Furthermore, this
instantiation specifies the impact level for each particular defect type on the environ-
ment (other components and the system as whole). In this particular case, the impact
of the pre-/post-condition and interaction protocol defects is assumed to be lower
than those of usage, incomplete, and signature defects. The justification is that in the
53
former case, a mitigation solution may not involve other components that communi-
cate with this component. However, in the case of usage, incomplete, and signature
defects, it is more likely that the fix involves more than the defective components.
Finally, we assume the expertise of the development team to have a fixed value across
all defect types. This factor could vary depending on the qualification of the develop-
ment team (with 1 denoting highly qualified), and could possibly be different across
different defect types, if they are assigned to different development teams.
The last step in the recovery cost estimation is to define the appropriate cost function
that incorporates various cost factors to calculate the recovery probability for a given
defect type. Intuitively, this cost function has to be specifically designed for the
domain by taking various domain-specific concerns into consideration. For instance,
a cost function for a video game software is probably quite different from one used in
a safety-critical system. Other socio-economical and cultural factors, such as the
number of developers responsible for defect mitigation and quality assurance, the
Table 2-1 Sample Instantiation of the Cost Framework
Usage Incomplete Signature
Pre/Post
Condition
Interaction
Protocol
Severity 0.9 0.9 0.9 0.7 0.6
Effort 0.4 0.7 0.2 0.4 0.8
Impact 0.9 0.9 0.9 0.65 0.5
Expertise 0.75 0.75 0.75 0.75 0.75
54
time to delivery, the distribution of the development team (e.g., offshoring) would all
pose special circumstances, which may prevent a single cost function to be applicable
to a variety of scenarios. Research into the selection of cost functions is beyond the
scope of this dissertation. Instead we offer a simple technique to incorporate all cost
factors into a single number. Alternatively, other cost functions may be used at this
stage. Our reliability prediction models are oblivious to the specific cost functions
used for quantification.
The role of individual cost factors may vary in the overall cost estimation. For
instance in our case, as the severity, effort, or the impact of a defect increases, it is
expected that the overall cost of recovery would increase, resulting in a lower recov-
ery probability. In some cases however, this relationship may be reversed. For exam-
ple, an increase in the development team expertise could indicate a lower cost for
recovery. In these circumstances, to avoid confusion, we suggest using the comple-
mentary value of the cost factor into consideration. In our case, we use 1-expertise
value in our cost estimation.
We use a Radar chart (aka Polar chart) to plot values of various cost factors. Each
cost factor is plotted along an axis. The number of axes is equal to the number of des-
ignated cost factors, and the angle between all axes are equal. Figure 2-7 depicts our
instantiation of the Radar chart. Four axes represent severity, effort, impact, and
expertise. Each axis has a maximum length of 1, which is consistent with cost factors
55
taking a value between 0 and 1. A point closer to the center on any axis depicts a low
value, while a point near the edge of the circle depicts a high value for the corre-
sponding cost factor.
Radar charts are useful when incorporating several indicators (cost factors) related to
one item (a defect type). The cumulative effect of the cost factors can be calculated by
finding out the surface area formed by the axes. The expertise factor is marked with a
* which indicates that it has an inverse effect on the cost estimation (i.e., the value of
1-expertise is used in the area calculation). This decision is made to make the influ-
ence of changes to the cost value consistent and intuitive.
We use a triangulation method to calculate this surface area. The overall area is
divided into triangles that are formed between two consecutive axes and the line con-
necting two points on the axes. Figure 2-7 depicts the four triangles formed by the
Figure 2-7. The Radar Chart View for the Cost Framework
0
effort
impact
expertise*
severity
1 1
1
1
56
four values on the axes. Assuming that the values for the four cost factors
are denoted by respectively, the overall area is esti-
mated using the following formula:
where num is the number of cost factors (number of axes) in the Radar chart, and the
angle between each two axes is .
Charts corresponding to the instantiation presented in Table2-1 are shown in
Figure 2-8 (top). Each chart corresponds to a defect type, and the area under the sur-
face for each chart corresponds to the calculated cost of recovery for each defect type.
The calculated recovery probability based on the cost of recovery is shown in the bot-
tom diagram of Figure 2-8.
One particular characteristic of the Radar chart is that equal weights are assigned to
all cost factors. However, under other circumstances, the importance of each cost fac-
tor may vary. It is thus debatable if it is adequate to treat all cost factors as having the
same importance. In those cases, our area calculation formula may be adapted to
incorporate different weight values when incorporating the cost factors.
12 3 4
, , , and θ θθ θ
12 3 4
,, , and τ ττ τ
3
14 1
1
12
sin( )[ ( ) ]
4
ii
i
area
num
π
τττ τ
+
=
=××+×
∑
/2 π
57
Figure 2-8. Graphical View of the Cost Framework Instantiation for Different Defect Types
Usage Defect
0
0.5
1
Severity
Effort
Impact
Expertise*
Incomplete Defect
0
0.2
0.4
0.6
0.8
1
Severity
Effort
Impact
Expertise*
Signature Defect
0
0.2
0.4
0.6
0.8
1
Severity
Effort
Impact
Expertise*
Pre/Post Condition Defect
0
0.2
0.4
0.6
0.8
Severity
Effort
Impact
Expertise*
Interaction Protocol Defect
0
0.2
0.4
0.6
0.8
Severity
Effort
Impact
Expertise*
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Recovery Probability
Recovery Probability for
Each Defect Type
0.7075 0.5725 0.7975 0.780625 0.71125
Usage Incomplete Signature
Pre/post
cond
Interaction
protocols
58
Chapter 3: Component Reliability
At the architectural level, the intended functionality of a software component is cap-
tured in structural and behavioral models [77]. Analysis of these models may reveal
potential design problems that can affect the component’s reliability. Our goal is to
provide a framework to predict the reliability of software components based on their
structural and behavioral models. Our framework can be used to provide analysis of
the components’ reliability before they are fully implemented and deployed, taking
into account the uncertainties associated with early reliability prediction. It can also
be used later on during the implementation, when more information on the compo-
nent’s operation and deployment is available, as an ongoing analysis tool aiding the
process of improving the reliability of the components and consequently the reliabil-
ity of the entire system.
A component’s reliability is estimated as the probability that it performs its intended
functionality without failure. A failure is defined as the occurrence of an incorrect
output as a result of an input value that is received, with respect to the specification
[101]. Moreover, an error is a mental mistake made by the designer or programmer. A
fault or a defect is the manifestation of that error in the system. It is an abnormal con-
dition that may cause a reduction in, or loss of, the capability of a functional unit to
perform a required function; it is a requirements, design, or implementation flaw or
59
deviation from a desired or intended state [61. In other words, faults are causes of fail-
ures.
A highly reliable component, therefore, is a component for which the probability of
the occurrence of failures is close to zero. We build our reliability model upon the
notion of failure states: a component’s dynamic behavioral model is augmented with
one or more failure states representing occurrence of a fault in the component’s oper-
ation. We assume that failures are recoverable [101]: the component may recover
from them with or without external interventions. A component’s reliability is then
predicted as the probability that it is operating normally at time t
n
in the future, as n
approaches infinity. The assumption of recoverability from failures implies that the
model does not have any absorbing state, i.e, a state where the probability of staying
in it once entered is zero
1
.
Our approach to reliability modeling involves three phases of activities. Figure 3-1
depicts the high-level methodology. The first phase is Architectural Modeling, Analy-
sis, and Quantification. In Chapter 2, we explained in detail how standard analysis
techniques are applied to the Quartet models of a component to reveal inconsisten-
cies. These inconsistencies represent defects that could result in failures during the
component’s operation. The failures in turn contribute to the component’s unreliabil-
1. An alternative approach to modeling reliability assumes failure states are absorbing states.
The reliability is then calculated as the mean time required for the component to arrive at
an absorbing (failure) state.
60
ity. However, not all defects (and subsequent failures) are “created equal”. The types
of defects could determine the severity of the subsequent failures, which in turn could
determine the cost required for recovering from them. In Section 2.6 of Chapter 2, we
presented our defect classification and cost framework, which together enable the
architect to quantify the effect of various types of defects on components’ reliability.
As discussed, our reliability model is independent from the specific classification and
quantification technique, and alternative approaches may be used instead.
The next phase of our component reliability prediction framework is the Operational
Profile Modeling. An operational profile is a quantitative characterization of how the
Figure 3-1. Component Reliability Prediction Framework
Profile Modeling
Architectural Modeling, Analysis, and Quantification
Reliability
Prediction
Analysis
HMM
Builder
Defect
Quantification
Reliability
Computation
HMM
Solver
Domain
Knowledge
Component
Reliability
Architectural
Models
Defects
Training
Data
Markov
Model
Legend
Artifacts
Functional
blocks
Numerical
Values
61
component will be used. It is an ordered set of operations that the software component
performs along with their associated probabilities. Since during the architectural
phase of software development, data on a given component’s operational profile may
not be available, an architecture-level reliability modeling approach must take this
uncertainty into consideration. In Section 3.2, we discuss a technique that can handle
this type of uncertainty under certain conditions and aid the reliability modeling pro-
cess.
Finally, the last phase of our approach is the Reliability Prediction step. Given (1) the
architectural models, associated defects revealed by analysis, and quantification of
these defects by the cost framework (obtained from the modeling, analysis, and quan-
tification phase), and (2) an operational profile (obtained from the profile modeling
phase), our reliability prediction framework offers a range of analyses on component
reliability values. A range of analyses is necessary when taking uncertainties associ-
ated with early reliability predication into consideration. Details of this phase are pre-
sented in Section 3.3.
An important observation in reliability modeling of software systems during the
architecture phase is that depending on various development scenarios, the artifacts
available vary significantly. The types of these artifacts influence the steps required
by the reliability model. Consequently, we construct a simple classification of various
forms of the reliability modeling problem (presented in Section 3.1), and use it to
62
explain which particular steps in the reliability model are applicable to a particular
form of the problem.
The rest of this chapter is organized as follows. We first describe a classification of
various forms of the component reliability estimation problem in Section 3.1. We then
discuss our approach to component profile modeling in Section 3.2. Section 3.3 pre-
sents our reliability prediction framework.
3.1 Classification of the Component Reliability Modeling Problem
When measuring the reliability after the implementation phase is complete, (e.g., dur-
ing testing), components are typically deployed in a host system, and various analyses
and measurements are performed in conditions that mimic the intended operational
profile of the component. At the architectural level, however, relying on the availabil-
ity of such information may not be a reasonable assumption, particularly because no
implementation artifact may be available. Depending on the process adopted for soft-
ware development (waterfall, spiral, agile, etc.), the type and the amount of informa-
tion relevant to a component’s operational profile varies significantly. For example, in
a spiral development process, at any time after the completion of the initial iterations,
some data representative of the operational profile may be obtained from past itera-
tions. Furthermore, in cases of product-line software development [128], when a com-
ponent under development may be an upgrade to an existing version of the same
63
component, data from previous versions may be available. In other cases however, for
example when architecting a brand new component given a set of requirements (e.g.,
UML’s use-cases as scenarios), no such data may be available.
In cases where operational profile-related data is available, an important factor is
whether this data is obtained from run-time monitoring of a version of the component,
or from architecture-time simulation of its architectural models. The primary differ-
ence in the two cases is the type of available data.
When gathering data from simulation of architectural models, the data could include
both the sequences and the frequency of component’s interface invocations, as well as
the associated sequence of states. The simulation of dynamic architectural models
(e.g., dynamic behavioral model) may be based on the user’s interaction with a simu-
lator: the user would manually control external stimuli and conditions that would
determine how the component would behave under different circumstances (e.g.,
[36]). The order and type of stimuli in this case may be based on the user’s percep-
tion. In other cases where a run-time version of the same component is available, the
simulation could also be performed by leveraging run-time observable stimuli during
a software component’s execution and feeding that information as inputs to the simu-
lator (e.g., [31]). In this case the order and type of stimuli is directly obtained from
run-time monitoring of the system and does not depend on the user’s perception.
64
When monitoring (instrumenting) the run-time operation of a component, however,
depending on the circumstances it is possible that only the sequences, and thus the
frequency of invocation of component’s interfaces, are logged. This for instance
could be due to limitations on availability of the source code (e.g., COTS-based sys-
tems), or to the use of distributed middleware technologies such as COM, CORBA,
and J2EE, where traditionally Interface Definition Languages (IDLs) are used as the
basis for data gathering [62]. The type of gathered data moreover depends on the goal
of the runtime monitoring process. If runtime monitoring is leveraged as a tool to aid
testing activities, the results are likely to contain additional debugging data. On the
other hand, when instrumenting the code to identify interactions and relationships
among components and subsystems (e.g., [62]), the results are unlikely to contain all
the necessary data to reconstruct the states as modeled in the architectural models.
This is due to the abstract nature of the notion of states: they do not exist at the imple-
mentation level with the same granularity or in the same form, and it is often very dif-
ficult, if not impossible, to keep track of all the parameter changes to reconstruct the
states as specified in the architectural models.
Finally, there are cases when no data on the component’s operational profile may be
obtained. Such a case could, for instance, occur when a new component is being
designed and developed. This case is the main focus of this dissertation, and while our
reliability model is applicable to other cases described here, our discussion will be
primarily focused on this scenario. Particularly, in cases where the ability to simulate
65
dynamic behavioral models does not exist, a synthesis process is used to produce the
data. Our synthesis process described later in this chapter relies upon input from a
domain expert to synthesize the necessary data. In the worst case, the operational pro-
file obtained via this approach would be random, and thus not reflective of the com-
ponent’s actual eventual usage. However, as shown in our evaluation in Chapter 6, the
synthesis process could greatly benefit from the domain knowledge of an expert.
Table 3-1 depicts a classification of various forms of the component reliability esti-
mation problem. We will use this classification throughout this chapter to help us
identify specific steps required in reliability prediction. In general, since we assume
that architectural models of the system are available (hence the architectural reliabil-
ity approach), case 4 falls outside the scope this research. It is noteworthy that this
case is addressed by existing reliability modeling approaches applicable to the testing
phase (e.g., [23,28,41,44,45,53,66,83]). The other three cases (cases 1, 2, and 3)
Table 3-1 Classification of the Forms of the Reliability Modeling Problem Space
Cases
Architectural
Models
Source of Data
Runtime Simulation Synthesis
Case 1 ++
Case 2++
Case 3++
Case 4 -
66
assume availability of the component’s architectural models, and leverage the data
obtained from simulation of the models, runtime monitoring of the component, or a
synthesis process respectively.
3.2 Profile Modeling
Addressing the problem of a component’s reliability modeling requires knowledge of
its operational profile. Estimating a representative operational profile for a component
before the completion of the development and deployment process is a challenging
problem that must be handled when predicting the component’s architectural reliabil-
ity.
A representative and reasonably complete operational profile of a component can
only be obtained by observing its actual operation after deployment. This profile
would include data on the order and the frequency of invocation of the component’s
operations. As discussed in Chapter 2, a component’s operations are accessed via
interfaces. These interfaces correspond to transitions in the component’s dynamic
behavioral model. Invocation of the component’s interfaces serve as stimuli that trig-
ger corresponding transitions in the behavioral model. Consequently, data on the fre-
quency of the component’s interface invocations may be translated into probabilities
of activation of transitions in the component’s behavioral models. In turn, these prob-
67
abilities, together with the behavioral model itself, may be used to predict the reliabil-
ity of the given component.
During the architectural phase, however, it is not always reasonable to assume that
such data is available. A reliability model applicable to the architectural stage thus
needs to account for and handle the uncertainties associated with unknown opera-
tional profiles. The discussion on the classification of various forms of the component
architectural reliability problem (Table3-1) identified three primary cases when
obtaining data associated with the operational profile: Case1 – runtime monitoring of
an existing component; Case 2 – simulating component’s architectural models; or
Case 3 – synthesizing data using domain information. As mentioned before, the pri-
mary focus of our research so far has been on case 3. Below we first describe how our
approach addresses the problem as related to the data synthesis case, and then briefly
discuss how variations of our approach relate to the other two cases.
3.2.1. Data Synthesis Approach
When building a brand new component, objective data on its operational profile may
not be available during the architectural phase. Moreover, it may not be always rea-
sonable to assume that dynamic models of the component’s behavior can be simulated
(e.g., waterfall development process, and scenario-based requirement development
using UML’s use-cases). In these cases, we use domain knowledge to generate data
representing the component’s operational profile.
68
In particular, we ask the architect to explicitly specify several valid sequences of the
component’s interfaces that may be invoked to achieve various functionality. The fre-
quency of these invocations is then statistically obtained given a set of sequences.
These sequences could be inferred from the dynamic behavioral model (e.g., in our
approach), by considering various paths through the corresponding statechart model.
For example, recall the controller component discussed in Chapter 2. Its dynamic
behavioral model is depicted in Figure 3-2. A domain expert may identify {getWallD-
ist, getWallDist, getWallDist, executeDirChange} as a desired sequence of interfaces
to be invoked starting at state init. This sequence represents when the rover is driving
in a particular direction under normal conditions, and then changes direction to avoid
init
normal
emergency
changed
getWallDist/
notifyDistChange
/notifyDistChange
executeSpeedChange/
notifySpeedChange
executeDirChange/
notifyDirChange
setDefaults
/notifyDistChange
logState
executeSpeedChange/
notifySpeedChange
executeDirChange/
notifyDirChange
executeSpeedChange/
notifySpeedChange
/notifyDistChange
getWallDist/
notifyDistChange
/notifyDistChange
/notifyDistChange
setDefaults
getWallDist/
notifyDistChange
getWallDist/
notifyDistChange
getWallDist/
notifyDistChange
Figure 3-2. Controller’s Dynamic Behavior Model (Guards Omitted for Brevity)
69
an obstacle.Typically, the expert would identify several such sequences. The number
of these sequences, along with their lengths, are critical factors for building a repre-
sentative operational profile.
To build the corresponding operational profile, we need to translate the data embed-
ded in these sequences into frequency of activation of corresponding transitions in the
model. This, in turn, relates to the probability of invocation of various components’
interfaces (aka a component’s operational profile). For instance, consider a hypotheti-
cal state s
i
in the model where two possible outgoing transitions may be activated. If
the expert has identified a set of 10 sequences of interface invocations starting at s
i
,
we can obtain the frequency of activation of each transition by statistically analyzing
the sequences. Let us assume that the frequencies of the two transitions are 7 and 3
respectively. That is, 7 out of the 10 sequences identify the first transition as their
invoked interface. Consequently, the transition probabilities of the two transitions can
be inferred to be 0.7 and 0.3 respectively. It is clear that a larger set of interface
sequences results in generation of a more representative operational profile.
While this is a very simple process, in practice applying it to a simple model such as
the Controller’s state machine, exhibits a problem. Given a sequence of interface
invocations in this case, there exists more than one sequence of states corresponding
to the interface sequence. For example, if a sequence of interface invocations given
70
by the domain expert contains {getWallDist, getWallDist, getWallDist, execute-
DirChange}, we cannot deterministically identify the corresponding sequence of
states associated with this sequence of transitions, by looking at the component’s
dynamic behavior model (shown in Figure 3-2). One sequence of states could be
{init, normal, normal, changed}, while another may be {init, normal, emergency,
changed}. This lack of a one-to-one correspondence between the interfaces and the
states prevents us from directly mapping this information into an operational profile.
A formalism that can be used in this case is Hidden Markov Models (HMMs). HMMs
are essentially Markov models with unknown parameters. In our case, the unknown
parameters are the unknown transition probabilities between different states which
correspond to the unknown operational profile of the component. Using HMMs and
existing standard algorithms, we use the data on the sequences of a component’s
interfaces specified by the architect (aka training data), and obtain transition probabil-
ities corresponding to our dynamic behavior model. In Section 3.2.4 we present some
background information on Markov Models as well as the Hidden Markov Model
variation, and then describe how this formalism can help us address the problem of a
component’s operational profile modeling.
Our current approach to training data generation relies on the domain expert knowl-
edge and produces data that are representative of the expert’s knowledge of compo-
nent’s operation. This can be done by asking the expert to manually provide a set of
71
sequences of component’s interfaces. Alternatively, we use an automated technique
that requires the expert to predict probability of invocation of various interfaces at
each state in the model. We then use this prediction to automatically generate valid
sequences of interfaces. This approach addresses the need to synthesize HMM’s train-
ing data based on the domain knowledge. As part of our future work, we are also
investigating other ways of training data generation using the dynamic behavior
model of a component, e.g., statechart simulation methods, and trace assertion meth-
ods for module specification. The goal is that using these new techniques, we
decrease the impact of the expert’s “judgement” and offer a more objective approach
to training data generation.
3.2.2. Model Simulation Approach
Case 2 of the classification of different forms of the component reliability problem
relies on the results from the simulation of architectural models as the source to build
an operational profile. In this case, when data on both the sequences of states and the
frequencies of transitions is available, it is possible to obtain an operational profile
directly from the data. This could be done by analyzing several runs or executions of
the component’s operations and identifying the frequency of activation of various
transitions at each state. These frequencies are then directly translated to transition
probabilities on the model using an approach similar to the one described in the previ-
ous section. Consequently, in case 2, no additional profile modeling and estimation
activity would be necessary to perform the component’s reliability prediction.
72
3.2.3. Runtime Monitoring Approach
In cases where the data is obtained from the component’s execution at runtime (Case
1) (or when no state information is collected from simulation), the data cannot be
directly used to build the component’s operational profile. Similar to the data synthe-
sis approach, a one-to-one mapping between the sequence of transitions and states in
the model may not exist. Once again, a Hidden Markov Model methodology may be
used here to build an operational profile based on available set of (training) data
obtained from runtime monitoring.
3.2.4 Background on Markov and Hidden Markov Models
Informally, a Markov chain is a Finite State Machine (FSM) that is extended with
transition-probability distributions. Formally, a Markov Model consists of a set of
states S={S
1
,S
2
,…,S
n
}, a transition probability matrix A={a
ij
} representing the proba-
bility of transition from state S
i
to state S
j
, and an initial state distribution vector .
The initial state distribution is defined as the probability that the state S
i
is an initial
state: , where q
1
denotes the state of the model at time t
1
. At regular
fixed intervals of time the system transfers from its state at time t, (q
t
) to its state at
time t+1, (q
t+1
). The Markov property assumes that the transfer of control between
states is memoryless. In other words, the probability of transition to the next state at
time t+1, only depends on the system at time t and is independent from its past his-
π
1
Pr[ ]
ii
qS π==
73
tory. In other words:
This assumption allows us analytical tractability, and does not pose any significant
limitation
2
.
Markov-based reliability models (e.g., {18,106,134]) leverage the Markov property,
and can be used to estimate the probability of being at given state when the model
reaches a steady state. They rely on the availability of matrix A (the probability of
transitions among states), to estimate the probability of being at a given state in the
future, by calculating the model’s steady state. The steady state is also known as the
equilibrium state, and is reached after the process passes through an arbitrary large
number of steps. A Markov Model’s steady state is characterized by a steady state
probability distribution vector defined as:
where is the initial state distribution vector, and A is the transition probability
matrix.
2. This is particularly the case since our architectural models are sufficiently rich to embody
memory informations using the notion of state variables and invariants at each state
(Recall Chapter 2) without violating the Markov property in the corresponding reliability
model.
1
Pr[ | ]
ti t j ij
qS q S a
−
= ==
()
() ( 1)
(0)
lim
n
n
nn
n
v
where A
A
π
ππ
π
→∞
−
=
=
=
π
74
Markov models have entirely observable states. A Hidden Markov Model (HMM),
however, is a variation of Markov Models that assumes that some of the parameters
(e.g., transitions probability) may be unknown. Particularly, HMMs assume that while
the number of states in the state-based model is known, the exact sequence of states to
obtain a sequence of transitions may not be known. In addition, HMMs assume that
the value of the transition probability distribution may be unknown or inaccurate. The
challenge is to determine the hidden parameters, from the observable parameters,
based on these assumptions.
An HMM is defined by a set of states S={S
1
,S
2
,…,S
n
}, a transition probability matrix
A={a
ij
} representing the probability of transition from state S
i
to state S
j
, an initial
state distribution vector , a set of observations O={O
1
,O
2
,…,O
m
}, and an observa-
tion probability matrix B = {b
ik
}, representing the probability of observing observa-
tion O
k
, given that the system is in state S
i
.
The following three canonical problems are associated with Hidden Markov Models
[103,104]. Given an output sequence:
1. What is the probability that a given HMM produced this output sequence?
2. What is the most likely sequence of state transitions that yield this output
sequence?
π
75
3. What are estimates for the transition probabilities related to this output sequence?
Later in this chapter we will discuss how the third problem is related to the compo-
nent reliability prediction problem. This problem is addressed by the Baum-Welch
algorithm [11]. Baum-Welch is an Expectation-Maximization algorithm, that given
the number of states, number of observations, and a set of training data, approximates
the best model in terms of transition and observation probability matrices A and B that
represent the training data set.
3
The Baum-Welch algorithm is an iterative optimization technique, which starts from a
possibly random model, and leverages the training data to find the local maximum of
the likelihood function. Specifically, the algorithm applies a dynamic programming
technique to efficiently estimate the HMM parameters (including transition probabili-
ties), while maximizing the likelihood that the training data is generated from the esti-
mated model. It operates by defining a forward variable , and a backward
variable as follows:
3. Baum-Welch training is only guaranteed to converge to a local optimum. The local opti-
mum may not always be the global optimum. While this is the only known algorithm to
address this problem, its output is an approximation of the actual model. One way to miti-
gate this shortcoming is to execute the algorithm iteratively and obtain a statistically sig-
nificant or “typical” result.
()
t
i α
()
t
i β
11 1 0
11 1 0
() ( )Pr( | )Pr ( | )
() Pr( | )Pr ( | ) ( )
tt t t tt
j
ttt ttt
j
ijqiqjxqi
iqjqixqjj
αα
ββ
−−
−−
=== =
== = =
∑
∑
76
The forward variable determines the probability of reaching state S
j
from state S
i
,
given a sequence of transitions (t
0
,... t
t
). Conversely, the backward variable deter-
mines of the occurrence of a (future) sequence of transitions, given the current state
S
i
.
Using the two complementary forward-backward probabilities the Baum-Welch algo-
rithm evaluates various probabilities. These include the probability of a given obser-
vation sequence, the probability that the HMM was at a given state S
i
at time t, as well
as the probability that the HMM was at a given state S
i
at time t and transitioned to
state S
j
at time t+1. By applying the Baum-Welch algorithm, the unknown matrices A
and B are obtained. This is equivalent to obtaining the operational profile of a compo-
nent given the set of data obtained by simulation, runtime monitoring, or synthesis of
the component’s model.
3.2.5 Application of HMMs to the Component Reliability Problem
A component’s dynamic behavior model is the heart of our component reliability
modeling approach. In our approach, the graphical representation of the component’s
internal behavior using states and transitions among those states is leveraged to build
a Markovian reliability model. The Markov property assumption is not too restrictive
in our case: it is possible to keep track of memory in a dynamic behavioral model and
77
still preserve the Markov property, by using more complex states with additional state
variables (recall Chapter 2).
However, building a Markov model from a component’s architectural models not
only relies on knowledge about the component’s states and transitions among those
states, but also requires availability of data that helps us obtain the probability of var-
ious transitions. As previously explained, such data may be obtained from runtime
monitoring or simulation of components’s model (cases 1, 2 in Section 3.1), or via an
expert-driven synthesis process (case 3 in Section 3.1), and based on circumstances, a
one-to-one correspondence between various observations and the transitions in the
model may not exist. In such cases, a regular Markov Model is not capable of prop-
erly representing the behavior of the component. A Hidden Markov Model, however,
can be formed to estimate an operational profile for the component given the avail-
able data.
The event/action interaction semantics in the dynamic behavioral model discussed in
Chapter 2 require an augmentation to the basic HMM (without changing its key
traits). Each transition in the component’s dynamic behavior model may have an
event/action pair associated with it: invocation of an event may result in triggering of
an action, which in turn may trigger another event in another component. This inter-
action semantics is leveraged in designing our system-level reliability model dis-
cussed in Chapter 4. We now formally define an Augmented HMM (AHMM) used to
78
model the operational profile of components. Once the operational profile is obtained,
we use our reliability model to predict the component reliability. The formal defini-
tion of our AHMM is given in Figure 3-3. Below we describe some of its properties.
In the dynamic behavioral model serving as the basis of our AHMM, for every two
states S
i
and S
j
, there may be several transitions with different event/action pairs (E
m
/
F
k
), for 1 k K, and 1 m M, where as shown in Figure 3-3, M is the number of
1
t
1
1
Assume:
S: set of all possible States, { ,..., }
N: number of states
q : state at time
E: set of all events, { ,..., }
M: number of events
F: set of allactions, :{ ,..., }
K: num
N
M
K
SS S
t
EE E
FF F
=
=
1
ber of actions
We now define:
( , , )is a Hidden Markov Modelsuch that:
A:state transition probabilitydistribution
{ }, Pr[ | ], 1 ,
B: Interface probabilitydistribution in state
ij ij t j t i
AB
Aa a q S q S ij N
λπ
+
=
== = = ≤≤
1
{ ( )}
( ) Pr[ / | ], 1 ,1 ,1
π:The initial probability distribution { }
Pr[ ],1 .
j
jmk tj
i
ii
j
Bbm
b m E F at t q S j N m M k K
qS i N
ππ
π
=
==≤≤≤≤≤≤
=
== ≤≤
Figure 3-3. Formal Definition of AHMM
≤≤ ≤ ≤
79
events and K is the number of actions. Then, the transition probability from S
i
to S
j
by
means of a given event E
m
via any of the possible actions F
k
on E
m
is:
We define the probability T
ij
of reaching state S
j
from state S
i
via any of the event/
action pairs E/F as:
Finally, at each state S
i
the following condition among all outgoing transitions exists:
For example, in the case of the controller component’s dynamic behavioral model
(depicted in Figure 3-2), between the two states init (S
1
) and normal (S
2
), there are
two transitions designated: getWallDist/notifyDistChange (E
1
/F
1
), and –/notifyDis-
tChange (E
2
/F
1
). The latter is an example of a transition with a true event where no
external stimuli besides a time step are necessary for the transition to be activated.
1
mk
K
ijE F
k
P
=
∑
11
mk
MK
ij ijE F
mk
TP
==
=
∑∑
11 1
1
mk
MK N
ijE F
mk j
P
== =
=
∑∑∑
80
Using the above equations, the transition probability from S
1
to S
2
by means of event
E
1
via any of the possible actions is:
The transition probability from state init to state normal can be formulated as the sum
of the probabilities of the two transitions.
Finally, at state S
1
we have:
As mentioned before the important question at this point is how to obtain these indi-
vidual probability values ( ). Section 3.2 described how these proba-
bilities could be obtained directly from the a simulation process (case 2). In cases 1
and 3, these probabilities may be obtained by applying the Baum-Welch algorithm
[11]. The Baum-Welch algorithm leverages information about the number of states
and event/action pairs in our model, as well as the training data (obtained from runt-
ime monitoring or the synthesis process for cases 1 and 3 respectively), and estimates
the parameters of the AHMM in terms of matrices A and B. In the next section, we
show how matrix B is used in the reliability prediction process.
11
12
1
mk
K
ijE F E F
k
PP
=
=
∑
11 2 1
12 12 12
11
mk
MK
ijE F E F E F
mk
TP P P
==
== +
∑∑
11 2 1
12 12
11 1
1
mk
MK N
ijE F E F E F
mk j
PP P
== =
=+ =
∑∑∑
11 2 1
12 12
and
EF E F
PP
81
3.3 Reliability Prediction
The last phase of our component reliability modeling approach involves actual pre-
diction and analysis of a component’s reliability. Given the operational profile of the
component (Section 3.2), the aim is to build a reliability model that predicts and ana-
lyzes the probability that a component performs its operation without failure.
Our component reliability model extends the Quartet’s dynamic behavioral model
with the notion of failure states. One of our goals has been to provide targeted sensi-
tivity analyses as part of our reliability modeling, aiming at offering cost-effective
strategies to defect mitigation. As a result, we model a failure state for each defect
type revealed during the architectural modeling and analysis phase. Each failure state
represents possible manifestation of the corresponding defect type during the compo-
nent’s runtime operation. We note that other possibilities, ranging from a single fail-
ure state to multiple failure states for each type, are also enabled by the model. Once
the model is augmented with failure states, two additional types of transitions must be
added to the model: failure transitions and recovery transitions.
Failure transitions are arcs from a component’s states to the failure states, and repre-
sent the possibility of a failure happening while the component is in a normal operat-
ing state. Recovery transitions model the notion of recovery from failures, and are
arcs from failure states to one or more “normal” component states. The designation of
82
one or more of the component’s “normal” states as recovery states for a given failure
state is performed by the architect.
As shown in Figure 3-4, the controller component’s dynamic behavioral model is
augmented with two failure states, F
1
and F
2
. F
1
denotes occurrence of failures corre-
sponding to the signature mismatch defect type, and F
2
represents occurrence of fail-
ures corresponding to the protocol mismatch defect type. As discussed in Chapter 2
these two types of defects were revealed by running various analyses on the controller
Figure 3-4. Graphical View of the Controller’s Reliability Model
init
normal
emergency
changed
getWallDist/
notifyDistChange
/notifyDistChange
executeSpeedChange/
notifySpeedChange
executeDirChange/
notifyDirChange
setDefaults
/notifyDistChange
logState
executeSpeedChange/
notifySpeedChange
executeDirChange/
notifyDirChange
executeSpeedChange/
notifySpeedChange
/notifyDistChange
getWallDist/
notifyDistChange
/notifyDistChange
/notifyDistChange
setDefaults
getWallDist/
notifyDistChange
getWallDist/
notifyDistChange
getWallDist/
notifyDistChange
F
1
(sig)
F
2
(prot)
83
component’s architectural models. In particular, the signature mismatch between esti-
mator and controller components was associated with the getWallDist interface, due
to an inconsistency of the return types. The setDefaults operation was determined to
be the source of a protocol mismatch between the controller and database compo-
nents. Dotted transitions in the diagram represent failure transitions connecting a sub-
set of the component’s states to failure states. Bolded transitions from failure states to
the init state in Figure 3-4 represent recovery transitions.
Only component states in which a particular defect type is relevant are linked to a fail-
ure state. This is decided by examining all outgoing transitions at a given state: if an
outgoing transition relates to a defect detected by architectural analysis, then an arc
connecting that state to the corresponding failure state is required. In the controller
example, since no outgoing transition corresponding to the setDefaults operation at
states init and normal exists, there is no need to model a failure transition from those
states to state F
2
. Moreover, designation of recovery states for each failure state is an
application specific task, and as such must be done by the architect. In the case of our
example, the init state is assigned to be the sole recovery state once any failure hap-
pens.
The next step after augmenting the model with failure states and adding failure and
recovery transitions, is to assign probabilities to all transitions in the model. As
described, the output of the operational profile modeling phase depends on the spe-
84
cific form of the problem at hand. Our primary focus is on case 3 where a set of train-
ing data is synthesized. This data is then used by the Baum-Welch algorithm to
estimate the probability of transitions among component states. Case 1 does not rely
on synthesized data, but obtains training data from simulating the models, and the
result of the case 2, is an estimation of the frequency (i.e., probability) of activation of
various transitions in the dynamic behavioral model. Either way, by this point in the
reliability modeling we have an estimate of all transition probabilities in the form of a
matrix (matrix A in Figure 3-3):
where a
ij
represents the probability of transition from state S
i
to state S
j
. The next step
is to incorporate these values in an extended model that includes the failure states and
their associated transitions. The new model is a Markov model which is used for reli-
ability prediction. Below, we first present the general form of the transition probabil-
ity matrix of this new model, and then discuss it in detail.
11 12 1
21 22 21
12
12
1
2
...
...
...
... ... ... .. ...
N
NN NN
N
N
SS S
aa a S
aa a S
A
aa a S
⎡ ⎤
⎢ ⎥
⎢ ⎥
=
⎢ ⎥
⎢ ⎥
⎣ ⎦
85
where A’ is the new transition matrix, M is the number of failure states, N is the num-
ber of normal states, F
i
is the i
th
failure state, and S
i
is the i
th
normal state. Moreover,
r
ij
is the probability of recovery from failure state F
i
to the normal state S
j
, and f
ij
denotes the probability associated with the failure transition from normal state S
i
to
the failure state F
j
. Finally, a
ij
is the probability value previously obtained from
matrix A (see above), corresponding to the probability of transitioning to state S
j
from
state
S
i
based on the results of operational profile modeling. This value is adjusted in
the new matrix (A’) to incorporate failure and recovery probabilities, while ensuring
that the new matrix preserves the properties of a Markov model.
12 1 2
1
11 1 12 1 1 1 11 12 1
11 1
2
21 2 22 2 2 2 21 22 2
11 1
1
1
1
2
'
... ...
(1 ) (1 ) ... (1 ) ...
(1 ) (1 ) ... (1 ) ...
...
... ... ... ... ... ... ... ...
(1 )
...
NM
MM M
ii N i M
ii i
MM M
ii N i M
ii i
M
NNNi N
i
M
A
SS S F F F
S
af a f a f f f f
S
af a f a f f f f
Saf a
F
F
F
== =
== =
=
=
−− −
−− −
−
∑∑ ∑
∑∑ ∑
∑212
11
11 12 1 1
1
21 22 2 2
1
12
1
(1 ) ... (1 ) ...
... 1 0 0 0
... 0 1 0 0
... ... ... ... ... ... ... ...
... 0 0 ... 1
MM
Ni NN Ni N N NM
ii
N
Ni
i
N
Ni
i
N
M MMN Mi
i
fa f f f f
rr r r
rr r r
rr r r
==
=
=
=
⎡ ⎤
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
−−
⎢ ⎥
⎢ ⎥
⎢ ⎥
−
⎢ ⎥
⎢ ⎥
⎢ ⎥
−
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
− ⎢ ⎥
⎣ ⎦
∑∑
∑
∑
∑
86
Our approach to initializing recovery probabilities (r
ij
) and failure probabilities (f
ij
) is
as follows. As described in Chapter 2, we use our defect quantification approach to
estimate the cost of recovery from each failure given a set of domain-specific cost
factors. The probability of recovery from each failure type is calculated as a function
of this recovery cost. Consequently, we instantiate the value of r
ij
directly from this
estimation. For instance, as depicted in Figure 2-8, the recovery probabilities for sig-
nature and protocol defect types were calculated as 0.7975 and 0.71125, respectively.
These two values were obtained from the instantiation of our cost framework
(Section 2.6) as the cumulative influence of a set of domain specific cost factors.
These values are assigned to the recovery probability from F
1
and F
2
to the init state
respectively. Recall that the state init was designated as the sole recovery state for the
component. The probability of remaining at the failure states F
1
and F
2
thus, are 1 –
0.7975= 0.2025 and 1 – 0.71125 = 0.28875 respectively.
While our cost function can be used to quantify the cost and consequently the proba-
bility of recovery from different types of failures, estimating the probability of failure
occurrence can only be done given the historical failure data from the component, or
by leveraging the domain expert’s knowledge. In the case of architecture-level reli-
ability modeling when no failure data may be available, a domain expert must esti-
mate the probability of failure occurrence at each state. At early stages of
development, estimating these probabilities with any degree of certainty is very diffi-
87
cult. These uncertainties warrant the need for offering flexibility in the reliability
model and allowing for reliability analysis based on different failure probability val-
ues. Using a range of failure probability values will result in a range of predicted
component reliability values. We consider this flexibility to be a useful analysis tool
that helps the architect in making important design decisions. To present the model,
however, we first focus on a single failure probability value, and then extend the reli-
ability analysis to a range of possible failure probability values.
Let us assume that the probability matrix obtained from profile modeling of the con-
troller component is as follows.
where i is init, n is normal, e is emergency, and c denotes the changed state. In our
case, this matrix was obtained by applying the Baum-Welch algorithm to a set of syn-
thesized training data obtained based the domain expert knowledge (an extensive
study of the impact of the training data on the predicted reliability value in terms of
the sensitivity of the model to random data or data based on the domain knowledge
may be found in Chapter 6). As discussed, in conjunction with the results obtained
0.1503 0.2858 0.2830 0.2809
0.4708 0.2653 0.0099 0.2540
0.2263 0.0984 0.3308 0.3445
0.2204 0.3539 0.1998 0.2258
in e c
i
n
A
e
c
⎡⎤
⎢⎥
⎢⎥
=
⎢⎥
⎢⎥
⎣⎦
88
from the analysis of architectural models, this matrix must be augmented with two
failure states F
1
and F
2
.
Let us assume that the architect has designated a 5% probability of failure for the sig-
nature mismatch and a 2% probability of failure for the protocol mismatch at each
related state. The new transition probability matrix will have two additional rows and
columns corresponding to the failure states F
1
and F
2
.
Examining the new matrix demonstrates that while new failure transition probabilities
are incorporated into the model, the ratio between various transition probabilities has
remained unchanged. For example, in the original transition probability matrix, once
the component was in the init state, the ratio between the probability of remaining in
init state or transitioning to the normal state was:
12
1
2
0.1428 0.2715 0.2688 0.2669 0.05 0
0.4473 0.2520 0.0094 0.2413 0.05 0
0.2105 0.0915 0.3076 0.3204 0.05 0.02
'
0.2050 0.3291 0.1858 0.2100 0.05 0.02
0.7975 0 0 0 0.2025 0
0.7113 0 0 0 0 0.2887
in e c F F
i
n
e
A
c
F
F
⎡⎤
⎢⎥
⎢⎥
⎢⎥
=
⎢⎥
⎢⎥
⎢⎥
⎢⎥
⎣⎦
0.1503
0.52
0.2858
=
89
In the new transition probability matrix, the same ratio between those probabilities
holds:
Once the new transition probability matrix for the model (including the failure states,
and the failure and transition recovery probabilities) is constructed, we can predict the
reliability of the component, by estimating the probability that it is operating nor-
mally at a time in the future when the component is in its steady state. Recall that a
component is considered reliable if by time t
n
, where n approaches infinity, it is oper-
ating normally. That is, it has either not failed, or has recovered from the failures it
may have encountered. The reliability is then predicted by estimating the probability
that the component at time t
n
is in a non-failure state. This can be estimated by calcu-
lating the steady state vector that represent the steady state behavior (recall
Section 3.2.4) of the component. The steady state distribution vector corresponding to
the component’s model can be calculated by solving the following system of equa-
tions:
0.1428
0.52
0.2715
=
[][ ]
... 1
... ...
x yz
xy z A xy z
+ ++ =
′ =
90
where the number of unknowns is equal to the number of states in the Markov model
(x, y, z, and so on). A steady state probability vector is then given by:
Numerically, calculation of the steady state vector can be performed by raising the
matrix to increasingly higher powers until all rows in the matrix converge to the
same values. Upon convergence, we obtain the steady-state vector V , whose elements
represent the long-term probability of being in the corresponding state.
In the case of the controller component, the steady state vector is calculated as:
Reliability is then the probability that the component is not in state F
1
or F
2
. In other
words:
where M is the number of failure states, and V(F
i
) are the elements of vector V corre-
sponding to failure states F
1
, F
2,
..., F
M
.
[ ]
... Vxyz
∞
=
A ′
12
V= [0.2832 0.2288 0.1767 0.2372 0.0581 0.0116]
in e c F F
1
Reliability = 1 ( )
M
i
i
VF
=
−
∑
91
In the case of the controller component, the reliability is estimated as:
For clarity, the calculation above shows a single approximation of the component’s
reliability based on a single assignment of failure transitions’ probabilities by the
domain expert. Given the difficulty of accurately estimating transition failure proba-
bilities when no actual failure data is available, we opt to provide a range of analyses
based on a threshold given for the transition failure probability estimation. In the
example above, the expert had assigned 0.05, and 0.02 for the probability of failure of
type F
1
and F
2
occurring at each state. Given this instantiation, the component reli-
ability was estimated at 93.03%. Assuming that these transition failure probabilities
are an estimate at best and may have uncertainties associated with them, we can pre-
dict the reliability of the component for a given threshold. Figure 3-5 (top) demon-
strates the range of predicted reliability values when the probability of transition to
failure state F
1
varies between 0.03 and 0.07. As depicted, the original predicted reli-
ability of 93.03% now falls at the middle of the estimated values from 95.22% to
90.94%. Similar type of analysis on the value of probability of failure transition to
state F
2
is shown in Figure 3-5 (bottom).
An assumption made when estimating a single component’s reliability is that the
effect of a defect associated with a service is captured in the providing component’s
Reliability = 1 (0.0581 0.0116) 0.9303 −+ =
92
model. That is, when a component requires services of another component, we
assume that any defect associated with that service is captured in the reliability model
of the providing component. Consequently, the failure states of a given component
signify the defects associated with its own functionality, while any problems with the
Figure 3-5. Reliability Analysis Results for the Controller Component
Reliability Analysis [P(F2) = 0.02]
0.88
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07
P(F1): Probability of transition to state F1
Component Reliability
Reliability Analysis [P(F1) = 0.05]
0.905
0.91
0.915
0.92
0.925
0.93
0.935
0.94
0.945
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04
P(F2): Probability of transition to F2
Component Reliability
93
functionality provided to it would be captured in the provider component’s reliability
model.
It is noteworthy that architectural analysis of software components may be applied to
components and their behavior in isolation (e.g., the case of an Off-the-Shelf compo-
nent that is not part of any system). It may also be applied to components that along
with other (interacting) components comprise a software system. While both type of
analysis can reveal defects, the defects are more useful and the analysis is more mean-
ingful when the component is considered in the context of a system. While the former
approach results in reliability values that may be reused in different software systems,
the latter approach determines the fitness of the component in a particular system.
Our approach in this dissertation has focused on components in the context of a soft-
ware system.
94
Chapter 4: System Reliability
In Chapter 3, we described our approach to predicting the reliability of a single com-
ponent (aka Local Reliability). In this chapter, we leverage the components’ reliabil-
ity values and offer a compositional approach to predict the architecture-level
reliability of a software system. The system reliability is predicted in terms of the reli-
abilities of its constituent components, and their complex interactions. Our approach
involves two major steps: first we build a model representing the overall behavior of
the system in terms of interactions among its components. Next, this model is used as
a basis for stochastic analysis of the system’s architectural reliability.
Since our approach is intended to be applied at early stages of the software develop-
ment life-cycle, lack of knowledge about the system’s operational profile poses a
major challenge in building a reliability model. The operational profile is used to
determine the failure behavior of the system, and is commonly obtained from runtime
monitoring of the deployed system. In the absence of data representing the system’s
operational profile, we use an analytical approach that relies on domain knowledge as
well as the system’s architecture to predict the reliability of the system. Similar to the
component-level reliability model, in cases where operational profile data is avail-
able, our reliability model can be adopted to leverage existing data and to provide a
more accurate analysis of the system’s reliability.
95
Figure 4-1 shows a high-level view of the system reliability prediction process. As
with our component-level reliability estimation approach, architectural models of the
system serve as the core of the reliability model. In our case, the Quartet offers mod-
els of components’ interaction protocols in the form of a set of Statecharts [47]. We
compose a concurrent model of components’ interaction protocol models to provide a
global view of the system’s behavior. Our reliability model leverages this global view,
and given components’ reliability values (obtained via our HMM methodology), pro-
vides a prediction of the system’s architecture-level reliability, based on the Bayesian
Network methodology [49].
In the rest of this chapter, we first describe our approach to modeling the global
behavior of a software system in terms of the Quartet models of its constituent com-
Architectural
Models
Component
Reliability Values
AHMM
Bayesian Network
Global Behavioral
Model
System Reliability
Inference
Legend
Numerical
Values
Approach
Elements
Artifacts
Learning
Process
Architectural
Models
Component
Reliability Values
AHMM
Bayesian Network
Global Behavioral
Model
System Reliability
Inference
Legend
Numerical
Values
Approach
Elements
Artifacts
Learning
Process
Architectural
Models
Component
Reliability Values
AHMM
Bayesian Network
Global Behavioral
Model
System Reliability
Inference
Architectural
Models
Component
Reliability Values
AHMM
Bayesian Network
Global Behavioral
Model
System Reliability
Inference Inference
Legend
Numerical
Values
Approach
Elements
Artifacts
Learning
Process
Legend
Numerical
Values
Approach
Elements
Artifacts
Learning
Process
Figure 4-1. Our Approach to System Reliability Prediction
96
ponents. We then describe our Bayesian system-level reliability model. As part of this
discussion, we provide a brief overview of Bayesian Networks. We then describe how
our Bayesian reliability model is constructed, and demonstrate the analyses it enables.
4.1 Global Behavioral Model
In order to estimate the overall reliability of a software system, we need to understand
the nature of the complex interactions among its components. This understanding
involves answering two types of questions: which interactions are allowed in this par-
ticular system, and how often do they occur. To answer the first question, we build the
Global Behavioral Model (GBM) of the system. This model is then used by the archi-
tect to analytically determine the expected frequency of various interactions.
The behavior of a software system is the collective behavior of its constituent compo-
nents. These components interact to achieve system-level goals. These interactions
are often very complex, and capturing them requires sophisticated modeling tech-
niques that are capable of representing request-response relations, as well as related
timing issues. These interactions are often described in terms of components’ pro-
vided and required functionality, exhibited through their interfaces [1,2,102,133].
As previously described in Chapter 2, one of the views of the Quartet approach to
software modeling is the model of components’ interaction protocols. Recall that a
97
component’s interaction protocol model provides a continuous external view of the
component’s execution by specifying the ordering in which component’s interfaces
must be invoked. Our specific approach leverages the Statecharts methodology
[47,133] with semantic extensions to model the event/action interactions between
communicating components [47].
We model the collective behavior of components using a set of concurrent state
machines. Each state machine within this concurrent model represents the interaction
protocol of a single component. Figure 4-2 depicts the conceptual view of the interac-
tions among components in the SCRover system. The left hand side diagram is the
view of the system’s configuration in terms of its communicating components. The
right hand side shows a concurrent state machine containing interaction protocol
models of individual components. In the interests of clarity, labels on the transitions,
Figure 4-2. View of SCRover System’s Collective Behavior
SCRover
Sensor
Estimator Controller
Actuator Database
Actuator Sensor
Controller
Database
Estimator
components
communication link
state
transition
concurrent
state
98
events, actions, parameters, and conditions have been omitted, but are described later
in this chapter.
In a concurrent state machine representing the system-level behavior of n communi-
cating components, at any point in time, the active state of the system is represented
using a set of component states {S
1
,S
2
..., S
n
}, where n is the number of components in
the system, and S
k
corresponds to the active state in the state machine corresponding
to the k
th
component. The interactions among components are represented via event/
action pairs. Each event/action pair acts as a synchronizer among the state machines.
The event/action interaction describes how invocation of a component’s services
affects another component. Figure 4-3 depicts the interaction protocols of the control-
ler, estimator, and actuator components in the SCRover system. To avoid unnecessary
complexity in the discussions, we discuss the SCRover’s system model in terms of
these three components. However, the approach and techniques presented here can be
applied to a greater number of components without modification.
The system’s three state machines are concurrently executed. The Statechart seman-
tics [47] permit two types of interactions among concurrent state machines. These
interactions leverage event/action semantics, and model how operations in one com-
ponent affect another component’s operations. The first type of interaction concerns
concurrent events. Given the appropriate active state of components, all of the transi-
99
tions with the same event are activated at the same time. For instance, in the case of
the SCRover model (Figure 4-3), if the active state of the controller component is
controller.S
2
and the active state of the actuator component is actuator.S
1
, then invo-
cation of the executeSpeedChange interface results in generation of the corresponding
event, which in turn causes a change of state in both components to controller.S
3
, and
actuator.S
2
respectively. Note that generation of this event has no effect on the state of
the estimator component regardless of its active state.
Figure 4-3. SCRover’s Global Behavioral View in terms of Interacting Components
SCRover System
Controller
Estimator Actuator
S 2
S 3
S 1
executeSpeedChange
executeDirChange
true
true
S 3
/getWallDist
S 1
notifyDistChange
notifyDirChange
S 2
/getWallDist
/getWallDist
S 1 S 2 S 3
/notifyDistChange
/notifyDistChange
executeSpeedChange
/notifyDistChange
executeDirChange/
notifyDirChange
getWallDist/
notifyDistChange
setDefaults
getWallDist/
notifyDistChange
Concurrency
Event/Action pair
Initial state
x/y
State
Transition
Legend
true
100
The second type of interaction concerns the event/action pair semantics. Given the
appropriate state of components, generation of an event in one of the components may
result in the invocation of an action, which in turn may result in generation of another
event in another (concurrent) state machine. In the SCRover system, assuming that
{controller.S
2
, estimator.S
1
, actuator.S
1
} is the system’s active state, invocation of the
executeSpeedChange interface in the actuator component results in generation of the
executeSpeedChange event. This in turn results in the triggering of the corresponding
transition in the controller component, causing the notifyDistChange action. The con-
current nature of the three state machines results in triggering of the notifyDistChange
transition in the estimator component (event caused by the action in the controller
component), as well as the executeSpeedChange transition in the actuator component
(original event). The new active state of the system will then be {controller.S
3
, estima-
tor.S
3
, actuator.S
2
}.
The semantics described above [48] form the basic principles upon which the Global
Behavioral Model of a system in terms of the behavior of its constituent components
is built. In the rest of this chapter, we describe how this model is used to predict the
architectural reliability of a software system.
101
4.2 Global Reliability Modeling
The Global Behavioral Model describes the behavior of the system as intended by the
architect. When the system strays from the intended behavior, it is said to demonstrate
a failure. Recall that failure is defined as the occurrence of an incorrect output as a
result of an input value that is received, with respect to the specification [101]. Fail-
ures are a result of faults or defects that are potentially attributed to design flaws or
deviation from a desired or intended behavior as specified by the architect. System
failures may be caused by components’ internal failures, or as a result of interactions
among communicating components. While all failures adversely affect system reli-
ability, different failures may contribute to system unreliability differently. The over-
all unreliability of a system can be formulated as the aggregate of the probabilities of
occurrences of various types of failures.
While the Augmented Hidden Markov Model (AHMM) presented in Chapter 3 was
effective for component reliability estimation, there are serious concerns with its abil-
ity to model reliability of a complex system. These concerns are mainly due to the
lack of theoretical foundations to build Hierarchical Hidden Markov Models. While
recent research has started to address the issue of concurrency and hierarchy in
Markov Modeling [9], generalization of the Expectation-Maximization algorithm for
Hierarchical Hidden Markov Models is still very much a topic for ongoing research.
102
Moreover, Hierarchical Hidden Markov Models have serious shortcomings with
respect to scalability when modeling concurrency.
For modeling the reliability at the system level, we use a related graphical model
capable of performing probabilistic inference. A Bayesian Network or Belief Net-
work (BN) [49], is a probabilistic graphical model in the form of a directed acyclic
graph. The nodes in a BN represent some variables, and the arcs (or links) connecting
these nodes represent the dependency relations among those variables. A Bayesian
Network represents a stochastic relationship among the nodes in the graph, in terms
of the conditional probabilities of some nodes with respect to the others. Given the
topology of a Bayesian Network and the probability distribution values at some of the
nodes, the probability distribution value of some other nodes may be deduced. This is
known as inference in Bayesian Networks. In the next subsection, we offer a basic
overview of Bayesian Networks, and discuss their applicability to the reliability esti-
mation problem.
It is worth mentioning that while theoretically Bayesian Networks may be used to
perform component-level reliability analysis, based on our experience, HMMs are
more appropriate to model architectural reliability of individual components. Theoret-
ically, Bayesian Networks and Hidden Markov Models belong to a class of models
known as Graphical Models. Graphical models merge concepts of Graph Theory and
Probability Theory [81]. Furthermore, the two formalism are demonstrated to be iso-
103
morphic under certain conditions [55]. Particularly, HMMs are considered to be a
special case of Dynamic Bayesian Networks. While a natural causal relations among
entities in our system level models lends itself to use of BNs at the system level,
HMMs are more intuitive when used in component-level reliability modeling. Conse-
quently we opted to use HMMs for component-level and BNs for system-level reli-
ability modeling.
4.2.1. Background on Bayesian Networks
Bayesian Networks or Belief Networks have been extensively used in Artificial Intel-
ligence and Machine Learning Decision Making, Medical Diagnosis, and Bioinfor-
matics [49,80,39,92,89]. They have also been used to model reliability of software
systems during the testing phase, based on the operational profile obtained from sys-
tem monitoring [4,58,68,96]. However, little work has been done on predicting sys-
tem reliability early in the development process when such information is not widely
available [97].
A Bayesian Network consists of two parts: qualitative and quantitative. The qualita-
tive part is a directed acyclic graph consisting of set of nodes, and directed arcs (links)
that connect the nodes. The arcs represent the dependency between probability distri-
bution values represented at each node. Similar to standard graph theory concepts, if
there is an arc from node A to another node B, then A is a parent of B. If a node has no
parents, then it is a root node. If a node has no children then it is a leaf node. The
104
quantitative part consists of specification of the conditional probabilities among the
nodes and their parents in the network.
In probability, two events are independent when knowing whether one of them occurs
makes it neither more probable nor less probable that the other occurs. In a Bayesian
Network, a node is independent of its ancestors given its parents. This property is
known as conditional independence. Formally, two events X and Y are conditionally
independent given a third event Z, if the occurrence (or non-occurrence) of X and Y
are independent events in their conditional probability distribution given Z. In other
words:
where represents the conditional probability of event X given the occur-
rence of event Z, and represents the joint probability of events X and Y
given the occurrence of event Z.
In a Bayesian Network, a node can represent any type of variable (e.g., a measure-
ment, a parameter, etc.). We use Bayesian Networks to model the dependency
between various system states. Specifically, our nodes represent the reliability values
at corresponding states in the system. The arcs model how reliability at one state is
affected by the reliability value at another state in the system.
1
Pr( | ) Pr( | ) Pr( | ) X YZ X Z YZ ∩ =×
Pr( | ) X Z
Pr( | ) X YZ ∩
105
The probabilistic inference considers available information about the network and
infers conclusions about the other parts of the model. In other words, given the indi-
vidual component reliability values, and the graph representing the relationship
among the reliabilities of various states in the system, we use inference to estimate the
posterior probability of the occurrence of different types of failures. The posterior
probability calculation considers the known information about the network (aka evi-
dence), and updates the conditional probability at other nodes (aka belief). The basis
for inference in Bayesian Networks is Bayes’s Theorem. Bayes’s Theorem is essen-
tially an expression of conditional probabilities that represent the probability of an
event occurring given evidence. In probability, Pr(A|B) (the conditional probability of
event A given B) and Pr(B|A) are two different terms. However, there is a relationship
between the two, and Bayes’s Theorem describes this relationship. Bayes’s Theorem
can be derived from the definition of conditional probability of events A and B as fol-
lows:
1. It is note-worthy that to remain consistent with the BN terminology and avoid confusion,
we use the term state in the context of the behavioral models and state machines, and use
the term node in the context of Bayesian models; conceptually the two terms are inter-
changeable. Moreover, the terms arcs and links are used interchangeably in this discus-
sion.
:
Pr( )
Pr( | )
()
Pr( )
Pr( | )
Pr( )
By Definition
AB
AB
PB
AB
BA
A
∩
=
∩
=
106
So:
where Pr(A|B) is the posterior probability, Pr(B|A) is known as the likelihood, Pr(A)
is the prior probability, and Pr(B) is the probability of the evidence. In other words,
the Bayes’s Rule can be phrased as follows:
Bayesian Networks offer a robust probabilistic formalism for reasoning under uncer-
tainty. It is relatively easy to understand and interpret a Bayesian Network as it
reflects our understanding of the world within the model. Furthermore, the condi-
tional independence between the nodes and their ancestors in the model provides a
more compact probabilistic relation between the nodes in the model based on their
structured representation. This property particularly helps ensure that the complexity
of components’ interactions is simplified when considering the effect of these interac-
tions on the system’s reliability.
In the next section, we describe our Bayesian reliability model in terms of its quanti-
tative and qualitative parts.
Pr( | ) Pr( ) Pr( | ) Pr( ) Pr( )
Pr( | ) Pr( )
Pr( | )
Pr( )
AB B B A A A B
BA A
AB
B
× =× = ∩
×
=
likelihood prior probability
posterior probability
evidence
×
=
107
4.3 A Bayesian Network for System Reliability Modeling
The global behavioral model presents the interactions among components in a sys-
tem. Failures may occur during system’s operation, and their cause may be rooted in
defects that originate from the architecture and design phases. We build a Bayesian
Network using the GBM as its core that determines the dependency between reliabil-
ity values at various system states. This model is further extended to include failure
nodes. Similar to our component-level reliability approach, we acknowledge that dif-
ferent types of failures may occur in the system. Furthermore, the contribution of dif-
ferent types of failures to the overall reliability of the system depends on many factors
such as the types of failures and specific components that exhibit the failure behavior.
We leverage our classification of architectural defects (presented in Chapter 2) to dif-
ferentiate among different classes of failures. Our system-level reliability model
explicitly represents different types of failure for each component in the system. The
overall reliability of the system is then estimated in terms of the cumulative effect of
different types of failure in various components. Below we describe our approach to
construction of the qualitative and quantitative parts of our Bayesian Network.
4.3.1. Qualitative Representation of the Bayesian Network
The dependency relationship among reliabilities at various system states is directly
tied to the interactions among the states in the system’s global behavioral model. In a
system’s GBM a transition (associated with an event) may result in a change in the
state of the component; that is, under the correct conditions (i.e., generation of certain
108
events and given the active state of the component), a change of state to a new state
may be caused. This concept serves as the core principle in converting a global
behavioral model to BN’s directed graph.
As previously mentioned, the Global Behavioral Model consists of a set of concurrent
state machines SM={sm
1
,...,sm
n
} where n is the number of components in the system.
Each state machine (sm
i
) consists of a set of states S= {s
1
,...,s
m
}, and a set of transi-
tions T={t
1
,...,t
p
}, where m represents the number of states in the state machine sm
i
,
and p is the number of transitions in the component corresponding to sm
i
. Each transi-
tion has its origin and destination in S. There is either a single event or an event/action
pair associated with each transition. Each event and action corresponds to a compo-
nent’s interface (recall Chapter 2).
Below we describe the steps required to leverage this behavioral model and construct
a Bayesian reliability model, in terms of nodes and the links representing the reliabil-
ity dependency among these nodes.
Nodes. The nodes in our Bayesian Network are directly related to the states in the
behavioral model. All the states in the global behavioral model become the nodes in
the BN. Moreover, a “super” node (init) is added to represent the instantiation of the
system. This node will be used to model the reliability of the system’s startup process.
109
In addition, for every component, a set of failure nodes are added to the Bayesian
Network. These failure nodes correspond to the different defect types revealed during
architectural analysis for each component. Each failure node represents the probabil-
ity of the occurrence of a specific type of failure in a component. A failure may be
due to an internal fault in the component, or a result of its interaction with the rest of
the system. The top part of Figure 4-4 shows the initial step of the BN construction
for the SCRover system. Initialization and failure nodes are added to the basic set of
nodes (corresponding to the states in the GBM).
Links. The next step involves designing the arcs (links) in the model to capture the
dependencies between reliabilities at various nodes. The links in our model can be
grouped in three distinct groups: instantiation links, failure links, and dependency
links.
Instantiation links are added from the init node to all nodes corresponding to initial
states of components. They model the system’s instantiation process, and signify that
the reliability of the system depends on the failure-free instantiation of all of its com-
ponents in the system. Note the links from the init node to controller.S
1
, estimator.S
1,
and actuator.S
1
in Figure 4-4.
110
Failure links are used to determine the possibility of occurrence of various types of
failures at different states of the system. We rely on the results obtained from our
architectural analysis phase to determine relevance of a particular defect type (and its
subsequent failure type) for each component. Specifically, for each node in the BN,
we consider the corresponding state in the global behavioral model. If in the GBM a
particular defect type is associated with an interface corresponding to an outgoing
transition at a given state, then a failure link from the corresponding node in the BN to
the failure node associated with that type of defect is drawn. In the case of the
SCRover system, recall that getWallDist() and setDefaults() were defective interfaces
identified during the architectural analysis phase. The first one was shown to demon-
strate a signature type mismatch with the estimator component, and the latter had
both a pre/post condition mismatch and a protocol mismatch with the database com-
ponent. States controller.S
1
and controller.S
2
both have outgoing transitions that are
activated once the getWallDist event is generated, as do estimator.S
2
and estimator.S
3
.
Consequently, a link from these nodes to the controller and estimator components’
signature failure node (F
4
) is added. Similarly, links from controller.S
2
to control-
ler.F
5
and controller.F
6
represent the possibility of Pre/Post condition and Protocol
failures respectively. The actuator component does not react to any of the defective
interfaces and thus no failure nodes need to be modeled for it. Figure 4-4 (bottom)
depicts the initialization and failure links in the SCRover’s Bayesian Network.
111
Finally, dependency links depict the reliability relationship among various nodes in
the system. There are two types of dependency links: inter-component and intra-com-
Figure 4-4. Nodes of the SCRover’s Bayesian Network (Top), and Initialization and
Failure Links Extension (Bottom)
init
S1
S3
S2
Controller Estimator Actuator
S1
S3
S2
S1
S3
S2
Legend
F
1
: Direction, F
2
: Usage, F
3
: Incomplete, F
4
: Signature, F
5
: Pre/Post Condition, F
6
: Protocol
F 4 F 5
F 4
F 6
init
S1
S3
S2
Controller Estimator Actuator
S1
S3
S2
S1
S3
S2
Legend
F
1
: Direction, F
2
: Usage, F
3
: Incomplete, F
4
: Signature, F
5
: Pre/Post Condition, F
6
: Protocol
F 4 F 5
F 4
F 6
112
ponent links. The intra-component dependency links are directly obtained from each
component’s protocol model. For every transition in the interaction protocol model of
the component, there is a directed arc from the node corresponding to its origin state,
to the node corresponding to its destination in the Bayesian Network. These links sig-
nify that reliability (probability of success) at each node depends on the reliability of
its parent node. For example, in the SCRover’s GBM, a transition from controller.S
1
to controller.S
2
signifies that the system’s reliability value at controller.S
2
depends on
the reliability of the system at controller.S
1
, justifying a link in the Bayesian Network
from the latter to the former as depicted in Figure 4-5.
Figure 4-5. Interaction Links in SCRover’s Bayesian Network
init
S1
S3
S2
Controller Estimator Actuator
S1 S1
S3
S2
S3
S2
F
4
F
5
F
4
F
6
113
The inter-component dependency links are designed to demonstrate the relationship
between reliabilities of the states among interacting components. The notion of event/
action interactions described earlier in this chapter serves as the logical core of these
links. Recall that generation of an event in one component may cause a change of
state in a different component. More specifically, for each pair in a compo-
nent’s state machine sm
i
, we seek all transitions in all other components’ state
machines where an event matches the action a
o
. A link is then added from the origin
node of in sm
i
to the destination nodes of all events a
o
in the other state
machines. The inter-component links indicate that the reliability (probability of suc-
cess) in the nodes of interacting components are influenced by the reliability at the
node of the component initiating the interaction.
Consider Figure 4-5 and Figure 4-6 depicting the SCRover’s BN in two stages. The
first diagram shows the inter-component dependencies while the second one depicts
the final Bayesian Network including all dependency, failure, and instantiation links.
Upon instantiation, {controller.S
1
, estimator.S
1
, and actutator.S
1
} becomes the active
state of the system (described in terms of the active states of each component). At this
point, several scenarios may happen. As an example a getWallDist event may be gen-
erated, which results in a transfer of state in the controller component from control-
ler.S
1
to controller.S
2
, resulting in activation of notifyDistChange action which in turn
causes a change of state in the estimator component from estimator.S
1
to estimator.S
2
.
/
lo
ea
/
lo
ea
114
This particular scenario has no immediate influence on the actuator component. A
change of state within a component (e.g., controller.S
1
to controller.S
2
in the above
example) can also be interpreted in terms of the dependency among the reliability val-
ues at those states. In this case, an unreliable controller.S
1
state may affect the proba-
bility of correct operations at controller.S
2
state. Moreover, because triggering of this
transition results in a change of state in the estimator component (from estimator.S
1
to
estimator.S
2
), the unreliability at controller.S
1
could also affect the probability of suc-
cessful operation at estimator.S
2
state.
A final issue concerns cycles in a Bayesian Network. By definition, a Bayesian Net-
work is a directed acyclic graph (DAG). Following the above approach may result in
init
S1
S3
S2
Controller Estimator Actuator
S1 S1
S3
S2
S3
S2
F
4
F
5
F
4
F
6
Figure 4-6. SCRover’s Final Bayesian Network Model
115
creation of cycles in the graph. However, it is important to note that these cycles are
time sensitive: for example, while there are links in both directions between the esti-
mator.S
1
and estimator.S
3
in Figure 4-6, the two links are not representing the depen-
dency between the two nodes in the same time-step. That is, since the estimator
component cannot be in both estimator.S
1
and estimator.S
3
states at the same time, the
cycle introduced in the BN graph represents reliability dependencies at different
points in time. To remedy this issue of cycles, before adding a link to our Bayesian
model, we check for cycles that may be created once the link is added. If by adding
the link a cycle is generated, then that link is marked as a Delay Link (dashed lines in
our graphs). Delay links are a standard concept in time-dependent Bayesian Networks
and convert a simple Bayesian Network to a Dynamic Bayesian Network (DBN)
[81].
2
A Dynamic Bayesian Network is a time-sensitive Bayesian Network. Inference
performed on a Bayesian Network can also be performed on a DBN. To do this, a
DBN is first expanded for a period of time (e.g., 100 time-steps). The result of this
expansion is a “regular” Bayesian Network that depicts the reliability dependencies
over time.
An example of this expansion process over a period of three time steps for the
SCRover system is depicted in Figure 4-7. As shown, expanding a BN increases the
complexity of the Bayesian Network by creating more nodes. This process does affect
2. Bayesian Network editors such as Netica [90] have the delay links built-in as standard
functionality.
116
the scalability of the model and its ability to perform inference in a reasonable time.
The inference problem in a BN in general is NP-hard, and the complexity of the algo-
rithm is exponential in the number of nodes and the number of variables represented
at each node (in our case it is just one variable for reliability). Approximation algo-
rithms such as Variation methods, Sampling (Monte Carlo) Method, or Parametric
approximation methods [81] may be applied to provide a solution to the problem.
In our approach, we employ several techniques to help reduce the complexity of the
problem. We estimate the reliability of the system in terms of the reliability at particu-
lar snapshots during system’s operation, or in a short time span. In doing so we avoid
creating a large network. The ramification is that we are not able to provide reliability
analysis over a long period of time, but we can do so given smaller time steps. The
overall complexity of the approach can be further improved by using principles of
hierarchy in architectural models. This will help us reduce the complexity of the com-
ponents by reducing their number of states. This in turn, results in a decrease in the
Figure 4-7. Expanded View of the SCRover’s Dynamic Bayesian Network
S3_Estimator
S2_Actuator S2_Actuator1 S2_Actuator2
S3_Actuator S3_Actuator1 S3_Actuator2
S1_Controller S1_Controller1 S1_Controller2
R_Controller
S1_Estimator S1_Estimator1 S1_Estimator2
S1_Actuator S1_Actuator1 S1_Actuator2
R_Actuator
S3_Controller S3_Controller1 S3_Controller2
S3_Estimator S3_Estimator1 S3_Estimator2
F_Signature_Estimator F_Signature_Estimator1 F_Signature_Estimator2
S2_Estimator S2_Estimator1 S2_Estimator2
init
R_Estimator
S2_Controller S2_Controller1 S2_Controller2
F_Signature_Controller F_Signature_Controller1 F_Signature_Controller2
F_PrePostCond_Controller F_PrePostCond_Controller1 F_PrePostCond_Controller2
F_Protocol_Controller F_Protocol_Controller1 F_Protocol_Controller2
117
number of nodes in the Bayesian Network, which directly improves the efficiency of
the inference algorithms.
Figure 4-8 provides a quick summary of all the steps required to build the qualitative
part of the Bayesian reliability model for assessing the architectural reliability of soft-
1. Create a node for every state in the GBM.
2. Add an init node to represent the reliability of the system’s
instantiation process. Connect this node to all the nodes
corresponding to the initial states of components.
3. For each component, add a component reliability node to
represent the (previously predicted) component reliability value.
Connect this node to the initial state of the corresponding
component.
4. For each transition in a component’s state machine, find the
nodes corresponding to its origin and destination states, and draw a
link from the init node to the destination node. If a link produces a
cycle in the BN, designate it as a delay link.
5. For each e
l
/a
o
pair in a component’s state machine sm
i
, seek all
transitions in other components’ state machines where an event
matches the action a
o
. Add a link from the origin node of e
l
/a
o
in sm
i
to the destination node of event a
o
in the other state machines. If a
link produces a cycle in the BN, designate it as a delay link.
6. For each component, add a set of relevant failure nodes.
7. For each node, if there is a “defective” outgoing transition in the
corresponding state in the GBM, add a failure link to the
appropriate failure node.
Figure 4-8. Summary of the BN’s Qualitative Construction Steps
118
ware systems. Now that the graphical (qualitative) part of the Bayesian Network is
constructed, we are ready to assign conditional probability values to each node in the
network. These conditional probabilities are used for inference, enabling probability
estimation, which in turn results in reliability prediction for the system.
4.3.2. Quantitative Representation of Our Bayesian Network
A major challenge of reliability prediction before the system’s implementation phase
is its unknown operational profile. If the operational profile of the system were avail-
able, the conditional probability values at various nodes could be deduced using sta-
tistical techniques similar to what was discussed in Chapter3. The problem of
estimating reliability would be then transformed to performing standard inference
methods on the available data. However, given the uncertainties associated with early
reliability estimation (e.g., during the architecture phase), the best that can be done is
to offer an analytical method that given known information about the system (its
topology, components’ interactions, and individual components’ reliabilities) derives
the conditional probability values at each node using the knowledge of the system’s
architect. These conditional probability values describe the reliability (probability of
successful operation) at each node, given the reliability of its parents.
Conceptually at each node we need to define a formula that specifies the dependency
between the node’s reliability and the reliabilities of its parents. In other words, we
need to calculate the conditional probability of successful operation at that node,
119
given the probability of successful operation of its parents. For a node n, with parents
p
1
,..., p
n
, we need to calculate Pr (n | p
1
,..., p
n
). The reliability at node n depends on
the way each parent affects n. This relationship is one that can be logically formulated
by the architect. For instance, consider node controller.S
3
in the SCRover system. Its
reliability depends on the reliability of its parents controller.S
2
, and
actuator.S
1
. We
ask the architect to specify this relationship for each node in the Bayesian Network.
Below we describe a few possibilities for generic formulae to calculate these proba-
bilities and discuss the ramification of each option.
In the field of Reliability Engineering, Reliability Block Diagrams (RBDs) [101] are
used to graphically represent how the components of a system are connected reliabil-
ity-wise. The configuration of a system is typically represented using a serial, paral-
lel, or serial-parallel combination configuration. Other complex configurations (that
cannot be simply classified or broken down to serial and parallel relations) can be also
defined. Moreover, configurations such as k-out-of-n nodes that allow the analyst to
specify a form of redundancy known as k-out-of-n redundancy can be formulated. In
this form of redundancy, at least k out of n elements must function correctly in order
for the system to function correctly. We have used the concepts of RBDs and have
applied them to our problem of formulating the relationship between the reliability of
a node with its parents. Below we describe some of these configurations.
120
Serial Reliability Configuration. A node and its parents are known to be in a serial
type of relationship with respect to their reliability dependency when the node’s reli-
ability directly depends on the reliability of all of its parents. In other words, a low
reliability of any of its parents, directly (negatively) affects the reliability of the node,
regardless of the reliability of the other parents. The reliability of a node depends on
the reliability of all of its parents and is given by:
where:
R
node
: Reliability of a give node
R
i
= Reliability of the parent node i
For example, in SCRover’s Bayesian Network, the architect may observe that the reli-
ability of the system at estimator.S
2
has a serial type of relationship with the reliability
values at estimator.S
1
and controller.S
2
. This would indicate that any change in the
reliability at the two parent nodes directly affects the reliability at estimator.S
2
.
Assuming R
12
and R
21
represent the reliability values at controller.S
2
and
estimator.S
1
respectively, the reliability at the estimator.S
2
node can be formulated as:
1
n
node i
i
R R
=
=
∏
2
estimator.S 12 21
RRR =×
121
In such type of relationship, the reliability value of a node is always less than or equal
to the reliability of its least reliable parent. In other words, as time progresses, the reli-
ability of the system always decreases and at best the system remains as reliable as it
has been in its previous time step. Clearly, if there are design decisions built into the
system’s architecture to enhance its reliability (e.g., redundancy), or to build fault tol-
erance into the system, a serial relationship cannot describe it sufficiently. A discus-
sion on the ramifications of this relationship and insights into situations where the
relationship is useful is given at the end of this section.
Parallel Reliability Configuration. In general, a parallel configuration relationship
between a node and its parents can be used to represent the concept of redundancy.
The node’s reliability is at least equal to or greater than the reliability of its most reli-
able parent. In this case, the unreliability of a node with n statistically independent
parallel parents nodes is the product of the unreliability value of all of the parents. In
other words, in a parallel setting, all n parents must have very low reliability for the
node to be very unreliable, i.e., if any of the n parents is highly reliable, then the node
will still be very reliable. The reliability of a node in a parallel configuration is then
given by:
1
1(1 )
n
node i
i
RR
=
=− −
∏
122
where:
R
node
= Reliability of the node
R
i
= Reliability of the parent node i
In the real world, examples of this type of configuration include RAID-1 computer
hard drive systems, standard automobile brake systems (where the front and back
brakes typically act as a redundant mechanism), as well as cables supporting a float-
ing bridge.
In the context of the SCRover system discussed throughout this dissertation, redun-
dancy is not designed into the system. Consequently, an example of a node with this
property cannot be provided. However, some analysis of this configuration and its
ramifications on the reliability of node and thus the reliability of the system can be
found in Chapter 6.
Other Complex Configurations. The serial and parallel configurations discussed
above are just two very basic forms of configuration that could determine the probabi-
listic relationship between a node and its parents. Other customized configurations
may describe the relationship between parent and child nodes in a complex system.
Examples include a partial parallel configuration known as k-out-of-n parallel config-
uration. In this type of configuration, k or more parents out of the total of n parents of
123
a node must fail, in order for the node to fail. An example of a type of system when
this configuration is relevant is in a four-engine airplane, where a minimum of two
engines are required for it to be able to fly and still satisfy minimal reliability require-
ments. From a reliability perspectives, this is a case of partial-parallel configuration:
from k=2 out of n=4 engines must be reliable in order to ensure the system reliability.
To generalize this case, one can observe that as the number of parent nodes that are
required to be reliable approaches the total number of parents, the behavior of the
configuration approaches the serial reliability case, where all the parent nodes are
required to be highly reliable in order for the child to be highly reliable. Detailed anal-
ysis of this case may be found in Chapter 6.
Using the above configuration scenarios, as well as customized relationships that may
be relevant to specific nodes in a system, we are able to formulate the conditional
probabilities at each node in the network. In the next section, we discuss how we use
this and similar techniques to perform system-level reliability analysis based on the
Bayesian methodology.
4.3.3. Discussion and Insights
Deciding on the particular reliability configuration at each node is tasked to the archi-
tect who has sufficient knowledge about interactions in the system and the dependen-
cies between various parts of the system. In particular, the relationships may be
124
defined by answering some of the following questions: How do states of components
affect the reliability of one another? Would a failure of a single component result in
system failure, or does it have to be combined with other components’ failures to do
so? Which components need to fail in order for the system to fail? Are there any
redundant components, such that failure of one of them does not affect the system’s
reliability? Answering these questions enables the architects to understand when fail-
ures happen, and help them formulate stochastic formulas that describe the failure
behavior of components, and consequently predict the probability of system failure.
In the case of a system of components configured in a serial fashion, we noted that the
least reliable component has the biggest influence on the reliability of the system.
Conversely, in the case of components with parallel configuration, the component
with the highest reliability has the biggest effect on the system's reliability, since the
most reliable component is the one that will most likely fail last. This is an important
property of the parallel configuration, specifically when making design decisions
aimed at improving the system’s dependability in general. Evaluation of this relation-
ship may be found in Chapter 6.
4.4 System Reliability Analysis
In Section 4.3 a detailed discussion on construction of a Bayesian reliability model
using the system’s global behavioral models was provided. In this section, we take
125
that discussion one step further, and demonstrate how those principles can be lever-
aged when analyzing the reliability of systems. This is done in terms of insights and
guidelines in applying those principles to systems with different characteristics, illus-
trated in the context of the SCRover example.
The component reliability values obtained via the AHMM modeling discussed in
Chapter 3 serve as estimates of the individual component reliabilities in isolation.
That is, the quantification of an individual component’s failure behavior is performed
when considering the component’s normal and failure behaviors in isolation.
3
We use
this stand alone measurement as the basis of the “goodness” of each component in
performing its operations and name it raw reliability values. We use a component’s
raw reliability value as a coefficient to the component’s initial state’s reliability.
Recall that instantiation links from the init node in the BN to all components’ initial
states signify the reliability of the system’s startup process. The conditional probabil-
ity of the nodes corresponding to the initial state in each component must then be for-
mulated as a function of the reliability of the startup process and the component’s raw
reliability value. For example, in the case of the controller component in the
3. While some of the analyses we performed on architectural models consider the compo-
nents within a system, the estimated reliability value does not include quantification of
failures that are related to interactions among components.
126
SCRover, the controller.S
1
node corresponds to the component’s initial state. The
probabilistic formula describing the reliability at this node is thus formulated as:
where R
init
is the reliability value of the startup process (assigned by the architect),
and is the raw reliability of the controller component obtained via the
AHMM approach. Assuming that the reliability of the startup process is 0.999, the
reliability of the controller component’s initial state (S
1
) in the SCRover system is cal-
culated as follows:
The reliability of the startup process is a measure to be assigned by the architect. This
parameter is used to represent the uncertainties associated with the system’s startup
process. Experience with the particular system (or functionally similar systems) is the
main source for initializing the values. If desired, a value of 1 could be used which
essentially represents no uncertainties predicted for the system’s startup process. We
believe using a numerical value (less than 1) strengthens the model by accounting for
unknown circumstances at startup.
1
.
1
()
n
raw
Controller S i init controller
i
RRRR
=
== ×
∏
raw
controller
R
1
.
0.999 0.93 0.929
Controller S
R =× =
127
An important observation is that in the SCRover’s Bayesian Network (depicted in the
top portion of Figure 4-8), there are delay links entering controller.S
1
node. Once the
network is expanded, the delay links are translated to normal links in the subsequent
time steps, while the links in the first time step only include the links from the init
node and the component reliability node. Thus the calculation above applies to the
first time step in the expanded network. For consequent time steps, there are addi-
R_Estimator
S2_Actuator S2_Actuator1 S2_Actuator2
F_PrePostCond_Actuator2 F_PrePostCond_Actuator1 F_PrePostCond_Actuator
S3_Actuator S3_Actuator1 S3_Actuator2
S1_Controller2 S1_Controller1 S1_Controller
F_PrePostCond_Controller F_PrePostCond_Controller1 F_PrePostCond_Controller2
F_Signature_Controller2 F_Signature_Controller1 F_Signature_Controller
S2_Controller S2_Controller1 S2_Controller2
R_Controller
S1_Estimator S1_Estimator1 S1_Estimator2
S1_Actuator2 S1_Actuator1 S1_Actuator
R_Actuator
S3_Controller2 S3_Controller1 S3_Controller
S3_Estimator2 S3_Estimator1 S3_Estimator
F_Protocol_Estimator F_Protocol_Estimator1 F_Protocol_Estimator2
S2_Estimator2 S2_Estimator1 S2_Estimator
init
S2_Actuator
S3_Actuator
S1_Controller
R_Controller
S1_Estimator
S1_Actuator
R_Actuator
S3_Controller
S3_Estimator
F_Signature_Estimator
S2_Estimator
init
R_Estimator
S2_Controller
F_Signature_Controller
F_PrePostCond_Controller
F_Protocol_Controller
Figure 4-8. SCRover’s Bayesian Network (top) and the Expanded Bayesian Network (bottom)
128
tional parent nodes to the controller.S
1
node whose reliabilities must be incorporated
in the probabilistic formula.
This approach is used to instantiate the reliability of the nodes corresponding to the
initial state of all components in the system. The next step involves assigning proba-
bilistic formulas to determine conditional probabilities at all other nodes, given their
parents’ reliabilities. For each node the architect will need to determine how the par-
ents’ reliability can affect the node’s reliability. For example, in cases where two or
more parents serve the same purpose in the model and can be treated as redundant, a
parallel configuration may be assigned. In other cases a serial configuration may be
suitable when each parent has equal and direct influence on the reliability of the node,
and when the node can never be more reliable than any of its parents. Still in other
cases, customized relationships may be defined depending on the domain and appli-
cation-specific knowledge, or on data from past experiences. For example, different
weights may be given to different parent nodes to amplify the importance of the reli-
ability of different parent nodes.
The final step involves assigning probabilistic formulas to each failure node. These
formulas determine how different nodes in different components contribute to spe-
cific types of failure. There is an important difference between the failure nodes and
the other nodes in the network. Throughout the network, for each non-failure node,
we model the probability of success or reliability. However, at each failure node, there
129
is a conceptual change and the probability of failure is modeled instead. Conse-
quently, the formulas at the failure nodes must reflect this distinction by assigning the
complementary probability value (1- R) at the node.
As an example, consider failure node F_signature_controller
in the SCRover model.
This specific failure may be caused by problems at either controller.S
1
or controller.S
2
nodes. Let us assume that the architect has determined that each of the parent nodes
contribute equally to this signature failure, and the unreliability at this node is equal to
or greater than the unreliability of the most unreliable parent. This justifies a serial
configuration for this node. The conditional probability value at this failure node is
thus calculated as shown below:
where F
controller.F4
is the probability of failure at controller.F
4
node.
A similar approach to failure probability calculation must be followed for all failure
nodes in the network. Once all probabilistic formulas for all the nodes in the model
are assigned, Bayesian inference is used to leverage the known data (in this case com-
ponent’s raw reliability values and the nodes’ probabilistic relationships) to infer
unknown information about the system. In our case the primary relevant unknown
421
...
1
1()
n
controller F i controller S controller S
i
FRR R
=
=− = ×
∏
130
information about the network is the probability of occurrence of each failure type.
The system’s failure probability (or its unreliability) may be formulated by aggregat-
ing the probabilities of occurrence of all types of failure in the system.
Devising a generalized technique to aggregate individual failure probabilities into an
overall system failure probability is unreasonable given the different development
scenarios and domain-specific issues related to different applications. Recall that the
number and types of failure nodes are tied to our defect classification discussed in
Chapter 2. Using the above approach, we obtain the failure probabilities in various
components in the system. We propose two alternative approaches for aggregating
these probabilities and obtaining the system’s probability of failure or its unreliability.
The first approach is a simple approach and results in a “basic” probability prediction
which directly combines all components’ failure probabilities using a Radar Chart.
The second approach incorporates the cost values assigned by the domain expert in
the aggregation process and results in a “cost-based” reliability prediction. Below we
discuss each technique in detail.
Basic Reliability Prediction. In this approach, we simply calculate the system’s fail-
ure probability (or its unreliability) in terms of the cumulative effect of each compo-
nent’s failures on the system. We plot values of various failure probabilities using a
simple Radar Chart. Each failure probability is plotted along an axis. The number of
axes is equal to the number of failure nodes in the BN, and the angles between all
131
axes are equal. Figure 4-9 depicts our instantiation of the Radar chart for the SCRover
system. Four axes represent controller’s PrePostCondition, Protocol, and Signature
failures as well as estimator’s Signature failure. Each axis has a maximum length of 1.
A point closer to the center on any axis depicts a low value, while a point near the
edge of the polygon depicts a high value for the corresponding failure probability. The
cumulative effect of the failures can be obtained by calculating the surface area
formed by all the axes. The surface area is calculated using a triangulation method
Figure 4-9. Cumulative Effect of Different Failures in SCRover
Cumulative Effect of Failures
0
0.05
0.1
0.15
0.2
0.25
0.3
controller.PrePostCond
controller.Protocol
controller.Signature
estimator.Signature
132
discussed in Chapter 2. Using this technique, the overall system reliability is esti-
mated as 0.965.
The reliability value estimated using this technique, does not directly incorporate the
cost associated with defect recovery as specified in our cost-framework. The cost val-
ues are however, indirectly considered as part of the initial component reliability val-
ues obtained via the HMM methodology. Calculating reliability using this approach
would be beneficial when considering system reliability from a purely technical point
of view: the interactions among components. However, in some cases, we may be
interested in performing reliability-based risk analysis, and incorporating economical
aspects of software development into the reliability estimation process. In those cases,
one approach would be to directly incorporate the cost associated with each failure,
into the aggregation process.
Cost-based Reliability Prediction. In this approach we leverage the cost-framework
discussed in Chapter 2, and incorporate the cost of recovery for each type of failure
(as assigned by the domain expert) into the cumulative failure probability calculation.
Recall that a domain expert instantiates our cost-framework by assigning failure costs
in the system. The cost of a failure has an inverse relationship with the probability of
11 2
area sin( )[0.136829* 0.07093 0.07093*0.07093
22 4
0.07093*0.262404+0.262404*0.136829]
Reliability 1 0.982686738 area
π
=× × + +
=− =
133
recovering from that type of failure: as the cost associated with the recovery
increases, the probability of recovery decreases, and vice versa.
To do so, we build a Radar Chart that takes the recovery probabilities into consider-
ation. Specifically, the value of each axis on the chart is adjusted to incorporate the
cost of recovery from the related failure type. The recovery probabilities as assigned
by the domain expert for our cost framework in Chapter 2 is presented again in
Figure 4-10. Leveraging this data, we adjust the numerical value for each axis in the
Radar Chart, by multiplying failure probabilities by the cost of recovery for that fail-
ure. This is justified by, for example, considering that the system is more likely to
recover from failure with a high probability of occurrence, for which a low cost of
revery is assumed (i.e., high probability of recovery). Alternatively, a failure with
very low probability of occurrence and low probability of recovery may be consid-
ered more critical than a failure with low probability of occurrence and high probabil-
ity of recovery. The new radar chart depicting these adjusted values given the
previous calculation and the data depicted in Figure 4-10 is shown in Figure 4-11.
The new system reliability value calculated based on the weighted cumulative
approach is as follows:
11 2
area sin( )[0.02770*0.0155+0.0155*0.02048+
22 4
0.02048*0.05314+0.05314*0.02770]
Reliability 1 0.99917 area
π
=× ×
=− =
134
Incorporating the cost values in this case resulted in an increase in the predicted reli-
ability value. This is a result of the specific recovery probability values (inverse of the
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Recovery Probability
Recovery Probability for
Each Defect Type
0.7075 0.5725 0.7975 0.780625 0.71125
Usage Incomplete Signature
Pre/post
cond
Interaction
protocols
Figure 4-10. Recovery Probability for Different Failures (Inverse of Cost)
Figure 4-11. Weighted Cumulative Effect of Failures for SCRover
Weighted Cumulative Effect of Failures
0
0.01
0.02
0.03
0.04
0.05
0.06
controller.PrePostCond
controller.Protocol
controller.Signature
estimator.Signature
135
cost) for various failures. Particularly, it turns out that since the architect considers
some of the failures to have a high recovery probability, the effect is that given the
available resources, it is possible to recover from the corresponding failures. Clearly,
this approach would be most beneficial to use if it is directly tied to an economical
model of the software development process, which explicitly identifies resources and
justifications for specific cost values.
The complete model of the SCRover’s Bayesian Network generated in the Netica
Environment can be found in Appendix C. Analysis of this model to demonstrate the
properties of the reliability model in provided is Chapter 6.
136
Chapter 5: Tool Support
In order to support various computational tasks related to this dissertation research,
we use three loosely integrated environments: Our in-house architectural modeling,
analysis, and evolution environment called Mae [111], enables us to model a system
and its components using different quartet views, and ensure consistency among the
views. It provides us with a set of defects revealed as part of the analysis enabled by
the tool. We then use an extension to an openly available toolbox for Mathwork’s
Matlab that supports Hidden Markov Modeling [82], to perform component reliabil-
ity calculations. Finally we use Norsys’s Netica environment [90] capable of perform-
ing Bayesian inference to perform system-level reliability analysis. Figure 5-1 depicts
the process view of the various tools used for this research. Numbers 1, 2, or 3 on
each arc signify various phases of the process: architectural modeling and analysis,
component reliability modeling, and system reliability modeling respectively.
In this chapter, we first describe the Mae environment in terms of its architecture and
its functionality. We then discuss the Matlab extension developed to perform compo-
nent reliability modeling using Hidden Markov Models. Finally we introduce the Net-
ica environment for Bayesian Modeling of system-level architectural reliability.
137
5.1 Mae
The Mae environment is an architectural modeling, analysis, and evolution environ-
ment that combines principles of architectural modeling with those of configuration
management [111]. It leverages a rich system model to provide a novel approach for
managing architecture-centered software evolution. It anchors the evolution process
to the architectural concepts of components, connectors subtypes, and interfaces,
enhancing it with the power and flexibility of the Configuration Management con-
cepts of revisions, variants, options, and configurations. The result is a novel architec-
tural system model with an associated architecture evolution environment. The
environment is extensible: it allows the users to customize the definition of compo-
nents, connectors, and interfaces, and enables them to define additional properties of
interest using a set of XML schemas. The tool seamlessly integrates the customized
definitions and regenerates its graphical user interface to allow for modeling of the
new concepts.
The architecture of this environment is shown in Figure 5-2. It consists of four major
subsystems: The first subsystem, the xADL 2.0 data binding library [25], forms the
core of the environment. The data binding library is a standard part of the xADL 2.0
infrastructure that, given a set of XML schemas, provides a programmatic interface to
access XML documents adhering to those schemas. In our case, the data binding
library provides access to XML documents described by set of customized XML
138
schemas that offer a rich definition of architectural concepts in accordance with the
Quartet approach. Therefore, the xADL 2.0 data binding library, in essence, encapsu-
lates the Quartet models by providing a programmatic interface to access, manipulate,
and store evolving architecture specifications. The details of the XML schemas pro-
viding definition of the Quartet approach may be found in Appendix A.
The three remaining subsystems of Mae each perform separate but complementary
tasks as part of the overall process of managing the evolution of a software architec-
ture:
Mae
Components
Reliabilities
Matlab
Netica
System
Reliability
1
1
2
2
2
3
3
3
3
Legend
Tool
Artifacts
Data
Numerical
Values
Defects Quartet
Models
Figure 5-1. Overall View of the Required Tools for Architectural Reliability Modeling and
Analysis
139
• The design subsystem combines functionality for graphically designing and
editing an architecture in terms of its structure and its behavior. This subsystem
supports architects in performing their day-to-day job of defining and maintaining
architectural descriptions, while also providing them with the familiar check out/
check in mechanism to create a historical archive of all changes they make.
• The selector subsystem enables a user to select one or more architectural
configurations out of the available version space.
• Finally, the analysis subsystem provides sophisticated analyses for detecting
inconsistencies and defects in the architectural models.
The analysis subsystem provides vital support for our approach to reliability estima-
tion. It offers the ability to ensure the consistency among various Quartet views of a
xADL 2.0
Data Binding Library
Analysis Subsystem Selector Subsystem Design Subsystem
XML Architectural
Specification
xADL
Schemas
Figure 5-2. Mae’s Architecture
140
software system. The basis for these analyses to ensure inter- and intra-consistencies
among various Quartet views was described in Chapter 2.
Once the architectural modeling and analysis phase is complete, a set of defects are
obtained. We classify these defects according to our architectural defect taxonomy
discussed in Chapter 2, and use the results directly in the component-level and sys-
tem-level reliability analysis.
5.2 Component Reliability Modeling
Once defects from architectural modeling and analysis phase are obtained, the data is
used to quantify the effect of each defect and obtain a quantification of the reliability
of each component. Component reliability estimation is done by building an exten-
sion to the Matlab Hidden Markov Model toolbox [82]. This extension leverages the
results obtained from the Expectation-Maximization algorithm to calculate the steady
state vector of the Markov Model associated with each component. In doing so, it also
incorporates results from the cost-framework in terms of cost and probability of
recovery from various defects.
The Matlab code used for estimating component reliability values is presented in
Appendix B.
141
5.3 System Reliability Modeling
Once the reliability of individual components are obtained via our HMM-based reli-
ability model, we use a Bayesian Network editor called Netica [90] to build a system-
level Bayesian reliability model. Netica offers a graphical editor to create Bayesian
Networks, and in addition to Bayesian inference, offers sensitivity analysis function-
ality. It also enables the user to specify probabilistic equations at each node in the
model that we use to specify the reliability of the nodes based on the reliability of
their parents. The network is then compiled and upon completion of the inference
process, the probability of arriving at each failure node is calculated. We then use
those values to obtain a measure of the system’s overall reliability using the tech-
niques described in Chapter 4.
Netica stores the models in a textual file format. The Bayesian model of the SCRover
system is presented in Appendix C.
142
CHAPTER 6: Evaluation
In this chapter, we evaluate our software architecture-based reliability modeling
approach to demonstrate that reliability prediction of software systems’ architectures
early during the development life-cycle is both possible and meaningful. The main
challenge associated with the early reliability prediction problem is the lack of imple-
mentation artifacts, and thus lack of knowledge about the systems’ operational pro-
files. Suitable reliability models must be able to address this challenge and the
associated uncertainties.
Recall that our approach uses a set of architectural modeling views called the Quartet
to compositionally model software systems in terms of their constituent components.
Analysis of these models reveals potential architectural defects that could result in a
reduction in the reliability of components, and thus the reliability of the entire system.
We quantify the impact of these defects using an architectural defect classification
and a cost-framework. An Augmented Hidden Markov Model (AHMM) leverages the
quantification results, as well as the Quartet models, to predict individual compo-
nents’ reliabilities. Finally, a Bayesian reliability model is constructed to composi-
tionally predict the overall reliability of the system, given the reliabilities of
individual components and their interactions.
143
The goal of our evaluation is to ensure that our methodology used for reliability pre-
diction is sound, and that the results are both meaningful and useful. We evaluate the
approach using the following criteria:
1. The coverage of our architectural analyses, as well as our defect classification is
evaluated empirically.
2. The component reliability prediction methodology is evaluated using sensitivity
and uncertainty analyses. The goal is to show the sensitivity of the model to
changes in various model parameters. Moreover, we demonstrate that our model
is capable of handling uncertainties associated with the components’ unknown
operational profiles.
3. The complexity and scalability of our adaptation of the Expectation-Maximiza-
tion algorithm is evaluated theoretically.
4. The system-level reliability model is evaluated in terms of sensitivity analyses
with respect to various model parameters. These analyses also demonstrate the
usefulness of the approach in terms of its ability to offer helpful insight that can
aid the architect as a decision tool during the development process.
The rest of this chapter is organized as follows. In Section 6.1 we discuss the empiri-
cal evaluation of our architectural modeling, analysis, and defect classification meth-
odology. Sections 6.2 and 6.3 present the evaluation of our component-level and
system-level reliability model respectively.
144
6.1 Architectural Analysis and Defect Classification
Our architectural modeling, analysis, and evolution environment Mae [111] provides
utilities for modeling the structural and behavioral aspects of software systems. It also
offers a suite of analysis tools for Quartet models based on the view consistency and
conformance principles discussed in Chapter 2. Since specific analysis techniques are
not a contribution of this dissertation research, the evaluation of those techniques is
beyond the scope of this work. However, the architectural defects revealed by the
analyses are directly leveraged by our reliability models, and thus we present an
empirical evaluation of our defect classification framework.
Our defect classification was developed after extensive study of the results of archi-
tectural analyses obtained from three different modeling and design methodologies.
This study was done in the context of the SCRover project [52], a robot testbed based
on NASA JPL's Mission Data System (MDS) [27], and in the context of NASA’s
High Dependability Computing Program (HDCP) [86]. The goal of the study was to
understand the tradeoffs among different architectural modeling approaches.
We extensively studied and documented our experience in using a UML-based meth-
odology called Model-Based Architecting and Software Engineering MBASE [125],
as well as two representative Architecture Description Languages (ADLs), Acme [40]
and Mae [111] to model SCRover. Both Acme and Mae models were derived from the
145
initial MBASE design and SCRover documentation, but were developed indepen-
dently of each other, and focused on different aspects of the architecture. We studied
the differences that resulted from focusing on varying aspects of the original docu-
mentation. We will show how these differences led to the automatic detection of dis-
tinct, but complementary, classes of errors, and how automatic analysis afforded by
either Mae or Acme yields better results than peer-reviews of the SCRover documen-
tation for architectural defects [110].
The results of these studies reinforced the hypothesis of the benefits of multi-view
modeling. The independence of the research groups performing each modeling and
analysis activity enabled us to empirically validate our defect classification. In the
rest of this section we first offer some statistics on the type and numbers of defects
that our architectural modeling environment Mae was able to detect in comparison
with the other approaches. These results were initially classified using a standard
Software Development Life Cycle classification scheme [125]. Since this classifica-
tion did not focus on architectural issues, we collaboratively developed a new taxon-
omy (presented in Chapter 2). The types and numbers of defects detected by all three
approaches based on our newly developed taxonomy of architectural defects are pre-
sented later in this section.
Figure 6-1 depicts the total number of defects found by all approaches (left column)
against the subset that can be captured in Mae-MDS models (middle column), and
146
those detected by Mae (right column). The categories in this graph correspond to a
standard SDLC classification scheme [125] and include Interface, Class/Object,
Logic/Algorithm, Ambiguity, Data Values, Inconsistency, and Others. We found this
classification too broad and inefficient in the context of defects rooted at the architec-
ture and design stages. For instance, the category Class/Object included both defects
rooted at the architecture, as well as implementation-level defects. Similarly, the
Logic/Algorithm category contained architectural defects, some of which dealt with
mismatched expectations among communicating components (i.e., pre/post-condition
mismatch), while others were defects caused by violation of the MDS architectural
style. Consequently, in collaboration with researchers from Carnegie Mellon Univer-
sity, we developed a new classification scheme (Chapter 2) that specifically focuses
on defects that are architectural in nature.
Figure 6-1. Mae Defect Detection Yield by Type
55
3
4
1
0
21
17
11
3
00
1
00
4
11
0
6 6
0
5
10
15
20
25
Interface Class/Obj Logic/Alg Ambiguity DataValues Other Inconsistency
#defects
#Represented in Mae
# Mae Detected
147
Figure 6-2 depicts the results of the re-classification of defects, in terms of the total
number and respective types of all architectural defects detected by the three model-
ing approaches using our defect classification.
The three independent modeling approaches not only confirmed each other’s analysis
results, but also demonstrated the value of viewing SCRover (and MDS) from differ-
ent perspectives. Mae and Acme in tandem detected all architectural defects identified
by the peer-review of UML models, and additionally identified previously undiscov-
ered defects. UML peer-reviews, on the other hand, identified additional classes of
defects that were not architectural in nature [14].
Figure 6-2. Defects Detected by UML, Acme, and Mae (by Type and Number)
148
The results presented here show that Quartet-based modeling is capable of capturing
useful information about a software architecture, and that the related analysis can
reveal critical architectural defects. Classification of these defects based on our taxon-
omy of architectural defects is used by a cost-framework (presented in Chapter 2) to
quantify the impact of defects. The quantification results are leveraged by our reli-
ability models to predict the components’ and the system’s reliability. The next two
sections evaluate the component- and system-level reliability models, respectively.
6.2 Component Reliability Prediction
Numerical results presented in this section are obtained via extensions to our Java-
based architectural modeling and analysis environment, Mae [111], and Matlab simu-
lations. Our results demonstrate that our AHMM methodology is effective in model-
ing the architectural reliability of software components in the presence of
uncertainties associated with early reliability prediction. Furthermore, they demon-
strate that our model is useful in enhancing the development process by offering cost-
effective strategies for mitigating architectural defects. The latter is done by providing
sensitivity analyses aimed at identifying the most critical defects in the component’s
architecture. In providing these analysis, we also demonstrate that our reliability pre-
diction approach is meaningful: changing model parameters exhibits predictable
trends in the estimated component reliability values. Finally, we provide an analytical
evaluation of the scalability and complexity of our approach. The results indicate that
149
the complexity and scalability of the model are bounded by the complexity and scal-
ability of the underlying formalisms (state machines and the Expectation-Maximiza-
tion algorithm). The details of the evaluations in all three categories – uncertainty,
sensitivity, and complexity – are provided in the rest of this section.
6.2.1. Uncertainty Analysis
There are multiple sources of uncertainties associated with reliability modeling dur-
ing the architecture and design phases. In the context of component reliability model-
ing, we have identified two primary sources of uncertainty:
1. Uncertainties associated with incorrect component behavior.
2. Uncertainties associated with unknown operational profile.
The first type of uncertainty at early stages of software development is a side-effect of
the nature of architectural models. Typically these models only specify the intended
(or desired) behavior of components. The undesired behaviors which may reveal the
cause of defects (and subsequent failures), are not typically modeled explicitly. Our
component reliability model innately addresses this type of uncertainty. We extend
the model of the desired behavior of components (dynamic behavioral model), and
enhance it with states that represent failure conditions, and transitions that represent
the nature of these failures. Recall the details of our component-level reliability
model in terms of failure states, and failure and recovery transitions as described in
150
Chapter 3. While a domain expert is expected to instantiate the failure transition prob-
abilities, recovery transitions’ probabilities are instantiated using our cost-framework.
We evaluate the approach in terms of uncertainties associated with probabilistic
instantiation of failure transitions. Sensitivity of the model to recovery probability
values is evaluated later in this section.
The second type of uncertainties relates to the unknown operational profile of the
component. A component’s exact operational profile may only be obtained via moni-
toring its operation while deployed in the field. During early stages of development,
however, conditions that are representative of the component in the field may not
exist. We offer three different experiments to evaluate the uncertainties associated
with the unknown operational profile. As discussed in Chapter 3, in the absence of
operational profile data, we use a data synthesis approach to fabricate the training
data needed for reliability modeling. Under ideal conditions, the data is generated
using domain expertise. For non-ideal cases we generate the data randomly. We com-
pare the results for a number of components under both conditions, and discuss how
our approach addresses the uncertainties associated with unknown operational profile.
6.2.1.1 Behavioral Uncertainties
Recall that our component reliability framework relies on the knowledge of a domain
expert in order to specify the probabilities of various failures at each component state.
We appreciate that determining this probability may be a challenging task. Especially
151
in cases when no prior data from the component in operation exist, it may be difficult
for the expert to estimate these probabilities with a great degree of confidence. In
Chapter 3, we discussed that in order to accommodate this uncertainty, we opt to pro-
vide a range of analyses for component reliability. To do this, we calculate the reli-
ability of a component, given a threshold for each failure probability value. We
demonstrated this technique for the SCRover’s controller component in Chapter 3,
and presented the predicted reliability values based on a threshold for failure proba-
bilities. In this section, we demonstrate the changes to the predicted reliability values
for the full range of possible failure probabilities. The goal is to demonstrate that the
reliability model reacts predictably to different failure and recovery probabilities, and
that it produces meaningful results.
Figure 6-3 demonstrates the changes to the controller component’s predicted reliabil-
ity values (y-axis) for a range of failure probabilities for the two failure states F
1
and
F
2
(refer to Chapter 3 for the value of other parameters in the reliability model). As
expected, overall the reliability model reacts correctly and meaningfully to the
changes in input parameters: an increase in the failure probabilities results in a
decrease in the component reliability. An interesting conclusion based on the results
depicted in Figure6-3 is that changes to the value of probability of failure F
1
(denoted by PF
1
) causes a sharper decrease in the component reliability than the
changes to the value of probability of failure F
2
(PF
2
). This may be explained given
152
the specific reliability model of the controller component. Recall that the expert has
instantiated the two probabilities of failures as PF
1
=0.05 and
PF
2
=0.02. Furthermore,
(as shown in Figure 3-4), all four (non-failure) states in the model may result in a fail-
ure of type F
1
(each with probability PF
1
). However, only two states may result in a
failure of type F
2
(each with probability PF
2
). Consequently, the probability that a
failure of type F
1
may occur from any of the four states at all is 4PF
1
. Similarly, prob-
ability of failure F
2
occurring from any of the two states is 2PF
2
. In our analysis, we
kept one of the parameters constant and analyzed the sensitivity of the model to
changes in the second parameter. In this model, there is always a greater likelihood of
failure of type F
1
as a result of changes to PF
1
.
Controller Component Reliability
0
0.2
0.4
0.6
0.8
1
1.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability of Failure
Reliability
PF2 changing PF1 changing
Figure 6-3. Controller Component Reliability Analysis Based on V arious Probabilities of
Failures to the Two Failure States
153
To confirm the conclusions, we repeated the above experiment with an arbitrary com-
ponent with 10 states, and 14 interfaces. Let us assume that 4 types of defects are
identified during architectural analysis. Our model thus is extended with 4 failure
states F
1
, F
2
, F
3
, and F
4
, representing usage, interaction protocol, signature, and pre/
post condition failure types respectively. The probability of recovery from each type
of failure is calculated using the values obtained from our cost-framework as depicted
in Figure 6-4: RP(F
1
) = 0.7075, RP(F
2
) = 0.71125, RP(F
3
) = 0.7975, and RP(F
4
) =
0.7806, where PR(F
i
) represents the recovery probability from state F
i
. Let us assume
that other transition probabilities are instantiated randomly. Figure 6-5 shows the
effect of changes to various failure probability values (PF(F
i
)) on the component’s
reliability. As expected, as the probability of various types of failures increases, the
overall component’s reliability decreases, with the changes to PF(F
4
) (probability of
pre/post condition failures) having a slightly greater impact than other failure types.
Given the random and synthesized nature of this model, however, it is not possible to
draw insights and intuitive conclusions regarding the impact of different types of fail-
ures on the component’s reliability.
In working with synthesized components and components with larger number of
states, we soon realized the challenges involved in instantiating probability matrices
for these models. Obviously, the easiest approach is to generate the necessary matri-
ces randomly. The important question is, how do the results obtained from random
154
and expert instantiation compare with each other? In the next subsection we provide
insights with respect to random and expert instantiation of probability matrices that
correspond to a component’s operational profile. Here we discuss the effect of ran-
dom and expert instantiation for the failure probability matrices.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Recovery Probability
Recovery Probability for
Each Defect Type
0.7075 0.5725 0.7975 0.780625 0.71125
Usage Incomplete Signature
Pre/post
cond
Interaction
protocols
Figure 6-4. Cost-framework Instantiation for Different Defect Types based on data in Chapter 2
Figure 6-5. Changes to a Random Component’s Reliability based on Different Failure
Probabilities
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability of Failure
Component Reliability
PF1 PF2 PF3 PF4
155
Given the above model of an arbitrary component with 10 states, and a set of training
data, we evaluated the sensitivity of the model to different failure probability values.
using two experiments. In both experiments, we randomly generated the matrix repre-
senting the failure probabilities. However, in the first experiment, we generated a full
matrix where essentially all the elements were non-zero. This indicates that there is a
chance of all four types of failures happening from every state in the component’s
model. The probabilities of failures ranged from 0.0019 to 0.0995 (0.19% to 9.95%
respectively). The mean of predicted reliabilities after 100 iterations of the EM algo-
rithm was 77.92%, with the corresponding histogram depicted in Figure 6-6 (left). We
then re-calculated the reliability of the component, with a new instantiation of the
failure probability matrix. This time, the random generation of the matrix created a
sparse matrix with entries within the same range as the last experiment (0.19% to
9.95%). The sparse matrix indicates that only some types of failures are likely to
occur at certain component state. The mean of predicted reliability values after 100
EM iterations was estimated at 95.27%. The corresponding histogram is depicted in
the right hand side of Figure 6-6. The results confirm our intuition and insights
obtained from the controller component: a full matrix offers a greater opportunity for
occurrence of various failures, while the sparse matrix limits this possibility to only a
few states: the more opportunity for failures, the lower the component reliability. The
primary conclusion is that expert instantiation of failure transitions probabilities is
critical in obtaining an accurate prediction of components reliabilities, given that the
expert’s knowledge is more representative of expected behavior of the component.
156
In the next subsection, we offer our analysis of the reliability model with regard to its
ability to handle other uncertainties, particularly those associated with the compo-
nent’s operational profile.
6.2.1.2 Operational Profile Uncertainty
The biggest challenge in architecture-level prediction of component reliability is the
unknown nature of a component’s operational profile at this stage of development. As
discussed earlier, a useful reliability model must be able to handle this type of uncer-
tainty and produce meaningful results. Our model addresses this challenge, and in this
section we evaluate it using a set of analyses.
As mentioned in Chapter 3, in cases where the operational profile of the component is
not available, we essentially synthesize this data. This is done by synthesizing a set of
0.944 0.946 0.948 0.95 0.952 0.954 0.956 0.958 0.96 0.962
0
2
4
6
8
10
12
14
16
18
20
Frequency
Component Reliability
0.772 0.774 0.776 0.778 0.78 0.782 0.784 0.786 0.788
0
2
4
6
8
10
12
14
16
18
20
Component Reliability
Frequency
Figure 6-6. Predicted Reliability for an Arbitrary Component Given a Full Failure Probability
Matrix (Left), and a Sparse Failure Probability Matrix (Right)
157
training data for the HMM-based reliability model, and by leveraging domain knowl-
edge. The Expectation-Maximization algorithm then uses this data to estimate the
best operational profile for the component. The reliability model leverages the
obtained operational profile, and provides a prediction of component’s reliability.
Since the data synthesis process relies on the domain expert’s knowledge, it is critical
to analyze the ability of the model to handle uncertainties associated with this instan-
tiation. We do so using two types of analyses. First, in cases where a domain expert
instantiates the model, we want to analyze the importance of exact instantiation on the
estimated reliability. In other words, we want to determine the effect of fluctuations
within Initial Transition Probabilities (ITPs) on the predicted reliability. The second
set of analysis is aimed at determining the importance of expert instantiation, and the
impact of random instantiation of the transition probabilities in cases where domain
expertise is not available. This is particularly critical for cases when the model is too
complex (too many states or interfaces) and thus the instantiation process is too
tedious, or when sufficient expert knowledge is not available.
To determine the effect of fluctuations on the model’s initialization parameters, we
performed sensitivity analysis both on the controller component, as well as synthe-
sized (arbitrary) components with various complexities (5, 10, and 20 states). We ana-
lyzed each component using an Initial Transition Probability instantiated by an
expert, and various levels of noise (fluctuation) in the matrix values (5%, 10%, and
20% noise). These noise levels would represent a range of errors from minor to rather
158
significant. Before presenting the results, let us explain the methodology for incorpo-
rating noise to the matrices. For each row of a matrix, one element is selected at ran-
dom. Then a specific percentage of its value (e.g., 5%) is subtracted from the
element’s value. Finally, a total of 5% of its value is added to the rest of the elements
in the same row to ensure that the sum of each row still adds up to 1 (refer to
Chapter 3 for the specific properties of these matrices). This methodology can repre-
sent architect’s mistake in a single probability value (and its domino effect on other
related probability values).
In the case of the controller component, the result of noise introduction to ITPs can be
seen in Figure 6-7. The three experiments (depicted along the x-axis) correspond to
the 5% noise, 10% noise, and 20% noise respectively. The changes in the predicted
component reliability (in terms of percentages) are depicted along the y-axis. As can
Figure 6-7. Percentage of changes in the reliability value of the controller
component (5%, 10%, and 20% Noise)
0
0.05
0.1
Percentage of Change in Reliability Value
Controller Component
Controller Component 0.021505376 0.043010753 0.096774194
5% Noise 10% Noise 20% Noise
159
be seen, fluctuations of up to 20% have resulted in at most a 0.1% change in the pre-
dicted reliability. This shows that the model in this case is resilient to uncertainties
associated with the unknown operation profile, and even significant fluctuations to
the ITP (in this case up to 20%) only have minor influence on the results.
Additional experiments resulted in similar conclusions about the ability of the model
to handle this type of uncertainty. We performed similar experiments on a set of arbi-
trary components. We varied the number of states in each component in order to study
the effect across components with different complexities. Particularly, we studied the
results in the context of components with 5, 10, and 20 states, with 10 interface ele-
ments and 4 failure states. The results depicted in Figure 6-8 demonstrate that noise of
up to 5% induced in components resulted in changes between –0.055% to +0.014% in
the component reliability. Similarly, a 10% noise resulted in a –0.111% to +0.042%
change to the estimated reliability. Finally, a 20% noise resulted in a –0.236% to
+0.099% change in the component reliability value. In other words, noise of up to
20% in the ITPs resulted in a fluctuation of up to 0.23% in the component reliability
value. Once again, these results confirm that the model is resilient to fluctuations on
the component’s transition probabilities. In other words, if the domain expert is
unable to specify the “exact” operational profile for the component, the impact on the
estimated reliability is not too pronounced. An interesting question at this point is,
why is this the case?
160
The answer lies in the heart of our approach. From Chapter 3, recall the last step of
the reliability prediction process, where the reliability value is calculated as a function
of the probability of not being in a failure state at time t
n
. In other words:
where V(F
i
) is the steady state probability vector. It can be seen that the calculated
reliability value depends on the probability of being in a failure state at time t
n
. Recall
Figure 6-8. Percentage Change in Reliability Value of Three Arbitrary Components with 5,
10, and 20 States (5% noise, 10% noise, and 20% noise, respectively)
-0.3
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Percentage
Change in
Reliability
Arbitrary Component - 5
states
-0.055725829 -0.111451658 -0.236834773
Arbitrary Component -
10 States
-0.01453911 -0.01453911 -0.02907822
Arbitrary Component -
20 States
0.014208582 0.042625746 0.099460074
5% Noise 10% Noise 20% Noise
1
Reliability = 1 ( )
M
i
i
VF
=
−
∑
161
the Markov property which assumes that the probability of transition to the next state
at time t+1 depends on the system at time t and is independent from its past history.
Using this assumption, the final calculated reliability primarily depends on the proba-
bility of being in a failure state and is independent of the specific path(s) taken to
reach the particular failure state. This is consistent with our results presented earlier:
the reliability model is very sensitive to changes in the values of the failure probabil-
ity matrix, while changes in the values of transition probability matrix (ITP) do not
greatly impact the predicted reliability value.
The second set of our uncertainty analyses takes this conclusion one step further. The
goal is to determine the impact of random generation of the component’s operational
profile. In other words, in cases when instantiating the transition probabilities by the
domain expert is challenging or impossible, how is the predicted reliability affected
by the random instantiation of the model?
Let us start with the SCRover’s controller component. Assuming no operational pro-
file data is available for the component, we generate a random set of training data to
predict the reliability. Given the failure and recovery probability values discussed in
Chapter 3, the component reliability using random data is predicted to be 0.9295.
Alternatively, we asked a domain expert to assign probability values for transition and
observation matrices for the controller component. The matrices instantiated by the
domain expert tend to be more sparse than the randomly generated matrices. This is
162
because the domain expert models the intended behavior of the component by expect-
ing certain behavior(s) at a given state. The matrix instantiation typically reflects this
expectation. The predicted reliability of the controller component based on the
expert’s knowledge is calculated as 0.9304. In the context of the analysis performed
on this model, the 0.09% difference is negligible. Such determination however, needs
to be performed in the context of the specific system being analyzed. Repeated exper-
iments confirmed the same results. As discussed previously, our reliability model is a
lot more sensitive to the recovery probabilities and failure probabilities when estimat-
ing a component’s reliability. In other words, instantiation of the transition probabili-
ties based on the domain expertise or randomly has comparatively little influence on
the estimated reliability. In summary, our model can handle uncertainties associated
Figure 6-9. Controller Component Reliability Based on Random and Expert
Instantiation
0.929
0.9295
0.93
0.9305
Reliability
Random Instantiation Expert Instantiation
Random Instantiation 0.9295
Expert Instantiation 0.9304
1
163
with the operational profile remarkably well. The results of this experiment is shown
in Figure 6-9.
We again performed similar experiments with three arbitrary components with 5, 10,
and 20 states, 10 interfaces, and 4 failure states. Since the components are arbitrary, a
good way to “simulate” the domain knowledge in terms of the initial probabilities is
to create a random sparse matrix. The results of the full matrix and sparse matrix
instantiations are shown in Figure 6-10. As depicted, the source of instantiation of
data had little influence on the estimated reliability. The results confirmed our hypoth-
esis originally discussed in the case of the controller component, and demonstrated
that the reliability model can innately accommodate unknown operational profile.
Figure 6-10. Arbitrary Component’s Full (Left) and Sparse (Right) Random
Instantiation for Training Data Generation
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Reliability
Random Instantiation Expert Instantiation
Random Instantiation 0.7207 0.9998 0.698
Expert Instantiation 0.7214 0.9986 0.6963
5 States
Component
10 States
Component
20 States
Component
164
6.2.2. Sensitivity Analysis
The traditional steady-state sensitivity analysis offered by Markov-based modeling
provides insights into critical elements of the model. This is done by understanding
and characterizing the relationship between parameters in the model that, together,
quantify the global query on the network (in our case the reliability). In the context of
component reliability estimation, such analyses offer insights into critical states as
well as critical paths within a single component. The critical states determine which
states have the most influence on the component reliability value. The critical paths
indicate specific paths of execution (i.e, specific series of invocation of component’s
interfaces) which result in the highest reliability value for the component.
While this information may be relevant, its impact on the development process may
be too limited: states are abstract concepts that are typically not treated as first class
entities during implementation. Moreover, knowledge about paths leading to highest/
lowest reliability value, while theoretically interesting, offers little help in enhancing
the software development process. However, integrating such information with a
cost-framework can offer crucial help in improving the quality of the product under
development. Recall the tight integration between our defect classification and cost-
framework on the one hand, and our reliability model on the other hand. Leveraging
the cost-framework together with standard sensitivity analysis enables us to provide a
cost-effective approach in mitigating architectural defects.
165
To demonstrate the usefulness of the analysis enabled by our approach, we first need
to demonstrate that our model is sensitive to various types of architectural defects.
Recall that our component reliability model uses a cost-framework instantiated by the
domain expert (Section 2.6) to calculate the probability of recovery from failures. We
demonstrate the sensitivity of the model by studying the effect of changes to the esti-
mated reliability when cost values in the cost-framework change. Changes to the cost
values affect the recovery probabilities. Intuitively, as the cost of recovery increases,
the probability of recovery decreases.
Recall the cost-framework presented in Section 2.6. The recovery probabilities were
obtained by calculating the surface area under a Radar Chart constructed from differ-
ent cost values. The results are shown in Figure 6-4. In the case of the controller com-
ponent, based on the results of architectural analysis, only two types of failures
(signature failure and interaction protocol failure) were considered relevant. Figure 6-
4 sets the probability of recovery from a signature and an interaction protocol failure
at 0.7975 and 0.7112 respectively. The component reliability was then estimated at
0.9303. Let us analyze the sensitivity of the model by providing a range of recovery
probabilities for each of these failure types. The results are shown in Figure 6-11. As
the probability of recovery for each type of failure increases from 0 to 1 (x-axis), the
component reliability increases from 0.0005 to 0.9414 (y-axis). The two curves
denote that the increase in component reliability in the two cases follows a similar
pattern. However, changes to the probability of recovery from a protocol type failure
166
causes a greater increase in the component’s reliability value. The results here seem to
indicate that the reliability model reacts to changes in its parameters in a predictable
and meaningful way: an increase in the probability of recovery from failures results in
an increase in the component’s reliability. Moreover, in the context of this specific
example, it can be concluded that in circumstances where recovery probability is
really low, it is more rewarding to ensure that the probability of recovery from a pro-
tocol type failure is improved. However, once the recovery probabilities are at about
50%, the difference in the amount that recovery from each failure type affects the
component’s reliability becomes relatively small.
The results of applying the same principle to an arbitrary component with 10 states,
10 interfaces, and 5 failures states are shown in Figure6-12. The results show
changes to the probability of recovery from usage failure caused the least impact on
Figure 6-11. Sensitivity Analysis for the Controller Component with Different Recovery
Probabilities
0
0.2
0.4
0.6
0.8
1
0 0.1 0.20.3 0.40.5 0.60.7 0.80.9 1
Probability of Recovery
Reliability
Changes to Probability of Recovery for Signature Failure
Changes to Probability of Recovery for Interaction Protocol Failure
167
component reliability, while changes to the recovery probability for the protocol, sig-
nature and incomplete types of failures followed a very similar pattern on affecting
component’s reliability. Since this is an arbitrary component with random parameters,
it is difficult to justify the specific behavior of the model. However, the conclusion at
this point is that the reliability model reacts predictably to changes in the model’s
parameters.
We leverage this conclusion and apply it in the context of our final set of sensitivity
analysis experiments. By tightly integrating a cost-framework to our reliability
model, we are able to provide an analysis of the most cost-effective approach to
defect mitigation. The results in the case of the controller component are shown in
Figure 6-12. Changes to the Probability of Recovery from Various Failure Types for an
Arbitrary Component
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability of Recovery
Reliability
Usage Protocol Signature
Incomplete Pre/Post Condition
168
Figure 6-13. The first set of numbers (labeled 1) depict the original reliability estima-
tion. In the second and third experiments (labeled 2 and 3, respectively), we improved
the failure recovery probability for the two types of failures to 1. The results suggest
that an increase to the probability of recovery from the signature type failure has the
greatest effect on the overall reliability. In other words, in a decision making situation
during development, where the resources must be used by prioritizing tasks, we can
use our analysis to determine which tasks are more critical. In this case, mitigating the
root cause of the protocol type failures, and eliminating the associated defects has the
most immediate influence on the controller component’s reliability. An important
observation in this example is that changes to the component reliability values are
Figure 6-13. Controller Component Reliability w.r.t. Different Failure Recovery Probabilities
0.7
0.75
0.8
0.85
0.9
0.95
1
Values
Signature Failure Recovery Probability
Protocol Failure Recovery Probability
Component Reliability
Signature Failure Recovery
Probability
0.7975 1 0.7975
Protocol Failure Recovery
Probability
0.71125 0.71125 1
Component Reliability 0.9303 0.9414 0.9334
12 3
169
quite small. The question thus arises as to whether or not the architect can base his/her
design decisions based on these changes. The reliability values calculated here are
tied to the model of component’s behavior, as well as the recovery probabilities
assigned by the cost framework. In order to determine if a change is significant, we
need to first understand how the full range of recovery probabilities affect the compo-
nent’s reliability. Given the range of possible values, the architect must decide
whether a particular change is significant in the context of the specific component
under analysis. For example, in the case of the Controller component, Figure 6-14
depicts changes to the component reliability (y-axis) as the recovery probability of the
two failure types changes from 0.1 to 1 (x-axis). As a parameter’s value changes, the
second parameter is kept constant at 0.1. It is clear that changes to the Signature
Recovery Probability in general cause a greater range of change to the component
reliability (from about 62% to 94%), while changes to the Protocol Recovery Proba-
bility cause a much smaller change (62% to 66%) in the component reliability. As
depicted here, changes to the component reliability value are quite considerable,
because of the low recovery probability values initially assigned to the two parame-
ters (0.1). Given these results, the architect must then decide whether a small change
in the component reliability value based on (small) changes to the recovery probabil-
ity is significant given the specific software component.
170
To generalize this type of analysis, we performed similar experiments and obtained
results from an arbitrary component with 10 states, 5 failure states, and 10 interfaces.
We performed a five-part experiment in which we varied the probability of recovery
for each failure type. In the initial configuration, we assumed recovery probability
values for the arbitrary configuration. In each part of the experiment, we increased the
probability of recovery for a failure type to 1, while keeping the other recovery prob-
abilities at their initial values. The results are depicted in Figure 6-15 and suggest that
in this component, ensuring that we can recover from the pre/post condition type of
failure has the biggest impact on component’s reliability. Once again, a full range of
analysis based on the recovery probability values can put these results in perspective,
and help the architect to determine the significance of the results.
Figure 6-14. Controller Component Reliability w.r.t. A Full Range of Recovery
Probability Values
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recovery Probability
Component Reliability
Signature Recovery Probability Changing
Protocol Recovery Probability Changing
171
The process of elimination of a particular failure type in a component is two-fold: (1)
ensuring that a failure does not occur (probability of failure of zero), and (2) making
certain that in the case of failure the component is able to recover from it (probability
of recovery of one). In other words, both failure and recovery probabilities are critical
in component’s reliability estimation. Consequently, to complete our experiment, we
represent total elimination of a defect and subsequent failure by assigning the proba-
bility of failure to 0 and probability of recovery to 1 for each type of failure. Doing so
for the two types of defects in the controller component resulted in Figure 6-16. The
Figure 6-15. Sensitivity Analysis for an Arbitrary 10-state Component
0.55
0.65
0.75
0.85
0.95
1.05
Experiments
Values
Usage Protocol Signature Incom plete Pre/Post Condition Reliability
Usage 0.7075 1 0.7075 0.7075 0.7075 0.7075
Protocol 0.71125 0.71125 1 0.71125 0.71125 0.71125
Signature 0.7975 0.7975 0.7975 1 0.7975 0.7975
Incom plete 0.5725 0.5725 0.5725 0.5725 1 0.5725
Pre/Post Condition 0.7806 0.7806 0.7806 0.7806 0.7806 1
Reliability 0.843 0.8432 0.8465 0.8446 0.8521 0.8635
12 34 5 6
172
diagram shows that total elimination of failure of type signature mismatch has the
greatest effect on the component’s reliability.
This type of analysis can be used as a decision tool during the architecture and design
phase in allocating resources for defect mitigation.
6.2.3. Complexity and Scalability Analysis
The complexity of the Baum-Welch algorithm when transitions are labeled is deter-
mined to be [11], where N is the number of states, M is the number of
events, and T is the length of the training data generated by the model. Our adaptation
of this algorithm for the AHMM results in changes to the algorithm complexity
Figure 6-16. Effect of Total Elimination of Failures
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
Experiments
Reliability
F2 (Protocol) Eliminated F1 (Signature) Eliminated
F2 (Protocol)
Eliminated
0.9413
F1 (Signature)
Eliminated
0.9871
1
2
() ON M T ××
173
resulting in , where K represents the number of actions associated
with the events. In other words, the algorithm is proportional to the complexity of the
component, which matches our intuition. Even though the numbers of events and
actions (M and K) in our model are pre-determined by the number of component’s
interfaces, the number of states N may be reduced by applying the concept of hierar-
chy to the model: a complex model may be abstracted to provide a higher-level view
of the component, by applying the principle of hierarchical modeling. The result
would be a reduction in the complexity of the algorithm.
6.3 System Reliability Prediction
We evaluate our system-level reliability model in terms of sensitivity analyses aimed
at demonstrating that architecture-level reliability modeling of software systems is
both possible and meaningful. Using a set of case studies, we demonstrate the above
along the following three dimensions. First, we demonstrate the results of a series of
sensitivity analyses that show the impact of changes to model parameters on the pre-
dicted reliability values. We then demonstrate how the model can be used to identify
the critical components in a system, as well as their critical defects whose mitigation
provides a cost-effective approach to enhancing the reliability of the system under
development. Finally, where appropriate, we demonstrate the effect of specific system
configurations on its reliability. The results can be helpful to the architect in making
architectural changes in the system that may help improve its reliability. In the rest of
2
( ) ON M K T ×× ×
174
this chapter, we describe our evaluation in the context of two case studies. Our experi-
ence with several other case studies and synthesized models, confirms the conclu-
sions presented here.
Before discussing the details of our evaluation, a brief discussion on the probabilistic
instantiation of the model is necessary. The Serial, Parallel, or other customized con-
figurations specified for each node of the Bayesian Network (recall Chapter4),
directly affect the predicted reliability values. Since the specific probabilistic relation
is highly application dependent, to avoid the unnecessary complexity we use a simple
serial configuration for all nodes. This implies that no redundancy is exercised (unless
explicitly stated otherwise), and that the reliability of each node is equally influenced
by the reliability of all of its parents. While this assumption at the level of internal
nodes in the Bayesian model may be reasonable, at the final stage when the overall
system reliability is calculated, special care is needed. Specifically, the aggregation
formula that is used to calculate the cumulative impact of individual failures and
obtain system’s reliability can greatly impact the results. For example, treating all the
failures similarly (by using a serial configuration assumption) to predict the system
reliability in a Client-Server system, can lead to conclusions that may not be
explained intuitively: a client may have a similar or even greater impact on the system
reliability. Using a more sophisticated formula in this case, by either assigning
weights to failures from different components, or considering a parallel relationship
between the failures may be more reasonable. Although we will come back to this
175
issue in the context of a specific example later in this chapter, addressing issues asso-
ciated with architectural styles and patterns of interactions and their impact on sys-
tem-level reliability is beyond the scope of this thesis. Furthermore, a detailed
discussion on the ramifications of various reliability relationships is beyond the scope
of our work, and can be found in [124].
6.3.1. Case Study 1: The SCRover System
In Chapter4, we demonstrated the steps involved in reliability modeling of the
SCRover system. In this section we provide sensitivity analysis on the predicted reli-
ability value.
For the discussion here, recall the SCRover’s Bayesian Network depicted in Figure 6-
17 (originally presented in Chapter 4). The system reliability was predicted at 0.9826,
assuming an initial reliability of 0.93, 0.96, and 0.99 for the controller, estimator, and
actuator components, respectively. The components’ initial reliability values were
obtained from our component-level reliability model discussed and evaluated earlier
in this dissertation.
Sensitivity to component-level reliability predictions. In order to analyze the sensi-
tivity of the model to different initial component reliability values, we repeated the
prediction process for a range of initial components’ reliabilities.
176
Recall that a complete reliability model for the SCRover is in the form of a Dynamic
Bayesian Network (DBN). Using the DBN methodology, the reliability of the system
is predicted as a function of time. To show the effect of time on the predicted system
reliability, we initially performed the prediction process for two consecutive time
steps (t=0, and t=1). Figure 6-18 demonstrates the impact of each component’s reli-
ability on the system’s reliability. As shown, changes to the reliability of the control-
ler component greatly influence the reliability of the system, while changes in the
reliabilities of estimator and actuator components have less impact on the overall sys-
tem reliability. This phenomenon is especially more prominent at t=0. In fact, the
impact of the reliabilities of the estimator and actuator components at t=0 on the
S2_Actuator
S3_Actuator
S1_Controller
R_Controller
S1_Estimator
S1_Actuator
R_Actuator
S3_Controller
S3_Estimator
F_Signature_Estimator
S2_Estimator
init
R_Estimator
S2_Controller
F_Signature_Controller
F_PrePostCond_Controller
F_Protocol_Controller
Figure 6-17. SCRover’s Bayesian Network
177
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Time t=1
Time t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
System's
Reliability
Controller's Initial Reliability
Time
Time t=1 Time t=0
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Time t=1
Time t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
System's
Reliability
Estimator's Initial Reliability
Time
Time t=1 Time t=0
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Time t=1
Time t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
System's
Reliability
Actuator's Initial Reliability
Time
Time t=1 Time t=0
Figure 6-18. Changes to the Reliability of the SCRover System at Times t=0 and t=1
178
system reliability is negligible. However, as time goes by, the results change. More
specifically, the role of the reliabilities of the estimator and actuator components
becomes more significant at t=1. This is due to the Dynamic Bayesian Networks, and
the associated delay links that model the behavior of the system in subsequent time
intervals. In the context of reliability prediction, the delay links act as a feedback
mechanism in the system. They incorporate reliability of various nodes at a given
time step t
i
, into the estimation of the reliability value at the following time t
i+1
.
Given the Serial relationship assumption among nodes and their parents in the sys-
tem, as the number of parents to a node increases, its reliability (product of parents’
reliabilities in the case of Serial relationship) decreases.
Before providing additional discussion and insights about this concept, let us expand
the SCRover’s Dynamic Bayesian Network for a few more time steps, in order to pro-
vide a better view of the system during operation. Figure 6-19 shows the predicted
system reliability in the first 5 time steps. A first glance reveals that as time passes,
system reliability decreases significantly. This is only partially correct. The reliability
of each node (and thus the entire system) at time t
i
, depends on the reliability of the
system (in terms of the reliability of its nodes) at previous time steps t
1
, t
2
,..., t
i-1
. If
the reliability prediction process is performed without considering the knowledge
about system’s operation as time passes, the above calculation is accurate. However,
179
often times we could infer that if no failure at time t
n
has occurred, the probabilities of
all types of failures at this time step could be reset to zero. The fact that no failure at a
particular time step has occurred is known as evidence. One of the properties of Baye-
sian Networks is the ability to make new inferences based on newly obtained evi-
dence as time passes. In the case of our Dynamic Bayesian Network, this new
evidence is the reliability of the system (in terms of the reliabilities of its various
nodes) at previous time steps. In other words, as time goes by, if we know that a par-
ticular type of failure at a given time step t
i
did not occur, the model can be updated to
include this new evidence when the reliability at time t
i+1
is being estimated. The
inference results will then be updated based on the newly available evidence and thus
the system reliability prediction is updated accordingly.
Figure 6-19. SCRover’s Reliability over Time based on its Dynamic Bayesian Network
0
0.2
0.4
0.6
0.8
1
1.2
t0 t1 t2 t3 t4 t5
Time
System Reliability
Reliability prediction over time
180
Figure 6-20 demonstrates the result of the system’s reliability prediction assuming the
new evidence obtained at each time step. Each curve demonstrates the system reli-
ability at times t=0 through t=5, given the known evidence about lack of failures in
the previous time steps
The results demonstrate that the model reacts meaningfully to changes in its parame-
ters. An increase in component reliabilities results in an increase in system reliability,
and vice versa. Moreover, as time passes, the system reliability decreases, unless pre-
vious evidence is incorporated into the estimation.
Figure 6-20. Updated Prediction of Reliability over Time based on New Evidence
0
0.2
0.4
0.6
0.8
1
1.2
t0 t1 t2 t3 t4 t5
Time
System Reliability
Prediction based on
evidence at time t=0
Prediction based
on evidence at time
t=1
Prediction based
on evidence at time
t=2
Prediction based
on evidence at time
t=3
Prediction based
on evidence at time
t=4
Prediction based
on evidence at time
t=5
181
The next experiment leverages the sensitivity of the model to components reliabili-
ties, and shows how the architect can use it to identify critical components in the sys-
tem.
In Figure 6-21, we depict the effect of changes to individual component reliabilities
(depicted via the bars) on the system’s reliability (depicted via the line). The x-axis
depicts the four experiments performed, and the y-axis shows the components’ and
system’s reliabilities associated with each experiment. The first experiment (labeled
1) puts the system’s reliability at about 10% given individual components’ reliabili-
ties of 0.5. In the subsequent experiments, we increase the reliability of each compo-
nent to 0.9 to study its effect on system’s reliability. As shown in this case, improving
Figure 6-21. Effect of Changes to Components’ Reliabilities on System’s Reliability
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
12 3 4
Experiments
Reliability Values
Controller Reliability Estimator Reliability Actuator Reliability System Reliability
182
the reliability of the controller component (experiment 2) has the biggest impact on
system’s reliability. The reason for this phenomenon lies at the heart of the DBNs
methodology.
Using Dynamic Bayesian Network, reliability is predicted in terms of the probability
of not getting into failure states during the operation: the sooner a system fails, the
lower the system reliability. If a component initiating the interaction with another
component fails immediately after initiating the interaction, the system as a whole
may reach a failure state faster than if multiple steps have passed before a failure
occurs. Subsequently, failures in components that initiate interactions may have a
greater impact on the overall system reliability. While this is relatively intuitive to
observe and understand in the case of SCRover system, as the complexity of interac-
tions in larger systems increase, and the DBN consequently gets expanded over
longer time periods, it would be harder to analyze the impact of components without
using an automated analysis process. As an example, consider Figure 6-22, where the
reliability prediction of the SCRover based on changing components’ reliabilities is
depicted for the second time step. While changes to the reliability of the controller
component results in the biggest overall increase of the system’s reliability, the
impacts of the reliabilities of the other two components follow a similar trend until a
certain point. Particularly, once the components’ reliabilities are at 0.9, improving the
reliability of estimator or actuator components to 100% results in reversing the domi-
nant trend before this point: a change in the estimator component reliability results in
183
a greater change to system reliability than does a change in the actuator component’s
reliability.
Sensitivity to the reliability of system’s startup process. To continue our evalua-
tion of the reliability model’s sensitivity to changes in its parameters, we performed
some experiments to analyze the effect of the system’s startup process and its reliabil-
ity (represented via the init node) on the system. Recall that the value of the node is to
be supplied by the architect. This node is specifically designated to model the uncer-
tainties associated with the system’s startup process. In all of the calculations so far,
the reliability value at this node was set at 0.999, which essentially represents a highly
reliable startup process.
Figure 6-22. Changes to System Reliability Based on Different Component Reliability V alues
at Time Step t=1
0
0.2
0.4
0.6
0.8
1
1.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Component Reliability
System Reliability
Controller Estimator Actuator
184
As discussed in Chapter 4, there is no one-size-fits-all technique for determining this
value for a system. While it is possible to eliminate the use of this parameter in the
reliability prediction altogether (by setting it to one), we believe it is a useful means
for including a variety of factors that contribute to the system’s reliability. In general,
the circumstances that affect this parameter may relate to the software development
process adopted, and thus are beyond the scope of the architectural models. For exam-
ple, specific development processes or component integration strategies have an
impact on this value. If component integration has been performed iteratively
throughout the development, there is a greater confidence in a successful final inte-
gration, and the system’s startup. On the other hand, if COTS components are used, if
the development process has followed more of a waterfall approach, or if components
are developed independently by different development teams, then it is reasonable to
anticipate more problems during the final integration and the system startup. In any
case, the value for this parameter is ultimately subjective, and techniques to obtain the
value more objectively are beyond the scope of our work. Use of risk management
frameworks [75] could be a reasonable approach in determining this value.
The diagram in Figure 6-23 depicts the sensitivity of the SCRover model to changes
in the reliability of the startup process. Since the init node serves as the super parent
to all components’ initial nodes, its value has a very strong impact on system’s reli-
ability. This is consistent with our intuition that as the reliability of the system’s star-
185
tup process decreases, the system’s ability to perform its operations successfully
decreases (regardless of the reliabilities of individual components).
Sensitivity to component-level failures. The purpose of this set of experiments is to
determine the sensitivity of the model to the failure probabilities. The failure proba-
bilities are estimated using Bayesian Inference given the individual components’ reli-
abilities, and their interactions. However, as discussed earlier, the inference can be
updated using evidence that may be available. We can use this principle to speculate
on the impact of each failure on the overall reliability. To do this, we can represent
elimination of a particular failure by assigning its probability of occurrence to zero,
and observing the effect on the system’s reliability. This type of analysis for instance
Figure 6-23. The Effect of Changes to System’s Reliability as the
Reliability of the Startup Process Changes
0
0.2
0.4
0.6
0.8
1
1.2
0.999 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Reliability of the Startup Process
System's Reliability
Reliability of the Startup Process
186
could help us decide whether elimination of the Signature failure in the estimator
component is more critical than elimination of Protocol failure in the controller com-
ponent.
The results of performing this type of analysis on SCRover are depicted in Figure 6-
24. The first column shows the original reliability estimation given a set of parame-
ters. Without changing those parameters, the prediction is repeated, with the probabil-
ities of various instances of failures in different components changed to zero. This
effectively is equivalent to repeating the inference process assuming no such failure
has occurred. The next four columns correspond to lack of Pre/Post condition failures,
Protocol failure, and Signature failure in controller component, and the Signature fail-
ure in the estimator component, respectively. It can be seen that ensuring that estima-
0.906932565
0.966997364
0.929774565
0.939935202
0.977158
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
System Reliability
original controller's pre/post cond controller's protocol
controller's signature estimator's signature
Figure 6-24. Effect of Elimination of Particular Failures on SCRover System’s Reliability
187
tor’s signature failure does not occur has the largest impact on the system’s reliability,
improving it from the 90% original prediction to 97.7%. This type of analysis can
help the architect make cost-effective defect mitigation strategies, by prioritizing fail-
ures by their impact on the system’s reliability.
In summary, in this subsection, we analyzed our system-level reliability prediction
approach in the context of the SCRover system via a set of sensitivity analyses. We
demonstrated that reliability prediction process is meaningful and that useful informa-
tion may be obtained from our analysis. In the next subsection, we continue the eval-
uation using a different case study.
6.3.2. Case Study 2: The OODT System
Our second case study is based on NASA’s Object Oriented Data Technology
(OODT) [87]. OODT is a methodology, a middleware, and a software architecture for
development of distributed data-intensive systems. The middleware offers access to
geographically distributed and heterogeneous data sources, by concealing the details
of mediation at each data source, and offering an extensible and flexible data sharing
and transporting methodology. An OODT-based system consists of a set of Clients,
one or more ProfileHandlers, and a set of ProfileServers. A high-level architecture of
such system is depicted in Figure 6-25. In the OODT methodology, a Client compo-
nent requests a set of services that may be provided by different ProfileServers. The
Client is oblivious to the number, type, and location of these servers. A
188
ProfileHandler component acts as a mediator, and routes requests and responses
between Clients and Servers.
The Global Behavioral Model of one possible (very high-level) instantiation of this
methodology is depicted in Figure 6-26 (top). To build a corresponding Bayesian Net-
work, we used our methodology described in Chapter 4 to construct the quantitative
part of the network. Probabilistic formulas were then assigned at each node of the BN
to represent the relationship between the reliability at various nodes. In the rest of this
section, we evaluate our system-level reliability approach in the context of the OODT
system.
Let us assume that in an adaptation of the OODT system, an application is designed to
provide a single point of access (via a web page) to multiple databases maintained by
Figure 6-25. OODT’s High Level Architecture
Profile Handler
Client 1 Client 2 Client m
Profile Server
1
Profile Server
2
Profile Server
n
189
Profile Handler
PH 1.S 1 PH 2.S 2
PH 1.S 3
PH
1.
S
4
PH 1.S 5
Send
SendServer1
SendServer2
Results/
Return
Results/
Return
Profile Server 2
PS 2.S 1 PS 2.S 2
Return
SendServer2/Results
Profile Server 1
PS
1.
S
1
PS
1.
S
2
Return
SendServer1/Results
Client
C
1.
S
1
C
1.
S
2
Return
Query/Send
Figure 6-26. OODT’s Global Behavioral Model (top) and Corresponding BN (bottom)
PS2_S1
PS2_S2
F_PS2_Protocol F_PS2_PrePost F_PS2_Interface
F_C1_PrePost
R_PH1
PH1_S4
PH1_S5
PH1_S3
PH1_S2
F_PH1_Interface F_PH1_Protocol
init
R_PS1
R_PS2
PS1_S2
PS1_S1
PH1_S1
C2_S2
C2_S1
R_C1
C1_S1
C1_S2
F_C1_PrePost1
F_PS1_PrePost F_PS1_Protocol F_PS1_Interface
F_PH1_PrePost
190
different NASA centers. Each database contains mission information specific to the
NASA center. Specifically, let us assume that two independent ProfileServers serving
two independent datasets are designed. ProfileServer1 is used to access spacecraft
identification numbers as assigned by NASA, for those missions under the supervi-
sion of the Jet Propulsion Laboratory (JPL). ProfileServer2 is used to access space-
craft identification numbers assigned by NASA’s Goddard Space Flight Center
(GSFC) for the mission under their authority. The former is physically deployed on a
set of servers in California, while the latter is physically located in Maryland.
Once a query is issued by a Client (from anywhere in the world), the ProfileHandler
component relays the query to the appropriate server, and the server sends a response.
The Client in this case is unaware of the specific server that has served the request. In
this scenario, since the two servers access independent and non-identical data sources,
it is crucial for both servers to operate reliably, in order for the system to operate reli-
ably. In other words, the reliability of the system depends on the reliable operation of
all of its components, including the two servers. This is in contrast to cases where one
server may be a backup of the other server, in which case reliable operation of at least
one of the servers is sufficient for the reliable operation of the system. The failure
aggregation formula incorporates the failure probability values obtained from the
Bayesian inference, and calculates the corresponding system reliability value, given
the specific scenarios discussed above. In our case study, we first focus on the case
where the two servers correspond to two independent and different datasets. Later in
191
this section, we demonstrate the result of modeling a different scenario, where the two
servers are considered to act as back-ups for identical datasets.
For our analysis, let us assume that there are two faulty services in this system: the
Return interface between the Client component and the ProfileServer and ProfileHan-
dler components has a Pre/Post-condition defect, and the Results interface in the Pro-
fileHandler component has both a Protocol and a Signature defect with the
corresponding Results interface in the ProfileServer components.
Sensitivity to component-level reliability predictions. In the first set of experi-
ments we studied the effect of changes to components’ initial reliabilities on the sys-
tem reliability. Since the two ProfileServer components are effectively identical in
their functionality (although serving different datasets), one would expect that their
reliability should have a similar impact on the overall system reliability. We varied the
initial components’ reliabilities to study their impact on the overall system reliability
and the result is depicted in Figure 6-27. As expected, changes to the reliability of the
two ProfileServers show very similar trends on the changes to the system’s reliability.
Calculations shown in Figure 6-27 are performed on the initial time step in the sys-
tem’s operation (t=0). Extending this experiment for the subsequent time interval
(t=0 and t=1) demonstrate the same trend as shown in Figure 6-28. Moreover, it can
be seen that similar to the SCRover experiment, as time goes by, the overall system
192
reliability decreases (unless new evidence is incorporated into the reliability calcula-
tion and new inference is made). It can also be seen that changes in the ProfileHandler
and Client components result in a greater range of predicted reliability values for the
system. For example, changes to the reliability of the Client component result in esti-
mated system reliability values ranging from 18% to 91%, while changes to the reli-
ability of the ProfileServer components result in system reliability variations between
52% and 91%. The reasoning behind this observation is as follows.
Recall that a reliable system is one in which a long series of component interactions
occurs without a failure interrupting the chain. If a series of interface invocations rep-
resents interactions among components, a failure occurring earlier during invocation
Figure 6-27. OODT Model’s Sensitivity to Different Initial Component Reliabilities
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Initial Components Reliabilities
System Reliability
ProfileServer2 Reliability ProfileServer1 Reliability Client Reliability ProfileHandlerReliability
193
Figure 6-28. Changes in the OODT System’s Reliability as Components’ Reliabilities Change
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
t=1
t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
System
Reliability
Component Reliability
Time
Profile Handler Component
t=1
t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
t=1
t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
System
Reliability
Component Reliability
Time
Client Component
t=1 t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
t=1
t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
System
Reliability
Component Reliability
Time
ProfileServer 1
t=1 t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
t=1
t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
System
Reliability
Component Reliability
Time
Profile server 2
t=1
t=0
194
has a greater impact on system reliability than a failure that occurs later in the
sequence of invocations. This explains why a component such as the Client has a
greater impact on the predicted reliability.
Expanding the Dynamic Bayesian Network for the OODT system for a period of
three time steps is shown in Figure 6-30. Similar to the results obtained from the
SCRover experiment, as time passes, the system’s reliability decreases. By assuming
that a failure has not materialized as time progresses, we are able to update the predic-
tion over time and offer a more accurate analysis of the system reliability overtime.
This is done by incorporating new evidence in the Bayesian Networks’ Inference pro-
cess. The middle and top curves in Figure 6-30 demonstrate the updated knowledge at
Figure 6-29. Effect of Changes to Components Reliabilities on System Reliability
0
0.2
0.4
0.6
0.8
1
1.2
12 3 4 5
Experiments
Reliability Value
Profile Handler Component Client Component
Profile Server 1 Component Profile Server 2 Component
System Reliability
195
each time step t
i
, after the evidence (in this case lack of failure) at t
i-1
is updated. As
demonstrated, the model reacts predictably to changes in its parameters.
Sensitivity to the reliability of system’s startup process. As mentioned before in
the context of the SCRover system, we model the reliability of the system startup pro-
cess in order to incorporate the uncertainties associated with this process. Figure 6-31
demonstrates the results of sensitivity analysis of the OODT system by varying the
reliability of the startup process. Intuitively, the system reliability has a direct rela-
tionship with the reliability of the startup process, and the results confirm this intu-
ition.
Figure 6-30. Reliability Prediction of the OODT System Over Three Time Periods
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
t1 t2 t3
Time
System Reliability
Prediction based on the evidence at t=0 Prediction based on the evidence at t=1
Prediction based on time t=2
196
Sensitivity to the probability of failure The purpose of this set of experiments is to
determine the sensitivity of the model to the estimated failure probabilities. The fail-
ure probabilities are obtained using Bayesian Inference, given the individual compo-
nents’ reliabilities, and their interactions. One interesting and useful analysis on a
system under the development is determining the impact of components’ failures on
the overall system reliability. We represent elimination of a particular failure by
assigning its probability of occurrence to zero, and observing its effect on the sys-
tem’s reliability, as the inference is updated using the new evidence.
The results of performing this type of analysis on OODT are depicted in Figure 6-32.
The x-axis shows the various instances of failures in the four components, while the
Figure 6-31. Changes to the OODT’s Reliability based on Different Startup Process Reliability
0
0.2
0.4
0.6
0.8
1
1.2
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Reliability of the Startup Process
System Reliability
Reliability at t=0
197
y-axis represents the system reliability. In each category of data (each component),
the first bar represents the original reliability prediction given the parameters. All
parameters remained unchanged, but in each experiment, probability of a specific
failure in a component was manually set to zero. For example, eliminating the Proto-
col and Pre/Post Condition defects in the ProfileServers has the most influence on
system’s reliability. It is important to note that the results presented here demonstrate
very little change in the predicted system reliability. This is a side effect of the model
and the interactions among its components. In other examples (including that of the
SCRover), the change in the reliability value was more significant. It is very difficult
to define a generic threshold level based on which changes in the reliability values are
considered statistically significant. However, this is something that could be decided
on an application-specific basis, given the results obtained from various reliability
analyses. Consequently, this decision is left to the domain expert.
Figure 6-32. Eliminating the Probability of different Failures and Their Impact on System
Reliability
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
ProfileHandler Client ProfileServer
1
ProfileServer
2
System Reliability
Original Protocol Signature Pre/Post Condition
198
Analysis of components’ roles on system reliability. Recall that the results of the
analysis of the OODT system so far, assume a serial relationship among all the nodes
in the system. Furthermore, the system reliability is calculated such that all compo-
nents failures are similarly incorporated in the reliability prediction formula. In this
experiment, we modify these assumptions and model redundancy in the system. Spe-
cifically, the reliability prediction formula is modified to consider the two Pro-
fileServer components as the back-up for one another. In this setting, the reliability of
the system depends on successful operation of at least one server component. In other
words, the reliability is estimated by incorporating the reliability value of the most
reliable server among the two ProfileServers. The top diagram in Figure 6-33 demon-
strates the system reliability (depicted via a line) as the reliability of ProfileServer 1
increases. The diagram in the bottom shows that while increasing the reliability of
ProfileServer1, the system reliability remains unchanged, if the reliability of
ProfileServer2 decreases. This is because the system reliability formula only incorpo-
rates the most reliable server in its reliability calculation. Other scenarios in the case
of this system, could formulate the system reliability to represent that the impact of
unreliability of the server components are significantly greater than the unreliability
of the client components.
This experiment demonstrates that our model is flexible, and allows changes to the
roles of individual components and the impact of their reliabilities on the system reli-
ability.
199
Analysis of the impact of system’s configuration on its reliability. As previously
discussed, we envision that our reliability modeling approach may be used to analyze
the effect of changes to the system’s structure, on the overall system’s reliability.
While a structural change is considered an addition or removal of components in the
system, the impact on the interactions among components in the system, the global
0
0.2
0.4
0.6
0.8
1
1.2
12 34
Experiments
Reliability
Profile Server 1 Profile Server 2 System
0
0.2
0.4
0.6
0.8
1
1.2
12 34
Experiments
Reliability
Profile Server 1 Profile Server 2 System
Figure 6-33. Modeling Redundancy in OODT
200
behavior of the system, and thus the reliability of the system is beyond a simple struc-
tural change. For example, in the case of the OODT system, the ProfileHandler com-
ponent seems to act as a bottleneck for the system. As the number of clients increases,
the load on the ProfileHandler component increases, as it is required to interact with
more components than previously. Intuitively, this would have an adverse affect on
the system’s reliability. One possible solution is to instantiate additional ProfileHan-
dlers in the system to balance the load, and eliminate the single point of failure in the
system.
We demonstrate the results of such structural changes on the OODT system and its
impact on system’s reliability in Figure 6-34. The x-axis represents various configura-
Figure 6-34. Impact of Different Configurations on OODT System Reliability
1 Client, 1
ProfileHandler
3 Clients, 1
ProfileHandler
5 clients, 1
ProfileHandler
5 clients, 2
ProfileHandler
t=2
t=1
t=0
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
System
Reliability
Configurations
Time
t=2 t=1 t=0
201
tions while the y and z-axes represent time and system reliability, respectively. In
every time step, by increasing the number of clients in the system, the reliability
decreases gradually. However, once a new instance of the ProfileHandler component
is added to the system, the reliability is improved. The second ProfileHandler compo-
nent is set up such that it is responsible for handling communication to and from the
fourth and fifth client components. The return of the system reliability value to
approximately reliability value of the system with 3 Clients and a single ProfileHan-
dler component can thus be rationalized, given the load balancing described above.
Similar to the results obtained from SCRover, the OODT results confirm that our sys-
tem-level reliability modeling approach produces meaningful and useful results. Our
experience with a set of synthesized models, as well as other case studies confirm the
conclusions presented here. In the next two subsections we first offer an overview of
the complexity and scalability of our approach. The uncertainties associated with the
system reliability prediction during the architecture phase are discussed last.
6.3.3. Complexity and Scalability
In general, the problem of Exact Inference in Bayesian Networks and Dynamic Baye-
sian Networks is NP-hard [21]. Efficient average case and approximation algorithms
have thus been developed to tackle the complexity problem [21,91]. In our approach,
Bayesian Networks are only used in a predictive context. That is given the probabilis-
tic relations among the nodes (assigned by the domain expert), we predict the proba-
202
bility of certain events (failures) at a future time. The complexity of our DBN is thus
a function of the number of nodes in the system, as well as the time interval over
which the reliability analysis is performed. We now discuss each of these factors and
their influence on our reliability modeling.
In this chapter, we discussed how evidence about a system’s reliability at time t
i
can
be used to provide a better prediction of its reliability at times t
i+1
,t
i+2
,.... Following
this approach, the complexity of the DBN can be reduced in the subsequent time steps
by incorporating the results from previous time steps. The number of nodes thus can
be controlled systematically to disallow the model to grow arbitrarily complex. For
example, in the context of the OODT example, and its DBN expansion over 3 time
steps, the total number of nodes in the model is 69. However, if it is already known
that no failures at t=0 and t=1 have occurred, the number of nodes in the model are
effectively reduced to 21. When modeling large and complex systems over an
extended period of time, it may be more effective to do so by reducing the complexity
of the model and performing reliability analysis in certain time intervals.
On the other hand, the number of nodes in the DBN in turn, depends on the number of
states in the Interaction Protocol models of the components that comprise the system.
Unlike the Dynamic Behavioral Models used for component-level reliability model-
ing, the complexity of the Interaction Protocol Models is bound by the number of
203
externally visible interfaces of each component. The principles of component-based
software engineering, and encapsulation in Object-Oriented design, typically prevent
a component from having arbitrarily large number of interfaces. Consequently, fol-
lowing the best practices of software design should directly help in the creation of
models with reasonable numbers of externally visible interfaces. In turn, his will curb
the complexity of the models.
6.3.4. System Reliability Modeling and Handling Uncertainties
Modeling reliability of a software system in a compositional manner early during
software development process, when implementation is not available and the opera-
tional profile is unknown, requires dealing with various sources of uncertainties.
Accommodating these uncertainties results in a more realistic prediction of the reli-
ability of the architecture. Below we enumerate some of these sources of uncertain-
ties, and describe the ways our approach handles them.
Uncertainty of Components Reliability values. Existing approaches to component
reliability estimation typically do so in isolation. A component, whether a third party
component, an OTS component, or an in house component typically is designed,
built, and tested either in isolation, or in an environment that may not be typical of its
intended use. Consequently, the reliability values associated with a component may
not be accurate if the component is used in a different setting. While such calculation
of component reliability value is useful as an “estimate” of how it may perform in a
204
system, depending on the specific system, and the other software and hardware com-
ponents interacting with it, the value cannot be treated as an absolute number.
Our decision discussed in Chapter 4, of treating the estimated component reliability
values as a node in the Bayesian Network, enables us to associate a degree of uncer-
tainty with this value, consistent with the stochastic nature of our approach. This
helps us tackle this problem natively in our reliability modeling approach.
Uncertainty of System Startup Process. Building a system out of fully reliable
components may not result in perfect reliability of the final system. This may be due
to various sources of uncertainties introduced in the integration process. Starting up a
system is among critical steps in the integration which could adversely affect the sys-
tem reliability. By introducing an init node (as described in Chapter 4), we have
addressed the problem of uncertainties associated with the startup process.
Uncertainties of Human Interaction. This is an important and potentially serious
source of uncertainty when dealing with software systems. As an example, a fully
functionally reliable system, may result in catastrophic conditions, because of
improper usage of the system by the operators or users. One way to eliminate such
uncertainties is to design checks and balances in all parts of the system design to dis-
allow such mistakes. Using our architectural modeling approach, such checks and
balances could be implemented as pre/post conditions, guards, and other types of
205
assertions in the functional specification of the system. While such approach can help
reduce the possibility of such harmful interactions, other uncertainties may need to be
blended into the reliability model to address this form of uncertainty. The specific
issue of modeling human-computer interactions and associated uncertainties are
beyond the scope of this dissertation.
207
Chapter 7: Related Work
The topic of this dissertation research expands over fields of software architecture and
reliability modeling. We have studied a variety of approaches in each domain, and
identified a few approaches that span both domains. In this section, we first present a
summary of related approaches to architectural modeling. We then provide an over-
view on existing reliability models. While extensive surveys of software reliability
modeling have been provided elsewhere [34,45,130], we present an original taxon-
omy of reliability models with a special emphasis on architectural relevance. Finally,
a discussion of Markov-based and Bayesian Network-based reliability models as
related to our research is presented.
7.1 Architectural Modeling
Building good models of complex software systems in terms of their constituent com-
ponents is an important step in realizing the goals of architecture-based software
development [77]. Effective architectural modeling should provide a good view of the
structural and compositional aspects of a system; it should also detail the system’s
behavior. Modeling from multiple perspectives has been identified as an effective
way to capture a variety of important properties of component-based software sys-
tems [16,30,50,57,93]. A well known example is UML, which employs nine diagrams
(also called views) to model requirements, structural and behavioral design, deploy-
208
ment, and other aspects of a system. When several system aspects are modeled using
different modeling views, inconsistencies may arise.
Ensuring consistency among heterogeneous models of a software system is a major
software engineering challenge that has been studied in multiple approaches, with dif-
ferent foci. A small number of representative approaches are discussed here. [37]
offers a model reconciliation technique particularly suited to requirements engineer-
ing. The assumption made by the technique is that the requirements specifications are
captured formally. [8,38] also provide a formal solution to maintaining inter-model
consistency, though more directly applicable at the software architectural level. One
criticism that could be levied at these approaches is that their formality lessens the
likelihood of their adoption. On the other hand, [32,51] provide more specialized
approaches for maintaining consistency among UML diagrams. While their potential
for wide adoption is aided by their focus on UML, these approaches may be ulti-
mately harmed by UML’s lack of formal semantics.
We now discuss representative approaches to modeling each of the four views on
architectural models.
Interface modeling. Component modeling has been most frequently performed at
the level of interfaces. This has included matching interface names and associated
input/output parameter types. Component interface modeling has become routine,
209
spanning modern programming languages, interface definition languages (IDLs)
[78,94], architecture description languages (ADLs) [77], and general-purpose model-
ing notations such as UML [122]. However, software modeling solely at this level
does not guarantee many important properties, such as interoperability or substitut-
ability of components: two components may associate vastly different meanings with
syntactically identical interfaces.
Static Behavior Modeling. Several approaches have extended interface modeling
with static behavioral semantics [1,65,95,136]. Such approaches describe the behav-
ioral properties of a system at specific snapshots in the system’s execution. This is
done primarily using invariants on the component states and pre- and post-conditions
associated with the components’ operations. Static behavioral specification tech-
niques are successful at describing what the state of a component should be at specific
points of time. However, they are not expressive enough to represent how the compo-
nent arrives at a given state.
Dynamic Behavior Modeling. The deficiencies associated with static behavior
modeling have led to a third group of component modeling techniques and notations.
Modeling dynamic component behavior results in a more detailed view of the compo-
nent and how it arrives at certain states during its execution. It provides a continuous
view of the component’s internal execution details. While this level of component
modeling has not been practiced as widely as interface or static behavior modeling,
210
there are several notable examples of it. For instance, UML has adopted a StateChart-
based technique to model the dynamic behaviors of its conceptual components (i.e.,
Classes). Other variations of state-based techniques (e.g., FSM) have been used for
similar purposes (e.g., [30]). Finally, Wright [2] uses CSP to model dynamic behav-
iors of its components and connectors.
Interaction Protocol Modeling. The last category of component modeling
approaches focuses on legal protocols of interaction among components. This view of
modeling provides a continuous external view of a component’s execution by specify-
ing the allowed execution traces of its operations (accessed via interfaces). Several
techniques for specifying interaction protocols have been developed. These tech-
niques are based on CSP [2], FSM [133], temporal logic [1], and regular languages
[102]. They often focus on detailed formal models of the interaction protocols and
enable proofs of protocol properties. However, some may not scale very well, while
others may be too formal and complex for routine use by practitioners.
Typically, the static and dynamic component behaviors and interaction protocols are
expressed in terms of a component’s interface model. For instance, at the level of
static behavior modeling, the pre- and post-conditions of an operation are tied to the
specific interface through which the operation is accessed. Similarly, the widely
adopted protocol modeling approach [133] uses finite-state machines in which com-
ponent interfaces serve as labels on the transitions. The same is also true of UML’s
211
use of interfaces specified in class diagrams for modeling event/action pairs in the
corresponding StateCharts model.
7.2 Reliability Modeling
Modeling, estimating, and analyzing software reliability –during testing– is a disci-
pline with over 30 years of history. Many reliability models have been proposed: Soft-
ware Reliability Growth Models (SRGMs) are used to predict and estimate software
reliability using statistical approaches [41,53,68,85]. Extensive overview of these
approaches are previously provided [34,42].
The major shortcoming of SRGM approaches is that they treat the software system as
a monolithic entity. They ignore the internal structure of the system, and thus are
called black-box approaches. Consequently, these approaches cannot be used when
relating the reliability of the overall system to the reliability of its constituent compo-
nents. This is a major shortcoming in case of large and complex software systems,
where decomposition, separation of concerns, and reuse play important roles in archi-
tecting and designing them. Finally, these black-box techniques directly leverage fail-
ure data, and thus cannot be applied to stages before testing. Estimating the reliability
of the system during testing does little in the way of a cost-effective software devel-
opment process. The defects detected during testing will be significantly more costly
to fix than if detected in earlier stages of the development. Additionally, knowing the
212
estimated reliability value at such a late stage leaves few options in meeting the reli-
ability requirements of a software system.
Another category of software reliability modeling techniques is white-box: they con-
sider a system’s internal structure in reliability estimation. These approaches directly
leverage the reliability of individual components and their configuration in order to
calculate the system’s overall reliability [43,56]. They usually assume that the indi-
vidual component reliability is known or can be obtained via SRGM approaches.
Goseva-Popstojanova et al. further classify white-box techniques into path-based,
state-based, and additive [45]: path-based models compute software reliability based
on the system’s possible execution paths; state-based models use the control flow
graph to represent the system’s internal structure and estimate its reliability analyti-
cally; finally, additive models simply add the failure rates of each individual unit to
determine the overall failure rate of the application and do not consider software
structure. In summary, white-box approaches leverage two independent models in
calculating reliability: a structural model describing software’s internal structure, and
a failure model, describing software’s failure behavior.
The common theme across all of these approaches however, is their applicability to
implementation-level artifacts, and reliability estimation during testing. Even those
approaches assumed to be applicable in other development phases rely on estimates
of the code size [23]. When architectural, existing approaches consider only the struc-
213
ture of the system. The only exceptions are [45,106,134,130]. Reussner et al. [106]
build architectural reliability models based on both structural and behavioral specifi-
cations of a system. Their parametrized reliability estimation technique assumes the
reliability of individual component services to be known. Wang et al. [134] leverage
architectural configuration while focusing on architectural styles for building a pre-
diction model that is mostly concerned with sequential control flow across compo-
nents in a system. Goseva-Popstojanova et al. [45] focus on uncertainties associated
with unknown operational profiles, and provide extensive sensitivity analysis to dem-
onstrate the effectiveness of their approach. Their architectural model represents the
control flow among the components, but cannot model concurrency and hierarchy
often represented in architectural models [77]. Finally, Yacoub et al. [130] leverage a
scenario-based model of system’s behavior and build component dependency graphs
to perform reliability analysis.
However, none of these approaches consider the effect of a component’s internal
behavior on its reliability. They simply assume that the component reliability, or some
of its elements (such as reliability of component’s services) is known. They then use
these values to obtain system reliability. Additionally, with the exception of [45], they
rely on the availability of a running system to obtain the frequency of component ser-
vice invocations (operational profile).
214
When predicting software reliability at early stages of development, such as during
architectural design, proper knowledge of the system’s operation profile cannot be
known. This contributes to some level of uncertainty in the parameters used for reli-
ability estimation. In general, if a considerable uncertainty in the estimates of the sys-
tem’s operational profile exists, then the uncertainty may be propagated to the
estimated reliability. Consequently, traditional approaches to software reliability esti-
mation may not be appropriate since they cannot take such uncertainties into consid-
eration. Few approaches assess the uncertainties in reliability estimation heuristically,
with variable operational profile, via techniques such as method of moments and sim-
ulation-based techniques such Monte Carlo simulation [45]. Other techniques how-
ever, assume fixed (a priori known) operational profile and varying component
reliability and apply traditional Markov-based sensitivity analysis [18,115].
Hidden Markov Models (HMMs) [103] is a formalism that leverages Markov models
while assuming some parameters may be unknown (hidden). Particularly, HMMs
assume that, while the number of states in the state-based model is known, the exact
sequence of states to obtain a sequence of transitions may be unknown. In addition,
HMMs assume that the value of the transition probability distribution may be inaccu-
rate. The challenge is to determine the hidden parameters, from the observable param-
eters, based on this assumption.
215
With the exception of [29], previous approaches to Markovian software reliability
modeling have not leveraged HMMs (the focus of [29] is on imperfect debugging
during testing and does not relate components’ interaction and reliability estimations
– which is one of the a primary goal of this dissertation research).
While HMMs have not been previously used in the context of architecture-level reli-
ability estimation, they have given results in areas such as recognition of handwritten
characters [12], image recognition [19], segmentation of DNA sequences and gene
recognition [14], economic data modeling [29], etc.
Bayesian Networks have long been used in various areas of science and engineering
where a flexible method for reasoning under uncertainty is needed. They have been
used for data mining and knowledge discovery [5], security, filter spamming, and
intrusion detection [60,64], forecasting, and data analysis in medical domain [135].
They have also been applied to modeling reliability of engineering systems
[4,68,96,98,123].
Two important factors distinguishes our work from all these Bayesian-based reliabil-
ity approaches: (1) Our approach directly leverages architectural models of the sys-
tem; and (2) Our approach does not rely on existence of system’s operational profile.
The approach in [4], is a BN-based model that predicts the quality of a software prod-
uct by focusing on the structure of the software development process. The quantita-
216
tive part of the BN is constructed based on the various activities in the development
process and the probabilities are assigned according to the metrics obtained from
these activities. The approach clearly is not applicable to early reliability assessment.
Both [68,96] rely on the testing data to construct the Bayesian network and perform
reliability estimation: once again such data does not exist at the architectural level.
The approach in [98] calculates the quality of the development process by specifically
focusing on various process activities. Our approach clearly differs from the above by
relying on architectural models and domain knowledge.
7.3 Taxonomy of Architectural Reliability Models
Extensive study of software reliability techniques are provided elsewhere [23,34,45].
For the purpose of our research, we were interested in approaches relevant to software
architecture and its artifacts. In order to relate existing reliability models to software
architectural artifacts, we have developed a taxonomy of architecturally relevant
aspects to reliability models. The taxonomy is depicted in Figure 7-1. We will now
discuss different dimensions of this taxonomy.
Basis. At their cores, reliability models can be grouped as those that are applicable to
implementation-level artifacts (i.e. code), process-based, or architecture-level arti-
facts (i.e., specification-based). The approaches applicable to code are further
217
Figure 7-1. Taxonomy of Architecture-based Reliability Models
Basis
Architectural
Relevance
Model
Richness
Flat
Compositional
Overall
Reliability
Assessment
Criteria Dimension Value
Implementation-
based
Specification-based
Control flow
Interaction
protocols
Interfaces
Connectors
Components
Configurations
Static Behaviors
Interface
Protocols
Dynamic Behaviors
Black-box
White-box
Sub-Value
Approaches
Wang99, Reussner03, Gokhale98, Whittaker93, Li97,
Krishnamurthy97, Everette99, Zequeira00, CARMA
Wang99, Hamlet01, Bondavalli01, Whittaker93,
CARMA
Wang99, Hamlet01, Reussner03, Dolbec95, Gokhale98,
Krishnamurthy97, Everette99, Singh01, CARMA
Wang99, Reussner03, Bondavalli01, Li97,
Everette99, CARMA
Everette99
Reussner03, Whittaker93, Everette99, Singh01, CARMA
Architectural Style
Wang99
Reussner03, Singh01, CARMA
Reussner03, CARMA
Wang99, Dolbec95, Gokhale98, Whittaker93, Li97,
Krishnamurthy97, Singh01
Reussner03, Li97, CARMA
CARMA
Wang99, Hamlet01, Reussner03, Bondavalli01,
Dolbec95, Gokhale98, Li97, Krishnamurthy97,
Everette99, Singh01
Hamlet01, Reussner03, Dolbec95, Krishnamurthy97,
Everette99, CARMA
Estimation
Assumption-
based
CARMA
Process-based Pai02, Mockus03, Neil96, Smidts96
218
classified to be black-box, where the system structure is not taken into consideration,
or white-box, where the system structure is considered in the reliability model.
Traditional reliability models are implementation-based and may be black-box (e.g.
SRGMs), or white-box [45]. Process-based approaches such as [79,98] consider the
software development process and its various activities (such as architecture and
design stage), and measure the reliability of the process.The focus of our research is
on specification-based models, where analytical reliability models may be applied to
a model of the software system’s architecture.
Architectural Relevance. Architectural models provide an abstraction of software
system properties. In general, a particular system is defined in terms of a collection of
components (loci of computation) and connectors (loci of communication) as orga-
nized in an architectural configuration. Architecture Description Languages (ADLs)
[77] specify software properties in terms of a set of components that communicate via
connectors through interfaces. Finally, an architectural style defines a vocabulary of
component and connector types and a set of constraints on how instances of these
types can be combined in a system [117]. We postulate that a useful model to quantify
system reliability at the level of software architecture should consider the above mod-
eling elements.
219
Model Richness. As discussed in Section 7.1, functional properties of software sys-
tems are described using one or more of the following four views: interfaces, static
behaviors, dynamic behaviors, and interaction protocols. Even though other aspects
of a system may be modeled using other modeling views, we believe that the above
four models provide a comprehensive basis for specifying functional properties of
systems. Explicit emphasis on these views has not been much of a focus in existing
reliability models. With the exception of Reussner et al. [106], which leverages inter-
faces, static behaviors (pre/post conditions), and interaction protocols, other
approaches focus only on components interaction protocols to estimate system reli-
ability. Focusing on a subset of these modeling views results in the need to make sim-
plifying assumptions in estimating overall reliability. Such an approach inherently
assumes that values of individual component reliabilities are known a priori.
As discussed earlier, state-based approaches to reliability modeling use a control flow
graph to represent the application structure and the application reliability is estimated
analytically. Path-based approaches on the other hand, estimate the reliability by con-
sidering the possible execution paths of the application. As a result, the path-based
approaches provide only approximate estimates for applications which have infinite
paths due to the presence of loops. In our taxonomy, we adopt the same principle and
classify models of interaction protocols as those that are specifying the control flow
vs. those that specify interactions among components. The latter enables modeling
220
overall system’s behavior in terms of the behavior of individual components that are
being executed concurrently.
Overall Reliability Assessment. In general there are two classes of approaches to
estimating a system’s overall reliability. The flat techniques (often seen in black-box
models), take a non-compositional approach to estimating the overall reliability. Such
approaches are inconsistent with software architecture and its goal of decomposition,
reuse, and separation of concerns. The compositional approaches take system’s struc-
ture and components’ interactions into account when estimating overall reliability.
We further classify them into those that assume a components’ reliability, or reliabil-
ity of components’ constituent elements (e.g., components’ services) is known. Such
assumptions undermines usability of these models. Alternatively, component-level
reliability may be estimated given proper functional models of the component itself
and based on the result of advanced analyses. Our proposed technique takes one such
approach.
222
Chapter 8: Conclusion and Future Work
Despite the maturity of software reliability techniques, predicting the reliability of
software systems before implementation has not received adequate attention in the
past. Reliability estimation techniques are often geared toward the testing phase of the
development life cycle, when the system’s operational profile is known. In these
approaches, defects are primarily identified during testing. However, about 50% of
these defects are rooted in pre-implementation phases of development, such as archi-
tecture and design [101]. Studies have shown that early discovery of defects in the
software development life cycle results in a more cost effective mitigation process
[13]. Reliability prediction early during the software development life cycle is critical
in building reliability into the software system. Given the uncertainties associated
with software systems early in the development process, appropriate reliability mod-
els must be able to accommodate the uncertainties and produce meaningful results.
The approach described in this dissertation aims at closing the gap between architec-
tural modeling on the one hand, and its impact on software reliability, on the other
hand. We focused on the reliability of individual software components as the first
step. Our approach leverages standard architectural models of the system, and uses an
Augmented Hidden Markov Model to predict the reliability of software components.
223
The system’s overall reliability is then predicted as a function of individual compo-
nents’ reliabilities, and their complex interactions. This is done by building a
Dynamic Bayesian Network based on components’ interaction protocol models, and
leveraging analytical techniques to quantify the reliability of software systems. In the
rest of this chapter, we first enumerate the contributions of this dissertation research.
We then conclude by offering several interesting directions this research can take in
the future.
8.1 Contributions
The contribution of this work can be summarized as follows:
• Mechanisms to ensure intra- and inter-consistency among multiple views of
system’s architectural models,
• A formal reliability model to predict both component-level and system-level
reliability of a given software system based on its architectural specification, and
• Parameterized and pluggable defect classification and cost-framework to identify
critical defects, whose mitigation is most cost-effective in improving a system’s
overall reliability.
The combination of the architectural modeling and analysis technique, together with
the defect classification, cost-framework, and the reliability models of individual
224
components and the overall system, comprise a comprehensive methodology which
has been evaluated on a series of case studies applications.
8.2 Future Work
In this section we describe a set of open research questions which form the various
aspects of our future work.
8.2.1. Architectural Styles and Patterns and Reliability
This work has not considered the impact of specific architectural styles or patterns of
interaction on reliability. Architectural styles impose constraints on the interactions
among components in a system, as well as on the system’s structure. Moreover, lever-
aging known interaction patterns can help eliminate some of possible architectural
defects. Relating properties of architectural styles, and patterns to our reliability
model may be beneficial in two ways. First it could enable the architects to use pat-
terns and styles as a template, where the impact of the specific constraint on system’s
reliability is already quantified. Moreover, leveraging various design constraints may
enable us to eliminate some of the parameters in the reliability model, and thus reduce
the complexity of the underlying algorithms. This in turn, may help improve the scal-
ability of the approach.
The only related approach in incorporating architectural styles into a reliability model
[134] simplifies the problem by only modeling the transfer of control among compo-
225
nents based on the style characteristics. The main challenge in providing a more fine-
grained incorporation of the two concepts is addressing concurrency issues in compo-
nents’ interaction. Another interesting problem is to formalize the notion of reliable
patterns. A good place to start is to draw parallels with the research and development
in the software security community [113]. We believe that our Bayesian reliability
modeling approach offers a starting point in incorporating these concepts into the reli-
ability prediction approach.
8.2.2. Reliability Modeling for Software Connectors
Software connectors are the loci of communication in a software system and act as a
glue that enable the interactions among components. Our reliability model only
focuses on software components and their operations, and treats connectors as special
components. While Software connectors are determined to provide a suitable vehicle
to model other dependability attributes (such as security) [105], there has not been
any research in modeling reliability of systems using software connectors. We plan to
study this topic and extend our reliability model to encompass both components and
connectors. The first challenge here is building appropriate abstractions for modeling
relevant connector properties. Unlike components, not much focus has been on devel-
oping effective techniques for modeling and analyzing software connectors. Further-
more, special attention must be given to the interaction of components and
connectors. Our system-level reliability modeling approach thus must be adapted to
226
incorporate connector models into the Global Behavioral Model, and formalized the
component-connector and connector-connector interactions.
8.2.3. Early Prediction of Other Dependability Attributes
Modeling other dependability attributes (such as availability, safety, and security)
exhibit similar properties to those in the reliability modeling. Building dependable
software systems requires addressing other dependability properties. We plan to
extend our work to model other dependability aspects of software system’s architec-
ture, in early stages of the development process.
Availability. Similar to software reliability, availability may be modeled stochasti-
cally. An interesting question is whether prediction of system availability may be per-
formed at early stages of software development when no implementation-level
artifact exist. Architectural models (e.g., ADLs) must thus be extended to explicitly
model, analyze, and simulate the deployment conditions under which the system will
be operational.
Security and safety. Recent advances in software security community have brought
architectural risk analysis and threat modeling to the forefront of software develop-
ment dependability process. The main shortcoming of these approaches however, is
the emphasis to think about low-level implementation issues at an early stage of
development, when possibly no code is yet developed. Higher-level abstractions are
227
needed to describe, model, and analyze security and safety characteristics of the sys-
tems at the architectural level.
8.2.4. Extensions to Support Product Families
In the past we have done extensive work in architectural modeling, analysis, and evo-
lution of software systems, a natural spring board for supporting architectural design
of product families. This research could benefit from the results of the relations
between architectural patterns and styles and their impact on software reliability. Such
abstractions reveal themselves more naturally in the contexts in which reuse is lever-
aged. We intend to expand our previous work in modeling architectural evolution, and
build reliability models applicable to product families and their associated challenges.
228
References
1. N. Aguirre, T.S.E. Maibaum. A Temporal Logic Approach to Component
Based System Specification and Reasoning. In Proceedings of the 5th ICSE
Workshop on Component-Based Software Engineering, Orlando, FL, 2002.
2. R. Allen, and D. Garlan. A Formal Basis for Architecture Connection. ACM
Transactions on Software Engineering and Methodology, 6(3): p.213-249,
1997.
3. R. Almond. An extended example for testing Graphical Belief. Technical
Report 6, Statistical Sciences Inc. (1992).
4. S. Amasaki, Y . et. al., Bayesian Belief Network for Assessing the Likelihood
of Fault Content, in Proceedings of the 14th International Symposium on Soft-
ware Reliability Engineering ISSRE, Denver, Colorado 2003.
5. S. Arnborg. A Survey of Bayesian Data Mining, in John Wang’s Data Mining:
Opportunities and Challenges, Montclair State University, USA, 2003.
6. P. Ashar, A. Gupta, S.Malik. Using complete-1-distinguishability for FSM
equivalence checking. ACM Transactions on Design Automation of Electronic
Systems V ol. 6, No. 4, pp 569-590, October 2001.
7. A. Azem. Software Reliability Determination for Conventional and Logic
Programming, Walter de Gruyter, 1995.
8. R. Balzer. Tolerating Inconsistency, in Proceedings of 13th International Con-
ference on Software Engineering (ICSE-13), Austin, Texas, 1991.
9. A. Benveniste, E. Fabre, S. Haar. Markov Nets: Probabilistic Models for Dis-
tributed and Concurrent Systems, IEEE Transactions on Automatic Control
AC-48, 11, pages 1936-1950, November 2003.
10. A. Bondavalli, et. al., Dependability Analysis in the Early Phases of UML
Based System Design, Journal of Computer Systems Science and Engineer-
ing, V ol. 16, pp. 265-275, 2001
11. L. E. Baum, An inequality and associated maximization technique in statisti-
cal estimation for probabilistic functions of Markov processes. Inequalities,
3:1-8, 1972.
229
12. B. Boehm. Software Engineering Economics, Prentice-Hall, Englewood
Cliffs, NJ, 1981.
13. B. Boehm. Software Risk Management: Principles and Practices, IEEE Soft-
ware, January 1991.
14. B. Boehm, J. Bhuta, D. Garlan, E. Gradman, L. Huang, A. Lam, R. Madachy,
N. Medvidovic, K. Meyer, S. Meyers, G. Perez, K. Reinholtz, R. Roshandel,
N. Rouquette, Using Testbeds to Accelerate Technology Maturity and Transi-
tion: The SCRover Experience, USC Technical Report USC-CSE-2003-507,
(Submitted to ICSE 2004), September 2003.
15. B. Boehm, P. Grünbacher, R. Briggs, Developing Groupware for Require-
ments Negotiation: Lessons Learned, IEEE Software, May/June 2001.
16. G. Booch, I. Jacobson, J. Rumbaugh, The Unified Modeling Language User
Guide, Addison-Wesley, Reading, MA.
17. J. Chang and D.J. Richardson, Structural Specification-based Testing: Auto
mated Support and Experimental Evaluation, ESEC/FSE’99: Proceedings of
the 7th European Software Engineering Conference, Toulouse, France, Sep-
tember 1999.
18. R.C. Cheung, A user-oriented software reliability model, IEEE Transactions
on Software Engineering, SE6 (2):118-125, March 1980.
19. E. Charniak, Bayesian network without tears, AI Magazine, vol. 12, no. 4, pp.
50-63, 1991.
20. E. Cinlar, Introduction to Stochastic Processes, Englewood Cliffs, NJ, Pren-
tice-Hall, 1975.
21. G . F. Cooper, The Computational Complexity of Probabilistic Inference Using
Bayesian Belief Networks. Artificial Intelligence, 42(2–3):393–405, March
1990.
22. C. Courcoubetis, and M. Yannakakis, The complexity of probabilistic verifi-
cation. Journal of the ACM, 42(4):857–907, 1995.
23. S.R. Dalal, Software Reliability Models: A Selective Survey and New Direc-
tions, Handbook of Reliability Engineering, edited by H. Pham, Springer,
2003.
230
24. T. DeMarco, Controlling Software Projects: Management, Measurement, and
Estimation. Englewood Cliffs, NJ: Yourdon Press, 1998.
25. E. Dashofy, van der Hoek A., Taylor R.N., An Infra-structure for the Rapid
Development of XML-based Architecture Description Languages, In Pro-
ceedings of the 24th International Conference on Software Engineering
(ICSE2002), Orlando, Florida.
26. D. Dvorak, Challenging Encapsulation in the Design of High-Risk Control
Systems. In Proceedings of the 2002 Conference on Object Oriented Pro-
gramming Systems, Languages, and Applications (OOPSLA’92), Seattle, W A,
November 2002
27. D. Dvorak, R. Rasmussen, G. Reeves, and A. Sacks, Software Architecture
Themes In JPL's Mission Data System, In Proceedings of the AIAA Space
Technology Conference and Exposition, Albuquerque, NM, September, 1999.
28. J. Dolbec, T. Shepard, A Component Based Software Reliability Model, in
Proceedings of the 1995 conference of the Centre for Advanced Studies on
Collaborative research, Toronto, Ontario, Canada, November 1995.
29. J.B. Durand, O. Gaudoin, Software reliability modelling and prediction with
Hidden Markov chains, Technical Report Number: INRIA n°4747, February
2003.
30. M. Dias, M. Vieira, Software Architecture Analysis based on Statechart
Semantics, in Proceedings of the 10th International Workshop on Software
Specification and Design, FSE-8, San Diego, USA, November 2000.
31. A. Egyed, Architecture Differencing for Self Management, in Proceedings of
the 1st ACM SIGSOFT workshop on Self-managed systems, Newport Beach,
California, 2004.
32. A. Egyed, Scalable Consistency Checking between Diagrams - The ViewInte-
gra Approach, in Proceedings of the 16th IEEE International Conference on
Automated Software Engineering, San Diego, CA, 2001
33. W. Everett, Software Component Reliability Analysis, in IEEE Symposium on
Application - Specific Systems and Software Engineering and Technology,
Richardson, Texas, 1999.
34. W. Farr, Software Reliability Modeling Survey, Handbook of Software Reli-
ability Engineering, M. R. Lyu, Editor. McGraw-Hill, New York, NY , 1996.
231
35. A. Farías, M. Südholt, On Components with Explicit Protocols Satisfying a
Notion of Correctness by Construction. in Proceedings of Confederated Inter-
national Conferences CoopIS/DOA/ODBASE, 2002.
36. T.H. Feng, A Virtual Machine Supporting Multiple Statechart Extensions, In
Proceedings of Summer Computer Simulation Conference (SCSC 2003), Stu-
dent Workshop. The Society for Computer Modeling and Simulation. Jul.
2003, Montreal, Canada.
37. A. Finkelstein, D. Gabbay, A. Hunter, J. Kramer, and B. Nuseibeh, Inconsis-
tency Handling in Multi-Perspective Specifications, IEEE Transactions on
Software Engineering, 20(8): 569-578, August 1994.
38. P. Fradet, D. Le Métayer, M. Périn, Consistency Checking for Multiple View
Software Architectures”, in Proceedings of the Seventh European Software
Engineering Conference (ESEC) and the Seventh ACM SIGSOFT Symposium
on the Foundations of Software Engineering, 1999.
39. Y . Gal, A. Pfeffer, A Language for Modeling Agents' Decision Making Pro-
cesses in Games, in Proceedings of the second international joint conference
on Autonomous agents and multiagent systems, Melbourne, Australia, 2003.
40. D. Garlan, R.T. Monroe, and D. Wile, Acme: Architectural Description of
Component-Based Systems. Foundations of Component-Based Systems.
Leavens, G .T., and Sitaraman, M. (eds). Cambridge University Press, 2000 pp.
47-68.
41. A.L. Goel, K. Okumoto, Time-Dependent Error-Detection Rate Models for
Software Reliability and Other Performance Measures, IEEE Transactions on
Reliability, 28(3):206–211, August 1979.
42. S. Gokhale, P.N. Marinos, and K.S. Trivedi, Important milestones in software
reliability modeling, In Proceedings of the 8th International Conference on
Software Engineering and Knowledge Engineering (SEKE 96), Lake Tahoe,
June 1996.
43. S. Gokhale, T. Philip, P. Marinos, K. Trivedi, Unification of finite-failure non-
homogenous Poisson process models through test coverage, in Proceedings of
the 7th IEEE International Symposium on Software Reliability Engineering
(ISSRE-96), November. 1996.
232
44. S. Gokhale, W. E. Wong, K. S. Trivedi, and J. R. Horgan, An Analytical
Approach to Architecture-Based Software Reliability Prediction, IEEE Inter-
national. Computer Performance and Dependability Symposium, Durham,
NC, Sept. 1998.
45. K. Goseva-Popstojanova, A.P. Mathur, K.S. Trivedi, Comparison of Architec-
ture-Based Software Reliability Models, in Proceedings of the 12th IEEE
International Symposium on Software Reliability Engineering (ISSRE-2001),
Hong Kong, November 2001.
46. W.J. Gutjahr, Optimal Test Distributions for Software Failure Cost Estimation,
IEEE Transaction on Software Engineering, V . 21, No. 3, pp. 219-228, March
1995.
47. D. Harel, Statecharts: A visual formalism for complex systems, Science of
Computer Programming, V olume 8, Issue 3, June 1987.
48. D. Harel, A. Naamad, The STATEMATE Semantics of Statecharts. ACM
Transactions on Software Engineering Methodoly. 5(4): 293-333 (1996).
49. D. Heckerman, A Tutorial on Learning with Bayesian Networks. In Learning
in Graphical Models, M. Jordan, ed. MIT Press, Cambridge, MA, 1999.
50. C. Hofmeister, R.L. Nord, and D. Soni, Describing Software Architecture
with UML, In Proceedings of the TC2 First Working IFIP Conference on Soft-
ware Architecture (WICSA1), San Antonio, TX, February 22-24, 1999.
51. B. Hnatkowska, Z. Huzar, J. Magott, Consistency Checking in UML Models,
in Proceedings of Fourth International Conference on Information System
Modeling (ISM01), Czech Republic, 2001.
52. Inspector SCRover Project, http://cse.usc.edu/iscr/pages/ProjectDescription/
home.htm
53. Z. Jelinski and P. B. Moranda, Software Reliability Research, Statistical Com-
puter Performance Evaluation, edited by W. Freigerger, Academic Press,
1972.
54. F. Jensen, Bayesian Networks and Decision Graphs. Springer., 2001
55. M. I. Jordan, (ed), Learning in Graphical Models, MIT Press. 1998.
233
56. S. Krishnamurthy, A.P . Mathur, On the Estimation of Reliability of a Software
System Using Reliability of its Components, in Proceedings of the 8th IEEE
International Symposium on Software Reliability Engineering (ISSRE-97),
pp.146-155, November 1997.
57. P.B. Kruchten, The 4+1 View Model of Architecture. IEEE Software, 2(6):42-
50, 1995.
58. H. Langseth, Bayesian Networks with Application in Reliability Analysis.
Technical Report PhD Thesis, Dept. of Mathematical Sciences, Norvegian
University of Science and Technology, 2002.
59. J.C. Laprie and K. Kanoun, Handbook of Software Reliability Engineering, M.
R. Lyu, Editor, chapter “Software Reliability and System Reliability”, pages
27–69. McGraw-Hill, New York, NY , 1996.
60. W. Lee, Applying Data Mining to Intrusion Detection: the Quest for Automa-
tion, Efficiency, and Credibility, ACM SIGKDD Explorations Newsletter, Vol-
ume 4, Issue 2, Pages: 35 - 42, December 2002.
61. N. Leveson, Safeware: System Safety and Computers, Addison Wesley
(1995).
62. J. Li, Monitoring and Characterization of Component-Based Systems with
Global Causality Capture, In Proceedings of the 23rd IEEE International
Conference on Distributed Computing Systems (ICDCS), 2003.
63. J. Li, J. Micallef, and J. Horgan, Automatic Simulation to Predict Software
Architecture Reliability, in Proceedings of Eighth International Symposium
on Software Reliability Engineering (ISSRE '97), Albuquerque, NM, 1997.
64. P. Liu, W. Zang, M Yu., Incentive-based Modeling and Inference of Attacker
Intent, Objectives, and Strategies, ACM Transactions on Information and Sys-
tem Security, V olume 8, Issue 1, Pages: 78 - 118, 2005.
65. B.H. Liskov, J. M. Wing, A Behavioral Notion of Subtyping, ACM Transac-
tions on Programming Languages and Systems, November 1994.
66. B. Littlewood, A Reliability Model for Markov Structured Software, In Pro-
ceedings of the 1975 International Conference on Reliable Software, pages
204–207, Los Angeles, CA, April 1975.
234
67. B. Littlewood, A Semi-Markov Model for Software Reliability with Failure
Costs, In Proceedings of Symposium on Computational Software Engineering,
pp 281–300, Polytechnic Institute of New York, April 1976.
68. B.A. Littlewood, and J.L. Verrall, A Bayesian Reliability Growth Model for
Computer Software, Applied Statistics, V olume 22, pp. 332-346, 1973.
69. D.C. Luckham, and J. Vera, An Event-Based Architecture Definition Lan-
guage. IEEE Transactions on Software Engineering, vol. 21, no. 9, pp. 717-
734, September 1995.
70. M. R. Lyu, Handbook of Software Reliability Engineering, McGraw-Hill,
New York, NY , 1996.
71. J. Magee, and J. Kramer, Dynamic Structure in Software Architectures, in
Proceedings of the Fourth ACM SIGSOFT Symposium on the Foundations of
Software Engineering, pp.3-13, 1996.
72. A. Maggiolo-Schettini, A. Peron, and S. Tini, Equivalence of Statecharts, In
Proceedings of CONCUR '96, Springer, Berlin, 1996
73. D. Mason, Probabilistic Analysis for Component Reliability Composition. In
5th ICSE Workshop on Component-Based Software Engineering
(CBSE’2002), Orlando, Florida, USA, May 2002.
74. The MathWorks Matlab: http://www.mathworks.com
75. J. McManus, Risk Management in Software Development Projects, Butter-
worth-Heinemann, 2003.
76. N. Medvidovic, D.S. Rosenblum, and R.N. Taylor, A Language and Environ-
ment for Architecture-Based Software Development and Evolution, In Pro-
ceedings of the 21st International Conference on Software Engineering
(ICSE'99), Los Angeles, CA, May 1999.
77. N. Medvidovic, and R.N. Taylor, A Classification and Comparison Frame-
work for Software Architecture Description Languages. IEEE Transactions on
Software Engineering 26(1), pp. 70-93, 2000.
78. Microsoft Developer Network Library, Common Object Model Specification,
Microsoft Corporation, 1996.
235
79. A. Mockus, D.M. Weiss, P. Zhang, Understanding and Predicting Effort in
Software Projects, in Proceedings of the 25th International Conference on
Software Engineering, Portland, Oregon, 2003.
80. J.F. Murray, G .F. Hughes, K. Kreutz-Delgado, Machine Learning Methods for
Predicting Failures in Hard Drives: A Multiple-Instance Application, The
Journal of Machine Learning Research, V ol 6, Pages: 783 - 816, 2005.
81. K. Murphy, A Brief Introduction to Graphical Models and Bayesian Net-
works, http://www.cs.ubc.ca/~murphyk/bayes/.html, 1998.
82. K. Murphy, Hidden Markov Model (HMM) Toolbox for Matlab, http://
www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html
83. J.D. Musa, A Theory of Software Reliability and Its Application, IEEE Trans-
actions on Software Engineering, 1(1975)3, pp. 312-327, 1975.
84. J.D. Musa, A. Iannino, K. Okumoto, Software Reliability– Measurement, Pre-
diction, Application, McGraw-Hill International Editions, 1987.
85. J.D. Musa, and K. Okumoto, Logarithmic Poisson Execution Time Model for
Software Reliability Measurement, in Proceedings of Compsac 1984, pp. 230-
238, 1984.
86. NASA High Dependability Computing Project (HDCP), http://www.hdcp.org.
87. NASA Object Oriented Data Technology (OODT), http://oodt.jpl.nasa.gov.
88. R. Neapolitan, Probabilistic Reasoning in Expert Systems. J. Wiley, 1990.
89. C. Needham, J.R. Bradford, A.J. Bulpitt, D.R. Westhead, Application of
Bayesian Networks to Two Classification Problems in Bioinformatics. Quan-
titative Biology, Shape Analysis and Wavelets, 87-90, 2005.
90. Netica. http://www.norsys.com
91. A. Nicholson, S. Russell, Techniques for Handling Inference Complexity in
Dynamic Belief Networks, Technical Report: CS-93-31, Brown University,
1993.
92. D. Nikovski, Constructing Bayesian Networks for Medical Diagnosis from
Incomplete and Partially Correct Statistics, IEEE Transactions on Knowledge
and Data Engineering, V olume 12, Issue 4, Pages: 509 - 516, 2000.
236
93. B. Nuseibeh, J. Kramer, and A. Finkelstein, Expressing the Relationships
Between Multiple Views in Requirements Specification, in Proceedings of the
15th International Conference on Software Engineering (ICSE-15), Balti-
more, Maryland, USA, 1993.
94. Object Management Group, The Common Object Request Broker: Architec-
ture and Specification, Document Number 91.12.1, OMG, December 1991.
95. The Object Constraint Language (OCL), http://www-3.ibm.com/software/ad/
library/standards/ocl.html.
96. H. Okamura, H. Furumura, and T. Dohi, Bayesian Approach to Estimate Soft-
ware Reliability in Fault-removal Environment, in Proceedings of the 15th
IEEE International Symposium on Software Reliability Engineering (ISSRE
2004) (Fast Abstract), Saint-Malo, France, December 2-5, 2004.
97. G.J. Pai, and J.B. Dugan, Enhancing Software Reliability Estimation Using
Bayesian Networks and Fault Trees, in Proceedings of the 12th IEEE Interna-
tional Symposium on Software Reliability Engineering (ISSRE Fast Abstract
Track), 2001.
98. G .J. Pai, S.K. Donohue, and J.B. Dugan, Estimating Software Reliability from
Process and Product Evidence, in Proceedings of the 6th International Con-
ference on Probabilistic Safety Assessment and Management, Feb. 2002.
99. J. Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann,
1989.
100. D.E. Perry, and A.L. Wolf, Foundations for the Study of Software Architec-
tures, ACM SIGSOFT Software Engineering Notes, 17(4): 40-52, 1992.
101. H. Pham, Software Reliability, Springer 2002.
102. F. Plasil, S. Visnovsky, Behavior Protocols for Software Components, IEEE
Transactions on Software Engineering 28(11), pp. 1056–1076, November
2002.
103. L.R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applica-
tions in Speech Recognition. In Proceedings of IEEE. V olume 77, 1989.
237
104. L.R. Rabiner, B.H. Juang, and C.H. Lee, An Overview of Automatic Speech
Recognition. In C. H. Lee, F. K. Soong, and K. K. Paliwal, editors, Automatic
Speech and Speaker Recognition, Advanced Topics, pages 1-30. Kluwer Aca-
demic Publishers, 1996.
105. J. Ren, R. Taylor, P. Dourish, D. Redmiles. Towards An Architectural Treat-
ment of Software Security: A Connector-Centric Approach. In Proceedings of
the Workshop on Software Engineering for Secure Systems, International
Conference on Software Engineering, St. Louis, Missouri, USA, 2005.
106. R. Reussner, H. Schmidt, I. Poernomo, Reliability Prediction for Component-
based Software Architectures, In Journal of Systems and Software, 66(3), pp.
241-252, Elsevier Science Inc, 2003.
107. R. Roshandel, N. Medvidovic, Coupling Static and Dynamic Semantics in an
Architecture Description Language, in Proceeding of Working Conference on
Complex and Dynamic Systems Architectures, Brisbane, Australia, 2001.
108. R. Roshandel, N. Medvidovic, Modeling Multiple Aspects of Software Com-
ponents, in Proceeding of Workshop on Specification and Verification of Com-
ponent-Based Systems, ESEC-FSE03, Helsinki, Finland, September 2003.
109. R. Roshandel, N. Medvidovic, Multi-View Software Component Modeling
for Dependability, In R. de Lemos, C. Gacek, and A. Romanowski, eds.,
Architecting Dependable Systems II, Lecture Notes in Computer Science
3069, Springer Verlag, pages 286-306, June 2004.
110. R. Roshandel, B. Schmerl, N. Medvidovic, D. Garlan, D. Zhang, Understand-
ing Tradeoffs among Different Architectural Modeling Approaches, in Proc.
of the 4th Working IEEE/IFIP Conference on Software Architecture, WICSA
2004, Oslo, Norway, June 2004.
111. R. Roshandel, A. van der Hoek, M. Mikic-Rakic, N. Medvidovic, Mae - A
System Model and Environment for Managing Architectural Evolution, ACM
Transactions on Software Engineering and Methodology, vol. 11, no. 2, pages
240-276, April 2004.
112. G.J. Schick, and R.W. Wolverton, An Analysis of Computing Software Reli-
ability Models, in IEEE Transactions on Software Engineering, vol. SE-4, pp.
104120, July 1978.
113. M. Schumacher, Security Engineering with Patterns: Origins, Theoretical
Models, and New Applications, Springer; 1 edition, 2003.
238
114. SCRover Project: http://cse.usc.edu/hdcp/iscr.
115. K. Seigrist, Reliability of systems with Markov transfer of control, in IEEE
Transactions on Software Engineering, 14(7):1049–1053, July 1988.
116. M. Shaw, Cost and Effort Estimation. CPSC451 Lecture Notes. The Univer-
sity of Calgary, 1995.
117. M. Shaw, D. Garlan, Software Architecture: Perspectives on an Emerging
Discipline. Prentice-Hall, 1996.
118. M. Shaw, R. DeLine, D.V . Klein, T.L. Ross, D.M. Young, G. Zelesnik,
Abstractions for Software Architecture and Tools to Support Them. IEEE
Transactions on Software Engineering, 21(4), 1995.
119. M. Shooman, Software Engineering, Design, Reliability, and Management,
Mc-Graw-Hill, New York, 1983.
120. N.D. Singpurwalla, and S.P. Wilson, Statistical Methods in Software Engi-
neering: Reliability and Risk. Springer Verlag, New York, NY , 1999.
121. J. Solano-Soto and L. Sucar, A Methodology for Reliable System Design. In
Lecture Notes in Computer Science, V olume 2070, pp. 734–745. Springer,
2001.
122. The Unified Modeling Language (UML), http://www.uml.org.
123. J. Torres-Toledano and L. Sucar, Bayesian Networks for Reliability Analysis
of Complex Systems. In Lecture Notes in Artificial Intelligence 1484.
Springer Verlag, 1998.
124. K. Trivedi, Probability and Statistics with Reliability, Queueing, and Com-
puter Science Applications, 2nd Edition, Wiley-Interscience, 2001.
125. USC Center for Software Engineering, Guidelines for Model-Based (System)
Architecting and Software Engineering, http://sunset.usc.edu/research/
MBASE, 2003.
126. M. Vardi, Automatic verification of probabilistic concurrent finite-state pro-
grams. In Proceedings of FOCS’85, pages 327–338. IEEE Press, 1987.
239
127. A. van der Hoek, M. Rakic, R. Roshandel, N. Medvidovic, Taming Architec-
ture Evolution, in Proceedings of the Sixth European Software Engineering
Conference (ESEC) and the Ninth ACM SIGSOFT Symposium on the Founda-
tions of Software Engineering (FSE-9), Vienna, Austria, 2001.
128. R. van Ommering, Building Product Populations with Software Components,
in Proceedings of the 24th International Conference on Software Engineering
(ICSE2002), Orlando, Florida.
129. A.J Viterbi, Error Bounds for Convolutional Codes and An Asymptotically
Optimal Decoding Algorithm, IEEE Transactions on Information Theory,
13:260–269, 1967.
130. S. Yacoub, B. Cukic, and H. Ammar, Scenario-Based Analysis of Component-
based Software. In Proceedings of the Tenth International Symposium on Soft-
ware Reliability Engineering, Boca Raton, FL, November 1999.
131. S. Yamada, Software Reliability Models and Their Applications: A Survey,
International Seminar on Software Reliability of Man-Machine Systems,
Kyoto, Japan 2000.
132. S. Yamada, M. Ohba, and S. Osaki, S-Shaped Reliability Growth Modeling
for Software Error Detection. in IEEE Transactions on Reliability, R-
32(5):475-485, December 1983.
133. D.M. Yellin, R.E. Strom, Protocol Specifications and Component Adaptors,
ACM Transactions on Programming Languages and Systems, V ol. 19, No. 2,
1997.
134. W. Wang, Y . Wu, M. Chen, An Architecture-based Software Reliability
Model, in Proceedings of Pacific Rim International Symposium on Depend-
able Computing, 1999, pp. 143-150.
135. M. West and P.J. Harrison, Bayesian Forecasting and Dynamic Models, 2nd
edn. Springer-Verlag, New York, 1997.
136. A.M. Zaremski, J.M. Wing, Specification Matching of Software Components,
ACM Transactions on Software Engineering and Methodology, 6(4):333–369,
1997.
240
Appendix A: Mae Schemas for Quartet Models
This appendix contains three xADL schemas that describe static behaviors, dynamic
behaviors, and interaction protocol views of the Quartet model.
Static Behaviors Schema
* Copyright (c) 2003-2004 University of Southern California.
* All rights reserved.
*
* This software was developed at the University of Southern California.
*
* Redistribution and use in source and binary forms are permitted
* provided that the above copyright notice and this paragraph are
* duplicated in all such forms and that any documentation,
* advertising materials, and other materials related to such
* distribution and use acknowledge that the software was developed
* by the University of Southern California. The name of the
* University may not be used to endorse or promote products derived
* from this software without specific prior written permission.
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.
*
-->
xmlns:archinstance="http://www.ics.uci.edu/pub/arch/xarch/instance.xsd"
xmlns:archversions="http://www.ics.uci.edu/pub/arch/xArch/versions.xsd"
xmlns:archtypes="http://www.ics.uci.edu/pub/arch/xArch/types.xsd"
xmlns:archbool="http://www.ics.uci.edu/pub/arch/xArch/boolguard.xsd"
xmlns:menage="http://www.ics.uci.edu/pub/arch/xArch/menage.xsd"
xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsd="http://www.w3.org/
2001/XMLSchema" xmlns="http://sunset.usc.edu/~softarch/schemas/
staticbeh.xsd" elementFormDefault="qualified"
attributeFormDefault="qualified">
schemaLocation="http://www.isr.uci.edu/projects/xarchuci/ext/types.xsd"/>
xArch Type XML Schema 1.0
Change Log:
2003-3-10: Roshanak Roshandel [roshande@usc.edu]
Transiting from the C2 schema to the static behavioral schema
maxOccurs="unbounded"/>
minOccurs="0"/>
minOccurs="0" maxOccurs="unbounded"/>
minOccurs="0" maxOccurs="unbounded"/>
minOccurs="0"/>
type="InterfaceElement" minOccurs="0" maxOccurs="unbounded"/>
242
type="archinstance:Description"/>
maxOccurs="unbounded"/>
maxOccurs="unbounded"/>
maxOccurs="unbounded"/>
243
Dynamic Behaviors Schema
xmlns:archinstance="http://www.ics.uci.edu/pub/arch/xarch/instance.xsd"
xmlns:archversions="http://www.ics.uci.edu/pub/arch/xArch/versions.xsd"
xmlns:archtypes="http://www.ics.uci.edu/pub/arch/xArch/types.xsd"
xmlns:archbool="http://www.ics.uci.edu/pub/arch/xArch/boolguard.xsd"
xmlns:menage="http://www.ics.uci.edu/pub/arch/xArch/menage.xsd"
xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsd="http://www.w3.org/
2001/XMLSchema"
xmlns:static="http://sunset.usc.edu/~softarch/schemas/staticbeh.xsd"
xmlns="http://sunset.usc.edu/~softarch/schemas/dynamicbeh.xsd"
elementFormDefault="qualified" attributeFormDefault="qualified">
schemaLocation="http://www.isr.uci.edu/projects/xarchuci/ext/types.xsd"/>
schemaLocation="sunset.usc.edu/~softarch/schemas/staticbeh.xsd"/>
Dynamic Behaviors XML Schema 1.0
Change Log:
2003-10-1: Roshanak Roshandel [roshande@usc.edu]
244
Initial development
maxOccurs="1"/>
maxOccurs="unbounded"/>
maxOccurs="unbounded"/>
maxOccurs="unbounded"/>
Interaction Protocol Schema
xmlns:archinstance="http://www.ics.uci.edu/pub/arch/xarch/instance.xsd"
xmlns:archversions="http://www.ics.uci.edu/pub/arch/xArch/versions.xsd"
xmlns:archtypes="http://www.ics.uci.edu/pub/arch/xArch/types.xsd"
xmlns:archbool="http://www.ics.uci.edu/pub/arch/xArch/boolguard.xsd"
xmlns:menage="http://www.ics.uci.edu/pub/arch/xArch/menage.xsd"
xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsd="http://www.w3.org/
2001/XMLSchema"
xmlns:static="http://sunset.usc.edu/~softarch/schemas/staticbeh.xsd"
xmlns:dynamic="http://sunset.usc.edu/~softarch/schemas/dynamicbeh.xsd"
xmlns="http://sunset.usc.edu/~softarch/schemas/protocols.xsd"
elementFormDefault="qualified" attributeFormDefault="qualified">
246
schemaLocation="http://www.isr.uci.edu/projects/xarchuci/ext/types.xsd"/>
schemaLocation="sunset.usc.edu/~softarch/schemas/staticbeh.xsd"/>
schemaLocation="sunset.usc.edu/~softarch/schemas/dynamic.xsd"/>
Interaction Protocols XML Schema 1.0
Change Log:
2003-10-1: Roshanak Roshandel [roshande@usc.edu]
Initial development
type="ProtocolModel" maxOccurs="1"/>
maxOccurs="unbounded"/>
maxOccurs="unbounded"/>
247
248
Appendix B: Sample Matlab Code for Component Reliabil-
ity Estimation
In this section, we present the sample Matlab code used for estimating the reliability
of the SCRover’s Controller component:
%Copyright University of Southern California
% Roshanak Roshandel (roshande@usc.edu)
%--------------------------------------------
% This code is used to calculate the component-level reliability using HMM
% This is an adaptation for the Controller component
flag = 0 %use HMM
%flag = 1 % use MM
numInterf = 5; % numInterf: number of ovservations (i.e., component
interfaces)
numNStates = 4; % numNStates: number of states in the dynamic
behavioral model
numIter=100; %number of iterations for the HMM algorithm
% Training data generation
num=200; %num of sequences
length=100; % length of each sequence
if (flag == 1) %Operation Profile Known.
numIter=1;
A(:,:,1) = [0.1503, 0.2858, 0.2830, 0.2809
0.4708, 0.2653, 0.0099, 0.2540
0.2263 0.0984 0.3308 0.3445
0.2204, 0.3539, 0.1998, 0.2258]
end
%--------------------------------------------
%training data generation
%--------------------------------------------
if (flag == 0)
length = 50;
numSeq = 100;
% if using the stochastic, the data is generated randomly,
%if non-random, initialize these three matrices based on domain expertise
prior0 = normalise(rand(numNStates,1));
transmat0 = mk_stochastic(rand(numNStates,numNStates));
obsmat0 = mk_stochastic(rand(numNStates,numInterf));
data = dhmm_sample(prior0, transmat0, obsmat0, length, numSeq);
end
for n=1:numIter
if (flag == 0)
%--------------------------------------------
%Expectation-Maximization
249
%--------------------------------------------
% Start from a random model
prior1 = normalise(rand(numNStates,1));
transmat1 = mk_stochastic(rand(numNStates,numNStates));
obsmat1 = mk_stochastic(rand(numNStates,numInterf));
% improve guess of parameters using EM
[LL, prior2, transmat2, obsmat2] = dhmm_em(data, prior1, transmat1,
obsmat1, 'max_iter', 100);
LL;
% use model to compute log likelihood
loglik = dhmm_logprob(data, prior2, transmat2, obsmat2);
loglikArr(n) = loglik
A(:,:,n)= transmat2;
end
%--------------------------------------------
%Reliability Estimation
%--------------------------------------------
% Number of failure states (obtained from architectural analysis)
numFStates = 2;
% Probability of failure occurence (assigned by the domain expert)
% this must be assigned from every state in the model to every failure
% state Si*Fi
PF = [0.05 0
0.05 0
0.05 0.02
0.05 0.02];
% Probability of recovery from defects (obtained via the cost framework)
%PR = Fi*Si
PR= [0.7975 0 0 0
0.71125 0 0 0]
%% for now lets assume this is A -- this is typically obtained from EM
%% and is (transmat2)
init = zeros(numNStates+numFStates);
AA(:,:,n) = init;
sumFi=sum(PF');
sumRi=sum(PR');
for kk=1:numNStates
%kk
for ff=1:numNStates
AA(kk,ff,n) = (1-sumFi(kk)).*A(kk,ff,n);
end
for ff=1:numFStates
AA(kk,numNStates+ff,n) = PF(kk,ff);
end
end
for (kk=1:numFStates)
for ff=1:numNStates
AA(numNStates+kk,ff,n) = PR(kk,ff);
end
for ff=1:numFStates
AA(numNStates+ff,numNStates+ff,n)= 1-sumRi(ff);
250
end
end
% Find the steady state behavior
ProbMat(:,:,n) = AA(:,:,n)^200
% Find the probability of not being in one of the failure states.
sumF= 0
for ff=1:numFStates
sumF = sumF + ProbMat(1,numNStates+ff,n)
end
Rel(n) = 1-sumF
end
Reliability = mean (Rel)
251
Appendix C: SCRover Bayesian Network Generated by
Netica
// ~->[DNET-1]->~
// File created by RoshandelR at USouthCalifornia using Netica 3.17 on Jul
22, 2006 at 10:07:42.
bnet SCRover_step3_2 {
numdimensions = 1;
autoupdate = TRUE;
whenchanged = 1153585020;
visual V1 {
defdispform = LABELBOX;
nodelabeling = NAMETITLE;
NodeMaxNumEntries = 50;
nodefont = font {shape= "Arial"; size= 10;};
linkfont = font {shape= "Arial"; size= 9;};
windowposn = (22, 22, 772, 475);
resolution = 72;
drawingbounds = (1380, 934);
showpagebreaks = FALSE;
usegrid = TRUE;
gridspace = (6, 6);
NodeSet Node {BuiltIn = 1; Color = 0xc0c0c0;};
NodeSet Nature {BuiltIn = 1; Color = 0xf8eed2;};
NodeSet Deterministic {BuiltIn = 1; Color = 0xd3caa6;};
NodeSet Finding {BuiltIn = 1; Color = 0xc8c8c8;};
NodeSet Constant {BuiltIn = 1; Color = 0xffffff;};
NodeSet ConstantValue {BuiltIn = 1; Color = 0xffffb4;};
NodeSet Utility {BuiltIn = 1; Color = 0xffbdbd;};
NodeSet Decision {BuiltIn = 1; Color = 0xdee8ff;};
NodeSet Documentation {BuiltIn = 1; Color = 0xff8000;};
NodeSet Title {BuiltIn = 1; Color = 0xffffff;};
PrinterSetting A {
margins = (1270, 1270, 1270, 1270);
landscape = FALSE;
magnify = 1;
};
};
node init {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
parents = ();
probs =
// 0 to 0.1 0.1 to 0.2 0.2 to 0.3 0.3 to 0.4 0.4 to 0.5 0.5
to 0.6 0.6 to 0.7 0.7 to 0.8 0.8 to 0.9 0.9 to 1 1
(0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0);
whenchanged = 1153418644;
visual V1 {
center = (384, 30);
height = 15;
};
};
node R_Controller {
252
kind = UTILITY;
discrete = FALSE;
measure = RATIO;
chance = DETERMIN;
parents = ();
whenchanged = 1153584980;
visual V1 {
center = (90, 126);
height = 7;
};
};
node S1_Controller {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
parents = (S3_Controller, S2_Estimator, S3_Estimator, init,
R_Controller);
equation = "S1_Controller (R_Controller, S3_Controller, S2_Estimator,
S3_Estimator) = \n\
R_Controller*S3_Controller*S2_Estimator*S3_Estimator";
delays = (
(1),
(1),
(1),
(0),
(0));
whenchanged = 1153584970;
visual V1 {
center = (120, 174);
height = 4;
link 2 {
path = ((398, 488), (198, 288), (126, 185));
};
link 3 {
path = ((622, 506), (522, 384), (139, 185));
};
};
};
node R_Estimator {
kind = UTILITY;
discrete = FALSE;
measure = RATIO;
chance = DETERMIN;
parents = ();
whenchanged = 1153585012;
visual V1 {
center = (312, 120);
height = 16;
};
};
node S1_Estimator {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
parents = (S2_Estimator, S3_Estimator, init, R_Estimator);
equation = "S1_Estimator (R_Estimator, S2_Estimator, S3_Estimator) =
R_Estimator*S2_Estimator*S3_Estimator\n\
";
delays = (
(1),
(1),
(0),
253
(0));
whenchanged = 1153585017;
visual V1 {
center = (378, 150);
height = 8;
link 1 {
path = ((407, 488), (402, 444), (378, 161));
};
link 2 {
path = ((624, 506), (516, 348), (384, 161));
};
};
};
node R_Actuator {
kind = UTILITY;
discrete = FALSE;
measure = RATIO;
chance = DETERMIN;
parents = ();
whenchanged = 1153584993;
visual V1 {
center = (654, 66);
height = 10;
};
};
node S1_Actuator {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
parents = (S2_Actuator, init, R_Actuator);
equation = "S1_Actuator (R_Actuator, S2_Actuator) =
R_Actuator*S2_Actuator\n";
delays = (
(1),
(0),
(0));
whenchanged = 1153585020;
visual V1 {
center = (594, 120);
height = 9;
link 1 {
path = ((704, 266), (660, 246), (599, 131));
};
};
};
node S2_Controller {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
parents = (S1_Controller, S2_Estimator, S3_Estimator);
probs =
// 0 to 0.1 0.1 to 0.2 0.2 to 0.3 0.3 to 0.4 0.4 to 0.5 0.5
to 0.6 0.6 to 0.7 0.7 to 0.8 0.8 to 0.9 0.9 to 1 1 /
/ S1_Controller S2_Estimator S3_Estimator
();
numcases = 10;
equation = "S2_Controller (S1_Controller, S2_Estimator, S3_Estimator) =
\n\
S1_Controller*S2_Estimator*S3_Estimator\n\
";
delays = (
(0),
254
(1),
(1));
whenchanged = 1153423839;
visual V1 {
center = (54, 240);
height = 17;
link 3 {
path = ((611, 506), (354, 372), (76, 251));
};
};
};
node S3_Controller {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
parents = (S2_Controller, S1_Actuator);
probs =
// 0 to 0.1 0.1 to 0.2 0.2 to 0.3 0.3 to 0.4 0.4 to 0.5 0.5
to 0.6 0.6 to 0.7 0.7 to 0.8 0.8 to 0.9 0.9 to 1 1 /
/ S2_Controller S1_Actuator
(); // 1 1 ;
numcases = 10;
equation = "S3_Controller (S2_Controller, S1_Actuator) =
S2_Controller*S1_Actuator\n";
EqnDirty = TRUE;
whenchanged = 1153423948;
visual V1 {
center = (240, 246);
height = 11;
};
};
node S2_Estimator {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
parents = (S1_Estimator, S2_Controller);
probs =
// 0 to 0.1 0.1 to 0.2 0.2 to 0.3 0.3 to 0.4 0.4 to 0.5 0.5
to 0.6 0.6 to 0.7 0.7 to 0.8 0.8 to 0.9 0.9 to 1 1 /
/ S1_Estimator S2_Controller
(); // 1 1 ;
numcases = 10;
equation = "S2_Estimator (S1_Estimator, S2_Controller) =
S1_Estimator*S2_Controller\n";
EqnDirty = TRUE;
whenchanged = 1153423890;
visual V1 {
center = (408, 498);
height = 14;
link 1 {
path = ((379, 161), (420, 450), (410, 488));
};
link 2 {
path = ((63, 251), (228, 420), (385, 488));
};
};
};
node S3_Estimator {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
255
parents = (S1_Estimator, S1_Controller, S2_Controller);
probs =
// 0 to 0.1 0.1 to 0.2 0.2 to 0.3 0.3 to 0.4 0.4 to 0.5 0.5
to 0.6 0.6 to 0.7 0.7 to 0.8 0.8 to 0.9 0.9 to 1 1 /
/ S1_Estimator S1_Controller S2_Controller
(); // 1 1 1 ;
numcases = 10;
equation = "S3_Estimator (S1_Estimator, S1_Controller, S2_Controller) = \
S1_Estimator*S1_Controller*S2_Controller\n\
";
EqnDirty = TRUE;
whenchanged = 1153423905;
visual V1 {
center = (630, 516);
height = 12;
link 1 {
path = ((382, 161), (504, 420), (617, 506));
};
link 2 {
path = ((131, 185), (474, 480), (587, 506));
};
link 3 {
path = ((71, 251), (348, 408), (604, 506));
};
};
};
node F_Signature_Controller {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
parents = (S1_Controller, S2_Controller);
probs =
// 0 to 0.1 0.1 to 0.2 0.2 to 0.3 0.3 to 0.4 0.4 to 0.5 0.5
to 0.6 0.6 to 0.7 0.7 to 0.8 0.8 to 0.9 0.9 to 1 1 /
/ S1_Controller S2_Controller
(); // 1 1 ;
numcases = 10;
equation = "F_Signature_Controller (S1_Controller, S2_Controller) =
S1_Controller*S2_Controller\n";
EqnDirty = TRUE;
whenchanged = 1153423869;
visual V1 {
center = (210, 504);
height = 6;
};
};
node F_Protocol_Estimator {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
parents = (S1_Estimator);
probs =
// 0 to 0.1 0.1 to 0.2 0.2 to 0.3 0.3 to 0.4 0.4 to 0.5 0.5
to 0.6 0.6 to 0.7 0.7 to 0.8 0.8 to 0.9 0.9 to 1 1 /
/ S1_Estimator
(); // 1 ;
numcases = 10;
equation = "F_Protocol_Estimator (S1_Estimator) = (S1_Estimator) \n";
EqnDirty = TRUE;
whenchanged = 1153423878;
visual V1 {
center = (522, 552);
height = 13;
256
};
};
node S2_Actuator {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
parents = (S1_Actuator);
probs =
// 0 to 0.1 0.1 to 0.2 0.2 to 0.3 0.3 to 0.4 0.4 to 0.5 0.5
to 0.6 0.6 to 0.7 0.7 to 0.8 0.8 to 0.9 0.9 to 1 1 /
/ S1_Actuator
(); // 1 ;
numcases = 10;
equation = "S2_Actuator (S1_Actuator) = S1_Actuator\n";
EqnDirty = TRUE;
whenchanged = 1153423989;
visual V1 {
center = (726, 276);
height = 1;
link 1 {
path = ((602, 131), (684, 222), (719, 266));
};
};
};
node F_PrePostCond_Actuator {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
parents = (S1_Actuator);
probs =
// 0 to 0.1 0.1 to 0.2 0.2 to 0.3 0.3 to 0.4 0.4 to 0.5 0.5
to 0.6 0.6 to 0.7 0.7 to 0.8 0.8 to 0.9 0.9 to 1 1 /
/ S1_Actuator
(); // 1 ;
numcases = 10;
equation = "F_PrePostCond_Actuator (S1_Actuator) = (S1_Actuator) \n";
EqnDirty = TRUE;
whenchanged = 1153423998;
visual V1 {
center = (666, 378);
height = 2;
};
};
node F_PrePostCond_Controller {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
parents = (S2_Controller);
probs =
// 0 to 0.1 0.1 to 0.2 0.2 to 0.3 0.3 to 0.4 0.4 to 0.5 0.5
to 0.6 0.6 to 0.7 0.7 to 0.8 0.8 to 0.9 0.9 to 1 1 /
/ S2_Controller
(); // 1 ;
numcases = 10;
equation = "F_PrePostCond_Controller (S2_Controller) = (S2_Controller) ";
EqnDirty = TRUE;
whenchanged = 1153423855;
visual V1 {
center = (102, 420);
height = 5;
};
257
};
node S3_Actuator {
kind = NATURE;
discrete = FALSE;
chance = DETERMIN;
levels = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1);
parents = (S2_Actuator);
probs =
// 0 to 0.1 0.1 to 0.2 0.2 to 0.3 0.3 to 0.4 0.4 to 0.5 0.5
to 0.6 0.6 to 0.7 0.7 to 0.8 0.8 to 0.9 0.9 to 1 1 /
/ S2_Actuator
(); // 1 ;
numcases = 10;
equation = "S3_Actuator (S2_Actuator) = (S2_Actuator) \n";
EqnDirty = TRUE;
whenchanged = 1153423976;
visual V1 {
center = (828, 204);
height = 3;
};
};
};
Abstract (if available)
Abstract
Modeling and estimating software reliability during testing is useful in quantifying the quality of the software systems. However, such measurements applied late in the development process leave too little to be done to improve the quality and dependability of the software system in a cost-effective way. Reliability, an important dependability attribute, is defined as the probability that the system performs its intended functionality under specified design limits. We argue that reliability models must be built to predict the system reliability throughout the development process, and specifically when exact context and execution profile of the system is unknown, or when the implementation artifacts are unavailable. In the context of software architectures, various techniques for modeling software systems and specifying their functionality have been developed. These techniques enable extensive analysis of the specification, but typically lack quantification. Additionally, their relation to dependability attributes of the modeled software system is unknown.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Design-time software quality modeling and analysis of distributed software-intensive systems
PDF
Deriving component‐level behavior models from scenario‐based requirements
PDF
A user-centric approach for improving a distributed software system's deployment architecture
PDF
Automated synthesis of domain-specific model interpreters
PDF
The incremental commitment spiral model process patterns for rapid-fielding projects
PDF
Architecture and application of an autonomous robotic software engineering technology testbed (SETT)
PDF
A reference architecture for integrated self‐adaptive software environments
PDF
Analysis of embedded software architecture with precedent dependent aperiodic tasks
PDF
Software quality analysis: a value-based approach
PDF
Improved size and effort estimation models for software maintenance
PDF
A model for estimating schedule acceleration in agile software development projects
PDF
A model for estimating cross-project multitasking overhead in software development projects
PDF
Techniques for methodically exploring software development alternatives
PDF
Domain-based effort distribution model for software cost estimation
PDF
Architectural evolution and decay in software systems
PDF
Software connectors for highly distributed and voluminous data-intensive systems
PDF
Software architecture recovery using text classification -- recover and RELAX
PDF
Security functional requirements analysis for developing secure software
PDF
Formalizing informal stakeholder inputs using gap-bridging methods
PDF
A value-based theory of software engineering
Asset Metadata
Creator
Roshandel, Roshanak
(author)
Core Title
Calculating architectural reliability via modeling and analysis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/06/2006
Defense Date
07/18/2005
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
architectural modeling,dependability,Mae,OAI-PMH Harvest,reliability,Software Architecture
Language
English
Advisor
Medvidovic, Nenad (
committee chair
), Boehm, Barry W. (
committee member
), Golubchik, Leana (
committee member
), Meshkati, Najmedin (
committee member
)
Creator Email
roshanak@seattleu.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m126
Unique identifier
UC1167455
Identifier
etd-Roshandel-20061106 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-23021 (legacy record id),usctheses-m126 (legacy record id)
Legacy Identifier
etd-Roshandel-20061106.pdf
Dmrecord
23021
Document Type
Dissertation
Rights
Roshandel, Roshanak
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
architectural modeling
dependability
Mae
reliability