Close
The page header's logo
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Reliability analysis and optimization in the design of distributed systems
(USC Thesis Other) 

Reliability analysis and optimization in the design of distributed systems

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content RELIABILITY ANALYSIS AND OPTIM IZATION IN THE DESIGN OF DISTRIBUTED SYSTEM S by Salim H ariri A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Engineering) March 1986 U M I Number: DP22758 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Dissertation PuNisteng UMI DP22758 Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106- 1346 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90089 This dissertation, w ritten by Salim Hari ri Ph.D. Cp5 ttin 3 / 6 , # , & A v under the direction of h.\s......... Dissertation Committee, and approved by all its members, has been presented to and accepted b y The Graduate School, in partial fulfillm ent of re­ quirem ents for the degree of D O C TO R OF PHILOSOPH Y Dean of Graduate Studies Date M ar ch 20 , 1986 DISSERTATION COMMITTEE C . S. K i i : " a A i Chairperson A )<JU? ....... Dedication To my parents ii Acknowledgement I wish to thank my committee members, Professors C. S. Raghavendra, Melvin Breuer, and Francesco Parisi-Presicce for serving on my dissertation committee. I owe a special debt to my advisor, Dr. Raghavendra, who has been a con­ stant source of encouragem ent and a patient guide during my study at USC. I like to express my gratitude for his invaluable encouragem ent, m otivation, and friendship. I would like to thank Professor V. K. Prasanna K um ar for m any useful dis­ cussions on certain parts of my dissertation. I was very fortunate in my graduate career to meet some good friends who were selfless in their effort to help and encourage me, especially at the difficult times. I like to take this opportunity to express my appreciation to Anujan V arm a and P a t N aulty for their support and friendship. Also, I like to thank P a t for her kind offer to edit this dissertation. I also like to express my appreciation to Damascus University and the Agency for International Development for their joint fellowship th a t supported my graduate program.. iii Contents 1 In tro d u ctio n 1 1.1 Benefits of Distributed S y stem s...................... 2 1.2 Objective of this R esearch......................................................................... 3 1.3 Overview of the Dissertation ................................................................... 4 2 T erm inal R elia b ility A n alysis 8 2.1 Background.................................................... 9 2.2 Path Enumeration . , ................................................................................... 13 2.3 Derivation of SY REL.................................................................................... 16 2.4 An Illustrative E x a m p le ............................................................................. 33 2.5 Concluding Remarks and C om p arison .................................................. 40 3 R elia b ility M easures in D istrib u ted S ystem s 45 3.1 Introduction.................................................................................................... 45 3.2 Reliability M ea su re s.................................................................................... 47 3.3 Distributed Program Reliability Algorithm ........................................ 52 3.3.1 MFST A L G O R IT H M .................................................................... 53 3.4 Distributed System Reliability A lg o r ith m ........................................... 58 3.5 Sensitivity A nalysis....................................................................................... 64 3.5.1 Introducing Redundancy in the computers running a task . 65 3.5.2 Redundancy in Programs and F ile s............................................ 66 iv 3.6 An Illustrative E x a m p le ................. 69 3.6.1 Evaluating DPR for PRG1 ........................................................... 70 3.6.2 Evaluating DSR for all p rogram s................................................ 71 4 T erm inal R elia b ility O p tim ization 77 4.1 Introduction........................................................................................... 77 4.2 Optimization Problem Form ulation.......................................................... 78 4.3 Reliability Optimization A lg o r ith m s....................................................... 81 4.3.1 Derivation of Algorithm 1 ............................... 82 4.3.2 Derivation of Algorithm 2 .............................................................. 83 4.3.3 Derivation of Algorithm 3 .............................................................. 85 4.4 Illustrative E xam ples..................................................................................... 93 5 D istr ib u te d F un ction s A llo ca tio n for R elia b ility an d D elay O pti­ m iza tio n 103 5.1 Introduction.........................................................................................................103 5.2 Reliability and Delay Analysis ..........................................................105 5.2.1 Approximating Task’s R eliab ility................................................... 106 5.2.2 Delay A n a ly s is .....................................................................................107 5.3 Decomposing the Optimization P ro b lem ...................................................112 5.4 Constructing a Compound Objective Function.........................................117 5.5 An Illustrative E x a m p le ................................................................................. 120 6 C on clu sion s an d D irection s for F uture R esearch 125 6.1 Summary and Conclusions..............................................................................125 6.2 Directions for Future R e se a r c h ....................................................................127 v List of Figures 2.1 A 4-node network........................................................................................... 14 2.2 The connection matrix of Figure 2.1......................................................... 14 2.3 The matrix corresponding to X + X T...................................................... 15 2.4 A 6-node computer network with 13 paths............................................. 33 2.5 An 8-node computer network with 24 paths.......................................... 43 3.1 A distributed system with an allocation of its resources....................... 49 3.2 A tree that runs all the three programs................................................ 51 3.3 Reliability improvements for triple redundancy.................................. 66 3.4 Reliability improvements due to programs and files.................................68 3.5 A distributed system with an allocation of its four programs and their required files........................................................................................... 70 3.6 M F S T 's for P R G t ........................................................................................ 75 3.6 M F S F 's for all programs............................................................................ 76 4.1 Derivations of the contributing functions................................................ 86 4.2 A network with 3 paths................................................................................ 97 4.3 A bridge network with 4 paths................................................................... 98 4.4 A modified ARPANET with 13 paths................................................... 98 5.1 An M F S T with the allocation of a task’s functions............................109 5.2 A six-node distributed system....................................................................... 121 vi 5.3 Distribution of the functions according to a2.............................................124 5.4 Distribution of the functions according to a5.............................................124 List of Tables 2.1 Comparing the execution time of SYREL with other algorithms. . 42 2.2.aThe disjoint procedure is called 3 times (out of 10) for Figure 2.5. . 43 2.2.bThe disjoint procedure is called 10 times (out of 24) for Figure 2.6. 44 4.1 The initial reliabilities for Type 3. 99 4.2 The constant values for Type 4............................................................ 99 4.3 The optimal distribution of the budget when Algorithm 2 is applied to Figure 4.2. . ............................................................. 100 4.4 The optimal distribution of the budget when Algorithm 3 is applied i ■ to Figure 4.2.................................................................... 100 4.5 The optimal solution obtainded from applying Algorithm 2 to Fig­ ure 4.3............................................................................................... 101 4.6 The optimal solution obtainded from applying Algorithm 3 to Fig­ ure 4.3......................................................................................... 101 4.7 The optimal solution obtainded from applying Algorithm 2 to Fig­ ure 4.4........................................................................................................... 102 4.8 The optimal solution obtainded from applying Algorithm 3 to Fig­ ure 4.4........................................................................................................ 102 5.1 T P D ’s of the allocations resulted from applying F A P ...................123 viii List of Symbols and Notations c f k : the contributing function of an element ek . CFk : the approximate contributing function of ek . E, (E~) : the event in which all the elements of 5,- are up (down). ek : an element of a graph which can be either a node or a link. E>\i : the event in which all the elements of S,-y are operational. E V t (EV/) : the set of edges incident on Vt {Vj- ). fm (t) : the set of missing files that are not available in t and are needed to execute the task under consideration. f m i f ;zj) : the set of missing files needed to execute the programs that run at xj . FA, : the set of files available at a ? ,-. FNt : the set of files ( F ’s) needed to execute PRG, . FSF : a forest th at connects the root nodes where np programs run to some other nodes such that each program can access all the files needed for executing the programs on that node. FST : a tree th at connects the root node where a program runs to some other nodes such th at its vertices hold all the needed files for that program ’s execution. MFST (MFSF ): a minimal tree (forest) such th at there exists no other FST (FSF ) which is a subset of it. nf : the number of functions in a distributed task. ix nc (a,-) Nt Pk (?* ) Pi PA{ $ s i \i t ( f ) Vt (Vf ) n* : tie number 6T~drstril>uted~ programs th at m ust run in order to consider a system operational. : the number of trees covering a task for an allocation at -. : the num ber of MFST 's associated with an allocation of a task’s functions. : the probability of element ek being up (down). : a simple path between a pair of terminal nodes. A simple path is defined as a set of edges and nodes which results in a path between a pair of terminal nodes and any proper subset which does not result in a path. : the set of nodes at which program i (PRG ,■ ) could run. : the index set corresponding to the elements of path P {. : a conditional set th at corresponds to all the elements in set 5,- but not in Sj , i.e., % = { * : * 6 S t A i £ S y } : a tree (forest) of the graph representing a distributed system. : the set of vertices included in t ( / ). : the set of all simple paths that contain ek , i.e., Uk = { Pp. k e P ; } X Chapter 1 Introduction D istributed computing is made feasible by the price-performance revolu­ tion in microelectronics on the one hand and the development of efficient and cost effective comm unication structures on the other hand. A computing sys­ tem th a t interconnects a set of inexpensive processing elements has become more attractive than a large and complex m ultiprogram m ed uniprocessor sys­ tem . The processors can cooperate, exchange messages through a communica­ tions network, and behave as a single com puter which is called a distributed system. D istributed systems are studied in [CHU 69, ABRA 73, MARI 79, THUR 79, LAMP 81, STAN 84] and there are m any different definitions of these systems. Enslow [ENSL 78] capitalizes on the distribution concept and contends th a t a distributed system m ust have distributed processors, distribut­ ed d ata base, and distributed system control. O thers describe them in term s of their topology and the types of comm unication concepts [THUR 79]. In this thesis, we define a distributed system as a collection of processor-memory pairs connected by a communications network and logically integrated by a distri­ 1 buted operating system [KLEI 85, STAN 85]. 1.1 B en efits o f D istrib u ted S y stem s The m ajor benefits of distributed systems include increased perfor­ mance, extensibility, resource sharing, and reliability [LAMP 81]. The perfor­ mance is enhanced because of the cooperation of several processing elements on a single activity via a decentralized technique th a t avoids the contention and bottlenecks th a t exist in uniprocessor and multiprocessor systems. Short response time and high throughput can also be achieved through partitioning the computing functions into tasks, such th a t each can run on different pro­ cessing elements. Extensibility is an im portant aspect of distributed systems because of their ability to adopt to any changing environm ent w ithout disturb­ ing their functioning. Also the set of services provided by distributed systems can be expanded w ith an incremental cost; one m ay add more processing ele­ m ents to the existing system or replace existing processors w ith functionally identical but better performing elements. One of the m ain advantages of using a communications netw ork is the ability to share resources, such as physical devices, files, programs, data, and peripheral facilities. Technological advances have helped building components which are more and more reliable. However, the probability th a t a fault or an error occurs is non-zero for any physical device. Generally speaking, the relia­ bility of a system th a t includes a single critical resource is low since it depends directly on w hether or not th a t single resource is operational. M ost reliability mechanisms developed for conventional systems rely on redundant hardware, 2 such as triple m odular redundancy, redundant software, for example, running different versions of a procedure in parallel on different com puters, and redun­ dant data. Potential reliability improvements in distributed systems are possi­ ble due to the ease in introducing redundancy in their resources, and the abili­ ty of m utual inspection of hosts and communication processors. 1.2 O b jectives o f th is R esearch The applications of distributed systems include on-line transaction pro­ cessing systems, such as fund transfers and airline reservations, airborne sys­ tems, real tim e industrial controls, life support functions, and nuclear plant su­ pervision. This explosive growth of applications makes the reliability of distri­ buted systems an inevitable design issue. While all aspects of distributed sys­ tem s are significant, this thesis concentrates on analyzing and optimizing the reliability in these systems. The reliability of a system can be defined as the degree of tolerance against errors and faults [STAN 85]. Increased reliability can be achieved by either fault avoidance or fault tolerance [AVIZ 78]. The fault avoidance discipline leads to adopting conservative designs and using high reliability components, while fault tolerance employs error detection and redundancy to handle faults. Since components do fail, the reliability analysis of a distributed system can be used to determine the probability th at a com­ m unication p ath exists between two computers or the probability th a t a distri­ buted task, for example, two-phase commit protocol, or a service provided by a distributed operating system, behaves according to its specifications. 3 The objectives of this thesis are to analyze the reliability in distributee system s and provide efficient algorithms for the following research problems: 1. W hat is the probability of having two computers com m unicate w ith eacl other, and how can it be evaluated efficiently ? 2. If we were given a limited budget and a fixed topology and asked to im­ prove the reliability between two computers, how can we determ ine the op­ tim al distribution of the budget to improve components’ reliabilities ? 3. In distributed systems, redundancy is used to improve reliability. A distri­ buted task involves the execution of a set of functions on different remote com puters th a t cooperate in a decentralized m anner. How can we measure and evaluate the reliability of th a t task ? Since a user can run a remote task from any site, how can we measure anc evaluate the reliability of executing it from a rem ote site ? 4. It is shown th a t the distribution of resources among the computers! influences a task ’s reliability and performance. How can we distribute the resources of a given distributed system so th a t the reliability and some per­ formance measure, for example, delay, are optim ized ? 1.3 O verview o f th e D isserta tio n The organization of the thesis is as follows. In C hapter 2, we first sur­ vey the m ethods for evaluating term inal reliability which is the probability th a t a communication path between a given pair of com puters is operating For evaluating this, a simple and efficient algorithm, SYREL, is presented tc obtain compact term inal reliability expression between a pair of com puters of a complex network. This algorithm incorporates conditional probability, set theory, and Boolean algebra in a distinct approach in which most of the com­ putations performed are directly executable Boolean operations. The condi­ tional probability is used to avoid applying at each iteration the m ost time consuming step in the reliability algorithm s which is m aking a set of events m utually exclusive. The algorithm has been im plem ented on a VAX-11/75C and is efficient compared to existing reliability algorithm s. In C hapter 3, we study reliability issues in distributed systems where a task is cooperatively executed by a num ber of com puters interconnected via a communications network. The reliability of a distributed system can be described in term s of the reliability of processing elements and communication links and also of the redundancy of program s and d ata files. The terminal- pair reliability does not capture the redundancy of programs and files in a dis­ tributed system . W e introduce two reliability measures which are distributed program reliability th a t describes the probability of the successful execution of a program requiring the cooperation of several com puters, and distributed sys­ tem reliability which is the probability th at all the specified distributed pro­ grams for the system are operational. These two reliability measures can be extended to incorporate the effects of user sites on reliability. We develop an efficient unified approach based on breadth-first traversal of a graph to evalu­ ate the proposed reliability measures. We also perform the sensitivity analysis to determine the num ber of redundant copies of program s and data files need­ ed for the high reliability of operation. In C hapter 4, we study an interesting optim ization problem which max­ imizes the term inal reliability between a pair of computing elem ents under a given budget constraint. A nalytical techniques to solve this problem are appli­ cable only to special forms of reliability expressions. We present three itera­ tive algorithm s for term inal reliability maximization. The first two algorithm s require the com putation of term inal reliability expressions, and are therefore efficient for only small networks. The third algorithm , which is developed for large distributed systems, does not require the com putation of term inal relia­ bility expressions; this algorithm maximizes approxim ate objective functions and gives accurate results. Several examples are presented to illustrate the ap­ proxim ate optim ization algorithm , and an estim ation of the error involved is also given. In C hapter 5, we present two optim ization algorithm s for allocating the functions of a given distributed task so th a t the reliability is maximized and the comm unication delay is minimized. The allocation of resources, such as computing functions and files, influences the reliability of task ’s executions and the delays. Two different approaches are developed to allocate the resources of a given task. In the first m ethod, the problem is reduced to a 0-1 integer linear program m ing problem and can be solved as two sub-problem s. F irst, we find the set of allocations th a t maximize the reliability using an approach based on the branch-and-bound technique. Then, standard delay analysis is adopted to assess the average task packet delay associated w ith each of these allocations in order to choose the optim al solution. In the second approach, we construct a compound objective function th a t measures both the reliability 6 and delay, and then we use an optim ization algorithm to obtain optim al 01 near optim al distribution of a task’s functions. A detailed example is present­ ed to illustrate the optim ization methodology. In C hapter 6, we summarize the main contributions of this dissertation and then discuss directions for further research. 7 Chapter 2 Terminal Reliability Analysis One of the fundam ental considerations in the design of a distributed system is the reliability of its comm unications network. This characteristic depends on the topology of the network in addition to the reliability of the in­ dividual com puter systems and comm unication channels. The reliability of a comm unications network has been defined in a num ber of different ways. A netw ork could be defined operational in the presence of failures if communica­ tion paths exist between certain pairs of operative computers. It could also be defined operational if every com puter could communicate w ith a certain per­ centage of other computers. Based on the definition of an acceptable operation of a com m unications network, one can consider determ inistic and probabilistic reliability criteria. The determ inistic criteria were originally introduced as vulnerability measures for comm unications networks [WILIv 72]. In this study, the destructive force was assumed to be caused by an enemy who had knowledge of the structure of the network. The connectivity of a netw ork is defined as the minimum num ber of comm unication links and com puter centers th a t m ust fail for the 8 network to be disconnected. In the presence of randomly distributed natural forces, a probabilistic approach is used to estimate the probability of service disruption between any pair of operative computers. In this thesis, we use the probabilistic approach to measure the reliability of distributed systems and their communications networks. 2.1 B ackgroun d In the reliability analysis, it is customary to represent a computer net­ work by a probabilistic graph G (V , E ) where V is the set of nodes that represent the computers, and E is the set of edges that denote the communi­ cation channels. The failures of the links and the nodes are statistically in­ dependent. The terminal reliability describes the probability that at least one communication path exists between a source node and a destination node. When the graph is series-parallel with respect to the terminal nodes, and all nodes are perfectly reliable, the terminal-pair reliability expression can easily be obtained by repeated applications of two simple rules [SHOO 68]. The gen­ eral case of non series-parallel networks with unequal element reliability intro­ duces a much more difficult problem for which several methods exist. A sur­ vey along with a bibliography of the techniques for evaluating the reliability of different types of networks can be found in [HWAN 81]. These methods can be classified as shown below. 1. State enumeration [ARNB 78, BROW 71, FRAT 73, NAKA 77b]. 2. Decomposition technique [MISR 70, ROSE 77, SATY 83, TORR 83]. 3. Graph-theoretic approach [ROSE 77, SATY 78, SATY 82, JOHN 84]. 9 4. P a th enum eration [ABRA 79, AGGA 75, FRA T 73, GRNA 80]. a) Direct expansion of the probability of a union of events. b) Reduction to m utually exclusive events. 5. C utset enum eration [LIN 76, RAI 78]. a) Direct expansion. b) R eduction to m utually exclusive events. In state enum eration methods, all elem entary events are examined to determ ine w hether or not an elem entary event is favorable. An event is favor­ able if at least one path exists in the subgraph corresponding to it. The term i­ nal reliability is the sum m ation of the probability of the favorable events since they are all m utually exclusive. These m ethods are im practical for large net­ works because of the large num ber of events th a t m ust be enum erated, for ex­ ample, in a netw ork w ith 10 components, there are more th an one thousand elem entary events to be considered. In type 2, m ethods are proposed to decompose the complex network into sm aller ones for which the reliability can be easily evaluated. In [KIM 72], the reliability is evaluated by reducing the network into series-parallel sub­ networks. However, this approach is not applicable when nodes as well as comm unication links are unreliable; furtherm ore, not all networks are reduci­ ble to series-parallel sub-networks. In [ROSE 77, SATY 83], the state of the graph is partitioned into two events with respect to the two states of a parti­ tioning element called the keystone element. This decomposition process is applied recursively until the reliability of the partitioned sub-netw orks can be evaluated in a straightforw ard m anner. A pruned tree approach to the relia­ 10 bility com putation was presented in [TORR 83], which is an improvement of the tree algorithm discussed in [NAKA 78]. It also decomposes the system state into two different disjoint states by using one unreliable element as a keystone for the decomposition. The decomposition algorithm s suffer from the following problems: they do not produce compact reliability expressions, the num ber of iterations required grows exponentially w ith the num ber of parti­ tioning elements, and their execution times depend heavily on the m ethod used for selecting the keystone elements. The m ethods of type 3 use graph-theoretic techniques to determine ter­ minal reliability as well as other reliability measures, such as network reliabili­ ty which measures the probability th a t any node is connected to all other nodes. An algorithm th a t evaluates the term inal reliability expression with non-canceling term s is introduced in [SATY 78]. The term s correspond on a one-to-one basis to the p-acyclic subgraphs of the given probabilistic graph. A p-acyclie subgraph is an acyclic graph in which every link is in at least one path from the source node to the term inal node. The term inal reliability ex­ pression can be derived directly from the p-acyclic subgraphs. These methods produce non-com pact reliability expressions and their im plem entations tend to be less efficient th an those algorithms based on path and cutset enum eration. The m ethods of types 4.a and and 5.a are im practical for solving medi­ um or large size networks because of the large num ber of term s to be m anipu­ lated; it is of order (2m -1), where m is the num ber of paths or cutsets. The m ethods of types 4.b, and 5.b are the only known practical m ethods for deter­ mining symbolic term inal reliability expressions. 11 In path (cutset) enum eration methods, the term inal reliability (unrelia- bility) expression is obtained by finding the set of possible paths (cutsets) between a pair of term inal nodes and then applying Boolean algebra and pro­ bability theory to modify the set of paths (cutsets) to an equivalent set of mu­ tually exclusive paths. The term inal reliability (unreliability) expression can then be obtained in a straightforw ard m anner from the disjoint set of paths (cutsets). Rai [RAI 78] proposed a m ethod for evaluating term inal reliability which can be applied to both path sets and cutsets. This m ethod uses the ex­ clusive operator to obtain the m utually exclusive term s and applies standard Boolean operators to remove the redundant term s generated during the dis­ joint process. The disjoint process performed on each path is tim e consuming because the redundant term s are removed by hand com putations which make it im practical, especially when the num ber of literals (variables) in a path is large. An efficient algorithm for symbolic reliability analysis was presented in [GRNA 80]. The binary cubes are adopted to represent the set of paths. A $-operation, which is a modified sharp operation, when it is applied to two cubes, such as IP 1 and IP 2, produces all disjoint sub-cubes of IP l not includ­ ed in IP 2. By repeatedly applying this operation on the cubes representing the paths, a set of m utually exclusive paths is obtained. The term inal reliabil­ ity expression can be obtained directly from the resulting cubes. However, the algorithm still applies the disjoint operation in each iteration; furtherm ore, this disjoint process is done using complex operations th a t are not directly exe- 12 cutable Boolean operations. ~ 1 In this chapter, an algorithm called SYREL (a SYmbolic RELiability al­ gorithm based on path and cutset m ethods) is presented which avoids applying the time consuming disjoint process at each iteration and uses conditional pro­ bability to reduce dram atically the num ber of term s involved in the disjoint process, as shown later. SYREL is a simple algorithm sim ilar to the hand com putation of term inal reliability th a t is usually followed by reliability en­ gineers to derive the reliability expression of a simple network. F urther, all the operations perform ed on the set of simple paths are directly executable Boolean operations th at significantly improve SYREL’s execution tim e. It is usually very difficult to compare the execution tim e of the algorithm s on re­ ported examples because of the dependency on the type of language used to implement the algorithm , the com puter used, and the skill of programmers. However, applying SYREL, which is w ritten in PASCAL and runs on VAX- 11/750, to the same examples reported in [GRNA 80] has resulted in better ex­ ecution times. 2.2 P a th E n u m era tio n The first step in any term inal reliability algorithm based on path enum eration is obtaining the set of paths between a given pair of term inal nodes. It is a classical graph problem, and algorithm s based on the depth-first r or breadth-first search technique can be used to traverse the graph and enum erate the required set of paths [AHO 74]. A nother interesting approach is based on using the connection m atrix of a graph [BIEG 77, RAI 78, DEVA 13 79]. T he steps o rth e algorithm presented in jBlEG 77] are simple and will be explained through an example. * 2,4 * 2,3 * 3 Figure 2.1: A 4-node network. In this method, the nodes of the graph are numbered such th at if a link exists between two nodes, say from a- , to Xj , then the index i is chosen to be less than j ; as in Figure 2.1. The connection matrix of the network is shown in Figure 2.2. ' *1 * 2 *3 x 4 *1 0 x 1,2 x 1,3 0 *2 0 0 * 2,3 * 2,4 1 3 0 0 0 * 3,4 *4 0 0 0 0 Figure 2.2: The connection matrix X of Figure 2.1. 14 fn X , The row Index Is the node From which the arcs leave, and the column index is the node at which the arcs term inate. Every entry in column k imm ediately precedes every entry in row k . For example, arc i p in column 2 directly precedes arcs x 23 and x 24 in row 2. When X is added to its transpose, X T , the resulting matrix is shown in Figure 2.3. X i x 2 x 3 x 4 * 1 0 x 1,2 x 1,3 0 x 2 x 1,2 0 x 2,3 x 2,4 x 3 ^ 1,3 x 2,3 0 x 3,4 x 4 0 x 2,4 x 3,4 0 Figure 2.3: The matrix corresponding to X + X T . . From the matrix of Figure 2.3, the set of paths is determined as follows: every element in a column above the main diagonal is immediately followed by every element in the same column below the main diagonal. For instance, x 12 is immediately followed by x 23 and x 24, while x 13 or x 23 is followed by x 34 as shown below: X i 2 < % 2 3 ’ - 1 - i * 2 2 4 X 1,3 I 3,4' x 2,3 < ' x 3,4 where x 12 < ^ 2,3 means x 12 is immediately precedes x 23. 15 By separating the precedence statements into a single statement, the following paths are obtained: X 1,2X 2,4’ X 1, 2* 2, 3*3,45 ^ 1, 3* 3,4 The cutsets can also be obtained from the connection matrix or from the set of paths as shown in the next section. In Chapter 3, we develop an al­ gorithm to derive all the trees connecting a subset of nodes, and it can also be used to derive all paths between the designated terminal nodes. 2.3 D eriv a tio n o f SY R E L In this algorithm, a Computer Network (CN) is represented by a proba­ bilistic graph G {V1 E ). We assume that the failures of elements are statisti­ cally independent. In a computer network, several simple paths may exist between a pair of nodes, which are denoted as Pi ’s. The terminal reliability between a pair of nodes is given by: m R = Pr( U E{) (2.1) where E denotes the event in which path P{ is up and m represents the number of paths. The terminal reliability expression can be derived using a simple hand computable method that is based on decomposing the set of paths into another set of mutually exclusive paths. For example, the network shown in Figure 2.1 has three paths, namely, P i ~ x i,2-x 2,4’ P 2 = x i,Z'x an<^ 16 P ^ = x x2.x 2,3 -£ 3,4- The term inal reliability expression consists of three term s which correspond to three m utually exclusive events: the first event occurs when P x is up, the second event occurs when P 2 is up and P x is down, and the third event occurs when P 3 is up and both P l and P 2 are down. The term corresponding to the first event is p x 2.p 24. In the second event, P 2 is up, which results in x j 3 andx34 being operational, and P 1 is down, which occurs when either of links x l2 or x 24 has failed. This event does not occur when both x 12 and x 24 are operational. Hence, the term corresponding to this second event is p P 3 4 - [1 - P 1,2 ^ 2,4] • th ird event occurs only when links x 24 and x x 3 are down in order to fail paths 1 and 2, respectively. As a result, the term corresponding to this event is: P 1,2 P 2,3'P 3,4 [ 9 2 ,4 -9 1,3 ] The term inal reliability expression of this simple netw ork is the sum of the term s corresponding to these three events, i.e., R = P 1,2 P 2,4 + P 1 ,3-P 3 ,4 -[1 ~P 1,2 P 2,4] + P 12 P 2,3'P 3,A I 9 2 ,4 -9 1,3 ] The derivation of the SYREL algorithm , which is a system atic and efficient im plem entation of the simple m ethod discussed above proceeds as fol­ lows. E quation 2.1 can be decomposed into m utually exclusive events as: R = Pr(E l ) + P r { E 2 / \ E ^ ) + . . . + P r ( E m A - E 7 /\- E 7 A . . . , (2.2) 17 This equation can be decomposed using conditional probability as: R = P r ( E l ) + P r ( E 2 ) . P r ( E ^ \ E 2 ) + . . . + Pr ( E m ) . Pr ( £ 7 A E l A • ■ • A E ^ Z I Em ) (2.3) Hence, the term inal reliability expression can be evaluated in term s of two distinct events. The first event indicates th a t a path, say P, , is in the operational state while the second event indicates th a t all the previous paths, which are P 1 ? . . P ,_ i, are in the failure state, given th a t Pi is operational. The term inal reliability algorithm (SYREL) presented here will solve the prob­ lem of computing the probabilities of these events efficiently. The probability of the first event, Pr ( Ei ) can be determined in a straightforw ard m anner. It is equal to the product of the reliabilities of all the elements present in path Pi th a t are denoted by Si , i.e., P r { E i ) = n Pk for all k £ S- i The probability of the second event, Pr (E l /\ • • * A P,-_i I P* )> can be evaluated using conditional probability and standard Boolean operations when it is rew ritten as: F r ( £ 7 / \ • ■ • A £ ^ | £ ; ) = P r ( s 7 j 7 A • ■ • A ) where Ej j , • denotes the event th a t path Py is down given P,- is operating. Let Ej | , • denote the event in which path Py is working, given th a t Pt- is also up. If P,- is known to be operational, then the probability th a t Py is also up depends only on the elements of Py which are not in P ,-. Hence, the 18 event Ej 11 is only a function of those elements In Sj but not in S; . The conditional set Sj | is introduced to identify those elements th a t can fail path Pj given th a t P,- is operational, i.e., S i I * = S i ~ S * = : ek € s j a n d ek £ s i } where ek represents either a node or a link. For example, in the network of Figure 2.4, consider the paths P l = x i 3.x 35.x 56 and P 2 = x 1 2.x 2 5.x 5 The conditional set S 1 1 2 is then obtained by applying the exclusive operator (-) as shown below: S 1 12 = & 1 ~ S 2 = { ^ 1,3 » x 3 ,5 ’ ^ 5,6} ~ 1,2 > x 2,5 » 2 7 5,6} == i x 1 ,3 > ^3,5} Let us assume th a t path Pi is up and denotes the set of all condi­ tional sets Sj | , • ’s corresponding to the previous paths of P ,-, i.e., Ui = { Sj | , ■ , for j = 1 ,2 - 1 } If a set 5^ is a subset of 5^ , then Sk absorbs S ( , i. e., sk n S t = sk Also, let Mj denote the set of all minimal (nonabsorbable) conditional sets when P,- is known to be operational. L e m m a 2.1: The intersection of the conditional sets is equivalent to the inter­ section of the minimal conditional sets obtained after removing all absorbable sets, i.e., 19 n s .- 1 , ■ = n s.-1 f°r all S j | . £ U- J 1 torall J 1 (2 .4 J P roof: If the conditional sets are not minimal, then there exists two sets, say Sj | and Sjc | j , such th a t Sj | C ^ | i . This yields: Sj 1 1 n Sfc | j = Sj | , ■ Since the intersection of sets is com m utative, we can group all absorb able conditional sets together in order to replace their intersection by the ab­ sorbing conditional set. If this is done for all the groups, an equivalent inter­ section will result from the intersection of the nonabsorbable conditional sets. C The result of this lemma is very useful because it reduces dram atically the num ber of sets to be examined at each iteration of the algorithm . For ex­ ample in Figure 2.4, consider the case in which path P 4 is up. The condition­ al sets are obtained as follows: £ > 1 1 4 — S i - S 4 = {a; 5 6 } £ > 2 | 4 = = £>2~*£,4 = { x 1,2 ’ x 2,5 1 x 5 ,6 } S 3 \ 4 ~ $ 3 ~ $ 4 ~ 1,2 » ^ 2,4} The conditional set S j | 4 is a subset of 5 2 | 4 which yields: 11 4 n ^21 4 1 4 = ^ 1 1 4 f i L em m a 2.2: The probability of the intersection of the events corresponding 20 to the minimal conditional sets is equal to the product of the probabilities of each minimal conditional event if th e in tersection o f all p ossible p airs o f th e m inim al con d ition al sets is em p ty; there are no common elements among the minimal conditional sets. This can be expressed as: Pr( A Ej | , • ) = I I Pr{Ei U ) for all S . ( ■ £ Af. for dl S _ | _ 6 M _ (2 .5 ) if for all Sj | , • , Sk | , • G M{ Sj | D Sk \ , • = 0 P roof: Since there are no common elements among the conditional sets, the probability th at one set is in the operational state does not affect the probabil­ ity of the others being in any of their two valid states. This means th at the events corresponding to these sets act as a set of independent events, and as a result, the probability of their intersection is equal to the product of their pro­ babilities. □ C orollary 2.1: Pr( /\ £ , | i ) = n (1 - -Pr ( J E T y ,, ) ) for dl Sj ! ( 6 for S y ,, 6 M - (2 .6 ) P roof: From the probability axioms, it is known th at the probability of Ej | , ■ is: Pr{EjU ) = ( 1 --Pr(£/|,)) It is also known th a t the complements of independent events are also independent. Hence, 21 Pr L a„ A E ,,i ) = n ft(%)= n ( l ^ r ( £ y|,.)) for all S j U eM{ fo r a H eM. fo r o W s eM_ □ For example, the minimal conditional sets (•S'i 14 , 5’3 14) which are evaluated in the previous example satisfy the condition of Lemma 2 . 2 since their intersection is empty, i.e., *^i 14 H £ 3 1 4 = { * 5,6} H { * 1,2 1 * 2,4) == 0 Applying the result of the previous Corollary yields: Pr{ E x | 4 /\ £ 3| 4 ) = ( 1 - Pr{ £ l)4) ) ( 1 - Pr { E s | 4) ) = ( 1 - P 5,6 ) ( 1 ~ P l,2 • P 2,4 ) In Lemma 2.2, we studied the case in which there are no common ele­ ments among the minimal conditional sets, and we obtained the probability of the event corresponding to their intersection using only standard set opera­ tions. In what follows, we consider the case in which there are some common elements among the minimal conditional sets. The probability of the intersec­ tion event is computed by finding the probability of the union of the cutsets associated with the minimal conditional sets as shown in the next Lemma. Lem m a 2.3: k — nc Pr( A Ej U ) = Pr( U Ck ) for all Sj ^ £ M - 3 k = l ( 2 .7) where Ck denotes the failure state of cutset Ck , and nc represents the 22 num ber of all cutsets associated w ith the conditional sets. P roof: A cutset, say C k , is by definition a set whose intersection w ith one of its corresponding minimal conditional sets is not em pty, i.e., for all Sj | G M{ , Sj \ , • n Ck 0 Hence, if the cutset is in the failure state, which occurs w hen all its ele­ m ents are down, the corresponding conditional sets will also be in the failure state. Therefore, the failure of any one of the corresponding cutsets is sufficient to fail all the paths in M, , i.e., the probability of the union of all Ck ’s will be equal to the probability of the event in which all the paths associ­ ated w ith the conditional sets are not working. C Lem m a 2.3 enables us to compute the probability of the intersection of conditional sets as the probability of the union of the cutsets associated w ith them . One way of evaluating the cutsets associated w ith a set of minim al con­ ditional sets is described as follows. F irst, we identify all one-elem ent sets such th a t the element in each of these sets appears in all the conditional sets. Next, we identify all two-element sets such th a t either one of these two ele­ m ents appears in all the conditional sets. This process is repeated to identify sets of 3 elements and so on. For example, consider the case in which p ath P 5 of Figure 2.4 is up. The minimal conditional sets are: 15 = {^3,5 > x 5,e}; &3 15 = i x 1,2)1 £ 4 15 — { # 3,5 5 x 4,5 } 23 Since S 1|5 n S ' 4|57^0, a procedure sim ilar to the one discussed previ- ously can be used to determ ine the set of cutsets. It first searches for all one- element cutsets. Since there is no one-element cutset, the search continues to find two-elem ent cutsets. The only two-element cutset is C 1 = { x i 2 , £ 3 51- In a sim ilar way, the three-elem ent cutset C 2 — { x 1 2 , x 4§ , x 56} is ob­ tained. The probability of the union of the cutsets can be com puted in a straightforw ard m anner if these cutsets are m ade m utually exclusive. The problem of m aking a set of paths or cutsets m utually exclusive has been stu­ died in most term inal reliability algorithms available in the literature. A bra­ ham [ABRA 79] presented an efficient m ethod to disjoint the set of paths between the term inal nodes of a given com puter netw ork. A sim ilar approach will be used in SYREL to make the set of cutsets m utually exclusive. A cutset Ck can be m ade disjoint w ith another cutset Cj in two steps. F irst, the set of variables in Cj but not in Ck is found by, C i I* = C J ~ C k Second, if Cj | k ^ 0 , the cutset Ck is replaced by the following sub-cutsets th a t are disjoint w ith Cj : {e i> Ck }, {« i, e 2, Ck }, , . {e j, e 2, . . ., en _j, en , Ck } where ey is an element which could be either a node or a link, and rijk denotes the cardinality of the set Cj | k . 24 However, if tlie- CJ j * ~ = 0, Ck is dropped from consideration because it is a superset of Cj and thus absorbed by it. Let DC be the list of all m utually exclusive cutsets associated w ith the minimal conditional sets. The formal description of the disjoint m ethod is shown in the C utset Disjoint Procedure (CDP). C U T S E T D IS JO IN T P R O C E D U R E (C D P ) DCi = C l ; initialization. for j:= 2 to nc d o ,nc is the num ber of cutsets, begin DCj = Cj for i— 1 to j-1 do begin for for all dck E DCj do if dck is disjoint w ith C, th en select next dck + l E DCj. else begin ; m ake dck disjoint with C{ c i | * — Ci - dck if C ; | k = 0 th en drop dck from list DCj else replace dck with {£ li dck }, i, e2> dck }, . . ., {e j, e2> • • •) ^njk~l’ eny t’ ^ add them to the list DCj end end end. add DCj to the list DC 25 For example, consider the cutsets {(ij j , ar3 5) , (x12 > ^ 4,5 > ^ 5 5 )}) the disjoint procedure proceeds as follows: 1. In the initialization step, the first disjoint cutset and the num ber of cutsets are as follows: DC 1 — C 1 = {x 1 2 , ^ 3^}; nc — 2 2. A t the beginning of the j t h and the i th loops, D C 2 = { x l2 , ^ 4,5 , ^ 5,5 } and dc D C 2 equals { x l 2 , x 4 5 , :r56}. The set of variables in but not in dc 1 is com puted as: ('ey I dc , ^ C x - dc J = {x l 2 , £ 3 5 } - {x it2 3 X 4,5 ’ x 5,6} =: { ^ 3,5} Since dc x is not disjoint w ith C j and CCi | dCi is not em pty, dc 2 is re­ placed w ith { i j 2 5 ^ 4,5 > ^ 5,6 » x 3,5 }• Hence, the resulting disjoint cutsets are: DC = {(ar 12 , 2^3 ,5 ) J (x 1 ,2 > ^4,5 > x 5,6 > x 3,5 )} The im plem entation of the disjoint procedure can be improved when the num ber of the minimal conditional sets (M i ) to be considered is reduced. In com puter com m unication networks which are loosely connected, most of the paths do not have common elements among them . Hence, the set M, can be partitioned into two groups; the first group (IG (Mg - )) w ith independent sets, i.e., IG (M{) = { Sj 1 i : for all Sj j ^ Sk | , • £ M f - , Sj | , • D Sk \ , • = 0 } 26 while the second group (DG (M,-)) contains the remaining sets of M,-, i.e., DG (M{ ) = Mi - IG (M{) Since there are no common elements among the sets of IG (Mi) and any other set of M,-, the probability of having these sets in the failure or operation­ al state is not influenced by the state of the remaining sets DG (Mi ). Hence, the result of Lemma 2.2 can be applied to determine the probability in which all the sets of M{ are in the failure state as: Pr( /\ £y|,-) = n (1 - P r ( ) ) . P r ( /\ Ek U ) for ail S j [ e IG ( A / .) for al1 S\ , i £ DG (Mi ) (2 .9 ) Having partitioned M ,- into two groups, the cutsets corresponding to the sets of DG (Mi ) need to be derived. However, if the cutset generation and disjoint procedures are combined into one method, further improvement in the execution time can be obtained. This can be achieved by applying Bayes theorem to decompose two sets of DG (Mi) about its common elements into two states; in the first state, any one of the common elements is assumed to be in the failure state while it is assumed in the operational state in the second state. By repeatedly applying this decomposition procedure, a set of mutually exclusive sets can be obtained directly from m anipulating the sets of DG (M{). For example, M g which corresponds to iteration 9 of SYREL’s execution of the network shown in Figure 2.4 is as follows: I i 27 M g — { ( x 2,5) » ( x 4,6) > (*1,3 > ^3,5) ) (•*'2,3 > ^ 3,5)} Partitioning this sets into two groups yields: IG ( M g) = {(^ 2,5) > (x 4,e)} D G ( M 9) = { (x i f3 , 2:3 5) , ( z 2,3 > ^ 3,5)} Decomposing the sets of the second group about its common elements produces the following two disjoint sets: {( ) 7 (^3,5 > x l,&x 2,3,)} The term s corresponding to the event in which all the previous paths to P q fail can be obtained from Equation 2.9 as: ^ r ( * I. A E j |g ) = {q2,b • 94,5) • (73,5 + P3,5 • 91,3 • 92,3) for all ! g G M g In the previous analysis, all the steps necessary to implement SYREL algorithm are derived. Now, we present the complete algorithm. 28 S Y R E L A L G O R IT H M 1 . E num erate all paths between term inal nodes and sort them according to their lengths. 2 . Initialization step. T , = £ . • ( £ , ) ; R = r l5 < = 1 3. U pdating step. i = i + l 4. D eterm ining minimal conditional sets. fo r j = l to i-1 d o Sj |,- = S j - S { add Sj | g - to list Lt - fo r k = 1 to i - 2 do b e g in * elim inate all absorbable conditional sets. fo r j = k + l t o i - l do temp = Sk | ,■ D Sj j ,• if temp = Sk | , delete Sj | , • from list L , • if temp — Sj | , • , delete Sk | , • from list Lf e n d 5 . I f for all Sk | , S( | , £ i , Sk | , D Sj | , = 0 th e n T , = P r ( £ , - ) . n ^ ( £ , 1, ) for all S ■, ■ £ M- go to step 7 * 6 . I f for all Sk \ ,■ , St | f G L{ 5* | ,• D S{ \ ,■ 7 ^ 0 th e n E valuate cutsets and then call CDP Tf = P r ( E t ) . [ £ Pr ( DCk ) ] for all DC). 7. E valuating term inal reliability. R = R Tf 8 . If i < m go to step 3 9. Stop. 29 The critical p a rt in this algorithm , and any other reliability algorithm , is the step used to m ake a set of paths or cutsets m utually exclusive. In the SYREL algorithm , this critical loop (CDP) is not executed at each iteration of the algorithm , and step 6 is bypassed whenever there are no common elements among the minimal conditional sets. This feature by itself gives a significant im provem ent in the execution tim e of SYREL when compared w ith the algo­ rithm s th a t execute the critical loop at each iteration [AGGA 75, LIN 76, ABRA 79, GRNA 80]. Furtherm ore, the execution of the critical loop is im­ proved dram atically because of implementing conditional probability to reduce the num ber of variables th a t should be considered during the disjoint process (because of the elim ination of the variables of the path assum ed in operational state at a given iteration) and of using the cutset technique to reduce the num ber of cutsets to be made m utually exclusive as shown later. The SYREL’s execution tim e can be further improved when the path length at one iteration is n - 1 , where n denotes the num ber of nodes in the considered graph. The use of conditional probability to evaluate the probabili­ ty when p a th Pk is operational and all its previous paths, P lf P 2, ..., P k _ l5 are down allows us to exploit the case in which the p a th ’s length is n -1. This case th a t will be discussed in the next lemma was first observed in reference [RAI 78]. L e m m a 2.4: Let us assume th a t the paths are sorted in an ascending order of their lengths and path Pk is of length ( n - 1 ), and is in the operational state. Then, the probability th a t all the previous paths to Pk are down is equal to 30 the product of the unreliability of all the elements th a t do not appear in Pk . Proof: Let Pk be a path of length n - 1 . Since n is the num ber of nodes in the netw ork, Pk has exercised all the nodes in the network. Any path, say Pj , w ith a length less th a n .n - 1 will not exercise all the nodes. Hence, there exists some link xi y in Pj but not in Pk which is used to bypass some nodes in Pk because Pj is shorter th an Pk . If link x, • - fails, all paths of length less th an n - 1 which use link x,- , • will be down. If all the links x4 , • ’s th a t are not * > J 1 tJ present in P k fail, then all the shorter paths will also not be working. If the length of a path, say P i , is equal to n - 1 , then there exists at least one link in Pi but not in Pk Hence, the failure of the links which are not in Pk will also result in the failures of all previous paths of length n - 1 . Therefore, the pro­ bability th a t all previous paths are down when Pk is up is the product of the unreliability of all the links th a t are not present in Pk . □ Therefore, the SYREL algorithm can be modified to exploit the case in which the p ath length is (w - 1 ) by first checking the length of the p ath as­ sum ed operational at each iteration of the algorithm . If it is equal to (n - 1 ), Lem m a 2.4 can be used to derive directly the term corresponding to this itera­ tion as shown in SYREL1 . 31 SY R E L 1 A L G O R IT H M 1. E num erate all paths between term inal nodes and sort them according to their lengths. 2. Initialization step. T x = Pr ( E i ); R = T i =1 3. U pdating step. i = i + l 4. T est p ath length (pi) if pi; = n -1 th en Ti = E [ ( Pk ,i ) I I (9* )J g ° to steP 8 for all i £ S- for all j (jz 5? - 5. Determ ining minimal conditional sets. for j = l to i-1 do Sj | j - = Sj - S ; ; add Sj | to list Li for k = 1 to i - 2 do b egin * elim inate all absorbable conditional sets. for j = k + 1 to i - 1 do temp = Sk | , ■ n Sj | if temp = Sk | ,•, delete Sj | , • from list L,- if temp — Sj | , delete Sk | , • from list L ,- end 6. I f for all Sk | , S, | , - S L, St | , ■ n S | i = 0 th e n r,- = P r ( £,■ ) . n P r ( E t \ i for all | ■ € M- ) ; go to step 8 7. I f for ail Sk | f ,5 , | E L{ Sk ( n St , ■ 0 th en E valuate cutsets and then call CDP T{ = Pr ( £ , - ) . [ £ Pr ( DCk ) ] for all DCh 8. E valuating term inal reliability. R = R 4“ Tj 9. I f i < m go to step 3 10. S to p . 32 2.4 A n Illu stra tiv e E xam ple 5,6 WXS Figure 2.4: A 6-node computer network with 13 paths. The set of paths between the term inal nodes ( i r i 6) of Figure 2.4 and their corresponding sets are as follows: p 1 = X ,3-*3,s-*5,6 S i = i x 1,3 *3,5 *5,6} P 2 = X ,2'* 2,5-* 5,6 s 2 = ( * 1 ,2 *2,5 *5,6} P 3 = X ,2’* 2,4'x 4,6 S 3 = { * 1,2 *2,4 *4,6} P a = X ,3-* 3,5'* 4,5-* 4,6 s 4 = {*1,3 * 3,5 *4,5 > *4,6} P 5 = X ,3 * 2,3'*2,4 * 4,6 S 5 = ( * 1 ,3 *2,3 *2,4 » *4 ,6 } P 6 —r X ,3-*2,3'*2,5-*5,6 S 6 = ( * 1 , 3 * 2,3 *2,5 * *5,6} P 7 = X 2.X 2j5. X 4 5.X 4 6 = { * 1,2 *2,5 *4,5 > * 4 , e } P 8 = X 2-X 2 3 . X 3 5 . X 5 g *^8 = { * 1,2 *2,3 *3,5 » *5 ,6 } P 9 = X ,2 * 2,4 * 4 , 5 * 5,6 = { * 1,2 * 2,4 *4,5 > *5,6} P 10 = X ,3 '* 2 ,3 ‘* 2 , 4 - * 4 , 5 * 5 ,6 10 = { * 1,3 *2,3 *2,4 > *4.5 » ^11 = X , 3 * 2 ,3 -* 2 ,5 -* 4 ,5 * 4,6 ^ 1 1 == {*1,3 *2,3 *2,5 > *4,5 > P 12 = X , 3 * 3,5-*2,5-*2,4'* 4,6 S 12 { * 1,3 *3,5 *2,5 » *2,4 » P 13 = X ,2-x 2 , 3 * 3 , 5 * 4 , 5 * 4 , 6 S 13 = { * 1 , 2 * 2,3 *3,5 > *4,5 > 5,6} 1,6} 1,6} In the following discussion, we will explain the derivation of the terms corresponding to three different cases that could be encountered by the algo­ rithm . In the first case, there are no common elements among the minimal 33 conditional sets. In the second case, there are some common elements among these sets. In the th ird case, the path length is (n - l) . F irst case: This case is exercised during the derivation of the term corresponding to p ath P 8. The conditional sets com puted according to step 5 of SYREL1 are: S i 8 — S i - S g — x 1,3 } s 2 8 = S 2 - s 8 = ^ 2,5 } C O 8 — S 3 - S 8 = x 2,4 x 4,6> S 4 8 = S 4 - S 8 = x 1,3 >x 4,5 >x 4,6} S 5 8 == S 5 " S 8 = x 1,3 »x 2,4 ’ x 4,6> 8 = S 6 - S 8 = x 1,3 >x 2,& } S 7 8 == S 7 - S 8 = x 2,5 >x 4,5 >^ 4,5} Once the conditional sets are derived as indicated above, the next sub­ step in step 5 is to elim inate all the absorbable conditional sets and produce minimal conditional sets. The conditional set S x 18 is a subset of the condi­ tional sets (S 4 18 , S 5 18 , S 6 18), and therefore their intersection is equal to S 1 1 8 > 1 | 8 = = = S1 |8 n * 5* 4 I 8 S5I8 n - S ' 6 I 8 Similarly, we found th a t the conditional set S 2 1 8 is a subset of the con­ ditional set S 7 | 8, and therefore their intersection yields: 34 ^2|8 = *^2 I 8 H ^7|8 The minimal conditional sets after removing all the absorbable sets are: ^1 | 8 — &1 | 8 = { ^ 1,3} S 2\8 = $2 | 8 — {^2,5} *^3 | 8 = *^3 | 8 = { x 2,4 > * 4 ,6 ) Since there are no common elements among the minimal conditional sets produced in step 5 , step 6 of SYRELl is perform ed to obtain the proba­ bility of the intersection of the events corresponding to the minimal condition­ al sets as: P r ( - ^ l | 8 A £ '2 |8 A £ 3(8 ) — 9l,3-?2,5 ( 1~P 2,4.’P 4,6 ) Hence, the term corresponding to this iteration is: ^ 8 = P 1,2'P 2,3‘P 3,5-P 5,6 ( Q 1,3-92,5 ( 1~P 2,4-P 4,8 )] Second case: In this case, there exists at least one pair of the minimal conditional sets whose intersection is not empty. It is encountered in iteration 9 of the al­ gorithm. Applying step 5 yields the following conditional and minimal sets: Conditional sets: Si\g = S t - S q = {x x 3 , x 3j5} ‘ 5> 2 | 9 — S 2~ S q = { ^ 2,5} 35 S 3\9 = S i ~ S 9 ~ = (^4,6) ^ 4 | 9 = £ 4 “ ^ 9 == i ’* ' 1,3 > x 3,5 > x 4,6) ^ 5 I 9 = *^5 “ ^ 9 == i x 1,3 > z 2,3 > ^ 4,5) S q\ q = S&~ S Q = {ar1 3 , ar2,3 > ^ 2,5} ^ 7 | 9 = ^ 7 ~~ $ 9 — 2,5 J x 4,5} S 8\9 = ^ 8 ~ Sg = { z 2,3 > x 3,5} Minimal conditional sets: 1 1 9 — ^ 1 19 = { • * ■ 1,3 > x 3,5 } ^ 2 | 9 = *^2 I 9 = { x 2,5} ^ 3 | 9 “ ^ 3 | 9 = ( ^ 4 ,6 ) ^ 8 | 9 ^ ^ 8 | 9 == {•*'2,3 > x 3,5} Since the intersection of ( S j |9 , -Sg| g) is not em pty, step 7 is per­ formed. The cutsets corresponding to the conditional sets are: G i — { x 3 ) 5 , a r 2 , 5 > * 4 , 6 ) ^ 2 = ( x l , 3 » • * ' 2 , 3 > x 2 , 5 I x 4 , 6 } Once the cutsets are obtained, the disjoint cutset procedure is then called to m ake the cutsets m utually exclusive. These disjoint cutsets are: D C x — { • * " 3 , 5 I x 2 , 5 > * 4 , 6 ) D C 2 — { • * ' 1 , 3 > * 2 , 3 > x 2 , 5 J x 4 , 6 t * 3 , 5 } the probability of the union of these disjoint cutsets can be found in a straightforw ard m anner as shown below: 36 Pr ( D C i ) + Pr ( D C 2 ) = q 3 5.q 2 5.q 46 + i >3-9 2,3-9 2,5-9 4,6-P 3,5 Finally, the term corresponding to this iteration is: T g — P 1 2 P 2,4 P 4,5 P 5,6 [ 9 3,5-9 2,5-9 4,6 + 9 1,3-9 2,3-9 2,5-9 4,6-P 3,5 ] T h ird case: In this case, the p ath length equals (n -1 ), where n is the num ber of nodes in a given netw ork. This case is first encountered in iteration 10. The result of Lem m a 2.4 introduced in the previous section gives a straightforw ard m anner to obtain the term corresponding to iteration 10. This term is: TlO — P 1,3-P 2,3-P 2,4-P 4,5-P 5,6 [ 9 1,2-9 3,5-9 2,5-9 4,6 ] A detailed description of the minimal conditional sets, the disjoint cutsets whenever there are some common elements among the minim al condi­ tional sets, and the term s corresponding to each iteration of SYREL1 for the netw ork shown in Figure 2.4 is as follows: i = l T i — P ifi-P z,s-P 5,6 i= 2 $ 1 | 2 = 1,3 5 ^ 3 ,5 } T 2 — p 1 )2-P 2,5-P 5,6 -[ 1 - P 1,3-P 3,5 ] i= 3 S 1 i 3 = { x 1,3 » x 3,5 > x 5,6} 5 2 | 3 — (#2,5 > ^5,6} 37 DC i = {# 5 > 6 } DC 2 = {2:^3 , # 2j5 , x 5 6} D C 3 = {2^3 5 , a : 2 5 , x lj3 , a;5j6} D 3 = P 1,2 -P 2,4'P 4,6-[ 9 5,6 + 9 1,3 -? 2,5-P 5,6 + 9 3,5-9 2,5-P 1,3-P 5,6 ] i—4 ^ 1 14 ^ { ^ s . s } *^3 | 4 ~ 1,2 > x 2,4} T 4 = p 13 .p 3 5.p 4 5 .p 4 6 -[ 9 5,6-( 1 ~ P 1,2 P 2,4 ) ] 1 = 5 & 1 | 5 — { ^ 3 ,5 J x 5,6} 3 | 5 “ I 37 1,2 } ^ 4 j 5 — 3,5 > x 4,5} DC ! = { x l 2 , ®3i5} D C 2 == {2; 1 2 , ar4)5 , 2 5 6 , 2:3 5} D s = z P 1,3-P 2,3-P 2,4-P 4,6-[ 9 1,2-9 3,5 + 9 1,2-9 4,5-9 5 ,6 -P 3,5 ] 1 = 6 S 1 I 6 = { ^ 3,5} ^ 2 | 6 == 1,2 } 5 | 6 == {-*" 2,4 > 27 4,6} D 6 = P 1,3-P 2,3-P 2,5-P 5,6- [ 9 l , 2 - 9 3,5-( 1 “ P 2 ,4 -P 4 ,6 ) ] i= 7 ^ 2 17 — 5,6) ^ 3 I 7 — { ^ 2,4 } *^4 j 7 — I 3" 1,3 > ^ 3,5 } ^ 7 = P 1,2 -P 2,5-P 4,5-P 4,6- [ 9 5,6-9 2,4-( 1 ~ P 1,3-P 3,5 ) ] i= 8 38 *^ 1 |8 = I'3 7 .1,3) ^ 2 |8 = I 3 7 2,5} *^ 3 |8 — i x 2,4 ’ 3 7 4,6} T & = P 1,2-P 2,3 P 3,5 P 5,6'[ 9 1,3-9 2,5-( 1 ~ P 2,4'P 4,6 ) ] 1=9 ^1 |9 = ( 371,3 > 3 7 3,5} S 2 jg = {0:2,5} ^ 3 | 9 == i x 4,6) ^ 8 | 9 = i x 2,3 > x 3,5} D C 1 = { ^ 3 5 , X 2j5 , 2 T 4 6 } D C 2 = { x 1,3 , x 2,3 > x 2,5 > 3 7 4,6 3 7 3,5} T 9 = P 1,2-P2,4 P 4,S-P5,6-[ 9 3,5- 9 2,5- 9 4,6 + 9 1,3-9 2,3-9 2,5-9 4,6-P 3,5 1=10 pl 10 — 5 ;pi denotes p a th ’s length. T 10 == P 1,3-P 2,3-P 2,4'P 4,5-P 5,6- [ 9 1,2 - 9 3,5- 9 2,5- 9 4,6 ] 1=11 pi 11 = 5 ^11 = P 1,3-P 2,3'P 2,3'P 4,5'P 4 ,6 * [ 9 1,2-9 3,5-9 2,4-9 5,6 ] 1=12 pl 12 == ^ T 12 — P 1,3 P 3,5-P 2,5-P 2,4'P 4,6-[ 9 1,2 -9 2,3-9 4,5-9 5,6 ] 1=13 pi 1 3 = 5 ^ 13 == P 1,2 'P 2,3'P 3,5-P 4,5'P 4,6-[ 9 1,3-9 2,5-9 2,4-9 5,6 ] 1 3 R = E 2=1 39 2.5 C on clu d in g R em a rk s and C om p arison In this chapter, we presented an efficient algorithm , SYREL, to obtain compact reliability expressions in complex com puter netw orks. The novelty of this approach is the distinct methodology used to incorporate conditional pro­ bability, Boolean algebra, and cutset technique to achieve an efficient algo­ rithm . A comparison of its execution tim es for some standard examples with other m ethods reported in the literature is shown in Table 2.1. The efficiency is due to the following: 1) The disjoint procedure, which is the critical loop in all reliability algo­ rithm s, is executed at each iteration in most algorithm s. In SYREL, this has been avoided whenever there are no common elements among the minimal conditional sets. As a result, the disjoint procedure is called 3 tim es for the netw ork shown in Figure 2.4 out of 13 iterations and is called 10 tim es out of 24 iterations for the netw ork shown in Figure 2.5 as shown in Tables (2.2.a) and (2.2.b), respectively. 2) Furtherm ore, the disjoint procedure is not perform ed on the set of paths directly as it is done in m ost reliability algorithm s. Instead, SYREL applies the disjoint procedure on cutsets associated w ith the m inim al conditional sets. The disjoint procedure executes more efficiently on the cutsets than if applied directly to the set of paths because of the following reasons: a) at the i th iteration, the path P i should be made disjoint w ith all the previous paths. This means th a t the disjoint procedure will be called at least («-1) tim es if it is applied directly on paths. However, as shown in Tables (2.2.a and 2.2.b), the num ber of cutsets to be made disjoint is m uch less than 40 {i -i),in general, especially for the larger values of the index % . b) The com- plexity of the disjoint procedure grows exponentially w ith the num ber of variables in the sets to be m ade disjoint. The num ber of variables in the cutsets is reduced because of removing the variables associated w ith the path assum ed to be operating. 3) SYREL also exploits the case in which the p a th ’s length equals (n -1). Hence, the term associated w ith a path satisfying this case can be derived directly by virtue of the lem m a presented in Section 2.3. The set representations for paths is useful in applying SYREL to evalu­ ate reliability m easures in distributed system s, such as m ultiterm inal reliabili­ ty, distributed program reliability, and distributed system reliability, as shown in C hapter 3. 41 A lgorithm C om puter system /Language Fig. 2.4 Fig. 2.5 R FR A T 73 IBM 360/67 FO R TR A N - 112 s LIN 76 CDC 6500 FO R TR A N 9 s ABRA 79 DEC-10 SAIL 6 s SATY 78 PD P 11/45 FO R TR A N 1.2 s GRN A 80 D EC-10 FO R TR A N 0.8 s 1.5 s SYREL 1 86 VAX-11/750 PASCAL 0.1 s 0.8 s Table 2.1: Com paring execution tim es of SYREL w ith other algorithm s. 2.4 4.6 '4,5 '6,8 '5,6 2,3 3.4 5,7 3.7 Figure 2.5: An 8-node computer network with 24 paths. Using SYREL 1 Cutsets generated Iteration 3,5’ X 2,5’ X 4,6} 1,3’ x '2,3' X 2,5’ X 4,6) Table 2.2.a: The disjoint procedure is called 3 times (out of 13) for Figure 2.4. 43 Using SYREL 1 Iteration C utsets generated 5 {#4 6, x 3 7} {x46, X 7 g} {a; 2 3, X 4 6 , X 1 3) 6 {x 6,8’ ^ 3,7 } {x 2 3 , X 6 8 , X ! 3 } {a: 4 6 , X 5)6, x 3)7} (a: 2,3 , x 4 6, x 5 6, ^ 1 3 } 7 {•^ 2,3’ ^ 4,65 x 4,5’ x 1,3} {x 2fl, 2: 4,5, X 6,8 , 2: j 3 } ( a:2,3) ^ 6,85 2? 5,7? 37 1,3 } {a: 2 3 , x 4 6, 2:6 8 a: 5 7, a: x 3} I 3 ' 2,3’ ^4,6’ x 5 ,6 > ^ 5,7’ ^ 1,3 } 11 {3:2,4, x 6 8, 2:3 7 } 1,2j x 6,8 ’ x 3j} {a: 2,4, 2:4 ,6 , 2:5j6, 2 :3 7} {a: 12, a: 4 6, x 5 6 , 2:3 7 } 12 {x 2 4 , a: 3 4 , a: 7;8 } {a: 4,6, 2 :4 5 , x 78} {3:2,3, 2: 1,2, 2:3 4 , x 78} {x 1 > 2 , x 46, 2:3,4, X 7,3} 13 {.x 4,5’ x 6,8 > x 3,7 } { x 2 ,3 , a: 4 ,5, 2:3 ,4 , x 68, x 1 3 } 15 {•^ 2,4’ x 6,8’ x 3,7’ ^ 1,3 } {2: 2,4’ x 4,6 ’ x 5,6’ ® 3,7’ ^1,3 } 16 {•^ 2,4’ x 3,4’ ^ 1,3 , x 7,8} { x 4 6, 2: 4j5, X 3 ,3 , X7 g} 17 {x 1,2’ 2:4 6 , 2:3,4 , X3 7 } {x 1,2’ 2:4 ,6, 2:3,4 , X5,7, X7,g} 18 {x 1,2 , 2:3 4, X6,g, X 3,7 } {X 1,2 , X46, X3 4 , X56, X3 7 } Table 2.2.b: The disjoint procedure is called 10 tim es (out of 24) for Figure 2.5. C h ap ter 3 R eliability M easures in D istrib u ted System s 3.1 In tro d u ctio n D istributed system s can be modeled as a collection of different objects (resources) th a t are interconnected via a com m unication netw ork. A simplified model of a distributed system considers only two objects, namely, program s and their associated d ata files. In this model, several processing elements cooperate in the execution of a program . A program running at a site m ay re­ quire files or processed results on d ata existing at some other sites. For the successful execution of th a t program , the local host, the processing elements having the required files, and the interconnecting links m ust all be operational. O ur analysis and algorithm s are applicable to system s w ith m any types of ob­ jects as well. W ith processing elements and com m unication links each having a certain probability of being operational, there is a certain reliability associat­ ed w ith the event in which the program can run successfully. 45 Several reliability m easures have been studied by researchers in th e con- text of distributed system s, namely, source-to-m ultiple-term inal reliability (SMT reliability) [SATY 81], com puter netw ork reliability [BALL 79, AGGA 81], survivability index [MERW 80], and m ultiterm inal reliability [GRNA 81]. The SMT reliability is defined as the probability th a t a specified processing node can reach every other processing element in the netw ork [SATY 81]. The com puter netw ork reliability is defined as the probability th a t each node in the netw ork can com m unicate w ith all other nodes [BALL 79, AG G A 81]. The survivability index is a quantitative measure of the survivability of a system and is defined as the expected num ber of program s th a t rem ain operational after some com binations of nodes and links have failed [MERW 80]. The mul­ titerm inal reliability is the probability th a t at least one path exists between each node of a set called the source nodes and other set called the destination nodes [GRNA 81]. The SMT and com puter netw ork reliability are good m easures for com­ puter com m unication networks; however, for distributed systems the reliability m easures should capture the effects of the redundant allocation of their ob­ jects. The survivability index is not applicable to large distributed system s be­ cause it enum erates all the elem entary states. In the m ultiterm inal reliability algorithm , the m ultiterm inal connections between the source and destination sets are decomposed into pairs between the PE th a t runs the program and all other P E ’s holding some required files ( F ’s). The paths betw een the enum erated pairs are combined to obtain the required m ultiterm inal connec­ tions. If the num ber of paths is large, the path enum eration between all these 46 pairs and their com binations will be com putationally expensive [GRNA 81]. In this chapter, several reliability measures for distributed system s with replicated resources, such as program s and files are introduced, and efficient m ethods to evaluate them are developed. These m easures are the D istributed Program R eliability (DPR) and D istributed System R eliability (DSR). An elegant approach based on a graph model has been developed to generate all the required subgraphs for successful execution of the program (s) under con­ sideration. It avoids applying the path enum eration among pairs of com put­ ers, as was done in the m ultiterm inal reliability algorithm , and performs a graph traversal to obtain all the required connections represented by trees or forests. These subgraphs are then used to evaluate the reliability. To deter­ mine the num ber of redundant copies of program s and d a ta files required for the reliable operation of the system , a sensitivity analysis is perform ed. The reliability expressions can be modified to obtain other reliability measures, such as availability, m ean tim e to first failure, and m ean tim e betw een failure [RAGH 83]. 3.2 R elia b ility M easu res Even though m any issues, such as distributed algorithm s and operating system s, concurrency control, and load balancing have received considerable attention by researchers [DION 80, W IT T 80, CHOU 83, BERM 85], there is very little research regarding the reliability of distributed system s. The relia­ bility modeling of distributed system s is im portant, especially in the early stages of designing these system s. New reliability m easures and efficient algo­ 47 rithm s for evaluating them are needed to assess the probability th a t a distri- buted system w ith redundant resources meets its specifications. Some of the redundancy m ay be inherently present in the system , for example, m ultiple com m unication paths between a pair of com puters, and others could be deli­ berately introduced, such as replicated copies of d ata files for increased relia­ bility. Consider a simple distributed system shown in Figure 3.1. Let PA( and FN{ denote the set of P E ’s th a t can run program PRGj and the set of files needed for its execution, respectively. Also, let FA; be the set of files avail­ able at node a;,-. Figure 3.1 shows four P E ’s th a t can run three different pro­ grams distributed redundantly across the system . E ach program can run on one or more com puters and needs files residing at other sites. Program PRG l can run successfully when either x x or x 4 is working, and it can access data files, F i, F 2, F 3. The sets of files needed for the execution of the program s are as follows: FN t = {F X ,F 2,F 3} F N ^ = {F VF 2,F 4} F N 3 = { F ltF 2,F 3F 4} 48 Figure 3.1: A distributed system with an allocation of its resources. In general, the set of nodes and links involved in running a given pro­ gram and accessing its required files forms a tree. We define a File Spanning Tree (FST ) as a tree th at connects the root node, which is the PE that runs the program under consideration, to some other nodes such th at its vertices hold all the required files for executing th at program. The tree x 1x 2x 3x 12x 23 is an FST where PRG x runs at A Minimal File Spanning Tree, called MFST , is an FST such th at there exists no other FST which is a subset of it. For example, FST x tx 2x 4x { 2x 2 > 4 is not an MFST because it is superset of x lx 2X i 2. For reliability analysis, we are interested in finding all the M FST ’s th at provide the appropriate accessibility for executing a given distri­ buted program. For PRG , to run on either x x or x 4, the M FST ’s are: *^1*^ 2 2' 1 2 ’ ^ 3^ 4^ 3 4' oX qX j qX 2 3 ! X oX 3 X 2 3X 04 49 If all the elements of any one of these four M FST js are operational, PRG l can be successfully executed. The probability th a t a distributed pro­ gram runs w ithout any failure can be defined as the probability of having at least one M FST operating, i.e., D PR = P r ( at least one M FST of a given program is up ) This can be w ritten as: nt DPR = P r( U M FST, ) > = 1 (3.1) where nt is the num ber of M F S T ’s th a t run a given program . The D PR m easures the reliability of a particular distributed program . For the entire system to be operational, several such program s are required to run. We introduce D istributed System R eliability, called DSR, as a system level reliability measure and is defined as the probability of executing m pro­ gram s successfully, i.e., 771 DSR = Pr ( n P RG ( ) ‘ = 1 (3.2) As shown in Figure 3.1, node x l can run program PRG i and node x 2 can run PRG 2 and PRG 3 when the needed files are accessible. The subgraph x 1x 2x 3x 12x 23 shown in Figure 3.2, which will be referred to as a forest (in this example, it is a tree), provides all the required connections for executing all three program s when all its components are operational. 50 The Minimal File Spanning Forests (M FSF ’s) are defined in a way sim ilar to the definition of M F S T ’s introduced above. Figure 3.2: A tree th at runs all the three programs. Equation 3.2 can now be w ritten in terms of the set of all M F SF ’s that provide the appropriate accessibility for processing all programs as: n/ DSR = Pr ( U MFSF, ) , = 1 (3.3) where nj is the num ber of M F SF ’s th at run all programs. The distribution of programs and data files in a distributed system is transparent to users. Hence, it is possible to have an event in which a distri­ buted program is up but it is not accessible to one user because of some faults in processors and/or communication links. PRG x can run successfully when M FST x 1x 2x l2 is up, but the user from either x 3 or x 4 can not run this pro­ gram when either x 2 3 or x 2 4 1 S down. As a result, it F im portant to modify the D PR and DSR measures to model the effects of user sites on these reliabili- 51 ties. By doing so, we will have two users’ site related reliability measures : D istributed Program Reliability and D istributed System Reliability with respect to a Site, namely, DPRS and DSRS, respectively. The DPRS describes the probability th at a user located at a site, say s , can run a distributed pro­ gram successfully, i.e., DPRS{s ) = Pr( U ] MFST As ) ) / = i (3-4) where nt (s ) is the num ber of M FST ’s th at run a given program from site s and MFSTj (s ) is an MFST connected to that site. Similarly, the DSRS describes the probability th at a user located at a site, say s , can execute all the programs of a distributed system, i.e., nf (s ) DSRS (s ) = Pr ( U MFSF 1 (s ) ) (3.5) where Uf (s ) denotes the num ber of MFSF’s th at provide the required con­ nections for executing all the programs from site s and MFSFj (s ) is an MFSF connected to th at site. 52 3.3 D istr ib u te d P ro g r a m R elia b ility A lg o rith m In this section, we develop an algorithm to evaluate the reliability of a distributed program based on the graph-theoretic approach. Search techniques can be used to system atically generate all the required trees. O ur approach of enum erating all the M F S T ’s is based on traversing th e graph in breadth-first search m anner. Once the M F S T ’s are found, then a term inal reliability evaluation algorithm based on path identifiers, such as SYREL, can be used to determ ine the D PR . Thus, our algorithm for evaluating the D PR has the fol­ lowing steps: 1. Apply the M FST algorithm to find all the M F ST ’s for a given program . 2. Apply a term inal reliability algorithm to evaluate the D PR. 3.3 .1 M FST A lg o rith m In this algorithm , the M F S T ’s are generated in a nondecreasing order of their sizes, where the size is defined as the num ber of links present in an M FST and is done in the following m anner. F irst, all the M FST ’s of size 0 are enum erated; this occurs when some of the root nodes th a t run PRGj have all the needed files (F N j ) locally accessible. Next, all the M F S T ’s of size 1 are determ ined; these trees have only one edge which connects the root node to some other node, such th a t the root and the adjacent nodes have all the files of FTVj . This procedure is repeated for identifying M F S T ’s w ith size 2, and so on up to trees of size n - 1 , where n is the num ber of nodes in the system or no more M F S T ’s can be obtained. 53 The procedure used to construct the M F S T ' s consists of checking and expanding steps. In the cheeking step, trees th a t have been generated so far, which are stored in a list called T R Y , will be tested to determ ine w hether or not the vertices of each tree t have all the files needed (FNp ) for executing program PRGp . A tree t is an F ST if its vertices have all the required files, i.e., U FA{ D F N for all x- E Let f m ( t ) be the set of all the missing files th a t are in the set FNp but not available at the vertices of a tree t . By definition, a tree is an F ST if the set of files needed to execute PRG p is a subset of the set of files available at the vertices of th a t tree and therefore the set of missing files is em pty, i.e., /m(O = 0 In addition to checking w hether or not a tree is an F S T , it is also necessary to check if it is an M F S T . Once the checking process is complete, the list T R Y will have all the trees th a t are not M FST's, and thus their f m (t )’s are not em pty. The expanding step is necessary to increase the size of each tree in T R Y by connecting each vertex of a tree to a new adjacent vertex hoping th a t the new vertex will have the needed files; if T R Y is em pty, of course, the algorithm stops. For example, in Figure 3.1, tree a; 3 : ^ 3 is not an F S T ; it is th u s expanded by connecting it to one of the adjacent vertices, nam ely, ar2 and 54 x 4. The added adjacent vertex might have all or some of the needed files. If th a t node has all the missing files, a new F S T w ould be generated from ad­ ding the adjacent vertex. If it has some or none of the needed files, th a t node can be used as an interm ediate node to access other set of adjacent nodes in the subsequent expanding steps. The set of edges th a t can be added to any tree is denoted as A E ( t ) , called the set of A djacent Edges of t . This set is obtained by first finding the set of edges incident on the vertices of Vt and then deleting from it the edges of t , i. e., A E (£) = E y (j, ) — t where E y (t ) denotes the set of edges incident on the vertices of Vt . For example, the set of adjacent edges of x xx 3x lz of Figure 3.1 is com­ puted as: A E (x iX 3^ 1 3 ) = {x 1 3 , # 1 2 > x 2,3 ’x 3,4} ~ { ^ 1,3 } = i x 1,2 ix 2,3 >^3,4 } The expanded trees are stored in a tem porary list called N E W which m ust also be checked to remove any replication among the expanded trees. This removal of the replicated trees is done in the CLEANUP function. The formal description of the M FST ALGORITHM is given next. 55 M FST A L G O R IT H M Step 1 : Initialization T R Y = PA p * PAp denotes the set of nodes FOUND = 0 th a t could run program PRGp . T R Y = P A p * list TR Y initially has the root nodes th a t fo r all Zy G T !/2 T do will run program p . f m ix j ) = FNp - FA j od Step 2: G enerating all M F S T ’s rep ea t 2 . 1 Checking step * each tree in T R Y is checked fo r all t E T R Y do to determ ine w hether or not it has all if CHECK { t) th en the required files for executing PRG p . b egin add t to FOUND remove t from T R Y end od 2.2 Expanding step * each tree in T R Y is expanded by N E W = 0 connecting one adjacent edge to its vertices. for all t E TR Y do add EXPAND ( t ) to N E W od T R Y — CLEANUP {NEW ) u n til {T R Y = 0 ) function CHECK(t) * this function returns true when t is an b egin F ST {fffi {t) == 0) and is minimal; CHECK = false otherwise it returns false. if f m {t ) = 0 th en * t is an F S T and will be checked to see b egin w hether or not it is an M FST . if FOUND = 0 or ( for all 1 E FOUND , I D t j L l ) th en CHECK = true else * t is rem oved because it is not an M F S T . T R Y = T R Y - t end end (* CHECK*) 56 function EXPAND(t)' * this function constructs from t all trees begin that can be formed by adding to t one tmp = empty edge from its adjacent edge set A E ( t ). A E (t) = E Vf -t * determine the set of adjacent edges, for all (a ;,- y £ A E ( t ) , z,-£ Vt /\ Xj £ V t ) do begin * construct new tree newt from adding newt = t U { , y } adjacent vertex . f mi newt) = f m(t)~ FAj add newt to tmp end EXPAND = tmp end. (* EXPAND*) function CLEANUP(NEW) * this function removes from N E W all the begin replicated trees generated by EXPAND, for all £,, fy £ N E W do if £ ,• - t}- — 0 th en remove t from list N E W od end. (*CLEANUP*) Once all the M F S T 's have been generated, the next step is to find the probability th a t at least one of them is up, which means th a t all the com put­ ers and links included in at least one M FST are operational. Any term inal re­ liability evaluation algorithm based on path enum eration [ABRA 79, GRNA 80, HARI 8 6 ] can then be used to obtain the distributed program reliability. If the set of M F ST's is considered as a set of paths, the SYREL algorithm can be used to efficiently evaluate the D PR expression. 57 3.4 D istr ib u te d S y stem R elia b ility A lg o rith m The distributed system reliability (DSR) is defined to provide a global measure when m ultiple types of resources are shared among m any users. It essentially quantifies the availability of a certain m inim um configuration for the entire system to be considered operational. W ith our view of a distributed system , this reliability will be the probability th a t a given set of distributed program s are operational. Let us assume th a t a distributed system is defined to be operational when a set of np program s are executable in spite of link and node Failures. One m ethod of evaluating the DSR is by intersecting the M F S T ’s associated w ith each distributed program of the given set. The D PR will be called np tim es to obtain the M F S T ’s for each program and then intersect those trees to determ ine the subgraphs th a t assure the appropriate accessibility for run­ ning all the required program s. This approach is simple, but com putationally expensive, especially when the num ber of program s involved is large. A more elegant and efficient approach would be to enum erate directly the subgraphs th a t will provide the required paths to run all the program s of a given set. This approach is sim ilar to finding the M F S T ’s, b u t now it enum erates all M inimal File Spanning Forests (M F S F ’s). Sim ilar to the M FST m ethod, the MFSF algorithm traverses the graph in breadth-first m anner. The MFSF ’s are generated in a nondecreasing order of th eir sizes, where the size is defined as the num ber of links present in an MFSF , and ac­ cording to the following order. F irst, all the M F S F ’s of size 0 are enum erated; 58 this occurs when all program s run locally and do not need any rem ote d ata files. Next, all th e / M F S F ’s of size 1 are determ ined; these forests have only one edge such th a t all program s can access their rem ote d a ta files through th a t link. This procedure is repeated for identifying forests of size 2, and so on un­ til the M F S F ’s become spanning trees or no more M F S F ’s can be obtained. The MFSF algorithm consists of three steps: initialization, checking, and expanding steps. In the initialization step, all subsets of the com puters th a t jointly run all the program s are obtained and stored in a list called T R Y . This set can be determ ined by applying the following steps: a) determ ine the C artesian product T R Y = P A X X P A 2. . . X P A m where P A t denotes the set of com puters th a t can run PR G ;. b) remove all supersets from T R Y In the checking step, the forests th at have been generated so far, which are stored in T R Y , will be tested to determ ine w hether or not all the np pro­ gram s are executable. Let us assume th a t / be a forest whose active vertices a ( / ) can jointly run all np programs. A forest / is an FSF if each program can access all the needed files, i.e., for all x ■ E a ( / ), U FAk 2 FNPrg, for all Xk Ecorij where I U prg: I = n„ , prg.■ represents the set of program s th a t for all Xj £ a, ( / ) run at Xj , and con ( / ) denotes the set of nodes connected to the set of active 59 vertices a ( / ). For a given forest / , let f m ( f ; Xj ) be the set of missing files th a t can not be accessed from node Xj . By definition, a forest becomes an FSF when all program s can access all their needed files from their hosts; the missing file sets are all em pty, i.e., for all xj € a ( / ), f m ( / ; Xj ) = 0 In addition to checking w hether or not a forest is an FSF , it is also necessary to check if it is an MFSF ; it is not a superset of another MFSF constructed previously, i.e., for all MFSFi E FOUND , MFSFt n / ^ MFSF{ where FOUND denotes the list of determ ined MFSF ’s. Once the checking process is complete, the list T R Y will have all the forests th a t are not F SF ’s, and thus their missing file sets are not em pty. In the expanding step, the size of each forest / in T R Y is increased by connect­ ing each vertex to an adjacent node hoping th a t it will provide the accessibili­ ty needed to obtain all the rem ote files to the program s’ hosts. However, if the resulting missing file sets are not em pty w ith this new node, then it can be used as an interm ediate vertex. The set of edges th a t can be added to any forest / is called the set of A djacent Edges A E ( / ). This set is obtained by first finding the set of the edges incident on the vertices of / and then delet­ ing from it th e edges of / i. e., 60 ---------------------- X E T I ) = Ey( f ) - / ------------------------------------------------------------- — where E v {f ) denotes the set of edges incident on the vertices of / which are represented by the set Vj . The rest of the steps of this algorithm is sim ilar to those of the M FST algorithm described in Section 3.3 and is shown in the M FSF algorithm . We briefly discuss its applicability to evaluate other reliability m easures including DPR, DPRS, and DSRS. The initialization step of the MFSF algorithm is modified slightly during the evaluation of the desirable reliability measure. The M F ST 's needed for evaluating D PR can be enum erated by setting T R Y to P A p . To evaluate DPRS ($ ), all the subgraphs th a t guarantee the execu­ tion of PRGp from site s m ust be enum erated. These subgraphs are obtained by modifying the needed file set (FNp ) to include a dum m y elem ent th a t is only available at site s . Hence, the site m ust be included in every M FST generated. Likewise, the subgraphs required for the evaluation of DSRS are produced by adding the site location to each FNp . 61 MFSF ALGORITHM Step 1 : Initialization * determ ine all the sets of vertices th a t can T R Y = PA l X PA 2- ■ • X PAm run all the given set of programs, for (for all /,• , f j £ T R Y ) do if /,• fl f j — f i th en remove / y from T R Y i f / , fl Si = Si th en remove /,• from T R Y od FOUND = 0 fo r all / e T R Y do for all X j £ a { f ) do f m{ f ^ i ) - F N prgi- F A j od od Step 2 : G enerating all MFSF s rep ea t 2.1 Checking step fo r all / e T R Y do if CHECK ( f ) th en begin add / to FOUND remove / from T R Y end od 2.2 Expanding step N E W = 0 fo r all / e T R Y do add EXPAND ( / ) to N E W od T R Y = CLEANUP {NEW ) u n til ( T R Y = 0 ) * determine for each / , the sets of missing file sets f m ( f ; a?y); one for each active vertex in / . * ENprg . denotes the set of files needed t o . execute the programs at x •. 62 function CHECK(f) * this function returns true when / is an begin FSF (for all xj in a ( / ) f m ( / ; xj ) = 0) CHECK = false and is minimal; otherwise, it returns false, if (for all Xj Ea ( f ) f m{f ; X j ) = 0 ) th en begin if FOUND = 0 or ( for all IE FOUND , I fl / ^ / ) th en CHECK = true else remove / from T R Y end end (* CHECK*) function EXPAND(f) * this function constructs from / all forests b egin that can be formed by adding one edge tmp = empty from its adjacent edge set A E ( f ). A E { f ) = E V/ - f for all (xi j E A E ( f ) , x i E V f l \ x j $ :Vf ) do b egin * construct new forest n e w f by adding n e w v = / U {*,-,/} adjacent vertex x f . a ( n e w f ) = a ( / ) for all X jE a (n e w f ) do f m ( n e w f ; x - ) = F N - U F A k J for all xk € co n ) - od add newf to tmp end EXPAND = tmp end. (* EXPAND*) * con (j ) denotes the set of nodes th a t are reachable from vertex Xj . function CLEANUP(NEW ) begin for all / , , f j E N E W do if f { - f j = 0 th en remove / from list N E W od end. (*CLEANUP*) * this function removes from N E W all the replicated forests generated by EXPAND. 63 3 .5 S e n s itiv ity A n a ly sis The redundancy in resources, such as program s, d a ta files, and devices is introduced to improve the reliability of distributed system s. However, it in­ creases the cost of the system and the overhead of m anaging these redundant resources. Massive redundancy is therefore not desirable for m ost applications. The objective of this analysis is to determ ine an upper bound on the num ber of redundant resources th a t achieve significant im provem ents in the reliability. This bound can be used as a cost effective measure to decide on th e level of the redundancy of each resource needed to improve the reliability of the distri­ buted system under consideration. In the simplified model discussed here, two types of resources are considered: program s and d ata files. The reliability of a distributed program can be im proved by introducing redundancy in the num ber of com puters th a t could participate in its execution or th a t hold some of its required files. To simplify the analysis, we assume th a t the com m unications network is reliable; there is at least one path always available for accessing a required resource. Essentially we are concerned w ith the failures of com puters and thereby inaccessibility to resources controlled by them . W e assume th a t all the com puters have the same reliability (p ) and their failures are statistically independent. The result of our approxim ate analysis suggests for a large class of distributed program s, which need different num ber of resources, a triple redundancy is sufficient to provide at least 90% of the m axim um possible relia­ bility im provem ent. 64 3 .5 .1 In tro d u cin g R ed u n d a n cy in th e C o m p u ters ru n n in g a T a sk . Let us assume th a t k com puters are needed to execute a program . To improve its reliability, we add another com puter to the set of k com puters. Now this program can successfully run when at least k out of k +1 com puters are operational. Hence, the reliability of th a t program can be evaluated as: DPR (1 ) = p k +1 + (k + 1 )p k (1 -p ) In general, if we add r com puters, the probability th a t at least k com­ puters are operational out of k +r is given as: DPR(r) = £ ( * + r ) „ * +- ( l- „ f - o V ’ ’ (3.8) The num ber of computers involved in the execution of a distributed program depends on its type, perform ance, environm ent, and other constraints while the value of the param eter r determ ines the level of redundancy. The effect of r on the reliability is m easured by the Reliability Im provem ent Fac­ to r (RIF) which is defined as the ratio between the reliability im provem ent ob­ tained from adding r com puters to the maxim al reliability im provem ent, which occurs when DPR (0) is increased up to 1 , i.e., DPR (r ) - DPR (0) RIF 1 - DPR (0) ( 3 9 ) For a given value of r , this equation gives the fraction of the maxim um possible reliability im provem ent th a t can be obtained. W e study the reliabili- 65 ty oT a distributed program Tor different values of k , small"to relativelylarge, and for different values of r . The results are shown in Figure 3 .3 , which shows th at reliability improvements for triple redundancy are 9 9 %, 98%, and 94% for tasks of sizes (k = 2 , k = 5 , k = 10), respectively. 9 0% - 9 0 % - u rn i a * * Radoadaacy L**d (r) Figure 3.3: Reliability improvements for triple redundancy. 3.5.2 R ed u n d an cy in P rogram s and F iles In the previous case, we did not distinguish between the different types of the resources required for executing a program. However, for practical con­ siderations, such as the availability of each type of these resources, their cost, and the performance constraints, it is useful to treat them differently. For ex­ ample, it might be more cost effective to replicate data files rather than pro­ grams. Let us assume that for a distributed program, kp computers are need­ ed to execute it and kj remote computers have the required data files for its execution. These remote computers are assumed to be disjoint with the com- 66 puters running th a t program . This assum ption is realistic because if the same com puter th a t participates in the execution of a task also holds some required d ata files for other com puters, then it implies th a t w hen th a t com puter is up its d ata is also accessible. F urther, to improve reliability and simplify con­ sistency and recovery problems, d ata files and program s are kept at different com puters. The reliability of executing a program can then be evaluated as: DPR (0,0) = p k” . p kf where the (0 ,0 ) param eter denotes th a t there is no redundancy in the program s and files. If we introduce redundancy only in the com puters th a t run a program , its reliability after adding rp com puters will be as follows: k , h - DPR (rp ,0) = p 1 £ * = = o kp +rp p k P + rP ‘ ( i _ p ) f (3.10) For the same level of redundancy, E quation 3.10 differs from the relia­ bility expression of the previous case, which is shown in E quation 3.8, by a constant factor p kf . Hence, the graphical representation of E quation 3.10 will be sim ilar to the one shown in Figure 3.3 and therefore triple redundancy would be sufficient to achieve reliability im provem ent of at least 90% of the m axim um reliability im provem ent. This analysis can be generalized to include redundancy in both pro­ gram s and files. F or rp and ry levels of redundancy for the com puters run- 67 nmg and holding the required files Tor tfie program under consideration, the re­ liability expression is: Figure 3.4 shows the effect of allocation of redundant copies of pro­ gram s and files on the reliability of -a program of different sizes ((kp =2;kj- = 2 ) , (kp =2;kf = 5 ) , (kp —2;kl0)). Likewise in the two previous cases, triple redundancy in the resources is sufficient to achieve more than 9 5 % of possible improvement in the reliability. Figure 3.4: Reliability improvements due to redundant programs and files. (i-p )■) ] - (3.11) Rir < * Optimal (r, + r , ) 68 3.6 A n Illu stra tiv e E x a m p le In this section, we illustrate the use of the M F ST and MFSF algo­ rithm s to obtain the D PR for a given program and the DSR of the distributed system shown in Figure 3.5. It consists of six processing elem ents th a t could run four program s where each one of them can be executed on two different com puters. The allocations of these program s are as follows: P A 1 = {x 1 ,x 6}; P A 2 = { x 3 ,x 4} P A 3 = {x3 ,x 4}; P A 4 = { x 2 ,x 5} The set of files needed for executing each program is: F N 1 = { F l t F 2 ,F3}; F N 2 = {F a ,F 4 ,F J F N 3 = { F 1 ,F 3 , F 5}; F N 4 = { /•, ,F t ,F t ,F «} The files accessible at each processing elem ent and the allocation of the program s across the system are shown in Figure 3.5. In th following section, we first shows the application of the M FST and SYREL algorithm s to evalu­ ate the D PR expression of program 1 and then apply the MFSF and SYREL to evaluate the DSR expression for all four program s. 69 ‘ 1.3 Figure 3.5: A distributed system with allocation of its four programs and the required d ata files. 3.6.1 E v a lu a tin g D P R for PRG ! This program can run on either x x or x 6 and its M FST's shown here are represented only by its edges since they imply the vertices involved in these trees. In the initialization step, the list T R Y equals ( x 1 ? x 6), and the sets of missing files are: / m ( * l ) = 1 X 2 X 3> - { F , X 5} = { f 2 ,F3} f m (* 6) = { F ! ,F, , f ,} - IF, ,F<} = { F 2 ,F 3> The application of the checking and expanding steps is repeated until T R Y becomes empty. The resulting 10 M FST's are stored in the FOUND list and are as follows: FOUND = 1 ,2"^ 1,3 ’ 1 1,2X 2,3 ’ 1 1,3^2,3 > x 4,6'1 ' 5,6 » a'4 ,6 ;r 4,5 * ^ 5,6*^ 5,4> 7 0 • ^ 1,2^ 2,4^ 4,5 t x 1,3X 3,5^ 4,5 > ^ 4,6^ 2,4^ 2,3 » x 5,6X 3,5X 2,3 } Figure 3.6 shows these M F S T ’s and also the locations of the files need­ ed for executing PRG 1 using each tree. The double circles indicate the site at which the program runs. Once the M FST {PRG j) has been found, the SYREL algorithm can be used to evaluate the following term s of the D PR expression: T 1 = P i P 2P 3Pi, 2P 1,3 T 2 = P iP 2P 3P 1,2P 2 ,3 [? 1,3] P 3 = P 1P 2P 3P 1,3P 2,3 [9 1,2] T 4 = p 4p 5P 6P 4)6p 5 6 [(1 - p X P 2P 3) + P 1P 2P 3 ( 9 1 ,2 9 1 , 3 + 9 1 ,2 9 2 ,3P 1,3 + 9 1,3 9 2 ,3 ? 1,2)] T 5 = P 4P 5P 6P 4 ,5 ? 4 ,6 9 5,6 P ~ P \P 2 ? 3) + P 1P 2P 3 ( 9 1 ,2 9 1 , 3 + 9 1 ,2 9 2 ,3P 1 , 3 + 9 l , 3 9 2 ,3 ? 1,2)] ^ 6 = ? 4 ? 5 ? 6 ? 4 ,5 ? 5 ,6 9 4,6 P ~ ? 1? 2? 3 ) + ? 1? 2 ? 3 ( 9 1 ,2 9 1 , 3 + 9 1 ,2 9 2 ,3 ? 1 , 3 + 9 1 ,3 9 2 ,3 ? 1,2)] T 7 = P 1? 2? 4 ? 5 ? 1,2? 2 ,4 ? 4,5 [9 3 9 6 + 9 3 9 4 ,6 9 5,6 ? 6 + 9 6 9 1 ,3 9 2 ,3 ? 3 + 9 1 ,3 9 2 ,3 9 4 ,6 9 5 ,6 ? 3 ? 6] T 8 = ? 1 ? 3 ? 4 ? 5 ? 1 ,3 ? 3 ,5 ? 4,5 [9 2 9 6 + 9 6 9 1,2 9 2 ,3 ? 2 + 9 2 9 4 ,6 9 5,6? 6 + 9 l , 2 9 2 ,3 9 4 ,6 9 5,6? 2? el T g = ? 2 ? 3 P 4 P 6P 2 ,3 ? 2 ,4 ? 4,6 [9 l 9 5 + 9 l9 4 ,5 9 5 ,6 ? 5 + 9 5 9 1 ,2 9 1,3? 1 + 9 1 ,2 9 1,39 4 ,5 9 5,6? 1? 5] T 10 = ? 2? 3 ? 5 ? 6 ? 2 ,3 ? 3 ,5 ? 5,6 [9 19 4+ 9 1 9 4 ,5 9 4,6? 4 + 9 4 9 1 ,2 9 1,3 ? 1 + 9 1 ,2 9 1 ,3 9 4 ,5 9 4 ,e P 1? 4] 10 DPR = £ T, * = 1 3.6 .2 E v a lu a tin g D S R fo r all p ro g ra m s Applying the MFSF algorithm to obtain the set of all subgraphs th a t provide the proper accessibility for executing all four program s proceeds as fol­ lows: Step i: Initialization step. T R Y = [ x xx 2x z), ( x xx 2x 4), ( x xx zx s), (x xx 4x b), ( x 2x 3x 8), {x2x 4x 6), («3x 5a r8), (ar4x 5ar6) The set of missing files sets associated with each subgraph in T R Y is: /m (* 1* 2* 3; * 1) = F N prgi- F A x = F N X - FA x = {F 2, F z) f m (* 1* 2* 3; * 2) = F N prgs- F A 2 — F N 4 - F A 2 = {F 4, F J f mi* 1* 2* 3; *3) = F N prg3 - F A 3 = F N 2Z - F A 3 = { F x, F 2, F 5, F J f m (* 1* 2* 4; * 1) = F N prgi - F A x = F N X - FA x = { F 2, F 3> f m (* 1* 2* 4; * 2) = F N prgt - F A 2 = F N 4 - F A 2 = { F 4, F J f m (* 1* 2* 4! *4) = FNprg<- F A 4 = F N 2i 3 - FA 4 = { / V F 3, F 4, F J /rn (* 1* 3* 5; * 1) = F N prgi - FA t = F N X - F A , = { F 2, F 3} f m ( x xx 3x b, x 3) = F N prg3 - F A 3 = FAT2i3 - FA3 = { F x, F 2, F s, F J f m (* 1* 3* 5; * 5) = F N prgs - F A b = F N 4 - F A b = { F v F 2, F 4} f m (* 1* 4* 5; * 1) = F N prg t - F A x = F/V, - FA j = {F 2, F 3} f m (* 1* 4* 5; *4) = i ^ 4- i r A 4 = FiV2i3 - F A 4 = { F lf F 3, F 4, F J /m (*1*4*5; *5) = - F A s = F N 4 - F A 5 = {Fj, F 2, F 4} / m (* 2* 3* 6! * 2) = F N prg2- FA 2 = F7V4 - F A 2 = ( F 4, F 6} / m (*2*3*6! *3) = - FA 3 = FAT 2 3 - FA 3 = { F 1 ? F 2, F s, Fg} / m (*2*3*65 *6) = F N prgt - F A 6 = F N x - F A 6 = { F 2, F 3} /rn (*2*4*6! * 2) == F N prgi — F A 2 = F N 4 — F A 2 = {F 4, F J /m (*2*4*6! *4) = ^ prj4 - F A 4 = F N 2 Z - F A 4 = {Fj, F 3, F 4, F J / m (* 2* 4* 5! *6) = F N prga - F A 6 = FiV, - FA e = ( F 2) F 3} / m (* 3* 5* 5! * 3) = F N prga — F A 3 = F N 23 - FA 3 = {Fj, F 2, F 5, F J /m (*3*5*6! *5) = F N prga - FA 5 = F7V4 - FA 5 = {Fj, F 2, F J /rn (*3*5*6! *6) = F N prgt- F A 5 = F N X - FA 6 = {F 2, F 3} 72 / m ( * 4 * 5 * 6 ; * 4 ) = F N p r g t - FA 4 = FJV2 3 - F A 4 = {F v F s, F 4, F J fm (*4*5* 6j *5) = F N p r 9 5 - FA 5 = F N 4 - FA 5 = {F 1 ? F 2, F J / m (*4*5*6> *6) = F N p r g * - F A 6 = FiV, - FA 5 = { F 2, F 3} W here FNprQ j denotes the set of program s th a t run on the com puter x}- and FNk< i represents the set of files needed to execute programs PRGk and PRGt . FOUND = 0 Step 2 : G enerating all M F SF 's Applying the checking step to forests of sizes 0 and 1 yields false re­ turns. T he first minim al file spanning forests are constructed when the forests become of size 2. Applying the checking step in this case yields the following M F S F ’s: FOUND = {x4ar5 z 6x 4 5x 5 6, x 4x 5x G x 45x 4 6, x 4x 5x 6x 4 6ar5 5} The set of missing files sets X j ) ) associated with each MFSF m ust all be empty; there is no missing files because its vertices have all the re­ quired files. For example, the set of missing files sets associated with * 4 * 5 * 6 * 4 ,5 * 5,6 ^ / m (*4*5*6* 4,5*5,6 J * 4) = F N 2,3 ~ (F A 4 U FA 5 U F A 6) = 0 f m (* 4* 5* 6* 4,5* 5,6 J * 5) = F N i ~ (F A 4 U FA 5 U FA 6) = 0 f m (* 4* 5* 6* 4,5* 5,6 J * 6) = F N 1 “ (F A 4 U FA 5 U FA 6) = 0 In a sim ilar m anner, the DSR algorithm proceeds to construct all the M F SF 's which are shown in Figure 3.7. Once the set of all M F S F 's are derived, applying SYREL yields the following term s of the DSR expression: F 1 — P 4P $P &P 4,bP 4,6 T 2 = P 4P 5P sP 4,5? 5,6 (1 - P 4,e] 73 ^ 3 — P AP 5P 6P 4,6P 5,6 [? 4,5] T 4 = P 1P 2P 3P 5P 1,2P 1,3P 3,5 K1 - P 4P 6)+ P AP e(? 4,5? 4,6+9 4,5? 5,6P 4,6+9 4,69 5,6 P 4,5)] ^ 5 = P 1P 2P 3P 5P 1,2P 2,3^3,59 1,3 [U ~ P aP e)+ P aP 6 (9 4,59 4,6 + 9 4,59 5,6P 4,6+9 4,e9 5,6P 4,s)l ^ 6 ~ P 1P 2P 3P 5P 1,3P 2,3P 3,59 1,2 [(1 ~ P AP 6)+ P AP 6(9 4,59 4,6+9 4,59 5,6P 4,6+9 4,69 5,6P 4,5)] ^ 7 = P 1P 3P AP 5P 1,3P 3,5P 4,5 [9 29 6+9 29 4,69 5,6? 6+ 9 69 1,29 2,3^ 2+9 1,29 2,39 4,69 5,6P 2P 6l ^ 8 = P 1P 2P 3P AP 5P 1,2P 1,3P 2,4P 4,59 3,5 [9 6+9 4,69 5,6^ 6 ] ^ 9 ~ P 1P 2P 3P AP 5P 1,2P 2,4P 3,5P 4,59 1,39 2,3 [9 6+9 4,69 5,6P 6 ] T 1 0 = P IP 2P 3P AP 5P 1,2P 2,3P 2,4P 4,59 1 ,39 3,5 [9 6+9 4,69 5,6P e] ^11 — P 1P 2P 3P AP 5P 1,3P 2,3P 2,aP 4,59 1,29 3,5 [9 6+9 4,69 5,6P 6 ] T 12 = P 2P 3P AP 5P & P 2,3P 2,AP 3,bP 4,69 4,59 5,e[9 1 + 9 1,29 1,3P l] T 13 = P 2P 3P AP 5P 6P 2,3P 2,4P 3,5P 5,69 4,59 4,6 [9 1 + 9 1,29 l,3P l] = § r f * = 1 For example, if we assume th a t all the elem ents of the system shown in Figure 3.5 have the same reliability of 0.9, then the reliability of executing PRG 1 is 0.9374 and the reliability of the distributed system for the given set of program s and files equals 0.8419. j C * )r * ® r » m f s t z Ft *rsre Figure 3.6: M F S T ’s for PRG i 75 Figure 3.7: MFSF 7 s for all programs 76 Chapter 4 Terminal Reliability Optimization 4.1 In tro d u ctio n An im portant design param eter in com puter com m unications networks as well as in distributed system s is the reliability betw een two computers. Maximizing this reliability under a cost constraint is the optim ization problem addressed in this chapter. The techniques th a t can be used for reliability op­ tim ization vary from analytical to heuristic approxim ation m ethods. A good review of these techniques can be found in [TILL 77]. Im proving the reliability of a system can be achieved by one or a com bination of the following methods: • Im plem enting large safety factors. • Reducing the complexity of the system. • Im proving the reliability of components. • Using redundant components. • Exercising a planned m aintenance and repair schedule. 77 M ost of the m ethods proposed in the literature for maximizing the system ’s reliability are based on using redundancy to improve the reliability of series-parallel systems [TILL 70, MISR 71, SHAR 71, NAKA 77a, G O PA 78, HW AN 79]. This approach is characterized as a nonlinear integer program ­ ming problem which is more difficult to solve th an a nonlinear program m ing problem . Integer program m ing techniques give integer solutions but they do not guarantee th a t optim al solutions can be obtained in a reasonable com put­ ing tim e. Dynamic program m ing solves the optim al redundancy allocation problem , but the required com putation time and the am ount of storage needed for the generated tables m ake it infeasible when the num ber of state variables is large [MESS 70]. In this chapter, we adopt an iterative approach based on Hooke and Jeeves m ethod [AVRI 76] which has been shown to be more suc­ cessful th an integer program m ing m ethods in dealing w ith large system s [TILL 77]. 4.2 O p tim iza tio n P ro b le m F o rm u la tio n The optim ization problem th a t will be discussed here is more general th an those of solving series-parallel systems. D istributed system s have com­ plex structures and can not be modeled as series-parallel networks; therefore, different techniques need to be used to obtain optim al or near optim al solu­ tions in a reasonable com putation tim e. Furtherm ore, the cost-reliability func­ tions associated w ith the com ponents depend on their characteristics, the tech­ nology used in m anufacturing them , the environm ent surrounding their physi­ cal location, and the m ethod used for improving their reliability. The algo­ rithm s introduced here are independent of the techniques used to improve the 78 com ponents’ reliability and assume th a t the cost-reliability functions are known. These functions approxim ate the am ount of dollars th a t m ust be spent in order to increase the reliability of their corresponding elements. Term inal reliability is a measure of the probability of successful com­ m unication betw een a pair of computers. If m ultiple paths exist between a given pair of nodes, then it is given by the probability th a t at least one path is operational. A general form ula for the term inal reliability can be obtained by directly expanding the Equation 2.1 as follows: m R = Y, Pr(P,)~ EPr (Pi A Pj) + . . . t = 1 I < j + (-l)m~l P r ( P l / \ P 2 . . . / \ P m ) (4.1) where P, is used to denote the ith path and also the event in which it is operating for simplicity. E quation 4.1 will be used to derive the approxim ate contributing func­ tions th a t are used as objective functions in A lgorithm 3 as shown later. One of the basic advantages of distributed system s is their adaptability to changing environm ents, for example, replacing a set of com puters with identical but more reliable ones. An interesting problem th a t could face designers is improving the reliability between two com puting centers or among a set of com puters of an existing system w ith a fixed budget. This optim iza­ tion problem can be form ulated as: 79 GIVEN : Network Topology. Cost-Reliability Functions. MAXIMIZE : Term inal Reliability. OVER DESIGN VARIABLES : Investm ent variables (dt -’s). SUBJECT TO : Fixed Cost (D max). An analytical methodology which was studied in [RAGH 82] to solve this optim ization problem using Lagrangian m ultipliers showed th a t the op­ tim al solutions can be derived for some simple reliability expressions. For ex­ ample, the following problem was studied analytically: GIVEN : pi = r, + c ,- d ,- MAXIMIZE : R — V iPzPz - Pn OVER DESIGN VARIABLES : d ,- ’s * df - denotes the am ount spent on element et -. ft SUBJECT TO : £ = 25m„ m ax 0 < d, < --------— ; do not spend more than w hat is required to get p% from r,- to 1. The optim al solution resulting from applying the Lagrangian m ultiplier m ethod is: d, = n 1 { ^ m ax + X) 80 However, this analytical approach can only be applied to simple forms of reliability expressions. Therefore, we adopt an iterative approach th a t can handle large distributed system s, and we develop three algorithm s to maximize the reliability within a given cost constraint. The first two algorithm s will be used for system s whose term inal reliability expressions are com putable, while the th ird one is an approxim ation of the first algorithm so th a t it can deal w ith large system s. The approxim ate approach is necessary because the tim e complexity of reliability algorithm s has been shown not to be NP-com plete [BALL 80]. 4.3 R elia b ility O p tim iza tio n A lg o rith m s In this section, three iterative algorithm s are presented to solve the op­ tim ization problem form ulated in Section 4.2. They are based on the pattern search m ethod developed by Hooke and Jeeves [AVRI 76] which had shown success in finding the m aximum (minimum) of a large class of real valued func­ tions and ease in its im plem entation. These iterative algorithm s consist m ain­ ly of tw o steps, an exploring step and an updating step. In the exploring step, the reliability is improved by investing a 6-dollar am ount on each ele­ m ent, and then the objective function is evaluated. In the updating step, the reliability of the element th a t maximizes th a t function is im proved. These two steps are repeated until all the available budget has been used up. However, these algorithm s use different objective functions to identify at each iteration the element(s) th a t maximize them . The first algorithm uses the term inal reli­ ability expression itself as an objective function, while the second one uses the derivatives of the reliability function w ith respect to the investm ent variables 81 (di ’s). The third algorithm uses the approxim ate contributing functions asso- ciated w ith each unreliable element. 4.3.1 D e riv a tio n o f A lg o r ith m 1 In this algorithm , the term inal reliability expression is used as an objec­ tive function to be maximized in each iteration. In the exploring step, the reli­ ability of each element is improved sequentially by spending 8 dollars, and the corresponding term inal reliability value is evaluated. In the updating step, the values of the resulting term inal reliability are com pared to identify the ele­ m ent th a t maximizes it and then selecting th a t element to improve its reliabil­ ity. These two steps are repeated until all the budget has been spent. The num ber of iterations,maxiter , is a function of 8 and can be chosen to be the least am ount of dollars th a t m ust be spent on a component to produce an in­ crease in its reliability equal to the desired level of accuracy; for example, if the accuracy in m easuring the reliability is 10"^, 8 is then the am ount th a t should be spent on any element to increase its reliability by at least 10 Once 8 is determ ined, the num ber of iterations is given by — com_ 8 plete description of this algorithm is shown next. 82 A L G O R IT H M 1 maxiter = — * Initialization. o for i = l to n d o * where n is the num ber of elements. di = 0 od for iter= l to maxiter do b egin * Exploring step, for i = l to n do b egin dj = T 5 evaluate i?8 - d{ — df - 6 en d * U pdating step, find . Rifndx max (R{ • i 1,2, . . ., n ) improve imax : dimax = dimax +<5 end 4.3.2 D e riv a tio n o f A lg o r ith m 2 Algorithm 1 can be improved by reducing the num ber of term s th a t m ust be com puted at each iteration. This is done by evaluating the deriva­ tives of the term inal reliability w ith respect to the investm ent variables. The derivative of a function / (d) w ith respect to a variable d , by definition, is: J ( i ) = lim / ( d + A d ) - / { d ) A d — > o A d (4 2) The derivative function ( / ) can be considered as a m easure of the function’s sensitivity to variable d . In this context, d denotes the am ount of dollars th a t will be spent on each element. 83 L e m m a 4.1: Im proving the reliability of an elem ent, say e max, for which R max is m aximal will also maximize the reliability value, i.e., I f R max == m ax{ R t : i — 1, 2, . . . n } then R max = m ax( R i - * = 1* 2 , . . . n } P ro o f: Let R be the initial term inal reliability value at the beginning of an iteration and e max be the element for which R max is the maxim al value. For any other element, say ey , we will have by assum ption: R max R j Since the am ount of investm ent ( A d ) is the same for each element, substitution of these two values in E quation 4.2 yields: R m ax ~~ R R j ~ R A ~d A d and this leads to: R m ax ■ ^ > R j □ Therefore, improving the reliability of e max will result in a maximal value of the term inal reliability. Algorithm 1 can therefore be im proved by evaluating the derivatives of the reliability expression w ith respect to the in­ vestm ent variables as shown in Algorithm 2. 84 A L G O R IT H M 2 maxiter — — * Initialization, fo r i = l to n do dt = 0 od for iter= l to maxiter do begin * Exploring step, for i—1 to n do begin d % == d; +_ 8 evaluate R i di = dj - 8 en d *_LJpdating step. _ find . R(max max ( Rj . i = 1, 2, . . ., n ) improve imax : dimax = dimax + < § end Clearly, the num ber of term s com puted at each iteration of Algorithm 2 is fewer th an those perform ed in Algorithm 1 for tw o reasons. F irst, the ex­ pression denoting the partial derivative w ith respect to an investm ent variable di will have fewer term s th an the term inal reliability expression. Second, the derivative of pi can be factored out; therefore, the num ber of m ultiplications needed to evaluate R , • is reduced. 4.3 .3 D e riv a tio n o f A lg o r ith m 3 The tw o algorithm s discussed previously need a priori an exact reliabili­ ty expression which is com putationally expensive to evaluate for large net­ works [BALL 80]. For example, a medium size system w ith 20 different paths betw een the term inal nodes will have 220- l term s; more th a n one million 85 term s, if Equation 4.2 is used (smaller term s if more efficient algorithms are used [ABRA 79, GRNA 80, HARI 86]). An approxim ation m ethod which does not require using the reliability expression would be very useful, especially in the early stages of designing a fairly large distributed system, where decisions have to be made about where to place the links and the nodes. The general form ula of the term inal reliability shown in Equation 4.1 can be decomposed into n different subfunctions, where n is the num ber of the unreliable elements in the system. These subfunctions, called the Contri­ buting Functions (c f ’s), capture the effects of each element on the reliability. A contributing function c f k , which corresponds to element ek , can be direct­ ly derived from Equation 4.1 by canceling all the term s which do not contain ek . Hence, c f k is described by the following equation: cfk S Pr (Pi ) - £ Pr ( P , APy ) + • • • for all P, g 11* ■ for (P ; U P j ) € lit + (-l)m-1Pr ( P t/\P2 -- ■ APm ) (4.3) where denotes the set of all simple paths th a t contain ek . > 4 Figure 4.1: Derivations of the contributing functions. 86 For example, the networkflshown in Figure 4.1 has two disjointed paths between nodes and ar4, x 12 x 2i and ar13 x 3 4 . Sets IIj 2 and n24, which denote the set of all paths containing links x 12 and x 2 4, respectively, consist of only path P v Similarly, sets IIj 3 and n3 4 consist of only path P 2. Once the set of all possible paths and the sets IIj 2, n 2 4, IIj 3, and II3 4 have been found, the term inal reliability and the contributing functions of the links of this netw ork can be obtained from Equations 4.1 and 4.3, respectively, as shown below: P i = x 12x 2 4; P 2 = x i $ ar3 4 iii 2 — n2 4 = { p !}; n13 = n3 4 = { p 2} ^ = P 1,2 P 2,4 + P 1,3 P 3,4 “ P 1,2 P 2,4 P 1 ,3 P 3,4 cf 1,2 = cf 2,4 == P 1,2 P 2,4 ~P 1,2 P 2,4 P 1,3 P 3,4 Cf ! 3 = cf 3 4 = p i 3 p 3 4 - p ! 2 P 2,4 P 1,3 P 3,4 T h eo rem 4.1: The am ount of increase in the term inal reliability as a result of im proving the reliability of element ek ( A ) is equal to the increase in its contributing function A cf k , i.e., A Rk = A cf k F roof: T he term s of the general formula of the term inal reliability can occur in either of the following cases: 1: They contain element ek , or 87 2: They do not contain ek . Im proving the reliability of ek will only affect those term s satisfying case 1 . However, the values of the other term s which satisfy case 2 do not change when the reliability of ek is improved. This means th a t the increase in term inal reliability as a result of improving ek is equal to the sum m ation of all the changes in the values of the term s satisfying case 1 . Therefore, the in­ crease in term inal reliability, A Rk , can be obtained from adding the changes in the values of the term s satisfying case 1 . By doing so, we will have the fol­ lowing equation: A P t = £ A P r ( P , ) - £ A P r ( P i /YPi ) + •• for all P, for all (P , UP j )< S 11* + (~ l) m _ 1 A P r ( P j A P 2 ' ' ■ A P m ) (4.4) However, the contributing function cf k will contain only those term s of the term inal reliability expression th a t satisfy case 1 . As a result, the in­ crease in this function will be equal to th a t of the term inal reliability. □ Let A cf i denote the increase in cf i resulting from investing 8 dollars in im proving the reliability of e( - and let A cf max denote the m axim al in­ crease among the contributing functions at one iteration, i.e., A cf max = max ( A cf i : 2, . . . U ) If the increases in the contributing functions are used instead of the ter­ m inal reliability expression in Algorithm 1 to identify the elements th a t m ust 88 be im proved at each iteration, then we will have the following corollary: C orollary 4.1: If the reliability of the element th a t produces a maximal in­ crease in its contributing function is improved at each iteration, then the op­ tim al solution th a t will result is the same as or equivalent to the one obtained using A lgorithm 1. P ro o f: We have to show th a t the element which is selected for im provem ent a t each iteration of Algorithm 1 is the same element or equivalent to it when the increases in the contributing functions (A c /,- ’s) are com pared to identify (e max). In the exploring step, Algorithm 1 will search for an element which maximizes term inal reliability while in the updating step the reliability of th a t elem ent only will be improved. W ithout a loss of generality, we can assume th a t improving the reliability of ek produces a m aximal term inal reliability in a particular iteration of Algorithm 1. Then in the updating step, the reliabili­ ty of ek will be upgraded. If the Im provem ent in ek produces a m aximal ter­ m inal reliability, then this will also result in a m aximal increase in it. By the previous theorem , we have proved th a t the increase A R k is equal to th a t of A c f k ; therefore, the element th a t maximizes term inal reliability will also result in a m aximal increase in its contributing function which leads to: A Rk = A cf k = max ( A c/,- : i = 1 , 2, . . . n ) If the contributing functions’ increases A cf k ,s are used as objective functions in Algorithm 1, then the resulting optim al solution is equivalent to the one obtained when term inal reliability is used as an objective function. If 89 There is more th an one element which produces an equivalent Increase in the reliability, choosing any one of them would produce the same or equivalent op­ tim al solutions. □ It is clear from the previous discussion th a t the changes in the contri­ buting functions can be used as objective functions in Algorithm 1 to find the optim al solution for maximizing the term inal reliability. The contributing functions’ changes can be described by the following set of equations: A c f t = £ A P r( P i) - £ A P r ( P t /\Pj)+... for all P, Gil* for all (P t U P j ) € lb + (“ l ) m_1 A Pr ( ? , A P 2 ■ • A Pm ) for k = 1 , 2, . . . n ( 4 .5 ) The first set of positive term s and the last term in any function A cf k can be derived directly from knowing the set of all paths th a t include ek . However, determ ining the term s which are between the first positive term s and the last term can not be evaluated in a straightforw ard m anner. These term s have alternating signs so th a t the negative term s reduce the func­ tion A cf k while the positive term s partially cancel the previous reduction in the function and so on. In large networks, the num ber of term s th a t cancel each other will be larger because of the alternation in the term s’ signs, espe­ cially when the reliability values of the elem ents are identical or close to one another. Those term s, w ith a good approxim ation, can be replaced w ith only the last term m ultiplied by a constant 7 which is added to com pensate for the 90 effect of the canceled term s and to guarantee positive values for the contribut- ing functions’ increases when their corresponding elem ents are enhanced. By doing so, the approxim ate contributing functions are as follows: E A f t (P , , ) - l A P r ( P 1/ \ P 2 ■ ■ ■ f\Pm ) for all Pi € U k for k = 1 , 2, . . . n (4.6) The constant 7 is chosen so th a t the changes in the approxim ate contri­ buting functions A CFk ’s are positive when their corresponding components are improved. Since the m aximal num ber of positive term s in any A CFk is m , an upper bound of 7 th a t m ight guarantee positive increases for A CFk ’s will be m . Hence, the constant 7 is heuristically chosen to be the largest in­ teger th a t guarantees positive increases in A CFk ’s when their elem ents are im­ proved; i.e., the following inequality holds for all unreliable elements: 7 = max{ I : 0 < I < m A for all ek A C F k > 0 } (4.7) where I is an integer variable. Algorithm 3 can now be derived from A lgorithm 1 by replacing the ter­ m inal reliability expression w ith the approxim ate contributing functions’ in­ creases A CFk ’s shown in E quation 4.6. The approxim ate contributing func­ tions can be derived directly from the set of paths betw een term inal nodes; therefore, the evaluation of reliability is not required in determ ining near op­ tim al solutions. Hence, A lgorithm 3 can be used to solve this optim ization problem for complex distributed systems. 91 Let m be the num ber of simple paths between the term inal nodes. The num ber of term s in the term inal reliability expression evaluated according to E quation 4.1 is 2m -1 (or relatively larger th an m when efficient algorithm s are used), while the num ber of term s in A CFk ’s is of order m . The complexi­ ty of the com putations needed at each iteration depends on the efficiency of the reliability algorithm s in obtaining com pact expressions and it is, in general, of an order m uch larger th an m when Algorithm 1 is used, while it is of order m when A lgorithm 3 is used. Therefore, the com putations are drastically re­ duced at the expense of finding a near optim al solution for relatively large dis­ trib u ted system s. However, as shown later, the errors involved w ith this ap­ proxim ation are extrem ely small and do not exceed 0.2% in m any examples th a t are tested for different cost-reliability functions. Therefore, we believe th a t this algorithm provides a powerful tool for the reliability m axim ization of large networks. The complete description of Algorithm 3 is shown next. 92 A L G O R IT H M 3 find all paths P,- ’s * Initialization. D m ax maxiter — -------- 8 evaluate A CFk ’s d ,- = 0 for all t = 1 , 2, . . n for ite r = l to m axiter do begin * Exploring step, for i = l to n do b egin dj = dj + <5 evaluate A C F , - dt = d{ - 8 end * Updating step, find : A CFimax= m a x (A CF,- : i = 1 , 2, . . ., n ) improve t'm aa: : dimax = dtmax +8 end 4 .4 Illu stra tiv e E x a m p les In this section, Algorithm s 2 and 3 are used to find the optim al solu­ tions for the three networks shown in Figures 4.2, 4.3, and 4.4. Two types of eost-reliability functions are considered, linear and exponential functions. We also studied this optim ization problem for homogeneous and nonhomogeneous networks to show the significance of Algorithm 3 and the good approximations involved in finding the optim al solutions. In this chapter, we use a homogene­ ous term to refer to the networks in which all the unreliable elements have the same cost-reliability function and a nonhomogeneous term to refer to those in which the elements have different cost-reliability functions. 93 The links of the considered networks are assumed unreliable while the nodes are assumed perfectly reliable for the sake of simplicity. T he Relative E rror ( R E ) is used as a measure of the errors involved in approxim ating the optim al solutions and is defined as follows: R — R R E = . pt aPx Ropt In all the examples studied, the following assumptions are considered: 1 . T he num ber of iterations is 1 0 . 2 . T he constants 7 ’s are chosen so th a t the inequality given in Equation 4.7 is satisfied; 7 equals 2 for the networks shown in Figures 4.2 and 4.3 and equals 7 for the one shown in Figure 4.4. 3. T he total budget is normalized. The cost-reliability functions are usually difficult to derive because they depend on the m ethods used to improve their reliabilities, the com ponents’ characteristics, and the environm ent surrounding them . Therefore, different types of these functions are considered to cover, with a good approxim ation, a large class of cost-reliability functions. T he communication links have, in gen­ eral, high reliabilities, especially if coaxial cables are used. This has led us to assume high initial reliabilities, around 0.9. The cost-reliability functions con­ sidered are as follows: 94 • T y p e 1 In this type, all the links are assumed to have the same linear cost- reliability function which is given as, Pi,j = pO + (1-pO) dij where p 0, which is equal to 0.9, is the initial reliabilities of all the links and d{ ,• is the investm ent variable on link x; . * i J * j 3 • T y p e 2 In this type, all the links are assumed to have the same exponential cost-reliability function. This function is given as, = i - ( i - Po)« d‘.i where the constant \ is chosen for illustration purposes to be 3.6 which makes Pi j to be within 10~2 of 1 when d ,- ■ — D max. • T y p e 3 In this type, the links have different linear cost-reliability functions of type 1, which are given as follows: Pi,i = P ° i j + U-pO.-.y) di,j where the initial reliabilities of all the links are random ly chosen around 0.9 as shown in Table 4.1. 95 e T y p e 4 In this type, the links have different exponential cost-reliability func­ tions of type 2. The constants A ’s of the link’ functions are given in Table 4.2. In the following section, we will discuss the results obtained from apply­ ing Algorithm 2 and Algorithm 3 to the network shown in Figure 4.2. T he op­ tim al solutions derived by these two algorithms are compared using the rela­ tive error criterion. This measure indicates how far the approxim ated solutions obtained using Algorithm 3 are from the exact optim al solutions obtained us­ ing Algorithm 2. As shown in Table 4.3, for the netw ork shown in Figure 4.2, Algorithm 3 gave the same optim al solutions for cost-reliability functions of types 1, 2, and 3. The value 1 . 0 corresponding to types 1 and 3 in Table 4.3 means th at the term inal reliability will be maximized when all the available budget spent on improving x 12’s reliability. T he value of 0.5, which is given for ar1 2 and a ; 5 6 in Type 2 , means th at the optim al distribution is obtained when 50% of the budget is spent on improving rr x,2 s reliability and the other 50% on x 5 6. In Type 4, the optim al solution obtained using Algorithm 2 suggests spending 50%, 10%, and 40% of the budget to improve the reliability of elements 2r12, X i 3, and x 5 6, respectively. The solution obtained using Algorithm 3 suggests spending 60% and 40% on improving x 1 2 and ar5 6, respectively. However, the relative error involved in this approxim ation is only 0.055%. Similarly, these two algorithms are applied to the bridge and the modified A R PA N ET networks shown in Figures 4.3 and 4.4, respectively. The 96 results obtained and the relative errors involved in deriving the optim al solu­ tions of the two algorithm s are summarized in Tables 4.4 through 4.8. These num erical results indicate the significance of Algorithm 3 in finding the ap­ proxim ate optim al distribution of the budget to maximize the term inal reliabil­ ity between a given pair of nodes. For example, a com pact term inal reliability expression containing 2 2 term s can be obtained using SYREL for the network shown in Figure 4.4. If this expression is used in Algorithm I, the num ber of term s to be calculated and summed is of order 22, while it is of order 7 when A lgorithm 3 is used. This reduction of com putations becomes very significant when the num ber of iterations is large. Furtherm ore, we avoid deriving the term inal reliability expression for solving the optim ization problem because of the complexity involved in its evaluation. Instead, we used the approxim ate contributing function’s changes shown in Equation 4.6 which can be derived directly from the set of all paths. '2,6 6,6 Figure 4.2: A network with 3 paths. 97 * 3 Figure 4.3: A bridge network w ith 4 paths. 5,6 Figure 4.4: A modified A RPAN ET with 13 paths. 98 p ° i . i Fig. 4.2 Fig. 4.3 Fig. 4.4 P ° 1,2 0.90 0.90 0.95 P ° 1 ,3 0.92 0.98 0.90 P ^ 2 ,3 0.90 0.98 P ° 2 ,4 0.95 0.95 0.93 P ^2 ,5 0.94 0.90 P ° 3 , 4 0.92 P ° 3 , 5 0.90 0.92 P ®4,5 0.94 P ° 4 ,6 0.98 0.96 P ° 5 , 6 0.93 0.93 Table 4.1: The initial reliabilities for Type 3. J Fig. 4.2 Fig. 4.3 Fig. 4.4 ^1,2 2.625 2.625 3.670 ^1,3 5.756 2.526 2.625 ^2,3 2.526 2.526 ^2,4 3.670 3.670 4.720 ^2,5 3.850 2.526 ^3,4 5.756 ^3,5 2.526 5.756 ^4,5 3.850 ^4,6 2.526 4.320 ^5,6 4.720 5.600 Table 4.2: The constant values for Type 4. U sing A lg o rith m 2 In itia l R eliab ility = 0 .9 528 cost Type Optimal budget distribution Terminal Reliability ^ 1,2 ^ 1,3 d 2,4 ^ 2,5 d 3,5 d 4,6 d 5,6 Type 1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.977751 Type 2 0.5 0.0 0.0 0.0 0.0 0.0 0.5 0.990057 Type 3 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.994508 Type 4 0.5 0.1 0.0 0.0 0.0 0.0 0.4 0.990059 Table 4.3: The optim al distribution of the budget when A lgorithm 2 is applied to Figure 4.2. U sing A lg o rith m 3 In itial R eliab ility = 0 .9 5 2 8 cost Optimal budget distribution Terminal Relative Type ^ 1,2 ^ 1,3 d 2 4 d 2,5 d 3,5 d 4& d 5,6 Reliability Error Type 1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.977751 0.00% Type 2 0.5 0.0 0.0 0.0 0.0 0.0 0.5 0.990057 0.00% Type 3 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.994508 0.00% Type 4 0.6 0.0 0.0 0.0 0.0 0.0 0.4 0.989515 0.05% Table 4.4: The optim al distribution of the budget when A lgorithm 3 is applied to Figure 4.2. 100 U sin g A lg o r ith m 2 In itial R elia b ility = 0 .9 6 3 9 cost Type O ptim al budget distribution Term inal Reliability ^ 1,2 ^ 1,3 ^ 2,3 d 2 4 ^ 3,4 Type 1 0.5 0 . 0 0 . 0 0.5 0 . 0 0.989170 Type 2 0.5 0 . 0 0 . 0 0.5 0 . 0 0.996404 Type 3 0 . 0 0 . 0 0 . 0 0 . 0 1 . 0 0.997910 Type 4 0 . 0 0 . 6 0 . 0 0 . 0 0.4 0.996522 Table 4.5: The optim al solution obtained from applying Algorithm 2 to Figure 4.3. U sin g A lg o r ith m 3 In itia l R eliab ility = 0 .9 6 3 9 cost Type O ptim al budget distribution Term inal R eliability Relative E rror ^ 1 , 2 ^ 1,3 ^ 2 , 3 ^ 2,4 ^ 3,4 Type 1 1 . 0 0 . 0 0 . 0 0 . 0 0 . 0 0.989100 0.007% Type 2 0 . 2 0 . 2 0 . 2 0 . 2 0 . 2 0.995058 0.140% Type 3 1 . 0 0 . 0 0 . 0 0 . 0 0 . 0 0.995908 0 .2 0 0 % Type 4 0 . 2 0 . 2 0 . 1 0 . 2 0.3 0.995299 0 .1 2 0 % Table 4.6: The optim al solution obtained from applying A lgorithm 3 to Figure 4.3. U sing A lg o rith m 2 In itia l R eliability = 0 .9 7 7 2 cost Type Optimal budget distribution Terminal Reliability ^ 1,2 ^ 1,3 ^ 2,3 ^ 2,4 ^ 2,5 ^ 3 ,5 ^ 4,5 d 4 6 ^ 5,6 Type 1 0.0 1 .0 0 .0 0 .0 0 .0 0 .0 0.0 0 .0 0.0 0.986199 Type 2 0 .2 0.3 0 .0 0 .0 0 .0 0 .0 0 .0 0.3 0.2 0.994074 Type 3 0 .0 1 .0 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0 0.0 0.995986 Type 4 0.4 0 .1 0 .0 0 .1 0 .0 0 .0 0 .0 0 .2 0 .2 0.994762 Table 4.7: The optim al solution obtained from applying A lgorithm 2 to Figure 4.4. U sing A lg o rith m 3 In itia l R eliability = 0 .9 7 7 2 cost Type Optimal budget distribution Terminal Reliability Relative Error ^ 1,2 ^ 1,3 d 2,3 ^ 2,4 ^ 2,5 ^ 3,5 d 4,5 d 4 6 d 5,6 Type 1 0 .0 1 .0 0 .0 0.0 0 .0 0 .0 0.0 0 .0 0 .0 0.986199 0 .00% Type 2 0 .1 0.3 0 .1 0.0 0 .0 0 .0 0 .1 0.3 0 .1 0.992644 0.14% Type 3 0 .0 1 .0 0 .0 0.0 0 .0 0 .0 0 .0 0 .0 0 .0 0.995986 0 .00% Type 4 0 .2 0 .2 0 .0 0 .0 0 .0 0 .0 0 .1 0.3 0 .2 0.993733 0 .1 0 Table 4.8: The optim al solution obtained from applying A lgorithm 3 to Figure 4.4. 102 Chapter 5 Distributed Functions Allocation for Reliability and Delay Optimization 5.1 In tro d u ctio n The reliability of a distributed task was studied in C hapter 3 in term s of minimal file spanning trees M F S T ’s. The distribution of the resources asso­ ciated w ith a task affects the num ber of these trees and thus the reliability. Similarly, the allocation of the resources influences the average comm unication delay incurred during a ta sk ’s execution; the delay becomes significantly long when m any com puters m ust be visited before required resources can be ac­ cessed. This chapter studies the allocation problem of a given ta sk ’s functions (resources) and develops algorithm s to distribute them across the system so th a t reliability and delay are optimized. Several optim ization problems have been addressed in the context of com puter com m unication networks, for example, the capacity assignm ent, and the topology optim ization [FRAN 72, FR A T 76, KLEI 76, GERL 77]. They minimize the total cost of the system in addition to satisfying certain perfor­ mance constraints such as average delay. The topology optim ization w ith reli­ 103 ability constraint is a hard problem, and a heuristic solution is usually used to provide two or more node-disjoint paths between every pair of P E ’s. The op­ tim al allocation of the resources of a distributed database for minimizing the storage cost and response tim es has been widely investigated [CHU 69, COFF 81, MARC 81, RAM A 83]. Researchers have also investigated the problem of combined file allocation and netw ork design to minimize storage and communi­ cation costs [CHEN 80, IGNI 82, IRAN 82, LANI 83]. The topology of a distributed system is assumed here to be fixed, and it is required to distribute the resources of a task so th a t the reliability is maxim­ ized and the delay is minimized. Algorithm s th a t assist designers in determ in­ ing the optim al allocations of the resources are im portant especially in the ear­ ly stages of designing reliable distributed systems; they provide the initial esti­ m ation of the best preform ance measures th a t can be obtained w ith a given set of resources. If they do not satisfy the system ’s requirem ents, alternative designs or more redundant resources can be added. To achieve this objective, we study the effect of resource allocations on the ta sk ’s reliability and com­ m unication delays. We develop optim ization algorithm s to allocate the resources of a task or a set of tasks. The form ulation of the optim ization problem is as follows: 104 GIVEN : Fixed topology. n processing elements. MAXIMIZE : D istributed task reliability. MINIMIZE : Average packet delay. OVER DESIGN VARIABLES : A llocation variables. SUBJECT TO : R edundancy used is less th an a given upper bound. Two different approaches are used to solve this optim ization problem. In the first m ethod, it is decomposed into two sub-problem s: 1 ) find the set of allocations th a t maximize the reliability, and 2 ) find the task packet delay as­ sociated w ith each of these allocations and choose the solution th a t results in m inim um delay. In the second approach, we construct a com pound objective function th a t m easures the effects of resource allocations on both reliability and delay. Then, it is maximized using an algorithm sim ilar to the one developed in the first approach. 5.2 R elia b ility an d D elay A n a ly sis In this section, we discuss a task ’s reliability and provide a function for approxim ating it. R eliability of a distributed task can be evaluated using the algorithm s presented in C hapter 3. However, for solving the optim ization problem of resource allocation in large systems, we approxim ate the task ’s reli­ ability. 105 5.2.1 A p p r o x im a tin g T a s k ’s R elia b ility As we have studied, in our model of distributed system s, a distributed task is a set of functions (actions) to be executed on different processors. F or example, these functions could be calling a rem ote procedure. R edundancy in the functions ( / ,• ’s) is introduced to increase reliability and fault tolerance [RENN 80, GAJRC 82]. An extreme level of redundancy, which is the num ber of redundant copies of functions or files, is the replication of each function in all the elem ents of the system . However, a large num ber of redundant copies of resources, such as replicated files, introduces additional traffic to m aintain consistency. It was shown in C hapter 3 th a t triple redundancy is generally sufficient to increase the reliability by 90% of the maximal achievable im­ provem ent. For large systems, it is not feasible to evaluate task ’s reliability at each iteration of an optim ization algorithm . Furtherm ore, the purpose of the relia­ bility evaluation is to direct the search of an optim ization algorithm tow ard the allocations th a t maximize the reliability. Hence, any approxim ation of the task reliability can be used effectively in this type of optim ization. Let us as­ sume th a t all the com puters and com m unication links have the same reliabili­ ties and are equal to pc and p i , respectively. This is not unreasonable as most of the com ponents’ reliabilities are w ithin a close range. Let nt denote the num ber of nodes in an M F ST of a task T . The reliability and unreliability of th a t tree are, respectively, com puted as: 106 n n . -1 rt = Pc ■ Pi ; Q t = 1 - rt Let us assume th a t there are N t M F S T ’s. If we assume that all these trees are disjoint and th a t they have equal reliability (rt ), the task reliability can be evaluated as: ( N t N. -k Q t (5.1) However, the sets of trees are not usually disjoint and therefore there will be some common components among them. The failure of any one of them will contribute to the failure of more than one tree. As a result, the reli­ ability given in Equation 5.1 can be viewed as an upper bound. This equation can be used to approxim ate the reliability in the optim ization algorithms to be presented in Section 5.4. 5.2.2 D elay A n alysis In addition to studying the reliability of a task associated with the dis­ tribution of its functions, one can also analyze the corresponding packet delay, throughput, or any other useful performance measure. In this section, we present an adaptation of a standard approach to estim ate the average packet delay associated with a distributed task. It is based on queueing network models. Queues on a single communication link are characterized by the ar­ rival rate of packets flowing through it and the distribution of transmission tim es. The major difficulty of the queueing network model is that the flow of 107 these queueing models, the independence assum ption is introduced in which the length of a packet is considered as an independent random variable as it moves from one node to another [KLEI 76]. If we also assume Poisson processes for all packet arrivals, exponential distribution of packet length w ith — bits per packet, fixed routing, and no nodal delay, then the average delay A * Tfr i on channel xk ( is given by [KLEI 76] as: T ^ 1 k ,1 ' f.iCAPk i - \ k i (5.2) where CAPk / a n d \ k i denote the capacity of channel xk i and the traffic flow on it (both in bits per second), respectively, and — represents the average fX packet length. E quation 5.2 is an approxim ation of the delay associated w ith a chan­ nel. The consequent analysis is not restricted to th a t form ula or to the as­ sum ptions used in its derivation. However, any other equation derived based on more realistic assum ptions can be used to replace E quation 5.2. A Task P acket Delay (TPD ) is defined as the average packet delay incurred during the inform ation exchange between the cooperating com puters. The links of an M F ST can be used to dictate the routes of inform ation exchange among the cooperating P E ’s. One criterion for choosing an appropriate tree is identify­ ing the one th a t leads to a better load balancing of the system . 108 /a h Figure 5.1: An M F ST with the allocation of a task’s functions. For example, Figure 5.1 shows an M FST and the allocations of three functions ( / x, f 2J 3 ) that compose a distributed task. In th at tree, the fol­ lowing routes are used for information exchange among the functions: l) d r 12 for transferring packets between / , and / 2, 2) x l 3 for transferring informa­ tion between / x and / 3, and 3) 1 ,;2.r , 3 for transferring packets between / 2 and / 3. Since there are many trees for each allocation of the functions, the TPDp of an Allocation ap can be computed in term of the Packet Delay (PDi ) associated with each M FSTt and can be expressed as: TPD, = - i - £ PD, '1 (5.3) where N t denotes the number of M F S T 's in allocation ap . Let tr (* ,j ) denote the number of packets transferred between func­ tions /,• and f j , and let ek t (i .j ) be a binary variable that takes on value 1 when xk t of an MFSTi is included in the route between /,- and f j . The to­ 109 tal am ount of traffic (7 ) flowing on an MFS'1\ is the sum m ation of all the traffic betw een all pairs of functions, i.e., ■Uj Uj 1 = E E t r { i , j ) i= 1 j = 1 (5.4) where rij denotes the num ber of functions in a task. The packet delay PDj of an MFST,- can be found by applying the fol­ lowing steps [TANE 81]: 1. Find the traffic flow on each channel of M F ST ^ . \ = X X «*,!(■'.3 j) Vxk 1 eMFSTi ; — 1 1 (5.5) 2. Find the total traffic on all channels. X = X) \ / ( 1 \ i x i t l e M F S T , ’ ( 5 .6 ) 3. F ind the m ean num ber of hops per packet. - = A ” "" 7 (5-7) 4. F ind the M ean Channel Delay (M CD^). X1. 1 T 1 M CD{ = X] VxuEMFST; * (5.8) 5. Find the Packet Delay of M F STi (F T ,-). 110 P D ' = n M C D i (5.9) The objective of the optim ization algorithm s presented here, is to find the set of allocations th a t optim ize both reliability and delay. The comparison among allocations is not straightforw ard because of their different effects on re­ liability and delay. Hence, in order to simplify the comparison, the TPD is norm alized w ith respect to the W orst Task Delay ( WTD ). This delay is com­ puted in sim ilar steps to the one used in com puting the TPD after considering the following changes: 1 . The m ean channel delay (MOD; ) associated w ith an M F ST * is considered the w orst possible delay. This occurs when the m aximal traffic flows through the m inim al capacity channel, i.e., 1 ^ CAP min - 7 (5.1.0) where C!APmin denotes the minimal capacity channel of M F ST i . 2 . The m ean num ber of hops per packet is equal to the num ber of links in M F S T i, i. e., n — nt - 1 where nt denotes the num ber of nodes in M FST ,-. TPD The ratio (--------- ) is unitless and therefore can be used w ith the relia- v W T D J bility function to form a compound objective function th a t m easures both reli­ ability and delay. I ll 5.3 D eco m p o sin g th e O p tim iza tio n P ro b lem This approach is suitable for large distributed system s because it does not require the evaluation of the reliability expression to determ ine the alloca­ tions th a t maximize it. A property of the allocation th a t maximizes the relia­ bility is identified to guide the search for this type of allocation. Let T be a distributed task of ny functions and also require the cooperation of ny computers. The function /,■ th a t is assigned to run on a com puter does not necessarily be only one function; it could be a set of func­ tions grouped together to form one entity in order to satisfy some other perfor­ mance constraints. F or a given distributed system , let n max be the m axim um num ber of trees of ny P E ’s th a t run T and n be the total num ber of P E ’s. D efin itio n 5.1: A tree of ny P E 's is said to cover a distributed task T if they can run all its functions such th at each one runs on a distinct P E . T h eo rem 5.1: The reliability of a distributed task T is maximized if each tree of size ny P E ’s out of n max covers T . P roof: Let us assume th a t there exists an allocation of the functions, say a max, such th a t each tree of ny nodes covers T . If the num ber of trees cover­ ing T in any other allocation is equal to n max, then the reliability is the same because they both have the same set of trees th a t cover T . Let O y be an allo­ cation in which the num ber of trees covering T , which is denoted by nc (ay ), is less th an th a t of a max. This set of trees is a subset of th a t corresponding to a max- Hence, the task reliability corresponding to ay, which is the probability th a t there exists at least one tree in the operational state, is less than th a t as­ sociated w ith a max. 112 Most oFThe optim isation algorithms of com puter networks andNIistri” buted systems th a t address reliability as a perform ance requirem ent to be m et do not use the reliability expression in their form ulations. Instead, they use some simpler criteria, such as the connectivity of the netw ork or m aintaining a lower bound on the num ber of node-disjoint paths between node-pairs. We also do not use the reliability expression to guide the search toward maximal reliability allocations, but the result of Theorem 5.1 is used instead to identify those solutions. T h at is, the num ber of trees executing the given task is max­ imized. This allows us to form ulate the problem as a 0-1 integer linear pro­ gram m ing problem. Let y{ j denote a binary variable th a t takes on value 1 when function /,• runs on node xy and 0 otherwise. Also, let ne (ap ) represent the num ber of trees of nj P E 's covering a distributed task T for an allocation ap . The al­ location procedure of a task ’s functions can be form ulated as follows: F o rm u la tio n 5.1 GIVEN : Fixed topology. n processing elements. MINIMIZE : £ £ « = 1 j — 1 ; minimize the to tal num ber of allocated functions. SU BJECT TO ; all trees of nt nodes cover T . n J and £) Vi,j< ri l ==1> 2, . . nf y= i ; r,- is an upper bound on redundancy in /,•. 113 In large distributed system s, it is not practical to allocate the functions associated w ith a distributed task so th a t all the trees of rtj- P E ’s cover the task under consideration (T ); to cover all trees, the upper bound on redundan­ cy could be increased up to ( ™ ) which is not feasible because of the difficulty in m aintaining consistency and the overhead traffic. As a result, the previous form ulation is modified so th a t the num ber of trees th a t cover a given task is maximized. We introduce a control variable ck to refer w hether or not tree tk covers T . It is a binary variable which takes on a value of I only when each P E in th a t tree processes a distinct function of T and 0 otherwise. The form ulation of the problem becomes as follows: F o rm u la tio n 5.2 GIVEN : Fixed topology. n processing elements. MAXIMIZE : nc (ap ) = £ ck . V th ; where nc (a ,p ) denotes the num ber of trees covering T for allocation ap . SUBJECT TO : £ y iJ < rf , * = 1 nf / = l ; r,- is an upper bound on redundancy in / ,•. Form ulation 5.2 is a standard 0-1 integer linear program m ing problem and therefore any existing algorithm can be used to solve this allocation prob­ lem. A n algorithm to solve Form ulation 5.2 can be derived based on the branch-and-bound technique [HILL 80]. The algorithm consists of four steps: branch, bound, fathom ing, and stopping. In the branch step, one set is select­ 114 ed according to a criterion which could be the one th a t gives the better bound or the m ost recently created set. In the bound step, the num ber of trees cov­ ering T is determ ined for ap . In the fathom ing step, the selected set is checked and then dropped if it violates the set of constraints. Once it is re­ moved, the algorithm proceeds from the branch step. Otherw ise, the following substeps are perform ed before it proceeds to the stopping step: 1) the current lower bound Zt is reset to nc (ap ) whenever it is larger th an Z [ ; 2) the selected allocation ap is partitioned into two subsets about an allocation variable y; y . In the first subset, yi j is set to 1 while in the second subset, it is set to 0. In the stopping step, the algorithm is stopped when all the allocation variables of the rem aining subsets have been considered; otherwise, it proceeds from the branch step and so on. The optim al allocations are those whose nc (ap )’s are equal to Z\ . If all the com puters and the comm unication links have the same relia­ bility, then any allocation a th a t maximizes the num ber of trees covering a task T will result in a maximal reliability. This is true because in this case all the trees will have the same reliability and therefore the highest value is achieved when the num ber of trees is maximized. However, if the elements have different reliability values, it is possible for an allocation a 1 to exist, for which nc (a ' ) < nc (a ), and its reliability is larger th an th a t of a . In gen­ eral, the reliability values of com puters as well as com m unication channels are in a close range, and therefore algorithm s for solving Form ulation 5.2 will identify optim al or near optim al solutions. 115 F u n ctio n A llo ca tio n P ro ced u re (F A P ) 1. Initialization. 1.1 Z[ = 0 ; the current lower bound is zero. 1.2 A = a 0 ; list A has allocation a 0 in which no function has been as­ signed to any node. 2. B ranch step. 2.1 select the newest partitioned subset, say ay , 3. Bound step. 3.1 find nc (ay ); the num ber of trees covering T . 3.2 find nc (ay ) ' ; the num ber of trees th a t can not cover T . 4. Fathom ing step. 4.1 ay is fathom ed, and is therefore removed from list A , when one or more of the following constraints are violated: n S Vi,j < ri > * = 1 > 2, . . . nf 3= i n max - n c (ay ) ' > Zi ; the num ber of trees th a t could cover T is greater th an Z i . 4.2 if aj violates the constraints, go to the branch step; otherwise, do 4.2.1 if n c (ay ) > Zt , then Zt — nc (ay ) 4.2.2 partition ay into tw o subsets ( ay; , ay2 ) about the unconsidered vari­ able yp k such th a t yp ^ = 1 in ayt and yp k = 0 in ayg . 5. Stopping step. If all the allocation variables y, y have been considered, the procedure stops; otherwise, go to branch step. 116 A lg o r ith m 5.1 1. E num erate all the sets of nj nodes th at could form trees w ith (nf -1) links. 2. A pply FA P to obtain the set of allocations th a t maximize the num ber of trees covering T . 3. For each allocation obtained in the second step, evaluate the task reliabili­ ty using the approach discussed in C hapter 3. Then, discard from the set of allocations the ones th a t have lesser reliability th an other allocations w ith m aximal reliability. 4. Find the TPD associated w ith each allocation obtained in the third step. 5. Choose the optim al allocation such th a t TPDp is m inimized, i.e., TPDopt = M IN (TPDp) V V 5.4 C o n stru ctin g a C om p ou n d O b jective F u n ctio n Instead of solving the optim ization problem sequentially as done in Al­ gorithm 5.1, one could form a function th at m easures both reliability and de­ lay and then find the allocation th a t maximizes th a t function. TPD F = a R - (I - , W TD (5.11) where a and /? are weight factors assigned to reliability and delay functions, respectively. This function (Equation 5.11) can further be simplified if we normalize their sum m ation, i.e., a + j3 = 1. 117 Substituting ft w ith 1 - a in Equation 5.11 yields to: IP _ / O , TPD , TPD F — Oi (R -f- — — — ) — ----------- v WTD ’ WTD (5.12) The value of the param eter a can be used to reflect the relative impor­ tance of the reliability w ith respect to the delay. The optim ization problem can be solved as shown in Form ulation 5.3. F o rm u la tio n 5.3 GIVEN MAXIMIZE SU BJECT TO : Fixed topology. n processing elements, param eter a. TPD ^ _ TPD : F = a (R + n : £ Vi,j< ri / = i . ; r,- is an upper bound on the redundancy in / . W T D ' WTD -I nf In some applications, it is required to reduce the cost of running the functions of a given task on the com puters and also satisfy some reliability and delay constraints. Let us assume th a t c,- y denotes the cost of running function / i on com puter Xj . If the cost is a m ajor objective in the functions allocation problem , one could solve it as shown in Form ulation 5.4. 118 Formulation 5.4 GIVEN : Fixed topology. n processing elements, parameter a. #/ n MINIMIZE : COST = £ £ eitj yitj X — 1 / = 1 n SUBJECT T O : J = 1 ; r ,- is an upper bound on the redundancy in /,-. R > i? m in; reliability must be larger than a lower bound. TPD < TPD m ax; delay must be smaller than an upper bound. The Formulations 5.3 and 5.4 can be solved using a procedure similar to the one described in FAP. Algorithm 5.2 describes the steps of a procedure to solve Formulation 5.3. A lgorith m 5.2 1. Initialization. 2. Branch step. 2.1 select the newest created allocation ay. 3. Bound step. 3.1 evaluate nc (ay) and nc (ay)' ; the number of trees covering T and those that do not cover it, respectively. 3.2 evaluate TPD. 3.3 evaluate the task reliability using either the approach discussed in Chapter 3 or Equation 5.1. 1.1 F[ — 0 , A — a0 119 3.4 evaluate the objective function F . 4. Fathom ing step. 4.1 dj is fathom ed when one or more of the following constraints are violat­ ed: n 53 Vi,j — L ’ ’ ® ===1 ) 2 , . . Tlj / = 1 n max ~ n c i aj ) 4.2 if the constraints are not violated then 4.2.1 if F > F[ then Fi = F . 4.2.2 partition ay into two allocations about yp k such th a t a) in the first allocation, node xk has function / p . b) in the second allocation, node xk does not have / p . 5. Stopping step. If all the allocation variables have been considered, the algorithm stops; oth­ erwise, go to the branch step. 5.5 A n Illu stra tiv e E xam p le Let us assume th a t we have a distributed task T consisting of three functions / 2, and / 3 which are to be allocated to the P E ’s of the distri­ buted system shown in Figure 5.2 so th a t reliability and delay are optim ized. The capacity of the com m unication channels, the traffic rate betw een the func­ tions, and the redundancy level in this example are assumed as follows: CAP i 2 = CAP i % — CAP 2 3 = CAP 4 5 = 20 Kbits/sec C A P 2j4 = C A P35 = 30 Kbits/sec — = 1000 bits/packet C A P 46 = CAP 5 6 = CAP 25 = 15 Kbits/sec 120 F r (1,2) — I packet/sec F r (I,T) = 3 packets/sec tr (2,3) = 4 packets/sec r j == r . - > = r 3 = 2 5,6 Figure 5.2: A six node distributed system. In the following discussion, we use the two algorithms developed in the previous section to obtain the set of optimal allocations. For simplicity, we as­ sume th at only one function could be allocated to a computer. The first step of Algorithm 5.1 is to find all the sets of 3 computers that could form trees with two links. By doing so the following 11 sets are obtained: ( x j , X 2 , -£3) , (•£ 1 * 2 * •^4) ’ i.x I * x 2 ’ ” ^5) ’ 1 » x 3 » x 5) (x 2 , % 3 , ^5) , (^2 ’ ^4 * ^5) * ( • * • 2 ’ ^4 ’ 6) > (^2 f x 5 » ^ 6) (2:3 , X4 , X 5) , (x3 , X 5 , X g) , (x 4 , X $ , X &) Applying FAP leads to 18 allocations where each one of them covers only six sets of the 11 listed above. Although these allocations cover the same num ber of trees, the reliability corresponding to each allocation is different be­ cause the num ber of MFST's spanning the nodes of each tree depends on the degree of the nodes. For example, in Figure 5.2, there are three trees of size 3 th at span nodes (a: j , x 2 , x 3), while there is only one tree spanning nodes 121 [x 1 , x 2 , x 4). Hence, the distributed task reliability (R ) m ust be computed for each allocation to identify the one th a t maximizes it. By doing so, the num ber of allocations decreases to six. The distribution of the functions for each allocation of these six identified allocations is shown in Table 5.1. Once the set of allocations is obtained, the next step is to evaluate for each one the task packet delay ( TPD ) according to the steps described in Sec­ tion 5.3. The resulting delays and reliability of each allocation are also shown in Table 5.1. The optim al allocation is a 2 since it has the m inim al TPD 2 (107.6 msec), while allocation a 5 has the worst delay (154.6 msec) as shown in Figures 5.3 and 5.4, respectively. This can be explained as follows: in a 2, the functions th a t interact w ith each other heavily ( / 2, f 3) are allocated such th a t they use highest capacity links (x 24 , x 3S ), while the opposite is done in a 5, in which light interacting functions ( / v f 2) are assigned to use those links of high capacity. If we assume all the nodes and links have the same reliabili­ ty, the reliability of executing th a t task w ithout any redundancy is: R (0) = p 5 (5.12) and the reliability im provem ent factor for any one of the six allocations is: R IF = R W - R (°) = 0-9597-0.5905 = gQ % 1-R (0) 1-0.5905 Applying Algorithm 2 identifies the same optim al distribution of the functions {a2). 122 M a x im a l R e lia b ility A llo c a tio n s M a x im a l T a s k R e lia b ility = 0 .9 5 9 7 Allocation x l x 2 x 3 £ 4 x 5 x 6 TPD,- (in ms) a 1 / i f 2 / 3 / 3 / 2 / 1 126.6 a 2 / l / 3 / 2 / 2 / 3 / 1 107.6 a 3 / 2 / l / 3 / 3 / 1 / 2 139.3 a 4 / 2 / 3 / 1 I 1 / 3 / 2 112.9 a 5 / 3 / l / 2 / 2 / 1 / 3 154.6 a 6 / 3 / 2 / 1 / 1 / 2 / 3 143.6 Table 5.1: TPD ’s of the allocations resulted from applying FA P. 5,6 ft ft Figure 5.3: D istribution of functions according to a 2- x2 5,6 ft h Figure 5.4: Distribution of functions according to o 5. 124 Chapter 6 Conclusions and Directions for Future Research 6.1 S u m m ary an d C on clu sion s Reliability m easures and algorithm s to evaluate them efficiently in the context of com puter netw orks and distributed system s are addressed in this dissertation. C ertain reliability optim ization problems are form ulated and solved using iterative approach. A m ajor contribution of this research is a simple and efficient term inal reliability evaluation algorithm which can be used for larger networks, for example in networks w ith more th an 40 paths reliabili­ ty can be evaluated w ithin a few seconds and w ithout requiring excessive mem ory space. SYREL algorithm generates a symbolic reliability expression w ith num ber of term s on the order of the num ber of paths. This algorithm can be easily extended to evaluate other known reliability m easures for com­ plex networks. This algorithm is highly suitable for im plem entation on a parallel com puter. 125 For distributed systems, two new reliability m easures are defined and algorithm s to evaluate them efficiently are developed using graph-theoretic ap­ proach. The M FST and M FSF algorithm s presented in C hapter 3 use the breadth first search technique to enum erate subgraphs and then SYREL algo­ rithm is used to obtain reliability expressions. The efficiency of these algo­ rithm s can be improved if better search techniques can be found. If the given distributed system has a simple com m unication structure, such as a ring or bus topology, then the subgraph enum eration part becomes sim pler. Iterative optim ization algorithm s are developed in C hapter 4 to maxim­ ize the term inal reliability of complex com puter netw orks under a cost con­ straint. The numerical results w ith Algorithm 3 indicates th a t approxim ate reliability functions can be used to determine optim al or near optim al solu­ tions. These algorithm s can also be used in other optim ization problems, such as optim izing D PR, DSR, or any other reliability function in a given distribut­ ed system . In chapter 5, algorithm s are presented for allocating the resources of a given distributed task so th a t both reliability and delay are optim ized. These algorithm s are im plem ented based on the branch-and-bound technique. The tasks considered in this analysis are assumed to be vital to the operation of the distributed system . This work can be easily extended to allocate the resources associated w ith a set of tasks when their functions are combined together to form one set. 126 6.2 D irection s for F u tu re R esearch The reliability aspects of a distributed system with a fixed topology are studied in this dissertation. Further research involves in studying the topology optimization problem in which the reliability issue is addressed quantitatively. One could study the problem of topology optimization with a suitable reliabili­ ty function that can be easily evaluated. The allocation of functions of a given task that is considered in this dissertation does not take into account the current load of each computer in the system. An extension to the work developed in Chapter 5 is to allocate the resources of a set of tasks with several measures, such as reliability, throughput, average delay, and processing cost are optimized in a dynamic environment. Since designing of distributed systems is more difficult than uniprocessor systems, a challenging research problem is to design an expert system that helps designers study tradeoffs in alternative systems. Another area where the graph-theoretic techniques developed in this thesis can be applied is in evaluating modular software reliability. Design diversity approach has been suggested for improving software reliability [AVIZ 84]. An interesting problem is analytical modeling of such redundant software unit rather than experimental evaluation [STAN 85]. Further research is sug­ gested for studying the feasibility of representing the software modules struc­ ture by a probabilistic graph. The nodes in that graph denote the modules of the system software under consideration and the links indicate the interaction among them. The weight to be given to a node could depend on the size, i.e., the number of lines in the source code of the corresponding module and the 127 frequency at which it is used. The reliability of a link should capture the pro­ bability of executing the two modules corresponding to the nodes incident on that link. Once that graph is constructed, the reliability of the software module can be measured using the techniques described in Chapters 2 and 3. 128 REFERENCES [ABRA 73] N, Abramson, F. F. Kuo, Computer-Communication Networks, Englewood Cliffs: Prentice-Hall, 1973. [ABRA 79] J. A. Abraham , ”An Improved Algorithm for Network Reliabili­ ty ,” IEEE Trans. Reliability, Vol. R-28, No. 1, April 1979, pp. 58-61. [AGGA 75] K. K. Aggarwal, K. B. Misra, J. S. G upta, ”A F ast Algorithm for Reliability Evaluation,” IEEE Trans. Reliability, Vol. R-24, No. 1, April 1975, pp. 83-85. [AGGA 81] K. K. Aggarwal, S. Rai, ’ ’Reliability Evaluation in Computer- CommunJcation Networks,” IEEE Trans. Reliability, Vol. R-30, No. 1, April 1981, pp. 32-35. [AHO 74] A. V. Aho, J. E. Hopcraft, J. D. Ullman, The Design and Analysis of Computer Algorithms, Reading: Addison-Wesley, 1974. [ARNB 78] S. Arnborg, ”A reduced State Enum eration- Another Algorithm for Reliability Evaluation,” IEEE Trans. Reliability, Vol. R-27, No. 2, June 1978, pp. 101-105. [AVIZ 78] A. Avizienis, ’ ’Fault Tolerance: The Survival A ttribute of Digi­ tal Systems,” Proceedings of IEEE, October 1978, pp. 1109- 1125. [AVIZ 84] A. Avizienis, J. P. J. Kelly, ’ ’Fault Tolerance by Design Diversi­ ty: Concepts and Experim ents,” Computer, Vol. 17, No. 8, Au­ gust 1984, pp. 67-80. [AVRI 76] M. Avriel, Nonlinear Programming Analysis and Methods, En­ glewood Cliffs: Prentic-Hall, 1976. [BALL 79] M. O. Ball, ’ ’Computing Network Reliability,” Operations Research, Vol. 27, July-August 1979, No. 4, pp. 823-838. [BALL 80] M. O. Ball, ’ ’The Complexity of Network Reliability Com puta­ tions,” Networks, Vol. 10, No. 2, Summer 1980, pp. 153-165. 129 ]BTE'G~77]— [BENN 82] [BERM 85] [BROW 71] [CHEN 80] [CHOU 83] [CHU 69] [COFF 81] [DEVA 79] [DION 80] [ENSL 78] J. E. Biegel, ’’D eterm ination of Tie Sets and C ut Sets for a Sys- tem w ithout Feedback,” IEEE Trans. Reliability, Vol. R-26, No. 1, April 1977, pp. 39-42. R. G. B ennetts, ’ ’Analysis of Reliability Block Diagrams by Boolean techniques,” IEEE Trans. Reliability, Vol. R-31, No. 2, June 1982, pp. 159-166. K. P. Berm an, T. A. Joseph, T. Raeuchle, A. E . A bbadi, ’ ’Ob­ ject M anagem ent in D istributed System s,” IEEE Trans. Software Engineering, Vol. SE-11, No. 6, June 1985, pp. 502- 508. D. Brown, ”A Com puterized Algorithm for determ ining the Re­ liability of R edundant Configurations,” IEEE Trans. Reliability, Vol. R-20, No. 3, August 1971, pp. 121-124. P . Chen, J. Akoka, ’’Optim al Design of D istributed Inform ation System s,” IEEE Trans. Computers, Vol. C-29, No. 12, De­ cember 1980, pp. 1068-1080. T . C. K. Chou, J. A. A braham , ’ ’Load R edistribution Under Failure in D istributed Systems,” IEEE Trans. Computers, Vol. C-32, No. 9, Septem ber 1983, pp. 799-808. W . W . Chu, ’ ’M ultiple File Allocation in M ultiple Com puter System ,” IEEE Trans. Computers, Vol. C-18, No. 10, October 1969, pp. 885-889. E. G. Coffman, Jr. et al., ’ ’O ptim ization of the N um ber of Copies in a D istributed D ata Base,” IEEE Trans. Software En­ gineering, Vol. SE-7, No. 1, January 1981, pp. 78-84. S. D evam anoharan, ”A Note on D eterm ination of Tiesets and C utsets for a System w ithout Feedback,” IEEE Trans. Reliabil­ ity, Vol. R-28, No. 1, April 1979, pp. 67-68. J. Dion, ’ ’The Cambridge File Server,” A C M Operating System Review, Vol. 14, No. 4, October 1980, pp. 26-35. P . Enslow, ’’W hat is a D istributed D ata Processing System ,” Computer, Vol. 11, No. 7, January 1978, pp. 10-21. 130 [FRAN 72] [FRAT 73] [FRAT 76] [GARC 82] [GERL 77] [GOPA 78] [GRNA 80] [GRNA 81]f [HARI 86] [HILL 80] [HWAN 79] H. Frank, W. Chon, ’ ’Topological O ptim ization of Com puter Netw orks,” Proceedings of the IEEE, Vol. 60, No. 11, 1972, pp. 1385-1397. L. F ratta,. U. G. M ontanari, ”A Boolean A lgebra M ethod for Com puting the Term inal R eliability in a C om m unication N et­ w ork,” IEEE Trans. Circuit Theory, Vol. CT-20, No. 3, M ay 1973, pp. 203-211. L. F ra tta , U. G. M ontanari, ” Synthesis of Available Netw orks,” IEEE Trans. Reliability, Vol. R-25, No. 2, June 1976, pp. 81-86. H. Garcia-M olina, ’ ’R eliability Issues for Fully Replicated Dis­ trib u ted D atabases,” Computer, Vol. 16, No. 9, Septem ber 1982, pp. 34-42. M. Gerla, L. Kleinrock, ”On the Topological Design of Distri­ buted Com puter Networks,” IEEE Trans. Communications, Vol. COM-25, No. 1, January 1977, pp. 48-60. K. Gopal, K. K. Aggarwal, J. S. G upta, ”A n Im proved Algo­ rithm for Reliability O ptim ization,” IEEE Trans. Reliability, Vol. R-27, No. 5, December 1978, pp. 325-328. A. G rnarov, L. Kleinrock, M. Gerla, ”A New Algorithm for Symbolic Reliability Analysis of C om puter Com m unication N etw orks,” Proceedings of the Pacific telecommunication conference, June 1980, pp. 1A-11 to 1A-19. A. Grnarov and M. Gerla, ’ ’M ultiterm inal Reliability Analysis of D istributed Processing System s,” Proceedings of the 1981 International conference on parallel processing, A ugust 1981, pp. 79-88. S. Hariri, C. S. Raghavendra, ’ ’SYREL: A Symbolic Reliability Algorithm based on P a th and C utset M ethods,” Proceedings of the IEEE Infocom 86, April 1986, pp. 293-301. F. S. Hillier, G. J. Lieberman, Introductions to Operations Research, San Francisco: Holden-Day, 1980. C. L. Hwang, F. A. Tillman, W . Kuo, ’ ’R eliability O ptim ization 131 [HWAN 81] [IGNI 82] [IRAN 82] [JOHN 84] [KIM 72] [KLEI 76] [KLEI 85] [LAMP 81] [LANI 83] [LIN 76] by Generalized Lagrangian-Function and Reduced-G radient M ethods,” IEEE Trans. Reliability, Vol. R-28, No. 4, October 1 9 7 9 , p p . 3 1 6 - 3 1 9 . C. L. Hwang, F. A. Tillman, M. H. Lee, ”System Reliability E valuation Techniques for Complex Large Systems-A Review,” IEEE Trans. Reliability, Vol. R-30, No. 5, December 1981, pp. 411-423. J. P . Ignizio, D. F. Palm er, C. M. M urphy, ”A M ulticriteria Ap­ proach to Supersystem A rchitecture Definition,” IEEE Trans. Computers, Vol. C-31, No. 5, May 1982, pp. 410-418. K. B. Irani, N. G. K habbaz, ”A M ethodology for the Design of Com m unication Networks and the D istribution of D ata in Dis­ tributed Supercom puter Systems,” IEEE Trans. Computers, Vol. C-31, No. 5, M ay 1982, pp.420-434. R. Johnson, ’ ’Network Reliability and Acyclic O rientation,” Networks, Vol. 14, No. 4, W inter 1984, pp. 489-505. Y. H. Kim, K. E. Case, P . Chare, ”A m ethod for Com puting Complex System R eliability,” IEEE Trans. Reliability, Vol. R- 21, No. 4, November 1972, pp. 215-219. L. Kleinrock, Queuing Systems, Volume II: Computer Applica­ tions, New York: Wiley, 1976. L. Kleinrock, ’ ’D istributed System s,” Computer, Vol. 18, No. 11, Novemeber 1985, pp. 90-103. B. W. Lam pson, M. Paul, H. J. Siegert, Distributed Systems- Architecture and Implementation, Lecture Notes in Com puter Science, Springer-Verlag, 1981. L. J. Laning, M. S. Leonard, ’ ’File Allocation in a D istributed Com puter Com m unication N etw ork,” IEEE Trans. Computers, Vol. C-32, No. 3, M arch 1983, pp. 232-244. P. M. Lin, B. J. Leon, T. C. Huang, ”A New Algorithm for Symbolic System Reliability Analysis,” IEEE Trans. Reliability, Vol. R-25, NO. 1, April 1976, pp. 2- 15. 132 [M A R C 8 I f [MARI 79] [MERW 80] [MESS 70] [MISR 71] [MISR 70] [NAKA 77a] [NAKA 77 b] [NAKA 78] [RAGH 82] R. Marcogliese, R. Movarese, ’ ’Module and D ata Allocation M ethods in D istributed System s,” Proceedings of the 2nd. Inter­ national Conference on Distributed Computing Systems, 1981, pp. 50-59. M. P . M ariani, D. F. Palm er, Tutorial: Distributed System Design, IEEE Com puter Society Press, IEEE Catalog No. EHO 151-1, 1979. R . E. Merwin, M. M irhakak, ’ ’D erivation and Use of a Surviva­ bility Criterion for DDP system s,” Proceedings of the 1980 N a­ tional Computer Conference, M ay 1980, pp. 139-146. M. Messinger, M. L. Shooman, ’’Techniques for O ptim um Spares Allocation: A Tutorial Review,” IEEE Trans. Reliability, Vol. R-19, No. 4, November 1970, pp. 155-166. K. B. Misra, ”A M ethod of Solving R edundancy O ptim ization Problem s,” IEEE Trans. Reliability, Vol. R-20, No. 3, A ugust 1971, pp. 117-120. K . B. Misra, ”An Algorithm for the Reliability E valuation of R edundant Netw orks,” IEEE Trans. Reliability, Vol. R-19, No. 4, November 1970, p p .146-151. Y. Nakagawa, K. N akashim a, ”A Heuristic M ethod for D eter­ mining Optim al Reliability A llocation,” IEEE Trans. Reliabili­ ty, Vol. R-26, No. 3, A ugust 1977, pp. 156-161. H. Nakazawa, ”A Decomposition M ethod for Com puting Sys­ tem Reliability by a Boolean Expression,” IEEE Trans. Relia­ bility, Vol. R-26, No. 4, O ctober 1977, pp 250-252. H. Nakazawa, ”A Decomposition M ethod for Com puting Sys­ tem Reliability by a M atrix Expression,” IEEE Trans. Reliabili­ ty, Vol. R-27, No. 5, December 1978, pp. 342-344. C. S. R aghavendra, M. Gerla, A. Avizienis, ’ ’Reliability O ptim i­ zation in the Design of D istributed System s,” Proceedings of the 3rd international Conference on Computing Systems, Oc­ tober 1982, pp. 388-393. 1 3 3 [RAGH 83] [RAI 78] [RAMA 83] [RENN 80] [ROSE 77] [SATY 78] [SATY 81] [SATY 82] [SATY 83] [SHOO 68] [STAN 84] C. S. R aghavendra, S. V. M akam , "Dynam ic R eliability Model­ ing and Analysis of Com puter N etw orks,” Proceedings of the International Conference on Parallel Processing, A ugust 1983. S. Rai, K . K. Aggarwal, ”An Efficient M ethod for Reliability E valuation,” IEEE Trans. Reliability, Vol. R-27, No. 3, August 1978, pp. 206-211. C. V. R am am oorthy, B. W . W all, ’ ’The Isom orphism of Simple File A llocation,” IEEE Trans. Computers, Vol. C-32, No. 3, M arch 1983, pp. 221-231. D. A. Rennels, ’ ’D istributed F ault-T olerant C om puter Sys­ tem s,” Computer, Vol. 13, No. 3, M arch 1980, pp. 55-65. A. Rosenthal, ’ ’Com puting the R eliability of Complex Net­ w orks,” SIA M Applied Mathematics, Vol. 32, No. 2, M arch 1977, pp. 384-393. A. Satyanarayana, A. P rabhakar, ’ ’New Topological Form ula and R apid Algorithm for Reliability Analysis of Complex Net­ w orks,” IEEE Trans. Reliability, Vol. R-27, No. 2, June 1978, pp 82-100. A. Satyanarayana, J. N. Plagstrom, ”A New A lgorithm for Reli­ ability Analysis of M ulti-Term inal N etw orks,” IEEE Trans. R e­ liability, Vol. R-30, No. 4, October 1981, pp. 325-333. A. S atyanarayana, ”A Unified Form ula for Analysis of Some N etw ork Reliability Problem s,” IEEE Trans, on Reliability, Vol. R-31, No. 1, April 1982, pp. 23-32. A. Satyanarayana, M. K. Chang, ’ ’N etw ork Reliability and Fac­ toring Theorem ,” Networks, Vol. 13, No. 1, Spring 1983, pp. 107-120. H. Shooman, Probabilistic Reliability: A n Engineering A p­ proach, New York: MacGraw-Hill, 1968. J. A. Stankovic, ”A perspective on D istributed C om puter Sys­ tem s,” IEEE Trans, on Computers, Vol. C-33, No. 12, De­ 134 cember 1984, ppll02-1115. [STAN 85] [SHAR 71] [TANE 81] [THUR 79] [TILL 70] [TILL 77] [TORR 83] [WILK 72] [WITT 80] J. A. Stankovic, Reliable Distributed System Software, IEEE Com puter Society Press, IEEE Catalog No. EHO 230-3, 1985. J. Sharm a, K. V. Venkateswaran, ”A Direct M ethod for Maxim­ izing System Reliability,” IEEE Trans. Reliability, Vol. R-20, No. 4, November 1971, pp. 256-259. A. S. Tanenbaum , Computer Networks, Englewood Cliffs: Prentice-Hill, 1981. K. J. Thurber, G. M. Masson, Distributed Processor Communi­ cation Architecture, Lexington: Lexington Books, 1979. F. Tillman, C. Hwang, L. Fan, K. Lai, ” Optimal Reliability of a Complex System ,” IEEE Trans. Reliability, Vol. R-19, No. 3, August 1970, pp. 95-100. F. Tillman, C. Hwang, W. Kuo, ’ ’Optim ization Techniques for System Reliability w ith Redundancy-A Review,” IEEE Trans. Reliability, Vol. R-26, No. 3, August 1977, pp. 162-16. J. Torrey, ”A Pruned Tree Approach to Reliability Com puta­ tion,” IEEE Trans. Reliability, Vol. R-32, No. 2, June 1983, pp 170-174. R. S. Wilkov, ’ ’Analysis and Design of Reliable Com puter Net­ works,” IEEE Trans. Communications, Vol. COM-20, No. 3, June 1972, pp. 660-678. L. W ittie, A. M. Van Tilborg, ’ ’MICROS, A D istributed O perating System for M icronet, A Reconfigurable Network C om puter,” IEEE Trans. Computers, Vol. C-29, No. 12, De­ cember 1980, pp. 1133-1144. 
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button
Conceptually similar
Design of application-oriented languages by protocol analysis
PDF
Design of application-oriented languages by protocol analysis 
Occamflow: Programming a multiprocessor system in a high-level data-flow language
PDF
Occamflow: Programming a multiprocessor system in a high-level data-flow language 
Softman: An environment supporting the engineering and reverse engineering of large scale software systems
PDF
Softman: An environment supporting the engineering and reverse engineering of large scale software systems 
Parallel language and pipeline constructs for concurrent computation
PDF
Parallel language and pipeline constructs for concurrent computation 
On the composition and decomposition of datalog program mappings
PDF
On the composition and decomposition of datalog program mappings 
Design and analysis of reliable interconnection networks
PDF
Design and analysis of reliable interconnection networks 
Numerical solution of the generalized eigenvalue problem and the eigentuple-eigenvector problem
PDF
Numerical solution of the generalized eigenvalue problem and the eigentuple-eigenvector problem 
Design and performance analysis of locking algorithms for distributed databases
PDF
Design and performance analysis of locking algorithms for distributed databases 
The precedence-assignment model for distributed database concurrency control algorithms.
PDF
The precedence-assignment model for distributed database concurrency control algorithms. 
Interpretations and simple syntax-directed translations
PDF
Interpretations and simple syntax-directed translations 
Tool positioning and path generation algorithms computer-aided manufacturing
PDF
Tool positioning and path generation algorithms computer-aided manufacturing 
Fault-tolerance and reliability analysis of large-scale multicomputer systems.
PDF
Fault-tolerance and reliability analysis of large-scale multicomputer systems. 
A "true concurrency" approach to parallel process modeling, verification and design
PDF
A "true concurrency" approach to parallel process modeling, verification and design 
Management of interface design activities
PDF
Management of interface design activities 
Matching images using linear features
PDF
Matching images using linear features 
Response determination and control of structural systems
PDF
Response determination and control of structural systems 
Applications of abstract data types: The Trio operating system.
PDF
Applications of abstract data types: The Trio operating system. 
Formal methods for behavioral and system-level power optimization and synthesis.
PDF
Formal methods for behavioral and system-level power optimization and synthesis. 
Optimization of BIST resources during high-level synthesis
PDF
Optimization of BIST resources during high-level synthesis 
Performance modeling and network management for self-similar traffic
PDF
Performance modeling and network management for self-similar traffic 
Action button
Asset Metadata
Creator Hariri, Salim (author) 
Core Title Reliability analysis and optimization in the design of distributed systems 
Contributor Digitized by ProQuest (provenance) 
Degree Doctor of Philosophy 
Degree Program Computer Engineering 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag Computer Science,OAI-PMH Harvest 
Language English
Permanent Link (DOI) https://doi.org/10.25549/usctheses-c17-768722 
Unique identifier UC11348341 
Identifier DP22758.pdf (filename),usctheses-c17-768722 (legacy record id) 
Legacy Identifier DP22758.pdf 
Dmrecord 768722 
Document Type Dissertation 
Rights Hariri, Salim 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au... 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA