Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Protocol evaluation in the context of dynamic topologies
(USC Thesis Other)
Protocol evaluation in the context of dynamic topologies
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PROTOCOL EVALUATION IN THE CONTEXT OF DYNAMIC TOPOLOGIES by Kannan Varadhan A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) August 1998 © 1998 Kannan Varadhan Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES. CALIFORNIA 90007 This dissertation written by under the direction of h id . Dissertation Committee, and approved by a U . its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY Date A u g u s t 6 , ..1 9 9 8 DISSERTATION COMMITTEE Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Contents List Of Figures v Dedication vii Acknowledgements viii Abstract x 1 Introduction 1 1.1 Problem Definition...................................................................................................... 3 1.1.1 Thesis Statem ent............................................................................................. 6 1.2 Contributions from this T h e s is ................................................................................... 6 1.2.1 Protocol Evaluation.......................................................................................... 6 1.2.2 Route Oscillations in Inter-domain R outing................................................... 9 1.3 Organization of the T h e s is ......................................................................................... 10 2 Background and Related Work 11 2.1 Overview of Network Layer Routing Protocols ......................................................... 1 1 2.2 Inter-Domain Routing and Policy-Based R o u tin g ..........................................................15 2.3 Proof Techniques for Protocols................................................................................... 19 2.4 Internet Protocol Simulators ..........................................................................................20 2.5 Simulation M o d els.........................................................................................................22 2.5.1 Traffic M o d els................................................................................................... 22 2.5.2 Topology M odels................................................................................................24 2.5.3 Routing Models................................................................................................... 25 2.6 End-to-End Protocol Overview......................................................................................27 3 Persistent Route Oscillations in Inter-domain Routing 35 3.1 Description of the Problem: Oscillations in BG P/ID RP............................................... 38 3.2 Characterization of the O scillation................................................................................42 3.2.1 Assumptions and Problem Statement .............................................................. 42 3.2.2 Return G ra p h s ...................................................................................................44 3.2.3 Properties of Return G ra p h s............................................................................. 45 ii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.4 Persistent Route Oscillations in D ....................................................................47 3.2.5 Discussion ...................................................................................................... 49 3.3 Evaluation of Alternative Solutions................................................................................50 3.4 S um m ary........................................................................................................................55 4 Effect of Topology Dynamics on Unicast End-To-End Protocols 57 4.1 Simulation Methods and M echanism s..........................................................................59 4.2 Case Study: Characterising T C P ...................................................................................62 4.3 Characterising TCP: Some R esults................................................................................65 4.4 Chapter Summary............................................................................................................68 5 Multicast Transport Protocol Evaluations 70 5.1 Protocol Overview .........................................................................................................71 5.2 Outline of the Methodology............................................................................................ 74 5.3 SRM Timer Mechanisms in the Context of Topology Dynamics..................................76 5.3.1 Deterministic Adaptation of the SRM timers .................................................. 77 5.3.1.1 Base Case: SRM Timers over a Stable N e tw o rk ............................. 79 5.3.1.2 Dynamic Behaviour of the SRM Adaptive T im e rs.......................... 84 5.3.2 Probabilistic and Mixed Adaptations................................................................. 88 5.4 Network Partitions .........................................................................................................91 5.5 The Role of Session Messages in Loss Recovery.......................................................... 97 5.6 Chapter Summary.......................................................................................................... 101 6 Clustering Mechanisms 104 6.1 Protocol Overview .......................................................................................................105 6.2 Clustering Mechanisms: Scalable Session Messages...................................................108 6.2.1 Proximity...........................................................................................................109 6.2.1.1 Base Case: SSM Cluster Formation over Stable Topologies . . .112 6.2.1.2 SSM Collective Migration over Dynamic Topologies ...................117 6.2.2 Combination Measures: Proximity + S iz e .......................................................121 6.3 Partition Analysis of SSM .......................................................................................... 121 6.4 Notes on the Evolution of S S M .................................................................................... 123 6.5 Applicability to Other P ro to co ls................................................................................. 124 6.6 Chapter Summary..........................................................................................................125 7 Conclusions and Future Work 126 7.1 Unicast Transport: T C P .................................................................................................127 7.2 Multicast Transport: S R M .......................................................................................... 128 7.3 Clustering Mechanisms: S S M ....................................................................................132 7.4 Other Issues...................................................................................................................135 Bibliography 136 Appendix A The ns Network S im u lato r.................................................................................................... 145 iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix B Unicast Routing in n s .............................................................................................................147 B.l The Interface to the Simulation Operator (The A P I).................................................. 147 B.2 Other Configuration Mechanisms for Specialised Routing......................................... 149 B.3 Protocol Specific Configuration Param eters...............................................................150 B.4 Internals and Architecture of Routing ........................................................................151 B.4.1 The c la sse s...................................................................................................... 151 B.4.2 Interface to Network Dynamics and Multicast................................................156 B.5 Protocol Internals.........................................................................................................157 Appendix C Network Dynamics in n s ...................................................................................................... 160 C.l The user level API ......................................................................................................160 C.2 The Internal Architecture.............................................................................................162 C.2.1 The class rtModel............................................................................................. 163 C.2.2 class rtQ ueue................................................................................................... 165 C.3 Interaction with Unicast Routing................................................................................. 165 C.3.1 Extensions to Other C la sse s........................................................................... 166 C.4 Deficencies in the Current Network Dynamics A P I ...................................................167 Appendix D The SRM Agent in n s .............................................................................................................168 D.l Configuration............................................................................................................... 168 D.1.1 Trivial Configuration....................................................................................... 168 D. 1.2 Other Configuration Param eters...................................................................170 D.1.3 Statistics ......................................................................................................... 171 D.l.4 T rac in g ............................................................................................................ 173 D.2 Architecture and Internals.............................................................................................175 D.3 Packet Handling: Processing received m essages...................................................... 176 D.4 Loss Detection—The Class SRMinfo ......................................................................178 D.5 Loss Recovery O b jects................................................................................................178 D.6 Session O b jects............................................................................................................181 D.7 Extending the Base Class A g e n t................................................................................. 182 D.7.1 Fixed T im ers...................................................................................................182 D.7.2 Adaptive T im ers.............................................................................................182 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List Of Figures 2.1 A portion of the In te rn e t........................................................................................... 12 2.2 Counting-to-infinity in distance vector algorithm s................................................... 14 2.3 Example topology to demonstrate the need for policy based routing ..................... 16 2.4 Tahoe and Reno T C P .....................................................................................................30 3.1 Example of a cyclic domain policy that leads to non-convergence in BGP/IDRP . . 39 3.2 BGP/IDRP behavior with four nodes ......................................................................... 41 3.3 A persistent route oscillation among n d o m ain s..........................................................43 3.4 Return graphs for the topology of Figure 3 .2 ( a ) ..........................................................45 3.5 Multiple one-node cycles can cause an oscillation.......................................................48 3.6 Different oscillation periods for different initial conditions ........................................50 4.1 Comparison of different routing protocol options in a sim ulation...............................61 4.2 Topology used for first set of TCP experiments .......................................................... 63 4.3 Tahoe TCP throughput in stable topologies................................................................... 64 4.4 Tahoe TCP throughput in Dynamic Topologies ..........................................................65 4.5 TCP throughput traces for Reno and SACK TCPs in Dynamic Topologies................66 4.6 Effect of Interleaving Packets on TCP S A C K ............................................................. 67 5.1 Cyclic topologies used in the study of reliable multicast..............................................78 5.2 Average recovery delay per loss ...................................................................................79 5.3 Normalized recovery delay, expressed in units of rtt per l o s s .....................................80 5.4 Request and repair counts per l o s s ................................................................................81 5.5 Messages sent by each node for each loss ................................................................... 82 5.6 Adaptive Timers: Parameter adaptation by each n o d e .................................................83 5.7 Recovery delay per loss: Adaptive timers over dynamic topologies............................85 5.8 Request and repair counts per loss: Adaptive timers over dynamic topologies . . . 85 5.9 Trace of messages for a single loss following a link fa ilu re ........................................86 5.10 Maximum request and repair rounds per loss: Adaptive tim e rs ..................................87 5.11 Topologies for Probabilistic and Mixed Adaptation of the SRM Timers...................... 88 5.12 Topology used for evaluation of SRM under partitions.................................................92 5.13 Average recovery delay for losses independent of the partition ..................................94 5.14 Average recovery delays seen in the presence of partitions ........................................95 5.15 Traffic and drops on Link (0 ,7 )......................................................................................97 5.16 Average Recovery Delays as a Function of the Frequency of Session Messages . . 98 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.17 Normalised Recovery Delays as a Function of the Frequency of Session Messages . 99 5.18 Request Message Counts as a Function of the Frequency of Session Messages . . . 100 5.19 Repair Message Counts as a Function of the Frequency of Session Messages . . . . 101 5.20 Request Message Rounds as a Function of the Frequency of Session Messages . . .102 5.21 Repair Message Rounds as a Function of the Frequency of Session Messages . . .103 6.1 Topology to Study Collective M igration..................................................................... I ll 6.2 Recovery Delays: Base Case Evaluation of SSM over Stable Topologies................. 113 6.3 Message Counts: Base Case Evaluation of SSM over Stable Topologies................. 115 6.4 Average Cluster Distances: Base Case Evaluation of SSM over Stable Topologies .116 6.5 Recovery Delays: SSM Evaluation over Dynamic Topologies................................... 118 6.6 Message Counts: SSM Evaluation over Dynamic Topologies................................... 119 6.7 Average Cluster Distances: SSM over Dynamic Topologies...................................... 120 7.1 Topology for Studying Clustering Properties...............................................................133 A.l Architectural components of an end-to-end protocol simulator n s .......................... 146 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Dedication To my wife, Shanthi, who came in at the beginning o f this journey... ... And my son, Aditya, who comes in at the start o f the next. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgements I am indebted to a number of people who helped make this thesis a reality. Emigrating is traumatic, to say nothing of looking foward to the prospect of graduate stu dentship living. Our path intertwined at the start of my quest for the PhD, and Shanthi has endured much these past five years. To her, I owe the deepest debt of gratitude—to her, I am most thankful. If the thesis is the product of partnership between a student and their advisor, Prof. Deborah Estrin was the best partner this student could ever have. She was always available, to discuss, critique and comment. Deborah: Thou wert my guide, philosopher, and friend. I still wonder, however, how the thickness of the post-its of reminders and TODOs on your PDA never grows exponentially—that is one skill I wish I had leamt from you in the years that I was your student. Prof. Sally Floyd, thank you. From you, I leamt a lot about protocol design. The depth of your knowledge and insights into specific protocols was a boon to my work. You were always accessible, and my discussions with you were enriching and fulfilling; many of the ideas in this thesis were developed through these interactions with you. Thank you, Sally. My qualifying exam and thesis defense committee members provided input on my studies and research over the span of the last five years. Their help and advice has been invaluable. Thank you Professors, Cengiz Alaettinoglu, Ramesh Govindan, Dennis McLeod, and Sandeep Gupta. I had the fortune to work in two separate projects at USC/ISI: the routing arbiter project and the VINT project. The ideas and experiments embodied in this thesis owe much to the discussions and infrastructure support from the members of these projects. Of special mention are Padma Haidar, Ahmed Helmy, Polly Huang, Satish Kumar, who provided the base code for my simulations; Mark Handley, Ya Xu, and Haobo Yu, the wizards of visualisation; Cengiz Alaettinoglu, Lee Breslau, viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Kevin Fall, Ramesh Govindan, John Hiedemann, Steve McCanne, Yakov Rekhter—they provided invaluable input at various stages of the thesis. Befriend thy librarian—they can make magic happen. Linda Mizushima was a magician par extraordiniare finding the books and references at critical junctures in the most timely fashion. Thank you, Linda. I will always cherish the support and encouragement of Joseph Bannister, Ted Faber, Greg Finn, Steve Hotz, Bill Manning, Katia Obraczka, Jon Postel, Gene Tsudik, and numerous others at USC/ISI. I also wish to thank my colleagues at USC—Shai Herzog, Charley Liu, Reza Rejaie, Brenda Timmerman, Liming Wei, and Daniel Zappala. Tony Li, when I was waffling between finishing my PhD and taking up a full time job in silicon valley, gave me a proverbial kick in my rear, suggesting I should go to USC and finish my thesis with Prof. Estrin. Chuck Kalmanek helped reason out the future potentials in my thesis while I spent the day with them in New Jersey. My discussions with Chuck helped shape the concluding chapter in this thesis. Almost the last, certainly not the least, I am grateful to my parents: to my mother and father, Smt. and Sri Varadhan, for pestering me with the one question—“Son, when are you gettting a job?”; to Shanthi’s mother and father, Smt. and Sri Ranganathan, who visited us during the time Aditya was bom. They helped us tide over the difficulties of caring for a new-bom infant. Their visiting us made it easier to concentrate on my task at hand, finishing. And finally, thanks to the various funding agencies, the National Science Foundation for their support for the routing arbiter project (cooperative agreement NCR-9321043 and contract number NCR-92-06418); the DARPA support for the VINT project (contract number ABT63-96-C-0054); systems research at USC is supported through the NSF infrastructure grant (award number CDA- 9216321). Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract End-to-end protocols operating in hosts at the edges of the network are affected by topology changes and routing protocol transients. Such protocols are evaluated in transport level simula tors during the design process. Most of these simulators use static underlying topologies which do not exhibit network dynamics. The evaluation of transport protocols in the context of network dynamics can inform proto col design, and also can aid comprehensibility and therefore manageability when the protocols are deployed in an operational network. There is limited experience with systematic analysis and eval uation of transport protocols and multi-protocol interaction, particularly in the context of topology dynamics and routing protocols interactions. This thesis will explore methodologies to evaluate end-to-end protocols in the context of network dynamics. These methodologies allow the protocol designer to evaluate and understand the worst-case behaviours of their protocol. We initially developed these methodologies using TCP, and then applied them to the system atic evaluation of reliable multicast protocols. Our results show ways in which topology changes can cause an interleaving of acks (and packets) in TCP, such that the sender sends a burst of packets, which in turn results in transient congestion within the network; these transients occur at unexpected moments. Similarly, our evaluation of the timer mechanisms in SRM reveals the importance of good distance estimation algorithms. Finally, we evaluate the clustering algorithms used in Scalable Session Messages (SSM), and show that topology changes can result in the for mation of sub-optimal clusters, and the impact often extends beyond the locality of the topology changes. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A secondary focus of our research is the theoretical evaluation of one type of network dynam ics, non-convergent behaviour, exhibited by a class of hop-by-hop inter-domain routing protocols such as BGP/IDRP. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction The Internet interconnects a variety of networks with diverse underlying characteristics [23]. The current Internet Protocol (IP) [106] architecture supports datagram best-effort service. A number of end-to-end protocols operate over this best-effort service network. These protocols are designed to be adaptive to cope with the heterogeneity of the Internet. However, designing robust protocols is a challenge given the complexities of interactions that can occur during operation, as well as the continuing flux and change in the operational characteristics of the network. A typical protocol exhibits complex interactions at a variety of granularities: each instance of the protocol incorporates many components and mechanisms that interact with each other—for in stance, round trip time estimation algorithms determine the congestion control and retransmit timer settings [107, 42], reliable multicast loss recovery timers interact with other control messages and data reception [42, 41, 61]. By design, each protocol instance interacts with other protocol instances—however, the number of number of messages and protocol states introduce combinato rial complexities. In addition, there are a number of protocols that are simultaneously operating in any network, which in turn introduces the complexities of multi-protocol interaction—examples of these include a transport protocol’s dependence on unicast routing, or a multicast transport pro tocol complexity due to the type of transmission media, such as a broadcast LAN. Factoring in all of these interactions can be non-trivial, and makes robust protocol, and in particular, robust end-to-end protocol, design a challenging problem. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Change is the other factor that complicates the design of robust protocols. Change is an intrin sic characteristic of the Internet and occurs in a number of different ways. The topology changes as links and nodes within the network periodically fail and recover; changes in the group member ship dynamics of a multicast application; local congestion points form and dissolve as the traffic load on the network changes. This thesis concentrates on protocol evaluation in the context of dynamic topologies, as a means to ameliorate the difficulties in protocol design. Protocol evaluation is characterisation of protocol behaviour in a particular (or different) operating regions. It gives the designer an under standing, possibly some intuitions, about the operation of a particular protocol. This then allows them to systematically investigate different design tradeoffs, and perform more informed design. Moreover, characterising protocol behaviour and understanding its operating characteristics fa cilitates debugging and analysis of problems when the protocol is deployed in the operational infrastructure—i.e., it enhances the operational manageability of the protocol post-deployment. Protocol evaluations occur at all stages of design. Analytical and simulation methods are useful in the early stages of design, prototypes are built and deployed in the later stages. Analytical methods use formal or informal techniques to explore protocol robustness in specific contexts. Typical examples of applicability include verifying basic protocol correctness, evaluating protocol behaviour in specific topologies, or tolerance to failure/recovery of protocol instances. The analysis is often restricted to small scale, and may ignore other parameters unrelated to the focus of the analysis. This helps reduce the complexity of analysis, which oftentimes can become intractable. However, analysis can provide considerable insight when studying a specific situation or problem. An example is our analysis of route oscillations in inter-domain routing, using specific topologies and constraints (Chapter 3). A simulator executes on one or more machines, and models a virtual topology; protocol in stances execute at different nodes in that virtual topology. A simulation run is defined by the topology and the protocols that are specified as parameters to that run. Other parameters include Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. additional workloads, topology dynamics, group membership dynamics, other protocols, etc. Sim ulations provide a controlled environment in which to study a protocol; simulation results are re producible from one simulation to the next.1 However, it is difficult to study a protocol in detail in large scale topologies, or multiple protocols simultaneously. One approach to study large scale behaviours is, depending on the focus of the study, to use abstract models of different components of a simulation—source models, simplified workload models, models of topology change, abstract models for different types of links, topologies, etc. Prototypes running on test or operational networks can permit a richer evaluation of the pro tocol because they capture increased details than either analytic or simulation methods. This type of evaluation accounts for many of the variables that affect protocol operation, and makes it pos sible to study different types of multi-protocol interactions. In addition, it is sometimes possible to study the scaling properties of the protocol through deployment in the operational infrastruc ture [101, 102]. However, these evaluations have their limitations as well. First, it can be very expensive and risk-prone to deploy an untested protocol in an operational network. Moreover, it is difficult to precisely control the environment. It is also difficult, if not impossible to systemati cally realise the events that stress the particular protocol—for example, to cause topology changes to occur, or to create or eliminate different levels of congestion on demand. This is especially so in the operational network. Lastly, it is difficult, if not impossible to replay or recreate events precisely. These limitations hinder fault isolation, and make debugging failures during protocol operation expensive and difficult. 1.1 Problem Definition This thesis explores methods to evaluate end-to-end protocols in the context of topology change. Conventional approaches to this problem use analytic methods or deployment to conduct these evaluations. Our complementary approach, which has not been significantly addressed through prior research, is to explore simulation methods to conduct the protocol evaluation. 'To an extent, this depends on the specific simulator, the type of random number generators it uses, and its starting seed values. This is not an issue for more simulators, but is something that the protocol designer using a particular simulator must be aware of. 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We have already described that simulation techniques offer a number of advantages to the pro tocol designer. As a brief recap, simulation methods allow us to study protocol behaviour in vary ing levels of detail and abstractions than is possible either through analytical methods or through prototypes. They also permit protocol analysis through dependably reproducible scenarios. An other advantage is the ability to study protocol behaviours at varying degrees of scale of topology or interactions. Furthermore, characterising protocol behaviour in the context of topology change through simulation methods has a number of other advantages as well. (1) the most obvious benefit is the ability for the protocol designer to conduct more informed design at early stages in the design process; (2) the ability to analyse protocol behaviour in the context of topology changes provides a con trolled environment in which one can explore pathological regions of operation that may be discovered in deployment; and finally, (3) by identifying a variety of comer cases, and through more extensive simulation studies, we arrive at better characterisations of protocol behaviour that can then enhance the protocol's manageability when it is deployed in operational networks. Hence, using simulations, we would like to evaluate protocol behaviours in a variety of “re gions of operation ” A region of operation is a combination of operating parameters, the most important of which is the topology, and the associated models of topology change. In addition, the characteristics of the lower layers, including that of the link layers, and the network layer routing protocols can influence end-to-end protocol behaviour in the context of topology changes. We briefly discuss each of these topics in the following paragraphs. The topology is defined by the nodes and the links of the topology. The topology can either be regular, or random. Regular topologies are useful for stressing specific components of a protocol. Such topologies are often chosen with specific knowledge of the protocol. In contrast, random topologies are created using a topology generator [137, 35]. Topologies are generated using some 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. distributions or topology generation models—Waxman distribution, flat random topologies, hier archical topologies, transit-stub topologies, etc. Models for topology change are particular to each topology. The models can range from simple failures applied at chosen links, to complex faults occurring at multiple places in the topology, either as independent or correlated failure/recovery events. Simple periodic distributions applied at carefully chosen links are useful in regular topologies for evaluating specific characteristics of the protocol. Other more complex models may be required when analysing the protocol over larger randomly generated topologies. Another important factor in the context of topology change is the characteristics of the nodes and the links at various locations in the topology. These characteristics as observed by an end- to-end session will change as the topology changes. Examples of such characteristics include the queueing disciplines at the nodes—drop tail, random early drop, fair-queueing, class-based queueing, etc.—and the types of links that the session traverses—point-to-point, broadcast LANs, etc.—loss and transmission characteristics of each of the links, etc. Finally, the characteristics of unicast and multicast routing also influence protocol behaviour. Routing protocol convergence time following a failure can introduce pathologies and transients in the routing protocol itself. Other characteristics of unicast routing that affect end-to-end protocols include asymmetric routing and equal cost multi-path routing mechanisms. Similarly, the two basic paradigms of multicast routing are flood-and-prune based protocols for dense mode multicast, and shared tree based protocols for sparse mode multicast; each has very significant impact on end-to- end protocols. At one end of the spectrum, regular topologies, with simple models of topology change, and minimal unicast and multicast routing are useful to study specific protocol behaviours. By minimal routing, we mean that the routing protocols are tuned for rapid convergence following a topology change, without exhibiting many transient behaviours or pathologies. This region of operation allows the protocol designer to evaluate their protocol and its transients in the context of topol ogy change. It allows us to identify comer case behaviours, as well as isolate and understand pathological situations in their protocol. 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. By contrast, a randomly generated topology, with appropriate models of topology change, and unicast routing, could be a reasonable model of, say, the Internet. This then provides a richer characterisation of the protocol at the region of operation in which we expect the protocol will be deployed. 1.1.1 Thesis Statement This thesis explores methods to evaluate end-to-end protocols in the context of topology change through simulation methods. The simulations focus on the evaluation of protocols in three problem domains—unicast transport, multicast transport, and clustering algorithms—using simple repre sentative topologies, simple fault models, and minimal distance vector unicast routing, and simple dense mode multicast routing protocols. The aim of this thesis is to develop tools and techniques to enhance protocol evaluation through simulation methods. We discuss the specific contributions and results in each of the problem domains that we in vestigated in the following section (Section 1.2). In the rest of this thesis, we will use the terms “protocol evaluation” and “evaluation” to refer to “simulation methods for protocol evaluation in the context of topology change.” 1.2 Contributions from this Thesis The contributions from this thesis is in two parts: protocol evaluation in the context of dynamic topologies, and the identification and analysis of the possibility of route oscillations in inter domain routing. We discuss our contributions in each of these parts separately. 1.2.1 Protocol Evaluation In general, protocols exhibit a number of interactions that makes it hard to identify the possible interactions. While this appears to be a truism, it is important to realise that to capture and charac terise protocol behaviour, we need to identify and isolate the protocol interactions carefully. This requires deep intrinsic knowledge about a given protocol; It is, for example, one of the reasons why 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. identifying the right set of experiments for Scalable Session Messages (Chapter 6) was difficult. This thesis also did not explore generic methods to evaluate an arbitrary protocol in the context of topology changes. This, we believe, is still premature given that there is little experience with protocol evaluation in the context of topology changes. This thesis demonstrates the utility of simulation methods for protocol evaluation in the context of dynamic topologies. The thesis explores methods to evaluate end-to-end protocols in three separate problem domains. The three problem domains are unicast transport protocols, multicast transport protocols, and clustering algorithms. In each domain, the thesis evaluates a protocol representative of that domain. Unicast Transport: TCP Unicast transport protocols are an essential component to any net work. Transmission Control Protocol (TCP) [107] is an unicast transport protocol in use in the Internet. The protocol has been extensively studied and enhanced over the number of years that it has been in use. Robustness to failures within the network was a primary design goal for TCP [23]. TCP is a unicast transport protocol built on top of the Internet’s point-to-point best-effort delivery paradigm. TCP does not rely upon session-specific state within the network. Rather, all state about a partic ular session is maintained by the end points of that session. Thus, a particular session is only interrupted if one of the end points of the session fails, but is resilient to an arbitrary number of failures within the network. The congestion control algorithms in TCP have been refined and improved through operational experience over a number of years. This thesis evaluates the congestion control algorithms in TCP. In the process, we note that the TCP is essentially robust; however, we identify some counter intuitive ways in which the side effects of topology change, packet interleaving, can cause some transient congestion effects in TCP. Multicast Transport: SRM Multicast transport protocol design is guided by the same goals of robustness, performance, and scalability. However, designing robust multicast protocols is compli cated by the multiplicity of interactions that are possible. The members of a group have different 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. operating characteristics with respect each other, propagation delays, congestion, bandwidth, loss characteristics etc. are different between every pair of members. This interaction with multi ple, heterogenous members makes congestion control and error recovery mechanisms of multicast protocols inherently complex. Each multicast application has its own requirements, which dictates how the reliability requirements, constrains congestion control schemes, and defines how error recovery should be performed. Unlike TCP however, multicast transport protocol design is an emerging area. A number of different protocols have been proposed and are being studied, with emphasis on different appli cation requirements. Scalable Reliable Multicast (SRM) [42] guarantees eventual delivery of all data, but makes no assertions about the order of delivery of packets, even for data from a single source. The versions of the protocol that we looked at do not have any defined congestion control mechanisms. The protocol is designed for white-board-like applications. We will describe the protocol in detail in Section 5.1. Our contributions in the study of SRM are to evaluate the robustness of the timer mechanisms of the protocol; through our study, we identify a number of research questions that could enhance the protocol behaviour. Clustering Algorithms: SSM A cluster is a collection of contiguous nodes in the topology that share a common characteristic, and hence can be grouped together as a single entity having that characteristic. The advantage of clustering in network protocols is a saving in protocol state, often with a consequent savings in bandwidth as well, as the amount of information that needs to be exchanged is reduced due to aggregation techniques. Many clustering schemes will use a representative within that cluster to maintain state about that cluster. Some examples of such clustering mechanisms include unicast address aggregation reduces forwarding state at the routers in the network [45]. Nodes cluster around a representative to reduce the amount of session information that must be maintained by other group members [121]. A subgroup of members that shares losses due to single loss link can form a separate local recovery 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. group to reduce the loss recovery overheads for the members of the larger multicast session [83, 54], One essential property of a cluster is that all of the members in the cluster are contiguous to each other. Therefore, cluster formation is directly dependent on the topology, and changes to the topology can affect the shape and formation of clusters. Some types of changes that can occur in a topology can lead to a single cluster becoming non-contiguous, i.e., cluster breakup, two clusters can possibly coalesce into one, network partitions and healing. In all of these, topology changes add another dimension of complexity when the protocol uses representatives for cluster formation. In such cases, the members of a cluster could get separated from their representative, or two clusters, each with a representative could coalesce. Finally, in a large network, the manual configuration of clusters and clustering boundaries does not scale. Therefore, it is critical for protocols that use clustering mechanisms to scale should consider self-configuring or auto-configuring techniques in order to form these clusters. In this thesis, we evaluate the robustness of self-configuring clustering algorithms using SSM as a case study. In Chapter 7, we also explore more general purpose techniques to evaluate cluster algorithms in the abstract. 1.2.2 Route Oscillations in Inter-domain Routing A secondary contribution from this thesis is in identifying the possibility of route oscillations in hop-by-hop policy based inter-domain routing protocols such as BGP[ 113] and ZDRP[68], In these protocols, each domain can select its routes independently based on its local policies— there is no global metric that defines the route selection at each domain. If a group of domains sets mutually inter-dependent policies for routes to a particular destination, then there may be no satisfying route assignment for each of the domains. In such a situation, each of domains will oscillate between the routes possible at that domain. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.3 Organization o f the Thesis This thesis is organised as follows: Chapter 3 describes a problem in inter-domain routing that can result in non-convergent unicast routing state within the network; the chapter contains a de scription of the problem, and its formal analysis. The following chapter, Chapter 4, describes the methodology for the evaluation of the behaviour of end-to-end protocols in the context of network dynamics; this chapter also contains a brief evaluation of TCP’s response to network dynamics. Chapter 5 applies our methodology to two aspects of multicast transport protocols: the evaluation of the timer mechanisms in SRM, and the evaluation of clustering mechanisms in reliable multicast protocols using scalable session messages mechanisms as a case study. Chapter 7 presents our con clusions and future work. Finally, in the appendices, we describe the details of our implementation of unicast routing, network dynamics, and SRM in ns. 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 Background and Related Work This thesis has two parts: the analysis of persistent route oscillations in policy based hop-by- hop inter-domain routing protocols, and the evaluation of end-to-end protocols in the context of dynamic routing. The unifying theme of this thesis is unicast routing and topology change. We begin this chapter with an overview of unicast routing (Section 2.1). Work on various aspects of policy based routing protocols (Section 2.2) forms the core of our related work for the first part of this thesis. In the analysis of route oscillations, we conducted a brief survey of proof techniques for protocol testing (Section 2.3) to determine why these tech niques could not effectively identify these route oscillations. This area is also relevant to the main focus of this thesis, exploring simulation methodologies for protocol evaluation in the context of dynamic routing. Other areas related specifically to our main topic are a survey of the different simulators in existence (Section 2.4); and the different topology, traffic, and routing models (Sec tion 2.5) that are applicable to simulation methodologies. We conclude with a survey of unicast and multicast transport protocols (Section 2.6), to describe the domain of applicability and impact. 2.1 Overview of Network Layer Routing Protocols The Internet is a collection of nodes interconnected by links. Users access the network through end systems (or hosts) at the edges of the network. A contiguous collection of nodes under one administrative authority is called an “ Administrative Domain” (or Autonomous System). An ad ministrative domain interacts with other domains using some inter-domain routing protocol, such 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This figure, while not exact, shows the approximate hierarchical nature o f the Internet, with multiple levels o f national and international transit and stub domains: The figure also shows some non-hierarchical connectivity. Figure 2.1: A portion of the Internet as EGP [94], BGP [113], or IDRP [68]. A domain may also run one or more intra-domain routing protocols within it. RIP [55], OSPF [96], or IGRP [56] are examples of intra-domain routing protocols. The essential difference between inter-domain and intra-domain routing protocols is that the former is designed to realize inter-domain policies, while the latter is designed for opti mal routing. A contiguous set of nodes within a domain that run a common routing protocol is a “ Routing Domain" running that protocol. The route to a destination on the network is a 3-tuple: (prefix, next-hop, attributes). The prefix identifies the nodes and end systems that can be reached by using that route. In this thesis, we assume that all references to routes pertain to prefix x, unless otherwise stated. The next-hop identifies an adjacent node that can be used to reach x. The attributes indicate some characteristics specific to that route to x. In the protocols that we consider in this section, this attribute is the metric (or cost) to reach x using that route. Each node stores the route to x that has the smallest metric value in its local route information base (or L oc_ R I B ) , and uses that route to reach x. Every node, Dt, in the network exchanges information with its adjacent neighbours. Each node then computes its routes to a destination, x, using the information learned from its neighbours. Routing protocols are classified by information that is exchanged among the nodes. In a “Distance 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Vector'’ (DV) protocol, a Vi running that protocol exchanges its Loc_RlB with its neighbours; The route computation at Vi selects the lowest cost route to x advertised by all its neighbours. On the other hand, in a “Link State” (LS) Protocol, Vi floods information about the state of all links incident on it. Vi can compute its Loc_RIB from the global link states using a route computation algorithm like Dijkstra’s Shortest Path First (SPF) algorithm [34]. Distance Vector Routing Protocols These are protocols based on distributed Bellman-Ford routing algorithms [43]. The original ARPANET routing protocol [92], NETCHANGE [129], and RIP [55] are examples of distance vector protocols. Distance vector algorithms are known to have a problem, namely forming loops during periods of transience within the network. These kinds of route loops are called count-to-infinity loops. Figure 2.2 shows an example topology where such loops occur. The loops form when a route becomes invalid at a node, Vi. Vi then selects an alternate route with higher metric through a neighbour, Vy, the new route is actually the route it had earlier advertised to Vj. Each node then advertises the route back to the other with increasing metrics. This process of counting-to-infinity terminates when Vi or V j recognizes that the metric is larger than the largest possible diameter of the network, and does not select that route. At this point, both nodes cease and the route exchange cycle terminates. There are two possible solutions to this problem. The first is to recognize that loops form whenever the metric values to x increase. When an increase occurs, a node must change its next-hop only after coordinating the increase in the metric of the original route with all up-stream nodes who also use that route [71]. Once all affected nodes have updated their metrics, they can recompute new routes to x. Various other optimizations to this technique of diffusing computations have been proposed [8,47]. A variant of this diffusion scheme was first proposed in [93], in which a node adjacent to the point of failure generates a request message; a node that has a valid route responds to the request. IGRP [56] is an implementation of a diffusing update algorithm. The second observation is that the loops form because the route does not contain sufficient information about the path implied. The problem can be avoided if the route attributes were augmented with the path that it traversed [124]. This class of algorithms distributes 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In this topology o f four nodes, assume that the cost o f each link is I. Then. Vo uses V i to reach x. There are two possible situations that can occur when V \'s link to x fails. 1. Assume that Vo advertises its route to x . metric 2. to T>i. For simplicity, let us ignore V> for the moment. When the link fails. V i selects the route with metric 2 that Vo advertises. Recall however that Vo s route was originally advertised by V i. The two nodes will advertise the same route back to each other, each time with a progressively higher metric, until they recognize that that their route to x has a higher metric than the maximum possible diameter o f the network. At this point, they will mark the destination as unreachable, and the route exchange process becomes quiescent. 2. Vo could recognize that it has selected V i s route to x. and therefore not advertise that mute back to V \ . However, this mechanism (called split horizons; is not sufficient to suppress all forms of count-to-inhnity loops. Assume that V \ advertises its mute to x to Vo and V i with a metric o f one. While Vo and V i will not advertise that route back to V \ . they will advertise it to each other with a metric o f at least two. When V \ withdraws its mute to x to physical failures. Vo and V i will use each other's advertisements, and go thmugh a similar process o f count-to-inhnity. Figure 2.2: Counting-to-infinity in distance vector algorithms path information rather than a simple distance metric, and hence are called “Path Vector” (PV) algorithms [110]. Link State Routing Protocols These are protocols that disseminate a complete map of the topol ogy within the routing domain. Each node, V t, floods its link states to all other nodes in the routing domain. Eventually, every node in the domain, including Vi, has the link states for the complete topology. Vi can then independently compute its routes using some commonly agreed upon rout ing algorithm. OSPF [96] and IS-IS [66] use Dijkstra’s SPF algorithm. When a link incident on Vi, or a node adjacent to X > , fails (or recovers), V t will generate a new link state update, and flood it through the routing domain. Each node can thus compute new routes for destinations unreach able (or reachable) through that node or link. The new ARPANET algorithm [91] and P-NNI [4] are examples of other link state routing algorithms. 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Unlike distance vector algorithms, link state protocols do not exhibit counting-to-infinity loops because each node computes routes from a complete map of the topology. However, since each node has to carry a complete map of the topology, these algorithms do not scale well. The scalabil ity of the algorithm is improved by introducing area abstraction hierarchies. A contiguous portion of the network is configured as an area. Nodes internal to the area carry a detailed map of the area. Nodes external to the area only carry an abstracted representation of the area; they may consider the area as a single node. 2.2 Inter-Domain Routing and Policy-Based Routing We reviewed the fundamental principles of dynamic hop-by-hop routing in the previous section. In that section, we also observed that the goal of inter-domain routing is to realize administrative domain policies. We now articulate the goals of policy routing, and the related inter-domain routing protocols in greater detail. The Internet is primarily hierarchical with intermediate top level service providers (transit domains) offering packet forwarding to local campus or corporate networks (stub domains). How ever, there are a substantial number of non-hierarchical cross-connections at all levels of the hier archy. In this environment, transit domains wish to impose access restrictions and preferences, and account for their resource usage, while stub domains express their preferences. These policies and preferences can be based on a variety of factors such as contractual, legal, or economic considera tions, or other factors such as the available quality of service, resource reservations capabilities etc. These policies have been realized through policy expressions in a routing protocol [22, 16, 37]. The policies can be classified as one of the following four types of policy expressions [37]: (1) source domain policies, in which a domain chooses the transit domain it will use for a given destination, as for example, domain S in Figure 2.3 will use domain C to reach x in domain 27; (2) transit domain policies, in which a domain provides transit to a particular destination, as for example, C will provide transit to 27; (3) path sensitive transit policies, in which a domain will provide transit to a particular destination only along certain paths, as for example, a source can use 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This is a classic example o f a path sensitive policy that cannot be realized by a routing protocol that only supports a hop-by-hop forwarding mode. S is constrained to use the routes that A selects in order to reach V . If A selects B to reach V . and S 's policies were otherwise, i.e.. that it must use C and not B. then S 's policies cannot be realized. Figure 2.3: Example topology to demonstrate the need for policy based routing C to reach T > only via A, i.e., using the path {A, C, T> ): and (4) attribute based policies, such as those based on QoS or temporal attributes. A domain has additional considerations in its interaction with other domains [81 ] that should be realized in a routing protocol. The protocol should provide containment of route instabilities: By this we mean that intra-domain routing at a domain and its inter-domain routing operations should not be affected by each other. To put it another way, there must be an explicit decoupling between intra-domain routing protocols at a domain and that domain’s inter-domain routing proto col. The inter-domain routing protocol should also permit information hiding; a domain should not be constrained to reveal data that it considers sensitive. In this subsection, we will look at a variety of distance vector and link state inter-domain routing protocols, and how these considerations are realized in these protocols. Distance Vector Protocols Most source and transit policies can be realized through selective advertisement and route selection of routing information. This is trivial in a distance vector routing protocol. In addition, distance vector routing permits the dissemination of multiple forwarding tables indexed by quality of service values for the commonly used qualities of service. However, since a domain is constrained to only select from the subset of routes advertised by its neighbours, it may not be able to realize some of its path sensitive policies. We look at some distance vector based inter-domain routing protocols: 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The ANSI standard IS62749 [65] proposes a protocol that can be used in both connection- oriented and connection-less infrastructures. The network is assumed to be a well defined hier archy of domains. Inter-domain links are labeled to indicate whether they lead “up” or “down” the hierarchy, or are “jump” links across the hierarchy. Route information and data flow are con strained such that once a route or a data packet traverses a down-link or a jump-link, it cannot then traverse an up-link or another jump-link. However, this protocol is not sufficiently rich enough to realize all possible path sensitive transit policies. Exterior Gateway Protocol (EGP) [94] observed that domain level metrics across disparate domains were not always comparable. Therefore, the protocol only conveys reachability status without any information about distance metrics. Hence, there is no mechanism for preventing loops, and external oversight is required to guarantee that the topology was loop-free. The Border Gateway Protocol (BGP) [113] was designed as a path vector protocol in order to relax the topology constraints introduced in EGP. Inter-Domain Routing Protocol (IDRP) [68] was derived from BGP, and is a superset of BGP. In this proposal, we are not concerned with the differences between the two, and hence will refer to them as BGP/IDRP. A path vector protocol, by definition, carries path information about a route; BGP/IDRP observes that this attribute could be used to realize a richer set of path sensitive policies. However, the protocol will not scale if it has to realize all possible combinations of domain policies. It should be noted in conclusion that all of these hop-by-hop policy routing protocols can exhibit the kind of oscillations that we describe in the next chapter (Chapter 3). Our work in that chapter was discovered and done in the context of BGP/IDRP, but can be universally adapted to the analysis of the policies realised by any of the protocols described above. Link State Protocols In a link state model for inter-domain policy routing protocols, each do main maintains its link state policy information. Such link state information encodes the domain's preferences, and its transit and path sensitive policies, as well as the qualities of service that it supports. Route computation is the process of a source obtains the other domain’s policies, and constructing an explicit route. Data packets from the source can then either carry the source route, 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. or the source can install state within the network [24, 38, 127, 18]. This paradigm has the advan tage that the sources that require policy based routes bear the burden of computing the routes [24]. However, the source must explicitly compute a policy specific route and possibly setup state be fore it can communicate with the destination. The protocol state that is installed within the network cannot be aggregated across all sources. In addition, it is not clear that the protocol would scale if this were the only inter-domain communication paradigm for the Internet. We now describe some specific link state based inter-domain routing protocols. Inter-Domain Policy Routing (IDPR) [127] and Nimrod [18] use a virtual policy infrastructure to realize policies. Policy gateways distribute route information, and act as forwarders; route servers maintain a domain’s policy information, and generate policy routes for the domain. Both protocols also provide mechanisms for hierarchical aggregation. In order to deploy the protocol in a datagram infrastructure, IDPR provides proxy agents for end systems that can setup and manage policy routes. However, Nimrod assumes that explicit routed flows will be a fundamental data forwarding mode; datagrams are viewed as a special mode of operation in the network. Nimrod is also distinguished from IDPR in that it explicitly permits domains to selectively advertise route information based on criteria local to the domain. Viewserver [2] considers a hierarchical organization of route servers whose addresses reflect the hierarchy. Servers maintain domain level information about domains in their view. Route computation is the process of querying successive servers with overlapping views to compose a domain level explicit route. Finally, the Unified routing architecture [39] observes that neither link state based nor distance vector based paradigms will scale as the fundamental component of a network. Rather, to achieve a more scalable inter-domain routing architecture, a combination of (a) an explicit route compo nent to generate special purpose routes on-demand, and (b) a distance vector based policy routing protocol to realize most commonly used policies are required. 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.3 Proof Techniques for Protocols Traditionally, routing protocol developers use assertional proof techniques to prove their protocols correct. For examples see Tajibnapis [129] for distance vector protocols, Shin and Chen [124] for path vector protocols, Jaffe and Moss [71,47] for diffusion algorithms. Merlin and Segall specified and proved their diffusion algorithms using communicating finite state machines [93]. There is a large body of work related to use of formal proof techniques to study a protocol [11, 118]. These techniques work on abstract specifications of the protocol. In this context, they prove well defined properties, such as safety properties, e.g., freedom from deadlock, and progress properties, e.g., fieedom from starvation. However, these proof systems generally do not capture the kinds of problems that we discuss in the next chapter (Chapter 3). We believe this is primarily because they tend to abstract out the critical mechanisms of the protocol that cause these problems we are addressing. Furthermore, the routing mechanisms addressed in this chapter can create problems in so many different ways that it makes existing formal proof techniques intractable. Nevertheless, in this section, we cover this area briefly. While this section cannot do justice to this extensive body of work, we attempt to give the reader a flavour of the nature of this work. Axiomatic reasoning has been one way to prove specifications and abstract algorithms, and has been used in the systematic development of an algorithm. An example point-to-point mobile application is proved using assertional reasoning [116] using UNITY [19]. In proving a simple transmission protocol, Hailpem [53] observes that axiomatic reasoning is more instructive than proofs using a specification of the protocol as communicating finite state machines. However, Gouda [48] suggests that augmenting the finite state machine with protocol state variables helps prove the desired properties. Estelle [63, 17] and SDL [9, 10] are formal languages for protocol specification that are based on extended finite state machines. An alternate technique for studying a protocol is through algebraic systems such as the calculus of communicating systems (CCS) [95]. “Workbench” is a verification tool based on CCS [26]; it has been used to prove protocols like the alternating bit protocol and CSMA/CD [100]. LOTOS [64,12] is a protocol specification language based on CCS. 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reachability analysis is the strategy of systematically walking a state space based specification of the protocol to prove properties of the protocol. Validation of protocols such as X.21 using reachability analysis, and a random walk strategy are shown in [135]. Similar reachability analysis is applied to an alternating bit protocol specified using extended petri-nets [13]. In this paper, the extensions relate to adding timing information. The authors show that in simple ring-topologies, it is possible to reduce the problem of checking timing consistency to that of a network flow problem, that then helps them derive properties of the protocol. Formal specifications have been used extensively for test case generation [6], conformance testing [62], etc. These are useful concepts for testing protocol implementations and their adher ence to the standard, their interoperability with other implementations etc. In conclusion, formal verification techniques are an important component of protocol design. They are methodological approaches to the design of protocols. However, none of these tech niques have been applied to complete and expressive routing protocols. The work in these areas is complementary to the focus of this proposal. An exception is the work on Systematic Techniques for Robustness Evaluation of Selected Scenarios (STRESS) techniques [57] to evaluate protocol robustness along multiple dimensions. This is in contrast to our work that explores methods in the context of dynamic topologies alone. It is important to note that our work laid the groundwork and is the motivation for the STRESS approach. 2.4 Internet Protocol Simulators Many network simulators have been constructed to support protocol research and development. We briefly discuss a few of them in this section. MaRS [3] was designed to study network routing protocols. It has been used to study distance vector [93], Path Vector [20], and link state [91] algorithms. Shankar et al. [120] in analyzing different routing protocols, describe the impact of network dynamics on the throughput and delay of a simple window based transport protocol. MaRS allows new routing protocol modules to be added, and has been extended to study session level operation of PIM [134]. However, because of its focus on routing, its support for end-to-end protocols is very limited. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. NEST [36] is a multi-threaded simulator designed for network layer protocols. Protocol im plementations designed to run on an operational network node can be added to the simulator with minor modifications, and run as lightweight processes within the simulator. However, because the protocol runs as a lightweight process, the node function can be preempted during execution at a node; therefore, the node functions must be re-entrant. NEST has been used in the study of RIP [55] and the design of routing protocols for mobile packet radio networks. In contrast to MaRS and NEST, n e ts im [58] focuses on transport layer issues. It (and its derivatives) have been used in many studies of TCP performance. However, it contains no support for dynamic routing, relying instead on static routing tables; these routing tables have to be hand- configured by the user. REAL [75] is another end-to-end protocol simulator for the study of flow control and conges tion algorithms. REAL implements three kinds of traffic models: exponential arrival bulk traffic (called the FTP workload), exponential arrival interactive traffic (called the TELNET workload), and a high data rate bulk traffic model (called the “ill-behaved” workload). It also has no sup port for dynamic routing. However, it incorporates a static all-pairs SPF algorithm to compute the routes for the topology before initiating a simulation run. ns [89] is an end-to-end protocol simulator derived from REAL, ns differs from REAL in two significant ways: (i) it is an object-oriented simulator written in C++, with a Tel interpreter front end or simulation configuration; (ii) the traffic models in ns are derived from tcplib [28]. We used ns to conduct our protocol evaluations, and made modifications to it to support dynamic routing, the ability to cause topology changes, and some multicast transport algorithms. We will describe ns briefly (but in greater detail than this single paragraph), as well as document our modifications in the appendix. Network emulation is another technique for protocol research and development. The Bell Labs Network Emulator [78] emulates a virtual network comprised of point-to-point links. The emulator itself runs on a transputer hardware platform. It has been used to run an emulated version of DARTnet, as well as an x-kemel. 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. x-Sim [15] runs over the x-kemel; it supports the “direct execution” of transport layer protocol implementations within the x-kemel. However, process emulation in the form of direct execution is restricted to the execution of a single instance of the protocol running on the operating system. The development of simulation tools has not been limited to research environments. Among a number of commercial simulation packages are SimuNet [126], Prophesy [108] and CPSim [27]. SimuNet is a centralized network simulator, that has been tested for a 50 node network running OSPF. Prophesy is a generic centralized simulation tool for queuing analysis. CPSim is a parallel simulation tool that can be extended for computer networks; the freely available version is limited to 256 simulated objects. Our work on developing routing models for simulators is complementary to the general focus of end-to-end protocol simulators; hence our work will be directly applicable to the simulators that we have described in this section. 2.5 Simulation Models Simulations are useful to study end-to-end protocol behaviour without actually deploying the pro tocol in an operation infrastructure. However, these studies will only be effective if the properties of the network are realistically modeled in the simulator. In this section, we review the models of traffic (Section 2.5.1), and topology (Section 2.5.2) that are used in simulations by protocol designers. We also review the models of routing within the network (Section 2.5.3) that are being developed. 2.5.1 Traffic Models Network traffic has often been modeled as Poisson processes. This assumes that individual events (i.e., packet events) are independent of other events. However, packet arrival events do not often satisfy this property of independence. They are extremely bursty, with a large peak-to-average ratio, and are often correlated to other packet flows. A number of different models have been 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. proposed in order to derive more accurate models. In this section, we describe a few of these models. By analyzing LAN traffic traces over a one-week period, Jain et al. [72] observe that data traffic cannot be modeled as either Poisson or compound-Poisson (i.e., one in which the batch arrival process is Poisson, and batch sizes are a uniformly distributed). However, the traces show a high degree of source locality; i.e., packets between a particular source and destination will arrive back to back; further, such traffic will trigger protocol responses at the destination, which can be seen as traffic in the reverse direction. Such a back to back packet sequence, and the associated response packets is called a packet train. This kind of traffic can be modeled using three parameters: the inter-car gap (i.e., the interval between two consecutive packets in a train), the inter-train gap (i.e., the interval between two trains), and the trailing time (i.e., the time between when the first packet is delivered at the host, and a response packet is seen on the wire). Leland et al. [79], in an alternate study of corporate LAN traffic gathered over a period of 18 months suggest that traffic actually exhibits more complex behaviours than poisson and complex- poisson models. They suggest that aggregate LAN traffic is self-similar; i.e., aggregate traffic on the LAN is just as bursty as traffic from individual sources, and further that traffic is fractal across all time scales of observation. This is different from Poisson traffic models, which tend to become indistinguishable from white noise when aggregated over time. Their paper proposes that fractional Gaussian noise or fractional autoregressive integrated moving-average (ARIMA) process models might be more appropriate traffic models. As in the models described earlier, tcplib [28] is derived from the study and analysis of traces over a three-day period from multiple locations over the wide area Internet. However, tcplib differs from the other two in two significant ways: First, it classifies traffic based on protocols, and then generates traffic models for each different protocol; second, it is a trace driven model, and has been implemented as such. It was originally added to REAL [75], and is currently used in ns [89]. tcplib only models the distributions of bytes in a TCP session. Paxson [101] extends this work by deriving complete analytical models for various TCP protocols. For example, whereas the bytes in a “TELNET-originator” (i.e., the user side of a telnet session) fit a log-extreme distribution, the 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. packet distribution fits a log-normal distribution. Bulk data from protocols such as SMTP, NNTP, or FTP-data, best fit log-normal distributions; however, the fit was just as poor or just as good as empirical models such as tcplib. In addition, the traffic in the reverse direction has little variation relative the variation in the size of the transfer. Paxson et al. [103] go on to more fully develop the TELNET-originator and FTP-data models. They observe that session arrivals for telnet and ftp sessions are the only events that fit Poisson processes. Three components model the TELNET-originator. packet inter-arrival is best modeled by tcplib, connection size in bytes is a log-extreme distribution, and packet-size distribution is a log-normal distribution. The paper also observes that FTP-data sessions are dominated by small, significant bursts of traffic. If a transfer is successful, then a handful of the largest bursts carry the bulk of the ftp data bytes. This paper also investigates the possible roots for self similarity, noted by Leland et al. [79], in individual protocol behaviours. Per-protocol modeling [28, 101, 103] is important because it helps us understand the impact of protocols on the network. This is useful in optimizing network resources; it is also useful in under standing and improving the end-to-end algorithms, such as TCP’s congestion control behaviour. 2.5.2 Topology Models Protocol evaluation should be conducted on two types of topologies: those that stress the protocol, and those that most closely approximate operational networks. The former is dependent on indi vidual protocols. However, the latter should be more generally modeled so that protocol designers can better understand topological characteristics. Thomas et al. [131] classify topologies based on a variety of characteristics, namely node de gree, hop diameter, metric diameter, and biconnectedness. For example, operational networks such as the Internet are generally hierarchical, with a greater number of biconnected components, and lower average node degree. The paper compares different models, and the topologies they gen erate to operational networks. The observation is that random models are not good at generating moderate sized topologies that are simultaneously connected, and approximate the average node degree of operational networks. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In Zegura et al. [137], they go on to propose a transit stub model for generating topologies, whose strategy matches the hierarchical nature of operation networks, and therefore generates good approximations of real network topologies. 2.5.3 Routing Models A study of inter-domain routing protocol traces over a 12-hour period is reported in [21]. The Internet was organized around the NSFnet backbone with a well defined hierarchy. EGP and BGP were the inter-domain routing protocols in use. In this environment, the authors report that (i) nearly 90% of the route updates contained no new information; (ii) the size of an update could be approximately correlated to the location of a topology change. A maximum of 7% of the total networks changed state during the observation interval. The most dominant change was single networks at the periphery, indicating that the network itself was fairly stable; (iii) a small number of networks exhibit a large fraction of instability (about 1 % of the networks showed more than 30 transitions). Finally, they also report on the duration of unreachability for a given network, and the update latency through their network relative to the size of an update. An end-to-end perspective on routing is presented in [102]. The methodology is to execute traceroutes from an identified set of sites to every other site in that set. The probes are executed on command from a control site, at periodic intervals exponentially distributed around a mean. They observe a number of routing pathologies; they document these pathologies, and elide them from their data set. The pathologies that they observed were routing loops, obviously erroneous routing, route changes mid-probe, route splitting, infrastructure failures, excessive hop-counts, or temporary outages. They define two measures of route stability: prevalence is the probability that having observed a route r at present, that it will be observed again in the future; persistence is the notion that having observed a route r at time f, the time period before that route is likely to have changed, (i) They note that the prevalence of a single primary route is fairly high (around 60%) with some variations from site-to-site (variability of 50%-90%). (ii) A back of the envelope calculation suggests that route changes occur about once every 1.5 hours. However, after removing data corresponding to outliers, they observe that at hourly granularities, route changes occur every 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 36 hours (with a maximum changing every 12 hours). Likewise, over a larger scale granularity, they observe route changes every 2-4 days; further that there is over a 90% chance that a particular route has a persistence of at least a week. Finally, they also study their data for route symmetry: They report that close to 50% of their routes are asymmetric; a third of these routes differed in at least two or more “hops.” Another study of inter-domain routing protocol traces is reported in [49]. This data has been gathered at exchange points in the Internet over an 18 month period. The network at this time is no longer a well-defined hierarchy; there are multiple commercial backbones interconnecting at well- known exchange points. The data in this paper has been collected at a few such exchange points. This paper characterizes growth trends in the network in terms of the number of new prefixes, links and domains observed over this period. The observation is that the growth is “approximately linear.” They also observe that the average domain degree has remained fairly constant, and that the domain level diameter has not increased significantly either. However, the substantial growth of the Internet has resulted in more connectivity between domains within a class (for example, between stub domains, or between service providers at a particular level); this growth has also resulted in greater redundancy in the domain level paths. Finally, they characterize the stability of the Internet using two measures: prefix availability of the primary path to a domain, which they show is greater than 95% for about 80% of the prefixes; and prefix steadiness {i.e., the mean duration during which the prefix is continuously reachable) is over twenty four hours for 80% of the prefixes. While prefix availability and prefix steadiness are not the same measures as prevalence and persistence measures reported in [102], they are somewhat similar in character; they are also similar in that they both report fairly close measures of route stability in the Internet. More recently, Labovitz et al. [76], through measurements of inter-domain route updates at various network access points, suggests that there are a far greater number of route updates than is to be expected in a network, the size of the Internet. 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.6 End-to-End Protocol Overview We conclude our overview of related work with a survey of several end-to-end protocols that we have used in our case studies, as well as review other transport protocols that would benefit from evaluation in the context of dynamic topologies. Individual protocol designers sometimes suggest how their protocol should perform in the presence of topology changes [42]; however, we know of no systematic study of the performance and evaluation of end-to-end protocols over dynamic topologies. Unicast Transport Transmission Control Protocol (TCP) [107] is a connection oriented end-to- end transport protocol. It is designed to operate over an unreliable best-effort datagram service. TCP provides reliable ordered delivery of byte stream data. The protocol is widely deployed in the Internet. It has been studied extensively, and optimized for performance in a wide range of operating conditions. All our preliminary work is in the context of TCP; hence, we present key results in the congestion control and error recovery mechanisms in TCP as background. The delay bandwidth product of a TCP session between a sender and a receiver is a measure of the maximum throughput of the protocol for that session. The end-to-end round trip time (rtt) is a measure of the delay; the TCP window size is an estimate of the bandwidth. A TCP sender attempts to determine these values dynamically in order to maximize its throughput. We have already observed that a TCP sender needs to estimate the rtt in order to determine the end-to-end delay. More practically, an accurate estimate of the rtt permits the sender to set its retransmission timeout (rto) as a multiple of the rtt. This is important if the sender is to avoid excessive retransmissions without simultaneously compromising its responsiveness. The protocol samples the rtt values for each packet transmitted, and uses a low pass estimation formula to converge to a mean value [70, 73]. During packet loss, the sender uses exponential backoff of its retransmission timeout. It is not possible to obtain an accurate measure of the rtt during packet loss. The Kam retransmit strategy [73] uses progressively exponential backoff of the rto\ rtt estimation is suspended until a packet is delivered without a retransmission. 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We now describe the congestion avoidance and error recovery strategies in TCP. We will use graphs in Figure 2.4 to illustrate these aspects of TCP. Each of the graphs in these figures (and the rest of this chapter) shows TCP throughput and behaviour as a function of the cumulative data delivered over time. As a compact representation, the y values show TCP byte sequence numbers modulo 90. A dropped packet is shown as an '+’. In later graphs, we will show link failure as a dashed vertical line at a particular instant in time; link recovery is shown as a continuous vertical line at a particular instant in time. The observations are made at a node incident to a bottleneck link in the topology. Two events are recorded for each packet: when the packet is queued on the bottleneck link, and when the packet is actually transmitted. Each of these events is shown separately as a “dot.” Therefore, the separation between the two points represents that packets queueing delay at that node. Therefore, a single point corresponding to a given packet implies that that packet did not have any queueing delay at that node. Congestion Avoidance in TCP In TCP, a window of data is the maximum amount of unac knowledged data that a sender can have outstanding. The sender uses incoming acknowledgments to dynamically adjust its window size. It does this in two phases[70]. Initially, TCP enters the slow start phase, in which, the sender opens the window exponentially for every incoming ac knowledgment. The trace in Figure 2.4(a) between t = Is to t = 2s is an example of slow start. The slow start phase continues until the window size reaches a threshold, which is normally set to half the current window size. The window size of a new session is that advertised by the re ceiver. TCP enters the additive increase phase once the slow start threshold is reached. In this phase, the sender increases its window size linearly in an attempt to probe the available bandwidth without congesting the network. Figure 2.4(a) between t = 3s and t = 3.5s is an example of the linear increase phase in TCP. Congestion is signaled by packet loss. The sender responds by a multiplicative decrease of the window size, and the threshold, and then reinitiates slow start. Of note is one variant of TCP, “Vegas-TCP” [14], in which the sender adjusts its window in response to queueing within the network. The sender estimates the queueing within the network 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. by making detailed measurements of the round trip dme from the source to the destination and correlating those measurements to the throughput it is observing. Error Recovery Mechanisms in TCP When a packet is lost due to congestion, TCP must wait for its retransmit timer to fire before resending unacknowledged data. However, in most cases, the receiver sends multiple duplicate acknowledgments for the same missing data [40]. The ‘ Tahoe-TCP” version of TCP uses a fast retransmit strategy based on this observation. The sender uses multiple duplicate acknowledgments to immediately retransmit the next packet, assuming that that packet is lost, without actually waiting for its retransmit timers to fire. This occurs at t = 2.3s in Figure 2.4(a). However, the sender will still initiate slow start. The “Reno-TCP” version of TCP uses a fast recovery optimization, on the observation that every acknowledgment indicates a packet received. This corresponds to a packet that has left the network, and therefore a new packet that the network could sustain. Therefore, while the sender still adjusts its parameters to respond to the congestion signal, will continue to send new packets corresponding to every new acknowledgment that it receives. We can see this at t = 2.3s in Figure 2.4(b). Neither Tahoe-TCP nor Reno-TCP copes well in the presence of multiple packet loss. A receiver can use the TCP selective acknowledgment (SACK) option [85] to more specifically indicate the data received and queued on its end. This then permits the sender to more selectively retransmit the missing data. In Chapter 4, we will show how Tahoe-TCP, Reno-TCP, and sack-TCP perform in the presence of dynamic routing. Multicast Transport The related work in multicast transport protocols concentrates on four significant areas: reliable transport, congestion control, router mechanisms, and application devel opment. We describe each of these parts in turn. Reliable Transport Scalable Reliable Multicast (SRM) [42] is designed in the context of multi-party white-board and shared text applications. SRM requires the application to conform to the Application Layer Framing model [25]. In this model, a data source for that application packages data into application data units; a receiver can process complete data units out of order, 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. o s « « « CO 0 a 3 5 1 4 6 time (a) Fast retransmit in Tahoe-TCP o C o 0 1 01 9 0 2 3 5 6 4 time (b) Fast recovery in Reno-TCP o c o < 0 0 2 3 1 5 6 4 time (c) Multiple packet loss in TCP Figure 2.4(a) shows an example o f Tahoe-TCP’ s fast retransmit strategy in operation. A single packet loss (shown as a '+') occurs at t = 1.9s. TCP. instead o f waiting for its retransmit timeout timer to tire after 3s to retransmit the packet, immediately resends the packet after multiple duplicate ac knowledgements from the receiver. Subsequently, it initiates slow start after all outstanding packets are acknowledged. In Figure 2.4(b) however. Reno-TCP starts a fast recovery, instead of initiating slow start, as soon as the receiver acknowledges outstanding packets. Unlike the other two graphs. Figure 2.4(c) shows multiple packet losses in a single window. There fore. both the fast retransmit and fast recovery mechanisms o f Reno TCP fail. The sender waits for its retransmit timer to fire; it then initiates slow start to re-estimate a new operating point Figure 2.4: Tahoe and Reno TCP 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i.e., it can process a data unit without necessarily having processed data units sent earlier. Because applications can process data units out of order, loss of data does not impact the receiver processing additional data from the same or other sources in the group. SRM ensures reliable delivery of all data to all participants, without any guarantees about the order of delivery. The protocol is nack-based—members only send nacks when they detect a loss. To avoid implosion, i.e., the simultaneous protocol activity to recover from a loss by all members of the group, the protocol uses randomised timers. The protocol is receiver-reliable, in that any member can respond to a nack if that member has the missing data. In Chapter 5, we study the robustness of the timer mechanisms to dynamic topologies. We describe the protocol in greater detail in Section 5.1. A variety of different algorithms use similar mechanisms for reliable transport. Liu et al. [84, 82] propose a different adaptive timer mechanism for SRM based on estimating the neighbourhood size that experiences a loss. Similarly, RPM [50] is an SRM-like protocol for the reliable delivery of large scale route objects among multiple distributed route registries. XTP [128], TRM [117] are other multicast transport protocols similar to SRM in operation. Other mechanisms for reliable transport require constructing a structured hierarchy. The hierar chy improves the scalability by reducing implosions, and otherwise constraining the loss recovery to portions of the hierarchy. Holbrook et al. [61] defines a multicast transport protocol that uses a hierarchy of log-servers to aid in loss recovery. RMTP [80] is an ack-based transport protocol. Designated receivers at each level of the hierarchy are responsible for generating the acknowl edgements. The structured hierarchy helps avoid the implosion of acks that would otherwise be generated. TMTP [136] is a nack-based scheme. A hierarchy of domain managers provide loss recovery support. New members use an expanding ring search to find the optimal domain manager to attach to. Congestion Control Receiver-Driven Layered Multicast (RLM) [90] is a multicast conges tion control protocol for layered video transmission [119, 130] using multicast The signal en coding can be characterized as a hierarchy of layers. Each layer is sent to a separate multicast 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. group. A receiver will join the appropriate number of groups that their network can support in terms of the available bandwidth to the source. During congestion, the receiver observes uniform packet loss across all the groups it receives. The receiver’s congestion control strategy in RLM is to (a) drop higher layers of encoding during congestion, and (b) add layers when spare band width is available. Periodically, a receiver will conduct “join-experiments” to probe the network to detect if spare bandwidth is available. The protocol requires coordination mechanisms for two reasons. First, to ensure that multiple receivers conducting experiments do not interfere with each other, and further that one receiver’s experiments are not mistaken as congestion signals by other otherwise correctly subscribed receivers. DeLucia [32] proposes a representative based congestion control algorithm for multicast bulk data transfer applications. The scheme is based on the premise that the links that result in con gestion will be relatively few in comparison to the size of a multicast group. Therefore, carefully identifying the representatives, and using a positive acknowledgment scheme, involving the repre sentatives will be a good mechanism to detect and avoid congestion. Handley [54] describes a mechanism by which nodes downstream of a congestion form a separate group that then receives its data at a slower data rate. Router Mechanisms Papadopoulos [99] proposes forwarding services support to make the loss recovery more efficient. In this scheme, every router on the multicast tree elects one member adjacent to it as the “replier.” This replier will receive all requests sent by members downstream. If the replier has the requested packet, it will send a retransmission that will only be forwarded to the downstream headers. Alternately, the replier itself has to send a request, which the router will then forward upstream to the other members of the group. McCanne [86] explores the minimal set of mechanisms that a router could support to simplify the design and improve the performance of multicast transport protocols. As an example, he cites the sub-tree multicast (or “subcast”) in which the retransmission of the repair packet is constrained to stay within the subtree that experienced the loss. 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Pretty Good Multicast (PGM) [41] explores mechanisms to optimise the recovery delay of reliable transport protocols. On detecting a loss, an end system unicasts a NACK to its PGM- capable multicast router. Each such router in turn unicasts its NACK to its upstream router. This creates an ephemeral tree that is then used to send the retransmissions. Periodically, the source sends source path messages (SPMs) that indicate when the current window of data is about to be moved forward. Once the window has shifted, it will not be possible to recover the old data in the context of PGM. Application Development Real-time Transport Protocol (RTP) [51] is an end-to-end trans port protocol, designed for unreliable, but timely delivery of datagrams for real-time audio or video, or other multimedia or real-time applications. Real-Time Control Protocol (RTCP) [51] is the control protocol component of Real-Time Protocol (RTP). RTCP provides feedback on the quality of data distribution, that can then be used for diagnosis by a source, or other management applications. Reddy [109] proposes a multicast-based application of a network of dynamically adaptive measurement servers to gather localized information about the network, ns is used in the design of the protocol, to determine the scope of each measurement server so that no node or link is monitored by more than one server, to make the algorithm more robust in the face of servers that join and leave the group. Clustering Algorithms SRM requires each member of the group to send periodic session mes sages. The bandwidth requirements for everyone to send session messages does not scale to large groups of thousands of members. Scalable Session Messages (SSM) [123] proposes constructing a two level hierarchy of members. The members at the top level send global session messages to the entire group. The other members cluster around their nearest global member, and only send session messages with enough scope to reach the members in that local cluster. The algorithm is self-configuring, and forms representatives and clusters to stay within pre-defined thresholds. We looked at the stability of clustering algorithms in the context of topology changes in Chapter 5. We describe this protocol in greater detail in Section 6.1. 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Unicast routing aggregates the addresses of a group of destinations into a single address prefix in order to scale. Early versions of routing and addressing used hard boundaries for the address prefix, which did not scale to the large Internet. Classless Inter-Domain Routing (CIDR) [45] proposed a hierarchical address allocation strategy and variable length prefix aggregation that dra matically improves the scaling properties of the routing system. CIDR requires the system to disseminate extra routing information in certain conditions, to cope with topology change and the availability of alternate paths. Analysis of the most optimal clustering and hierarchy that is also stable to topology change and alternate paths is a complex task. Moreover, with the new IP ver sion 6 [31, 59], many different proposals have been floated for scalable addressing and routing, with relaxed constraints on addressing architecture. This relaxation can then impact the stability of the routing system. We need to carefully evaluate such addressing architecture in the context of topology changes. Another scaling concern in SRM and other multicast transport protocols is that loss and loss recovery in some portion of the multicast distribution tree can involve and impact the entire group. Local group concepts [83, 60, 74] attempt to improve the performance of the protocol when the losses are concentrated on a specific portion of the tree by having the members downstream of the loss form a separate sub-group to recover from the losses in that sub-group. Finally, clustering is a common mechanism to improve the scalaing properties of many pro tocols and algorithms. The clusters are organised around the topology. Therefore, in all of these clustering schemes, it is important to consider the impact of the clustering mechanism to topology change. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 Persistent Route Oscillations in Inter-domain Routing During the design of a routing protocol, configuration mechanisms are added to satisfy deployment criteria. These mechanisms can affect the correctness properties of the protocol. This can happen in spite of the fact that the original (and possibly more abstract) specifications and algorithms of the protocol are proved correct. In effect, the configuration mechanisms weaken the assumptions that were required for the correctness of the original specifications. This could lead to errors in the protocol that is eventually deployed. This chapter describes a formal analysis of one such problem that arises from policy configuration mechanisms in inter-domain routing. Configuration mechanisms are introduced in a protocol for deployment in operational net works. Such mechanisms may be added to permit policy expressibility, to make the protocol scale better, for administration and management, etc. Errors that are caused by the addition of these criteria can be quite subtle. Therefore, the problems may be hard to identify. They may also make it hard to evaluate alternative solutions. We know from experience that some of these errors are found long after the protocol has been deployed and has been in use for some time. Consider simple path vector routing algorithms. Path vector algorithms have nice convergence properties [124, 125]. Because routes in these algorithms carry complete path information, they have been adopted for for hop-by-hop policy routing, e.g. BGP [113] and IDRP [68]. However, the application of policy based route selection and advertisement weakens the assumptions used to prove convergence. This convergence-critical assumption is that the routes to a destination chosen by a node at any point in time are the lowest cost routes to that destination [77, 129]. Therefore, it 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. should come as no surprise to find that some policy configurations in BGP/IDRP (i.e., those that do not always select the shortest path) can result in those protocols not converging on routes to some destinations. The design of routing protocols is considered a mature field. Protocols are studied in a number of different ways during design. Yet, problems are identified long after deployment in operational infrastructure. One could therefore ask why these problems were not identified earlier in the design process? Configuration mechanisms add another dimension of correctness to any protocol that must factored into the analysis of that protocol. However, each of the different techniques for verifying the correctness of a protocol has advantages and disadvantages that can aid or impede this process. Configurations mechanisms, such as policy based routing or scaling mechanisms for routing algorithms, are a required component of deploying a protocol. Formal methods tend to abstract out configuration related issues in order to make studying the problem more tractable. Such methods already tell us that BGP/IDRP will not converge. However, since it is important for us to use the policy mechanisms in BGP/IDRP, it is important for us to determine which policy mechanisms are safe. Obviously, adding policy mechanisms and topology representations in order to analyze all possible policies in the various topology representations is not feasible. However, it is possible analyze the protocol using other techniques. In this chapter, we will derive new formal methods to determine on a per-case basis whether or not a given set of policies can oscillate. We do this by transforming a collection of domain policies in a specific topology into an abstract graphical representation. We can then determine based on properties of that representation, whether or not that topology will then oscillate. Simulators probably do not capture these problems because, simulations, like formal methods, tend to abstract out configuration related information. Conceivably we could apply techniques such as reachability analysis [135, 13] to simulations of protocols with the configuration mecha nisms in place. However, in this situation, we believe that our formal methods are more generally applicable, and hence will be more effective. Isolating the problem is particular to the focus of 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. study. As we have noted earlier, the problem may be quite subtle; recognizing the problem may be hard, even using a fully functional implementation. Finally, dedicated test networks are often not large enough to stress the problem. Operational networks rarely present the complex scenarios that stress the more subtle aspects of a protocol. Earlier in Chapter 2, we have reviewed inter-domain routing protocols, and policy based rout ing; we also reviewed the various proof techniques that have been applied to protocol design. In this chapter, we look at route oscillations introduced by adding policy configuration mechanisms to hop-by-hop routing protocols. The following section (Section 3.1) describes inter-domain topolo gies and preference functions for which BGP/IDRP exhibits persistent route oscillations. We show that these oscillations may be attributed to route “feedback,” caused by inter-dependent do main preference functions. By appropriately configuring a public-domain BGP implementation with such preference functions, we have re-created these oscillations in our laboratory testbed. However, despite the widespread deployment of BGP in the Internet, there is no anecdotal evi dence of observed route oscillations of the form discussed in this paper. Existing provider policies are safe probably because the commercial Internet infrastructure is still in its infancy—therefore, the range of policies currently expressed is still limited. We think conditions for route oscillations are more likely to occur as the commercial Internet matures, and as the Internet transitions to the more expressive IDRP. It is important to understand the pathological situations in any protocol, however rare they may be, so as to be able to avoid these situations, and otherwise recognize and recover from them even if they should occur [105]. In Section 3.2, we study these oscillations in a restricted class of inter-domain topologies. For these topologies, we describe a representation of domain preference functions that we call return graphs. Using this representation, we derive necessary and sufficient conditions for the existence of route oscillations in these topologies. Our derivation shows that these oscillations can happen in relatively complex ways even in simple topologies. The existence of route oscillations in inter-domain routing points to a routing protocol design failure. We evaluate a number of different solutions in Section 3.3. We show that constraining BGP/IDRP to consider only a small “safe” subset of path attributes can reduce significantly the 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. number of policies realizable in those protocols. Not surprisingly, perhaps, realizing richer poli cies through independent route selection can affect route convergence adversely in BGP/IDRP. However, in the existing commercial Internet infrastructure, mechanisms to realize policy through independent route selection are already widely deployed. In this situation, a combination of the following two approaches can be adopted. The first approach analyzes domain policies a priori to detect the likelihood of route oscillations; one or more domain policies can then be modified to avoid oscillations. A routing policy registry (such as the Internet Routing Registry [7,1]) is useful for this. The second approach introduces additional protocol mechanisms that detect the existence of an oscillation, and modify one or more domains’ policies to suppress the oscillation. Unsafe policies can then be realized using explicit routes [38,44]. 3.1 Description of the Problem: Oscillations in BGP/IDRP In this section, we describe inter-domain topologies and preference functions for which BGP/IDRP exhibits persistent route oscillations. In the descriptions of the policy configurations in BGP/IDRP that follows (and for the rest of this chapter), we require a formal notion of BGP/IDRP route computation. Hence, we now describe a simple model of route computation at each participating domain D. In this paper, we assume that all references to routes pertain to address prefix x. unless otherwise stated. Domain V maintains the last route advertisement to x heard from each of its neighbors. V also maintains the last route r t o i that it advertised. Suppose V hears a new route advertisement for x from its neighbor. It assigns a preference to this route and recomputes the most preferred route to x. If this route is different from r, V propagates this route. BGP/IDRP route computation is said to converge if no further route advertisements are heard at participating domains. Even in a relatively small inter-domain topology, BGP/IDRP can exhibit persistent route oscil lations (i.e., non-convergence) for routes to x. Consider three domains Do, V \, and D2 connected together as shown in Figure 3.1. Suppose that domain Do (respectively Dj and D2 ) has a “direct” 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ro r\ 1 * 2 -D o ro ro J * 2 V i ro ri v 2 r-i n r2 In this figure, each domain is represented by a circle. The table is a compact representation o f the preference functions at all o f the domains (Do. V \, and V 2), and for ail possible routes to destination x that can occur in this topology. The entries in each column are the routes that a domain will select when it receives a route corresponding to that column. Notice that the preference functions are inter-dependent: Vo s most preferred route is r2. V i ’ s most preferred route is ro, and V? s most preferred route is ri. Also. Vo will never select r i. V i will never select r2. and V i will never select ro. Intuitively, we can see that there is no unique route assignment, such that each node is assigned a route that satisfices its local policies. Therefore, if each of the domains has a route to destination x. then the selection and advertisement o f its route to x by any domain will lead to a conflict in another domain's route; that other domain will then change its route, and advertise a new route. Hence this set of nodes and routes will never converge on their routes to a destination x. Later in this chapter, we will see that this topology can oscillate in a number o f complex patterns. Figure 3.1: Example of a cyclic domain policy that leads to non-convergence in BGP/IDRP route ro (respectively n and r2) to the destination. With the preference functions shown in Fig ure 3.1, the algorithm exhibits persistent route oscillations (Figure 3.1). Intuitively, this is because the policies of the three domains are not simultaneously satisfiable. The preference functions of Figure 3.1 are not simply based on the identity of the domain that advertises a route; for instance, even though V q hears n and ro from Vo. it selects one and not the other. 'Do’s preference functions express its policies regarding V\ and Vo. It is not unusual for a provider to specify such a policy in the existing Internet. We think that inter-dependent policies similar to that in Figure 3.1 are not unlikely in the future Internet. In Figure 3.1, each domain alternates between two routes— its own “direct” route and that of its anti-clockwise neighbor. At some instant f, a domain can have selected exactly one of these routes. The selected route defines that domain’s route state (or r-state) at t. If a domain is in an r- state r at t, it must have last advertised r. When a domain receives a route advertisement, its r-state may change. At different times, a domain can be in different r-states. The r-states at a domain 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are determined by its neighbors’ r-states, and its own preference functions. For example, Po has exactly two states: ro and r-i. Po never selects ri, since the direct route ro is always available. From Figure 3.1, we can make the following observations: (1) These persistent route oscil lations occur in the absence of topology changes; (2) This topology oscillates regardless of route processing times and route propagation delays at the three domains; (3) The lack of a global metric in BGP/IDRP causes each domain to oscillate between loop-free paths; (4) If packet forwarding is synchronized with route exchange, packets could loop indefinitely. For example. P i could forward a packet to Po- P i could switch its route to P 2 before forwarding that packet to P 0. Po, in turn, could switch its route to P i, and then forward that packet. Hence, the loop; (5) Independent of the initial r-states of the domains, BGP/IDRP always exhibits persistent oscillations in this topol ogy; and (6) The original hop-by-hop distance vector algorithms of [124, 129, 77] are provably loop-free; local policies that are configured at each domain introduce the oscillations. There exist preference functions that cause BGP/IDRP to behave differently for different ini tial r-states. Figure 3.2 shows preference functions for a cycle of four domains. BGP/IDRP con verges in this topology for a particular assignment of initial states. At convergence, each domain has a route to x that satisfies its policy requirements and there are no new route advertisements that must be processed. Such a route assignment to each domain is a stable route assignment. With other initial r-states, the topology exhibits two different kinds of oscillations. In one of them, Po repeatedly selects ro and ri. In the other, Po repeatedly oscillates among ro, ro, and r 3 . In this example, BGP/IDRP can exhibit a persistent route oscillation despite the existence o f a stable route assignment. Notice that, in the above results, we make no assertion about how the cycle of four domains arrives at any particular initial r-state. We consider protocol operation from an initial configuration in which each domain in the cycle has selected a direct route, either its own, or that of another in the cycle that it acceptable to it given its policy configuration. We then assert that, if the cycle is in that particular initial configuration, then oscillation, if it occurs, occurs independent of message propagation delays and route computation speeds. However, it may be the case that, for a cycle of domains to reach that initial configuration may depend on initial route selections at each domain, 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ro ri r2 r3 V q ro ro r2 r3 Vi ro rt ri r3 v 2 » * 2 ri r-i r3 v 3 ro ri r2 r3 (a) Preference functions (b) Initial r-states for Vo to oscillate as (ro. r2, ro) (c) A possible stable assignment (cl) initial r-states for Vo to oscillate as (ro, r2, r2, r2, ro) 3.2(a) This figure shows preference functions at four domains. Each domain has a direct route. 3.2(b) These initial r-states can be realized, for example, ifV 2 s advertisement reaches V z. V q. and V\ before those domains have processed their direct route. 3.2(c) I fV 3 's advertisement forrz reaches all other domains first, these r-states result. 3.2(d) If V 2 and V 2 select their direct route, and the advertisement for r2 reaches Vo and V i , these initial r-states are achieved. Figure 3.2: BGP/IDRP behavior with four nodes as well as route computation and propagation delays. How a cycle of domains could get to a particular initial r-state is beyond the focus of our present work. What causes the oscillations described in Figures 3.1 and 3.2? In Figure 3.1, observe that a domain’s r-state can “feedback” into another, possibly different, r-state. Informally, when Vq advertises ro, V \ transitions to r-state ro, as its preference function dictates. Then, X>i’s advertise ment of ro causes T>2 to select and advertise r2 . This advertisement causes Vq to select r 2 - We say that r-state ro returns to r 2 at V q. Intuitively, route oscillations happen in Figure 3.1 because 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0 A cyclic inter-domain topology in which no domain selects routes from its clockwise neighbor. Vi A domain in D. Domains in D are numbered with integer subscripts. 2?<ei * s Vi’s clockwise neigh bor. Vi's direct route. We assume that this route is always present as a fall-back route. prefi A function that takes two routes, and returns the route that has a higher preference at Vi. r-state At any instant t, the route selected by a domain. At different instants, a domain can select different routes. Hi The collection of possible r-states o f Vi. return relation We say r„ returns to r*, at Vi, if when Vi advertises r a . route feedback causes Vi to select r& . return graph For each V,, the directed graph whose nodes are the r-states in TL, and whose arcs express the return relationships between those states. G o The return graph at Vo. Gi is the return graph for any Vi in D. return cycle A cycle in a component of the return graph. Every component in D has exactly one cycle. C A single cycle in G o- C , denotes the return cycle isomorphic to C in Gi. Table 3.1: Glossary of terminology and notation there exists a cycle of returns at P i: ri returns to T 2 , and returns to n . In Figure 3.2, Po has two such cycles, in one of which returns to itself. 3.2 Characterization of the Oscillation We now attempt to analyze persistent route oscillations in a particular class of inter-domain topolo gies. In these topologies, domain preference functions can be represented as return graphs, based on the notion of return states. We derive necessary and sufficient conditions on return graphs for the existence of route oscillations in these topologies. Table 3.1 summarizes the various terms introduced in this section. 3.2.1 Assumptions and Problem Statem ent Suppose that the three domains of Figure 3.1 were part of a larger inter-domain topology. De pending on the policies of adjacent domains, the oscillations at these three domains could affect a number of other domains, perhaps triggering other “sympathetic” oscillations. Visualizing, and reasoning about, these complex oscillation patterns in general topologies is difficult. 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. n— 2 'n - L ro ri r - 2 . N 1 rn-1 Vo ro ro r2 rn- 2 r„— i v x ro r\ n rn- 2 rn-1 v 2 ro ri r2 rn-2 r„-i V n - 2 ro r\ r 2 rn- 2 rn- 2 Vjt— i rn-1 ri r 2 rn-1 This figure shows n domains, each with a direct mute, and the table o f domain preference functions. Each domain prefers its anti-clockwise neighbor’ s direct route more than its own. V \ oscillates as (ro. r„_ i, r „ - 2 ....... rn, ro), if some domain initially advertises its direct mute. Figure 3.3: A persistent route oscillation among n domains For this reason, we consider a more restricted class of topologies which exhibit route oscilla tions. Informally, we do not believe that the kind of route feedback described in the previous sec tion can happen in acyclic topologies. 1 So, we consider a class of simple cyclic topologies, which we call D. D contains n domains Po, P i, ..., P ^ -i- Each domain T > i peers with P (i-i) m 0d n and P(i+L) m od n (respectively notated P ,q i and Pj@i ). Without loss of generality, assume P tQi is P i’s clockwise neighbor. Assume further that, each domain P* has a direct route ri that is always available as a fall-back. Figure 3.3 describes preference functions for a persistent route oscillation in D. In this oscillation, each domain repeatedly selects n — 1 r-states. In this work, we study those route oscillations in D that occur in the absence of topology changes and are independent of route computation times and route processing speeds. We say a Vi oscillates if it repeatedly selects a sequence of r-states r a, r j , . . . , rx. One of these r-states can be r t. The other r-states must correspond to routes heard from P,© \ , P*© i» or both. Here, we consider those oscillations in which Vi s r-states correspond either to r t or to routes heard from V iQ i- That is, we restrict the class of preference functions in D to those in which a V i never selects a route from P,-@ i. Our analysis also applies to oscillations in which V i selects either rj or routes from V ^ i- Section 3.2.5 discusses the likelihood of oscillations in which V i’s r-states include routes from both P*©i and P,©i. 'The one exception, that we are aware of, is the acyclic topology of two nodes connected directly to each other. We can model this as the two-node topology in D. 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. With these assumptions, we attempt to answer the following two questions: • Among the class of preference functions we consider, which ones can cause route oscilla tions in D? • For given preference functions in D, what are the different ways (if any) in which D can oscillate? One possible answer to these questions is suggested by the following approach. If we represent the current state of D by a vector of r-states, we can represent the next state of D as a product of a state transformation matrix and the current state of D. This transformation matrix is determined by the given preference functions. Conditions on the eigenvalues of this matrix determine whether D can oscillate or not [98]. In the next section, we describe an alternative representation of D’s preference functions, which we call a return graph. The choice of this term was motivated by the roughly analogous control-theoretic notion of a first return map (sometimes also called a Poincare map) [52], 3.2.2 Return G raphs In Section 3.1, we informally introduced the notion of a domain's r-state. At any instant t, the r-state of a domain is the route it has selected. In D, the r-states of X ? , can include r t and some other domains’ direct routes heard from X ?,© i - A direct route rj is an r-state of V, if and only if, when Vj advertises r}, all domains between Vj and D, (going clockwise in D, Vt inclusive), select that route. Denote the preference function at Vi by pref{; pref{(ra. rt,) is the more preferred of ra and r* at X ? ,-. Formally, r3 is an r-state of Vi if and only if: prefk(rj,rk) is rj, for all k in j © 1 ,..., i 0 I, *. Thus, the set of possible r-states of Vi (denoted by 7^) can be determined entirely from domain preference functions. In Section 3.1, we also introduced the returns relation between two states. We said that, at Vi, ra returns to rt, if, when Vi advertises ra, route feedback causes Z > ,- to transition to r& . Equivalently, Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. fnT " R 1 r 3 j~) f n f ~ * R |r3 p ) . ©t . _ R F I R I R *H] m In this topttfogy, the return graph corresponding to each domain is shown adjacent to that domain. Each return graph has two components. One component is a cycle consisting o f two nodes. The second component is a cycle consisting o f one node. Figure 3.4: Return graphs for the topology of Figure 3.2(a) the returns relation is can be defined in terms of domain preference functions in D. Suppose that ra and r*, are two r-states at 2?,-. Then, ra returns to r& at T > i if and only if: rb = pref£ (prefie l(prefie2( ... (prefm i(ra,r m ),.. .),rie2),riel),rl) (3.1) That is, when P* selects ra, P tSi selectspreflS1 (rQ . r,e l ), V i@ 2 selects prefie2(prefiel(ra, r : - e i ), rj® 2). and so on, exactly once around D. Given the preference functions in D, we can define at P , a directed return graph Gi, whose nodes are the r-states in 7£,. Gi has a directed arc from r a to 77, if and only if ra returns to r& . Figure 3.4 shows the return graph for the example in Figure 3.2. This collection of return graphs is an alternative representation of the preference functions shown in Figure 3.2. 3.2.3 Properties of R eturn G raphs We can make several general observations about return graphs. Equation 3.1 implies that each node in a return graph has exactly one outgoing arc. Such a directed graph has a well-defined structure; it may be disconnected, and each connected component generally contains one or more chains “leading into” exactly one cycle. (For, if a connected component contained two cycles, 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. some node in that component must have two outgoing arcs). A cycle in a return graph may have one or more nodes (Figure 3.4). A one-node cycle corresponds to an r-state that returns to itself. Cycles in a return graph Gi have several interesting properties. 1. Since every node in a return graph has exactly one outgoing arc, a node in G, can be in at most one cycle. Moreover, the directed path leading out from any node in a return graph eventually leads into a cycle. 2. A one-node cycle corresponds to a stable route assignment. That is, if ra returns to r„ at 2 ?* , then the following is a stable route assignment in D: ra at V iy prefi@ 1(r,r,©i) at and so on. 3. From the previous property, it is trivially true that for a one-node cycle in Gi, there exists a “corresponding” one-node cycle in G,©i. If there exists a two-node cycle in G,, there exists a “corresponding” two-node cycle in Gj@i- To see this, suppose that ra and constitute the two-node cycle in Gi. Clearly the states prefie l(rQ . r,@i) and prefte i(r{,,rte l) (say rc and r < * respectively) must be in If ra returns to 77, in Gi, then rc returns to tj in G;e i- Conversely, if 77, returns to ra in Gi, then 77* returns to r c in G,@i. Finally, rc and rj cannot be identical; if they were, ra and 77, must also be identical. Extending this argument, if there exists a fc-node cycle in Gt , there exists a fc-node cycle in every other domain’s return graph. Since a node in Gi can be in at most one cycle, the cycles in other domains’ return graphs are isomorphic to the cycles in Gt. From Property 3, cycles in Go are representative of cycles in all G,. In D, one of the r-states of every domain T > i is its direct route 7 7 . This r-state must lead into some cycle in Gi (from Property 1). That cycle corresponds to a cycle (call it C) in Go- We say that 77 can activate C. Thus, in Figure 3.2(c), r 3 can activate the one-node cycle. Intuitively, if only T >i were to advertise r,- initially, cycle C would eventually be realized. More than one direct route can activate the same cycle. In Figure 3.2(b), any one of ro, ri, or r 2 can activate the two-node cycle. The collection of initially activated cycles defines the initial r-states of domains in D. 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.4 Persistent Route Oscillations in D In this section, we describe necessary and sufficient conditions on cycles in Go for the existence of persistent route oscillations in D. Earlier, we said Vi oscillates if it repeatedly visits the same sequence of k r-states, for some k. We say an oscillation exists in D if at least one domain in D oscillates. In D, if Vi oscillates among k r-states, then 2?,e i must oscillate among at least k r-states (any state transition in P, can only be triggered by a state transition in P,©i, since n is always available and, by our assumption, P* never selects a route advertised by Pj©i). Actually, Pj©i must oscillate among exactly k routes. Otherwise, £ > ,© 2 and all other domains in D, including P,-© i must oscillate among more than k routes. This is a contradiction. We call the smallest repeated sequence of r-states at P* its period. Intuitively, if Go has a multi-node cycle C, D can exhibit persistent route oscillations. This happens if n can activate C, and P , initially advertises rj. Thus, there exists an oscillation in the topology of Figure 3.1 if Po initially advertises ro- Less obviously, two (or more) one-node cycles can also cause oscillations in D. For example, Figure 3.5 shows a topology in D. In this topology, Go has three one-node cycles. There exists an oscillation in this topology if Vo and V \ initially advertise ro and h respectively. The following theorem formalizes these two observations. Theorem 3.2.4.1 D can exhibit persistent route oscillations if and only if either Go has at least one k-node cycle (k > 2), or Go has more than one one-node cycle. Proof We now sketch a proof for Theorem 3.2.4.1. The proof sketch has two parts. In the first part, we show that if Go has exactly one one-node cycle, D cannot oscillate. In the second part, we show inidal conditions for an oscillation in D if Go has one A;-node cycle {k > 2 ), or Go has more than one one-node cycle. 1. Suppose Go has exactly one one-node cycle. Then, since the cycles in Go are representa tive of cycles in other domains’ return graphs, all other Gi must have exactly one one-node cycle. Assume to the contrary that D exhibits a persistent route oscillation. Suppose that V fs period has two routes r tt and (the proof for the case when V fs period has k routes 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. E ° D EIQ EIQ Vo tq r! r 2 V \ ro rx r 2 V i ro rx r 2 ro r L r 2 (b) State 1 (c) State 2 (d) State 3 This figure shows the preference functions at 2>o. The figure also shows Go, which contains three one- node cycles. There exists an oscillation in this topology i f at least two o f these three cycles are initially activated. The figure also shows the three states o f the oscillation, when ro and r? are initially activated. Figure 3.5: Multiple one-node cycles can cause an oscillation is similar). Now, in Gi, the directed path out of r a must lead into a cycle (from Property 1). The same is true for r& . Suppose r c constitutes the one-node cycle in Gi. If r c is different from both r a and rj, then when D; advertises either r a or r& , it will eventu ally attain r-state rc. But that is a contradiction, since we assume that Vi repeatedly selects only ra and r& . If r c is r& , then advertisement of both ra and r& by 2?, results in an eventual transition into rb (i.e., ra cannot recur in the sequence), a contradiction. A similar contradiction occurs if r c is ra. 2. Suppose Go has one A:-node cycle. Without loss of generality, assume that r, can activate this cycle. Then, a start state in which only rj is initially advertised will result in persistent route oscillations. A period of the oscillation at each Vi contains the r-states of the A:-node cycle. This follows from the definition of the returns relation. Suppose Go has two one-node cycles. Without loss of generality, assume that rj and rj can activate these two cycles. An 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. initial state in which only rj and r7’s advertisements initially traverse D will result in an oscillation whose period contains the r-states in their two one-node cycles. ■ 3.2.5 Discussion In this section, we discuss the implications of the conditions for the existence of an oscillation in D. We also consider the effect of relaxing some of our assumptions about the topology and the preference functions. Given a set of preference functions in D, we can use Theorem 3.2.4.1 to determine the different ways in which domains in D can oscillate. The theorem describes two ways: when either a single multi-node cycle, or two one-node cycles are initially activated. Oscillations with more complex periods are possible. Suppose r a and activate two different cycles Ca and C & . The period of the resulting oscillation contains the r-states of Ca and C & . The oscillation of Figure 3.2(d) is an example of this. However, if ra and r& activate C, it is possible for the period of the oscillation to contain two instances of each r-state in C. When two or more cycles are initially activated, the order of the r-states in a period of the oscillation depends on the routes used to activate the cycles. Figure 3.6 demonstrates this. If Go has one multi-node cycle, only one cycle need be activated to cause route oscillations. If Go has only multi-node cycles, it follows from Property 1 above that any initial state leads to route oscillations. This is the case with Figure 3.1; there exists no stable route assignment in D. We call such return graphs unsatisfiable. BGP/IDRP admits preference functions which can result in unsatisfiable return graphs. We considered a particular kind of oscillation, one in which route advertisements “flow” clock wise around D. In D, can Vi's r-states include routes advertised both by l and A©t? We have not been able to construct examples of such oscillations without assuming some temporal ordering on each domain’s route selection policies. Intuitively, if a period of the oscillation at Vi includes routes from both Dt©i and then varying route propagation delays can perturb the order of r- states within a period of the oscillation. For this reason, we believe that if D oscillates independent 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ro ri r2 rz r\ rz Vo ro ro rz U rz ro r i r i J * 4 rz ro r i T 2 rz rz ro ri T 2 rz rz £ > 4 ro n r2 rz f4 V5 rZ f2 rz r4 rz frof *lr7| [r-Tf 7 * 5 This figure shows a six domain topology in D. There are two two-node cycles in Go- I f ro and ri are initially advertised, the resulting oscillation has the following period at Vo: (ro. r 3 . r 4. rs>. However, if ro and r 5 are initially advertised, the resulting oscillation has the following period at Vo: {ro. rs. r4, rs>- Figure 3.6: Different oscillation periods for different initial conditions of route processing times and route propagation delays, the oscillations must either be clockwise or anti-clockwise. We also believe that if a more general topology oscillates independent of topology changes, there must exist at least one cycle of domains that oscillates in a clockwise or anti-clockwise man ner. As we have said before, in a more general topology, other domains may exhibit sympathetic oscillations. To analytically examine route oscillations, we considered a constrained class of topologies. Are return graphs applicable in more general topologies? Obviously, our analysis applies to those sub-graphs of the more general topologies that satisfy the requirements for D. However, in D an r-state r ’s return state was uniquely defined. In a general topology, more than one return state is possible for a given r at P j. Whether it is possible to derive conditions for the existence of an oscillation in these more general return graphs is left for future study. 3.3 Evaluation of Alternative Solutions In the previous section, we showed preference functions for which BGP/IDRP can exhibit persis tent route oscillations. We now consider constraining the protocol to allow preference functions 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. expressed in terms of a path attribute X. Such preference functions are safe if they do not cause oscillations in a general topology. We consider the question: Do there exist safe preference func tions on X ? If there exists an X such that preference functions on X allow “interesting” policies, constraining BGP/IDRP to these preference functions is an acceptable solution to the oscillation problem. We believe that if preference functions on X are safe in D, then they are safe in more general topologies. Put differently, if preference functions based on X can result in a cycle of oscillating domains in a general topology, we can construct an oscillation in D caused by preference func tions based on X. Intuitively, this construction simply “extracts” the cycle of domains and their preference functions from the more general topology. To show that preference functions based on X are safe in D, it suffices to show that the equiva lent return graphs can contain exactly one one-node cycle. As we have discussed earlier, routes in BGP and IDRP carry a PATH attribute—this is a sequence of domains that the route has traversed. We now consider two possible preference functions based on the PATH attribute. We show that if all domains are constrained to selecting the shortest PATH route, oscillations cannot happen in D. We also show that if domains were allowed to independently select routes based on the first element of the PATH only (next-hop), multi-node cycles cannot form in a domain’s return graph. 1. Shortest Path Routing: In Figure 3.1, at least one domain’s r-states contains a route with a PATH longer than its direct route. Denote by l(ro) the PATH length of ro at Vq. If each domain always selected its shortest path route, then /(rj) + 1 < /(r,© i), i.e., l(rt) < l(r,e i ). Putting these inequalities together, we arrive at a contradiction. This observation motivates considering shortest path route selection to realize safe policies. We can show that this preference function is always safe in D. If rj is X > ,’s shortest path route, then any route that V, selects and advertises will return to r,. Therefore there is only one one-node cycle at Gi, and by extension, there is only one one-node cycle in Go- Therefore, D will not oscillate if every domain uses shortest path route selection. We believe that shortest path route selection will not cause oscillations in other more general topologies. 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This is not a new result. We know from [129,77] that a distance vector hop-by-hop algorithm augmented with a loop suppression mechanism always converges, i.e., never oscillates. This algorithm is similar to a BGP/IDRP algorithm constrained to only select shortest PATH routes [124, 20]. 2. Next-hop Policies: T> i in Figure 3.1 advertises ri and ri to Do- By looking at the entire PATH of those routes, Po only selects ri and not ri. If Do’s preference functions are based only on the first element of the PATH, i.e., the next-hop, then Po cannot assign different preferences to t*i and ri. This observation motivates considering next-hop based preference functions to realize safe policies. However, domain preference functions in Figure 3.5 are based on next-hop; we have shown that this topology is susceptible to a route oscillation for certain initial r-state s. Next-hop based functions cannot result in multi-node cycles in D. Consider route preference functions expressed only on the next-hop. Each P, has two possible choices: it prefers no route advertised by Pi© i, or it prefers every route advertised by Pi©i. If any one P, always chooses r{ regardless of any route advertised by its neighbor, i.e., ri returns to r t in Go, Go contains exactly one one-node cycle. If all domains P, prefer routes advertised by their neighbors Pj©i, Go has n one-node cycles, one corresponding to each r,. Next-hop based preference functions have a possible stable assignment in D. In the next section, we show how next-hop based policies may be relatively safely realized when used in conjunction with other mechanisms. Existing Internet provider policies are largely next-hop based [7], However, for next-hop based preference functions to cause oscillations, there must exist a cycle of domains in which every Vi prefers P,©i over their fall-back route. The relatively small likelihood of this configuration probably explains why route oscillations have not been observed in the current Internet. Preference functions based on shortest PATH and next-hop restrict the kinds of policies that can be realized. A domain that has multiple routes to a given destination can choose any of those routes 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in BGP/IDRP; but with shortest PATH route selection the domain can only choose among those routes that have the shortest path length. With next-hop based preference functions, a domain cannot express policies about providers that are not directly adjacent; domains may desire such expressivity in a commercial Internet. Other preference functions on the PATH are likely unsafe. All of our examples of topologies in earlier sections can be created using arbitrary preference functions on the PATH attribute. We have also found that preference functions on most other BGP and IDRP path attributes are unsafe (e.g., D I S T _ L I S T _ IN C L ). We conclude that in hop-by-hop inter-domain routing protocols, such as BGP/IDRP, constraining the protocol to preference functions based on safe attributes allows only relatively “uninteresting” policies. To realize richer policy through independent route selection, yet avoid or minimize the im pact of route oscillations, two other approaches are possible: 1) Require domains to coordinate among themselves for specifying policy. This coordination can allow interesting yet safe prefer ence functions to be realized. 2) Allow domains to independently specify their policies, and deploy mechanisms to detect and suppress oscillations. 1. Pre-analysis: Given global knowledge of the policies for all domains, it may be possible to analyze those policies for the likelihood of route oscillations. One or more domains could then modify their policies based on the results of this analysis. One way of doing such an analysis may be to extend the return graph representation to more general topologies. We are considering this for future study. An alternative approach might be to simulate the effect of these policies off-line. Such a simulation would capture those oscillations that occur independent of initial conditions, e.g., Figure 3.1. More ex tensive simulations might be necessary to capture those oscillations that depend on initial conditions, for example, the oscillations in Figure 3.2. For analysis to be possible, each domain’ s policy must be available to all other domains at all times. One mechanism for making policies available is a route registry. Such a route registry currently exists in the Internet for inter-provider route co-ordination [1, 7]. This 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. seems to be a reasonable approach for safely realizing richer policies through hop-by-hop inter-domain routing in the Internet. 2. Suppression Strategies: Global analysis detects the likelihood of oscillations a priori. It may be acceptable to allow domains to realize their policies independently and suppress oscillations when they occur. In order to suppress an oscillation in a cycle of domains, at least one of the domains in that cycle must modify its policies. In Figure 3.1, if T > o modifies its preference function to assign a higher preference value to ro than ro, the oscillation will cease. This suggests the following general rule: when a domain detects that it is oscillating, it should assign the highest preference value to its fall-back route. It is possible to conceive of a variety of detection schemes that indicate the likelihood of oscillations. We describe two schemes that maintain some history of route transitions at a domain. The first scheme maintains all the r-states seen at a domain over some time period T. If this history contains a repeated sequence of routes and the domain is in a cycle of oscillating domains, then the rule described above will suppress the oscillation. To reliably detect oscillations, this scheme will need to keep a significant amount of history. Alterna tively, a domain can maintain a time-decayed count of the route advertisements seen from each neighbor domain. If this “instability” count exceeds an empirically derived threshold, the domain may assume the likelihood of an oscillation. This scheme is currently deployed in the Internet [132] to suppress route advertisements caused by frequent topology changes. The above detection schemes can generate false positives. In either scheme, the domain that maintains the history may not contribute to the oscillation, but may be sympathetically oscillating with some other domain that does. Instability counts cannot distinguish between oscillations and route advertisements caused by frequent topology changes. Therefore, it is desirable to apply our general rule to modify policies only temporarily. Instability based suppression [132] modifies policies temporarily. The original policies are restored after the instability count decays below another empirically derived threshold. De pending on the decay rate and the thresholds, this scheme may suppress oscillations in cases 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where a stable route assignment exists (for example, with next-hop based preference func tions, Figure 3.2(c)). For other kinds of oscillations, such as in Figure 3.1, domains do not oscillate when modified policies are in effect; but when the original policies are restored, they oscillate briefly until the instability based suppresion is re-established. This approach only reduces the impact of persistent route oscillations on the routing infrastructure. Finally, other detection schemes are also possible. For example, in Figure 3.1, V q sees an oscillation with period (ro, ri). The transition from r 2 to ro is a “negative transition” [111] because ro has a lower preference than r i at V q. V q's advertisement of ro causes V \ to make a positive transition. A negative transition followed by a positive transition could be used to indicate the likelihood of an oscillation. 3.4 Summary Recall that the original motivation for path vector proposals [124] was to avoid the count-to-infinity loops that distance vector protocols exhibit (Chapter 2). It was possible to formally show by using assertional logic [129, 77] that a distance vector protocol that selects routes based on minimum metric will always converge; further, that a distance vector protocol that distributes path informa tion will avoid the transient count-to-infinity loops [124], There are similar examples in other protocols as well. Link state protocols often use Dijkstra’s Shortest Path First (SPF) algorithm to compute routes. The algorithm assumes that all nodes in the routing domain compute their routes from the same link state database. Hence, all the nodes in the routing domain arrive at consistent routing table decisions. In order to improve the scaling properties of the protocol, a network administrator can configure abstraction hierarchies. These hierarchies hide relevant link state information. This violates the assumption that all nodes are using the same information to compute their routes. Therefore, this may result in stable packet forwarding loops. A systematic analysis of the types of abstractions that lead to problems in link state routing protocols is discussed in [1 1 2 ]. 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In this chapter, we have shown that independent route selection to realize policies in a path vector protocol can result in persistent route oscillations in hop-by-hop inter-domain routing. We believe that only shortest path route selection is provably safe. This significantly reduces the policies that can be realized using inter-domain routing. Given the existence of a widely deployed commercial Internet infrastructure, a combination of policy analysis and instability based route suppression can be used to deal with route oscilla tions. The former can detect most route oscillations caused by inter-dependent policies. The latter mitigates the impact, on the infrastructure, of route oscillations not detected by analysis. Explicit routing can then be used to realize desired policies {i.e., make routes available) that hop by hop routing cannot safely advertise. The explicit routing component can then complement BGP/IDRP. Evidence suggests that the addition of configuration mechanisms affects protocol correctness in subtle ways. The mechanisms alter some of the original assumptions that were used to prove the protocol correct. It then becomes harder to detect the weaknesses in the protocol, or once the weaknesses are identified, to evaluate possible solutions. 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 Effect of Topology Dynamics on Unicast End-To-End Protocols A number of transport and application level protocols measure and react to network characteristics, such as the end-to-end delay, round trip time (m), available bandwidth, congestion, etc. For example, a TCP [107] source adjusts its data rate based on its estimate of the available bandwidth and congestion. Other examples include Receiver-Driven Layered Multicast (RLM) [90], that requires a group member to estimate the level of congestion, and decide the amount of data that it can receive at any given time; in Scalable Reliable Multicast (SRM) [42], every group member estimates the rtt to all other group members, for use in its error recovery mechanisms. Common to all of these protocols is that changes to the topology can significantly change the actual network characteristics, and therefore impact the performance of the protocol, and possibly adversely affect the network. The underlying network over which these protocols operate is a multi-hop network, that offers a best-effort datagram service. In addition, the operational network is dynamic; i.e., elements within the network fail and recover on occasion. Dynamic unicast routing is required to provide end-to-end connectivity, and is the single-most critical function in the Internet. Dynamic routing protocols are designed to be robust and scalable; they are also designed to be convergent, i.e., after any given change in the network, all of the nodes must compute their routes, and become quiescent within a finite time. However, in practice, they may not converge on the time scales required to be transparent to the transport protocol. 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Therefore, transport protocols are designed to be adaptive to varying network conditions; in practice though, their evaluation is often conducted after their deployment on a test or operational network. Such evaluations have their own advantages and disadvantages. It is easy to conduct small, controlled experiments with multiple interacting protocols on a test network; however, a test network is constrained by its limited scale. On the other hand, an operational network allows us to evaluate the protocol on the scale at which it would be deployed; however, it is hard to impose the stress conditions required to systematically characterise protocol behaviour. It is also not desirable to subject an operational network to possible additional instabilities that could be introduced by the deployment of an yet-to-be-fully-evaluated protocol. We propose to enhance the capabilities of the protocol designer to evaluate protocols in the presence of dynamic topologies in a simulator. This and the following chapter explore methods to evaluate and characterise end-to-end protocols in a simulator in the context of dynamic routing and topology changes. The approach of protocol evaluation in the context of dynamic routing permits the designer to understand protocol behaviour. The protocol designer can verify that the protocol meets the desired design characteristics prior to protocol deployment. This evaluation can also aid in making the appropriate design choices. In addition, tradeoffs due to possibly conflicting goals can be evaluated during design. A common example of this is the tension between the goals of robustness and efficiency in adaptive protocols. In this chapter, we explore methods to evaluate a protocol in a simulator in the context of dynamic topologies and unicast routing. We call the combination of dynamic topologies and uni cast routing behaviours, “network dynamics.” We begin by developing the methods to evaluate a protocol, describing the components of network dynamics that must be present in a simulator (Section 4.1). We illustrate these methods using TCP as the example (Section 4.2). We present the results from our evaluation of TCP in a separate section (Section 4.3). We conclude this chapter by describing the steps that are necessary to systematically evaluate a unicast end-to-end protocol. 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.1 Simulation Methods and Mechanisms In this section, we develop the methods to evaluate end-to-end protocols in the context of dynamic routing in a simulator. For this, we first identify the characteristics of the network that the protocol measures, and assert the properties of the protocol based on those measurements. We then deter mine the means and metrics to measure changes in the protocol behaviour. We can now evaluate protocol behaviour for each property in isolation in a simulator. The next step is to identify the components in the simulator to stress the chosen property, V , of the protocol. These components are the topology, the models of topology dynamics and the locations in the topology at which they should be applied, as well as the unicast routing models. We will discuss each of these compo nents in the following paragraphs. In the following sections, we will apply these methods to the evaluation of TCP’s congestion response and the behaviour of the congestion control algorithms, and present the results of our evaluation of TCP. Selecting the Topology The topology required for any simulation is chosen based on the prop erty V of the protocol that we wish to study. For example, to study congestion control issues in unicast protocols, it is sufficient to consider a small topology with a limited number of congested links; on the other hand, we require large randomly generated topologies to study the efficiency of caching strategies in web related protocols. Even when a problem requires study using large topologies, the protocol designer starts off by using small carefully chosen topologies; this allows the designer to more carefully analyse the results of individual simulations and identify anomalies much easier than if they were using large randomly generated topologies. At some later time when the protocol design has matured, larger automatically generated topologies are also useful. Our focus at this point is to choose from the topologies that, when changed, will impact prop erty V in our protocol. For example, to study the control aspects of an unicast protocol, we choose topologies with at least one bottleneck link and one alternate (or fallback) path for the bottleneck link. The bottleneck link can then be subjected to varying levels of congestion to study the pro tocols response under stable conditions; the alternate path is used when the bottleneck link fails (Figure 4.2). We will discuss and apply other more complex topologies when evaluating multicast 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. transport protocols in the next chapter (Chapter 5). Other related work is attempting to formalize this approach [57], Models of Dynamic Topologies In any topology, a network is dynamic if one or more nodes or links in the topology can periodically fail and (possibly) recover. A number of different models of network dynamics can be used to specify the durations when a particular node or link will be down or up. In ns, we have implemented models based on Exponential and Deterministic random variables. Two others are a Trace-based model and a Manual specification model. A protocol designer could use the existing models in ns, or specify and use others. Appendix C describes our implementation of these models in ns. The exact model that is chosen is a function of the protocol, V and the topology. Each model is applied to one individual node or link in the topology. It is possible to obtain complex patterns of observable end-to-end connectivity by applying multiple models to different nodes and links in the topology. In this manner, we can create many different types of failures in the network, including simple failure and recovery (where a single node or link periodically fails and recovers), multiple failures and recoveries, topology partition and healing scenarios, etc. A routing protocol must then be used to compute the end-to-end connectivity after any topology change. Unicast Routing As with the models of dynamics, different types of routing protocols can be used. In a typical simulation that has no network dynamics, the user can compute the routes in the topology off-line, and then start their simulation. We call this one-time off-line route computation strategy, “Static” routing. A similar off-line route computation strategy can be used when the network is dynamic. In this strategy, routes are recomputed over the topology after any topology change. We call such an off-line route computation strategy, “Session” routing. A more realistic, but complex, option is to simulate “Dynamic” routing. Figure 4.1 shows a TCP session running over a dynamic topology, with three different kinds of unicast routing strategies. The experiments in this figure use the topology of Figure 4.2. The figures show Link (2, 3) fail at t « 0.8s, and recover at f w 3.5s. The plot shows the failure and recovery events as vertical lines at the appropriate times; the arrows at the top of the graph shows 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Static routing (b) Session routing (c) Dynamic routing Figure 4.1: Comparison of different routing protocol options in a simulation the type of event, the label adjacent to the arrows indicate the relevant link. An alternate path is available to route around the failed link. We notice that the throughput and response of TCP is different depending on the type of routing strategy used in the simulation. Each type of routing strategy has specific consequences for the transport protocol that is being simulated. Static routing, if used over a dynamic topology, can lead to temporary partitions while a node or link is down. This explains why there is no packet throughput seen in Figure 4.1(a) when Link (2, 3) is down. Session routing, on the other hand, will always ensure end-to-end connectivity, as long as the topology is connected. However, instantaneously recomputing the routes in the topology violates the causality of events occurring in the simulation during a short transient. Hence, results from a simulation with Session routing at the network layer, can be incorrect. For example, throughput measured by a protocol simulation that uses Session routing to repair network dynamics will often be higher than is normally possible. Figure 4.1(b) shows that a greater number of packets are successfully transmitted than occurs in the adjacent traces. This is of particular issue when we wish to study the transient protocol behaviour induced by topology changes. While Dynamic routing is more realistic, it entails overhead, and can introduce its own arti facts. For example, a protocol designer must take into consideration the fact that, at startup time 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (t = Os), no node in the topology has routes to other nodes, except possibly to its directly con nected neighbours. Dynamic routing takes some time to become quiescent after some number of routing protocol messages are exchanged by all of the nodes in the topology. From Figure 4.1(c), we can infer that the TCP session loses all of the initial packets. This is because the session starts at t = Os in our simulation and at this time the route computation has not yet become quies cent. Since the retransmit timeout is very high when a session’s data packets are lost very early in the connection history (and this timeout value is set at 3s), we see the first successful packet transmission at t = 3s via the alternate path. For the rest of the experiments in this paper, we use a Dynamic routing protocol. We have implemented a simple Distributed Bellman-Ford routing protocol in /is[89]. Our choice of param eters for the routing protocol are motivated by our desire to study the effect of topology changes on the end-to-end protocol. One well-understood effect of topology change on these protocols is a transient loss of connectivity. The duration of this connectivity is a function of the routing protocol and the topology. While this area of research is quite interesting and useful when trying to understand protocol behaviour, our focus is on other effects that arise from topology changes, such as intra-network behaviours that affect protocol operation (for example, packet interleaving effects), or variations in the path characteristics (for example, sudden variations in the rtt). There fore, we choose our parameters to ensure rapid routing protocol convergence without interfering with the simulation of the end-to-end protocols. In our simulations, the route update interval is 2s. In addition, updates are triggered whenever the incident topology at a node changes state, or when that node receives an update from a neighbour that causes it to recompute new routes. 4.2 Case Study: Characterising TCP TCP [107] is a window based unicast transport protocol that guarantees reliable and ordered deliv ery of data. The size of the window of a TCP source reflects the amount of data that the source can transmit; it is set based on observations about the round trip time, the arrival of acknowledgments from the receiver, and estimates of the available bandwidth and perceived congestion. Congestion is deduced from packet drop, which in turn is deduced from the non-receipt of acknowledgments 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Brief notes on the topology: — Links (0,2) and (1,2) have bandwidth 8Mb. propagation delay 5m s — All other links have bandwidth 0.8Mb, propagation delay 100ms — Link (2,3) has a queue limit o f 6 packets — Source o f interest is at Node 0 — Link (2.3) is dynamic in selected experiments Figure 4.2: Topology used for first set of TCP experiments some data. We can deduce a TCP sender’s window sizes by observing the trace of packet sequence numbers over time, as they transit the network from source to destination. In this section, we extend the work of Fall et al. [40] to evaluate TCP behaviour in the context of dynamic topologies. In their study, the authors systematically evaluate the performance of different types of TCP to various levels of congestion and packet loss within the network. Our work is a continuation of their study, to evaluate protocol behaviour in the context of network dynamics. To conduct this evaluation, we use identical configurations in terms of the varieties of TCP, source models etc., with extensions to the topology and unicast routing. In this chapter, we describe the models used, and present the results of our evaluation. We use the topology shown in Figure 4.2 for the first set of our TCP simulations. Link (2, 3) is the bottleneck link, and is dynamic, i.e., it periodically fails and recovers. The link will be down (and up) for a period of time drawn from an exponential distribution with a mean of Is (and 1 0 s). The path using the Links (2, 4) and (4, 3) is the fallback path to the bottleneck link. Obviously, we cannot use Static routing, since we need to compute new routes whenever the bottleneck link fails; likewise, since Session routing violates causality during topology change transients, and our interest is in the transience response of TCP’s congestion control mechanisms, we cannot use Session routing. We therefore use the Dynamic routing in our experiments. We simulate two FTP sessions, one from Node 1 to Node 3, starting at time t = 0.5s, and the other from Node 0 to Node 3, starting at time t = 1.0s. The window size used by the latter session 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 4.3: Tahoe TCP throughput in stable topologies is 100 packets; Figure 4.5 presents the time vs. sequence number plots of the session from Node 0 to Node 3, when it uses three different flavours of TCP: Tahoe TCP, Reno TCP, and TCP SACK. The simulation configuration above (topology + unicast routing) differs from [40] in two sig nificant ways: ( 1) their topology has no alternate paths between the sources and sink though Node 4, and (2) they use Static routing to precompute the routes in the topology prior to simu lation execution. We can verify that neither of these changes affects protocol behaviour when the topology is stable. As an example, Figure 4.3 shows the throughput of the two FTP sessions over Tahoe TCP. In this figure, and all other figures in this chapter, we plot the sequence numbers modulo 90. We trace the sequence numbers of packets traversing Link (2, 3). The sequence number of the acks for each packet are also shown as tiny dots. These acks are seen approximately one rtt after the packet was first sent by the source. Figure 4.3 shows the throughput from each of the FTP sessions when the topology is stable. The upper plot shows the throughput of Session 1, the FTP session from Node 1 to Node 3, and the lower plot shows the throughput of Session 0, the other FTP session from Node 0 to Node 3. Each session experiences congestion and consequent packet drops at t « 1.5s. Following the packet 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 i <23» « 1 <23> * T — i------- — # * * * . . . • / * . * * * ° • • * 0 1 2 3 4 Ume 5 S ^ <23> * ! / • / / 1 # ' / ------------ — 4. ■«---. . . . / . 0 1 2 3 4 5 6 time Figure 4.4: Tahoe TCP throughput in Dynamic Topologies drop, each session adjusts its window, and resumes transmission. This behaviour is identical to that described by Fall et al. [40]. Our focus of research is on protocol behaviour in the context of topology changes. We will therefore consider the periodic failure and recovery of Link (2, 3); specifically, the link will fail at t % 0.8s and recover at t ss 3.5s respectively. These events are shown as vertical lines at appropriate points in time. The alternate path through Node 4 is available when Link (2, 3) fails. Our plots will show the trace of packets through Link (2, 4) when Link (2, 3) is down. In the following section, we present the results of our characterisation of TCP in the context of network dynamics. 4.3 Characterising TCP: Some Results Unlike in the stable case, we see from Figure 4.4 that each session experiences different types of network behaviours. Session 1 from Node 1 to Node 3 experiences packet loss due to link failure early in its connection phase (Figure 4.4). In fact, Session 1 in each of the variants of TCP experienced this same type of loss, and exhibited the same response to this loss. Hence, we do not plot the throughput of this session in our subsequent figures. Session 0 from Node 0 to Node 3, on 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. <23> o § « o « s (0 5 2 3 4 6 0 1 time (a) Reno TCP o c o s 0 1 2 3 4 5 6 lime (b ) S A C K T C P The sessions ail create transient congestion after Link (2,3) recovers; we can see the difference in the recovery mechanisms o f the three sessions after the topology change event: - Reno TCP (Figure (a)), after a fast retransmit, waits for its RTO timer to fire, and then initiates slow start. - SACK TCP (Figure (b)). appears to behave identically as Reno does, but has a slightly better throughput than Reno TCP. - Tahoe TCP (Session 0 in Figure 4.4) initiates slow start immediately after a fast retransmit, and hence has the highest performance o f the three. Figure 4.5: TCP throughput traces for Reno and SACK TCPs in Dynamic Topologies the other hand, experiences the arrival of interleaved acks and significant packet drops following link recovery. We now discuss these two effects exhibited by this session from Node 0 to Node 3. We first make a couple of observations about the packet drops occurring at t s= 3.5s in Ses sion 0 from Node 0 in Figure 4.4, and in each of the sessions in Figure 4.5. In this session, all of the packet drops occur in a single window of packets. These drops occur because of routing and topology changes. Hence, routing changes can lead to multiple packet drops. Moreover, these packet drops occur just after the recovery of Link (2,3). This was contrary to our expectation that packet drops occur due to link failure, but never due to link recovery. Therefore, packet drops may occur due a link recovery. 66 / Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. § « e CO 0 1 2 3 4 5 6 time Figure 4.6: Effect of Interleaving Packets on TCP SACK One of the reasons for this packet drop is the arrival of interleaved acks. These acks are in transit across both the original longer path, and the newer shorter path, at the instant of link recovery. In general, there are several possible consequences of such packet or ack interleaving. One consequence from interleaved acks/packets is that the sender can get three dup-acks, and assume that a packet has been lost, even though the packet is not lost This forces all variants of TCP to do a fast retransmit of the “lost” packet. TCP Reno, in particular, is susceptible into going into a retransmit timeout needlessly; TCP Reno is defined by this behaviour in response to multiple packet drops. Another possible consequence of interleaved acks is that the sender suddenly sees a big jump in the sequence number acked, and therefore sends a large burst of packets back to back. This can then result in actual packet drops as opposed to simply perceived packet drops. We saw this earlier in Figure 4.5, when packets from three types of TCP were dropped. In a different simulation of TCP SACK using the same topology of Figure 4.2, We see both of these consequences (1. an unnecessary retransmit of out-of-order packets, and 2. a real packet drop), exhibited by a TCP SACK source (Figure 4.6). In the figure, the receiver, Node 3, receives a few packets out of order. These are six packets transmitted across Link (2,4) just before Link (2,3) recovers at f w 3.5s; they are in transit across the alternate longer delay path. Subsequent packets in the same window arrive at the destination earlier using the shorter Link (2, 3). (1) The destina tion continually acks for the interleaved packets still in transit across Links (2,4,3); after receiving three dup-acks, the sender retransmits the six packets at t s; 3.8s even though they are not actually lost (2) Subsequently, based on the jump in the sequence number acked by the receiver, the source sends a burst of packets (at time t ss 4.0s) that then results in the actual packet drops. 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This simulation points to a slightly less conservative behaviour in TCP SACK about which packets it should retransmit. The source waits for three dup-acks before retransmitting the first packet, as do TCP Tahoe and TCP Reno. However, based on the selective acks from the receiver, the source infers loss of packets not explicitly acked and retransmits them, leading to possibly unnecessary retransmits of packets that are not actually lost. We saw this earlier in Figure 4.6, when it retransmit all six packets that were interleaved at t « 3.5s. In contrast, TCP Tahoe, which also retransmits packets that are not likely lost, is controlled by slow start; TCP Reno goes into fast recovery assuming that only one packet has been dropped1. 4.4 Chapter Summary In summary, we have shown that studying TCP over dynamic topologies can reveal interesting aspects of TCP. An important consequence of dynamic topologies that we have highlighted is the extensive reordering of packets that can arise due to a small topology change. It is important to recognize that the results in this section are not necessarily a function of the size of the topology. Packet interleaving occurs in the immediate vicinity of a topology change, and this localized re ordering is maintained all the way to the end-to-end protocol operating at the edges of the network. The ability to study such behaviours is very useful in protocol design. As an aside, one interesting observation is that simple changes in the topology were not always sufficient to cause this transient congestion. The delay-bandwidth product between the source and the destination is a measure of the amount of data that can be in transit through the network. A TCP source measures this quantity implicitly, and adjusts its throughput in response to these mea surements. Changes to the topology can result in changes to the end-to-end delay and bandwidth. We would expect that all such changes in the delay or bandwidth end-to-end will result in chang ing the behaviour of the TCP source. However, we found that such changes in the network do not always lead to the types of congestion seen earlier (Section 4.3). 1 It is generally the case that, if multiple packets in a window have been dropped, the TCP Reno sender ultimately waits for a retransmit timeout, and the initiates slow start [40], 6 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Earlier, in Section 4.1, we had specified a number of experiments, including running the exper iment on different topologies, as well as multiple scenarios of network behaviour, such as routing transients, multiPath routing, or network partition. Not all of these experiments are interesting in the context of TCP. This is because TCP has been deployed for a substantial number of years, and is a well-understood and mature protocol. We have used TCP to successfully explore the viability of our methods of characterising protocol behaviour in the presence of network transients. In the following chapter, we will apply these methods to the analysis of multicast transport protocols, which are still in their infancy of design and deployment. 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 5 Multicast Transport Protocol Evaluations In the previous chapter, we described a methodology to characterise unicast transport protocols in the context of network dynamics. In this chapter, we extend this methodology to multicast transport algorithms. The members of a multicast group can differ in their processing power, connectivity and path characteristics to the other members, etc. This heterogeneity imposes constraints unique to mul ticast protocols. Multicast based protocols are often adaptive to cope with the heterogeneity. The diversity of interaction also makes reliable delivery and the design of multicast transport proto cols more complex than the design of unicast transport protocols. In general, multicast transport protocols differ from their unicast counterpart in other more significant ways as well. Unicast transport protocols, such as TCP, are only affected by changes in the topology if the changes occur on the path between the two end-points. By contrast, a multicast session traverses a greater number of links at any given instant. Therefore, the likelihood is greater that changes in the topology occur between some of the members in the session, and hence impacts protocol behaviour for those members downstream of the change. Partitions and partition healing are more complex situations for multicast protocols. Many multicast protocols are designed to continue operation in spite of partition. Contrast this with unicast protocols where, because of fate-sharing, nodes affected by a partition will make some number of attempts at recovery and then fail. In fact, many multicast applications do not even 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. assume knowledge of individual group membership and there is no apparent difference between partition and membership dynamics [87]. An important distinction to bear in mind is that network dynamics, specifically partition and healing, introduces different complexities than are captured by group membership dynamics. These latter dynamics model group size increase or decrease based on pre-specified distributions. During a partition, the members in each component become unreachable to each other. In spite of this, the members in each connected component continue protocol operation independent of the members in the other components. This can influence protocol behaviour when the partition heals. The effect of a member joining a group is often less drastic than the impact of a partition healing. In the latter case, the protocol may have to cope with a large group of members suddenly becoming visible with no prior notification. In addition, each group of members has advanced its state, and neither party is in a starting configuration. Hence, characterising protocol behaviour in the presence of topology changes and network dynamics is orthogonal to group membership dynamics. Finally, we note that the development of multicast transport protocols is still in relative infancy, compared to unicast transport protocols such as TCP. This, as well as the complexity of these pro tocols, the impact of topology changes, the increased number of failure modes such as partitions, all make it imperative that we characterise multicast transport protocol behaviour in the context of network dynamics. In this chapter, we will extend the methodology developed in Chapter 4 to multicast algorithms and apply it to the evaluation of Scalable Reliable Multicast (SRM)’s timer mechanisms. We begin with a brief overview of SRM (Section 5.1). Section 5.2 outlines the methodology that we have developed. In the following sections, we apply our methodology and present the results of our evaluation to the SRM protocol timers (Sections 5.3, 5.4, and 5.5) in the context of network dynamics. 5.1 Protocol Overview Scalable Reliable Multicast (SRM) [42] is a transport level protocol for the reliable, possibly un ordered delivery of packets from one source to all recipients of a multicast group. 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In SRM, “active sources” in a multicast group periodically send data to the group, as specified by the application. Each unit of data is identified by the tuple (source id, message sequence number); this corresponds to the notion of “naming in Application Data Units model” used in [42]. The source tags each unit of data with its unique identifier. In addition, every member of the group periodically sends session messages specifying the amount of data it has received from each source. Loss can be detected either through receipt of data from the source, or through receipt of a session from some member of the group. A member that detects a loss can send out a request for that lost data. The error recovery mechanism in SRM is receiver-oriented; therefore, any member that has the requested data can respond to that retransmission request. It is not necessary that the original source of a particular data unit be the one to respond to requests for retransmission of that data. In the rest of this paper, we use packets as a synonym for data units. In order to avoid multiple simultaneous requests by all nodes that detect the loss of a particular packet, each node, n, that detects the loss will set a random timer in the interval [Cid5.(Ci + C2 )ds\, where Ci and C2 are the protocol request parameters, and ds is n ,’s distance to the source of that packet, n* will send out its request when the timer expires. If n* hears a request for that packet before its timer expires, it will cancel its timer. In either case, it will reschedule its timer using exponential backoff. Since duplicate requests from other nodes can mislead a node into backing off an already-backed-off timer, each node also delineates an ignore-backoff interval, during which it will ignore requests from other nodes. When it finally receives a repair, the node will set a hold-down timer, with a value of 3ds, so that it does not attempt to aggressively schedule and send a repair in response to a duplicate (or ill-timed) request. Likewise, each node rij that can send a repair in response to a request will set a random timer in the interval [ .D \dr, {Di -l- £>2)dr]- D\ and D 2 are the protocol repair parameters, and dr is n f s estimate of its distance to the n* that sent the request, n j sends out a repair message when its timer expires. If it receives a repair before its timer expires, then its cancels the timer. In either case, rij sets a hold-down timer, with a value of 3dr, during which time it will ignore all requests for that packet, so that it does not attempt to send out a second repair in response to a duplicate (or ill-timed) request. 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We consider two mechanisms by which values for the timer parameters, C\, £> 2 , D \, and D 2 , are chosen. In the fixed timer mechanism, all of the nodes use identical preset values. Fixed timers are useful to illustrate the behaviour of the protocol. In practice, we need a mechanism to adapt to the observed delay and loss characteristics exhibited by the network. With adaptive timers, each node, n,-, that detects a loss adjusts its request parameters based on the number of duplicate requests that it receives, as well as the average delay between detecting the loss and hearing the request expressed in units of its rtt to the source, ds. Similarly, each node, ny, that can repair a loss adjusts is repair parameters based on the number of duplicate repairs that it receives, as well as the average delay between receiving a request and sending a repair, normalized in units of its rtt to the requestor, dr. The adaptive timer mechanism will tolerate a duplicate request (repair), as long as the delay is within one rtt to the source (requester). The terms, C\ and D \, are called the deterministic components of the timer mechanisms in SRM. Each node has to set its timer value to at least C\ds (or D\dr) from the time it detects a loss (receives a request). If C2 and £ > 2 are set to 0, then the C\ and D\ components ensure that nodes closer to the loss send out requests and repairs. Hence, this component helps distinguish between, and favour those, nodes that are nearer to the loss. By contrast, when there are multiple nodes that are equidistant to the loss, then the nodes should use probabilistic choice in order to decide which of the nodes will send the first request or repair. The terms, C2 and £ > 2 , specify the maximum range of the interval in which the timers can be set, and hence are called the probabilistic components of the timer mechanisms in SRM. Earlier, we had said that each node periodically sends session messages that contain informa tion about the latest message from each source that it has received. In addition, nodes use these session messages to estimate their distances to each other. Each node, m,, time-stamps every session message it sends. In addition, for every other group member, rrij, that m t is aware of, m* advertises the sequence number of the last session message it received from m j, as well as the time that m j sent that session message, and the time that m, received it. m j can use this information to estimate its distance to m*. 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In this chapter, we evaluate the timer mechanisms in a multicast transport protocol, SRM, in the context of network dynamics. Timers are relevant to a broader context of applicability than the specific protocols. They are an intrinsic component of every protocol, and getting the timer mechanisms to function correctly in an operational network is a complex task. A precise characterisation of a protocol timers is extremely useful in the deployment and management of a particular protocol [104]. This work lays the groundwork for a more systematic exploration of methods for evaluating the timer mechanisms in ther protocols as well in a sufficiently general context. Of particular note is the work of Helmy [57] to systematically derive the test cases for a protocol based on an analysis of its description. We now go on to present our evaluations of the timer mechanisms in SRM in the context of net work dynamics. Before we describe the results, we begin with a brief outline of the methodology required for the evaluations. 5.2 Outline of the Methodology Topology changes that impact end-to-end protocol behaviour fall into two classes: those that leave the topology connected, so that the end-points participating in the protocol continue communi cating with each other after the topology change; and, those that leave the topology partitioned, resulting in the group being split into more than one separate component. Each type of topology change constitutes one part of our protocol evaluation. Connected Topologies When the topology is connected at all times, we identify the protocol behaviours that are a function of the topology. We can then pair-wise enumerate the changes in those protocol behaviours. Notice that identifying the protocol behaviours and enumerating the changes requires knowl edge of the protocol. As an example, the protocol timers in SRM consist of two components, the deterministic and probabilistic components, each of which is a function of the topology. In Sec tion 5.3, we show the three possible enumerations for our evaluation. Similarly, the behaviours to characterise the clustering properties of SSM are cluster formation and breakup, and migration of 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. nodes within one cluster to another; we will consider the possible enumerations for SSM clusters in Section 6.2. The next step is to evaluate the protocol for each of the cases enumerated. This consists of first deriving the specific topology to conduct the evaluation. One set of experiments on that topology is conducted with the topology stable and unchanging; this set serves as the base case behaviour of the protocol on that topology. The remaining sets of experiments are over dynamic topologies. Partition and Healing Unlike the simple failures, partitions will cause some subset of members to lose connectivity to each other. However, there is a time lag before the members discover the loss of other members. While the topology is partitioned, the members in each partitioned component may continue to function independent of the members in the other components. However, when the partition heals, each member will instantly (re)discover the presence of the other members. The distribution of group members across the components of a partition can have an impact on protocol performance. This leads to two types of partition scenarios. In the first case, one of the components of the partition suspends adaptation and activity until the partition heals. For example, this can happen in SRM using adaptive timers, if the members of a component do not experience the loss of data packets from the sources in the component, or there are no active sources in that component. Note that it might not be a significant issue that one component has suspended adaptation during the partition, except in those cases where a member in that component was active in loss recovery and had adapted to that recovery before the partition. Following the partition and healing, another less optimal member might have better adaptation to respond to subsequent losses, leading to slightly worse loss recovery response. This leads to one type of interaction after the partition heals. It is not immediately obvious whether or not this type of partition scenario can be adequately captured by other types of experiments, such as membership dynamics changes, or the simple failure experiments outlined earlier. Alternately, it is possible that multiple sub-groups formed by a partition continue to adapt independently to the external characteristics. This requires that each component of the network contain the stimulus that causes members in that component to continue their adaptation. For SRM 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. with adaptive timers, an active source in conjunction with a lossy link is an example of a stimulus to which members respond. The behaviour of the protocol when this type of partition heals can be different from the earlier type of partition. This type of partition also requires careful analysis, to isolate the effects of adaptation during partition and healing. In this thesis, we concentrate on protocol evaluation in the context of this type of partition, because this models richer protocol interactions, and models behaviours likely to occur in an operational network. In the following sections, we will apply this methodology to the characterisation of the timer mechanisms in SRM. We will apply this methodology to the evaluation of clustering algorithms in the following chapter (Chapter 6). 5.3 SRM Timer Mechanisms in the Context of Topology Dynamics In this section, we characterise the behaviour of the adaptive timer mechanisms in the Scalable Reliable Multicast protocol. As a point of comparison, we will also characterise the fixed timers mechanisms described by Floyd et al. [42]. We will use the methodology outlined in the previous section for this characterisation. Recall that the SRM protocol timers are a composite of the deterministic and the probabilis tic components, each of which is independently affected by the topology. We restrict ourselves to conditions where the topology alternates between two configurations, depending on the state of the individual nodes and links in the topology. Therefore, the relevant enumerations for our characterisation are three: (1) “Deterministic Adaptation,” i.e., only the deterministic component is exercised in both topology configurations, (2) “Probabilistic Adaptation,” i.e., only the proba bilistic component is exercised in both topology configurations, and (3) “Mixed Adaptation,” i.e., each component is exercised in alternately in the different topology configurations. In the following subsections, we present our results for each of the above enumerations; we then present the results of our partition analysis (Section 5.4). Finally, we observe that the fre quency of session messages play an important part in loss recovery. We investigate this aspect of protocol behaviour, and present the results of our evaluation of the role of the firequency of session messages. 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.3.1 Determ inistic A daptation of the SRM tim ers For our experiments, we use the ring topology shown in Figure 5.1. This topology is a very simple model of a network with alternate paths for fallback (or fallback paths). Such fallback paths are used in the event of failure of some of the other paths in the network. When the topology is stable, our topology resembles a string topology [42]. These topologies stress the deterministic component of the SRM timers. The bandwidth of each of the links in the topology is 1.5Mbps. The delay for all but one of the links is 10ms; the delay for Link (4, 5) is 80ms Link (4, 5) is called the fallback link, and will only be used if one of the other links in the topology has failed. As in the unicast case (Chapter 4), we use the simple Distributed Bellman-Ford routing pro tocol implemented in ns[89] (Chapter B). In our simulations, the route update interval is 2s. In addition, updates are triggered whenever the incident topology at a node changes state, or when that node receives an update from a neighbour that causes it to recompute new routes. The multicast routing protocol is a dense mode variant of DVMRP [133]. For all sources in every multicast group, each node n* computes a parent-child relationship graph; the graph specifies whether n, is upstream or downstream of each of its neighbors in the shortest path spanning tree of any given source in a multicast group. n{ uses the receipt of a unicast route update from a neighbour as a signal to recompute its parent-child relationships. The multicast algorithm is a flood and prune algorithm. Therefore, n, periodically floods multicast packets to all of its neighbours that are downstream to the packet source. Neighbours that do not have any members in the group will send a prune back to their upstream neighbour. Prune state associated with each of its neighbours times out every 0.5s. In order to keep our current analysis simple, we have assumed a group density of one, i.e., there is an SRM agent at every node in the topology. Different aspects of the protocol are dependent on group density; in this paper, we explore the behaviour of the timer mechanisms in the protocol, which are a greater function of the topology, than the group density. Hence, our results are not invalidated by this simplifying assumption. As a notational convenience, we do not distinguish between a node in the topology and an SRM agent running on that node in the topology. 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In the topology, — the source is located at Node 0, — All links are 1.5 Mbps. — All links except Link {4,5) have 10ms delay, — The delay on Link (4,5) is 80ms, — Link {1,2) is dynamic, — Links ( I, 2) and (5 .4) periodically drop data packets from Node 0. Figure 5.1: Cyclic topologies used in the study of reliable multicast A constant bit rate source generates two packets per second, and is attached to Node 0. We gen erate periodic loss of the data stream to observe the behaviour of the adaptive timers. Links (1,2) and (5,4) drop every other packet from the source. This approach allows us to precisely quantify the loss. In this configuration, only Nodes 2, 3, and 4 will experience all of the data losses and therefore attempt error recovery; the other nodes in the topology will attempt to repair the loss. The losses on the two links are independent of each other.1 The data rate is set so that each loss and recovery will be complete before the source sends the next packet. Each of our simulations runs for 200s. In order to let the initial routing become quiescent, as well as to let the distance estimation algorithms in SRM converge, losses begin at t = 10s; we start our evaluations after t = 20s. For clarity, we only show the results till t = 120s in our plots. We ran each experiment 31 times, with a different seed for the random number generator in our simulator each time. Unless otherwise specified, our results are a plot of the average of the 31 runs. Each plot also includes the results from each of the individual runs as tiny dots to illustrate the distribution. 1 In particular, note that we are using dense-mode multicast, based on periodic flood-and-prune. In our topology, the fallback link. Link (4,5), is not usually used. Node 5 will periodically flood data to its neighbour. Node 4 across this fallback link. Node 4 sends a prune whenever Node 5 sends data packets across the fallback link; (the exception of course, is when the link must be used because some other link in the topology has failed). Link (4,5) will drop alternate packets seen in this periodic flood as well, and hence, over a period of time, there may not be a one-to-one correlation between packets dropped on Link (1,2) and Link (4,5). 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0 . 1 4 0.12 0.1 | 0 . 0 8 S 0 . 0 6 0 . 0 4 0.02 0 2 0 6 0 t i m e 8 0 0.12 0.1 0 . 0 8 S 0 . 0 6 0 . 0 4 0.02 0 2 0 4 0 6 0 8 0 1 0 0 120 (a) Fixed Timers (b) Adaptive Timers Figure 5.2: Average recovery delay per loss In the remainder of this section, we will present our simulation results as part of our proto col evaluation. We first evaluate the protocol in two different ways: the first, evaluation using a stable topology, is a base case evaluation (Section 5.3.1.1). We use this to illustrate the protocol behaviour, as well as to define our metrics. We then study the same simulation configuration, when the topology is made dynamic (Section 5.3.1.2). This evaluation measure helps us understand the characteristic of the deterministic component of the timer functions. 5.3.1.1 Base Case: SRM Timers over a Stable Network In this section, we will evaluate SRM in a stable topology; i.e., the topology is not dynamic. This serves to briefly review the protocol behaviour, define the metrics to evaluate the protocol, and to illustrate the different data presentation methods that we will use in the following subsections. “Recovery delay” is the time difference between when a node detects a lost data unit to the time it actually receives the repair from another node. Figure 5.2 shows the average recovery delay for the nodes in the topology to recover from each loss. The x-axis corresponds to the approximate time that the loss was detected by any of the nodes. This allows us to assign a unique time stamp to that event; we use this idea in later evaluations to correlate other events in the simulation, such as topology change events. The y-axis indicates the recovery delay. In Figure 5.2(a) all of the nodes use fixed timers; In Figure 5.2(b) all of the nodes use adaptive timers. From the figures, we 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.6 1 . 4 1.2 1 S 3 0 8 I as 0 . 4 0.2 0 0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 t i m e t i m e (a) Fixed Timers (b) Adaptive Timers Figure 5.3: Normalized recovery delay, expressed in units of rtt per loss see that fixed timers has a constant recovery delay; adaptive timers gradually improve the recovery delay at all the nodes experiencing the loss. Since the recovery delay (and by extension, the average recovery delay) is a function of which node initiates the request, and which other node performs the repair, this metric can have high variability. An alternate measure, used in [42], is to normalize the recovery delay at a node, and express it in units of that node’s rtt. In particular, for any given loss, we determine the last node to receive the repair, and normalize the delay experienced by that node its estimate of the rtt to the source. This is a measure of the worst case recovery delay experienced by these nodes. Figure 5.3 shows the plots of this metric for fixed and adaptive timer mechanisms. From Figure 5.3(a), we see that, when using fixed timer mechanisms, the worst case recovery delay for any node is over 1.2rtt to the source for that node. On the other hand, we can see from Figure 5.3(b) that the delay is within 1 rtt for that node, when the nodes use adaptive timers. These figures illustrate some of the original motivation in [42] for developing adaptive timer mechanisms, i.e., to reduce the recovery delay, at the expense of a marginal increase in the number of duplicate requests. We therefore look at the number of requests and repairs sent by each of timer mechanisms. SRM correlates the number of messages sent for each loss with the recovery delay for that loss. The protocol tries to optimise the number of messages sent for each loss. We can therefore 80 . . r: : . • • • \.! : t ‘ j : . .-is ■ • ::.r . . s M 1 . 4 1 0.8 0 .6 0 . 4 0 .2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2 0 4 0 6 0 0 0 tO O 1 2 0 3 . 5 2 . 5 & 2 3 I 1 . 5 0 . 5 2 0 1 0 0 120 0 4 0 6 0 t i m e 9 0 (a) # of requests sent in fixed timer mechanisms (b) # of requests sent in adaptive timer mechanisms 3 . 5 2 . 5 I 2 3 1 1 . 5 0 . 5 0 2 0 6 0 t i m e 8 0 4 0 100 120 (c) # of repairs sent in fixed timer mechanisms (d) # of repairs sent in adaptive timer mechanisms Figure 5.4: Request and repair counts per loss characterise the protocol’s performance as a count of the number of request and repair messages that are sent for each loss. Figure 5.4 shows these plots of request and repair messages for fixed and adaptive timers. In order to show all of the different points in the distribution, we plot the dis tribution as jittered dots. From Figure 5.4(a) and 5.4(c), we can see that fixed time mechanism send few duplicate requests or repairs. On the other hand, adaptive timers send slightly higher number of requests (Figure 5.4(b)) in order to reduce the recovery delays following a loss. Figure 5.4(d) tells us that adaptive timer mechanisms are more consistent in not sending duplicate repairs than fixed timers in our topology. 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Fixed timers (b) Adaptive Timers Note that this plot is the result from a single simulation run, and does not indicate any distributions. Figure 5.5: Messages sent by each node for each loss From the figures for normalized recovery delay and request counts (Figures 5.3(b), 5.4(b) respectively), we see that adaptive timer mechanisms tends to admit some number of duplicate requests in order to ensure that the recovery delay is within expected bounds. For the rest of this paper, we will evaluate the behaviour of the adaptive timer mechanisms in response to topology change events. As an aside. Figure 5.5 shows the traces from two specific simulations; each is a plot of the nodes that send the individual request and repair messages for each loss. Each simulation is run using the same seed. Figure 5.5(a) shows the results using fixed timers, and Figure 5.5(b), using adaptive timers. In each case, we see that Node 2 sends out most of the requests and Node 1 does most of the repairs for fixed timers, which is as we expect in a string topology. With adaptive timers, these nodes send out all of the requests and repairs, while Node 3 also sends out the occasional duplicate requests. A final measure of protocol behaviour that could help improve our comprehensibility of the protocol is a plot of the request and repair parameters at each of the nodes. Figure 5.6(a) shows this plot for one run of the simulation. The y-axis shows the parameters for each node offset by an appropriate amount. Note that each node is only involved in request or repair for each loss. From the figure, we see that Node 1 reduces its Di and £> 2, so that it is always the one to send out repairs. All other nodes that can repair a loss increase their D i, so that they would rarely send out a repair message. Node 2 reduces its C\ and (72, so that it always sends out a request message. By reducing its parameters significantly, and rather quickly, Node 2 ensures that the recovery delay is already fairly small. Over time, Node 3 adapts its parameters to enable it to 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. s a C 2 - © * - ~ 0 2 2 0 6 0 t i m e 100 120 S * 8 0 1 f i 7 6 5 4 3 2 1 0 120 2 0 4 0 6 0 8 0 100 0 (a) Stable Topologies (b) Dynamic Topologies The above are plots o f the protocol parameters for each node in the topology o f Figure 5.1, offset by each node’ s identifier. The plot on the left occurs when the topology is stable. On the right, we see topology changes start att — 40s and occur every 20s thereafter. Nodes Z 3, and 4 adapt their request parameters; the request parameter adaptation is minima/, since adaptive timer mechanisms can tolerate one extra duplicate at the cost lower normalised delay and only three nodes experience data loss consistently. The other nodes adapt their repair parameters, and Node I adapts to doing most o f the repairs. This is characteristic o f the adaptation when conducted over a stable topology. Contrast this with the plot on the right, that shows parameter adaptation occurring over dy namic topologies. We notice the parameters change every time Link (L 2) fails or recovers. These adaptations correspond to the requests and repairs sent by each o f the nodes and shown in Fig ure 5.5(b). Note that this plot is the result from a single simulation run, and does not indicate any distributions. Figure 5.6: Adaptive Timers: Parameter adaptation by each node send an occasional duplicate request. This explains the occasional duplicate request from Node 3 that we see in Figure 5.5. We also notice from the figure (Figure 5.6(a)) that in stable topologies, the nodes adapt to the topology very quickly, and subsequently exhibit very little adaptivity. In contrast. Figure 5.6(b) shows the same parameter adaptation for the same simulation, when the topology is dynamic. We notice immediately that some of the nodes are continually adapting to every change in the topology. In the following section, we will evaluate protocol behaviour over dynamic topologies. 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.3.1.2 Dynamic Behaviour o f the SRM Adaptive Timers In this section, we repeat the experiments of the earlier section, when running SRM over dynamic topologies. For the sake of brevity, we concentrate on adaptive timer mechanisms in the remainder of this chapter. The choice of the network dynamics model for these experiments is guided by the following criteria: (a) the topology changes must occur in such a manner that, regardless of the topology, the same set of nodes experiences the losses and the same set of nodes can repair the loss; and (b) the interval between two successive topology changes should be sufficient to let the nodes adapt to the new topology. This permits us to study the adaptation of the timer mechanisms. In addition to the above two criteria, it is also useful to have topology changes occur at predictable intervals. While this last criteria does not model any operational characteristic, it helps us isolate the effect of each topology change from the next. Therefore, for the rest of the experiments in this chapter, we use a deterministic network dynamics model, one in which Link (1,2) periodically fails and recovers every 20s. The first event is the link failure at t = 40s. Link (1,2) will fail at t = 40s and t = 80s in our plots, and recover at t = 60s and t — 100s. During the intervals, [40s, 60s] and [80s, 100s], the alternate higher delay path through Link (4, 5) is used. In our plots, we show the time at which topology changes occur as a vertical line at the appropriate instant in time. Markers on the line (seen towards the top of the plot) indicate whether the link fails (goes down) or recovers (comes up); a label indicates the link that has changed state. Figure 5.7 shows the plots of average and normalized recovery delays for the nodes experienc ing loss in the topology. In particular. Figure 5.7(a) illustrates the expected increase in the average time to recover from a loss every time the alternate path through Link (4, 5) is used. In these in tervals, we see the average delay increase from ss 0.1s to ss 0.8s. Correspondingly, Figure 5.7(b) indicates that the normalized delay increases from ~ 1 rtt to « 2m. By design, the increase in normalized delay will cause the nodes to attempt to send more duplicate requests in an attempt to reduce their recovery delay. Figure 5.8(a) confirms that the nodes send more than 2 requests in the intervals when the normalized delay increases over 2rtt, 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.4 :1 2 > 1 0.8 0 . 4 120 0 6 0 2 0 4 0 3 0 1 0 0 0 14 :1 2 > : 1 2 > 6 0 0 2 0 4 0 8 0 100 120 (a) Average recovery delay (b) Normalized recovery delay Figure 5.7: Recovery delay per loss: Adaptive timers over dynamic topologies i o s 8 7 : 1 2 > :1 2 > 6 5 4 3 2 1 0 4 0 6 0 120 2 0 6 0 100 0 & 3 (a) # of requests sent in adaptive timer mechanisms (b) # of repairs sent in adaptive timer mechanisms Figure 5.8: Request and repair counts per loss: Adaptive timers over dynamic topologies and decreases to sending about 1.5 requests on average, at other intervals. In contrast, the number of repairs sent is about 1.5, and never changes significantly (Figure 5.8(b)). While the results in terms of the duplicate requests and repairs, and recovery delays are as expected, we can observe two anomalies: (1) exaggerated spikes at t = 40s and t = 80s cor responding to the instants when Link (1, 2) fails, and (2) increases in the number of repairs sent at the same instants in time. The first anomaly is an artificial consequence of protocol behaviour operating over dynamic topologies. The second is a real consequence of protocol behaviour. We discuss each of these in turn. 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T I 4 3 2 t 0 4t This figure shows a possible sequence o f events following a link failure, in which nodes set and send multiple rounds o f requests and repairs. The x-axis shows the time; along the y-axis, we plot the timeline for each node, and the sequence o f events at that node. The arrows indicate the duration o f the hold-down timer following the sending or receipt o f a repair message. The square brackets show the interval from which a node sets its nack or repair timer. In this topology. Nodes 2,3, and 4 send requests, and the other nodes send repair messages. Figure 5.9: Trace of messages for a single loss following a link failure The anomalous spikes in the normalized delay that occur when Link (1,2) fails are an artifact of operating over dynamic topologies. The reason for these spikes is that, just after the link failure, the time to recover increases for all the nodes. However, the distance estimates at each of the nodes is still the original, much smaller estimate. In particular, in our ring topology, the node that will get the repair last is the node that had the shortest distance estimate of all the three nodes before the failure. All three factors above, contribute to the exaggerated spikes in the normalized recovery delay at the instant of the link failure. Since the adaptive timer mechanisms attempt to optimize normalized delay by sending dupli cate requests, we might expect that the spikes in the number of requests at t = 40s and t = 80s in Figure 5.8(a) are reasonable. However, this does not explain the corresponding peaks in the number of repairs at those instants (Figure 5.8(b)). The reason for these becomes apparent when looking at a detailed trace of a single request and repair cycle following a link failure (Figure 5.9). Recall that, just after Link (1,2) fails, the nodes have very short distance estimates to each other, this estimate is based on information learned before the topology change. Therefore, each of the nodes will set its timers too short Nodes experiencing a loss will set multiple “rounds” of re quests, sending a request after each round, and setting their timers again, and again, until they 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0 2 0 4 0 6 0 S O 1 0 0 1 2 0 0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 time time (a) # rounds of request (b) # rounds of repair Figure 5.10: Maximum request and repair rounds per loss: Adaptive timers receive the repair. In a similar manner, nodes that can send the repair use correspondingly short distance estimates. They will set short hold-down times following the repair, and send multiple repairs, corresponding to each of the requests sent. Detailed traces of the individual losses indicate this behaviour. We can see this clearly in the trace of Figure 5.9. The figure shows the time lines at all of the nodes following a failure of Link (1,2) at t = 40s. Nodes 2, 3, and 4 go through at least two rounds of request, and they send three requests between them. Nodes 0, 1, and 7 set their hold-down timers too short, and set and send mukiple repairs. Node 0 in particular sends three repairs corresponding to each of requests it receives. Since the nodes execute multiple request and repair rounds, following a topology change. Figure 5.10 plots the maximum rounds of request and repair executed by any node. From Fig ure 5.10(a), in dynamic topologies the nodes execute multiple rounds of requests; ss 1.8 request rounds when Link (1, 2) is down, a 1.5 request rounds at other times. As expected, we also see the transient rise in the number of rounds of repair, that we discussed in the earlier paragraphs occurs at t = 40s and t = 80s (Figure 5.10(b)). However, we also observe a secondary phenomenon in this figure; the nodes execute more repair rounds when all the links in the topology are up, and the distances between the nodes are optimal, i.e., in the time intervals [60s, 80s] and [100s, 120s]. It is not immediately obvious to us as to why this occurs and requires further investigation; we have deferred this investigation for future work. 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Probabilistic Adaptation: Star-like Topologies (b) Mixed Adaptation: Star-Ring Topologies Figure 5.11: Topologies for Probabilistic and Mixed Adaptation of the SRM Timers 5.3.2 Probabilistic and Mixed A daptations In the previous section, we described the results of our evaluation of the deterministic adaptation of the protocol timers. We also have to evaluate the probabilistic and mixed adaptation of the timers. Here, we give examples of topologies that could be used for such evaluations, and the anticipated results of the evaluations. Evaluation of Probabilistic adaptation The topology shown in Figure 5.11 (a) is a star-topology with no at the center, when no is up; it becomes a different star-topology, with nj as center, when no is down. The delays and costs on all the links incident on no are the same, and lower than the delays and costs on all the links incident on n t . no and n t themselves do not participate in the multicast session. We know that the star topology stresses the probabilistic component of the protocol timers [42]. Our choice of topology permits us to evaluate the adaptation of the probabilistic component of the SRM timers in the context of topology changes. The occurrence of loss should be symmetric in this topology. This condition of symmetry implies that any member that experiences a loss of packets from a source when no is up should continue to experience similar loss of packets from that source when no is down. In our topology, we can place lossy links, either so that all members other than the source experience loss of packets from the source, or so only one member experiences the loss of packets. Each pattern of loss will impact protocol behaviour differently. In the former case, all the members 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (but one) will adapt their request parameters, because of the losses they experience. In the latter, the members will adapt their repair parameters, based on the requests they receive. As with the ring topology, the distance estimates measured by each of the nodes changes dramatically whenever the topology changes. This can result in the nodes sending and receiving more duplicate request (or repairs) than normal, until the nodes obtain a more accurate estimate of the distances in the new topology. Therefore, for a short duration after a topology change, each node’s measure of the normalised recovery delay will be incorrect. The incorrectness can either be an under-estimate, when the nodes in the topology switch to using a longer delay path {i.e., no is down), or it can be an over-estimate, when the nodes in the topology switch back to the shorter delay path. Each case results in different behaviours from the protocol. When the topology changes to use longer delay paths (no is down), we expect a transient increase in the number of duplicates; the average recovery delays will increase in the alternate topology, because of the longer delay paths. On the other hand, when the topology switches back to the shorter delay paths, the nodes have a higher estimate of the delay. The nodes will now, on average, take longer to respond to a loss. This results in fewer duplicates than expected for adaptive timer mechanisms. Each of the behaviours is a transient phenomena that will be rectified once the nodes have accurate estimates of the distances to each other. There is one other interesting artifact of over-estimation that is not detectable in our ring topol ogy. In the star-like topology, we will observe a temporary decrease in the normalised recovery delay. In the ring topology of the earlier section, we expect, but do not observe such a decrease when switching to a path with lesser delay. In that topology (Figure 5.1), the distance estimate of the last node to receive a repair following a topology change is comparable to the distance used in the alternate path. Hence, any decrease in the normalised delay we might observe is negligi ble. Contrast this with our current star-like topology, where the distance estimates of the last node to receive a repair are vastly different from the older estimates based on the alternate topology. Hence, the decrease will be much more appreciable, and noticeable. One final observation to make is the behaviour of adaptive timers in star-like topologies. Recall that in star topologies, some subset of the nodes will adapt their timers to continually respond to 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. loss actively, while others will adopt a more passive stance, by differently adapting their parame ters. This is, in general, applicable to our star-like topology. However, after a topology change, the nodes that were responding to loss actively in the previous topology configuration will continue to do so after the topology change. If the delays in the alternate topology, through n\, are not uni form on all links in that topology, then the performance of SRM in the alternate topology could be sub-optimal, with increased duplicates and delays until the nodes find an optimal operating region for that topology. The delays through no are uniform, and hence the nodes that are active in this alternate topology will remain active when no recovers; thereafter this will be a stable operating region for this topology. Evaluation of Mixed adaptation The topology shown in Figure 5.11(b) is a star topology with no at the center, when no is up, and a string topology when no is down. The delays and costs on all links incident on no are the same. The costs on these links incident on no must be lower than the rest of links that form the string topology. Thus, when no is up, the session uses a star topology; otherwise, the session uses a string topology; hence, this is the ideal topology for the evaluation of mixed adaptation of the SRM timers. As before, the location of the loss can impact how the members respond to that loss. The loss can occur on links incident on the source, so that all other group members suffer the loss; alternately, the loss can occur on links incident on one member, so that only that member suffers the loss, and all other members perform loss repair. While we might reason that this evaluation corresponds to the worst case adaptation for SRM, it is not clear that this is necessarily so. Notice that, in our earlier characterisations, the deterministic and probabilistic components adapt independent of each other, based on the actual topology. In each topology configuration, the nodes will adapt the component appropriate for that topology: the deterministic component adapts to the string topology, the probabilistic to the star topology. Since the adaptations in each topology configuration is independent of adaptation in the alternate topology configuration, the only critical factor affecting SRM performance will be the accuracy of distance estimation. Therefore, when the topology is a string, the transience response of SRM 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. will be similar to our characterisation using ring topologies earlier (Section 5.3.1); likewise, the transience response of SRM when the topology is a star will be similar to our characterisation earlier in this section. Each node will adapt its timer configuration parameters to an optimum suitable to both topology configurations. In all of our evaluations of SRM in the context of topology changes and connected topologies, we notice the critical necessity for each node to obtain accurate estimates to all other nodes as quickly as possible. Each node estimates its distance to other nodes by exchanging a round of session messages with the other nodes. Later, in Section 5.5, we will confirm the impact of session message frequency on the performance characteristics of SRM. 5.4 Network Partitions In the previous sections, we considered protocol behaviour when the types of topology changes never resulted in a partition. In this section, we study protocol behaviour when the topology is dynamic, and results in occasional network partitions. In particular, we concentrate on those types of partitions in which the nodes in each component of the partition continue to adapt independent of the nodes in the other partition. Before we describe our protocol evaluation of the adaptive timer mechanisms in SRM, we must outline our simulation model that we will use in this section, i.e., the topology, sources, and loss patterns. Our protocol evaluation is based on our classification of the losses that occur due to network partitions. We also address the effect of the protocol behaviour on the network itself. For this experiment, we need a topology that is susceptible to partitions. We use the tree topology of 14 nodes, shown in Figure 5.12. In this topology, Link (0,7) is dynamic. The periodic failure and recovery of Link (0, 7) generates the partition events that are of interest to us. We use the same model of network dynamics as in the earlier section (Section 5.3.1). Link (0, 7) fails at times t = 40s and t = 80s resulting in partitions. The partition heals when Link (0, 7) recovers at times t = 60s and t = 100s. Therefore, the topology is partitioned into two connected components in the intervals [40s, 60s] and [80s, 100s]. 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In the topology. — there are two sources, one at Node 4, the other at Node I0. — all links have bandwidth 1.5Mbps, delay 10ms — Link (0, 7) is dynamic, — Links (4, 0) and (8. 7) periodically drop data packets from Nodes 4 and 10 respectively. Figure 5.12: Topology used forevaluation of SRM under partitions The aim of the experiment is to study the impact of network partition and healing on the adaptive timer mechanisms in SRM. Since the timer mechanisms adapt to observed losses, we want a topology in which all the nodes continually receive some data, and there is intermittent loss of some fraction of that data, regardless of whether the network is fully connected or partitioned. This requires a source on either side of the partition, and some number of lossy links that periodically, but continuously, drop data from each of the sources. To satisfy these constraints, we place two constant bit rate sources, each generating two packets per second, one on Node 4, the other on Node 10. Link (4, 0) is configured to drop every other data packets originating from Node 4; likewise, Link (8, 7) is configured to drop every other data packets originating from Node 10. This experiment differs from the earlier experiments in this section in which the nodes were only adapting to data loss from one source. In this experiment, the nodes alternate between adapt ing to losses from multiple sources when the network is connected, to adapting to losses from a single source when the topology is partitioned. We can group the receivers in the topology by the pattern of losses they observe for data from each of the sources. This pattern of loss depends on the location of the receiver relative to the source with respect to the dynamic link. Figure 5.12 shows the receivers, classified by the type of 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. loss characteristics they experience. Nodes 4. 5, and 6 experience no loss of data from the source at Node 4; likewise, Nodes 8,9, 10, and 11 experience no loss of data from the source at Node 10. Nodes 0, 1, 2, and 3 belong to the same component as the source at Node 4, and therefore continually see data and periodic losses from this source at all times, regardless of whether the topology is fully connected or partitioned. We call the loss characteristics observed by this group of nodes. Type A. The average recovery delay for a loss of Type A is computed over the recovery delays seen by the Nodes 0, 1,2, and 3 for that loss. Similarly, Nodes 7, 12, 13, and 14 belong to the same component as the source at Node 10; they only see periodic loss of data for data from Node 10. These are Type B losses. The average recovery delay for a loss of Type B is computed over the recovery delays seen by the Nodes 7, 12, 13, and 14 for that loss. During a partition. Nodes 7 through 14 will be separated from the source at Node 4. Therefore, they will experience both losses due to partition, and periodic losses due to Link (4, 0) dropping packets. These are Type C losses. The average recovery delay for a loss of Type C is the computed over the recovery delays seen by the Nodes 7 through 14 for that loss. Likewise, Nodes 0 through 7 will be separated from the source at Node 10 during a partition. They will experience loss of data from the source, due to partition as well as periodic loss due to Link (8,7) dropping packets, when the topology is connected. These are Type D losses. The average recovery delay for a loss of Type D is the computed over the recovery delays seen by the Nodes 0 through 7 for that loss. We now characterise the performance of the protocol through plots of the average recovery delays for each type of loss. As in the previous sections, our plots show the average of 31 runs of the simulation. The plots also include the results from each of the individual runs (as tiny dots) to illustrate the distribution. Since Type A and B losses are not affected by network partition, the recovery delay observed by the nodes experiencing these losses will form our base case. Ideally, the partition should have no significant impact on the recovery delay, since the data losses are unrelated to any partition. However, in practice, a flurry of requests and repairs following a partition healing can result in transient congestion (and packet drop) within the network, which would then increase the average recovery delay for these types of losses. Figures 5.13(a) and 5.13(b) shows the recovery delay 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0.3 s 0.1 0 . 0 5 0 6 0 8 0 100 2 0 4 0 120 0.3 0 2 5 0 2 O l 6 0 tim e 8 0 100 1 2 0 4 0 0 2 0 (a) Type A losses: Source Node 4 (b) Type B losses: Source Node 10 Figure 5.13: Average recovery delay for losses independent of the partition for Type A and B losses. The figures illustrates the increase in the average recovery delay from « 0.05s to ~ 0.15s immediately following a partition healing. Recall that in our model of network dynamics, the partition heals at times t = 60s and t = 100s. Losses due to network partition are classified as Type C and D losses. However, these types also include periodic loss of data from the source when the network is not partitioned. Figures 5.14(a) and 5.14(b) plot the recovery delay seen by nodes experiencing Type C losses for data from the source at Node 4. Both plots are present the same results are different scales. Figure 5.14(a) shows the detail of the recovery delays occurring in less that 0.4s, and in particular, the average of all of the simulation runs. Figure 5.14(b) shows the complete graph, including the distribution obtained from multiple simulation runs; this reveals some of the outliers in some of the simulations. We can study the protocol behaviour in three separate regions: the initial period; the period when the network is partitioned {i.e., in the interval [40s. 60s]); and the period following the partition healing. We then analyse the causes for the outliers in the different simulations. Note that the recovery delays seen by nodes experiencing Type D losses for data from the source at Node 10 are identical, and are shown in the plots Figures 5.14(c) and 5.14(d). We do not discuss these losses separately in this chapter. In the initial period, we see that the protocol functions normally, until the first link failure event at t = 40s. At this point, the network is partitioned and remains partitioned until the link recovers 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. < 0 7 a . 0 3 5 0 2 5 s 0 1 5 0 .1 0 0 5 100 2 0 4 0 6 0 S O 120 0 3 0 2 5 S 2 0 0 4 0 6 0 8 0 100 120 (a) Type C: Source Node 4: Detail (b) Type C: Source Node 4: Complete Graph 0 . 4 < 0 7 : < 0 7 : 0 . 3 5 ) 7 > 0 . 3 S 0 . 1 5 0.1 0 . 0 5 100 2 0 4 0 6 0 8 0 120 0 3 0 2 5 2 0 1 5 1 0 5 0 6 0 tune 8 0 1 0 0 1 2 0 (c) Type D: Source Node 10: Complete Graph (d) Type D: Source Node 10: Complete Graph Figures (a) and (b) show the same plots at different scales. Likewise. Figures (c) and (d) show the same plots at different scales. Figure 5.14: Average recovery delays seen in the presence of partitions at t = 60s. In this interval, none of the nodes can execute any loss recovery, since they have no indication of any loss. All data packets that are to be sent on that link are dropped. After the partition heals, the nodes schedule the recovery of packets lost due to the partition as soon as they detect the loss. This occurs when either a new packet from a source, or a session message from a node in the other component during the partition, is received. Each node schedules the recovery for all lost packets at the same time. However, the observed recovery delay for each of these messages has high variability (« 0.1s to ~ 0.4s). The same sequence of events repeats during subsequent partition and healing events. 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Each of the complete plots (Figures 5.14(b), 5.14(d)) reveals a few outliers that raise some interesting design issues. These outliers are shown as V in the two plots and occur immediately before the partitions at t as 40s and t as 80s. In these losses, the nodes detect the loss of a packet corresponding to the time just prior to the network partitions at t as 40s and t as 80s. The error recovery is stalled due to the partition, and only completes after the partition heals. Hence the recovery time for these losses is of the order of magnitude of the partition. This is an extreme case of recovery delay that can occur in an operational network. While the error recovery is stalled, the nodes attempting the recovery periodically send out requests, and continually back-off their timers. Therefore, following the partition healing, the nodes are using a significantly larger timer value than is essential for quick recovery. This raises an interesting design question: Could such a node infer the partition healing event based on the arrival of new data or session messages, and use this information to reset its timers to more reasonable values and improve its adaptivity? We end this section by attempting to quantify the effect of the protocol on the network follow ing a partition or healing event. Figure 5.15 shows the requests and repairs that are sent on each direction of Link (0, 7). For each run of the simulation, we obtain the number of request and repair packets sent in each direction, and plot the average across the 31 runs. We can see the increase in the number of requests and repairs following the partition healing at t = 60s and t = 100s. We also see a corresponding increase in the number of drops on link in either direction. These figures indicate the increase in congestion in the network following a partition healing. An artifact of the implementation in our simulator is that the congestion control mechanisms are still incomplete. Our plan is to add a simple rate limiter, such that there will be a limit on the aggregate bandwidth used by control traffic. However, this method of congestion control has a number of open issues in the face of partition healing. For instance, at the instant of partition healing, each node on either side of the partition detects a burst of losses, and independently schedules and send request and repair messages. We leave for future work the question of whether a simple local rate limiting mechanism is sufficient to avoid transient congestion, or whether the nodes should be using a more sophisticated congestion control algorithm. 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 120 100 4 0 2 0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 120 1 0 0 a 5 g o S . (a) Traffic on Link (0,7) 2 0 4 0 6 0 8 0 t i m e (b) Traffic on Link (7,0) 120 8 0 7 0 6 0 S O f < 0 a 3 0 2 0 1 0 0 i r ] i 2 0 4 0 6 0 8 0 t i m e 100 120 6 0 <0 ■ o a 2 0 4 0 6 0 8 0 t i m e 100 120 (c) Drops on Link (0.7) (d) Drops on Link (7.0) Plots indicate the number o f requests and repairs sent or dropped in each direction o f Link (0. 7). Figure 5.15: Traffic and drops on Link (0,7) 5.5 The Role of Session Messages in Loss Recovery We have already observed that it is critical for the error recovery mechanisms to obtain good estimates of their distances to all other members. Distance estimation in SRM is achieved through the exchange of session messages by group members. It takes at least three iterations of session messages for two nodes to find their distances to each other. Our session message frequency in these experiments was one message every second; hence it comes as no surprise to find that the spikes in all of our graphs only last for a couple of losses. We conducted a series of experiments 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.5 A ^r^*Ty a ? B ?B 4l2> :12> 8 1 . 5 0 . 5 ^gjssageevttfYSMfgids a s ^ ^ 8 * a o #aT 3^fiS'* :12> :12> as 8 1 . 5 0 . 5 120 100 4 0 as ge* :12> :12> as 1 o s 1 . 5 0 . 5 120 100 4 0 6 0 t k n e 20 Figure 5.16: Average Recovery Delays as a Function of the Frequency of Session Messages to characterise the role of the frequency of session messages on error recovery. In this section, we describe the results of this evaluation. We use the same ring topology used in Section 5.3.1; we also use the parameters used for that set of experiments; viz., the same unicast and multicast routing protocols, the same source models, loss models, and group distribution. We conducted experiments for a frequency of sending one session message every 2s, 3s, 5s, and 10s. Our evaluation of SRM behaviour is based on the recovery delays, the number of duplicate request and repair messages, and the number of rounds of request and repair messages as a function of the frequency of sending session messages. Figures 5.16 and 5.17 show the recovery delays as a function of the frequency of sending session messages. We can make two observations from these plots: first, there is the exaggerated spike in the normalised recovery delays immediately after Link (1, 2) fails (at t = 40s and t — 80s), until the nodes estimate the distances in the alternate topology. The duration of the spike is clearly proportional to the frequency of sending session messages. 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2 0 4l2> :12> S 'S s § 120 0 6 0 20 4 0 8 0 100 t i m e 2 0 ^gessageev^Ssggtts :12> i12> £ © a I ) 6 0 8 0 0 120 2 0 4 0 100 t i m e 1 5 - 4l2> 4i; il \ Figure 5.17: Normalised Recovery Delays as a Function of the Frequency of Session Messages The second point to note is the decrease in the average recovery time immediately following the failure of Link (1, 2) (at t — 40s and t = 80s), and the increase immediately following the recovery of Link (1,2) (at t = 60s and t = 100s). The delay is a “decrease” in the sense that it is not as high as we would expect it to be in the alternate topology (with longer delay paths). This occurs because the nodes are using smaller estimates of distances to each other, and therefore attempt to recover from a loss earlier than they should. The delay following link recovery is an “increase” in the sense that it does not immediately decrease to the expected levels until the nodes have an accurate estimate of the distance. Again, this is because the nodes use the older (and larger) estimate of distances to each other, and hence take longer to attempt to recover from the loss. We can clearly see from the figures that these transients are proportional to the frequency of sending session messages. As with the recovery delays earlier, we see in Figure 5.18 that the number of request messages increases to over 2.5-3 immediately following the failure of Link (1,2) at t = 40s and t = 80s, 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I i 1 I § 4- 120 8 7 : 12> : 12> 6 5 4 3 2 1 0 6 0 time 00 0 20 4 0 120 & 1 0 4l2> 7 : 12> 6 5 4 3 2 1 0 20 00 0 4 0 6 0 100 120 time : i{S?«ge.v»v 1 0s ? f ig < ls a o i o o 120 Figure 5.18: Request Message Counts as a Function of the Frequency of Session Messages which is what we expect. We see the same type of increase in the number of repair messages (Figure 5.19), to over two messages, in the same duration. Once again, we can see a clear corre lation between the duration of increased duplicate messages, and the frequency of sending session messages. Finally, we can clearly observe the relationship between the number of rounds of requests and repairs that nodes send out (Figures 5.20, 5.21), and the role of distance estimation and the frequency of sending session messages. We can observe this strong correlation in the duration for which nodes execute more than two rounds of request/repair messages following the failure of Link (1,2). In practice, it is impractical for nodes to be sending such frequent session messages, if only because session messages consume expensive bandwidth; they also incur processing overheads at both the nodes that send it, and the nodes that receive and process the session messages. There fore, we need to determine the optimal frequency of sending session messages, and balance the 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8 4l2> 7 r l 2 > 6 5 4 3 2 1 0 20 8 0 100 120 0 4 0 6 0 i i I < i s ^ a v r*s?figFs 4l2> 4l2» h~ I 3 & a 3 6 0 t i m e 8 0 8 7 : 12> :12> 6 5 4 3 2 1 0 4 0 6 0 0 20 8 0 100 120 Figure 5.19: Repair Message Counts as a Function of the Frequency of Session Messages requirements of bandwidth for distance estimation against that required for other functions, such as error recovery. In this context, the idea of using scaled session messages such that nodes con sume a limited amount of bandwidth when sending session messages, attempt to limit the scope of their session messages, and at the same time designate a representative node to send more frequent session messages to the entire multicast group, bears promise [121]. We will describe our results in the evaluation of SSM as an instance of a clustering algorithm in Chapter 6 . 5.6 Chapter Summary To summarise this chapter briefly, we began by presenting a methodology for the evaluation of a multicast transport protocol in the context of topology changes and unicast routing dynamics. We outlined a methodology for the analysis of multicast transport protocols. In this methodol ogy, we separately evaluate the protocol for failures that leave the topology connected, and those 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4 . 5 ^ ^ » a g e « v « y 5 U f g f l d s 4l2> 4 : 12> 1 5 3 2 . 5 2 1 . 5 1 0 . 5 0 6 0 120 4 0 8 0 100 0 20 4 . 5 : 12> : 12> 1 5 1 . 5 0 . 5 120 6 0 8 0 100 4 0 0 20 Figure 5.20: Request Message Rounds as a Function of the Frequency of Session Messages that leave the topology partitioned. For the former, we enumerate the types of protocol responses to topology change. These enumerations tire specific to a protocol, and requires detailed knowl edge of the protocol. Our characterisation of protocol behaviour then is the composite of protocol behaviour in the context of each of the enumerations and response to partitions and healing. We have applied our methodology to characterise the behaviour of the protocol timers in SRM, and presented our results in this problem domain in this chapter. In the following chapter, we will apply this methodology to the evaluation of clustering algorithms, using SSM as a case study. 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • ro u n d s I r o u n d s 4 . 5 4 1 5 3 1 5 2 1 . 5 1 0 . 5 0 4 . 5 4 3 . 5 3 15 2 1 . 5 1 0 . 5 0 I l i ^ j g r a s s a g e e v e r y 2 s e ^ r t d s 2 0 4 0 6 0 Q m e 100 120 j ~ ! f c 12> ?- k i 4 . 5 : 12> 1 5 : 12> I “ i . 1 . 5 0 . 5 20 0 4 0 6 0 6 0 100 120 2 0 4 0 6 0 t i m e 100 120 Figure 5.21: Repair Message Rounds as a Function of the Frequency of Session Messages 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 Clustering Mechanisms In earlier chapters, we described a methodology to characterise protocol behaviour in the context of network dynamics (Chapters 4 and 5). We then applied this methodology to the evaluation of unicast transport protocols (Chapter 4) and the timer mechanisms in multicast transport protocols (Chapter 5). In this chapter, we apply this methodology to clustering algorithms, using Scalable Session Messages (SSM) [123] as a case study. Aggregation and clustering are important techniques to improve the scaling properties of any protocol [46, 83, 123, 29, 30]. A group of nodes that are topologically contiguous and share a common property, V, can be clustered together to form a single cluster that has property V. Clustering helps aggregate the protocol state, and thereby reduces the amount of state that must be maintained/advertised by each member partipating in the protocol. However, the formation of a particular cluster is dependent on the topology—hence, topology change can impact the formation of cluster formation. Topology change can result in a cluster becoming discontiguous, i.e., the cluster breaks up; or two clusters can become adjacent, so that the two clusters can coalesce into one cluster. The following two paragraphs describe some examples of clustering in unicast and multicast protocols. CIDR [46] describes a mechanism for aggregating a contiguous group of networks, and ad dressing that cluster using a single prefix. Unicast routing protocols, such as BGP [114], OSPF, [97], IDRP [69], IS-IS [67], etc., maintain and advertise the aggregated prefix, thereby reducing 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the amount of state maintained by each protocol instance. Area hierarchies are one form of ag gregation used by link state protocols [97, 67, 5]; a routing protocol instance at the boundary of an area, A i, advertises summary information to protocol instances external to Ai- Inter-domain routing protocols cluster networks that have a homogenous policy into autonomous systems (or routing domains) and confederations [114, 69] to aggregate the policy information that must be maintained and advertised. Members of a multicast group, Qu that share losses due to a particular lossy link may form a separate multicast sub-group to mitigate the impact of loss recovery on the members in Qi [83, 82, 60]. In another situation, a member can indicate, to any source, the congestion observed by a group of members in a multicast sub-tree to the source. This member then acts as the representative for the cluster of members in the sub-tree [32, 33,54], and avoids the implosion of congestion signals to the sources. In a different application, SSM [123], any contiguous sub-group of members in the multicast group might cluster around a representative that then sends global session messages about that cluster to the entire group. The other members only send local session messages within the group, thereby reducing both the bandwidth used, and the state that each member needs to advertise and maintain. Finally, note that, in most cases, manual configuration of these aggregation mechanisms and cluster boundaries does not scale to large networks. Therefore, protocols use additional mecha nisms to self-configure their cluster boundaries. In this chapter, we apply the methodology de veloped in the previous chapter (Section 5.2) to evaluate the behaviours of cluster formation in Scalable Session Messages (SSM) algorithms. We begin with a brief overview of SSM (Sec tion 6.1), followed by the results of our evaluation of SSM (Section 6.2) in the context of network dynamics. 6.1 Protocol Overview We described SRM briefly in the earlier chapter (Section 5.1). In SRM, loss detection and distance estimation (for loss recovery), are facilitated through periodic exchange of session messages. Each node periodically sends session messages that contain information about the latest message from 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. each source that it has received. In addition, nodes use these session messages to estimate their distances to each other. Each node, m i, dme-stamps every session message it sends. In addition, for every other group member, nr, , that m* is aware of, mi advertises the sequence number of the last session message it received from m j, as well as the time that m j sent that session message, and the time that m,- received it m j can use this information to estimate its distance to m*. The periodic transmission of session messages by every group member can result in quadratic growth in the use of network bandwidth for sending session messages, and a corresponding in crease in state space at each member. This does not scale as the size of the group increases. One option is for members to adjust the period of sending session messages based on the size of the group. However, this approach does not address the state space issue. It can also increase the time to recover from a loss, if the loss follows immediately after some topology change. In addition, the algorithm can have an adverse impact due to sudden changes in the membership size, possibly such as occurs during a partition healing—a node could reduce its rate of sending session messages and not send one, at precisely the instant when it should be increasing the rate to speed convergence. Scalable Session Messages (SSM) [121] is an auxiliary algorithm to SRM and proposes an alternative approach to scaling. It is representative-based to improve the scaling properties associ ated with SRM group members sending session messages. The approach may have applicability to a range of distributed applications. In this scheme, a select number of group members promote themselves to becoming global representatives and send session messages to the entire group. These global representatives send information about their distances to other global members and active sources. High and low thresh olds determine the number of global representives in each group. The group members strive to ensure that the number of representatives is within these thresholds. Members that are not global representatives cluster around the representative closest to them. Initially, each group member starts as a global representative. If the members detect that the number of representatives exceeds the specified thresholds, then each global representative sched ules a change to become a local member. These members send session messages with just enough 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ttl scope to reach their representative. Likewise, local members will schedule and promote them selves to become a global representative if the number of representatives is below the thresholds. The time for either event is scheduled using a randomised timer, to avoid synchronised changes, and oscillatory phenomena. When the timer fires, and if the number of representatives still exceeds the thresholds, the member will execute the scheduled action. Each member uses local measures of appropriateness to promote itself to a global representa tive or to switch back to being a local representative. For instance, a global representative that has some local representatives attached to it as a cluster is less likely to switch to becoming a local representative itself, compared to another global representative that has no local representatives at tached to it. Some measures of appropriateness are implicit in the protocol. For example, consider < 7,-, a global representative, that has an event scheduled at some time t to evaluate whether to switch to become a local member. If < 7, identifies a new local member, lj, using it as a representative, < 7, will reschedule its event to some later time tf > t. This mechanism implicitly sets the appropriate ness of a global representative to be proportional to the number of local members clustered around that representative. Notice that, in this scheme, only the global representatives can compute accurate distances to other global representatives and active sources. Each of these members can compute the distance to its representative; it can then indirectly compute its distance to each active source based on the distance to its representative, and the representative’s advertised distance to that active source. In a like manner, it can compute the distance to other group members as well. As an optimisation, any local member, m;, that receives a local session message from some other member m 3 due to sufficient scope may directly compute its distance to m j, rather than rely on the indirectly computed distance through its representative. There are two measures of appropriateness in SSM: (1) the size of the cluster, as a count of the number of local members attached to a global representative, and (2 ) proximity to existing representatives. When the size of a cluster exceeds optimal bounds, one of the members may promote itself to become a global representative and the cluster can split into two. As the size of the group increases, the bandwidth consumed by the advertisements from the representative 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. about the members in its cluster increases. Proximity to other representatives in the group is the other measure of appropriateness. The desire is to have representatives uniformly distributed with respect to the topology. Multiple representatives close to each other will increase the bandwidth used by the advertisements from these representatives. In addition, if the representatives are sparse at other locations in the tree, then the members at those locations will be at greater distances from their representatives, and hence have less accurate estimates of distances to members in other clusters. A group of nodes will attach themselves to a cluster with a sufficiently close representative if they determine that an optimal number of representatives exists in the group. The SSM algorithm is self-configuring, and clusters form and change shape depending on the topology and topology changes. In this section, we evaluate the clustering properties of SSM in the context of network dynamics. Our focus in this evaluation is to characterise the transient behaviour and response of the clustering to topology changes, and to possibly identify potential sub-optimal clustering behaviours. In this chapter, we evaluate the cluster mechanisms in SSM in the context of network dynamics. The following sections discuss our evaluation of SSM. 6.2 Clustering Mechanisms: Scalable Session Messages We proceed exactly as before, when we evaluated the timer mechanisms in SRM (Sections 5.2, and 5.3). The first step is to identify the protocol’s dependencies on the topology and the associated responses to topology changes. For SSM, it is the appropriateness measures that are sensitive to topology changes. We informally argue that cluster size as an appropriateness measure is not sensitive to the topology. We also show that proximity is extremely sensitive to the topology. 1 . Assume, after a topology change that cluster size for a cluster exceeds optimal cluster size, and that cluster breaks into two clusters. Following the breakup, there is no reason for the two clusters to become one again, unless the number of representatives is already close to the upper threshold. However, this behaviour is not necessarily influenced by topology changes, and will occur even otherwise, because the protocol operation is designed to ensure that the 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. number of representatives is at an optimum dictated by thresholds. Similarly, assume that a topology change results in two adjacent clusters to coalesce to into one cluster. This cluster will not revert back to its original configuration unless the number of representatives is close to the lower threshold; as before, we know that protocol operation ensures that the number of representatives is not close to the lower thresholds, independent of topology change events. 2. In contrast, proximity is extremely sensitive to topology. A group of nodes that finds a representative closer to them than their current representative will migrate to that other rep resentative. Changes to the topology can affect the distances seen by each member to its representative. We will use this result to study migration behaviours in Section 6.2.1. 3. Finally, while cluster size by itself is not topology sensitive, it is possible that cluster size, in conjunction with proximity might be sensitive to the topology. In the next section (Section 6.2.1), we will evaluate the behaviour of SSM to variations in the distances to the representatives; the following section (Section 6.2.2) will explore the behaviour of SSM due to variations in a combined measure, in which the protocol applies both proximity and size measures in forming clusters in response to topology changes. One final caveat to note is that the protocol is still evolving. Hence, the experiments in this section are based on early code that had certain constraints; chief among them are that this version of SSM operates in conjunction with SRM over fixed timers, and only some of the appropriateness measures and protocol state transitions were implemented. In Section 6.4 we will summarize the newer features, and discuss how the results in this section might differ with the more mature design. 6.2.1 Proxim ity Following the methodology outlined in Section 5.2, we describe the specific topology and other simulation parameters used to study this collective migration. We then present our results of the base case evaluation, in which the topology is stable, and the results when the topology is dynamic. 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Note that SSM is an auxiliary algorithm in SRM. Hence, each case will consist of an evaluation of SRM behaviour, and an evaluation of the clustering behaviours of SSM. For the former, we need a reference evaluation of SRM behaviour independent of SSM. Hence, for each evaluation (over stable and dynamic topologies) we conduct separate experiments of SRM execution in isolation, and SRM execution in conjunction with SSM. Another issue that arises because SSM is an auxiliary algorithm is that we must be careful to ensure that, absent topology changes, SSM must be executed identically whether the topology is stable or dynamic. To put it another way, we require that for a particular seed and simulation execution, SSM behaviour must be identical until the first topology change events occur. This requirement then permits us to make valid comparisons of protocol behaviour over stable and dynamic topologies. We use the topology with three “collectives” of nodes (labelled A, B, and C) as shown in Figure 6.1. A “collective” of nodes is a collection of topologically contiguous nodes that has the potential to become a cluster, either by itself, or by coalescing with other collectives (or other clusters) contiguous to it. In any self-configuring clustering algorithm, a collective of nodes in a cluster can break away and form a separate cluster, the collective can migrate to, and join, a different cluster, or a cluster can dissolve, and coalesce with another cluster. Recall that the size of the cluster is an implicit appropriateness measure for the representative of that cluster. Hence, the size of each collective of nodes is chosen, so as to favour the selection of representatives in collectives A, and C. As with the placement of the nodes, we select the TTLs on each of inter-cluster links to predictably contain the boundaries of the cluster to the collective. Hence the TTL on the link between collectives A and B is small (2 units), the TTL between the collectives S and A, and between B and C is larger (10 units); the TTL between the collectives C and S is much larger to prevent the nodes in collective C coalescing with the collective S. 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A @ \ C © © 0 ® / \ 1 / / 0 / 0 In the topology, — the source is at Node 0 — there are three collectives, labelled A. B, and C — all links have bandwidth 1.5Mbps, delay 10ms — Link (4,6) between the source and collective A is 10 hops. i.e.. packets that transit this link have their TTL decremented by 10 — likewise. Link (8, 13) between collectives A and B is 2 hops. Link (13. 16) between collectives B and C is 10 hops, and. Link ( I. 20) between collective C and the source is 15 hops — Link (4,6) between the source and collective A is 10 hops — Link (8,13) between collectives A and B is dynamic — Links (4, 6) and (16. 13), between the source and collective A. and between collectives C and B respectively, periodically drop date packets from Node 0. The thresholds for the number of global representatives in our simulations are two (for the lower limit) and three. 1 This permits one representative each in collectives A and C, and possibly one representative in collective S. As an aside, note that cluster formation in SSM is based on randomised timers, biased by each node’s measure of its appropriateness to either promote itself to become a global representative or a local member. Hence, representatives can form in any of the collectives in our topology, as long 'It is the case that such low thresholds are unrealistic in actual deployment. The reader should bear in mind that we are attempting to realise a controlled experiment through which we can understand and characterise the possible range of behaviours of the protocol, and hence choose protocol parameters that permit us to realise such configurations. Figure 6 .1: Topology to Study Collective Migration 111 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. as they do not violate the threshold bounds. For our study, it is sufficient if a global representative promotes itself either in collective A or B, but it is necessary that two representatives do not promote themselves, one each in collectives A and B. This will allow us to observe the migration of collective A or B as the link between these collectives periodically fails or recovers, and the collective migrates to an alternate (and farther) representative . We chose the same parameters that we used for timer evaluation in the previous section (Sec tion 5.3) for the source models, loss models, topology failure models, unicast and multicast routing, The source is in collective S, and generates two packets every second. The links between collectives A and B, and between C and B are lossy: every other data packet from the source that traverses this link is dropped. The link between collectives A and B is prone to failures. In our model, the link will first fail at t = 40s, and recover and fail at 20s intervals thereafter. Each simulation runs forf = 200s. However, to increase clarity, we only show the first 120s in our plots. As with the evaluation of the timer mechanisms, these simulations use distance vector (unicast) routing, and dense mode multicast routing. In the following subsections, we will evaluate collective migration of SSM clusters due to topology changes. Using our methodology (Section 5.2), we will begin by evaluating protocol operation over a stable topology, followed by evaluation over a dynamic, but connected, topology. 6.2.1.1 Base Case: SSM Cluster Formation over Stable Topologies In this section, we describe the results of our evaluation of SSM when the topology is stable. This evaluation is in two parts: the impact on SRM, through measurements of SRM behaviour, and the separate evaluation of SSM clustering behaviour. Impact on SRM To evaluate the impact of SSM on SRM, we must first quantify the effect of SRM in isolation under identical conditions. That quantification then serves as the basis for our evaluation of SSM. As before, our metrics for the evaluation of SRM are the recovery delays (both average recovery delay and normalised recovery delay), and the number of request and repair 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Average Recovery Delay: SRM (b) Average Recovery Delay: SRM + SSM 3 . 5 2 . 5 o 2 e 3 0 . 5 0 20 4 0 6 0 t i m e 8 0 too 120 3 . 5 2 . 5 g o 3 2 - c 3 1 . 5 0 . 5 120 100 4 0 20 t i m e (c) Normalised Recovery Delay: SRM (d) Normalised Recovery Delay: SRM + SSM Figure 6.2: Recovery Delays: Base Case Evaluation of SSM over Stable Topologies messages. For each metric, we show two plots, one for SRM in isolation (indicated by the tag “SRM”), and one for SRM in conjunction with SSM (indicated by the tag “SRM + SSM”). Recovery delays Figure 6.2 shows the recovery delays for each loss in our topology. These figures show a marginal improvement in the SRM performance when SSM is active (Figures 6 .2 (b), 6 .2 (d)). We conjecture that the following reason explains this performance gain in our topology. When SSM is active, i.e., the nodes execute SSM in conjunction with SRM, the global representatives have the most accurate distances to the active sources and other representatives. During a loss 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. therefore, the representatives are more likely to send a request or repair before other members in the respective clusters. Since we only have a very small topology, and each cluster is not signif icantly large, the SSM algorithm, in general, finds well-placed, centrally located representatives that can quickly respond to a loss or recovery. Alternately, if SSM were not active, any member experiencing the loss could respond to loss recovery; this will suppress other well-placed nodes and lead to an increase in the recovery delay. A consequence of our using fixed timers to evaluate SSM behaviour is that, with fixed timers, any node experiencing a loss is equally likely to attempt loss recovery. With adaptive timers however, some select set of nodes will adapt to the loss, and always attempt loss recovery. SSM, by biasing global representatives to respond sooner to loss recovery, appears to “adapt” the nodes to performance that is close to adaptive timers. However, this assumes that the global representative is the appropriate candidate for loss recovery to persistent losses from a nearby link. This may not always be the case in general topologies. In those situations, it is not clear that we will such a performance gain with SSM and SRM. However, this slight performance gain occurs due to the optimal placement of global represen tatives indicates one important point: i.e., good placement of the global representatives can help improve the performance of SRM. Message counts The next measure of evaluation is to analyse the number of messages sent by the protocol. We have seen from our evaluation of the SRM timers (Section 5.3) that these measures are a function of the distance estimates at each node. Since the nodes in our topology have fairly accurate estimates of the distances to each other, there is no significant difference between our evaluation of SRM and SRM + SSM. This is reflected in the figures of request/repair messages sent (Figure 6.3). Evaluation of SSM The second part of this base case evaluation is to quantify the behaviour of SSM and its clustering behaviour. A recurring theme in our evaluation has been the importance of a node having good estimates of distance to the other nodes. However, in SSM, only global representatives have accurate estimates of the distance to a select subset of the other nodes. The 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I 3 I 3 1 7 6 5 4 3 2 t 0 120 6 0 t i m e 100 20 4 0 ao o (a) Request Messages: SRM (b) Request Messages: SRM + SSM i 3 100 120 (c) Repair Messages: SRM & 3 7 6 5 4 3 2 1 0 120 6 0 t i m e 20 4 0 8 0 100 0 (d) Repair Messages: SRM + SSM Figure 6.3: Message Counts: Base Case Evaluation of SSM over Stable Topologies other nodes only indirectly estimate their distances to other nodes in the group. The farther away a node is from its representative, the poorer its estimates of distance. Therefore, the TTL-distance of each node to its representative is an approximate measure of the performance of SSM. To put it another way, we can use the sum (or mean) of the TTL-distance of all the nodes in the topology to their representatives as a possible metric. The smaller the value of this metric, the closer a node is to its representative, and hence, the more accurate its distance estimates; this should then imply a better performance for SRM. We can evaluate the overall performance of SSM in the entire topology, as well as the expected performance of each of the collectives through our TTL-distance metric. 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. -- - - - - - 1 - - - - - - - - - - -— 1 t 1 - - - - - - - - - - - - - - - - 1 - - - - - - - - - - - - - - - 3 0 --------------- , 1 ---- - ‘I ” 1 - - - - - - - - - - - - - - - - 1 - - - - - - - - - - - - - - - - - - 2 5 - - - 2 0 - - - - S 1 5 - - - - 1 0 - - - 5 - - 1 L I ... 1 1 — 0 t i i i i 60 t i m e 100 120 2 0 4 0 6 0 8 0 1 0 0 1 2 0 t i m e (a) All Nodes (b) Nodes in Collective A 3 0 2 5 i r " i " 1 " ' " 1 3 0 2 5 I l - - - - - - - - - - - - - - - - 1 - - - - - - - - - - - - - - - - r - ■ 2 0 - - 2 0 - - 1 5 - - H 1 5 - - 1 0 - - 1 0 - - S . 5 . . 0 1 . L 1 1 0 ■ . i — i - - - - - - - - - - - - - - - - 1 - - - - - - - - - - - - - - - - 60 t i m e 8 0 100 120 20 6 0 t i m e (c) Nodes in Collective B (d) Nodes in Collective C Figure 6.4: Average Cluster Distances: Base Case Evaluation of SSM over Stable Topologies Figure 6.4(a) is a plot of the average of the TTL-distance from a number of simulations. As before, the distribution obtained from each of those simulations is shown as jitterred dots as well. From this plot, we can verify that in general, SSM performed as we expected in this topology. Re call that we desire three representatives in collectives S, A, and C, which gives us a TTL-distance of about two. However, as with any randomised algorithm, it is possible to find representatives placed sub-optimally, with larger than required TTL-distance values. In our simulations, we found a few reasons for such sub-optimality. There were more than one representative in a single collective, so that some other collectives had to use high-ttl-cost links to coalesce into a cluster. In a different situation, the number of representatives were within the desired thresholds, but were fewer than 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. desirable; this again causes some collectives to use high-cost-links to coalesce into a cluster. 2 We can observe this phenomena more clearly when we plot the average of the TTL-distance for each collective (Figures 6.4(b), 6.4(c), and 6.4(d)). For our characterisation, we desired that represen tatives form in collectives A, or B; our goal in this characterisation is to study the migration of collectives within a cluster. We chose only those subset of simulations that satisfied our criteria. Therefore, we find that the TTL-distance measures of collectives A, and B are reasonable (Fig ures 6.4(b), 6.4(c)). However, the same is not true for collectives S, and C in our subset of chosen simulations. On occasion, collectives S, and C attach to representatives in other collectives to form sub-optimal clusters with large TTL-distances (around ~ 14 or 16). (Figures 6.4(b), 6.4(d)). There are a number of ways in which a sub-optimal selection of representatives can impact protocol behaviour. Within SSM itself, each node in a cluster has to send session messages with sufficient TTL to reach its representative, so that it can compute the distance to its representative. Likewise, the representative has to send session messages with sufficient TTL to reach all of the nodes in its cluster. If a cluster is formed out of two collectives separated by a high-ttl-cost link, then SSM can cause increased bandwidth usage due to sending session messages over this high- ttl-cost link. We have seen occasional examples of sub-optimal configuration over stable topolo gies using our early version of the protocol. SSM is, by design, resilient to topology changes. Therefore, if the members of a group configure themselves sub-optimally, it will take some time before they reconfigure into a more optimal state. It is therefore critical to verify whether topology changes can lead to further sub-optimal configurations. 6.2.1.2 SSM Collective Migration over Dynamic Topologies Having evaluated SSM behaviour over stable topologies, we now proceed with a similar evaluation of the protocol in the context of network dynamics. As before, we have two parts to this evaluation, to quantify the impact of SSM on SRM, and the independent evaluation of the clustering behaviour of SSM. JOnce again, we must caution that these aberrations are a consequence of using code based on an early, and still evolving version of the protocol. 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Average Recovery Delay: SRM (b) Average Recovery Delay: SRM + SSM 3 . 5 2 . 5 E o a c 3 1 . 5 0 . 5 8 0 0 20 6 0 120 4 0 (c) Normalised Recovery Delay: SRM E " o s L . L . (d) Normalised Recovery Delay: SRM + SSM Figure 6.5: Recovery Delays: SSM Evaluation over Dynamic Topologies Impact on SRM We begin with the evaluation of SRM behaviour. As before, we characterise the recovery delays, and the message counts to recover from each loss. Once again, we find that, in our topology, the recovery delay for losses improves marginally for SRM+SSM over simple SRM (Figure 6.5). There is no significant impact on the number of messages in either case (Fig ures 6 .6 ). This is not very different from our earlier characterisation of protocol behaviour over stable topologies. Evaluation of SSM The next step is to evaluate the performance of the SSM clustering algo rithms itself. We plot the TTL-distance of the nodes to their representatives for the nodes in the 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. £ 3 ? 7 6 5 4 3 2 1 0 00 100 120 0 20 60 4 0 I 3 7 6 5 4 3 2 1 0 20 4 0 6 0 100 120 0 00 (a) Request Messages: SRM (b) Request Messages: SRM + SSM & m 3 L . L 1 3 1 7 6 5 4 3 2 1 0 20 0 4 0 6 0 00 100 120 t i m e (c) Repair Messages: SRM (d) Repair Messages: SRM + SSM Figure 6 .6 : Message Counts: SSM Evaluation over Dynamic Topologies entire topology, and for the nodes in each collective. Figure 6.7(a) indicates changes in the TTL- distances as the topology changes. However, the average TTL-distance of all the nodes is mis leading, because we have constructed our simulation so as to cause only one collective to migrate. Therefore, we need to evaluate the variations in the TTL-distance of each individual collective to characterise this aspect of SSM behaviour. Quite naturally, and as we expect, collective A does not show any effect of collective migration (Figures 6.7(b)). This indicates that, as the protocol designers had suggested, at least one node in collective A promotes itself to become a global representative. The nodes in collective B, on the other hand, migrate to find a closer representative every time their link to collective A fails, 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3 0 2 5 20 S 0 20 4 0 6 0 t i m e 8 0 100 120 2 5 20 20 4 0 100 120 t i m e (a) Ail Nodes (b) Nodes in Collective A 3 0 2 5 20 10 0 20 4 0 120 8 0 100 2 5 5 4 0 8 0 100 120 20 t i m e (c) Nodes in Collective B (d) Nodes in Collective C Figure 6.7: Average Cluster Distances: SSM over Dynamic Topologies and their distance to their representative in collective A increases significantly (Figure 6.7(c)). When the nodes migrate to a different cluster closer to them, it is not always possible to find a sufficiently close representative. In our topology, and on some occasions, no node in collective C becomes a global representative, and the closest alternative is in collective S much farther away than expected. This sub-optimal placement of representatives is also borne out by the fact that the nodes in collective C show signs of migration when the topology changes (Figure 6.7(d)). We conclude the evaluation of collective migration in SSM with the following two obser vations: we have already remarked that the sub-optimal placement can occur due to topology 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. changes, that can then cause higher usage of bandwidth for session messages than is normally ex pected; in addition, we notice that the impact o f changes to the topology is not always constrained to the nodes and clusters in the immediate vicinity of the change. We did not expect the nodes in collective C to suffer the consequences of changes to the link between collectives A, and B. However, we do see some effect in that collective, which in retrospect are within the normal oper ating regions of the protocol. The newer version of the protocol is designed to be less rigid in its adaptivity to the topology, and hence is expected to perform better. Later, in Section 6.4, we will hypothesize on the applicability of our results in this section to the newer version of the protocol. 6.2.2 Com bination M easures: Proxim ity + Size It is possible to conjecture scenarios in which a group of nodes might factor in both appropriateness measures in forming clusters. Consider a topology similar to that shown in Figure 6.1, in which the link between collectives A and B alternately fails and recovers. When the link fails, the nodes in collective B might migrate to C, as in the previous section, but then reform a separate cluster, because the combined size of the nodes in collectives B and C exceeds the limits for the size of a cluster. Alternately, when the link recovers, the nodes will re-evaluate their measures, and dissolve, since they are quite small, and there is another cluster nearby; The nodes in collective B will coalesce with cluster A. However, SSM is designed to be slow (often stable) in response to topology changes. Hence, while the hypothetical scenario outlined above could occur in an actual larger topology, it will be difficult to realise this in practice in a simulation using a smaller topology. This, in fact, makes the evaluation of SSM in the context of changing, but connected topologies, particularly hard. 6.3 Partition Analysis of SSM To complete the characterisation of SSM, we must evaluate its behaviour in the context of network partition and healing. The design of SSM is still in progress; our evaluations in the previous section were based on those parts of the SSM protocol that are relatively mature. The design is 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. composed of two parts: the transition of a member from becoming a global representative to a local member, and the promotion of a member to a global representative. In particular, and of relevance to us, the latter aspect of SSM design is still in progress. Therefore, in this section, we conjecture on the issues relevant to SSM behaviour under partition and those that must be given careful consideration. Recall that each member executing the protocol attempts to keep the total number of global representatives in a session within pre-specified thresholds. When the group is partitioned, some of the members in each will promote themselves to become global representatives to keep the number of representatives within the thresholds. Likewise, when the partition heals, the number of global representatives will decrease. Network partition events are unplanned events, and members do not recognise the decrease in the number of representatives immediately. Therefore, if the component of a partition does not have any global representatives, it will be some time before the members in that component recog nise this and some of them promote themselves. In SSM, a local member computes distances to the active sources and other group members indirectly using its representative. Thus, in general, it is the representatives who have the most accurate, often the least distances to the sources and other nodes. Therefore, we can expect that it is the representatives who will be most active in any loss recovery. This implies that, if the members of a cluster are partitioned from their representative, the members in that cluster will see increased loss recovery times until they attach themselves to another global representative. After the partition heals, the number of representatives will be greater than the designed thresh old. Therefore, we will notice increased overhead of session messages until some of the represen tatives switch back to being local members. When a representative switches to becoming a local member and coalescing with a different cluster, then all of the members that were previously attached to that local representative must also change to this new (or different) cluster. This intro duces a separate overhead of partition and healing. Finally, we close by first stressing the relevance of this work to operational networks such as the Internet, and by raising some potential issues for SSM that must be considered in the context of 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. partition and healing. It is important to characterise the effect of cluster reformation due to partition healing. Notice too, that the representatives in a group may change as the topology partitions and heals. Can a partition and healing result in a sub-optimal configuration of representatives causing higher recovery delays and possibly increased overhead of session messages? 6.4 Notes on the Evolution o f SSM In this section, we briefly review of the constraints in SSM that impacted the results in the previous subsections; we also describe the evolution of SSM, and the newer mechanisms in this later version of the protocol that could affect the results described earlier. The SSM code used in this chapter is implemented to work with SRM over fixed timers. The implementation favours the formation of the maximum number of global representatives permis sible for a particular threshold. The rules for the transition of a global representative to becoming a local member of a cluster were less clearly defined. This makes the evaluation of SSM under partition conditions hard, if not impossible. Startup conditions for each member of the group are another area that has evolved. The current implementation has evolved to work with adaptive timers [123]. The rate of send ing session messages is not fixed, but a bandwidth based rate limiting scheme is used. This dif ference impacts the time it will take for the nodes to compute distances to every other member, especially in those simulations that do not use SSM. The protocol attempts to operate with the number of global representatives at the midpoint of the threshold range, rather than at one of the extremes. In addition, rules for the promotion and demotion of global representatives are more clearly elucidated. SSM is designed to stable in responding to topology changes. Therefore, if the members in a group get into some less than optimal configuration, they may take some time to adapt. Our implementation, because it was an early version of the protocol, had this problem in some of our simulations. The current specification and implementation has mechanisms to achieve a more balanced clustering structure, that does not get into such sub-optimal behaviour often. 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.5 Applicability to Other Protocols Pragmatic General Multicast (PGM) [41] describes mechanisms to provide intra-network support to enhance reliable multicast capabilities end-to-end. In the scheme, hosts unicast NAKs to the first-hop router, that then aggregates them up to the following hop back towards the source. The router also multicasts a NAK-confirmation to suppress other hosts from sending NAKs as well. The confirmation could trigger a local node nearby might volunteer to provide recovery, based on the receipt of NAK-confirmations from a router. Routers keep state for the ephemeral retransmit trees to provide efficient distribution of the recovery packet. Variations on this mechanism to opti mise the recovery include FEC based schemes, in which the source generates FEC packets based on the count of the lost packets; another variation is “PGM/breadcrumbs” in which NAKs are sent all the way to the source, but the ephemeral trees help control redistribution of the repair packet. In all of these schemes, routers maintain state information to provide reliability mechanisms. In the event of a topology change, routers along the new multicast tree do not immediately have the required information. Furthermore, the hosts may not be aware of the topology changes, and so may not take timely and appropriate recovery action. Hence, it would be interesting to evaluate the robustness of these schemes in the context of topology change. A number of protocols apply clustering techniques in order to improve their scaling character istics. Classless Inter-Domain Routing (CIDR) [46] aggregates topology information to improve the scalability of inter-domain unicast routing. CIDR achieves robustness to topology change through redundant advertisement of routing information. However, this detracts from the scalabil ity of unicast routing. Furthermore, it acts as an impediment to the mobility and migration of a cluster from one portion of the topology to another. This is a frequent occurrence in the Internet when a domain might choose to switch from one provider to another for internal policy reasons. A number of different mechanisms have been proposed in the past to facilitate such migration in a scalable manner in the context of the new Internet Protocol, IPv6 [31]. Clustering mechanisms have been proposed to improve the scalability of error recovery in reli able multicast protocols. Such local group recovery mechanisms are predicated on the assumptions that frequent packet loss typically occurs on a few select links. Only members downstream of the 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. lossy link relative to the source suffer the loss. Therefore, local group recovery mechanisms coa lesce this sub-group of members into a separate cluster, and contain the loss recovery mechanisms to this cluster. This reduces the impact on the other members in the session, increases the scaling properties of the protocol. Such mechanisms have been proposed for localised congestion control as well [54], Unlike clustering in SSM, clusters in a local group mechanism are formed on the basis of loss patterns. Such loss patterns are predictable and easily generated in a simulator, this makes it easier to generate a number of interesting cluster configurations for the evaluation of local group clustering mechanisms. We leave for future study, the evaluation of local group clustering mechanisms in the context of network dynamics. 6.6 Chapter Summary In this chapter, we applied our methodology to characterise the behaviour of the clustering phe nomena of SSM. We also showed how the methodology applied to SSM could be applied to other clustering phenomena, such as local group clusters for error recovery. Finally, we consider the possibility of realising a more abstract enumeration of different cluster configurations to evaluate the clustering properties of a protocol. Our enumerations for SSM in Section 6.2 were based on the protocol characteristics. It is interesting to conjecture a complete enumeration of the different cluster configurations possible using the topology used in that section, and mapping them to different protocol states. This approach did not map too well in the case of SSM, because the protocol is by design resilient to topology changes. However, there is evidence to believe that this approach will be very useful in evaluating mechanisms more directly related to the topology, such as topology aggregation or local group recovery mechanisms. We leave this abstract enumeration to future study, and explore this idea in some detail in the following chapter, in Section 7.3. 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 7 Conclusions and Future Work This thesis has demonstrated the importance of evaluating protocol robustness in the context of topology change. It has also explored the evaluation of protocols in three problem domains. In this chapter, we evaluate the significance of our contributions in this thesis, and indicate both the pieces that are to be accomplished, as well as indicate future directions for this thesis. Recall that, in Chapter 1, we defined a region of operation as a combination of operating pa rameters, consisting of the topology, the models of topology change, and the characteristics of the link layer, and the network layer unicast and multicast routing protocols. This thesis has focused on protocol evaluation using regular topologies, with simple models of topology change. These models of the topology and associated topology change were chosen through careful inspection of the particular protocol. This is the first step in protocol evaluation. This thesis lays the groundwork on methods to approach this problem in the context of three problem domains. In the following sections (Sections 7.1-7.3) we analyse our results in each of the three problem domains. The particular region of operation that this thesis explored is the first one to consider in any protocol evaluation in the context of dynamic topologies. Subsequently, we need to consider other regions of operation as well. In these other regions of operation, the designer should consider larger topologies, multiple faults—independent or correlated, variations in link layer characteristics, as well as alternate unicast and multicast routing protocols. It remains to future work to explore methods for protocol evaluation in these other regions of operation. 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. One important region of operation that this thesis has not considered is a model of the actual deployment infrastructure. Another possible future direction is the exploration of systematic meth ods to conduct comprehensive protocol evaluations along multiple dimensions, including topology change. We conclude the chapter by discussing these two issues (Section 7.4). 7.1 Unicast Transport: TCP TCP is a well-understood protocol that has been studied extensively for a number of years. It is designed to be robust and resilient to topology changes. While our evaluation did not reveal any fundamental weaknesses in TCP, it did indicate problems due to packet and ack interleaving. These problems with packet and ack interleaving have been recognized by other researchers earlier [1 0 2 ]; this thesis points out new and counter-intuitive ways by which such interleaving can occur within the network. Additional work in the form of experiments is needed to identify that these problems do occur in operational networks. There are at least two ways in which this could be done. One method is to conduct actual experiments in an operational network. This, however, requires careful experimen tation, since the problems exhibited by TCP due to packet interleaving are extremely timing sen sitive. Alternately, we can gather and analyze TCP traffic traces. This method, too, is non-trivial, because we require additional measurements about the state of the network, in order to identify topology changes; further, we must correlate these topology changes to appropriate traces. TCP’s response to packet and ack interleaving has been realized and recognized through em pirical observations of traffic traces in the Internet. For instance, consider when one host. ha. has multiple equal cost paths to another host, /i& , each of which has different end-to-end latencies. The interleaving occurs if one of the hosts then uses all of the paths in order to load share its network utilization across all of the links. Our work in this thesis shows that there are additional, new and as yet unrecognized reasons for the occurrence of such packet and ack interleaving—interleaving due to topology changes. Finally, numerous researchers have explored many different congestion control mechanisms and other improvements to the basic algorithms used by TCP. Many of these mechanisms relax the 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. conservative constraints in TCP. This can sometimes be detrimental to the robustness of TCP. As an example, Vegas TCP [14] proposes measuring the rtt at a very fine granularity, and correlating it to the observed gain in throughput. It can then use this as an estimator of the queueing observed within the network. At issue is how the protocol will adapt if a topology change occurs while a session is in progress, and affects that session. Such a topology change can alter both the rtt, as well as the observed throughput It is important to address the robustness of such mechanisms to their behaviour in the context of topology change, especially prior to that protocol’s deployment. 7.2 Multicast Transport: SRM Our study of the timer mechanisms in SRM is focused on answering a number of protocol specific questions; it leaves to further study, exploration of systematic methods for evaluation of the timer mechanisms in any generic protocol. In particular, the work identifies a set of carefully chosen topologies and experiments that reveal a number of pathologies to the protocol, and also a number of research questions that bear further investigations. We will begin by discussing each of research questions in the following paragraphs. Finally, as in the case of TCP, an important considera tion when evaluating protocol robustness exclusively through simulation is to conduct empirical experiments to realize the issues in an operational network. We will conclude this section by re viewing those issues that have been identified in operational networks, and those that need further investigation. In our evaluation of SRM, we identified the critical importance for each member in a multi cast group to have accurate distance estimates to all other members in that group. The distance estimates are essential to loss recovery, and our experiments indicate the types of pathological behaviours that can result due to topology changes. 1 Recall that each member measures the de lays and duplicates it observes for each loss, and adapts its protocol parameters; note also that the pathologies that we observe are reflected as abnormal delays and duplicates in loss recovery. We 'in particular, it is important for the distance estimates among the group members to be congruent to the topology; by this, a member closer to a source should have a smaller estimate of its distance to the source than another member farther away. 128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. conjecture therefore, whether it is possible to consider protocol mechanisms to respond to such pathologies. As an example, one possible mechanism might be to trigger session messages such that each member can quickly recalibrate its distances following any observed pathological behaviours. Such a mechanism would have to be careful to avoid implosive behaviours, in which every mem ber triggers session messages following a single topology change; additional mechanisms might be need to correlate various trigger messages to the actual topology change; furthermore, to avoid oscillatory behaviours, the time constants of the response should be greater than the time constants for the network to stabilize its unicast and multicast routing following the topology change. Two final points to note in connection with our work with SRM over connected topologies are that ( 1 ) the normalized delay measures used by the protocol are ideally suited to work over stable topologies, but exhibit strange pathologies when topologies are in a state of flux; this is an addi tional reason for each member participating in the protocol to recalibrate its distance estimates as quickly as possible. (2) The worst case behaviour of the protocol is not one in which the topology alternates between string and star like topologies, because the mode of protocol adaptation in a particular topology is independent of the mode of adaptation in the alternate topology; hence, the worst case topology changes are not necessarily the most critical to protocol operation than might be construed otherwise. The previous set of issues were identified in the context of our experiments over connected topologies. We recognized three problems through our experiments in partition and healing, and suggest possible avenues that can be explored. All of these suggestions are based on the premise that it is possible for members to infer a partition healing. Recall that each member, y,, carries state information about all other members, based on the session messages that < 7, periodically hears from these other members. Therefore, y ,- can infer that a partition possibly healed, if it does not receive any messages from some subset of these other members for some duration of time, following which it receives messages from them again. Therefore, if we can infer the likelihood of a partition healing, then the three problems are: ( 1 ) the stalled loss recovery due to partitions, resulting in very large loss recovery delays; (2 ) the transient congestion and increased delays 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. following a healing, for losses occurring during partition; and (3) a concern about some types of mechanisms that might be used for session message handling that might be contrary to making protocol operation efficient. We discuss each of these issues in the following paragraphs. • The first problem that we observed is that if a packet loss occurs before the network partition occurs, and information about that packet is received by all the affected members, then the loss recovery at some of the members could be stalled because of network partition. Subse quently, each of the members whose loss recovery is stalled, continually back of their timers. Therefore, the recovery following is a healing is is also affected by protocol overhead. One obvious mechanism that could address some of the delay is for members to reset their loss recovery parameters following a healing. They would then re-initiate the loss recovery for these particular losses as if they were just detected following the partition. • The second problem that we noticed is that there is transient congestion following a healing, which induces a lot of packet drops, and also impacts the recovery delays of losses occurring due to partition. We do not expect that congestion control and rate limiting mechanisms will address this situation satisfactorily. This is because each member that experienced these losses treats these as independent losses, and schedules its loss recovery separately for each loss. Since the loss recovery mechanisms in SRM are probabilistic, different members will send requests and repairs for the different losses separately. For similar probabilistic reasons, most of these messages will be sent close to each other as well. The number of data and control packets sent by each node during loss recovery is small. In such situations, a congestion control scheme will have limited leeway in responding to congestion signals from the network. Hence, we must consider other mechanisms to cope with partition healing. One possible mechanisms might be that, since the partition losses are correlated to each other, rather than being independent, each member could schedule one loss recovery request for all of the partition losses, and send one “batched” request for the partition losses. In a like manner, the nodes that send the repair messages for such a batched request might consider a similar mechanism, where one representative volunteers to send all the repair messages—it 130 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. might use a mechanism similar to that used for rate based pacing for TCP to reduce the burstiness. The volunteer member could be the first one to send a repair message out; this would be identical to the protocol mechanisms for sending out repair messages. Some issues that this mechanism should consider are; how to cope with overlapping losses, i.e., where some of the losses are losses because of network partition, and others are inde pendent losses? How should suppression be accomplished due to such overlapping losses? How to cope with the failure of the volunteer member? How should repairs be done if no member has all of the packets that are requested, possibly because the losses are from different sources, or because the source that had all the packets has left the group? • The last issue is more a concern than an exploration of a problem or possible avenue of solution. Recall that session messages are essential for each member to calibrate its distances to every other member in the group—however, there are issues in scaling to very large groups when every sends sends session messages at periodic intervals. A proposal for this is for members to send session messages at a rate inversely proportional to the size of the group. However, consider the effect of this mechanism following a partition healing. At this juncture, the size of group increases very suddenly, and each member may adjust and lower its session message sending rate, thereby increasing the time it would take to recalibrate distances; however, this is precisely the time when each member ought to increase its rate of sending session messages to recalibrate its distances to other members. Recall that, in contrast. Scalable Session Messages [122] uses alternate mechanisms to con serve bandwidth and recalibrate distances efficiently. We conclude this section by identifying those issues that have been observed empirically, and point out those issues that require further experimentation. The MASH toolkit [8 8 ] uses SRM for reliable transport. Using ns in its emulation mode, it has been possible to realize some other pathologies in a different context Similar methods could be used to investigate whether the pathologies realised in this thesis could occur in an operational context. The stalled loss recov ery due to network partitions has been identified in the context of RPM. However, the congestion 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. problems that we observed in simulation are more apparent in groups that have a high data rate as well as significant losses due to partition; hence these require further observation in order to identify such effects. 7.3 Clustering Mechanisms: SSM In the previous section on SRM, we indicated that we had good protocol specific results, from which we explored future avenues to be explored; however, our overall methodology in that prob lem domain itself was left for further exploration. By contrast, in the domain of clustering mech anisms in general, and SSM in particular, our protocol specific results are weak; however, there is greater promise of a more general methodology that can be applied to the study of clustering algo rithms. As we had indicated earlier in the previous chapter, our results in SSM were weak because SSM is still in its evolutionary stage, and we had to use preliminary code for our experiments. In addition, the protocol was designed to react minimally, if ever, to topology changes, which con strained the scope of experiments we could conduct. However, it should be possible to generalize the methods by which we constructed our experiments for SSM into a general technique that can be applied to evaluating most arbitrary clustering protocols and techniques. In this section, we describe this generalized architecture and map it to different cluster algorithms, SSM included. The effect of topology change on a clustering algorithm is to possibly reshape the clusters in a predictably finite number of ways: a sub-group of nodes in a cluster can break away and form a separate cluster, the sub-group can migrate to. and join, a different cluster, or a cluster can dissolve, and coalesce with another cluster. Variations of all of these phenomena can occur in large topologies with many clusters. The changes exhibited by a clustering algorithm can be captured in a simulation experiment using a specific topology of n collectives. Recall that our definition of a collective of nodes is a collection of topologically contiguous nodes that has the potential to become a cluster, either by itself, or by coalescing with other collectives (or other clusters) contiguous to it One such topology of three collectives is shown in Figure 7.1; this topology has one additional collective for the source, called collective S; one member in collective S is an active source. We can then list all 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Src. Collective Collective Collective Collective Figure 7.1: Topology for Studying Clustering Properties cluster configurations that can be realised with the n collectives. In the generalised methodology, we would then eliminate those cluster configurations that can never occur in the particular protocol that is to be evaluated. We can then enumerate the pairs of cluster configurations, such that each enumeration defines an experiment in our methodology. Each experiment is then mapped to the specific protocol, in order to evaluate that protocol in the context of topology change. In our particular scenario of three collectives, the list of cluster configurations that can be realized are: {} {(A)} {(B)} {(C)} {(AB)} {(BC)} {(ABC)} {(A), (B)} ((B), (C)} {(A), (C)} {(A), (BC)} {(AB),(C)} {(A), (B), (C)} The cluster configuration {} is the situation where no clusters form in the topology; similarly, {(A), (BC)} is the configuration in which collective A forms one cluster, (A), and the collectives B and C form one cluster, (BC). It is not always the case that every one of the above cluster configurations can be realized in a protocol. For example, every member in SSM must belong to some cluster. 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Therefore, only the subset of cluster configurations in which all three collectives are a member of some cluster are realizable in SSM. Thus the possible configurations for SSM are: {(ABC)} {(A), (BC)} {(AB), (C)} {(A), (B), (C)} Continuing with our example, the enumerations relevant to SSM are: ( 1 ) {(A), (BC)} f-v {(AB), (C)}, (2 ) {(ABC)} < - > • {(A),(BC)}, or. {(ABC)} < -* {(AB),(C)}, (3) {(AB), (C)} •H - {(A), (B), (C)}, or. {(A), (BC)} •H - {(A), (B), (C)}, (4) {(A), (B), (C)} {(ABC)}. Experiment 1 defines collective migration due to proximity variations (Section 6.2.1), Ex periments 2, 3, 4 evaluate the effect of combination measures of SSM to topology change (Sec tion 6 .2 .2 ). In general, this mechanism can be applied to other clustering algorithms as well. For example, local group recovery mechanisms form a cluster of nodes that share losses occurring because of one (or more) particular lossy link. Unlike SSM, all of the cluster configurations are realizable, by a local group recovery algorithm. Our methodology identifies the set of experiments required to evaluate a local group recovery mechanism in the context of topology change. Another application of our methodology is unicast address allocation. Here, the aim is to perform efficient address aggregation, to save unicast routing state and bandwidth on route adver tisements. Each aggregate defines a particular cluster. Difficulties arise when trying to aggregate the addresses for multi-homed sites, or sites that are mobile, or changing providers. In the case of IPv4 [106] and CIDR [46], the problem is dealt w ith through multiple duplicate advertisements, that reduces the efficiency of address aggregation and provides for increased robustness. A number of proposals have been made in the context of the new IPv6 [31, 59] to explore alternate aggrega tion strategies that are more efficient [115]. The robustness of such proposals can be systematically explored using the methodology outlined in this section. 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.4 Other Issues In the previous sections, we identified future directions in each of the individual problem domains that we have studied. We conclude by addressing two concerns about the analysis in each of the problem domains. The first issue is that our methodology is only the first step in evaluating the robustness of a protocol in the context of topology change. This step required us to analyze each protocol in small simple topologies that are carefully chosen based on the characteristics of that protocol. Subsequently, we should be evaluating the protocol in larger topologies whose network dynamics more accurately reflect the networks over which the protocol will be deployed. To do this, we need good models of the topology, as well as good models of network behaviour. Various models of topology exist that reflect different characteristics of the Internet [35, 131, 137]. We expect that the empirical observations of [102,49, 76] can be used to derive the models of network behaviour. The second concern is that our work has only explored methodologies for systematic protocol evaluation in the context of dynamic topologies. In general, designers conduct protocol evalua tions along multiple dimensions, such as by varying topologies, workloads, cross traffic, group membership dynamics etc. We need systematic methodologies to evaluate a protocol in a more general way, along these other dimensions as well. One promising approach is Systematic Testing of Robustness by Examination of Selected Scenarios (STRESS) [57], In summary, this thesis hopes to have accomplished the following— motivate protocol design ers to consider the impact of topology change on their protocols; provided them with the tools and techniques by which they could investigate the effect of topology change on their protocol design: laid the groundwork for other researchers to consider systematic methods for protocol evaluation during design. 135 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Bibliography [1] C. Alaettinoglu, T. Bates, E. Gerich, D. Karrenberg, D. Meyer, M. Terpstra, and C. Vil- lamizar. Routing Policy Specification Language (RPSL), RFC 2280 edition, 1998. (Status: PROPOSED STANDARD). [2] C. Alaettinoglu and A.U. Shankar. The viewserver hierarchy for inter-domain routing: Pro tocols and evaluation. IEEE Journal on Selected Areas in Communications, 13(8): 1396- 1410, October 1995. [3] C. Alaettinoglu, A.U. Shankar, K. Dussa-Zeiger, and I. Matta. Design and implementation of MaRS: A routing testbed. Internetworking: Research and Experience, 5:17— 41, 1994. [4] The ATM Forum. Private Network-Network Interface Specification Version 1.0 (PNNI 1.0), af-pnni-0055.000 edition, March 1996. [5] The ATM Forum. Private Network-Network Interface Specification Version 1.0 (PNNI 1.0), af-pnni-0055.000 edition, March 1996. [6 ] A. Avritzer and E.J. Weyuker. Generating test suites for software load testing. In Pro ceedings of the International Symposium on Software Testing and Analysis, pages 44— 57, 1994. [7] T. Bates, E. Gerich, L. Joncheray, J-M. Jouanigot, D. Karrenberg, M. Terpstra, and J. Yu. Representation of IP Routing Policies in a Routing Registry, RIPE-181 edition, October 3, 1994. Available at ftp://ftp.ripe.net/ripe/docs. [81 J. Behrens and J.J. Garcia-Luna-Aceves. Distributed, scalable routing based on link-state vectors. In Proceedings o f the ACM SIGCOMM, pages 136-147, London, England, UK. August 1994. [9] F. Belina and D. Hogrefe. The CCllT'-specification and description language SDL. In Computer Networks and ISDN Systems [10]. [10] F. Belina and D. Hogrefe. The CCITT-specification and description language SDL. Com puter Networks and ISDN Systems, 16, 1989. [11] G. v. Bochmann and A. Petrenko. Protocol testing: Review of methods and relevance for software testing. In Proceedings o f the International Symposium on Software Testing and Analysis, pages 109-124, 1994. 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [12] T. Bolognesi and E. Brinksma. Introduction to the ISO specification language LOTOS. Computer Networks and ISDN Systems, 14(1), 1987. [13] T. Bolognesi and H. Rudin. On the analysis of time-dependant protocols by network flow algorithms. In Protocol Specification, Testing, and Verification IV, pages 491-503. IFIP, Elseiver Science Publishing B.V. (North-Holland), 1984. [14] L.S. Brakmo, S. O’Malley, and L.L. Peterson. TCP vegas: New techniques for congestion detection and avoidance. In Proceedings of the ACM SIGCOMM, 1994. [15] L.S. Brakmo and L.L. Peterson. Experiences with network simulation. In Proceedings of the ACM SIGMETRICS, May 1996. [16] H.W. Braun. Models o f policy based routing, RFC 1104 edition, 1989. (Status: UN KNOWN). [17] S. Budkowski and P. Dembinski. An introduction to Estelle: a specification language for distributed systems. Computer Networks and ISDN Systems, 14(1), 1987. [18] I. Castineyra, N. Chiappa, and M. Steenstrup. The Nimrod Routing Architecture, RFC 1992 edition, 1996. (Status: INFORMATIONAL). [19] K.M. Chandy and J. Misra. Parallel Program Design. Addison-Wesley Publishing Com pany, Inc., 1988. [20] C. Cheng, R. Riley, S.P.R. Kumar, and J.J. Garcia Aceves. A loop-free extended Bellman- Ford routing protocol without bouncing effect. In Proceedings o f the ACM SIGCOMM, pages 224— 236, 1989. [21] B. Chinoy. Dynamics of Internet routing information. In Proceedings o f the ACM SIG COMM, pages 45-52, 1993. [22] D. Clark. Policy routing in internetworks. Internetworking: Research and Experience, 1:35-52, 1990. [23] D.D. Clark. The design philosophy of the DARPA Internet protocols. In Proceedings of the ACM SIGCOMM, pages 106-114, 1988. [24] D.D. Clark. Policy routing in Internet protocols, RFC 1102 edition, 1989. (Status: UN KNOWN). [25] D.D. Clark and D.L. Tennenhouse. Architectural considerations for a new generation of protocols. In Proceedings o f the ACM SIGCOMM, pages 200-208, 1990. [26] R. Cleaveland, J. Parrow, and B. Steffen. A semantics-based verification tool for finite- state systems. In Protocol Specification, Testing, and Verification IX, pages 287-302. 1F1P, Elseiver Science Publishing B.V. (North-Holland), 1989. [27] CPSim: A high-performance discrete-event simulation tool, http://www.wdn.com/bti-sim/, 1994. [28] P. Danzig and S. Jamin. tcplib: A library of TCP/IP traffic characteristics. Techni cal Report CS-SYS-91-01, University of Southern California, Networks and Distributed 137 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Systems Laboratory, Department of Computer Science, October 1991. Available at ftp://catarina.usc.edu/pub/jamin/tcplib. [29] S. Deering, D. Estrin, D. Farinacci, V. Jacobson, C-G. Liu, and L. Wei. An architecture for wide-area multicast routing. In Proceedings o f the ACM SIGCOMM, pages 126-135, September 1994. [30] S. Deering, D. Estrin, D. Farinacci, V. Jacobson, Ching-Gung Liu, and L. Wei. An architec ture for wise-area multicast routing. Technical Report USC-SC-94-565, Computer Science Department, University of Southern California, Los Angeles, CA 90089., 1994. [31] S. Deering and R. Hinden. Internet Protocol, Version 6 (IPv6) Specification, RFC 1883 edition, 1995. (Status: PROPOSED STANDARD). [32] D. DeLucia and K. Obraczka. A multicast congestion control mechanism using representa tives. Technical Report USC-CS TR 97-651, Department of Computer Science, University of Southern California, May 1997. [33] D. DeLucia and K. Obraczka. Multicast feedback suppression using representatives. In IEEE Proceedings o f the INFOCOM, May 1997. [34] E. W. Dijkstra. A note on two problems in connection with graphs. Numerical Mathemat ics, 1:269-271, 1959. [35] M. Doar. A better model for generating test networks. In Proceedings o f the Globecom, November 1996. [36] A. Dupuy, J. Schwartz, Y. Yemini, and D. Bacon. NEST: A network simulation and proto typing method. Communications of the ACM, 33(10):64-74, October 1990. [37] D. Estrin. Policy requirements for inter Administrative Domain routing, RFC 1125 edition, 1989. (Status: UNKNOWN). [38] D. Estrin, T. Li, Y . Rekhter, K. Varadhan, and D. Zappala. Source Demand Routing: Packet Format and Forwarding Specification (Version I), RFC 1940 edition, 1996. (Status: IN FORMATIONAL). [39] D. Estrin, Y. Rekhter. and S. Hotz. Scalable inter-domain routing architecture. In Proceed ings o f the ACM SIGCOMM, pages 40-52. Baltimore MD, U.S.A.. August 1992. [40] K. Fall and S. Floyd. Simulation-based comparisons of Tahoe, Reno, and SACK TCP. ACM Computer Communications Review, 26(3):5-2l, July 1996. [41] D. Farinacci, A. Lin, Tony Speakman, and A. Tweedly. Pretty Good Multicast (PGM) Transport Protocol Specification. Internet Draft: NONE working group, January 13, 1998. Work in progress. [42] S. Floyd, V. Jacobson, S. McCanne, C-G. Liu, and L. Zhang. A reliable multicast frame work for light-weight sessions and application level framing. IEEE/ACM Transactions on Networking, 1997. 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [43] L.R. Ford and D.R. Fulkerson. Flows in Networks. Princeton University Press, Princeton, N.J., U.S.A., 1962. [44] P. Ford and Y. Rekhter. Explicit Routing Protocol (ERP) for IPv6 . Available from the Authors, January 1995. Internet Draft: Work in Progress. [45] V. Fuller, T. Li, J. Yu, , and K. Varadhan. Classless Inter-Domain Routing (CIDR): an Ad dress Assignment and Aggregation Strategy, RFC 1519 edition, 1993. (Obsoletes RFC 1338) (Status: PROPOSED STANDARD). [46] V. Fuller, T. Li, J. Y u,, and K. Varadhan. Classless Inter-Domain Routing (CIDR): an Ad dress Assignment and Aggregation Strategy,RFC 1519 edition, 1993. (Obsoletes RFC 1338) (Status: PROPOSED STANDARD). [47] J.J. Garcia-Luna-Aceves. Loop-free routing using diffusing computations. IEEE/ACM Transactions on Networking, 1(1): 130— 141, February 1993. [48] M.G. Gouda. Correspondence on “A simple protocol whose proof is’nt”: The state machine approach. IEEE Transactions on Communications, COM-33(4):380— 382, April 1985. See also [53]. [49] R. Govindan and A. Reddy. An analysis of internet inter-domain topology and route stabil ity. In IEEE Proceedings o f the INFOCOM, April 1997. [50] R. Govindan, H. Yu, and D. Estrin. Scalable non-transactional replication in the internet, submitted for publication. [51] Audio-Video Transport Working Group, H. Schulzrinne, S. Casner, R. Frederick, and V . Ja cobson. RTP: A Transport Protocol for Real-Time Applications, RFC 1889 edition, 1996. (Status: PROPOSED STANDARD). [52] J. Guckenheimer and P. Holmes. Nonlinear Oscillations, Dynamic Systems, and Bifurca tions o f Vector Fields, chapter 1, pages 22-27. Springer-Verlag New York Inc., 1983. [53] B. Hailpem. A simple protocol whose proof isn’t. IEEE Transactions on Communications, COM-33(4):330-337, April 1985. [54] M. Handley. A congestion control architecture for hulk data transfer. Slides of the Presen tation at the Reliable Multicast Research Group. September 1997. http://vmw.east.isi.edu/ rm/ cannes-meeting.html. [55] C.L. Hedrick. Routing Information Protocol, RFC 1058 edition, 1988. (Updated by RFC 1388, RFC 1723) (Status: HISTORIC). [56] C.L. Hedrick. An introduction to IGRP. Technical report, Rutgers University, August 1991. [57] A. Helmy and D. Estrin. Simulation-based STRESS testing case study: A multicast routing protocol. In Symposium on Modeling, Analysis and Simulation of Computer and Telecom munication Systems, July 1998. [58] A. Heybey. Netsim Manual. MIT, 1989. 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [59] R. Hinden, S. Deering, and Editors. IP Version 6 Addressing Architecture, RFC 1884 edi tion, 1995. (Status: PROPOSED STANDARD). [60] M. Hoffman. Adding scalability to transport level multicast. In Proceedings o f the Third COST237 Workshop - Multimedia Telecommunications and Applications, November 1996. [61] H.W. Holbrook, S.K. Singhal, and D.R. Cheriton. Log-based receiver-reliable multicast for distributed interactive simulation. In Proceedings o f the ACM SIGCOMM, 1995. [62] ISO. Conformance Testing Methodology and Framework, ISO/TC97/SC21 N 2nd DP9646 edition, December 1987. [63] ISO. Estelle, a FDT Based on an Extended State Transition Model, ISO TC97/SC221 DIS 9074 edition, September 1987. [64] ISO. LOTOS, a FDT Based on Temporal Ordering o f Observational Behaviour, ISO TC97/SC21 DIS 8807 edition, July 1987. [65] ISO. Inter-Domain Routeing, ISO/IEC JTC1/SC 6 WG 2 N 324 edition, August 1989. [ 6 6 ] ISO. Information Processing Systems - Telecommunications and Information Exchange between Systems - Intermediate System to Intermediate System Intra-Domain Routing Ex change Protocol for use in Conjunction with the protocol fo r providing the Connectionless mode Network Service (ISO 8473), ISO/IEC 10589 edition, 1992. [67] ISO. Information Processing Systems - Telecommunications and Information Exchange between Systems - Intermediate System to Intermediate System Intra-Domain Routing Ex change Protocol for use in Conjunction with the protocol for providing the Connectionless mode Network Service (ISO 8473), ISO/IEC 10589 edition, 1992. [6 8 ] ISO. Protocol for Exchange o f Inter-domain Routeing Information among Intermediate Systems to Support Forwarding o f ISO 8473 PDUs, ISO/IEC/JTC1/SC6 IS 10747 edition, 1993. [69] ISO. Protocol for Exchange o f Inter-domain Routeing Information among Intermediate Systems to Support Forwarding o f ISO 8473 PDUs, ISO/IEC/JTC1/SC6 IS 10747 edition, 1993. [70] V. Jacobson. Congestion avoidance and control. In Proceedings <f the ACM SIGCOMM. pages 314-329, Stanford, CA, U.S.A., August 1988. [71] J. Jaffee and F. Moss. A responsive distributed routing algorithm for computer networks. IEEE Transactions on Communications, July 1982. [72] R. Jain and S.A. Routhier. Packet trains—measurements and a new model for computer network traffic. IEEE Journal on Selected Areas in Communications, SAC-4(6):986-995, September 1986. [73] P. Kam and C. Partridge. Estimating round-trip times in reliable transport protocols. In Proceedings of the ACM SIGCOMM, August 1987. 140 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [74] S.K. Kasera, J. Kurose, and D. Towsley. Scalable reliable multicast using multiple multicast groups. Technical Report CMPSCITR 96-73, Department of Computer Science, University of Massachusetts, October 1996. [75] S. Keshav. REAL: A network simulator. Technical Report UCB/CSD 88/472, University of California, Berkeley, 1988. [76] C. Labovitz, G.R. Malan, and F. Jahanian. Internet routing instability. In Proceedings o f the ACM SIGCOMM. ACM, September 1997. [77] L. Lamport. An assertional correctness proof of a distributed algorithm. Science o f Com puter Programming, 2:175-206, 1982. [78] A. Lapone, N. Maxemchuck, and H. Schulzrinne. The Bell Laboratories network emulator, August 1993. [79] W.E. Leland, M.S. Taqqu, W. Willinger, and D.V. Wilson. On the self-similar nature of ethemet traffic (extended version). IEEE/ACM Transactions on Networking, 2(1): 1 — 15, February 1994. [80] J.C. Lin and S. Paul. RMTP: A reliable multicast transport protocol. In IEEE Proceedings of the INFOCOM, pages 1414— 1424, April 1996. [81] M. Little. Goals and functional requirements for inter-autonomous system routing, RFC 1126 edition, 1989. (Status: UNKNOWN). [82] C-G. Liu, D. Estrin, S. Shenker, and L. Zhang. Recovery timer adaptation in SRM. Sub mitted to the IEEE Transactions on Networking. [83] C-G. Liu, D. Estrin, S. Shenker, and L. Zhang. Local error recovery in SRM: Comparison of two approaches. Technical Report USC-CS-TR-97-648, Department of Computer Science, University of Southern California, 1997. [84] C-G. Liu, D. Estrin, S. Shenker, and L. Zhang. Timer adjustment in SRM. Technical Report USC-CS-TR 97-656, University of SOuthem California, 1997. [85] M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow. TCP Selective Acknowledgement Op tions. RFC 2018 edition. 1996. (Status: PROPOSED STANDARD). [86] S. McCanne. Router forwarding services for reliable multicast. Note (199704141535. IAA10590@ mlk. cs. berkeley. edu) to the Reliable Multicast list (rm@ mash. cs. berkeley. edu), April 1997. [87] S. McCanne. Scalable multimedia communication with internet multicast, lightweight ses sions, and the mbone. Technical Report CSD-98-1002, University of California, Berkeley, March 1998. [8 8 ] S. McCanne, E. Brewer, and R. Katz, et al. Towards a common infrastructure for multimedia-networking middleware. In 7th International Workshop on Network and 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Operating Systems Support for Digital Audio and Video (NOSSDAV), St. Louus, Mis souri, May 1997. Available at http://www-mash.cs.berkeley.edu/dist/mash/papers/mash- nossdav97.ps.gz. [89] S. McCanne and S. Floyd, ns—Network Simulator, http://www-mash.cs.berkeley.edu/ns/. [90] S. McCanne, V. Jacobson, and M. Vetterli. Receiver-driven layered multicast. In Proceed ings o f the ACM SIGCOMM, pages 117-130, Stanford, CA, U.S.A., August 1996. [91] J. McQuillan, I. Richer, and E.C. Rosen. The new routing algorithm for the ARPANET. IEEE Transactions on Communications, COM-28(5):711-719, May 1980. [92] J.M. McQuillan, G. Falk, and I. Richer. A review of the development and performance of the ARPANET routing algorithm. IEEE Transactions on Communications, COM- 26(12): 1802-1811, December 1978. [93] P.M. Merlin and A. Segall. A failsafe distributed routing protocol. IEEE Transactions on Communications, COM-27(9): 1280-1287, 1979. [94] D.L. Mills. Exterior Gateway Protocol formal specification, RFC 904 edition, 1984. (Up dates RFC0827, RFC0888) (Status: HISTORIC). [95] R. Milner. A Calculus o f Communicating Systems. Number 92 in Lecture Notes in Com puter Science. Springer Verlag, 1980. [96] J. Moy. OSPF Version 2, RFC 1583 edition, 1994. (Obsoletes RFC 1247) (Obsoleted by RFC2178) (Status: DRAFT STANDARD). [97] J. Moy. OSPF Version 2, RFC 1583 edition, 1994. (Obsoletes RFC 1247) (Obsoleted by RFC2178) (Status: DRAFT STANDARD). [98] K. Ogata. Discrete-Time Control Systems, chapter 4, pages 351-353. Prentice-Hall Inc., Englewood Cliffs, NJ 07632, USA, 1987. [99] C. Papadopoulos, G. Parulkar, and G. Varghese. An error control scheme for large-scale multicast applications, http://dworkin.wustl.edu/ christos/PostscriptDocs/current.ps.Z. [100] J. Parrow. Verifying a CSMA/CD-protocol with CCS. In Protocol Specification, Test ing. and Verification VIII, pages 373-384. IFTP. Elseiver Science Publishing B.V. /North- Holland). 19S4. [101] V . Paxson. Empirically-derived analytic models of wide-area TCP connections. IEEE/ACM Transactions on Networking, 2(4), August 1994. [102] V . Paxson. End-to-end routing behavior in the internet. In Proceedings o f the ACM SIG COMM, August 1996. [103] V. Paxson and S. Floyd. Wide area traffic: The failure of poisson modeling. IEEE/ACM Transactions on Networking, 3(3):226-244, June 1995. [104] R. Perlman. Fault-tolerant broadcast of routing information. Computer Networks, 7, De cember 1983. 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [105] R. Perlman. Interconnections - Bridges and Routers. Addison-Wesley, Reading, Mas sachusetts, 1992. [106] J. Postel. Internet Protocol, RFC 791 edition, 1981. (Obsoletes RFC0760) (Status: STAN DARD). [107] J. Postel. Transmission Control Protocol, RFC 793 edition, 1981. (Status: STANDARD). [108] Prophesy—detailed description, http://www.csn.net/abstraction, 1994. Commercial pack age. [109] A. Reddy. A self organizing monitoring architecture. Thesis Proposal, Available from the authors, May 1997. [110] Y. Rekhter. Inter-Domain Routing Protocol (IDRP). Internetworking: Research and Expe rience, 4:61-80, 1993. [111] Y. Rekhter. Private communication, January 1996. [112] Y. Rekhter, S. Hotz, and D. Estrin. Constraints on forming clusters with link-state hop-by- hop routing. Technical Report RC 19203 (83635) 10/6/93, IBM, IBM Research Division, T.J.Watson Research Center, YorkTown Heights, NY 10598, 1993. [113] Y . Rekhter and T. Li. A Border Gateway Protocol 4 (BGP-4), RFC 1771 edition, 1995. (Obsoletes RFC 1654) (Status: DRAFT STANDARD). [114] Y. Rekhter and T. Li. A Border Gateway Protocol 4 (BGP-4), RFC 1771 edition, 1995. (Obsoletes RFC 1654) (Status: DRAFT STANDARD). [115] Y. Rekhter, P. Lothberg, R. Hinden, S. Deering, and J. Postel. An IPv6 Provider-Based Unicast Address Format, RFC 2073 edition, 1997. (Status: PROPOSED STANDARD). [116] G-C. Roman, PJ. McCann, and J.Y. Plun. Assertional reasoning about pairwise transient interactions in mobile computing. In IEEE, editor, Proceedings o f the 18th International Conference on Software Engineering, pages 155— 164, March 1996. [117] B. Sabata, M.J. Brown, and B.A. Denny. Transport protocol for reliable multicast: TRM. In International Conference on Networks, Orlando, Florida, U.S.A., January 1996. [ 1 181 M. Sajkowski. Protocol verification techniques: Status quo and perspectives. In Protocol Specification, Testing, and Verification IV. IFIP, Elseiver Science Publishing B.V. (North- Holland), 1984. [119] N. Shacham. Multipoint communication by hierarchically encoded data. In IEEE Proceed ings of the INFOCOM, pages 2107-2114, 1992. [120] A.U. Shankar, C. Alaettinoglu, K. Dussa-Zieger, and I. Matta. Transient and steady-state performance of routing protocols: Distance-vector versus link-state. Internetworking: Re search and Experience, 7(1), March 1996. [121] P. Sharma, D. Estrin, S. Floyd, and V . Jacobson. Scalable timers for soft state protocols. In IEEE Proceedings o f the INFOCOM, April 1997. 143 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [122] P. Sharma, D. Estrin, S. Floyd, and V. Jacobson. Scalable timers for soft state protocols. In IEEE Proceedings o f the INFOCOM, 1997. [123] P. Sharma, D. Estrin, S. Floyd, and L. Zhang. Scalable session messages in SRM using self-configuration. In Proceedings o f the ACM SIGCOMM, 1998. [124] K. Shin and M. Chen. Performance analysis of distributed routing strategies free of ping- pong-type looping. IEEE Transactions on Computers, C-36(2): 129-137, February 1987. [125] K. Shin and M. Chen. Minimal order loop-free routing strategy. IEEE Transactions on Computers, 39(7):870-881, July 1990. [126] D. Sidhu, T. Fu, R. Agarwal, A. Crain, and S. Carton. SimuNet: A tool for simulating high speed networks. Proceedings of the Networld + Interop, May 1994. Commercial software specification pamphlet, email: celenix@access.digex.net. [127] M. Steenstrup. An Architecture for Inter-Domain Policy Routing, RFC 1478 edition, 1993. (Status: PROPOSED STANDARD). [128] W. T. Strayer, B. J. Dempsey, and A. C. Weaver. XTP: The Xpress Transfer Protocol. Addison-Wesley, Reading, Mass., U.S.A., 1992. http://heg-school. aw. com/ cseng/ au thors/ dempsey/ xtp/ xtp.nclk. [129] W.D. Tajibnapis. A correctness proof of a topology information maintainence protocol for a distributed computer network. Communications o f the ACM, 20(7), 1977. [130] D. Taubman and A. Zakhor. Multi-rate 3-D subband coding of video. IEEE Transactions on Image Processing, 3(5):572— 588, September 1994. [131] M. Thomas and E. Zegura. Generation and analysis of random graphs to model internet works. Technical Report GIT-CC-94-46, Georgia Institute of Technology, 1994. [132] C. Villamizar and R. Govindan. Controlling BGP route processing overhead. Internet Draft: IDR Working Group, October 1993. Work in progress. [133] D. Waitzman, C. Partridge, and S.E. Deering. Distance Vector Multicast Routing Protocol. RFC 1075 edition, 1988. (Status: EXPERIMENTAL). [134] L. Wei and D. Estrin. Multicast routing in dense and sparse modes: Simulation study of tradeoffs and dynamics. Technical Report TR 95-613. Department of Computer Science. University of Southern California. 1995. [135] C.H. West. Protocol validation—principles and applications. Computer Networks and ISDN Systems, 24:219— 242, 1992. [136] R. Yavatkar, J. Griffioen, and M. Sudan. A reliable dissemination protocol for interactive collaborative applications. In Proceedings o f the ACM Multimedia conference, pages 333- 344, 1995. [137] E.W. Zegura, K.L. Calvert, and S. Bhattachaijee. How to model an internetwork. In IEEE Proceedings o f the INFOCOM, volume 3, pages 594-602, San Francisco, CA, U.S.A., March 1996. 144 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix A The ns Network Simulator ns is an object-oriented simulator that was originally developed at LBNL. Each of the elements of the topology, the protocol implementations under evaluation, and the traffic models are modular objects in ns. It is easy to add new functionality and extend ns because of this modular design, ns uses the Tel interpreter as a configuration interface. The core simulator exports primitive opera tions to the interpreter. A protocol designer running a simulation uses these primitives to configure an experiment. Therefore, given the right design, it is trivial to prototype a lot of different simula tion experiments. Figure A. 1(a) shows an abstract description of the current architecture of ns. ns implements different traffic models, such as f t p and t e l n e t based on tcplib [28], CBR models, etc. ns also offers different queue management strategies for the links and scheduling algorithms for the nodes. A simulation experiment in ns requires specifying the topology in terms of the nodes, links and path characteristics, the protocol being evaluated, as well as the traffic models. Traffic models can be used in one of two ways: (1) a source of data that drives the protocol under test, or (2) a source of data that cause congestion within the network, the simulation is then initialized by computing the routes to be used in the topology and then passing control to the event scheduler. The topology can not change during the course of a particular simulation run. We have added two components to ns in order to implement rtglib: a mechanism to toggle the state of individual links, and a simple distance vector routing protocol module. The models for network dynamics are implemented as Tel functions. The elementary model currently implemented toggles the state of a specified link periodically to emulate failure and re covery of that link. Since the models are realized through the Tel interpreter, the protocol designer can implement their own models of network dynamics as well. This allows them to test their pro tocol under specific models over and above the standard ones that we propose to provide. More details on network dynamics are in Appendix C. The routing protocol implementation uses hop counts as the metric, and split horizons (with poisoned reverse metrics) for route updates. These changes are shown in Figure A. 1(b). Additional details on the routing protocol mechanisms in nr are in Appendix B. 145 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Scheduler Protocol Traffic M odels Topology nodes + links + path characteristics (a) Cunent architecture of ns Protocol Scheduler Traffic Models I P rotocol Netwoi Dynamics Models! Topology nodes + links + path characteristics (b) Modified architecture of ns with rt- glib Figure A. 1(a) shows the components o f a typical transport level simulator (ns). A simulation experiment requires specifying the protocol to be evaluated. The traffic models and the topology that is to be used in the experiment must be specified as well. The topology is specified by defining the nodes and the links, as well as the path characteristics o f the nodes and the links. The topology in this architecture is considered stable, and unchanging. Figure A. 1(b) shows the modifications required to support network dynamics in this simulator. In ad dition to the parameters o f Figure A. 1(a). an experiment can now be run under a model o f network dynamics. The simulator operates a dynamic routing protocol to provide end-to-end connectivity. Figure A.l: Architectural components of an end-to-end protocol simulator ns We also completed a detailed implementation of SRM [42] for evaluating that protocol. Ap pendix D describes our implemetation of SRM in ns. 146 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix B Unicast Routing in ns This section describes the structure of unicast routing in ns. We begin by describing the interface to the user (Section B. 1), through methods in the class Simulator and the class RouteLogic. We then describe configuration mechanisms for specialised routing (Section B.2) such as asymmetric routing, or equal cost multipath routing The next section describes the the configuration mechanisms for individual routing strategies and protocols (Section B.3). We conclude with a comprehensive look at the internal architecture (Section B.4) of routing in ns. The procedures and functions described in this chapter can be found in -riy/tcl/lib/ns-route.tcl, -Au/tcl/rtglib/route-proto.tcl, ~nj/tcI/mcast/McastProto.tcl, and -ns/rtProtoDV. {cc, h}. B .l The Interface to the Simulation Operator (The API) The user level simulation script requires one command: to specify the unicast routing strategy or protocols for the simulation. A routing strategy is a general mechanism by which ns will compute routes for the simulation. There are three routing strategies in ns: Static, Session, and Dynamic. Conversely, a routing protocol is a realisation of a specific algorithm. Currently, Static and Session routing use the Dijkstra’s all-pairs SPF algorithm [34]: one type of dynamic routing strategy is currently implemented: the Distributed Bellman-Ford algorithm [43], In ns, we blur the distinction between strategy and protocol for static and session routing, considering them simply as protocols.1 r tp r o to { } is the instance procedure in the class Simulator that specifies the unicast routing protocol to be used in the simulation. It takes multiple arguments, the first of which is mandatory; this first argument identifies the routing protocol to be used. Subsequent arguments specify the nodes that will run the instance of this protocol. The default is to ran the same routing protocol on 'The consideration is that static and session routing strategies/protocols are implemented as agents derived from the class Agent/rtProto, similar to how the different dynamic routing protocols are implemented; hence the blurred distinctions. 147 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. all the nodes in the topology. As an example, the following commands illustrate the use of the r t p r o t o { } command. $ns rtproto Static ; # Enable static mute strategy for the simulation $ns rtproto Session ; # Enable session muting for this simulation $ns rtproto DV $nl $n2 $n3 ; # Run DVagents on nodes $nl, $n2, and $n3 If a simulation script does not specify any rtproto{} command, then ns will run Static routing on all the nodes in the topology. Multiple r tp r o to { } lines for the same or different routing procotols can occur in a simulation script. However, a simulation cannot use both centralised routing mechanisms such as static or session routing and detailed dynamic routing protocols such as DV. In dynamic routing, each node can be running more than one routing protocol. In such situations, more than one routing protocol can have a route to the same destination. Therefore, each protocol affixes a preference value to each of its routes. These values are non-negative integers in the range 0... 255. The lower the value, the more preferred the route. When multiple routing protocol agents have a route to the same destination, the most preferred route is chosen and installed in the node’s forwarding tables. If more than one agent has the most preferred routes, the ones with the lowest metric is chosen. We call the least cost route from the most preferred protocol the “candidate” route. If there are multiple candidate routes from the same or different protocols, then, currently, one of the agent’ s routes is randomly chosen.2 Preference Assignment and Control Each protocol agent stores an array of route preferences, r t p r e f _ . There is one element per destination, indexed by the node handle. The default preference values used by each protocol are derived from a class variable, p r e f e r e n c e . , for that protocol. The current defaults are: A g e n t / r t P r o t o s e t p r e f e r e n c e . 2 0 0 ; # global default preference Agent/rtProto/Direct3 set preference. 100 Agent/'rtProto/DV set preference. 120 A simulation script can control routing by altering the preference for routes in one of three ways: alter the preference for a specific route learned via a particular protocol agent, alter the preference for all routes learned by the agent, or alter the class variables for the agent before the agent is created. Link Cost Assignment and Control In the currently implemented route protocols, the metric of a route to a destination, at a node, is the cost to reach the destination from that node. It is 2This really is undesirable, and may be fixed at some point. The fix will probably be to favour the agents in class preference order. A user level simulation relying on this behaviour, or getting into this situation in specific topologies is not recommended. 3 Direct is a special routing strategy that is used in conjunction with Dynamic routing. We will de scribe this in greater detail as part of the route architecture description. 148 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. possible to change the link costs at each of the links. The instance procedure c o s t{ } is invoked as $ns c o s t (nodel) (node2) (cost), and sets the cost of the link from (nodel) to (node2) to (cost). $ns cost $nl $n2 10 ; t t set cost of link from $nl to $n2 to 10 $ns cost $n2 $nl 5 ; # set cost of link in reverse direction to 5 t $ns link $nl $n2 ] cos t ? ; # query cost of link from $nl to $n2 [$ns link $n2 $nl] cost? ; # query cost of link in reverse direction Notice that the procedure sets the cost along one direction only. Similarly, the procedure c o st? {} returns the cost of traversing the specified unidirectional link. The default cost of a link is 1. B.2 Other Configuration Mechanisms for Specialised Routing It is possible to adjust preference and cost mechanisms to get two special types of route configurations: asymmetric routing, and multipath routing. Asymmetric Routing Asymmetric routing occurs when the path from node n\ to node ri2 is different from the path from « 2 to n\. The following shows a simple topology, and cost configuration that can achieve such a result: Nodes nx and 712 use differ ent paths to reach each other. All other pairs of nodes use symmetric paths to reach each other. Any routing protocol that uses link costs as the metric can observe such asymmetric routing if the link costs are appropriately configured4. MultiPath Routing Each node can be individually configured to use multiple separate paths to a particular destination. The instance variable mul t i Pa th _ determines whether or not that node will use multiple paths to any destination. Each node initialises its instance variable from a class variable of the same name. If multiple candidate routes to a destination are available, all of which are learned through the same protocol, then that node can use all of the different routes to the destination simultaneously. A typical configuration is as shown below: Node set multiPath_ 1 ; # All new nodes in the simulation 4Link costs can also be used to favour or disregard specific links in order to achieve particular topology configura tions. 149 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ; # use multiPaths where applicable or alternately set nl [$ns Node] ; # only enable $nl to use multiPaths where applicable $nl set multiPath_ 1 Currently, only DV routing can generate multipath routes. B.3 Protocol Specific Configuration Parameters Static Routing The static route computation strategy is the default route computation mechanism in ns. This strategy uses the Dijkstra’s all-pairs SPF algorithm [34]. The route computation algorithm is ran exactly once prior to the start of the simulation. The routes are computed using an adjacency matrix and link costs of all the links in the topology. Session Routing The static routing strategy described earlier only computes routes for the topology once in the course of a simulation. If the topology changes while the simulation is in progress, Then some sources and destinations may become unreachable from each other. Session routing strategy is almost identical to static routing, in that it runs the Dijkstra all-pairs SPF algorithm prior to the start of the simulation, using the adjacency matrix and link costs of the links in the topology. However, it will also ran the same algorithm to recompute routes in the event that the topology changes during the course of a simulation. Note that session routing leads to complete and instantaneous change in the routes of the topology, whenever that topology changes. If the topology is always connected, then there is end-to-end connectivity at all times during the course of the simulation. However, the user should note that the instantaneous recompute of the routes in the topology can lead to temporary violations of causality around the instant that the topology changes. DV Routing DV routing is the implementation of Distributed Bellman-Ford (or Distance Vector) routing in ns. The implementation sends periodic route updates every adverclnterval. This variable is a class variable in the class Agent/rtProto/DV. Its default value is 2 seconds. In addition to periodic updates, each agent also sends triggered updates; it does this whenever the forwarding tables in the node change. This occurs either due to changes in the topology, or because an agent at the node received a route update, and recomputed and installed new routes. Each agent employs the split horizon with poisoned reverse mechanisms to advertise its routes to adjacent peers. “Split horizon” is the mechanism by which an agent will not advertise the route to a destination out of the interface that it is using to reach that destination. In a “Split horizon with poisoned reverse” mechanism, the agent will advertise that route out of that interface with a metric of infinity. 150 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Each DV agent uses a default p r e f e ren ce_ of 120. The value is determined by the class variable of the same name. Each agent uses the class variable i n f i n i t y (set at 32) to determine the validity of a route. B.4 Internals and Architecture of Routing We start with a discussion of the classes associated with unicast routing, and the code path used to configure and execute each of the different routing protocols. We conclude with a description of the interface between unicast routing and network dynamics, and that between unicast and multicast routing. B.4.1 The classes There are four main classes, the class RouteLogic, the class rtObject, the class rtPeer, and the base class Agent/rtProto for all protocols. In addition, the routing architecture extends the classes Simulator, Link, Node and Classifier. class RouteLogic This class defines two methods to configure unicast routing, and one method to query it for route information. It also defines an instance procedure that is applicable when the topology is dynamic. We discuss this last procedure in conjunction with the interface to network dynamics. • The instance procedure register!} i s invoked by simulator: :rtproto{} . It takes the protocol and a list of nodes as arguments, and constructs an instance variable, rtprotos_, as an array; the array index is the name of the protocol, and the value is the list of nodes that will run this protocol. • The configure!} reads the rtprotos_ instance variable, and for each element in the array, invokes route protocol methods to perform the appropriate initialisations. It is invoked by the simulator run procedure. For each protocol (rt-proto) indexed in the rtprotos_ array, this routine invokes Agent/rtProto/(rt-proto) init-all rtprotos_ ((rt-proto)) . If there are no elements in rtprotos_, the routine invokes Static routing, as Agent/rtProto/Static init-all. • The instance procedure lo o k u p !} takes two node numbers, nodeId\ and nodeldi, as argument; it returns the id of the neighbour node that nodeld\ uses to reach nodelcfe- 151 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The procedure is used by the static route computation procedure to query the computed routes and populate the routes at each of the nodes. It is also used by the multicast routing protocols to perform the appropriate RPF check. Note that this procedure overloads an instproc-like of the same name. The procedure queries the appropriate r to b j e c t entities if they exist (which they will if dynamic routing strategies are used in the simulation); otherwise, the procedure invokes the instproc-like to obtain the relevant information. class rtObject is used in simulations that use dynamic routing. Each node has a rtObject associated with it, that acts as a coordinator for the different routing protocols that operate at a node. At any node, the rtObject at that node tracks each of the protocols operating at that node; it computes and installs the nest route to each destination available via each of the protocols. In the event that the routing tables change, or the topology changes, the rtObject will alert the protocols to take the appropriate action. The class defines the procedure init-all{}; this procedure takes a list of nodes as arguments, and creates a rtObject at each of the nodes in its argument list. It subsequently invokes its compute-routes. The assumption is that the constructor for each of the new objects will instantiate the “Direct” route protocol at each of these nodes. This route protocol is responsible for computing the routes to immediately adjacent neighbours. When compute-routes{} is run by the init-all{} procedure, these direct routes are installed in the node by the appropriate route object. The other instance procedures in this class are: • init{} The constructor sets up pointers from itself to the node, in its instance variable node_, and from the node to itself, through the Node instance procedure init-routing{} and the Node instance variable rtObject_. It then initialises an array of nextH op_, r c p r e f _ , m ec ric _ , r tV ia _ . The index o f each of these arrays is the handle of the destination node. The nextHop_ contains the link that will be used to reach the particular destination; rtpref_ and metric_ are the preference and metric for the route installed in the node; rtvia_ is the name of the agent whose route is installed in the node. The constructor also creates the instance of the Direct route protocol, and invokes compute-routes {} for that protocol. • add-proto{} creates an instance of the protocol, stores a reference to it in its array of protocols, rtProtos_. The index of the array is the name of the protocol. It also attaches the protocol object to the node, and returns the handle of the protocol object 152 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • lookup {} takes a destination node handle, and returns the id of the neighbour node that is used to reach the destination. If multiple paths are in use, then it returns a list of the neighbour nodes that will be used. If the node does not have a route to the destination, the procedure will return -1. • com p u te-ro u tes {} is the core procedure in this class. It first checks to see if any of the routing protocols at the node have computed any new routes. If they have, it will determine the best route to each destination from among all the protocols. If any routes have changed, the procedure will notify each of the protocols of the number of such changes, in case any of these protocols wants to send a fresh update. Finally, it will also notify any multicast protocol that new unicast route tables have been computed. The routine checks the protocol agent’s instance variable, rtsC hanged_ to see if any of the routes in that protocol have changed since the protocol was last examined. It then uses the protocol’s instance variable arrays, nextHop_, r tp r e f _ , and m e tric _ to compute its own arrays. The rtObject will install or modify any of the routes as the changes are found. If any of the routes at the node have changed, the rtObject will invoke the protocol agent’s instance procedures, se n d -u p d a te s{ } with the number of changes as argument. It will then invoke the multicast route object, if it exists. The next set of routines are used to query the rtObject for various state information. • dum p-routes{} takes a output file descriptor as argument, and writes out the routing table at that node on that file descriptor. As an example, a typical dump output for a Node, id 1 (_ol2) at time t = Is is: dbg 6. N o d e : 5> [_ol2 rtobject?] _o 12(1) at t = 1. dump-routes . GO stdout Dest nextHop Pref Metric Proto _o7 (0) _o44 (0) 100 1 Direct _ol2 (1) _ol2 (1) 000 0 Local _ol7 (2) _o77 (2) 100 1 Direct _o22 (3) _o77 (2) 120 2 DV _o27 (4) _o77 (2) 120 3 DV _o32 (5) _o77 (2) 120 3 DV From the output, we can see that Node 1 is directly connected to Nodes 0, and 2, and leams of routes to Nodes 3,4, and 5 via Node 2 through DV routing. • rtP r o to ? {} takes a route protocol as argument, and returns a handle to the instance of the protocol running at the node. 153 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • nextHop? {} takes a destination node handle, and returns the link that is used to reach that destination. • Similarly, r t p r e f ?{} and m e tric ? {} take a destination node handle as argument, and return the preference and metric of the route to the destination installed at the node. The class rtPeer is a container class used by the protocol agents. Each object stores the address of the peer agent, and the metric and preference for each route advertised by that peer. A protocol agent will store one object per peer. The class maintains the instance variable addr_, and the instance variable arrays, metric_ and rtpref_; the array indices are the destination node handles. The class instance procedures, metric{} and preference}}, take one destination and value, and set the respective array variable. The procedures, metric?}} and preference?}}, take a destination and return the current value for that destination. The instance procedure addr? {} returns the address of the peer agent. class Agent/rtProto This class is the base class from which all routing protocol agents are derived. Each protocol agent must define the p ro c e d u re in it-a ll{ } to initialise the complete protocol, and possibly instance procedures i n i t { }, compute-routes}}, and send-updates {}. In addition, if the topology is dynamic, and the protocol supports route computation to react to changes in the topology, then the protocol should define the procedure compute-all{}, and possibly the instance procedure intf-changed{}. In this section, we will briefly describe the interface for the basic procedures. We will defer the description of compute-all {} and intf-changed}} to the section on network dynamics. We also defer the description of the details of each of the protocols to their separate section at the end of the chapter. — The procedure init-all {} is a global initialisation procedure for the class. It may be given a list of the nodes as an argument. This the list of nodes that should run this routing protocol. However, centralised routing protocols such as static and session routing will ignore this argument; detailed dynamic routing protocols such as DV will use this argument list to instantiate protocols agents at each of the nodes specified. Note that derived classes in OTcl do not inherit the procedures defined in the base class. Therefore, every derived routing protocol class must define its own procedures explicitly. — The instance procedure i n i t { } is the constructor for protocol agents that are created. The base class constructor initialises the default preference for objects in this class, identifies the interfaces incident on the node and their current status. The interfaces are indexed by the neighbour handle and stored in the instance variable array, ifs _ ; the corresponding status instance variable array is i f s t a t _ . 154 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Centralized routing protocols such as static and session routing do not create separate agents per node, and therefore do not access any of these instance procedures. — The instance procedure c o m p u te-ro u tes{} computes the actual routes for the protocol. The computation is based on the routes learned by the protocol, and varies from protocol to protocol. This routine is invoked by the rtObject whenever the topology changes. It is also invoked when the node receives an update for the protocol. If the routine computes new routes, rtOb j ect:: compute-routes {} needs to be invoked to recompute and possibly install new routes at the node. The actual invoking of the rtObject is done by the procedure that invoked this routine in the first place. — The instance procedure send-updates{} is invoked by the rtObject whenever the node routing tables have changed, and fresh updates have to be sent to all peers. The rtObject passes as argument the number of changes that were done. This procedure may also be invoked when there are no changes to the routes, but the topology incident on the node changes state. The number of changes is used to determine the list of peers to which a route update must be sent. Other procedures relate to responding to topology changes and are described later (Section B.4.2). Other Extensions to the Simulator, Node, Link, and Classifier — We have discussed the methods rtproto{} and cost)} in the class Simulator earlier (Section B.l). The one other method used internally is get-routelogic{}; this procedure returns the instance of routelogic in the simulation. The method is used by the class Simulator, and unicast and multicast routing. — The class Node contains these additional instance procedures to support dynamic unicast routing: i n i t - r o u t i n g { }, a d d - r o u t e s { }, d e l e t e - r o u t e s { }, and r t o b j e c t ? {} . The instance procedure init-routing{} is invoked by the rtobject at the node. It stores a pointer to the rtObject, in its instance variable rtObject_, for later manipulation or retrieval. It also checks its class variable to see if it should use multipath routing, and sets up an instance variable to that effect. If multipath routing could be used, the instance variable array routes_ stores a count of the number of paths installed for each destination. This is the only array in unicast routing that is indexed by the node id, rather than the node handle. The instance procedure rtO b j e c t? {} returns the rtObject handle for that node. 155 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The instance procedure add-routes{} takes a node id, and a list of links. It will add the list of links as the routes to reach the destination identified by the node id. The realisation of multipath routing is done by using a separate Classifier/multiPath. For any given destination id d, if this node has multiple paths to d, then the main classifier points to this multipath classifier instead of the link to reach the destination. Each of the multiple paths identified by the interfaces being used is installed in the multipath classifier. The multipath classifier will use each of the links installed in it for succeeding packets forwarded to it The instance procedure d e le te - r o u te s { } takes a node id, a list of interfaces, and a nullAgent. It removes each of the interfaces in the list from the installed list of interfaces. If the entry did not previously use a multipath classifier, then it must have had only one route, and the route entry is set to point to the nullAgent specified. — The main extension to the class Link for unicast routing is to support the notion of link costs. The instance variable cost_ contains the cost of the unidirectional link. The instance procedures cost{} and cost?{} set and get the cost on the link. Note that c o s t{ } takes the cost as argument. It is preferable to use the simulator method to set the cost variable, similar to the simulator instance procedures to set the queue or delay on a link. — The class Classifier contains three new procedures, two of which overloads an existing instproc-like, and the other two provide new functionality. The instance procedure install {} overloads the existing instproc-like of the same name. The procedure stores the entry being installed in the instance variable array, elements., and then invokes the instproc-like. The instance procedure installNext{} also overloads the existing instproc-like o f the same name. This instproc-like simply installs the entry into the next available slot. The instance procedure a d ja c e n ts { } returns a list of (key. value) pairs of all elements installed in the classifier. B.4.2 Interface to Network Dynamics and Multicast This section describes the methods applied in unicast routing to respond to changes in the topology. The complete sequence of actions that cause the changes in the topology, and fire the appropriate actions is described in a different section. The response to topology changes falls into two categories: actions taken by individual agents at each of the nodes, and actions to be taken globally for the entire protocol. Detailed routing protocols such as the DV implementation require actions to be performed by individual protocol agents at the affected nodes. Centralized routing protocols such as static and 156 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. session routing fall into the latter category exclusively. Detailed routing protocols could use such techniques to gather statistics related to the operation of the routing protocol; however, no such code is currently implemented in ns. Actions at the individual nodes Following any change in the topology, the network dynamics models will first invoke r t o b j e c t :: in tf-c h a n g e d { } at each of the affected nodes. For each of the unicast routing protocols operating at that node, r t o b j e c t :: in tf-c h a n g e d { } will invoke each individual protocol’s instance procedure, in tf-c h a n g e d { }, followed by that protocol’s com pute-routes {}. After each protocol has computed its individual routes r t o b j e c t : : in tf-c h a n g e d { } invokes com p u te-ro u tes {} to possibly install new routes. If new routes were installed in the node, r t o b j e c t : : com pute-routes {} will invoke send-updates {} for each of the protocols operating at the node. The procedure will also flag the multicast route object of the route changes at the node, indicating the number of changes that have been executed, r t o b j e c t : : f la g -m u ltic a s t{} will, in turn, notify the multicast route object to take appropriate action. The one exception to the interface between unicast and multicast routing is the interaction between dynamic dense mode multicast and detailed unicast routing. This dynamicDM implementation in ns assumes neighbour nodes will send an implicit update whenever their routes change, without actually sending the update. It then uses this implicit information to compute appropriate parent-child relationships for the multicast spanning trees. Therefore, detailed unicast routing will invoke r to b je c t_ f la g - m u ltic a s t 1 whenever it receives a route update as well, even if that update does not result in any change in its own routing tables. Global Actions Once the detailed actions at each of the affected nodes is completed, the network dynamics models will notify the RouteLogic instance ( R o u t e L o g i c : : n o t i f y { }) of changes to topology. This procedure invokes the procedure c o - p u t e - a l l f ) for each o f the protocols that were ever installed at any of the nodes. Centralized routing protocols such as session routing use this signal to recompute the routes to the topology. Finally, the R o u t e L o g ic : : n o t i f y { } procedure notifies any instances of centralised multicast that are operating at the node. B.5 Protocol Internals In this section, we describe any leftover details of each of the routing protocol agents. Note that this is the only place where we describe the internal route protocol agent, “Direct” routing. 157 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Direct Routing This protocol tracks the state of the incident links, and maintains routes to immediately adjacent neighbours only. As with the other protocols, it maintains instance variable arrays of nextHop_, rtpref_, and metric_, indexed by the handle of each of the possible destinations in the topology. The instance procedure co m p u te-ro u tes {} computes routes based on the current state of the link, and the previously known state of the incident links. No other procedures or instance procedures are defined for this protocol. Static Routing The procedure c o m p u te -ro u te s{} in the class RouteLogic first creates the adjacency matrix, and then invokes the C++ method, com pute_routes() of the shadow object. Finally, the procedure retrieves the result of the route computation, and inserts the appropriate routes at each of the nodes in the topology. The class only defines the procedure i n i t - a l l { } that invokes co m p u te-ro u tes{}. Session Routing The class defines the procedure i n i t - a l l {} to compute the routes at the start of the simulation. It also defines the procedure co rn p u te-all {} to compute the routes when the topology changes. Each of these procedures directly invokes com pute-routes {}. DV Routing In a dynamic routing strategy, nodes send and receive messages, and compute the routes in the topology based on the messages exchanged. The procedure i n i t - a l l {} takes a list of nodes as the argument; the default is the list of nodes in the topology. At each of the nodes in the argument, the procedure starts the class rtObject and a class Agent/rtProto/DV agents. It then determines the DV peers for each of the newly created DV agents, and creates the relevant rtPeer objects. The constructor for the DV agent initialises a number of instance variables; each agent stores an array, indexed by the destination node handle, of the preference and metric, the interface (or link) to the next hop. and the remote peer incident on the interface, for the best route to each destination computed by the agent. The agent creates these instance variables, and then schedules sending its first update within the first 0.5 seconds of simulation start. Each agent stores the list of its peers indexed by the handle of the peer node. Each peer is a separate peer structure that holds the address of the peer agent, the metric and preference of the route to each destination advertised by that peer. We discuss the rtPeer structure later when discuss the route architecture. The peer structures are initialised by the procedure ad d -p eer{ } invoked by i n i t - a l l { } . The routine se n d -p e rio d ic -u p d a te { } invokes sen d -u p d ate s{ } to send the actual updates. It then reschedules sending the next periodic update after a d v e r tln te r v a l jitterred slightly to avoid possible synchronisation effects. 158 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sen d -u p d ates {} will send updates to a select set of peers. If any of the routes at that node have changed, or for periodic updates, the procedure will send updates to all peers. Otherwise, if some incident links have just recovered, the procedure will send updates to the adjacent peers on those incident links only. send-updates {} uses the procedure send-to-peer{} to send the actual updates. This procedure packages the update, taking the split-horizon and poison reverse mechanisms into account. It invokes the instproc-like, send-update{} (Note the singular case) to send the actual update. The actual route update is stored in the class variable msg_ indexed by a non-decreasing integer as index. The instproc-like only sends the index to msg_ to the remote peer. This eliminates the need to convert from OTcI strings to alternate formats and back. When a peer receives a route update it first checks to determine if the update from differs from the previous ones. The agent will compute new routes if the update contains new information. 159 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix C Network Dynamics in ns This chapter describes the capabilities in ns to make the simulation topologies dynamic. We start with the instance procedures to the class Simulator that are useful to a simulation script (Section C.l). The next section describes the internal architecture (Section C.2), including the different classes and instance variables and procedures; the following section describes the interaction with unicast routing (Section C.3). This aspect of network dynamics is still somewhat experimental in ns. The last section of this chapter outlines some of the deficiencies in the current realisation (Section C.4) of network dynamics, some one or which may be fixed in the future. The procedures and functions described in this chapter can be found in ~/u/tcI/rtglib/dynamics.tcI and ~«5/tcl/Iib/route-proto.tcl. C .l The user level API The user level interface to network dynamics is a collection of instance procedures in the class Simulator, and one procedure to trace and log the dynamics activity. Reflecting a rather poor choice of names, these procedures are rtm odel, rtm o d e l-d e le te , and rtm o d e l-a t. There is one other procedure, rtm o d el-co n f ig u re, that is used internally by the class Simulator to configure the rtmodels just prior to simulation start. Wo describe this method later (Section C.2). — The instance procedure rtm odel {} defines a model to be applied to the nodes and links in the topology. Some examples of this command as it would be used in a simulation script are: $ns rtmodel Exponential 0.8 1.0 1.0 $nl $ns rtmodel Trace dynamics . trc $n2 $n3 $ns rtmodel Deterministic 20.0 20.0 $node(l) $node(5) The procedure requires at least three arguments: • The first two arguments define the model that will be used, and the parameters to configure the model. 160 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The currently implemented models in ns are Exponential (On/Off), Deterministic (On/Off), Trace (driven), or Manual (one-shot) models. • The number, format, and interpretation of the configuration parameters is specific to the particular model. 1. The exponential on/off model takes four parameters: ([start time], up interval, down interval, [finish time]), (start time) defaults to 0.5s from the start of the simulation, (finish time) defaults to the end of the simulation, (up interval) and (down interval) specify the mean of the exponential distribution defining the time that the node or link will be up and down respectively. The default up and down interval values are 10s and Is respectively. Any of these values can be specified as ” to default to the original value. The following are example specifications of parameters to this model: 0.81.01.0 ; # start at 0.8s., up/down = 1.0s., finish is default 5.00.5 ; # start is default, up/down = 5.0s, 0.5s., finish is default - 0.7 ; # start, up interval are default, down — 0.7s., finish is default - - - 10 ; # start, up, down are default, finish at 10s. 2. The deterministic on/off model is similar to the exponential model above, and takes four parameters: ([start time], up interval, down interval, [finish time]), (start time) defaults to the start of the simulation, (finish time) defaults to the end of the simulation. Only the interpretation of the up and down interval is different; (up interval) and (down interval) specify the exact duration that the node or link will be up and down respectively. The default values for these parameters are: (start time) is 0.5s from start of simulation, (up interval) is 2.0s, (down interval) is 1.0s, and (finish time) is the duration of the simulation. 3. The trace driven model takes one param eter the name of the trace file. The format of the input trace file is identical to that output by the dynamics trace modules, viz.. v (tim e) l i r . k - / o p e r a t i or.' /r .o ie ! ' 'r.o-no!’'. Lines that do not correspond to the node or link specified are ignored. v 0.8123 link-up 3 5 v 3.5124 link-down 3 5 4. The manual one-shot model takes two parameters: the operation to be performed, and the time that it is to be performed. • The rest of the arguments to the rtm o d e l]} procedure define the node or link that the model will be applied to. If only one node is specified, it is assumed that the node will fail. This is modeled by making the links incident on the node fail. If two nodes are specified, then the command assumes that the two are adjacent to each other, and the 161 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. model is applied to the link incident on the two nodes. If more than two nodes are specified, only the first is considered, the subsequent arguments are ignored. • instance variable, traceAllFile_ is set. The command returns the handle to the model that was created in this call. Internally, rtmodel {} stores the list of route models created in the class Simulator instance variable, rtModel_. — The instance procedure rtmodel-delete{} takes the handle of a route model as argument, removes it from the rtModel_ list, and deletes the route model. — The instance procedure rtmodel-at {} is a special interface to the Manual model of network dynamics. The command takes the time, operation, and node or link as arguments, and applies the operation to the node or link at the specified time. Example uses of this command are: $ns rtmodel-at 3.5 up $n0 $ns rtmodel-at 3.9 up $n(3) $n(5) $ns rtmodel-at 40 down $n4 Finally, the instance procedure trace-dynamics {} of the class rtModel enables tracing of the dynamics effected by this model. It is used as: set fh [open "dyn.tr" w] $rtmodell trace-dynamics $fh $rtmodel2 trace-dynamics $fh $rtmodell trace-dynamics stdout In this example, $rtmodell writes out trace entries to both dyn.tr and stdout; $rtmodel2 only writes out trace entries to dvn.tr. A typical sequence of trace entries written out hy either mode! might be: v 0.8123 link-up 3 5 v 0.8123 link-up 5 3 v 3.5124 link-down 3 5 v 3.5124 link-down 5 3 These lines above indicate that Link (3, 5) failed at 0.8123s, and recovered at time 3.5124s. C.2 The Internal Architecture Each model of network dynamics is implemented as a separate class, derived from the base class rtModel. We begin by describing the base class rtModel and the derived classes (Section C.2.1). 162 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The network dynamics models use an internal queueing structure to ensure that simultaneous events are correctly handled, the class rtQueue. The next subsection (Section C.2.2) describes the internals of this structure. Finally, we describe the extensions to the existing classes (Section C.3.1): the Node, Link, and others. C.2.1 The class rtM odel To use a new route model, the routine rtm odel {} creates an instance of the appropriate type, defines the node or link that the model will operate upon, configures the model, and possibly enables tracing; The individual instance procedures that accomplish this in pieces are: The constructor for the base class stores a reference to the Simulator in its instance variable, ns_. It also initialises the startTime_ and f inishTime_ from the class variables of the same name. The instance procedure set-elements identifies the node or link that the model will operate upon. The command stores two arrays: links_, of the links that the model will act upon; nodes_, of the incident nodes that will be affected by the link failure or recovery caused by the model. The default procedure in the base class to set the model configuration parameters is set-parms. It assumes a well defined start time, up interval, down interval, and a finish time, and sets up configuration parameters for some class of models. It stores these values in the instance variables: startTime_, uplnterval_, downInterval_, finishTime_. The exponential and deterministic models use this default routine, the trace based and manual models define their own procedures. The instance procedure trace)} enables trace-dynamics {} on each of the links that it affects. Additional details on trace-dynamics {} is discussed in the section on extensions to the class Link (Section C.3.1). The next sequence of configuration steps are taken just prior to the start of the simulator, ns invokes rtmodel-configure{} just before starting the simulation. This instance procedure first acquires an instance of the class rtQueue, and then invokes conf igure{} for each route model in its list, rtModel_. The instance procedure c o n fig u re )} makes each link that is is applied to dynamic; this is the set of links stored in its instance variable array, lin k s _ . Then the procedure schedules its first event. 163 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The default instance procedure set-f irst-event{} schedules the first event to take all the links “down” at $startTime_ + uplnterval_. Individual types of route models derived from this base class should redefine this function. Two instance procedures in the base class , s e t- e v e n t{ } and s e t- e v e n t- e x a c t{}, can be used to schedule events in the route queue. set-event{interval, operation} schedules o p e ra tio n after in te r v a l seconds from the current time; it uses the procedure s e t- e v e n t- e x a c t {} below. set-event-exact {fireTime, operation} schedules operation to execute at f ireTime. If the time for execution is greater than the f inishTime_, then the only possible action is to take a failed link “up.” Finally, the base class provides the methods to take the links up{} or down{}. Each method invokes the appropriate procedure on each of the links in the instance variable, links_. Exponential The model schedules its first event to take the links down at s tartTime_ + E(uplnterval_); It also defines the procedures, up{} and down{}; each procedure invokes the base class procedure to perform the actual operation. This routine then reschedules the next event at E(upInterval) or E(downlnterval_) respectively. Deterministic The model defines the procedures, up{} and down{}; each procedure invokes the base class procedure to perform the actual operation. This routine then reschedules the next event at uplnterval or downInterval_ respectively. Trace The model redefines the instance procedure s e t -parm s {} to operan a trace file, and set events based on that input. The instance procedure get-next-event}} returns the next valid event from the trace file. A valid event is an event that is applicable to one of the links in this object’s links_ variable. The instance procedure s e t- tr a c e - e v e n ts { } uses g e t-n e x t-e v e n t{ } to schedule the next valid event. The model redefines s e t - f i r s t - e v e n t { }, up{}, and down}} to use s e t- tr a c e - e v e n ts { }. Manual The model is designed to fire exactly once. The instance procedure se t-p a rm s{ } takes an operation and the time to execute that operation as arguments, s e t - f i r s t - e v e n t { } will schedule the event at the appropriate moment. This routine also redefines n o tif y } } to delete the object instance when the operation is completed. This notion of the object deleting itself is fragile code. 164 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Since the object only fires once and does nto have to be rescheduled, it does not overload the procedures up{} or down{}. C.2.2 class rtQ ueue The simulator needs to co-ordinate multiple simultaneous network dynamics events, especially to ensure the right coherent behaviour. Hence, the network dynamics models use their own internal route queue to schedule dynamics events. There is one instance of this object in the simulator, in the class Simulator instance variable rtq_. The queue object stores an array of queued operations in its instance variable, rtq_. The index is the time at which the event will execute. Each element is the list of operations that will execute at that time. The instance procedures insq{} and insq-i{} can insert an element into the queue. The first argument is the time at which this operation will execute. insq{} takes the exact time as argument; insq-i{} takes the interval as argument, and schedules the operation interval seconds after the current time. The following arguments specify the object, $obj, the instance procedure of that object, $iproc, and the arguments to that procedure, $args. These arguments are placed into the route queue for execution at the appropriate time. The instance procedure runq{} executes eval $obj $iproc $ args at the appropriate instant. After all the events for that instance are executed, runqf} will notify{} each object about the execution. Finally, the instance procedure delq{} can remove a queued action with the time and the name of the object. C.3 Interaction with Unicast Routing In an earlier section, we had described how unicast routing reacts (Section B.4.2) to changes to the topology. This section details the steps by which the network dynamics code will notify the nodes and routing about the changes to the topology. 1. rtQueue: : runq{} will invoke the procedures specified by each of the route model instances. After all of the actions are completed, runq{} will notify each of the models. 2. n o tif y { } will then invoke instance procedures at all of the nodes that were incident to the affected links. Each route model stores the list of nodes in its instance variable array, nodes_. 165 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. It will then notify the RouteLogic instance of topology changes. 3. The rtModel object invokes the class Node instance procedure intf-changed{} for each of the affected nodes. 4. Node:: int f -changed {} will notify any rtobject at the node of the possible changes to the topology. Recall that these route objects are created when the simulation uses detailed dynamic unicast routing. C J.1 Extensions to O ther Classes The existing classes assume that the topology is static by default. In this section, we document the necessary changes to these classes to support dynamic topologies. We have already described the instance procedures in the class Simulator to create or manipulate route models, i.e., rtm odel{}, r tm o d e l- a t{ }, rtm o d e l-d e le te { }, and rtm o d el-co n f ig u r e {} in earlier sections (Section C.2.1). Similarly, the class Node contains the instance procedure in tf-c h a n g e d { } that we described in the previous section (Section C.3). The network dynamics code operates on individual links. Each model currently translates its specification into operations on the appropriate links. The following paragraphs describe the class Link and related classes. class DynamicLink This class is the only TcIObject in the network dynamics code. The shadow class is called class DynaLink. The class supports one bound variable, s ta tu s_ . s ta tu s _ is 1 when the link is up, and 0 when the link is down. The shadow object’s recv() method checks the s ta tu s _ variable, to decide whether or not a packet should be forwarded. class Link This class supports the primitives: up, down, and up? to set and query status.. These primitives are instance procedures of the class. The instance procedures up{} and down{} set status_ to 1 and 0 respectively. In addition, when the link fails, down{} will reset all connectors that make up the link. Each connector, including all queues and the delay object will flush and drop any packets that it currently stores. This emulates the packet drop due to link failure. Both procedures then write trace entries to each file handle in the list, dynT_. The instance procedure up?{} returns the current value of s ta tu s _ . In addition, the class contains the instance procedure a ll- c o n n e c to r s {}. This procedure takes an operation as argument, and applies the operation uniformly to all of the class instance variables that are handles for TcIObjects. 166 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. class SimpleLInk The class supports two instance procedures dynamic {} and trace-dynamics {}. We have already described the latter procedure when describing the trace {} procedure in the class rtModel. The instance procedure dynamic {} inserts a DynamicLink object at the head of the queue. It points the down-target of the object to the drop target of the link, drpT_, if the object is defined, or to the nullAgent_ in the simulator. It also signals each connector in the link that the link is now dynamic. Most connectors ignore this signal to be become dynamic; the exception is DelayLink object. This object will normally schedule each packet it receives for reception by the destination node at the appropriate time. When the link is dynamic, the object will queue each packet internally; it schedules only one event for the next packet that will be delivered, instead of one event per packet normally. If the link fails, the route model will signal a reset, at which point, the shadow object will execute its reset instproc-like, and flush all packets in its internal queue. C.4 Deficencies in the Current Network Dynamics API There are a number of deficencies in the current API that should be changed in the next iteration: 1. There is no way to specify a cluster of nodes or links that behave in Iock-step dynamic synchrony. 2. Node failure should be dealt with as its own mechanism, rather than a second grade citizen of link failure. This shows up in a number of situations, such as: (a) The method of emulating node failure as the failure of the incident links is broken. Ideally, node failure should cause all agents incident on the node to be reset. (b) There is no tracing associated with node failure. 3. If two distinct route models are applied to two separate links incident on a common node, and the two links experience a topology change at the same instant, then the node will be notified more than once. 167 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix D The SRM Agent in ns This chapter describes the internals of the SRM implementation in ns. The chapter is in three parts: the first part is an overview of a minimal SRM configuration, and a “complete” description of the configuration parameters of the base SRM agent. The second part describes the architecture, internals, and the code path of the base SRM agent. The last part of the chapter is a description of the extensions for other types of SRM agents that have been attempted to date. The procedures and functions described in this chapter can be found in -rty/tcl/mcast/srm.tcl, ~nj/tcl/mcast/srm-adaptive.tcl, -nj/tcl/mcast/srm-nam.tcl, -n^/tcl/mcast/srm-debug.tcl, and ~nj/srm.{cc, h}. D .l Configuration Running an SRM simulation requires creating and configuring the agent, attaching an application-level data source (a traffic generator), and starting the agent and the traffic generator. D.1.1 Trivial Configuration Creating the Agent set ns [new Simulator] $ns enableMcast set node [$ns node] set group [$ns allocaddr] ; # preamble initialization ; # agent to reside on this node ; # multicast group for this agent set sza [new Agent/SRM] $srm set dst_ $group $ns attacb-agent $node $srm ; # configure the SRM agent $srm set fid_ 1 $srm log [open srmStats.tr w] $srm trace [open srmEvents.tr w] ; # optional configuration ; # log statistics in this file # trace events for this agent 168 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The key steps in configuring a virgin SRM agent are to assign its multicast group, and attach it to anode. Other useful configuration parameters are to assign a separate flow id to traffic originating from this agent, to open a log file for statistics, and a trace file for trace data.1 The file tcl/mcast/srm-nam. tel contains definitions that overload the agent’s send methods; this separates control traffic originating from the agent by type. Each type is allocated a separate flowED. The traffic is separated into session messages (flowid = 40), requests (flowid = 41), and repair messages (flowid = 42). The base flowid can be changed by setting global variable ctrlFid to one less than the desired flowid before sourcing srm-nam. tel. To do this, the simulation script must source srm-nam. tel before creating any SRM agents. This is useful for analysis of traffic traces, or for visualization in nam. Application Data Handling The agent does not generate any application data on its own; instead, the simulation user can connect any traffic generation module to any SRM agent to generate data. The following code demonstrates how a traffic generation agent can be attached to an SRM agent: set packetSize 210 set expO [new Traf f ic/Expoo] ;# configure traffic generator $expO set packet-size $packetSize $exp0 set burst-time 500ms $exp0 set idle-time 500ms $exp0 set rate 100k set sO [new Agent/CBR/UDP] ; # attach traffic generator to application $s0 set fid_ 0 $s0 attach-traffic $exp0 $srm(0) traffic- source $s0 ; # attach application to SRM agent $srm(0) set packetSize_ $packetSize ; # to generate repair packets ; # of appropriate size The instproc t r a f f i c - s o u r c e specifies the application agent that will produce data for the SRM agent. The user can attach any agent; the only distinguishing criteria is that the destination address must be zero. The SRM agent will add the SRM headers, set the destination address to the multicast group, and deliver the packet to its target. The SRM header contains the type of the message, the identity of the sender, the sequence number of the message, and (for control messages), the round for which this message is being sent. Each data unit in SRM is identified as (sender’s id, message sequence number). The SRM agent does not generate its own data; it does not also keep track of the data sent, except to record the sequence numbers of messages received in the event that it has to do error recovery. 'Note that the trace data can also be used to gather certain kinds of trace data. We will illustrate this later. 169 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Since the agent has no actual record of past data, it needs to know what packet size to use for each repair message. Hence, the instance variable p a c k e ts iz e _ specifies the size of repair messages generated by the agent Starting the Agent and Traffic Generator The agent and the traffic generator must be started separately. $szm start: $srm start-source At s t a r t , the agent joins the multicast group, and starts generating session messages. The start-source triggers the traffic generator to start sending data. D.1.2 O ther Configuration Param eters In addition to the above parameters, the SRM agent supports additional configuration variables. Each of the variables described in this section is both an OTcI class variable and an OTcl object’s instance variable. Changing the class variable changes the default value for all agents that are created subsequently. Changing the instance variable of a particular agent only affects the values used by that agent. For example, Agent/SRM set Dl_ 2.0 ; # Changes the class variable $srm set Dl_ 2.0 ; # Changes DI_for the particular Ssrm object only The default request and repair timer parameters [42] for each SRM agent are: Agent/SRM set Cl_ 2.0 Agent/SRM set C2_ 2.0 Agent/SRM set Dl_ 1.0 Agent/SRM set D2_ 1.0 It is thus possible to trivially obtain two flavours of SRM agents based on whether the agents use probabilistic or deterministic suppression by using the following definitions: Class Agent/SRM/Deterministic -superclass Agent/SRM Agent/SRM/Detenninistic set C2_ 0.0 Agent/SRM/Deterministic set D2_ 0.0 Class Agent/SRM/Probabilistic -superclass Agent/SRM Agent/SRM/Probabilistic set Cl_ 0.0 Agent/SRM/Probabilistic set Dl_ 0.0 In a later section (Section D.7), we will discuss other ways of extending the SRM agent. Timer related functions are handled by separate objects belonging to the class SRM. Timers are required for loss recovery and sending periodic session messages. There are loss recovery objects to send request and repair messages. The agent creates a separate request or repair object to 170 ; # request parameters ; # repair parameters Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. handle each loss. In contrast, the agent only creates one session object to send periodic session messages. The default classes the express each of these functions are: Agent/SRM set requestFunction_ "SRM/request" Agent/SRM set repairFunction_ "SRM/repair” Agent/SRM set sessionFunction_ ”SRM/session" Agent/SRM set requestBackof fLimit_ 5 ; # parameter to requestFunction Agent/SRM set sessionDelay_ 1.0 ; # parameter to sessionFunction_ The instance procedures requestFunction{} , repairFunction{}, and session- FunctionJ} can be used to change the default function for individual agents. The last two lines are specific parameters used by the request and session objects. The following section (Section D.2) describes the implementation of theses objects in greater detail. D.1.3 Statistics Each agent tracks two sets of statistics: statistics to measure the response to data loss, and overall statistics for each request/repair. In addition, there are methods to access other information from the agent. Data Loss The statistics to measure the response to data losses tracks the duplicate requests (and repairs), and the average request (and repair) delay. The algorithm used is documented in Floyd et at. [42]. In this algorithm, each new request (or repair) starts a new request (or repair) period. During the request (or repair) period, the agent measures the number of first round duplicate requests (or repairs) until the round terminates either due to receiving a request (or repair), or due to the agent sending one. The following code illustrates how the user can simple retrieve the current values in an agent: set statsList [$srm array get statistics,.] array set statsArray [$srm array get statistics_] The first form returns a list of key-value pairs. The second form loads the list into the statsArray for further manipulation. The keys of the array are dup-req, ave-dup-req, req-delay, ave-req-delay, dup-rep, ave-dup-rep, rep-delay, and ave-rep-delay. Overall Statistics In addition, each loss recovery and session object keeps track of times and statistics. In particular, each object records its startTime, serviceTime, distance, as are relevant to that object; startTime is the time that this object was created, serviceTime is the time for this object to complete its task, and the distance is the one-way time to reach the remote peer. 171 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For request objects, startTime is the time a packet loss is detected, serviceTime is the time to finally receive that packet, and distance is the distance to the original sender of the packet. For repair objects, startTime is the time that a request for retransmission is received, serviceTime is the time send a repair, and the distance is the distance to the original requester. For both types of objects, the serviceTime is normalised by the distance. For the session object, startTime is the time that the agent joins the multicast group. serviceTime and distance are not relevant. Each object also maintains statistics particular to that type of object. Request objects track the number of duplicate requests and repairs received, the number of requests sent, and the number of times this object had to backoff before finally receiving the data. Repair objects track the number of duplicate requests and repairs, as well as whether or not this object for this agent sent the repair. Session objects simply record the number of session messages sent. The values of the timers and the statistics for each object are written to the log file every time an object completes the error recovery function it was tasked to do. The format of this trace file is: (prefix) (id) (times) (stats) (time) n (node id) m (msg id) r (round) (msg id) is expressed as (source id: sequence number) type (of object) list of key-value pairs of startTime, serviceTime, distance list of key-value pairs of per object statistics dupRQST, dupREPR, (tsent, backoff fo r re q u e s t o b je c ts dupRQST, dupREPR, #sent fo r r e p a ir o b je c ts #sent fo r s e s s io n o b je c ts The following sample output illustrates the output file format (the lines have been folded to fit on the page): 3.6274 n 0 m <1:1> r 1 type repair serviceTime 0.500222 \ startTime 3.5853553333333332 distance 0.0105 \ #sent 1 dupREPR 0 dupRQST 0 3.6417 n 1 m <1:1> r 2 type request serviceTime 2.66406 \ startTime 3.5542666666666665 distance 0.0105 backoff 1 \ tsent 1 dupREPR 0 dupRQST 0 3.6876 n 2 m <1:1> r 2 type request serviceTime 1.33406 \ startTime 3.5685333333333333 distance 0.021 backoff 1 \ tsent 0 dupREPR 0 dupRQST 0 3.7349 n 3 m <1:1> r 2 type request serviceTime 0.876812 \ startTime 3.5828000000000002 distance 0.032 backoff 1 \ tsent 0 dupREPR 0 dupRQST 0 3.7793 n 5 m <1:1> r 2 type request serviceTime 0.669063 \ startTime 3.5970666666666671 distance 0.042 backoff 1 \ tsent 0 dupREPR 0 dupRQST 0 3.7808 n 4 m <1:1> r 2 type request serviceTime 0.661192 \ startTime 3.5970666666666671 distance 0.0425 backoff 1 \ tsent 0 dupREPR 0 dupRQST 0 172 where (prefix) is (id) is (times) is (stats) is Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Miscellaneous Information Finally, the user can use the following methods to gather additional information about the agent: • groupSize?{} returns the agent’s current estimate of the multicast group size. • distances? {} returns a list of key-value pairs of distances; the key is the address of the agent, the value is the estimate of the distance to that agent. The first element is the address of this agent, and the distance of 0. • distance? {} returns the distance to the particular agent specified as argument. The default distance at the start of any simulation is 1. D.1.4 Tracing Each object writes out trace information that can be used to track the progress of the object in its error recovery. Each trace entry is of the form: (prefix) (tag) (type of entry) (values) $srm(i) groupSize? $srm(i) distances? $srm(i) distance? 257 ; # returns $srm( i) 's estimate of the group size ; # returns list o f {address, distance) tuples ; # returns the distance to agent at address 257 173 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The prefix is as describe in the previous section for statistics. The tag is Q for request objects, P for repair objects, and S for session objects. The following types of trace entries and parameters axe written by each object: Type o f Tag Object O ther values Comments Q DETECT Q INTERVALS C l (C l_ ) C2 (C2_) dist (distance) i (backoff.) Q NTTMER at (tim e) Time the request timer will fire Q SENDNACK Q NACK IGNORE-BACKOFF (time) Receive NACK, ignore other NACKs until (time) Q REPAIR IGNORES (time) Receive REPAIR, ignore NACKs until (time) Q DATA Agent receives data instead of repair. Possibly indicates out of order arrival o f data. p NACK from (requester) Receive NACK, initiate repair P INTERVALS D1 (D I_ ) D2 (D2_) dist (distance) p RTIMER at (time) Time the repair timer will fire p SENDREP P REPAIR IGNORES (time) Receive REPAIR, ignore NACKs until (time) p DATA Agent receives data instead of repair. Indicates premature re quest by an agent. s SESSION logs session message sent The following illustrates a typical trace for a single loss and recovery. 3 .5543 n 1 m <1: 1> r 0 Q DETECT 3.5543 n 1 m <1: 1> r 1 Q INTERVALS Cl 2.0 C2 0.0 d 0.0105 i 1 3.5543 n 1 m <1: 1> r 1 Q NTIMER at 3.57527 3.5685 n 2 m <1 :1> r 0 Q DETECT 3.5685 n 2 m <1: 1> r 1 Q INTERVALS Cl 2.0 C2 0.0 d 0.021 i 1 3.5685 n 2 m <1:1> r 1 Q NTIMER at 3.61053 3.5753 a 1 m <1: 1> r 1 Q SENDNACK 3.5753 n 1 m <1: 1> r 2 Q INTERVALS Cl 2.0 C2 0.0 d 0.0105 i 2 3.5753 n 1 m <1 : 1> r 2 Q NTIMER at 3.61727 3.5753 n 1 m <1: 1> r 2 Q NACK IGNORE-BACKOFF 3.59627 3.5828 n 3 m <1: 1> r 0 Q DETECT 3.5828 n 3 m <1: 1> r 1 Q INTERVALS Cl 2.0 C2 0.0 d 0.032 i 1 3.5828 n 3 m <1: 1> r 1 Q NTIMER at 3.6468 3.5854 n 0 m <1: 1> r 0 P NACK from 257 3.5854 n 0 m <1: 1> r 1 P INTERVALS D1 1.0 D2 0.0 d 0.0105 174 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.5854 r x 0 m <1 1> r 1 P RTIMER at 3.59586 3.5886 n 2 m <1 1> r 2 Q INTERVALS Cl 2.0 C2 0.0 d 0.021 i 2 3.5886 n 2 m <1 1> r 2 Q NTIMER at 3.67262 3.5886 n 2 m <1 1> r 2 Q NACK IGNORE-BACKOFF 3.63062 3.5959 n 0 m <1 1> r 1 P SENDREP 3.5959 n 0 m <1 1> r 1 P REPAIR IGNORES 3.62736 3.5971 n 4 m <1 1> r 0 Q DETECT 3.5971 n 4 m <1 1> r 1 Q INTERVALS Cl 2.0 C2 0.0 d 0.0425 i 1 3.5971 n 4 m <1 1> r 1 Q NTIMER at 3.58207 3.5971 n 5 m <1 1> r 0 Q DETECT 3.5971 n 5 m <1 1> r 1 Q INTERVALS Cl 2.0 C2 0.0 d 0.042 i 1 3.5971 n 5 m <1 1> r 1 Q NTIMER at 3.68107 3.6029 n 3 m <1 1> r 2 Q INTERVALS Cl 2.0 C2 0.0 d 0.032 i 2 3.6029 n 3 m <1 1> r 2 Q NTIMER at 3.73089 3.6029 n 3 m <1 1> r 2 Q NACK IGNORE-BACKOFF 3.66689 3.6102 n 1 m <1 1> r 2 Q REPAIR IGNORES 3.64171 3.6172 n 4 m <1 1> r 2 Q INTERVALS Cl 2.0 C2 0.0 d 0.0425 i 2 3.6172 n 4 m <1 1> r 2 Q NTIMER at 3.78715 3.6172 n 4 m <1 1> r 2 Q NACK IGNORE-BACKOFF 3.70215 3.6172 n 5 m <1 1> r 2 Q INTERVALS Cl 2.0 C2 0.0 d 0.042 i 2 3.6172 n 5 m <1 1> r 2 Q NTIMER at 3.78515 3.6172 n 5 m <1 1> r 2 Q NACK IGNORE-BACKOFF 3.70115 3.6246 n 2 m <1 1> r 2 Q REPAIR IGNORES 3.68756 3 .6389 n 3 m <1 1> r 2 Q REPAIR IGNORES 3.73492 3.6533 n 4 m <1 1> r 2 Q REPAIR IGNORES 3.78077 3.6533 n 5 m <1 1> r 2 Q REPAIR IGNORES 3.77927 The logging of request and repair traces is done by SRM: :evTrace{} . However, the routine SRM/Session: :evTrace{} , overrides the base class definition of srm: :evTrace{} , and writes out nothing. Individual simulation scripts can override these methods for greater flexibility in logging options. One possible reason to override these methods might to reduce the amount of data generated; the new procedure could then generate compressed and processed output. Notice that the trace file contains sufficient information and details to derive most of the statistics written out in the log file, or is stored in the statistics arrays. D.2 Architecture and Internals The SRM agent implementation splits the protocol functions into packet handling, loss recovery, and session message activity. Packet handling consists of forwarding application data messages, sending and receipt of control messages. These activities are executed by C++ methods. Error detection is done in C++ due to receipt of messages. However, the loss recovery is entirely done through instance procedures in OTcl. 175 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The sending and processing of messages is accomplished in C++; the policy about when these messages should be sent is decided by instance procedures in OTcl. We first describe the C++ processing due to receipt of messages (Section D.3). Loss recovery and the sending of session messages involves timer based processing. The agent uses a separate class SRM to perform the timer based functions. For each loss, an agent may do either request or repair processing. Each agent will instantiate a separate loss recovery object for every loss, as is appropriate for the processing that it has to do. In the following section we describe the basic timer based functions and the loss recovery mechanisms (Section D.5). Finally, each agent uses one timer based function for sending periodic session messages (Section D.6). D.3 Packet Handling: Processing received messages The recv() method can receive four type of messages: data, request, repair, and session messages. Data Packets The agent does not generate any data messages. The user has to specify an external agent to generate traffic. The recv() method must distinguish between locally originated data that must be sent to the multicast group, and data received from multicast group that must be processed. Therefore, the application agent must set the packet’ s destination address to zero. For locally originated data, the agent adds the appropriate SRM headers, sets the destination address to the multicast group, and forwards the packet to its target. On receiving a data message from the group, recv_data(sender, msgid) will update its state marking message (sender, msgid) received, and possibly trigger requests if it detects losses. In addition, if the message was an older message received out of order, then there must be a pending request or repair that must be cleared. In that case, the compiled object invokes the OTcl instance procedure, recv-data{sender, msgid}.2 Currently, there is no provision for the receivers to actually receive any application data. The agent does not also store any of the user data. It only generates repair messages of the appropriate size, defined by the instance variable p a ck e tsize _ . However, the agent assumes that any application data is placed in the data portion of the packet, pointed to by p a c k e t-> a c c e ssd a ta (). Request Packets On receiving a request, recv_rqst(sender, msgid) will check whether it needs to schedule requests for other missing data. If it has received this request before it was aware that the source had generated this data message (i.e., the sequence number of the request is technically, r e c v _ d a t a ( ) invokes the instance procedure r e c v d a t a ( s e n d e r ) (m sg id ), that then invokes r e c v - d a t a { }. The indirection allows individual simulation scripts to override the r e c v { } as needed. 176 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. higher than the last known sequence number of data from this source), then the agent can infer that it is missing this, as well as data from the last known sequence number onwards; it schedules requests for all of the missing data and returns. On the other hand, if the sequence number of the request is less than the last known sequence number from the source, then the agent can be in one of three states: (1) it does not have this data, and has a request pending for it, (2) it has the data, and has seen an earlier request, upon which it has a repair pending for it, or (3) it has the data, and it should instantiate a repair. All of these error recovery mechanisms are done in OTcl; recv_rqst() invokes the instance procedure recv-rqst{sender, msgid, requester} for further processing. Repair Packets On receiving a repair, recv_repr(sender, msgid) will check whether it needs to schedule requests for other missing data. If it has received this repair before it was aware that the source had generated this data message (i.e., the sequence number of the repair is higher than the last known sequence number of data from this source), then the agent can infer that it is missing all data between the last known sequence number and that on the repair, it schedules requests for all of this data, marks this message as received, and returns. On the other hand, if the sequence number of the request is less than the last known sequence number from the source, then the agent can be in one of three states; (1) it does not have this data, and has a request pending for it, (2) it has the data, and has seen an earlier request, upon which it has a repair pending for it, or (3) it has the data, and probably scheduled a repair for it at some time; after error recovery, its holddown timer (equal to three times its distance to some requestor) expired, at which time the pending object was cleared. In this last situation, the agent will simply ignore the repair, for lack of being able to do anything meaningful. All of these error recovery mechanisms are done in OTcl; recv_repr() invokes the instance procedure recv-repr{sender, msgid} to complete the loss recovery phase for the particular message. Session Packets On receiving a session message, the agent updates its sequence numbers for ail active sources, and computes its instantaneous distance to the sending agent if possible. The agent will ignore earlier session messages from a group member, if it has received a later one out of order. Session message processing is done in recv_sess(). The format of the session message is: (count of tuples in this message, list of tuples), where each tuple indicates the (sender id, last sequence number from the source, time the last session message was received from this sender, time that that message was sent). The first tuple is the information about the local agent.3 3Note that this implementation of session message handling is subtly different from that used in wb or described by Floyd et al.. in [42]. In principle, an agent disseminates a list of the data it has actually received. Our implementation, on the other hand, only disseminates a count of the last message sequence number per source that the agent knows that that the source has sent. This is a constraint when studying aspects of loss recovery during partition and healing. It 177 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D.4 Loss Detection—The Class SRMinfo A very small encapsulating class, entirely in C++, tracks a number of assorted state information. Each member of the group, rii, uses one SRMinfo block for every other member of the group. An SRMinfo object about group member n-j at n,, contains information about the session messages received by nt - from nj. rii can use this information to compute its distance to nj. If nj sends is active in sending data traffic, then the SRMinfo object will also contain information about the received data, including a bit vector indicating all packets received from rij. The agent keeps a list of SRMinfo objects, one per group member, in its member variable, sip_. Its method, g e t_ state (in t sender) will return the object corresponding to that sender, possibly creating that object, if it did not already exist. The class SRMinfo has two methods to access and set the bit vector, i.e., ifR eceived(int id) indicates whether the particular message from the appropriate sender, with id id was received at n,, setR eceived(int id) to set the bit to indicate that the particular message from the appropriate sender, with id id was received at n, . The session message variables to access timing information are public; no encapsulating methods are provided. These are: int lses sp irit sendTime_; int recvTime_; double distance_; / * Data messages * / int ldata_; / * if of last data msg sent * / D.5 Loss Recovery Objects In the last section, we described the agent behaviour when it receives a message. Timers are used to control when any particular control message is to be sent. The SRM agent uses a separate class SRM to do the timer based processing. In this section, we describe the basecs if the class SRM, and the loss recovery objects. The following section will describe how the class SRM is used for sending periodic session messages. An SRM agent will instantiate one object to recover from one lost data packet. Agents that detect the loss will instantiate an object in the class SRM/request; agents that receive a request and have the required data will instantiate an object in the class SRM/repair. is reasonable to expect that the maintainer of this code will fix this problem during one of his numerous intervals of copious spare time. 178 / * tf of last session msg received * / / * Time sess. msg. ff sent * / / * Time sess. msg. ff received * / Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Request Mechanisms SRM agents detect loss when they receive a message, and infer the loss based on the sequence number on the message received. Since packet reception is handled entirely by the compiled object, loss detection occurs in the C++ methods. Loss recovery, however, is handled entirely by instance procedures of the corresponding interpreted object in OTcl. When any of the methods detects new losses, it invokes Agent/SRM:: request {} with a list of the message sequence numbers that are missing, request{} will create a new reques tFunction_ object for each message that is missing. The agent stores the object handle in its array of pending_ objects. The key to the array is the message identifier (sender):(msgid). The default reques tFunction_ is class SRM/request. The constructor for the class SRM/request calls the base class constructor to initialise the simulator instance (ns_), the SRM agent (agent_), trace file (trace_), and the times_ array. It then initialises its statistics_ array with the pertinent elements. A separate call to set-params{} sets the sender_, msgid_, round_ instance variables for the request object. The object determines Cl_ and C2_ by querying its agent_. It sets its distance to the sender (times_ (distance)) and fixes other scheduling parameters: the backoff constant (backof f_), the current number of backoffs (backof fCtr_), and the limit (backof fLimit_) fixed by the agent. set-params{} writes the trace entry “Q DETECT.” The final step in reques t{} is to schedule the timer to send the actual request at the appropriate moment. The instance procedure SRM/request: :schedule{} uses compute-delay{} and its current backoff constant to determine the delay. The object schedules send-reques t{} to be executed after delay_ seconds. The instance variable eventlD_ stores a handle to the scheduled event. The default compute-delay{} function returns a value uniformly distributed in the interval [C\ds, {Ci + - Co)ds\, where ds is twice $times_(distance). The schedule{} schedules an event to send a request after the computed delay. The routine writes a trace entry “Q NTIMER at (time).” When the scheduled timer fires, the routine s e n d -re q u e s t{ } sends the appropriate message. It invokes “$agent_ send request (args)” to send the request. Note that send{} is an instproc-like, executed by the command() method of the compiled object. However, it is possible to overload the instproc-like with a specific instance procedure send{} for specific configurations. As an example, recall that the file tc l/m c ast/srm -n am . t e l overloads the send{} command to set the flowid based on type of message that is sent se n d -re q u e s t{ } updates the statistics, and writes the trace entry “ Q S E N D N A C K . ” 179 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. When the agent receives a control message for a packet for which a pending object exists, the agent will hand the message off to the object for processing. When a request for a particular packet is received, the request object can be in one of two states: it is ignoring requests, considering them to be duplicates, or it will cancel its send event and re-schedule another one, after having backed off its timer. If ignoring requests it will update its statistics, and write the trace entry “Q NACK dup.” Otherwise, set a time based on its current estimate of the delay_, until which to ignore further requests. This interval is marked by the instance variable ignore_. If the object reschedules its timer, it will write the trace entry “ Q NACK ignore-backoff (ignore).” Note that this re-scheduling relies on the fact that the agent has joined the multicast group, and will therefore receive a copy of every message it sends out. When the request object receives a repair for the particular packet, it can be in one of two states: either it is still waiting for the repair, or it has already received an earlier repair. If it is the former, there will be an event pending to send a request, and eventID_ will point to that event. The object will compute its serviceTime, cancel that event, and set a holddown period during which it will ignore other requests. At the end of the holddown period, the object will ask its agent to clear it. It will write the trace entry “Q repair ignores (ignore).” On the other hand, if this is a duplicate repair, the object will update its statistics, and write the trace entry “Q REPAIR dup.” When the loss recovery phase is completed by the object, Agent/SRM: :clear{} will remove the object from its array of pending_ objects, and place it in its list of done_ objects. Periodically, the agent will cleanup and delete the done_ objects. Repair Mechanisms The agent will initiate a repair if it receives a request for a packet, and it does not have a request object pend in g _ for that packet. The default repair object belongs to the class SRM/repair. Barring minor differences, the sequence of events and the instance procedures in this class are identical to those for SRM/request. Rather than outline every single procedure, we only outline the differences from those described earlier for a request object. The repair object uses the repair parameters, Dl_, D2_. A repair object does not repeatedly reschedule is timers; therefore, it does not use any of the backoff variables such as that used by a request object The repair object ignores all requests for the same packet. The repair object does not use the ig n o re_ variable that request objects use. The trace entries written by repair objects are marginally different; they are “P nack from (requester),” “p RTIMER at (fireTime),” “p SENDREP,” “P REPAIR IGNORES (holddown).” Apart from these differences, the calling sequence for events in a repair object is similar to that of a request object 180 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Mechanisms for Statistics The agent, in concert with the request and repair objects, collect statistics about their response to data loss [42]. Each call to the agent reques t{} procedure marks a new period. At the start of a new period, mark-period{} computes the moving average of the number of duplicates in the last period. Whenever the agent receives a first round request from another agent, and it had sent a request in that round, then it considers the request as a duplicate request, and increments the appropriate counters. A request object does not consider duplicate requests if it did not itself send a request in the first round. If the agent has a repair object pending, then it does not consider the arrival of duplicate requests for that packet. The object methods SRM/request: :dup-request?]} and SRM/repair: :dup-request?{} encode these policies, and return 0 or 1 as required. A request object also computes the elapsed time between when the loss is detected to when it receives the first request. The agent computes a moving average of this elapsed time. The object computes the elapsed time (or delay) when it cancels its scheduled event for the first round. The object invokes Agent/SRM: :update-ave to compute the moving average of the delay. The agent keeps similar statistics of the duplicate repairs, and the repair delay. The agent stores the number of rounds taken for one loss recovery, to ensure that subsequent loss recovery phases for that packet that are not definitely not due to data loss do not account for these statistics. The agent stores the number of routes taken for a phase in the array old_. When a new loss recovery object is instantiated, the object will use the agent’s instance procedure round? {} to determine the number of rounds in a previous loss recovery phase for that packet. D.6 Session Objects Session objects, like the loss recovery objects (Section D.5), are derived from the base class SRM Unlike the loss recovery objects though, the agent only creates one session object for the lifetime of the agent. The constructor invokes the base class constructor as before; it then sets its instance variable sessionDelay_. The agent creates the session object when it start{ }s. At that time, it also invokes SRM/session::schedule, to send a session message after sessionDelay_ seconds. When the object sends a session message, it will schedule to send the next one after some interval. It will also update its statistics. send-session{} writes out the trace entry “s SESSION.” The class overrides the evTrace{} routine that writes out the trace entries. SRM/session::evTrace disable writing out the trace entry for session messages. Two types of session message scheduling strategies are currently available: The function in the base class schedules sending session messages at fixed intervals of sessionDelay_ jittered around a small value to avoid synchronization among all the agents at all the nodes. The class SRM/session/logScaled schedules sending messages at intervals of sessionOelay times 181 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. log2(groupsize_) so that the frequency of session messages is inversely proportional to the size of the group. The base class that sends messages at fixed intervals is the default sessio n F u n ctio n _ for the agent. D.7 Extending the Base Class Agent In the earlier section on configuration parameters (Section D.1.2), we had shown how to trivially extend the agent to get deterministic and probabilistic protocol behaviour. In this section, we describe how to derive more complex extensions to the protocol for fixed and adaptive timer mechanisms. D.7.1 Fixed Tim ers The fixed timer mechanism are done in the derived class Agent/SRM/Fixed The main difference with fixed timers is that the repair parameters are set to log(groupSize_). Therefore, the repair procedure of a fixed timer agent will set D\ and D i to be proportional to the group size before scheduling the repair object. D.7.2 Adaptive T im ers Agents using adaptive timer mechanisms modify their request and repair parameters under three conditions (1) every time a new loss object is created; (2) when sending a message; and (3) when they receive a duplicate, if their relative distance to the loss is less than that of the agent that sends the duplicate. All three changes require extensions to the agent and the loss objects. The class Agent/SRM/Adaptive uses class SRM/request/Adaptive and class SRM/repair/Adaptive as the request and repair functions respectively. In addition, the last item requires extending the packet headers, to advertise their distances to the loss. The corresponding compiled class for the agent is the class ASRMAgent. Recompute for Each New Loss Object Each time a new request object is created, SRM/request/Adaptive::set-params invokes $agent_ recom pute-request-param s. The agent method recom pute-request-param s(). uses the statistics about duplicates and delay to modify C\ and C < i for the current and future requests. Similarly, SRM/request/Adaptive::set-params for a new repair object invokes $agent_ reco m p u te-rep air-p aram s. The agent method recom pute-repair-param s(). uses the statistics objects to modify D \ and D < i for the current and future repairs. 182 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Sending a Message If a loss object sends a request in its first ro u n d ., then the agent, in the instance procedure s e n d in g -re q u e st{ }, will lower C\, and set its instance variable c lo s e s t.( r e q u e s to r ) to 1. Similarly, a loss object that sends a repair in its first round, will invoke the agent’s instance procedure, sending-repair{} , to lower D\ and set closest.(repairor) to 1 . Advertising the Distance Each agent must add additional information to each request/repair that it sends out. The base class SRMAgent invokes the virtual method addExtendedHeadersf) for each SRM packet that it sends out. The method is invoked after adding the SRM packet headers, and before the packet is transmitted. The adaptive SRM agent overloads addExtendedHeadersf) to specify its distances in the additional headers. When sending a request, that agent unequivocally knows the identity of the sender. As an example, the definition of addExtendedHeaders() for the adaptive SRM agent is: void addExtendedHeaders (Packet* p) { SRMinfo* sp; hdr.srm* sh = (hdr.srm*) p->access (off.srm.) ; hdr.asrm* seh = (hdr.asrm*) p->access(off.asrm.); switch (sh->type()) { case SRM.RQST: sp = get.state(sh->sender()); seh->distance () = sp->distance_; break; } } Sinilarly, the method parseExtendedHeaders() is invoked everytime an SRM paket is received. It sets the agent member variable pdistance. to the distance advertised by the peer that sent the message. The member variable is bound to an instance variable of the same name, so that the peer distance can be accessed by the appropriate instance procedures. The corresponding parseExtendedHeaders() method for the Adaptive SRM agent is simply: void parseExtendedHeaders(Packet* p) { hdr.asrm* seh = (hdr.asrm*) p->access(off.asrm.); pdistance. = seh->distance(); } Finally, the adaptive SRM agent’s extended headers are defined as struct hdr.asrm. The header declaration is identical to declaring other packet headers in ns. Unlike most other packet headers, these are not automatically available in the packet The interpreted constructor for the first adaptive agent will add the header to the packet format For example, the start of the constructor for the Agent/SRM/Adaptive agent is: Agent/SRM/Adaptive set done. 0 183 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Agent/SRM/Adaptive instproc init args { if ![$class set done_] { set pm [[Simulator instance] set packetManager_] TclObject set off_asrm_ [$pm allochdr aSRM] $class set done_ 1 eval $self next $args Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. IMAGE EVALUATION TEST TARGET (Q A -3 ) 1 .0 l.l 1 . 2 5 |M la . 116 1 . 8 1.4 1 .6 1 5 0 m m IIS /1 4 G E . I n c 1653 East Main Street Rochester. NY 14609 USA Phone: 716/482-0300 Fax: 716/288-5989 O 1993. Applied Image. Inc.. A O Rights Reserved Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. UMI A Bell & Howell Information Company 300 North Zed) Road, Ann Arbor MI 48106-1346 USA 313/761-4700 800/521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. NOTE TO USERS The original manuscript received by UMI contains pages with indistinct, light, broken, and/or slanted print. Pages were microfilmed as received. This reproduction is the best copy available UMI Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 9919116 UMI Microform 9919116 Copyright 1999, by UMI Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Zeeb Road Ann Arbor, MI 48103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The importance of using domain knowledge in solving information distillation problems
PDF
Probabilistic analysis of power dissipation in VLSI systems
PDF
Semantic heterogeneity resolution in federated databases by meta-data implantation and stepwise evolution
PDF
Orthogonal architectures for parallel image processing
PDF
MUNet: multicasting protocol in unidirectional ad-hoc networks
PDF
Reactive synaptogenesis in the striatum of the adult rat: the role of dopamine
PDF
The role of the sensorimotor cortical system in skill acquisition and motor learning: a behavioral study
PDF
The effect of helmet liner density upon acceleration and local contact forces during bicycle helmet impacts
PDF
The role of host proteins in retroviral cDNA integration
PDF
Studies on cholesterol metabolism in the rat
PDF
An argumentation-based approach to negotiation in collaborative engineering design
PDF
Internet security and quality-of-service provision via machine-learning theory
PDF
The design and synthesis of concurrent asynchronous systems.
PDF
Alternate models of women's health care policy in the United States
PDF
The isolation and characterization of rat plasma albumins
PDF
Semiperipheral mobility in the world economy: the experience of South Korea's industrial upgrading
PDF
The effects of irradiation upon lipid and carbohydrate metabolism in the rat
PDF
Iterative data detection: complexity reduction and applications
PDF
Elements of kanban production control for dynamic job shops
PDF
It's only temporary?: The reproduction of gender and race inequalities in temporary clerical employment
Asset Metadata
Creator
Varadhan, Kannan
(author)
Core Title
Protocol evaluation in the context of dynamic topologies
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
1998-08
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c17-404761
Unique identifier
UC11350312
Identifier
9919116.pdf (filename),usctheses-c17-404761 (legacy record id)
Legacy Identifier
9919116.pdf
Dmrecord
404761
Document Type
Dissertation
Rights
Varadhan, Kannan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
computer science