Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI
films the text directly from the original or copy submitted. Thus, some
thesis and dissertation copies are in typewriter face, while others may be
from any type of computer printer.
The quality of this reproduction is dependent upon the quality of the
copy submitted. Broken or indistinct print, colored or poor quality
illustrations and photographs, print bleedthrough, substandard margins,
and improper alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete
manuscript and there are missing pages, these will be noted. Also, if
unauthorized copyright material had to be removed, a note will indicate
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by
sectioning the original, beginning at the upper left-hand comer and
continuing from left to right in equal sections with small overlaps. Each
original is also photographed in one exposure and is included in reduced
form at the back of the book.
Photographs included in the original manuscript have been reproduced
xerographically in this copy. Higher quality 6” x 9” black and white
photographic prints are available for any photographs or illustrations
appearing in this copy for an additional charge. Contact UMI directly to
order.
UMI
A Bell & Howell Information Company
300 North Zceb Road, Ann Arbor MI 48106-1346 USA
313/761-4700 800/321-0600
SCALABLE MULTICAST ROUTING: TREE TYPES AND
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
PROTOCOL DYNAMICS
C
A Dissertation Presented to the
(Computer Science)
December 1995
Copyright 1995 Liming Wei
UMI Number: 9617150
UMI Microform 9617150
Copyright 1996, by UMI Company. All rights reserved.
This microform edition is protected against unauthorized
copying under Title 17, United States Code.
UMI
300 North Zeeb Road
Ann Arbor, MI 48103
UNIVERSITY OF SOUTHERN CALIFORNIA
THE ORADUATE SCHOOL
UNIVERSITY PARK
LOS ANOELES, CALIFORNIA 90007
This dissertation, written by
under the direction of h. i.$. Dissertation
Committee, and approved by all its members,
has been presented to and accepted by The
Graduate School, in partial fulfillment of re
quirements for the degree of
DOCTOR OF PHILOSOPHY
I— \ T l ,V 7 £ (Ai f t I
c ..
Dean of Graduate Studies
D a te....
DISSERTATION COMMITTEE
Chairperson
•" f ......
T “
To
my wife
Peixian Jiang
and
my father and mother
Tingde Wei and Hongyun Ouyang
Acknowledgements
First of all, I am indebted to my advisor Deborah Estrin, who generated many
challenging ideas and opportunities, and provided the most crucial guidance for me
to accomplish what I initially considered a formidable task.
I am grateful to Bob Braden who showed me the art of transport protocol analysis
and engineering during the 2 years I worked for him at ISI.
I benefited greatly from the insightful advice from other members of my disser
tation committee and qualifications examination committee: Bob Felderman, Peter
Danzig, John Silvester and Rafael Saavadra. I owe a special thanks to other mem
bers of the networking lab: Lee Breslau, Kannan Veradhan, Jong Suk Ahn, Vivek
Goyal, Puneet Sharma, Ching-Gung Liu, Sugih Jamin, Shai Herzog, Ari Medvinsky
and Anant Kumar; and members of division 7 at ISI: Steve Casner, Greg Finn,
Jon Postel, Joe Touch and Walter Prue, for their helpful discussions and friend
ship. Thanks to Steven Deering, Lixia Zhang, Van Jacobson, Dino Farinacci, Allyn
Romanow, Fong-Ching Liaw, Tom Lyon and Fred Sarmmartino who directly or in
directly helped me in my research. Among whom, I am especially thankful to Van
Jacobson for the numerous concepts and ideas he generously shared with the PIM
design group, from which I benefited greatly.
And I would like to gratefully acknowledge the financial support from the Na
tional Science Foundation and Sun Microsystems Computer Corporation.
I especially want to thank my lovely wife Peixian Jiang for encouraging me to
complete my studies and for putting up with my crazy working habits for the past
five years.
Contents
A cknow ledgem ents iii
L ist O f Tables vii
List O f Figures viii
A bstract x
1 Introduction 1
1.1 Problem Statement: Scaling Issues in Multicast R o u tin g ...................... 4
1.1.1 Measuring Tree Q ualities................................................................... 4
1.1.1.1 Shortest Path T r e e s ......................................................... 5
1.1.1.2 Steiner Minimal T r e e s ...................................................... 5
1.1.1.3 Traffic Concentration......................................................... 6
1.1.2 Protocol O v e rh e a d ............................................................................ 7
1.2 Thesis organization........................................................................................ 9
2 R elated Work 10
2.1 Algorithms for calculating multicast tree s................................................... 10
2.1.1 Polynomial Time Algorithms for Group Shared T r e e s ............... 10
2.2 Protocols ......................................................................................................... 13
2.2.1 Traditional Flooding-Group-State P ro to c o ls ............................... 13
2.2.1.1 Is link-state multicast a viable option for scalable
m ulticast?........................................................................... 15
2.2.2 Protocols based on Directed-Membership-Advertisement . . . 16
2.2.2.1 Core Based Tree ( C B T ) .................................................. 16
2.2.2.2 Protocol Independent Multicast (P IM )........................ 17
3 T opologies, M etrics and Tools 18
3.1 Network Topologies......................................................................................... 18
3.2 Performance M e tric s ..................................................................................... 19
3.3 Topology generator and S im u lato rs........................................................... 21
4 E valuations o f different m ulticast tree types 23
4.1 Simulating Different T r e e s ...................................................................... 23
4.2 Results on Multicast Trees and alg o rith m s......................................... 24
4.2.1 Delay and Cost: SPT, KMB, CBT and M S P T .............................. 24
4.2.2 Traffic Concentration: SPT, CBT and M S P T ..............................29
4.3 Conclusions from the Tree Simulation R e s u lts ................................... 31
5 Loop-free M ulticast R outing Supporting M ultiple Tree T yp es 33
5.1 Motivations and problem d efin itio n ...................................................... 33
5.2 Looping along branches of a “single” tree in unicast loop-free networks 34
5.3 Multicast forwarding loop along segments of mixed tree types . . . . 36
5.3.1 Looping under unicast asym m etry................................................ 37
5.3.2 RPF interface check as a loop prevention m e th o d ........................38
5.3.3 RPT and SPT branches intersecting at a multiaccess LAN . . 39
5.4 Analysis of a few loop prevention protocol m echanism s.................... 49
5.4.1 Assert mechanism.............................................................................. 49
5.4.2 Control packet t a g .......................................................................... 51
5.4.3 Local Forwarder T a g ....................................................................... 52
5.5 Summary of looping conditions and loop-prevention mechanisms . 53
6 M ulticast R outing in D ense and Sparse M odes: A Sim ulation Study
o f Tradeoffs and D ynam ics 54
6 1 Introduction............................................................................................... 54
6.1.1 Overhead tradeoffs of sparse mode and dense m o d e .................... 55
6.1.2 Sparse Mode Dynamics: the transition from RPT to SPT . . . 57
6.1.2.1 Black out periods during RPT to SPT transitions . . 57
6.1.2.2 Are there duplicate packets during the transitions? . 58
6.2 Trade-off of overheads in sparse mode and dense m o d e .................... 59
6.2.1 Metrics and Formulae .................................................................... 59
6.2.1.1 Storage Overhead............................................................... 59
6.2.1.2 Bandwidth (control) Overhead ..................................... 60
6.2.1.3 Overhead as Functions of Group D ensity......................... 61
6.2.2 SM/DM tradeoff ex p erim en ts....................................................... 65
6.3 Packet Losses During the Black out period when switching from RPT
to S P T .......................................................................................................... 67
6.4 Chapter conclusions................................................................................... 70
7 R outing Inform ation R eduction: A ggregation and D ata D riven
Techniques 72
7.1 Aggregation Techniques ............................................................................. 75
7.1.1 Aggregation across sources............................................................. 75
7.1.2 Aggregation across g ro u p s............................................................. 79
7.2 Data Driven T ech n iq u es......................................................................... 81
v
7.2.1 Data driven setup and maintenance of shared tree state . . . . 82
7.2.2 Dense mode emulation for sparse mode g ro u p s........................... 83
7.3 Limiting the group life tim e s........................................................................ 88
7.4 Summary of Multicast Routing Information Reduction Techniques . . 89
8 A nalysis o f a R esequencer M odel for M ulticast over ATM N et
works 91
8.1 Introduction..................................................................................................... 91
8.2 Strategies for Implementing Multicast in A T M .......................................... 92
8.3 The Resequencer Solution to M u lticast..................................................... 94
8.3.1 R esequencers...................................................................................... 94
8.3.2 Multicast Group T r e e s ..................................................................... 95
8.3.3 D iscussion............................................................................................ 97
8.4 Three Approaches to R esequencing............................................................ 98
8.4.1 Software Resequencing..................................................................... 98
8.4.2 Hardware R esequencing.....................................................................100
8.4.3 Optimized Hardware Cell Forw arding.............................................. 102
8.5 Simulation S tu d ie s ...........................................................................................102
8.6 Chapter C onclusions........................................................................................105
9 Sum m ary o f C ontributions and Future Research 106
9.1 Summary of main contributions.................................................................... 106
9.2 Future Research................................................................................................. 108
9.2.1 Open Issues in Multicast Routing Information Reduction
Techniques............................................................................................. 109
9.2.2 Scalable T im ers.....................................................................................110
9.2.3 Multicast group address/session management ..............................I l l
R eference List 112
List Of Tables
4.1 Summary of experiments with different tree t y p e s ...................................... 31
7.1 Using wild-card address entries: the design s p a c e ...................................... 73
7.2 The design space of data-driven techniques................................................... 74
7.3 State storage requirement for a group with a persistent-state shared
tr e e ..................................................................................................................... 85
7.4 State storage requirement for a group with a persistent-state shared
tr e e ..................................................................................................................... 86
7.5 W ithout dense mode emulation: Number of control updates for (*,G)
m aintenance..................................................................................................... 87
7.6 W ith dense mode emulation: number of control packets with 2 sources 87
vii
List Of Figures
1.1 Traffic concentration e x a m p le ................................................................. 7
3.1 Random graphs of different a ’s .................................................................. 19
4.1 Comparisons of delay and cost in 50-node graphs: (a) KMB tree vs
SPT, (b) center tree vs SPT, (c) MSPT vs S P T ................................ 25
4.2 Effect of group size for KMB trees, in 50-node graphs: (a) Ratio; (b)
Histograms for group size of 5; (c) Histograms for group size of 25 . . 26
4.3 Effect of group size for center trees, in 50-node graphs: (a) Ratio; (b)
Histograms for group size of 5; (c) Histograms for group size of 25 . . 27
4.4 Effect of group size for MSPT, in 50-node graphs: (a) Ratio; (b)
Histograms for group size of 5; (c) Histograms for group size of 25 . . 27
4.5 Traffic Concentration, in 50-node graphs with 300 40-member groups,
32 senders per g r o u p ................................................................................ 29
4.6 Distribution of link loads in the same g r a p h ....................................... 30
5.1 Loop along a branch of a single tree in unicast loop-free network . . . 35
5.2 RPF check prevents multicast looping.................................................... 39
5.3 A loop composed of a SPT segment and a R PT s e g m e n t................. 40
5.4 Lan with a single forwarder on the SPT...........................................................42
5.5 Lan with a single forwarder on the RPT...................................................... 42
5.6 Single SPT recipient: RPT segment directly reaches the L A N .................... 44
5.7 Single SPT recipient: RPT segment hits upstream. S P T ......................... 44
5.8 Lan with a single recipient on the RPT. SPT segment reaches the
LAN again..................................................................................................... 45
5.9 Lan with a single recipient on the RPT. SPT segment hits the up
stream R P T .................................................................................................. 46
5.10 Sufficient conditions for loop existence..................................................... 48
5.11 X sends an assert after it receives a potentially duplicate packet . . . 49
6.1 The receiver leaves RPT and joins S P T ................................................. 58
6.2 Arpanet topology used in dense mode and sparse mode simulations . 62
6.3 Tradeoffs of Sparse mode (SPT) and Dense mode in arpanet (47-
node), for 5-sender multicast groups....................................................... 63
viii
6.4 Example of two multicast groups having the same density.........................64
6.5 Maximum and Minimum group size vs d e n s ity ........................................ 64
6.6 SM/DM control bandwidth overhead supporting a sparse group in
arpanet ........................................................................................................... 66
6.7 Packet loss as a function of packet size (source rate A xed).........................69
6.8 Packet loss as as function of source rate and packet s iz e ........................ 70
7.1 A cloud to apply the source-proxy m e th o d ............................................... 77
7.2 State savings of group-based aggregation under uniform random mem
bership distribution. Different curves correspond to results measured
for trees rooted at different sources.............................................................. 80
7.3 Dense Mode emulation: Transforming the part of shared tree inside a
cloud into source-rooted shortest path trees. BR1 is the entry router
to the cloud for the shared tree, BR2 and BR3 are two SPT entry
routers for source S. BR4 and BR5 are two exit routers leading to
downstream receivers on the RP tree. BR1 multicasts data packets
received from the RP tree to all-border-routers group............................. 84
8.1 One-vc-per-source multicast model ............................................................ 93
8.2 The resequencer and group multicast tree m o d e l......................................... 96
8.3 Software PDU fo rw ard in g ............................................................................ 99
8.4 PDU reassembly d e l a y .....................................................................................101
8.5 Distribution of cell forwarding d e la y s ........................................................... 104
ix
Abstract
This thesis investigates issues of multicast routing in large scale wide area inter
networks. The work is pursued along two dimensions: (1) evaluation of multicast
distribution trees and algorithms; and (2) analysis of multicast protocol mechanisms.
Multicast trees can be shared across sources or may be source-specific. Differ
ences in tree types may lead to different data packet delivery qualities and different
costs in terms of bandwidth. Inspired by interest in using shared trees for scalable
multicast routing, we investigate the trade-offs among different algorithms and tree
types. The trade-off in performance is shown w ith regard to both individual groups
and aggregated effects of large number of groups. The conclusion is th a t shortest
path trees and shared trees have complementary properties. It is ideal if both tree
types can be supported in the same multicast routing protocol.
When both shared and shortest path trees are supported by a protocol, a brute
force combination of mechanisms for these two tree types m ay create packet loops.
The looping conditions are explored, and proofs presented for protocol systems that
are loop-free. Three candidate multicast loop-prevention mechanisms are described
and analyzed.
Different multicast routing protocols support not only two different tree types,
but also two different operating modes: dense mode and sparse mode. This thesis
will examine the tradeoffs of the different modes and the dynamic behaviors when
switching from shared to source-rooted shortest path trees (SPTs). The overall
storage and bandwidth trade-offs between dense and sparse modes are investigated
across different membership distributions. Protocol Independent Multicast (PIM) is
used as the reference protocol that provides a framework for this study. The mode
changes and dynamics are simulated in the USC PIM simulator (PIMSIM).
Ultimately, a truely scalable multicast protocol should be able to reduce the
number of routing entries by either compressing the number of routing entries or
by discarding idle entries. Four alternative mechanisms are investigated. First, we
examine aggregation of multicast routing entries based on the source address prefix.
This method combines all SPT entries for a group rooted at sources from the same
region into one entry. We will illustrate how the lack of an efficient and simple
protocol mechanism makes this option not promising. A second method, aggrega
tion across groups, combines entries for different groups rooted at the same source
into one entry. The third approach, data driven maintenance of shared tree state,
attem pts to maintain shared tree state also in a data driven fashion, to eliminate
the short-coming of sparse mode multicast in which state is kept in the absence of
d ata traffic. The lack of a simple reliability mechanism makes it less usable. Fi
nally we consider dense mode emulation to completely eliminate shared tree state
within a contiguous network region (i.e. a cloud). Shared tree state is replaced
by SPT entries within a particular cloud, where the group operates in dense mode.
This approach can result in substantial savings in state for long lived groups with
interm ittent sources, at the expense of less efficient bandwidth usage.
Finally we consider what would be needed to apply the IP multicast model to
other underlying technologies. A resequencer model is presented for multicast over
ATM networks. The model was designed to solve the cell demultiplexing and VC
(virtual circuit) depletion problems. Three methods are presented to implement a
multicast resequencer. Their performance is evaluated through simulation.
Chapter 1
Introduction
Multicast service enables point-to-multipoint communications. Led by the wide area
video/audio conferencing tools available on the Internet, the population of multicast
capable applications is increasing steadily. Areas of multicast applications include:
resource discovery, news dissemination, virtual reality systems involving multiple
geographically dispersed parties, stock exchange administration, public interactive
video/audio services, and massive replication based services etc. Despite their vast
differences in appearance and communication patterns, they share the most funda
mental requirement on packet delivery: only one copy of each data packet is sent by
the source, and all recipients of the group should receive a copy of each data packet.
The gradually evolving multicast network architecture in the internet reflects
the minimal set of common services needed to support the existing applications.
Efficient, high quality multicast support is indispensable for the future internet sup
porting integrated services [6].
The focus of this thesis will be the evaluation of different multicast trees, and
multicast protocol mechanisms maintaining different types of trees. The issues ad
dressed in this thesis did not attract much attention in the early days due to the
limited scale of multicast applications. Historically, practical multicast work started
as a service for local area networks. Broadcasting and hardware filtering were among
the early methods to achieve multicast deli very [30] [10]. In those days, protocol sim
plicity took precedence over other factors. Since almost all multicast applications
were confined within a single multi-access subnetwork and the number of simulta
neous groups was so small that the storage and bandwidth overhead remained an
1
insignificant issue until the need for multicast support over wide area internetworks
emerged.
Two representative pieces of work on multicast over wide area internetworks in
the past were David Wall’s Center Based Forwarding (CBF) technique[30] and Steve
Deering’s IP multicast models[14]. In his PhD thesis(1979), David Wall proposed
the center based forwarding (CBF) technique to construct multicast trees. A tree
constructed by CBF is shared by all sources and receivers within a multicast group.
Due to the need to compute a center and the unsatisfactory analytical bounds on
delay and cost for these trees, CBF was not put in use until recently when scaling
became a critical issue.
Deering’s IP multicast models have gained wide distribution due to their sim
plicity and robustness. The two Deering algorithms that have been adopted most
widely are the distance vector multicast routing algorithm and the link state mul
ticast routing algorithm. DVMRP (distance vector multicast routing protocol)[27],
based on the distance vector version of his algorithm, has been the routing protocol
for the ‘experimental’ multicast backbone MBONE1 for the past 3 and half years.
After a few years of experiments, researchers have recognized a number of desirable
properties in Deering’s mechanisms (the DVMRP protocol in particular), and have
also found their weaknesses. Among the most important properties deemed bene
ficial are: (1) shortest path trees are established that provide good delay in most
cases; (2) state setup is data-driven; (3) robustness is maintained in case of control
packet losses; (4) the roles of senders and receivers are kept separate. The short
comings of Deering’s early IP multicast schemes, when applied in a flat wide-area
internetwork2, are: (1) broadcast of membership information, and (2) the need for
off-tree routers to cache/compute tree state even though they are not on the packet
forwarding path. The deployed implementation of DVMRP runs a separate unicast
‘MBONE is a virtual multicast backbone network constructed mostly of tunnels that go across
networks without native multicast routing support.
2Steve Deering indicated that it was not an orignal design goal to run DVMRP on such a scale.
But when th at was all w hat people had, there appeared a need to comment on how it could perform
in such an unintended role.
2
distance vector routing protocol that by itself is becoming a bottleneck that restricts
its scalability3.
Recognizing the restrictions of the classical Deeering scheme, Tony Ballardie et
al. proposed a scheme called Core Based Tree (CBT)[4] in 1992, that adopted Wall’s
center based forwarding methodology. The rationale for the design decisions behind
CBT is that by using a shared tree, and by restricting membership advertisement
within on-tree routers, it can achieve significant savings in the required router storage
and in packet overhead of broadcast data. CBT uses one center-based tree for each
group, where the center is called the core. Once the location of the core is known to
all sources and receivers of a group, multicast route setup can be carried out based
on the underlying unicast routing information. By imposing a shared tree for each
group; CBT lacks the flexibility to offer shortest path when core based tree delays
are significantly4 worse than the shortest path delays, and it concentrates traffic on
a small portion of the links.
In 1993, PIM (Protocol Independent Multicast)[12] was proposed to solve the
scaling problems of the classical IP multicast schemes and the delay and traffic con
centration problem of CBT. PIM supports both source-rooted shortest path trees
and center-based shared trees and can operate in dense mode or sparse mode ac
cording to the characteristics of group membership distributions.
Section 1.1 will describe the basic scaling issues in multicast routing. Subsection
1.1.1 discusses three important measures for evaluating multicast tree types, and
subsection 1.1.2 discusses the problem of protocol overhead.
3The benefit of carrying such an independent unicast routing protocol, that is separate from
the unicast protocol for unicast forwarding, is that it allows the configuration of tunnels that form
a topology different from the topology used by unicast. This benefited the initial deployment
of multicast routing across networks that are not multicast capable. But the scalability of the
duplicated unicast routing module itself can become a limiting factor in a large heterogeneous and
dynamic environment.
4As will be shown in chapter 3, the mean end-to-end delay will be not bad, but there are extreme
cases where the delays along an optimal center-based tree can be high.
3
1.1 Problem Statement: Scaling Issues in M ulticast
Routing
There are three primary measures for a scalable multicast routing protocol: 1) qual
ity of data delivery path; 2) protocol messaging overhead; and 3) bandwidth utiliza
tion. We examine these problems in two perspectives: the data delivery qualities
and protocol overhead.
1.1.1 M easuring Tree Qualities
A distribution tree’s quality is determined by the following factors:
1. D elay. The delay of a multicast tree is evaluated in terms of the end-to-
end delay between a source and receivers, relative to the shortest unicast-path
delay between the same source and receivers.
2. C ost. There are two different resource costs associated with a multicast tree:
(a) Cost of total bandwidth consumption.
(b) Cost of tree state information.
3. Traffic co n cen tratio n . When a multicast group establishes its delivery trees
across the network, traffic from different sources may share links that are not
shared when the same sources use shortest path trees. We will compare the
maximum number of flows 5 on a unidirectional link as a simple measure of
the degree of traffic concentration.
In general, minimal cost and minimal delay cannot both be achieved with any
single type of tree. With respect to delay and cost, shortest path trees (SPT), which
are source rooted, provide minimal delay at the expense of cost; whereas Steiner
minimal trees (SMT), shared per group, minimize cost at the expense of delay.
Between them are a spectrum of different types of trees, offering different trade-offs
in delay and cost.
5We call a stream of packets on a link, originated from a particular source, a flow.
4
1.1.1.1 Shortest Path Trees
A Shortest Path Tree (SPT) rooted at the source is composed of the shortest paths
between the source and each of the receivers in the multicast group. An SPT does
not attem pt to minimize the total cost of distribution.
Source-rooted shortest path trees are easy to compute, and can be implemented
in a distributed fashion efficiently [10] [14]. In networks consisting of symmetric links
or paths, reverse path forwarding (RPF) algorithms can be used to derive shortest
path trees from the unicast routing mechanisms[14, 11]. When asymmetric paths
exist RPF will provide reverse shortest paths6, or distributed link state protocols
such as MOSPF can be used to compute shortest path trees[24].
Although not offering minimal cost paths, protocols based on shortest path trees
have been adopted most widely [17] [7]. This is due to the fact that when compared
with multiple unicast transmissions, SPT-based multicast already provides substan
tial savings in link cost, and packet generation at the sources[14]. For a (virtual)
network such as the MBONE [7] with relatively few globally-active multicast groups,
SPT’s are satisfactory. The bandwidth is used for real-time traffic and bandwidth
is frequently scarce relative to demand; therefore delay and traffic concentration are
of concern. Moreover, the network is not rich in connectivity and therefore different
types of trees would be mapped to the same routes anyway. However, to support
increasing usage and large scale applications there is a need to explore the properties
of different tree types.
1.1.1.2 Steiner M inim al Trees
The Steiner minimal tree (SMT) has been studied intensively in the area of algo
rithms for the past half century. It is defined to be the minimal cost subgraph (tree)
spanning a given subset of nodes in a graph [18] [34]. Since the SMT for all sources
within a multicast group is the same, irrespective of the role of sender or receiver,
the number of state entries needed to maintain the tree is only 1 per group. Thus it
scales well for big multicast groups with large numbers of senders.
6 Reverse shortest paths will have higher delay than forward shortest paths because the data
follows the shortest path from the receiver instead of the shortest path to the receiver.
5
It is well known that computation of a SMT is NP-complete, and is not expected
to have polynomial time solutions [20]. This computational complexity prohibits
on-demand computation over a reasonably-sized graph. K arp has suggested several
techniques to reduce the problem size for SMT computations [20]. The simplest
3 reductions are: (1) For non-member vertex k whose node degree is 1, k and its
incident edge can be removed from the graph; (2) For a non-member vertex k whose
node degree is 2, the two incident edges to k can be merged into one edge, and k can
be removed; (3) If an edge e = (i, j) ’s cost is higher than the shortest path between
vertices i and j, e can be removed from the graph. The limitation of graph reduction
techniques is that they are highly graph dependent. For computer networks as
complex as the Internet, where the average node degree (of routers) is higher than 3,
graph reduction will not be effective enough in reducing th e computational demand
to a practical range.
Because of the difficulties in obtaining an SMT in larger graphs, it is often
deemed acceptable to use near optimal trees to replace SMTs. Various near optimal
algorithms exist that produce good approximations to SMT [34].
In addition to computational problems, the worst-case maximum end-to-end path
length of an SMT is not bounded; it can be as long as the longest acyclic path within
the graph. Fortunately, this worst case scenario is unlikely in networks with the rich
connectivity typical of today’ s networks. Nevertheless, it is important to know, in the
average case, how good or bad a path length along a SMT can be. In chapter 4.2,
we will present simulation results of a near optimal SMT algorithm over random
graphs.
1.1.1.3 TVaffic C oncentration
As discussed above, SMTs minimize tree cost at the expense of increased path length.
In addition, SMTs typically exhibit higher traffic concentration because all sources
share a common tree. Traffic concentration determines the effective network capacity
for multicast applications. However, the traffic concentration problem has received
less attention than the delay-cost tradeoff in the area of multicast routing. Fig 1.1
illustrates the different traffic concentrations for two subgraphs of the same 3-node
complete graph, where all node pairs are connected by symmetrical links in opposite
6
(a) Sampto Nttorarit (b)SharadTrM (OSoureaSpadBeTiaa
Figure 1.1: Traffic concentration example
directions. There is a multicast group consisting of 3 receivers on nodes X, Y and
Z, and two sources X and Z sending traffic at unit rate 1. Fig 1.1(b) shows a shared
tree used by all senders of the group. Fig 1.1 (c) shows source specific shortest path
trees where each source has a distinct shortest path tree. In fig 1.1(b), link Z— > Y
has a load of 2 flows, while in fig 1.1(c) all links have a maximum load of 1 flow.
In chapter 4.2 we present simulation studies of traffic concentration in large graphs
w ith many groups.
1.1.2 Protocol Overhead
Routers running a multicast protocol, construct and maintain multicast trees by
means of exchanging protocol messages and building routing tables for certain mul
ticast groups. The quality of multicast protocol mechanisms can be determined
by:
1. Capability to create loop free distribution trees under dynamic network condi
tions and group membership configurations. Guarantee of loop-free operations.
2. Tree state storage overhead;
3. Control message overhead;
The operating mode of a multicast routing protocol is classified according to the
way a multicast group’s state is maintained in the network. If the state information
about a group is maintained in all routers that a source’s broadcast packets can reach,
the protocol is operating in dense mode. Otherwise, if state is only maintained in
routers on the distribution tree, the protocol is operating in sparse mode.
7
Since PIM supports both dense mode operation and sparse mode operation, and
both source-rooted shortest path trees and shared center-based trees, it provides a
framework for the analysis of relevant protocol mechanisms in the later chapters.
PIM supports both source-rooted shortest path trees and group shared trees.
Since these two types of trees can be active at the same time and their paths may
overlap at certain places in the network, there is a possibility that looping may occur
under certain circumstances. In chapter 5, the conditions for loop existence will be
analyzed, and proofs of loop existence lemmas will be presented. These lemmas will
be used to examine a few loop prevention mechanisms proposed or used in PIM.
Faced with the choice of dense mode or sparse mode operations, and the ability
to switch from a shared tree to source-rooted shortest path trees, we are interested in
the overhead tradeoffs and dynamic behaviors during the tree state transitions. We
already know th at when the group is extremely dense, e.g. when all leaf networks
have members, then dense mode operation will be most efficient; otherwise, when
only a few leaf subnetworks at far distant ends of the network have members, sparse
mode operation will be most efficient. But since there will be large numbers of mul
ticast groups operating in non-extreme scenarios, the relative overhead in between
these extreme scenarios is of most interest. Chapter 6 will evaluate the protocol
overhead of dense and sparse mode operations across a wide range of conditions,
and will examine the dynamic switching from shared tree to source-rooted shortest
path trees.
Reducing the number of routers carrying state information for each group’s mul
ticast trees is a first step to achieve better scalability. The next step is to aggregate
the routing entries so that the number of entries a router needs to store can be less
than the number of trees in which the router participates. This is important con
sidering that the potential number of multicast trees is far bigger than the potential
number of unicast routes. Aggregation techniques include aggregating discrete mul
ticast routing entries, with certain elements of the lookup key replaced by a wild
card element; and data-driven state setup and maintenance techniques. We will
examine the design space for such techniques and investigate a few in more detail in
chapter 7.
So far we have concentrated mostly on how to reduce the resources needed inside
the network for supporting multicast routing. Addressing space, or available port
8
number space have not been a critical bottleneck as far as routing is concerned.
However, in circuit-switched networks, such as ATM networks, resources available
at terminal devices as well as in the network switches or routers can also be a lim
iting factor in the ability to support large numbers of groups. One such relatively
scarce resource in an inexpensive terminal device is the number of virtual channels
supported. The number of virtual channels required for a multicast group can po
tentially reflect the cost for supporting such a group. Chapter 8 will investigate
these issues, review a number of proposals, and propose a resequencer model for
supporting multicast in ATM networks.
1.2 Thesis organization
The rest of this thesis is organized as following:
Chapter 2 reviews related work on different multicast tree calculation algorithms,
and different existing protocols.
Chapter 3 discusses the topology model used in a number of simulation experi
ments, and defines a few performance metrics.
Chapter 4 evaluates the different multicast tree types and presents the simulation
results.
Chapter 5 studies the looping conditions in state based multicast routing proto
cols supporting different tree types.
Chapter 6 uses simulation to investigate the overhead tradeoffs of dense and
sparse mode operations, and the dynamic switching from shared tree to source-
rooted shortest path trees.
Chapter 7 discusses a number of routing information reduction techniques, from
route aggregations to data-driven state setup and maintenance.
Chapter 8 proposes a resequencer model for multicast over ATM networks.
Chapter 9 summaries the important contributions of this thesis, and discusses
future research work.
9
Chapter 2
Related Work
The computation of different types of multicast trees has been studied intensively
in the past. Various protocols have been designed to support these tree types and
operate either in dense mode or sparse mode, or both. This chapter reviews the
previous work. Section 2.1 gives an overview of the shortest path tree algorithms
and minimum spanning tree algorithms, which focused primarily on polynomial time
pseudo-optimal Steiner tree algorithms. Section 2.2 reviews a number of existing
multicast routing protocols.
2.1 Algorithms for calculating m ulticast trees
Historically, the study of computation of multicast trees has been pursued as two
separate threads: one on how to calculate a shortest path tree, such as the Dijkstra’s
algorithm[9], or Bellman-Ford algorithm[9]; the other thread on how to calculate a
minimum distance tree spanning a given subset of nodes in a graph, i.e. the Steiner
tree algorithms. These two threads have been mostly independent of each other, and
there had been no analysis of the relative pros and cons between these two types of
trees. Since the calculation of shortest path trees has been rather straightforward,
we now look at related algorithms for computing the (pseudo) Steiner minimal tree.
2.1.1 Polynom ial T im e Algorithm s for Group Shared Trees
The most often studied algorithm is perhaps the Kou, Markowsky & Berman algorithm[21]
approximating a Steiner Minimal Tree (SMT) to within 5% extra link cost on
10
average[15]. However, the KMB algorithm in its original form needs the complete
network topology and group membership, therefore is not practical for large wide
area internets. David Wall transformed the KMB algorithm into a distributed ver
sion [30]. However, both algorithms assume that the multicast groups are statically
configured. Doar and Leslie found that a naive algorithm would do just as well as the
sm arter algorithms [15] and would allow frequent computation of the distribution
tree in order to track membership dynamics inherent in larger (and many smaller)
groups. To cope with the unbounded delay problems of the (near) optimal Steiner
trees, Wall proposed several center-based tree algorithms. The simplicity of this
class of algorithms is desirable for the design of practical protocols and was used as
the basis for the core based tree (CBT) interdomain multicast routing protocol [4],
as well as for the shared tree mode of another inter-domain multicast proposal, PIM
[12]. The following paragraphs reviews these algorithms in a bit more detail.
The K ou, M arkow sky & B erm an (K M B ) A lg o rith m is based on all-pair
shortest paths among the Steiner points. It first constructs a complete graph G l
composed of nodes of the multicast group members, and labels each link with the
shortest path cost between the nodes in the original graph, G. A minimum spanning
tree M l is computed from G l, which is used to construct a subgraph G2 of G
by replacing tree links in M l with the corresponding shortest paths in G. Then a
minimal spanning tree S in M l is computed. The solution is reached after pruning
unrelated branches off tree S.
The time complexity of this algorithm is 0(pn2). The cost of the tree is tightly
bounded by 2 — 2/p of the optimal Steiner tree’s cost, where p is the multicast group
size. In certain cases, KMB trees are the same as SMTs. According to Wall’s KMB
tree theorem [30]:
When there exists a Steiner tree T that only contains vertices that are
group members, then the KMB tree is a Steiner Minimal Tree.
As a reference example, we choose the trees calculated by the KMB algorithm as ap
proximations to the SMTs in our experiments measuring the performance of SMTs.
D av id W all’s d istrib u te d version o f K M B used a consistent tie-breaking
rule th at required only one MST construction, and eliminated the pruning step
[30]. The impact on message complexity, and the stability of the algorithm in large
11
networks are still unknown. Note that the tree computed is still a KMB tree, it is
only the control mechanism that is changed.
D o ar & L eslie’s N aive alg o rith m for dynamic multicast groups computes a
multicast route by first combining the shortest paths across initial multicast group
members, then joining new members to the nearest attachment point on the existing
tree. Their simulation results showed that the costs of the naive trees are within 1.5
times that of the KMB trees, and the averages of their maximum path lengths are
around 50% - 60% of the KMB trees’.
While this particular algorithm requires knowledge of the complete group mem
bership, it may be possible to modify it for use without such global knowledge.
However, before considering it as a candidate for real networks, one question that
needs to be answered is how does it compare with the widely used shortest path
trees? The answer can be derived from the comparisons of KMB trees and shortest
path trees which we present in Chapter 4.
C en ter-B ased T ree [30], as the name indicates, uses a shortest path tree rooted
at a node “in the center” of the network. It is proven in [30] that the maximum
delay of an optimal center based tree is bounded at 2 times that of the delay along
the shortest path between any pair of group members. It is also shown that if the
center is moved to a multicast group sender or receiver, the bound of the maximum
delay turns out to be 3 times the delay along the shortest path.
The fact that there exists a delay bound for this kind of tree is encouraging.
However, for practical purposes, what we really want to know is not only the worst
case bound, but also the average case maximum delay of such a tree, and the distri
bution of such delays. Another unknown factor is the cost of such trees, that is the
cost savings over the source-rooted shortest path trees for packet delivery.
It is difficult to use center based trees in their original form in the design of real
multicast protocols, because (1) finding the center for a group is an NP-complete
problem, and (2) it requires centralized knowledge of the entire network and group
membership. Alternative practical forms can be based on heuristic center placement
strategies [4], or simply use one shortest path tree rooted at a member of the multi
cast group, if delay and cost requirements are not too critical. We call the optimal
member-rooted shortest path tree a MSPT.
12
Finally, to consider any algorithm for practical application, we want to know how
evenly it distributes the routes, i.e. are there serious traffic concentration problems?
Again, simulation will be used to answer these questions.
2.2 Protocols
This section reviews a few existing multicast routing protocols. These protocols
are classified into two categories: those that flood the information about a group
and those that restrain the group membership information to on-tree routers. The
next subsection discusses the traditional protocols that flood state information about
groups1. Subsection 2.2.2 discusses protocols based on directed-membership-advertisement.
2.2.1 Traditional Flooding-Group-State P rotocols
The simplest mechanism to achieve one-to-many delivery is broadcast It is assumed
th a t every node within the network is a receiver. Dalai devised a reverse path
forwarding (RPF) mechanism[10] which essentially eliminated duplicate packets and
does not require computation of the broadcast tree — it relies on the unicast routing
information to get the shortest path information between pairs of nodes. This is the
m ajor advantage of the RPF method. The disadvantage of this scheme is that it
sends the packets to all nodes within the network. This mechanism became the basis
of a number of newer schemes that can do multicast, i.e. only send packets to group
members.
Steve Deering, in his PhD dissertation, devised three types of multicast methods:
spanning tree multicast; distant vector multicast; and link state multicast. All
methods use source-specific shortest path trees for packet delivery.
The spanning tree multicast was motivated by the fact that LAN bridges use
spanning-tree algorithms to interconnect different LANs. Deering extended the ba
sic algorithm to truncated broadcast and multicast. In truncated broadcast, a leaf
network does not receive packets unless it has local multicast group receivers. In
1 Either by flooding control packets such as link state update packets, carrying the group mem
bership information, or by broadcasting the data packets to set up the state of the group in all
routers reachable by the broadcast data packets.
13
spanning-tree multicast each router keeps a record of all membership report, and
forwards data packets according to the interfaces that were marked when member
ship reports were received. Spanning-tree algorithms only work in moderate sized
local networks due to their flooding behavior.
Deering’s distance vector multicast is an extension of Dalai and Metcalfe’s reverse-
path forwarding (RPF) algorithm. It achieves multicast by letting subnetworks ab
sent of multicast group members send prune messages toward the sources. Only
the first data packet is delivered via truncated-broadcast; when the packet arrives
at routers on subnetworks without group members it triggers the sending of prune
messages. The pruned branches grow back after a time-out period, and the subse
quent data packets are sent via truncated-broadcast again. This will trigger more
prune messages to prune the unwanted branches. For moderate sized network and
moderate number of multicast groups, this algorithm turns out to perform reason
ably well. But this protocol requires all reachable routers within the scope to carry
periodic packets and maintain state for all groups, no m atter what percentage of
nodes are group members. For larger networks, with millions of sparsely distributed
groups, the potential overhead of this protocol can be large.
Deering’s link state multicast extends the unicast link-state algorithm to do
multicast by flooding dynamic membership reports together with link-state updates.
Consequently each router has an exact map of member location, and can calculate
the whole tree to decide whether it is on a branch. However, the tree computations
are done only upon receipt of a packet from a new source to the group. MOSPF is a
protocol that implements Deering’s original algorithm. There are two major scaling
problems with this protocol: First, this protocol floods membership information to
all routers. Second, analysis of the implementation experience[23] shows that the
number of Dijkstra shortest path calculations is a major concern for this protocol,
even for networks with only around 200 nodes.
Another important contribution from Deering was the separation of the host
membership protocol (the host group model). This division effectively decoupled
the service interface from the implementation of the services. Therefore, application
software can be developed without regard to specific protocols.
14
2.2.1.1 Is lin k -state m ulticast a viable o ption for scalable m u ltic ast?
The recent trend of basing multicast route calculation on unicast routing tables (e.g.
RPF), has raised a question about link-state multicast routing protocols: How well
can they scale ?
This makes it necessary to compare the RPF-based schemes and link state based
schemes. The following are some differentiating properties of these two classes of
algorithms:
1. In fo rm atio n H iding. It is the nature of RPF based schemes to hide mem
bership information in intermediate routers. A router on a tree only needs
to know the existence of downstream members along certain outgoing inter
faces, and need not know how many members each interface leads to nor where
precisely those members are located.
W ith link-state multicast, each and every router inside the whole network needs
to know precisely the global membership information, by receiving the flooded
link-state updates. Note that not only on-tree routers but off-tree routers also
need to carry the same information in order to carry out the all pair shortest
path tree calculations.
2. D ata-d riv en s ta te setu p . With RPF mechanisms, it is relatively easy to
manage multicast routing state in a data-driven fashion, for example, DVMRP
and PIM maintain data-driven source-rooted shortest path trees.
However, there is no known method to do low-overhead data-driven link-state
multicast.
3. R P F based m echanism s incur very low processing overhead in router CPUs,
because there is no graph-related computation needed for multicast purposes.
But for a link-state protocol, the load on router CPUs is proportional to the
total number of groups and sources, whether the router is on a tree for those
groups and sources or not.
For these reasons, we do not expect link-state multicast to be a promising ap
proach for good scalable multicast routing techniques.
15
2.2.2 Protocols based on Directed-M em bership-Advertisem ent
After the creation of the MBONE running DVMRP, it was recognized by a group
of researchers in the IETF that DVMRP’s potential scaling problem is in its broad
cast behavior and that it disseminates unicast routing information using RIP which
itself does not scale well. This motivated members of the Inter-domain multicast
(IDMR) working group of the Internet Engineering Task Force (IETF) to search for
new solutions with better scaling properties. As a result CBT and PIM have been
proposed as candidate scalable replacements.
2.2.2.1 Core Based Tree (CBT)
CBT is a center-based forwarding protocol. A router called core is chosen as the
center for a multicast group. Each sender and multicast group receiver sends join
messages toward the group core. A center-based forwarding tree is set up along the
way when routers on the path receive positive acknowledgments from the core and
relay that response downstream toward the leaves of the tree.
The advantage of this scheme is that it introduced a certain degree of separation
between multicast protocols and unicast protocols — it doesn’t depend on how the
shortest path tree is computed (unlike DVMRP and MOSPF), nor does it require a
separate unicast routing protocol. In addition, by maintaining one tree for each mul
ticast group, compared with DVMRP and MOSPF, CBT reduces the state storage
required for each, by a factor of the number of senders within a group.
The disadvantage of this scheme has been that it is not possible to use a single tree
to provide minimal delay and minimal cost routing for every sender of a multicast
group, and there is a tendency for more traffic concentration than if SPT schemes
were used. Also optimal core placement is an NP-complete problem, but heuristic
placement was suggested in the CBT design document[4]. No backup mechanism
exists to counter the worst case scenario when even the optimal CBT offers very
unsatisfactory end-to-end delays compared with the shortest path delays. Although
a better core placement strategy may improve the average case, it is not clear whether
an acceptable solution can be found in general.
16
2.2.2.2 P rotocol Independent M ulticast (PIM )
PIM has been motivated by the desire to accommodate the benefits of traditional
Deering IP multicast architecture and CBT, and by the desire to provide the capa
bility to make flexible tradeoffs in a tangible way. PIM provides two basic modes of
operations: a sparse mode and a dense mode.
PIM sparse mode is capable of maintaining two types of packet delivery trees:
source-rooted shortest path trees and shared rendezvous-point (RP) centered trees.
In this mode, explicit directed join messages are used to establish tree branches.
PIM sparse mode starts by setting up a shared tree. Each group designates one
or more routers as Rendezvous Points (RPs). Each designated router with directly
connected members sends Join messages toward the RP. Such a Join message sets
up a shared tree branch extending from the RP back to the receiver. Each sender
sends Register messages to the RP, which responds by sending Join messages back
to the source2. Once a receiver is on the RP tree, it will gain knowledge about all
sources sending to the group. To switch to a source-rooted shortest path tree, a
receiver’s designated router sends a join towards the source, and a prune towards
the RP for the group.
PIM sparse mode operates efficiently when the multicast group members are
sparsely distributed across wide areas. Based on the nature of each application, the
tree type to be used can be selected independently for each group.
PIM dense mode (PIM-DM) is used when multicast group members are densely
populated within a network. The default assumption of dense mode is that every
node needs a copy of all data packets addressed to the group. A sender doesn’t need
to register with any RP, it directly sends packets addressed to the group, which are
then broadcast delivered using the reverse path forwarding technique. Every node
will receive a copy of the packet by default. Those nodes who don’t need the packet,
will prune the unnecessary branches explicitly 3.
2If the data rate is high enough, otherwise no Join messages are sent and data packets are
delivered to the RP, encapsulated in register messages.
3Note that join message is also used in this mode to reconnect a branch if a router whose
upstream branch is pruned by other routers on the same LAN
17
Chapter 3
Topologies, M etrics and Tools
An algorithm’s performance can be sensitive to the properties of the network topol
ogy, group membership distribution and group size. There is a need to use a network
model that can be easily characterized. Section 3.1 discusses the topology model to
be used in subsequent simulation experiments. Section 3.2 presents the performance
metrics used to evaluate each tree type. Section 3.3 briefly describes the simulation
tools developed for most experiments reported in this thesis.
3.1 Network Topologies
We will use simulations over different classes of random graphs to capture the com
prehensive characteristics of the algorithms and trees. We adopted the random
graph model introduced in [31] that can generate a variety of different graphs with
classifiable features: i.e. connectivity degrees and different edge distributions. One
advantage of this model over a purely random model is that it can more easily be
correlated to real world networks1.
The n vertices are randomly distributed over a rectangular coordinate grid, and
are assigned integer coordinates. Edges are introduced according to the edge prob
ability function that takes a pair of nodes (u, v) as its variables:
P (u, v) = /? ■ e (3.1)
According to results from [5] when the size of random graphs is big enough, most graph
properties will hold when the graph size grows even bigger. We repeated our major experiments
with two different graph sizes: 50 and 200 nodes, and found the measured results consistent.
18
Na20 ALPHAaO.10 B£TA«2.tOO 8 * d> 11 Nod*0*flTM>3.» fWO ALPHA-1.00 BETA^).3103m 4m 11 No*0*gm«w3.9
(a) a = 0.1 (b) a = 1.0
Figure 3.1: Random graphs of different a ’s
where d (u ,v) is the distance from node u to v, L is the maximum shortest path
distance between any pair of nodes in the network (often called network diameter),
1 > a > 0, /? > 0 2. A larger value of a increases the ratio of the number of long
edges vs. short edges, and a bigger /? results in a bigger average node degree of the
whole graph. Fig 3.1 shows two 20-node graphs of the same average node degree
and under the same node placement, but generated with different a ’s.
We assign the delay of a link to be the distance between the two end nodes in
the simulations.3
3.2 Performance Metrics
Let G = (V, E, C) be a directed graph, where V is a set of nodes, E = {(u, w)|tt, v €
V} is a set of edges, C = {c(u, t>)|(u, u) € F} is a set of edge costs. Let M C V be the
2The original model has restriction of ( 3 < 1. We found that larger values of /? beyond 1,
when combined with appropriate small a values, also generate graphs th at, subjectively at least,
appeared to be of practical significance. This observation was made through the X-window based
Random Topology Generator/Previewer, rtg, that was developed to help visualize different graphs.
3The assumption is th at the propogation delays governed by the law of physics dominate the
delays over links in a wide-area network. The difference in delay caused by different link speeds
and queuing delays can be treated as a relatively “small constant” . For example, the difference
in transmission delay between a T1 and T3 link for a IK byte packet is roughly 5 ms, which is
equivalent to the propagation delay of light over about 0-66 mile optical fiber. The queuing delays
vary according to changes in traffic, and are not normally used in routing algorithms th at don’t
adapt dynamically to traffic loads.
19
set of multicast members, S C V be the set of senders for Af, Tm (u) = (Vm, E m , Cm )
such that 7m(u) C G be a multicast tree for M which source u uses for packet
Af, and d(u, v, Tm{u)) be the path length from u to v via tree Tm(u). We define the
performance measures as follows:
1. Maximum and average delay measures due to Wall [30]. First, the maximum
delay for source u along tree Tm(u) is,
For convenience of comparison, we normalize the delays across different graphs,
But this definition may not capture the change of delays in a meaningful way. A source that
has the largest maximum delay in Tm may not have the largest maximum delay in a SPT. The
alternative definition RMaxo(TM)' gives the ratio of the maximum path length for an individual
source. Definition (3.5) presents the change in maximum delay for a whole group. The down side
of definition (3.5) is that it does not show the change of fate among the group members in different
types of trees.
delivery, Tm be the set of a specific kind of multicast trees for all sources in 5 for
maxD{TM{v)) = MAX{(1(u,v,Tm(u)) \
for all v 6 Af} (3.2)
the maximum delay for all sources in 5 for group Af,
M uxD{Tm ) = M AX{m axD {TM {u)) \
for all u € S'} (3.3)
and the average of the maximum delay for multicast group M,
A vcD(Tm ) = — $ 2 m < lxD(TM('u)), n = [S’]
ft
ti€S
(3.4)
and use R moxD, the ratio of maximum delays and RAveD, the ratio of average
maximum delays,4
M oxD (T m )
M oxD (SP T m )
(3.5)
4Note that we could have defined this ratio as,
, R m o z d (T m )'
MaxD^Tht)
maxD(SPTM(u))'
, where u £ S, and
m a x D (T M (u)) = M o x D (T m )
20
where SPTm is the set of shortest path trees for group M. And,
AveD (T\t)
RAveD{ M^ A vcD (SPT m ) * ^
2. Link cost for traffic from source u along tree Tm ,
Cost(TM,u )= c (i,j) (3.7)
For shared group trees, if the links are symmetric (c(i,j) = c(j, i)) and all
sources are receivers themselves, Cost(TM ,«) will be the same for all sources
u.
For non-shortest-path-tree Tm , the ratio of link cost for traffic from source u
C to tsa tio (2V,, u) = c C c ^ S P T u ! u ) (3’8)
where Cost{SPTM ,u) is the link cost of a shortest path tree rooted at u
extending to all members of group M . When there are multiple shortest path
trees, we pick one at random.
3. Traffic Concentration. Let n u m .flo w (i,j) be the number of flows passing link
(i, j). The maximum link load in a graph G when there are n active groups is,
T C m a x(G ,n )= M A X {n u m .flo w (i,j)\
for all (i,j) € E } (3.9)
The distribution function of all link loads of graph G under n active groups is,
D ist(G ,i) = number of links with i flows (3.10)
3.3 Topology generator and Simulators
A random topology generator/previewer and two simulators have been developed
for the experiments with tree types and protocol mechanisms.
21
The random topology generator, RTG, produces random graphs. Waxman’s edge
distribution model is implemented. RTG’s output topologies can be sorted according
to the values of a and /? in Waxman’s model, as well as the network size and
average node degrees. RTG has an X-based previewer that allows visual inspection
of topologies with small number of network nodes (< 100), and has postscript output
capabilities.
A simulator named mtree simulator was developed for experiments with tree
types and different algorithms. Floyd’s all pair shortest path algorithms, KMB[21],
and a few variations of center-based tree algorithms are implemented. A set of scripts
were also developed to gather experimental data and conduct post-simulation data
processing.
The second simulator PIMSIM, is a packet-level event-driven simulator, devel
oped to exercise PIM mechanisms. It is based on the Maryland Routing Simulator
(MaRS)[3, 2], a routing simulation test bed, and incorporated part of the mtree sim
ulator. It utilizes MaRS’s event management routines, basic network construction
models and X window user interface routines. It relies on the portion derived from
the mtree simulator to calculate unicast shortest path routing information. The goal
of this simulator is to capture the details of the dynamic behaviors of the basic PIM
mechanisms. The new simulator has PIM specific routing modules, multicast traffic
modules, multicast capable node modules and multiaccess LANs. For more detailed
description of PIMSIM, please see the design document for PIMSIM[32].
22
Chapter 4
Evaluations of different multicast tree types
A simulator has been developed to simulate the performance of different tree calcu
lation algorithms. This makes it possible to directly measure the d ata delivery costs
and qualities for different types of trees.
4.1 Simulating Different Trees
Tree calculation algorithms simulated include: SPT, KMB, MIN-MAX *, CBT and
MSPT.
Since there are a number of different tree types to evaluate, we have to choose
one of them as a reference for comparison purposes. SPTs have already achieved
much success in practical protocols. Therefore, analysis of other schemes for the
same purposes will be more meaningful when compared directly with SPTs. We
constructed 2 sets of experiments to measure the costs and delays of trees. We
compare KMB trees with SPTs, then center based trees with SPTs. Comparison
of SPT and KMB trees can reveal how much more room for link cost savings we
may have beyond SPTs, and how much more link delays we should expect if such
cost savings are achieved using KMB trees. Comparisons between SPT and center
based trees illustrate how much more delay will be incurred by center based trees, on
average; and whether they use more or less bandwidth resources than the shortest
path trees.
‘A low cost version of a heuristic near-optimal Steiner tree algorithm I devised. It didn’t
show significant advantage in the quality of trees generated, compared with MSPT. The idea was
therefore dropped.
23
For traffic concentration experiments, since center based trees are the only prac
tical candidate for constructing group shared trees in real world protocols [30] [12],
we only compare traffic concentrations of center based trees with SPTs.
The parameters that might affect the performance of different distribution tree
types are: (1) Reasonableness of graphs, i.e. the proportion of short links vs. long
links, which was suspected to have influence over the delay and cost of trees; 2)
graph node degree; (3) multicast group size (number of receiver members in current
group); (4) number of sources sending to the group; (5) Distribution of sources and
receiver members; and (6) graph size.
4.2 Results on Multicast Trees and algorithms
Most experiments were run over graphs of two different sizes: 50 nodes and 200
nodes. In the delay and cost comparisons, for each random graph generated, a
randomly selected multicast group was put in the graph, then the shortest path
trees, the center based tree (CBT2 and MSPT) and the KMB tree were computed.
We assume S = M 3. Figure 4.1 through fig 4.4 are the results for the delay and cost
comparisons. In these pictures, there are 500 runs at each data point. Figure 4.5
through fig 4.6 are the results for the traffic concentration experiments. There we
also investigate the situations when 5 ^ M .
4.2.1 D elay and Cost: SPT, K M B, C B T and M SPT
Figure 4.1 shows the effects of different node degrees on the delay and cost properties.
All graphs have 50 nodes, all multicast groups have 10 members. When processing
simulation data, the average node degrees are rounded to the nearest integer. The
vertical axis are the averages of R muxD> R avcD and Costjratio respectively. Solid
lines are for graphs with a = 0.2, dotted lines are for graphs with a = 0.6. A few
observations that can be made from this picture are:
1. All algorithms are relatively insensitive to the reasonableness of graphs;
3In this section, CBT stands for a center based tree with optimal center placement, not the
Core Based Tree protocol in particular.
3I.e. each multicast group member takes the role of both sender and receiver
24
R_MaxD R_AveD Cost_ratk)
(a)
(b)
(c)
m
2 4 6 8
Node Degree
22
2. 0
1 .8
1.6
14
12
1.0
0.6
1.0
0.9
0.8
0.7
0.6
0.5
10
1.1 22
2.0
1 .8
1.6
1.4
1 2
1.0
0.6
1.0
0.0
0.8
0.7
0.6
0.5
10
1.1
1.0
0.9
0.6
0.7
0.6
0.5
Node Degree Node Degree
Figure 4.1: Comparisons of delay and cost in 50-node graphs: (a) KMB tree vs SPT,
(b) center tree vs SPT, (c) MSPT vs SPT
25
R J t to O M tf o p im R _ A * O I K oy n C o tf.n M o H M o g v n
Group 8 tt* Nonmi»dCaiHt(%) Nnrwtwil Court <») N o w to d Court (%)
Figure 4.2: Effect of group size for KMB trees, in 50-node graphs: (a) Ratio; (b)
Histograms for group size of 5; (c) Histograms for group size of 25
2. T he maximum delay within a KMB tree tends to be larger than that of a CBT
tree. The maximum delay of a KMB tree increases faster when the average
node degree increases. Note that MSPT R m 0 xD and RAveD curves are quite
close to those of CBT’s;
3. T he KMB Costjratio curve decreases faster. MSPT Cost .ratio is close to
CBT Costjratio.
Observations 2 and 3 above suggest that MSPT will on the average be almost as
good as the optimal CBT.
The above experiments were rerun over 200-node graphs, also with 10 member
groups. The RmoxD, RavcD and Costjratio all have similar trends; KMB RmuxD
and RAveD are slightly higher, and KMB Costjratio curve is slightly flatter. The
changes in MSPT and CBT curves are very small. Therefore at higher node degrees,
the differences between KMB R m »xD and CBT R m u x D , and the differences between
KMB RAveD and CBT RavcD, are slightly smaller than in 50-node graphs.
Figure 4.2, fig 4.3 and fig 4.4 show the effect of changes in group populations on
the KMB and center tree delays and costs, in comparison with shortest path trees.
The results were taken from 50-node graphs where average node degree is restricted
to 4 plus or minus 0.5.
26
R J l a D I K a y — R _ A * 0 H t t o y m C « * _ r * o H M o y m
10 20 30 0 10 JO JO
O JO .
10 20 30 0 20 30 0 10 JO 30
N a m M C M lW M M t e d C a r t W NorawbodCourt(%)
Figure 4.3: Effect of group size for center trees, in 50-node graphs: (a) Ratio; (b)
Histograms for group size of 5; (c) Histograms for group size of 25
o >
S"
COOL
If
R .M a O l H a y n R .A M O H M ogan Coal.frtioHMogram
0 10 20 30
Q r a p R te
0 10 20 30 0 10 20 30 0 10 20 30
t
' 0 10 20 30 0 10 20 30 0 10 20 30
dCo«l(%) N om rtM Court (%) M anated Court (% )
Figure 4.4: Effect of group size for MSPT, in 50-node graphs: (a) Ratio; (b) His
tograms for group size of 5; (c) Histograms for group size of 25
27
Histograms are plotted along side to show the distributions of the three param
eters M axD , AveD, and Costjratio at group sizes of 5 (figures 4.2 (b), 4.3 (b), 4.4
(b)) and 25 (figures 4.2(c), 4.3 (c), 4.4 (c)). The bin size is fixed at 0.05. Note that
the horizontal axis of the histograms are normalized count. E.g. a bar of length 20
with ratio = 1.1 means there are 20% of the measured data values at ratio of 1.1 (to
1.1 + 0.05).
It can be observed from the pictures that the delays of KMB trees tend to grow
larger with larger groups, while their costs in comparison with shortest path trees
tend to be lower. The delay and cost curves of the center trees, however, are rather
flat. W ith large groups, it can be seen from the histograms that center trees tend
to have shorter delays than KMB trees, but the peaks of their cost_ratio histograms
are higher than that of the KMB trees. Although there seems to be an inconsistency
in the picture of M axD , AveD histograms in fig 4.3 (b), where the M axD peaks
at ratio 0 with count « 23% while the AveD ratio 0 is only at count « 13%, this is
mainly due to the definition of R \ia xD-4
KMB trees in general have bigger variations in delay than center trees. The tails
of RMaxDt RAveD in fig 4.2 (b) and (c) extend to about 2.4, while in fig 4.3 (b) (c)
and fig 4.4 (b)(c), the tails are below 1.7, and larger groups have slightly shorter
tails.
The above observation exhibits an interesting aspect of the fate-sharing nature
of center trees: They are not optimal for everyone, and overall, they are also not
very bad for everyone either.
We repeated the above experiment in 200-node graphs, with group sizes ranging
from 20 to 80. The results have similar trends. KMB delay curves are higher, the
average ratio of delays is around 2. KMB to SPT cost ratio curve is lower. CBT
and MSPT delay curves, though slightly higher, are not significantly different than
in 50-node graphs.
4 An example for this anomaly can be shown in two sets of numbers {5, 5, 5}, and {1, 1,5}. The
ratio of maximum numbers of these two sets is 5 : 5 = 1, but the ratio of average is 15 : 7 « 2.14.
28
T Q im b , 300 p c u p a p m R a to d TCffNR (CBT/8PT, MSPTAPT)
CBT/9PT 9000
1.S
MSPT/8PT
•000
1 .S
i
*- 7000
1.2
1.1
6000
1.0
MSPT
0.0 5000
(a) (b)
Figure 4.5: Traffic Concentration, in 50-node graphs with 300 40-member groups,
32 senders per group
4.2.2 Traffic Concentration: SPT , C B T and M SP T
In the traffic concentration experiments we use a number of fixed size multicast
groups randomly placed on a graph and some of the group members are assigned
th e roles of senders randomly. Assuming each source of the group generates traffic
at constant unit rate, the total number of unit traffic flows that traverse a link is
counted. Each physical link is treated as two unidirectional links connecting the
same pairs of nodes in opposite directions. For the same graph and groups, center
based trees and shortest path trees are computed and link loads counted.
The maximum link loads for SPT, CBT and MSPT are shown in fig 4.5 (a).
Figure 4.5 (a) shows that as the average node degree of the random graphs grows
from 3 to 8, the SPT’s maximum link load decreases. This is because there are more
redundant links when average node degree is higher, and there exist alternate paths
between many pairs of nodes. However, center based trees’ maximum link loads
hardly change when the average node degree increases, which is not surprising con
sidering the fact they are center or single-source-rooted shortest path trees. Fig 4.5
(b) shows the ratio of maximum link loads of CBT and MSPT vs. that of SPTs.
We have run th e same experiment under identical configurations but with 2 senders
per group, the results are comparable.
29
100 - 100 100
•0 00 00
00 00 00
I
40 40 40
20 20 20
i , i | L
0
i t d - J th lk X L
° 0 2000 4000 MOO ° 0 2000 4000 0000 ° 0 2000 4000 0000
Ur* Load Ur*Lo«d U rttlo td
(a) SPT (b) CBT (c) MSPT
Figure 4.6: Distribution of link loads in the same graph
This experiment suggests that in networks with low connectivity degree, the link
utilization pattern of center-based trees will be very close to that of shortest path
trees. However, when the average node degree increases, center-based trees maintain
almost flat maximum link loads, whereas the maximum link load of shortest path
trees drop significantly. This indicates that added alternative paths are not being
sufficiently utilized by center based trees.
Figure 4.6 shows the distribution of link loads within one specific 50-node graph.
The average node degree is 4.8, there are 300 active groups all having 40 members.
There are 20 senders within each group. When the number of senders increases from
2 to 20, with all three types of trees, the distributions of maximum link load roughly
retain the same shape except that the lengths of the tails extend to about ten times
those in 2-sender cases. Only results from 20-sender experiments are shown in fig 4.6.
The SPT histogram profile drops smoothly with larger link loads. The CBT
and M SPT histograms both have a narrow high peak at near 0, followed by a long
tail with a rather significant portion at the end of the tail. If we draw a vertical
line in the two right pictures in fig 4.6 at the position equal to the maximum link
load of the corresponding SPT link load histogram (as shown in fig 4.6 (b) (c) in
dotted lines), the area to the right of the line represents the number of links that
needs higher link capacities to service the same configuration of multicast groups
as in their SPT counterpart. The link load distribution is closely related to the
distribution of member locations of the multicast groups. Changing the set of the
30
Table 4.1: Summary of experiments with different tree types
KMB SPT CBT/M SPT
Delay High, no bound Lowest Low average
Sum of link cost Lowest High Low
Distribution of traffic Even distribution Concentrates traffic
Complexity Most complicated Simplest Medium
multicast groups may change some details of the histograms, but the profiles shown
in fig 4.6 remain relatively constant under random distribution of multicast groups.
This suggests that center based trees may not be ideal for high bandwidth ap
plications, which after the multiplying effect would create hot spots and reduce the
effective number of traffic flows that can be admitted into the network.
4.3 Conclusions from the Tree Simulation Results
Table 4.1 shows a summary of the experiments. SPT and shared tree (SMT &
center based) approaches are complementary in terms of algorithm complexity, link
costs, delays and traffic concentration densities. SPT offers minimal delay and is the
simplest to compute. SMT offers minimal cost and is the most difficult to compute.
Center based tree falls in between these two extreme cases. When doing distributed
computation of a multicast tree, the message complexity involved in constructing a
center based tree is much lower than that for a near optimal SMT algorithm.
SMT delays are not bounded in theory, simulation of KMB trees measured widely
distributed values. It is not likely that algorithms derived from pseudo-optimal SMT
algorithms such as KMB, can be used in real multicast protocols.
Center based tree delays are not adequately bounded, but (mostly) favorably
distributed. MSPT delays and costs are about the same as center based trees with
optimal center placement. MSPT and optimal CBT also have similar traffic con
centration characteristics. If the use of optimal center based trees is justified in a
31
practical protocol, it will be sufficient to use MSPT which is significantly easier to
compute than the optimal center based tree.
We have shown that the performance of these different trees is sensitive to group
population, average node-degree and locations of the group members. It is relatively
insensitive to the “reasonableness” of the edge distributions of a graph.
Our simulations showed that different algorithms indeed lead to different degrees
of traffic concentrations. Hence selection of algorithms should not be purely based
on performance for individual groups. When there exists heavy traffic concentra
tion, the heavily loaded links become bottlenecks. Although source specific SPTs
consume more link bandwidth for each individual multicast group, their demands
on bandwidth are more evenly distributed than the center based trees, especially in
networks with high connectivity degrees. Hence a network may support more high
bandwidth multicast groups if SPTs are used instead of center based trees.
32
Chapter 5
Loop-free M ulticast Routing Supporting
M ultiple Tree Types
A state based multicast protocol involving different tree types is susceptible to loops
when: (1) unicast routing has loops; (2) unicast routing information is inconsistent;
(3) unicast presents asymmetric routes; or (4) multicast d ata packets are link-layer
multicast delivered onto LANs. This chapter studies the conditions when a loop can
occur, and analyzes a few loop prevention protocol mechanisms proposed during the
design of PIM.
5.1 M otivations and problem definition
Looping is one of the most destructive problems in any routing protocol. Looping in
multicast is worse than looping in unicast, due to the fact th at each duplicated packet
may be delivered to multiple recipients, and potentially consume more bandwidth
(referred to as the amplification effect). Multicast looping affects more routers than
unicast loops do.
This chapter establishes a set of rules and criteria such that certain protocol
mechanisms susceptible to loops can be analyzed and verified. We focus on state
based protocols, where the multicast tree is represented by installed state information
in relevant routers, and upon which data packet forwarding action is based. It is
assumed that multicast data packets are link-layer multicast delivered on m ulti
access LANs.
33
When multiple tree types are used, it is useful to consider the situations in which
multicast looping can occur based on the cause of the loop:
1. Multicast loop resulting from unicast loops.
When constructing multicast routes, a unicast loop can be “copied” into mul
ticast routes and form a multicast forwarding loop on a multicast “tree”.
2. Loop within a single multicast tree.
3. Loop created by the overlap of shared and shortest path trees.
Multicast loops as a result of unicast loops should be dealt with by fixing the
unicast loops, instead of by creating special multicast loop-prevention mechanisms.
Therefore we will only investigate cases 2 and 3 listed above. Section 5.2 discusses
loops within a single tree; section 5.3 addresses loops along mixed shared and shortest
path tree segments; and section 5.4 discusses a few loop prevention mechanisms.
5.2 Looping along branches of a “single” tree in
unicast loop-free networks
Figure 5.1 illustrates the scenario where unicast routing does not present loops, while
a naive multicast routing mechanism could create a forwarding loop along a shared
tree branch1.
Multicast routes are initiated from the receivers toward a rendezvous point[12].
When D and C run a different unicast routing protocol from that run by A and B,
B chooses A as the next hop to RP; C chooses B through the point-to-point link as
the next hop to RP; D chooses C as the next hop to RP.
If the multicast protocol allows setting up of the path as indicated by arrows on
the links shown in figure 5.1, there will be a multicast forwarding loop created: LAN
- + B - » C - t LAN.
‘The discovery of this phenomenon was inspired by a discussion on DVMRP routing loops with
Kannan Varadhan. The DVMRP looping problem was presented by Steve Deering on an IETF
working group mailing list. As pointed out by Deering, the LAN does not have to be a physical
wire, it can be two subnetworks bridged together so that they appear to be one broadcast media
for IP packets
34
A
RP
LAN
Receiver2
Direction of Join messages
Receiverl
Figure 5.1: Loop along a branch of a single tree in unicast loop-free network
This type of loop can happen only when the unicast routing information is in
consistent among different routers attached to the same LAN2. We define the term
unicast routing consistency as follows,
D efinition 1 (U nicast ro u tin g consistencyJ The unicast routing protocol(s) on
a L A N is said to he consistent if for each possible destination in the network, all
routers attached to the LAN will choose the same next hop router if the L A N is the
next hop link toward such a destination.
Given the unicast routing consistency assumption we are be able to make the
following statements.
L em m a 1 If the unicast routing is loop-free and consistent over all multiaccess
LANs, there will be no multicast forwarding loops along:
1. a pure shared tree branch; or
2. a pure shortest path tree branch.
P roof: (1)
2PIM will not create such a tree, since each LAN has only one “ joiner” . Although the purpose of
such a mechanism originally wasn’t to prevent such loops, it happens to serve as a loop prevention
mechanism as well.
35
To construct a loop on a EPT, there must be a path leading from a
“downstream” link to an “upstream” link. In order for such a scenario
to occur, the unicast join messages should be able to flow from the “down
stream” link to the “upstream link” and reappear on a downstream link.
This would either require a unicast loop or a LAN that is not unicast
routing consistent.
(2)
Same as the proof for (1).
Q.E.D.
L em m a 2 I f there exists a multicast forwarding loop but the unicast routing is con
sistent over all LANs, the loop must consist of both shortest path tree segments and
shared tree segments.
P roof:
This lemma is a direct consequence of lemma 1.
5.3 M ulticast forwarding loop along segm ents of
m ixed tree types
This section deals with multicast loops along mixed shared and shortest path tree
branches. We first show that a loop with mixed shared and shortest path tree
branches can exist only if unicast routing is asymmetric. Subsection 5.3.2 shows the
conditions under which unicast routing is asymmetric, and the RPF interface checP
is effective as a loop prevention mechanism. Subsection 5.3.3 discusses situations
involving multi-access LANs in which the RPF check alone can not prevent loops,
and presents four necessary conditions and a sufficient condition for loop existence.
3When a router receives a data packet from a source 5, the router that performs RPF interface
check compares the interface the packet arrived on with the interface the router uses to send a packet
toward S via the shortest path. If they are the same, the packet passes the R P F inter facecheck.
A packet that fails the RPF interface check normally indicates an exception in the network and
often is either dropped or processed specially. A looping packet will be detected if it fails the RPF
interface check.
36
5.3.1 Looping under unicast asym m etry
It should be pointed out that looping along segments of different shortest path trees
of the same multicast group 4 is impossible unless there are network component
failures. When branches of different trees form a loop, one segment must be a
source-rooted shortest path tree branch, another must be a shared tree branch.
This section assumes that when a group’s SPT intersects its RPT, appropriate
RPT branches are merged into the SPT routing entry at the intersection router
for packet forwarding purposes5. At data forwarding time, the router will forward
packets received from the SPT onto all downstream SPT and RPT receivers.
The following lemma says that looping along branches of different trees can exist
only if unicast routing is asymmetric — either the costs of the paths are different in
two directions between two certain points in the loop, or there exist more than one
equal cost paths between those two particular points in the network, such that after
breaking the tie the resulting paths in two directions are different.
L em m a 3 (A sym m etry ) Assume unicast routing consistency across all multiac
cess LANs in the network. If there exists a multicast forwarding loop consisting of
segments of both SP T path and of R P T path, one of the following three is true:
1. there must exist two nodes u, v in the loop such that the shortest path from u
to v is different from the shortest path from v to u; or
2. there exist a node u and a multiaccess LA N L such that the shortest path from
u to some nodes n l on L is different from the shortest path from some nodes
n2 on L to node u. Or
3. There exist two multiaccess LANs L and M such that the shortest path from
a node n l on L to some node m l on M is different than that from a node m2
on M to some node 12 on L.
P roof:
4A router should drop a data packet if it does not have a shared-tree entry or a (S,G) entry for
the data packet.
5Either the RPT outgoing interfaces are copied into the SPT entries and tagged or the forwarding
action uses a union of interface list from the two entries.
37
By contradiction. Assume unicast routing is symmetric, but there exists
a multicast forwarding loop. According to lemma 2, the loop must consist
of both RPT and SPT segments. Within the loop, SPT&RPT must
intersect at two distinct links or nodes.
1. Intersections are at nodes u and v, u v
(a) If u v is on a SPT branch, v u is on a RPT
branch, then v u must be the unicast shortest
path from v to u and u v must be the unicast
shortest path from u to v (Due to the way RPT and
SPT were constructed).
=>• (Due to unicast routing symmetry) u v is the
==’> ' same asv-^u
==> The loop can’t exist since packets from the same
source will not travel in both directions along the
same path, in a stable state.
contradiction.
(b) If u v is on a RPT path, v u is on a SPT path,
the same contradiction as in the previous case.
2. Similar contradictions as the previous case.
3. Similar contradictions as the previous case.
Q.E.D.
5.3.2 R PF interface check as a loop prevention m ethod
If the protocol mechanism does an RPF check when forwarding data packets, there
will be no multicast forwarding loops when all RPT and SPT intersections are at
routers (not at multiaccess LAN). Figure 5.2 illustrates the generic scenario when
SPT and RPT branches can potentially form a loop.
38
> — « ' SPT (S,G)
Figure 5.2: RPF check prevents multicast looping
Packets from Srcl will first flow along the SPT path B — > D A. Then A will
forward them onto the RPT path A C — * B. Eventually B finds a longest match
on (s,g), does an RPF check on the incoming interface and rejects those Srcl packets
from the RPT, thus breaking a potential packet loop.
Note that there can be any number of intermediate hops in the path A C and
D m A in figure 5.2. There is also no topological restriction on the placement of
the source and receiver. Therefore, the above reasoning constitutes a generic proof
th at an RPF check can prevent loops in networks without multiaccess LANs.
5.3.3 R P T and SPT branches intersecting at a m ultiaccess
LAN
When an RPT and an SPT intersect at a multiaccess LAN, an SPT receiver may
pickup packets forwarded by a RPT forwarder, and a RPT receiver may pickup
packets forwarded by a SPT forwarder. This may lead to looping in certain situ
ations. One such situation is a potential PIM routing loop discovered during the
earlier development of the LAN related mechanisms 6, as illustrated in figure 5.3. In
the above picture, C chooses A as the best next hop to RP, RP joins to Src through
B D — ► Src. Without a loop prevention mechanism, data packets will loop
along LAN — ► B — ► R P — ► A — ► L A N .
Note that when a reciever receives two packets with identical packet headers and
contents, there are two possible causes: (1) there are two different paths from the
source to the receiver such that the two packets are independent of each other along
5 First discovered by members of the IETF IDMR working groups.
39
RP
LAN
( d ) ( S ,G )
Receiver
Sender
Figure 5.3: A loop composed of a SPT segment and a RPT segment
their forwarding paths; and (2) one packet is triggered by the other packet. Situation
(1) is referred to as duplicate packets due to parallel paths from the source, and (2)
indicates a potential loop. To distinguish duplicate packets generated by parallel
paths from packets that loop, we make the following definition.
D efinition 2 ^Packet In carn atio n ) An observed packet X is said to be an incarnation
of packet Y that was observed at an earlier time than X is observed, if and only if
packet Y is a replication of packet X — a change in bit values in packet X will re
sult in the same change of bit values in packet Y. A sequence o f packets X \, X 2 ,...
is said to be a successive incarnation sequence if X i is a incarnation of X j for all
i> j >= 1.
We can formally define packet loops in terms of Packet Incarnations.
D efinition 3 (P acket Forw arding h o o p ) A looping condition exists if more than
2 successive incarnations of a packet are observed on the same network interface.
In the following discussion, assume the LAN in question is the only LAN in the
whole network. The result can be generalized to networks with multiple LANs. We
also assume that there is no loop consisting of segments of a pure tree type, and
there is only one RP for the multicast group.
There are four necessary conditions for loop existence involving multiaccess
LANs:
40
C o n d itio n 1 A data packet and at least one of its incarnations should be able to
appear on the same LAN.
P ro o f:
Trivial. Q.E.D.
C o n d itio n 2 There exist at least two forwarders fo r the same group on the LAN,
one for the S P T and one for the RPT.
P ro o f:
By Contradiction. Assume there is only one forwarder for the LAN
but there exists a loop. As shown in figure 5.4, all packets and their
incarnations are injected into the LAN by the only forwarder B.
(a) If the forwarder is only an SPT forwarder, the loop must be along
a pure SPT path, which is impossible; or an intermixed SPT and
R PT path. The later case is illustrated in figure 5.4. Note that there
exist a router (router A in picture) such that the packets flowing
down the RPT would hit the SPT. But since router A would do an
RPF check on the incoming data packet, which matches its (S,G)
routing entry, router A will reject packets from the RPT incoming
interface due to an RPF check failure7.
(b) If the forwarder B is only a RPT forwarder, the loop must be along
a pure RPT path, which was shown to be impossible; or along
an intermixed RPT and SPT path. The later case is shown in
figure 5.5. Router B is the only forwarder for the LAN and is
only on the RPT. The receivers are not drawn, but there must be
a receiver that joins the RPT and whose path would go through
router D (i.e. it should either be attached to D or to the right or
above it in the picture.). The SPT and RPT intersect at routers
7Note that even if routers don’t do RPF check, this loop is still not possible without asymmetric
paths — since the RP would have to be both upstream on the RPT AND downstream on the SPT
of the LAN, this is impossible (NOTE that there is only one forwarder for the LAN)
41
Source
RPT
R P
B (S,G )
S P T
Figure 5.4: Lan with a single forwarder on the SPT.
RP
J S P T
LAN
RPT
Figure 5.5: Lan with a single forwarder on the RPT.
A and D. Unlike (a), here there is no RPF failure for router A
to accept packets from the SPT and forward them onto the RPT.
However, at router D, RPF failure will occur that prevents D from
forwarding RPT packets onto the SPT again 8.
(c) If the forwarder is both an SPT and an RPT forwarder, there are
two possible loop compositions:
1. The loop is along a pure SPT or RPT path;
2. The loop is along an intermixed SPT and RPT path, i.e.,
(i) From downstream of the LAN, the loop consists of an RPT
segment followed by an SPT segment before reaching the
LAN again;
(ii) From downstream of the LAN, the loop consists of an SPT
segment followed by an RPT segment before reaching the
LAN again.
Case (c)l is not possible. Case (c)2(i) was shown in figure 5.5 and
dealt with in (b), and case (c)2(ii) was shown in figure 5.4 and dealt
with in (a).
Q.E.D.
C o n d itio n 3 There exist at least two recipients 9, one for a SP T and one for the
RPT, that receive packets from the L A N and forward them out other interfaces.
P ro o f:
By Contradiction. Assume there is only one recipient for the group on
the LAN and there exists a loop.
(a) If the recipient is only an SPT recipient, the loop must be composed
solely of SPT path segments. Otherwise, if part of the loop is along
the RPT, the RPT should either,
8But if D doesn’t do an RPF check, the loop can’t be setup as long as the shortest path between
A and D is the same from both A’ s and D’s point of view — otherwise there will be a receiver that
is downstream of the LAN on the RPT, and upstream of the LAN on the SPT.
9A recipient on a LAN for a group is either a router that accepts packets from the LAN and
forwards them toward other interfaces, or a host attached to the LAN that is a member of the
group
43
-« — 1
R PT
RP
LAN
k (S,G)
SPT
D
Figure 5.6: Single SPT recipient: RPT segment directly reaches the LAN
Source
R P T ReceiverO
SPT
(S,G)
(S,G) j R eceiverl
SPT
R P
Figure 5.7: Single SPT recipient: RPT segment hits upstream S P T
44
" k :
SPT
(S»G)
LAN
R P T
Figure 5.8: Lan with a single recipient on the RPT. SPT segment reaches the LAN
again.
1. directly lead to the LAN as shown in figure 5.6; or
2. lead to the upstream SPT branch as shown in figure 5.7.
The scenario in figure 5.6 can not happen unless there is another
receiver on the LAN that is on the RPT so that (*,G) state is setup
in router A, i.e. it mandates two recipients for the group.
The scenario in figure 5.7 does not have a loop, since router E would
reject packets from the RPT because of RPF failures.
(b) If the recipient is only an RPT recipient, the loop must be composed
solely of RPT path segments which is impossible; or if part of the
loop is along the SPT, the SPT segment should either,
1. directly lead to the LAN as shown in figure 5.8; or
2. lead to the upstream RPT branch as shown in figure 5.9.
Case 1 can not happen, because there is not another receiver on the
LAN that could join the SPT and install (S,G) state on router A.
Case 2 can not present a loop, since router C will reject the source’s
packets from router B due to RPF failures.
(c) If the recipient is both an SPT and an RPT recipient, the loop can
either be a pure SPT(RPT) loop; or an intermixed loop with SPT
and RPT segments. We know the former is not possible. For similar
45
SP T
Source
R P T
1 Receiver
Figure 5.9: Lan with a single recipient on the RPT. SPT segment hits the upstream
RPT.
reasons to (c) of the proof of condition 3, it is impossible to create
loops in the mixed segment scenario.
Q.E.D.
C o n d itio n 4 There exists a router recipient on the LA N that will pick up and fo r
ward packets put on the LAN by any other “ upstream” router even if the tree path the
recipient router set up didn’ t go through that “ upstream” router (i.e. the recipient
router didn’ t send a join to that “ upstream” router).
P roof:
By Contradiction. Assume all recipients will only accept data packets
that are put on the LAN by the upstream router on its RPT or SPT,
but there still exists a loop.
In this setup, each router on the LAN is only related to the upstream
router it joined to and is “isolated” from traffic from other routers on
the same LAN. We can safely replace the LAN with some point-to-point
links connecting each downstream router with its upstream router. This
reduces the problem into a problem in networks with only point-to-point
46
links. It was shown in the previous discussions that with RPF mecha
nisms such networks do not present loops under our operating assump
tions.
Contradiction.
Q.E.D.
To break a multicast forwarding loop, it is sufficient to prevent any one of the
necessary conditions 1 through 4 from becoming true. However, it should be noted
th at the four necessary conditions are not independent of each other. They are
explicitly listed to facilitate examinations of real world protocol mechanisms.10 The
following lemma shows the relationship among the first three of them.
L em m a 4 (S P T /R P T connector,) I f the necessary condition 1 is true, then nec
essary conditions 2 and 3 are also true.
P ro o f:
The proof is similar to the proof of condition 2 and conditions 3.
The following is a sufficient condition for loop existence:
L em m a 5 (Sufficient Condition,) I f the necessary conditions 1 through 4 are all
true, there exists a loop involving a multiaccess LAN.
P ro o f:
Let condition 1 be true for a multicast group on a LAN
10Perhaps, it would be much more pleasant for theoreticians to read this section if conditions
1 and 4 had been designated as the two necessary condition theorems, and conditions 2 and 3 as
corollaries of condition 1. But this arbitrary hierarchy makes it conceptually more complicated
than necessary. So I decided to use a less rigid form of presentation and used the label “condition”
instead of “theorem” for each statement.
47
I
I
+
Packet F
I
Packet PI
I
T
T
LAN /^ N
© \
Figure 5.10: Sufficient conditions for loop existence.
Lemnjo 4 Qon(ijtions 2 and 3 are true.
Under condition 1, let P be a packet from a source S
for group G. See figure 5.10. It is first forwarded onto
the LAN by router X , and is picked up from the LAN by
router Y which forwarded the packet out a set of outgoing
interfaces SE T o if> not containing the LAN. One copy of
these packets will have a descendant incarnation packet
Pi forwarded onto the LAN. Let Z be the router that
forwards Pi onto the LAN. (Note that we do not mind
whether X = Z or X Z.)
C ondtt£on4 ^ wju jje picked up by router Y again and forwarded
onto the same set of outgoing interfaces S E T o i f , and one
of the copies will have a descendant incarnation packet
P2 forwarded onto the LAN by router Y . And so on so
forth, until the time-to-live field of the packet decrements
to zero.
==>• there exists a loop when conditions 1 through 4 are all true.
Q.E.D.
48
l _ r .! Source S
Packet 1
Packet 1
r
1. Packetl arrives
at X’s outgoing
interface
« * 1
■ s — •
1 2. X sends an •
i a
S assert to LAN !
j
Figure 5.11: X sends an assert after it receives a potentially duplicate packet
5.4 Analysis o f a few loop prevention protocol
mechanisms
Preventing necessary conditions 1 or 3 from becoming true is difficult, though not
impossible. There have been at least 3 mechanisms proposed to prevent multicast
loops within the PIM design group.
5.4.1 A ssert m echanism
The assert mechanism used in PIM breaks the necessary condition 2 by guaranteeing
that only one forwarder for each (source, G) routing entry is active on a multiaccess
LAN. Its major benefits are due to its data-driven nature. W ith the assert mecha
nism, a router first detects a duplicate packet, then arbitrates with another sender
and removes one of the forwarding paths.
The assert process starts when a router on a multiaccess LAN detects a potential
duplicate packet1 1 by observing a data packet coming in an interface that is in the
11It is potentially a duplicate because it can be the first copy of such packets on the LAN, the
subsequent packets will appear later if no action is taken, or if the subsequent packets arrive earlier
before the arbitration action takes effect.
49
outgoing interface list of a multicast routing entry. As shown in figure 5.11, both X
and Y are forwarders for group G on the LAN, if router X receives a data packet
addressed to the group G from the LAN (which is X’s outgoing interface), it can
deduce that there exists another packet delivery path for packets sourced at S for
the same group.
In response to the potential duplicate packet, router X sends out an assert mes
sage onto the LAN that contains a metric value denoting the distance from router
X to the source S (or RP if it is a shared tree entry). All other routers that forward
packets onto this LAN will receive the assert and will compare the metric value in
the assert with its own metric. If Y wins the arbitration Y will be the forwarder on
the LAN for that routing entry.1 2
Although assert was originally designed to resolve the parallel path problem,
it was later used to arbitrate among forwarders of SPT and RPT branches on a
multiaccess LAN. This function makes it an effective loop prevention mechanism for
loops involving mixed tree types, in that it allows only one forwarder for each source
per multicast group.
The assert mechanism is robust mostly due to its data driven nature. When an
assert message is lost on a LAN, the subsequent data packet will trigger another
assert message.
Because the assert mechanism described above does rely on the sending of an
assert from a forwarder, subtle hardware failures with odd timing may result in
failures in the assert arbitration process. For example, in the mixed SPT/R PT
looping picture in figure 5.3, if the SPT forwarder router D went dead right after it
has forwarded the first few data packets onto the LAN but before any incarnations
of that data packets arrive on the LAN, and if the first few data packets were only
received by router B but not by router A, the loop can persist. The first few data
packets will be captured inside the loop.
However, this type of loop will not by itself cause the network to collapse. It
merely consumes a fixed portion of the bandwidth and makes the available band
width appear smaller on links affected by the loop. Furthermore, all multicast data
12A SPT entry always wins the arbitration against a RPT entry. See PIM specification [13] for
details.
50
packets will carry a time-to-live (TTL) field in their packet headers that is decre
mented at each hop. So a data packet trapped in such a loop will not loop forever.
This is not as deadly as the other types of loops we have discussed so far 13.
Obviously, the occurrence of this kind of hardware failure during the very small
timing window is extremely unlikely and normally such loops are self healing either
when all packets run out of TTL or when the hardware failures are fixed.1 4
5.4.2 Control packet tag
This loop prevention approach also breaks the necessary condition 2. Under this
scheme, all SPT routing entries are tagged as ” RP-presence” if there is an RP
downstream of that particular router on the SPT. When an SPT branch tagged as
”RP-presence” intersects with an RPT branch at a multiaccess LAN, it breaks the
potential loop by only allowing the SPT forwarder to forward packets from the given
source onto the LAN.
In the context of PIM control messages, each (S,G) join message carries a ”RP-
presence” flag indicating whether the RP is downstream relative to the current router
th at sent the join. When an RP-presence tag is zeroed on a (S,G) entry, it means all
branches downstream from that point on only lead to real receivers or other routers
that are not the RP for the group G.
In figure 5.3, the multiaccess LAN is in the (*,G) outgoing interface list of router
A. If router A receives an (S,G) join with RP-presence tag set (from router B), and
which is addressed to another router D on the LAN, router A should prune this LAN
off for source S and setup a negative cache (S,G) entry. This will cut the forwarding
path through A for packets from the source “Sender” , and make router D the only
forwarder for the LAN.
13Or put it in another way, this kind of loop can only happen when the source is shut off so
th at no more packets can be injected into an existing loop. Other types of loops th at accept more
packets into a looping path may cause the network to collapse or melt down.
14But when TTL mechanisms are not employed for certain reasons, and such rare loops are
absolutely not tolerable, we can use the following extra mechanism called early arbitration.
The early arbitration scheme relies on the fact that joins are sent before data arrives and joins
are sent by the recipients. If a (S,G) join is received on an (*,G) outgoing interface and is addressed
to other routers on the LAN, the router would take the same action the assert takes, that it would
set up an (S,G) negative cache state and make it a non-forwarder for that LAN. Hence when a
forwarder fails, the downstream still keeps sending joins, which will trigger “assert”-like actions.
51
The reliability of this scheme depends upon the reliability of the (S,G) join
message with RP-presence tag set. If the (S,G) RP-presence tag message is lost,
there will be a temporary packet loop until the next periodic message is sent and
received. This kind of loop, though of low probability, is not tolerable when the
source is sending at high data rate. Unlike the looping situation with the assert
mechanism discussed in the previous subsection, this loop will accept all packets
from the source after the loop has formed and keep consuming more and more
bandwidth. All data packets trapped will loop for as long as the TTL allows.
With certain combinations of the loop perimeter, control message refresh period
and source rate, some links in the loop can be saturated with looping packets.
One possible fix for the above temporal loop in case of lost control message is to
let the router whose (S,G) entry has the LAN on its outgoing interface list respond
to data packets on the LAN. See figure 5.3, if D receives a packet from “Sender” on
the LAN, it sends a (S,G) prune onto the LAN, thus creating a negative cache (S,G)
entry on router A.
In reality, most groups and routers will not be involved in this kind of scenario. It
is quite wasteful to require all routers to generate, process and store an RP-presence
tag while only very few of them will be used in practice.
5.4.3 Local Forwarder Tag
This approach breaks the necessary condition 4 1S. This scheme is based on the
assumption th at packets on multi-access LANs carry the local forwarder’s link layer
address in the link layer packet header, e.g. an ethernet packet header has a source
address field that contains the packet forwarder’s ethernet address. When a router
sends a join to its next hop upstream router, it also memorizes the link layer address
of its upstream next hop. When receiving data packets from the LAN, a router has
to check the link layer address and will only accept packets th at carry the link layer
address of its upstream next hop router.
lsThi8 was one of the earliest schemes proposed by Deborah Estrin to resolve the looping problem.
She referred to this scheme as the ” link-layer address” based approach
52
This scheme is very robust. It is immune to control packet losses and interface
failures. But it does require a router to check the link layer address of another
router, which isn’t a desirable requirement for a network layer protocol.
5.5 Summary of looping conditions and loop-
prevention mechanisms
Multicast routing loops can happen when there is unicast routing inconsistency, or
unicast routing asymmetry. In the absence of unicast routing loops, the RPF check
mechanism can effectively prevent multicast routing loops when no multi-access
LANs are involved. Four necessary conditions for looping existence were presented.
Multicast routing loop can be eliminated by preventing any one of the four conditions
from becoming true. A sufficient condition was also presented.
Three multicast loop prevention mechanisms were investigated. Overall, the
assert mechanism was adopted by PIM because it was the simplest mechanism, with
the lowest overhead, and serves both as a parallel path arbitration mechanism and
a looping prevention mechanism.
53
Chapter 6
M ulticast Routing in Dense and Sparse Modes:
A Simulation Study of Tradeoffs and Dynamics
Chapters 4 and 5 focused on issues related to distribution trees. This chapter will
address the operating modes of the protocol mechanisms that maintain multicast
distribution trees. The discussions in this chapter will be based on PIM, which can
optimaly support multicast groups that are either sparsely distributed or densely dis
tributed. In sparse mode, PIM uses shared trees (RPT) or shortest path trees (SPT)
to deliver data packets. The availability of these various modes opens questions re
garding when each should be used, and the consequences of switching among them
dynamically. This chapter deals with two specific issues: 1) the overhead tradeoffs
between dense mode operations and sparse mode operations. 2) the behaviors of
PIM when receivers transition from RPT to SPT;
Our results illustrate the cross-over point of sparse mode and dense mode over
heads, which gives a hint for selecting protocol modes according to the group density
metric.
6.1 Introduction
PIM is capable of supporting sparse mode (SM) and dense mode (DM) operations.
The dense mode PIM operation is characterized by periodic broadcasts and prunes.
As in DVMRP[11], a source’s data packets are broadcast delivered to network nodes
that are absent of prune state for that source and multicast group, then the few leaf
subnetworks that don’t have group members will set up “negative” state and prune
54
the unwanted branches from the tree. The negative state will time out after a certain
period of time, resulting in the broadcasting of data packets and pruning of unwanted
branches again. In sparse mode operation, data packets are not broadcasted and only
routers on the multicast tree need to keep state information for a group. The state
of a multicast tree is set up when receivers’ designated routers send join messages
toward a Rendezvous Point (RP) or a source. In sparse mode, receivers may stay
on the RP-rooted shared tree (abbreviated as RP tree or RPT) or switch to the
source-rooted shortest path trees (SPT). For more detailed protocol descriptions,
please see [12, 13].
This section investigates the tradeoffs of different operating modes and the bound
ary conditions for some transient processes. We will analyze the phenomena and pro
vide formulae for some situations. When the complexity of the situation makes an
alytical methods inadequate, we use simulation to conduct experiments that closely
reflect the way the protocol operates in a real network. While our analysis of dense-
sparse mode tradeoffs is in terms of PIM sparse mode, with minor modifications, it
applies to DVMRP, or most broadcast and prune protocols.
6.1.1 Overhead tradeoffs o f sparse m ode and dense mode
We evaluate the overhead of sparse mode and dense mode operations in terms of
control bandwidth consumption and state storage requirements in all routers.
Dense mode protocols use a broadcast and prune mechanism that generates “con
trol traffic” on all links of the network. In contrast, sparse mode control traffic occurs
only along the branches of the active-source SPTs, and the shared R PT. In both
modes, control traffic is sent periodically, normally with different frequencies. The
currently suggested dense mode SPT entry timer is 3 times the sparse mode periodic
refresh timer [13]. In dense mode, the number of broadcast packets can be reduced
by using a very long timer value for the negative state, at the cost of keeping state for
dormant groups or sources for longer times and not being able to adapt to routing
changes quickly.
PIM dense mode requires state, either positive or negative, in all routers. Sparse
mode groups, however, only maintain state in routers on packet delivery trees.
55
W hen evaluating both of the criteria, we must consider group membership distri
butions. There are obviously two extreme cases of group membership distributions:
one extreme is a very large group for which every router either is needed to forward
packets for some downstream members or the router itself has group members at
tached; the other extreme is a group which only requires a small number of on-tree
routers to deliver packets to all members. Scenarios between these extreme cases
will be evaluated in detail in section 6.2.
Another factor that can affect an application’s preference for dense mode or
sparse mode is join latency. Additional join latency is incurred in two cases:
The join latency is incurred in two cases:
1. New member: In dense mode, the receiver’s designated router will send a graft
message toward existing sources according to the router’s negative cache state.
When a graft message reaches a router already on a multicast tree (attachment
point), data packets will be able to flow downstream to the new receiver. This
process takes a round trip time between the new receiver and the attachm ent
point. In sparse mode, a new receiver joins a group by sending a join message
toward the Rendezvous Point (RP). When the join message reaches a router
that is already on the RP tree, data packets from all existing sources will be
able to flow downstream toward the new receiver. This also takes a round trip
time between the new receiver and the attachment point on the RP tree. The
difference in join latency is negligible in this case *.
2. New source: In dense mode, the first (or a few) packet is broadcast delivered to
all routers. An existing receiver receives the first packet after a one-way delay
time between the source and the receiver. In sparse mode, the new source
sends the first few data packets encapsulated in RP-register messages toward
the RP [13]. The RP will decapsulate the RP-register packets and forward the
data packets down the RP tree. The receivers will receive the data packets
after a one-way delay between the source and the receiver along the RP tree.
'T h e difference in join latency can be in fact around a few milliseconds. But since the new
receiver joins in the middle of the transmissions, and has not received earlier packets anyway, such
minor difference in joining time should be negligible.
56
The difference in join latency is the difference between the RP tree delay and
the SPT delay. Again this is negligible in general.
Based on this analysis it appears that join latency is not a significant issue when
choosing between dense and sparse mode operations. Therefore, in the rest of this
chapter, we will not concern ourselves with the join latency issues.
6.1.2 Sparse M ode Dynamics: the transition from R P T to
SPT
In sparse mode PIM, all receivers join the RP tree first, then the receivers switch
to source-rooted shortest path trees when needed. We will assess the possibility of
packet losses when switching from the RP tree to the SPT. We call a time interval
a blackout period, if during that interval a receiver can not receive any data packets
from a source due to the lack of forwarding state inside the network.
6.1.2.1 Black out periods during RPT to SPT transitions
Here we briefly review the PIM protocol actions when receivers switch from a RP
tree to a source rooted shortest path tree. See the example in figure 6.1, the receiver
R already on the RP tree receives a series of data packets from source S and decides
to switch to 5 rooted shortest path tree. R’s designated router initiates a SPT-join
toward the source, which will set up the SPT state from 5 to R. B continues to
accept packets from the RP tree before the upstream SP T state is fully set up.
When a data packet from 5 arrives at B along the SPT branch, B knows th a t the
upstream SPT state has been completely setup (because the incoming interface for
SPT is different from that of the R PT ’s), and will prune itself off the RP tree for
source 5 2.
In figure 6.1, once the transition from RPT to SPT is completed, router B will
reject all 5 ’s data packets coming down the path R P — ► D — » B , and will only
accept data packets from 5 that come down the SPT path. Assume packets sent
by S are numbered 1, 2, ..., n according to the order of departure times at 5 , and
2Such prunes will set up negative cache state along the RP tree so that packets from 5 will not
be forwarded along the RP tree any more.
57
RP
R P T
Source
Receiver
X SPT-
Figure 6.1: The receiver leaves RPT and joins SPT
the path A — » R P — ► D — ► B is longer than the path A — » C — ► B. A fter router
A has established SPT routing state, assume the first data packet A forwards down
the SPT path is packet # i. When packet arrives at B , B will set its SPT-bit,
indicating the completion of RPT to SPT switch. If by this time, packet # (i — 1)
is still in the transmission path R P — ► D — ► B , it will be rejected by B when it
eventually arrives at B. Section 6.3 will give a quantitative analysis of such black
out periods when they do exist.
6.1.2.2 Are there duplicate packets during the transitions?
Can there be duplicates when control messages are lost during the switch from
R PT to SPT? One possibility is to have the negative cache state along th e RPT be
mistakenly deleted so that receivers will receive packets coming down both the SPT
and RPT. But in PIM even if control messages are lost, such mistakes can not occur.
This can be shown in figure 6.1, if both the SPT and RPT paths are forwarding data
packets from S to B , B with its SPT-bit set, will do an incoming interface check
and only accept packets from the SPT path and drop all packets coming from the
R PT path, i.e. no duplicate packets. However, when B's incoming interface is a
multiaccess LAN, i.e. the incoming interfaces for the RPT and the SPT are the
same, the incoming interface check will not be able to distinguish packets forwarded
by SPT from those forwarded by the RPT. In this situation, the assert mechanism
[13] will be activated so that the router that forwards packets from the RPT onto
the multiaccess LAN is pruned off. Therefore in the rest of this chapter, we will not
be concerned with duplicate packets.
58
6.2 Trade-off of overheads in sparse m ode and
dense mode
In this section, we define 3 basic metrics and formulate the overheads of sparse
mode and dense mode operations. We will also present simulation results showing
the temporal dynamics of bandwidth overhead in a real network.
6.2.1 M etrics and Formulae
The overhead comparisons of different modes are based on two measures: the total
state storage required network wide and the total control bandwidth consumed to
maintain certain multicast groups inside the network.
6.2.1.1 Storage Overhead
Let Vi,V2, ...vn be the nodes inside the network, Ci(g, S j) be the storage cost on node
i for group g , source Sj (or RP). The overall storage cost under a certain mode is
defined as (assume there are n nodes in the network and m sources for g):
n m
Cost,t0rage(mode) = E E < * (* $ ) (6.i)
i=l j=l
Since the difference in dense mode and sparse mode routing entries is rather
small [13, 16], it is sufficient to compare the total number of entries under different
operating modes in the storage comparisons. In the rest of this section, we assume
dense mode and sparse mode entries are of the same size.
Let the cost of each routing entry be 1. In dense mode, each router must maintain
a routing entry for each (source, group) pair — either positive or negative. The total
storage cost of dense mode multicast routing entries in the network will be:
Cost,torage(densejmode,G) = n x m (6.2)
59
In sparse mode, the storage cost consists of the costs of the SPT entries, RPT
(*,G) entries and the negative cache (S,G,RPbit) entries. Let A f(„> f l j(G) be the num
ber of RPT entries for group set G, N(,< g )(G) be the number of SPT entries for group
set G, and N(,,g,RPbit)(G) be the number of negative cache entries.
In sparse mode, the storage cost will be:
C osta tO T a g e (spt.mode, G) = ^V(»,j)(G) -I- ^V(»,g)(G) ■ + ■ (6.3)
Note that if due to low data rate, all RPs do not establish SPT state and receive all
data packets encapsulated in register messages [13], 7 V (S iS )(G) and Natgjtput) in the
above formula will be zero. This mode is representative of CBT as well [4].
6.2.1.2 Bandwidth (control) Overhead
For a multicast group g, let P (t,l,g ) be the number of tree maintenance packets
sent on link I from time 0 to time t. The total number of tree maintenance packets
in the network is:
k
CostcirU > and(t, g) = 53 1,9) (6-4)
/= i
The local IGMP or PIM query and report messages have no global effects and
are the same for dense mode and sparse mode. Such local messages will be ignored
in the definitions and experimental measurements in this chapter.
Since different protocol modes use different kinds of tree maintenance packets,
dense mode and sparse mode bandwidth overheads need to be measured in different
ways. In dense mode, data packets are broadcast delivered to propagate routing
information. A prune packet is triggered by an unwanted data packet, which will
delete an outgoing interface in a routing entry. The bandwidth overhead of dense
mode operation is thus defined as the total number (or bytes) of unwanted data
packets transm itted over all network links, plus the total number (bytes) of periodic
prune messages. In the following discussions, bandwidth overhead is measured in
units of packet count. The bandwidth overhead in bytes can be estimated based on
overhead packet count and application packet sizes.
60
Let Dunwanted-pkt{t) be the total number of unwanted data packets from tim e 0
to t, and DprU ne(t) be the total number of prunes sent during the same period, then
dense mode bandwidth overhead can be defined as,
Cost-DMctrlJband(ti(j) ~ ^tm uianted_pA :<(t, G) " I" Dprunt(t, G) (6.5)
In sparse mode, the bandwidth overhead can be defined as the total number of PIM
control messages (•Dpim_ T O J f f ) sent (i.e. S,G join/prune, *,G join, S,G RPbit prune):
Cost-SM ctri^ndit, G) = G) (6.6)
6.2.1.3 Overhead as Functions of Group Density
The density of a multicast group reflects the percentage of “on-tree” links vs the
total number of links in the network.
Let g be a global-scope multicast group with only one sender s; let / be the total
number of links on the tree rooted at s, using a shortest path tree algorithm to
calculate the multicast tree3. Let L be the total number of links within the network.
The following measure is called the density of g with respect to source s: 4
D ensity(s,g) = y (6.7)
L /
If g has more than one sender, the density metric for the whole group is defined
as the average of densities over all sources. Let sj, S2, ..., sm be the sources of group
g, where m is the number of sources. Let /,• (1 < i < m) be the number of links
on the shortest path tree rooted at source s,-. The density of group g, taken into
account all sources, is defined as: 5
Density(g) = ^ (6.8)
m x L
3If there is more than one shortest path tree, choose the one with the least number of on-tree
links.
4The sparseness of g with respect to source s is defined as the reciprocal of Density(s, g).
5The corresponding sparseness metric considering all sources is defined as the reciprocal of
Density (g).
61
Figure 6.2: Arpanet topology used in dense mode and sparse mode simulations
For a group with known density metrics, the following provides a lower bound
for the dense mode control bandwidth cost (formula 6.5), where the equality holds
when there is only one copy of each unwanted data packet traveling on each link
during each broadcast and prune cycle (assume negative cache time-out interval is
Td m )-
CostJDMctri_band(t,9) > (1 - Density(g)) x i x 2 x m x ■ = ? — (6.9)
I d m
When all receivers use SPTs in sparse mode operation, if the RP is placed at
a receiver 6, control messages will only travel through links that are on at least
one of the source rooted shortest path trees. When different shortest path trees
and the RP tree overlap, control messages are aggregated into one control packet.
The following gives an upperbound to the sparse mode control message overhead,
in term s of number of control packets (c.f. formula 6.6). Assume the sparse mode
refresh interval is T s m -
Cost-SMctriJ>ani(t, g) < Density(g) x L x m x — (6.10)
I S M
6or the receiver’ s first hop router
62
(a) State Storage Overhead (b) Control Bandwidth Overhead
Sparse M o d e , U p p e r B o u n d
Sparse M o d e , L o w e r B o u n d
Dense M o d e
1
z
Dense M o d e , L o w e r B o u n d
Sparse M o d e , U p p e r B o u n d
z
1000
Density (f/L)
o.o 0.2 M
Density (I/L)
Figure 6.3: Tradeoffs of Sparse mode (SPT) and Dense mode in arpanet (47-node),
for 5-sender multicast groups
If the RP is placed at member which is also a sender, the number of SPT entries
will be D ensity (g)xLxm , the number of RPT (*,g) entries will be D ensity(si,g)xL.
The sparse mode storage cost represented by formula 6.3 can be rewritten as:
Cost,toTage{spt-mode,g) = D ensity (g) x L x m +
D ensity(si,g) x L + N {s,g,RPb it) (6 .11)
Since the number of negative cache entries is smaller than the total number of SPT
entries when the RP is placed at a member/sender, the following inequalities provide
an upper and a lower bound for C osttt0Tage(spt^node,g):
Density(g) x L x m + D ensity(si, g) x L
< Cost,iO T agt(sptjmode, g) <
2 x Density(g) x i x m + D ensity(si,g) x L (6.12)
Figure 6.3 shows the memory and bandwidth overhead curves of the above for
mulae for the arpanet topology (fig 6.2) with an audio conference of 5 senders. Fig
6.3(a) shows the dense mode vs sparse mode memory tradeoff: the dense mode stor
age cost represented by formula (6.2) and the upper/lower bounds of the sparse
mode SPT operation (inequality (6.12)). Figure 6.3(b) shows the sparse mode and
63
G roup Size = 4 Group Size = 3
1=4; L=8; Density = 0.5 1=4; L=8; Density = 0.5
Figure 6.4: Example of two multicast groups having the same density
04
I “
«
1
1
04
a
1
0.1
0.1 0.0 0 04 0.7
Figure 6.5: Maximum and Minimum group size vs density
dense mode control bandwidth tradeoff: the lower bound of dense mode bandwidth
cost represented by inequality (6.9) and the sparse mode bandwidth overhead upper
bound from inequality (6.10)7. The tradeoffs of dense mode and sparse mode under
different group densities are very obvious in these two graphs.
The group size and the density of a tree are related, but there is no one-to-one
correspondence. For a certain network and a given source, there can exist a number
of groups with different sizes but with the same density. Fig 6.4 shows an example of
two multicast groups having the same density. In the topology shown, the maximum
size of a group with density 0.5 is 4, the minimum size of a group with the same
density is 3.
7 Note that in figure 6.3, the density axis doesn’t extend to 1, this is because none of the multicast
trees can include all network links.
64
It is easy to show th at the maximum group size under a certain density is
density x L8. The minimum group size under a certain density, however, is dependent
on the topology and can not be expressed in a simple formula. We constructed an
experiment over the arpanet topology and two 100-node random topologies, mea
sured the corresponding minimum group sizes for each density value. Figure 6.5
shows three curves of the maximum group size, minimum group size and the aver
age size of random groups over the arpanet topology. The error bars for random
groups represent the standard deviations — there are about 50 - 100 random groups
for each density value.
Note that all curves terminate at the density of about 0.7 — at density of 0.7
there is no way to increase the density of the tree further, all leaf domains are
members of the group and all network nodes are on tree. In fact the maximum
group size is n in a n-node Z-link strongly connected network. Only n — 1 out of
the L links are used to construct a tree with n nodes in this case, the maximum
density of a group in a n-node network is sjri . Therefore, for networks of different
sizes, if they have the same node degree, the maximum group size curve (in units
of percentage o f nodes) should remain the same. We speculate that under the same
node degree, their minimum group size curves and their random group size curves
should also have little difference 9.
6.2.2 S M /D M tradeoff experim ents
The figure 6.3 upperbound and lowerbound overhead curves for groups with certain
densities are useful for estimating the ranges of overhead. We use simulations with
a specific multicast group and a particular topology to measure precisely how the
overhead is incurred over time.1 0
8 We can prove by construction that if the size of a group is incremented by adding new nodes as
members, in the same order a breadth-first-search algorithm starting from the source would visit
the nodes, the size of the group is kept maximum for the density of the tree constructed.
9 We ran experiments on two 100-node random topologies, one with average node degree of 3.1,
the other with node degree of 5. The result from the network with node degree of 3.1 is very close
to th at shown in fig 6.5. The result from the other 100-node topology has similar shapes, except
th at every curve is “compressed” in the horizontal direction to the left so that the X-axis values
on the curves are reduced by about 40%.
10The storage overhead for dense mode can be treated as a constant (fig 6.3(a)). The storage
overhead for sparse mode depends on the number of on tree nodes. It can be derived from the tree
65
SM/DM Bandwidth Tr*d*off In 'arpm*!*, S-m*mb*r group
3000
2500
2000
1500
* 1000
500
100 200 300
Tim* (8*oonds)
400 500 600
Figure 6.6: SM/DM control bandwidth overhead supporting a sparse group in
arpanet
In the subsequent simulation experiments, sparse mode groups and dense mode
groups are used with the same network topologies. Reported here are simulations
over the Arpanet topology shown in figure 6.2.
The group involved in the simulation is assumed to have global scope. All sources
share the same sending behavior. The simulated scenario is a 5-participant audio
conference. The 5 participants are located at nodes 2, 24, 26, 27 and 29 of the
arpanet topology (fig 6.2). In sparse mode operation, node 24 is chosen as the RP
for the group. In the first run the group is started in dense mode, and all sources
start sending at 60 seconds offset from the network startup time. In the second run,
the group is started in sparse mode, and the same experiment repeated.
To log parameters as the experiments progress, a special monitor component
is created in the simulator to periodically collect the storage usage and various
packet counts at different nodes and links. The sampling times to,ti, ...,tn
can be configured into the monitor component. For this particular experiment, all
measurement periods are set to the same small constant (500ms).
Figure 6.6 shows the bandwidth overhead when supporting this group with 1)
dense mode (thin solid line); 2) sparse mode with SPT (thick line). The step function
of the dense mode measurement reflects the dense mode periodic broadcast and
prune behavior. In this experiment, all timers are set to the default value suggested
costs as reported in [33]. Therefore we won’t replicate the experiments already done in [33] and
only present the results for control bandwidth overhead here.
66
in the PIM specification document [13]. In a real network, the timers can be set
differently in order to achieve a more suitable bandwidth/adaptability tradeoff. In
general a longer negative cache timer will result in less periodic broadcast and prune
traffic, and will also result in slower adaptation to routing changes. The small steps
in the sparse mode measurement reflects the artifacts of the way the simulator is
initialized — all PIM routing modules are started at the same time. The group
density for this conference is 0.17.
It can be seen that for this 5-sender/receiver group, the number of dense mode
control packets increases much faster than the sparse mode control packets. The ad
vantage of sparse mode for small groups is obvious in terms of the control overhead.1 1
6.3 Packet Losses During the Black out period
when switching from R PT to SPT
If a multicast group operates in sparse mode, all receivers will join the RP tree first.
When a source’s packet rate is high enough, the receivers will switch to the source-
rooted shortest path tree. As described in subsection 6.1.2, there may exist a black
out period during the switch from RPT to SPT.
The length of the black out period is dependent on the difference between rel
evant sections along the RPT and the SPT paths. The number of packets lost is
proportional to the length of the black out period and the source’s rate.
First, we formally define the relevant parameters and metrics.
Let s be a source and r be a receiver, Dtpt{s,r) be the propagation delay for
packets from source s to receiver r along the shortest path. Let D Tpt(s, r) be the
one-way delay along the RP tree path from s to r. Let R(s) be s ’s sending rate in
unit of packets}second.
1 1 The bandwidth overhead of a network supporting mixed sparse mode and dense mode groups
has an additive property — the total bandwidth overhead is equal to the sum of the bandwidth
overhead when the network has only sparse mode groups, and the bandwidth overhead when the
network has only dense mode groups. This is because dense mode control messages and sparse
mode messages are always sent in separate packets. Storage overhead also has the additive property.
Experimental results gained from networks with only sparse mode groups and results with only
dense mode groups, can be combined to predict network overhead with mixed sparse and dense
mode groups.
67
The number of packets in flight along the shortest path tree from s to r at stable
state is Fapt:
F.p t{s, r ) = Dapt(s, r) * R(s) (6.13)
The num ber of packets in flight along the RP tree from s to r at stable state Frpt is:
FrP t(s, r) = Drpt(s, r) * R(s) (6.14)
When a receiver switches from R PT to SPT, the number of packets lost during the
transition L rpt->a p t is approximately:
•f'rpt->»pt(s, r) = Frpt(s,r) Fa p t(s,r) (6.15)
In the m ost favorable situation R PT to SPT switch can be free of packet loss if,
1. The path from the source to the receiver along the RP tree is exactly the
same as the path along the SPT rooted at the source. I.e. the physical packet
delivery path does not change during the RPT to SPT transition;
2. The difference in delay between the RPT path and the SPT path is smaller
than the time interval between consecutive packets.
When loss does happen, the worst case is that the inter-packet intervals are much
smaller th an the delay difference between the R PT path and SPT path. This worst
case scenario may happen if an extremely unfavorable router is used as an RP for
a widely dispersed multicast group. Note that since RP selection tools are being
developed to avoid this case, it should occur relatively rarely.
When the source rate is fixed, packet loss during the RPT to SPT transition is
directly related to the data packet size — for a given amount of data, the larger
the packet size, the longer the inter-packet interval, the fewer the number of packets
lost. For example, a vat1 2 pcm2 source (71Kb/s, 40ms frames) with a 50 ms SPT
path to a receiver and a 100 ms R PT path, the maximum number of packet loss can
be two 355 byte packets. But if the vat uses pcm4 encoding (68Kb/s 80ms frames),
no packet will be lost duing the R PT to SPT switch!
1 2Vat is an audio terminal tool developed by Van Jacobson and Steve McCanne at LBL.
68
f£ T /8 P T pot) diftetnuc: SO mu. Sourc* nl»: 71 Kbp* (PCM2 audio)
I
" S S b o
Figure 6.7: Packet loss as a function of packet size (source rate fixed)
To fully understand the transitioning process, it is ideal if one simple experiment
setup can cover all possible scenarios. The following statement effectively reduces
the experiment space without sacrificing the generality of our simulation results.
L em m a 6 In PIM sparse mode, when a receiver switches from R P T mode to SPT
mode, the number of packets dropped during the black out period is only dependent
on two factors:
1. the delay difference between the R P T path and SP T path from the source to
the receiver; and
2. the source’ s sending behavior (rate, packet size).
Other factors, such as topological features in a particular network, are irrelevant to
the packet loss during this period.
Hence it suffices to simulate the topology shown in figure 6.1 with ranges of
different link and source parameters. The results will hold in other topologies and
group membership distributions if the RPT path and SPT path delay difference and
the source’s sending behavior are the same.
Figure 6.7 shows simulated packet loss as a function of packet size in the network
of figure 6.1. The source rate is fixed at 71Kbps. The difference in delay along the
RPT path and SPT path is 50 m s — roughly the worst case scenario for arbitrary
RP placement inside the continental United States. One useful fact in this picture
is that for a PCM encoded audio source, there is no packet loss when the packet size
is larger than 500 bytes.
69
PM Ion during RPT/SPT awloh (RPT/SPT cHftortnot: SO M im oondt)
1 0 0 0
500 - j S Z* S o u re . ra te (Kbps)
P K te t « 2» (B y ljy * 2000
Figure 6.8: Packet loss as as function of source rate and packet size
Figure 6.8 shows packet loss as a function of both source rate and packet size in
the network of figure 6.1, for a PCM encoded audio source when the R PT /SPT path
length difference is 50ms. The contour lines on the base plain show the boundaries
of regions having the same drop rates.
6.4 Chapter conclusions
The tradeoffs between sparse mode and dense mode PIM operations are evaluated
via two measures: the state storage overhead, and the control bandwidth overhead.
With known measures of group density, the state storage and the control bandwidth
overheads can be calculated for dense mode operations. The bounds for such over
heads can be estimated for sparse mode operations. Simulations were run over the
arpanet topology and results presented. The arpanet experiments showed that for
groups with densities of less than 0.3, sparse mode operation has less storage and
bandwidth overhead in general. A density value of 0.3 in the arpanet corresponds
to groups with 10% to 20% of the total number of network nodes. We speculated
and verified with random topologies that the results from the arpanet topology can
be generalized to networks of different sizes with the same average node degree. To
estimate the overhead tradeoff in networks with different node degrees the results
70
need to be factored along the density axis by the ratio of the two networks’ node
degrees.
The number of packets lost during the transition from R PT to SPT is a function
of the path length difference between the RPT and SPT branches and the source’s
inter-packet intervals (or the source’s rate and packet sizes). Slow to moderate rate
applications such as audio sources normally suffer zero or insignificant packet losses
in normal operating enrironments. High speed sources can incur higher packet loss
rates, especially when the difference between the RPT path and SPT path is signif
icant. Since such losses only occur when switching from R PT to SPT, it is advised
to avoid frequently switching between the two different tree types unnecessarily.
71
Chapter 7
Routing Information Reduction: Aggregation
and Data Driven Techniques
The previous chapters assumed that each multicast delivery tree was represented by
an independent state entry inside the network. In this chapter we will analyze a few
techniques that allow the state information to be shared across different multicast
delivery trees. One approach to reduce state is to use wild card addresses in multicast
routing entries. We investigate two schemes taking this approach: aggregation across
source addresses and aggregation across groups. The other approach to reduce state
is to create and maintain data-driven routing state allowing prompt deletion of state
for idle sources or groups. Two techniques taking this approach are investigated:
data-driven shared trees; and dense mode emulation.1
These state reduction techniques are important because the potential number of
multicast routes is much larger than the number of potential unicast routes in the
same network2. Unfortunately, aggregation techniques used in unicast routing can
not be applied directly to multicast routing due to lack of topological structure in
the multicast group addresses.
A multicast tree can be uniquely identified by the pair (5, G) where 5 is a source
(or wild card) address and G is a multicast group address. One way to reduce the
number of entries is to use wild cards to replace specific addresses. Table 7.1 shows
four possibilities in the design space using wild card addresses.
1Our discussion uses PIM as their primary examples but is more widely applicable.
2 For a network with N distinct destinations, there are 2N different combinations of group
( N \
memberships but only I „ 1 different pairs for unicast communications.
72
Table 7.1: Using wild-card address entries: the design space
\ Group
\A ddress
SourceX.
Address \
Address in
full Wild Card
Address in
full
(S, G) (S, *)
Wild Card (*, G)
(*, *)
Among the design points in table 7.1, (S,G) and (*,G) were discussed in the
previous chapters. In between the (S,G) and (*,G), there is an intermediate ’(S-
aggregate, G )’ scheme which we will discuss in section 7.1.1. (*,*) represents a
broadcast scheme, which has been studied extensively [30, 10]. (S, *) is another
design point that we will investigate in this chapter.
When there exists a large number of long-lived idle groups, i.e. groups that
have active members but no sources, state storage compression is possible along
the temporal dimension. Just as a virtual memory system can discard pages that
have not been referenced for a period of time, a multicast router can also discard
certain routing entries that have not been used to forward data packets for a certain
period of time. Such a technique is referred to as data-driven state. The opposite
of data-driven is persistent state where state information is kept for the life time of
each group, as determined by end systems outside the network. Although it will be
argued that group state growth can ultimately be bounded by limiting the multicast
group life times, we will investigate how to introduce additional constraints on state
requirements.
Table 7.2 shows the design space for schemes using data-driven techniques. Both
types of multicast trees can be maintained in a data-driven fashion although not with
equal effectiveness. Past experiences have shown that data-driven source rooted SPT
mechanisms are easier to design. In fact DVMRP, PIM-SM and PIM-DM all create
and maintain source rooted SPTs in data-driven fashion. On the other hand, group
shared trees are more difficult to be made data-driven — special mechanisms are
needed. Among the four design points in table 7.2, case (a) does not involve any
73
Table 7.2: The design space of data-driven techniques
\ Group
\A ddress
S o u rc e \
Address \
Persistent
(5 , G) State
Data-driven
(S, G) State
Persistent
(*,G ) State
(a)
All state
persistent
(d)
(*,G)
data-driven
(S, G) persis
tent
Data-driven
(*,<-?) State
(b)
(S ,G ) persis
ten t
(*,G )
data-driven
(c)
All state
data-driven
data-driven actions and all routing state information is kept for the group life times.
Case (b) is not promising given that the number of sources is usually larger than 1.
When the sources are silent, such a scheme will only be able to get rid of th e shared
tree state, while keeping all SPT state. Case (c), if realized, achieves th e ultimate
effect of maintaining zero state for a group when no source is sending to th a t group
under idealized conditions. Case (d) is one valid design point th at has already been
adopted as a basic behavior in PIM-SM, where each group will only have its shared
tree persist inside the network while the bulk of the state information is tim ed out
when all sources are silent.
Since it is difficult to design a data-driven shared tree mechanism for case (c), an
alternative is to convert the shared trees to SPTs and then rely on the data-driven
SPT designs to achieve the goal of maintaining zero state for idle groups. In the
context of PIM design, this method has been referred to as dense mode emulation3.
This chapter investigates the following state reduction techniques in m ore detail:
1. Route aggregation:
3This method was originally presented by Van Jacobson in one of the PIM design meetings in
early 1994
74
(a) across sources;
(b) across different multicast groups;
2. Data driven techniques:
(a) Data driven setup and maintenance of shared trees4;
(b) Dense mode emulation for sparse mode groups;
3. Higher level group address usage control to limit group life time
The rest of this chapter is organized as follows. Section 7.1 discusses two aggre
gation techniques. In section 7.2 we explore two data driven techniques. Section
7.3 discusses the issue of using high level mechanism to limit group life times. We
conclude this chapter with section 7.4.
7.1 Aggregation Techniques
This section discusses aggregation techniques using wild-card addresses to reduce
the number of routing entries.
7.1.1 Aggregation across sources
The motivation for this scheme is based on the fact th at there can be more than
one source sending to a multicast group from a single routing domain. The number
of SPT entries is equal to the number of active sources. By combining the routing
entries of all SPTs rooted at sources from the same domain for the same group, we
may consider reducing the number of routing entries for each group to the number
of domains containing sources.
Under this aggregation scheme, all (S, G) entries with source S in the same do
main are aggregated into one entry, and the outgoing interface list of the aggregated
entry will be the union of the outgoing interface lists of all individual entries before
being aggregated. However, if the source aggregate is large and alternative paths
are abundant, the (S,G) entries for sources in the same domain can have different
4Formed as a result of discussions with Deborah Estrin
75
incoming interface and outgoing interface lists. This will cause data packets to be
flooded onto unwanted links. To avoid this problem, we may change the rule such
that only entries with the same outgoing interface lists are aggregated into one entry,
and use longest match search when looking up the routing table at data forwarding
time; i.e. for each source S in the same aggregate and group G, count the number
of (S, G) entries for a group G sharing each combination of the outgoing interfaces
in their outgoing interface lists. Those entries sharing an outgoing interface list
with the largest number of other entries are aggregated into one (S-agg, G) entry.
The rest of the source specific entries with different outgoing interface lists are left
untouched.
One major problem with this scheme is the so called address fan-out problem
in routers close to the sources. A router close to the source aggregate may have
multiple neighbor routers all leading to sources in the same aggregate. If the join
message such a router receives contains only the source aggregate address and the
group address, it will be difficult for the router to decide precisely where the sources
are and to which neighbors it should send joins.
Such source specific address information must be obtained either from the down
stream direction (from the join messages) or from the upstream direction (from the
sources).
When such information is obtained from the downstream join messages, routers
with (s-agg, G) state could pass through (S,G) joins without establishing (S,G) state.
Such join messages with source specific information can be used by the fanning out
router to create (S,G) state. The problem with this approach is that those joins
passed through the intermediate routers can not be aggregated with the normal
periodic PIM refresh messages.
Another approach is to use source-proxies inside an intermediate cloud between
the sources and the receivers5. A cloud border router, acting as a proxy for those
sources will inject packets into the cloud, by placing its own address as the source
address in the packet header6. The maximum number of sources ‘visible’ to routers
internal to the cloud will be the number of border routers of the cloud, instead
5Van Jacobson raised and discussed this concept with PIM authors.
®E.g. by encapsulating the data packets, and placing the proxy’s address in the source address
field of the outside header, and the multicast group address in the destination address field.
76
(a) Without proxy (b) W ith proxy
Figure 7.1: A cloud to apply the source-proxy method
of all hosts in the whole network. In order for the upstream border routers to
generate source specific join messages toward the sources, when an exit border router
receives a source specific join message from downstream routers outside the cloud, it
multicasts the join message to all other border routers. Therefore each border router
can keep an accurate cache of all relevant source specific addresses. The overhead
of multicast delivered control messages is a trade-off for the potential state savings
achieved. This approach is similar to that adopted for DVMRP[27].
One major problem with this source-proxy scheme was that when there are multi
ple entry points to a cloud, the proxy scheme degenerates. Fig 7.1 shows an example
of a cloud with 5 border routers and a group with 3 sources. Fig 7.1(a) shows three
source-rooted shortest path trees rooted at 51, 52 and 53 with different line styles.
Note that the outgoing interface lists on internal routers C 1 and C 2 for the three
trees are different. Fig 7.1(b) shows two proxy trees for this group, rooted at border
routers B 1 and B2 respectively.
Because only border router addresses are used to build the proxy tree inside the
cloud, when there are multiple entry routers for the same source, data packets will
be duplicated along certain downstream links. In the example in fig 7.1(b), 52 has
two entry points to the cloud, therefore its packets will be forwarded along both
proxy trees rooted at B 1 and B2. If an RPF check is only performed on the proxy
sources (to prevent looping), all downstream receivers will receive two copies of all
77
data packets from 52. Van Jacobson proposed performing an R P F check both on the
proxy sources and on the specific source addresses (to reduce duplicates), redundant
data traffic will be carried by a smaller num ber of links inside the cloud. In fig 7.1(b),
52’s data packets will only be unnecessarily forwarded on Cl C 2, C2 C l, and
C2 C3. One way to justify the extra overhead by data packets on unwanted
links is to treat it as a bandwidth-state tradeoff. B ut a counter argument is that it
trades off data bandwidth vs control state, instead of control bandwidth overhead
versus state overhead. This means the routers should be able to identify the high
data rate sources to avoid overloading certain links and only selectively apply the
proxy mechanism to groups with only very low data rate sources. By introducing
more source specific state inside the cloud, some of the extra d a ta packets can be
eliminated, but so will the benefit of the proxy scheme.
We have explored two possible ways of doing aggregation on source addresses in
downstream routers while still allowing certain upstream routers to obtain source
specific addresses from the downstream direction of the aggregation tree.
When the source specific address information is obtained from the upstream
direction, there are a few possible designs that can let the sources make known their
identities to the appropriate border routers. One approach is to let the sources
register their existence with all border routers by sending special unicast control
messages. This requires all routers be able to clearly identify the correct border
routers on behalf of local receivers. Such capability is not generally available in
unicast routing protocols.
A second approach is to use a broadcast-based search scheme; i.e. run dense
mode locally inside the source domain for local sources, create a special mechanism
to let the exit border routers appear as group members to sources inside the source
aggregate domain, but not as group members to sources outside of the domain.
A second function of this mechanism is to distinguish local sources from non-local
sources and only to take dense mode action for local sources. Once again these two
functions are hard to design under the constraints of existing unicast routing.
We conclude that aggregation across sources based on a common source address
prefix (i.e. aggregate) is not a viable solution.
Note that the two methods discussed in this subsection and th e next subsection
represent techniques that use wild-card addresses. The number of routing entries
78
are reduced through compression. Note that between these two methods, the effec
tiveness in terms of state storage reduction is influenced by the maximum number of
sources that can send to a group and the maximum number of groups that can exist
in a network. Since low rate sources can be aggregated via a shared tree, we only
need to be concerned on high speed sources. However, the maximum number of high
speed sources that can send to a group is limited by the capacity of the receivers,
as a result the reduction in state storage by aggregating across sources can also be
limited. On the other hand, the number of groups a low rate source can send to is
virtually unlimited. Therefore state reduction by aggregation across groups is both
more important and more promising.
7.1.2 A ggregation across groups
When a source S is sending to multiple multicast groups, say G i, G2 , ..., G*, the
routing entries in a router for different groups (5,G ,,), (5, Gi2), etc., may be aggre
gated together into one entry (S, *), resulting in savings in state storage.
This scheme is potentially more interesting than the previous one in that the
number of multicast groups a low rate source can send to is almost unlimited7.
There are two ways to realize this scheme: (1) All routing entries in a router for
the same source are merged, with the union of outgoing interface lists of all entries
installed as the outgoing interface list of the aggregated entry; (2) Routing entries
for the same source are merged if they share the same outgoing interface list. Other
entries with different outgoing interface lists are left intact. Longest match search
is performed when forwarding data packets.
Method (1) may incur excessive bandwidth consumption by data packets travel
ing over unwanted links. The gain in state savings will increase when the number of
groups gets larger. When the membership distributions of each group are random,
the excess bandwidth consumption by data packets traveling along unwanted links
will increase as the number of groups increases. When the number of groups is large
enough, this method reduces to a wide area broadcast scheme. In general, excessive
7It may also benefit other non-multicast applications that want to take advantage of the capa
bility to choose different alternate routes of multicast join messages.
79
gioup*. k r > aipanct (47-nod#)
0.7
0.6
| 0.5
4 0 4
I
0.3
0.2
0.1
I
Numter of muttcaat grotp# roofed at *>•
State 8avinga for aggiaggation
0 .6 i i
'•gg_io.jandi.ptet'
Group t i n : 10 '•gg_10jand5.ptot'
•gg_10_rand31ptot'
’ m u i n la M M K n k i1
0.7
0.6
OS
0.4
I
0.3
0.2
0.1
I
Numbor of muttcMtgtoupo rooted at tWMmo oouico
(a) Results for 3-member groups (b) Results for 10-member groups
Figure 7.2: State savings of group-based aggregation under uniform random mem
bership distribution. Different curves correspond to results measured for trees rooted
at different sources.
bandwidth consumed by broadcasting data packets is unpredictable and depends on
the source data rates8.
Method (2) does not suffer the extra bandwidth overhead of flooded data pack
ets. However, the effectiveness of method (2) depends upon the group membership
distributions and the number of groups. In the absence of a clear picture of what
kind of group membership distributions will exist in the future and how many groups
a network will need to support for each potential source, we simulated the scheme
over the Arpanet topology consisting of 47 nodes, under different randomized mem
bership distributions. Figure 7.2 shows the state savings for different numbers of
groups. Fig 7.2(a) shows results from 3-member random groups, (b) shows results
from 10-member random groups. For each source sending to the groups, the total
state storage in all routers is calculated before and after the aggregation is applied.
The ratio of these two storage totals is calculated and plotted versus the number
of groups in the two graphs in fig 7.2. Different curves depict state reductions for
different sources that send to the same set of groups in the network. The results
confirmed our intuitive prediction that as the number of groups grows larger, the
savings in terms of state storage increases. It also shows that in the experimental
8This is one major reason multicast is always favored over broadcast as a generic one-to-many
delivery paradigm in a wide area network.
80
topology, larger groups can have better state savings. Qualitatively, when the num
ber of groups is large enough, the state storage can be reduced by as much as | of
the non-aggregated entry storage required.
We repeated the above experiments with groups that have non-uniform member
ship distributions, i.e. members are more likely to appear in a particular subset of
regions than in the other regions. These experiments are based on the speculation
that since the network user population is not uniformly distributed and because of
the differences in working habits etc, the group membership distributions will usu
ally be non-uniform [25]. We designated 27% of the routers to be more frequented
by group members, so that 80% of members of each group are randomly distributed
among these routers. The remaining 20% of members are uniformly distributed
among the other routers. The result for 3 member groups showed less than 10%
state reduction relative to that shown in fig 7.2(a); the result for 10 member groups
showed reduction of about 50% compared with that in fig 7.2. The shapes of these
curves are roughly the same.
In summary, the experiments illustrated how the reduction in state storage can
increase as the number of groups a source is sending to increases. When the group
members are distributed unevenly, and when the number of groups is large enough,
the state storage savings can be more than 80% compared with that required without
any aggregations.
Overall, we are not discouraged by the fact that aggregation across sources does
not work well, because low data rate sources can always stay on the shared tree
and high data rate sources don’t present an interesting case for aggregation: the
maximum number of sources sending to a group at high data rates is limited by the
capacity of the receivers. However, we are more concerned about aggregation across
groups because the number of groups with low traffic rates can be unlimited. It is
im portant that an aggregation scheme scale well in number of groups.
7.2 D ata Driven Techniques
As stated before, there may exist long lived groups that are idle for most of their
life times. By removing routing entries that have been idle for a certain period of
81
tim e and recreating the entries when there is data traffic in need of those entries,
the state storage requirements can be reduced.
7.2.1 D ata driven setup and m aintenance of shared tree
state
In this subsection, we explore a new mechanism that attempts to make both source-
rooted shortest path tree and the group shared tree state data-driven. This scheme
represents a possible design point that can be valid when groups do not have limited
life times and the active sources only send during very small proportions of group
life times. There are two major goals for this mechanism:
1. To remove (*,G) in absence of data packets;
2. To not keep state information on off-tree routers, even when the source is
active.
T hat is to operate in sparse mode when sources are active, and keep no state inside
the network when no source is active.
Under this scheme, a designated router’s (*,G) state is still refreshed by the
local IGMP membership reports. When a member for a new group G appears, the
designated router sends a (*,G) join toward the RP. But when there are no data
packets arriving for this group G for a certain time interval, the designated router
will stop sending joins refreshing the upstream (*,G) state. This will cause the
upstream (*,G) state to be timed out and disappear.
When a source for group G starts sending again, it will send a data packet
toward the RP encapsulated in an RP register message. The RP will broadcast this
data packet using a reverse path forwarding algorithm, so that those LANs whose
designated routers have (*,G) state will restart sending (*,G) joins toward the RP.
There is a potential reliability problem with this scheme that if only the first data
packet for each active period is broadcast delivered, loss of such a packet will black
out certain regions of the network. One possible fix is to let the designated routers
that have suppressed sending of (*,G) joins to time out the suppression state and
start sending periodic (*,G) joins again without receiving a data packet. This fix,
however, would compromise the benefit of the data-driven mechanism: either the
82
timer that times out the suppression state is too long such that the join latency would
be too high in case of a loss; or the timer is too short such that the timed out state is
restored too quickly which makes the savings in state storage insignificant. Another
possibility is to repeat a small number of times the broadcast of a beacon packet
signaling ’source alive’ to the downstream leaf LANs when a source becomes active.
By arranging the time intervals between subsequent beacon packet broadcasts, the
probability of having a black-out in portions of the network could be reduced to
a negligible level. But since this represents an increase in the complexity of the
protocol design, it is questionable whether it is feasible or worthwhile to engineer
this mechanism.
7.2.2 D ense m ode em ulation for sparse m ode groups
PIM maintains data-driven source-rooted shortest path trees. However, as described
above, the shared tree is cumbersome to support in a data-driven manner.
This subsection discusses the dense mode emulation scheme that transforms the
part of a shared tree that is inside a particular network cloud into a set of source
rooted shortest path trees. These SPTs are maintained by data-driven dense mode
mechanisms.
As illustrated in figure 7.3, to transform the part of a shared tree that is inside
a dense mode emulation cloud, when a data packet comes down the shared tree
and arrives at an entry router to this cloud, the entry router multicasts this data
packet to all border routers of this cloud. Those border routers that are the RPF
entry routers for the packet source will inject the packet inside the cloud as if they
were forwarding a data packet along a dense mode source-rooted shortest path tree.
Routers inside the cloud operate in data-driven dense mode, and have no knowledge
of the shared tree outside the cloud.
In order to let the entry router of the shared tree (BR1 in fig 7.3) be able to
periodically send refresh messages upstream, it needs to maintain the proper shared
tree (*,G) state. Such state needs to be refreshed by joins from the downstream
routers with (*,G) state. To propagate the (*,G) joins across the dense mode em
ulation cloud, all shared tree (*,G) join messages from downstream routers outside
the cloud are passed through routers inside the cloud without creating any state.
83
R P to e link
R P
C.0)
B R 1
BR2
BR5
(S,C)
B R 3
(S.O)
(S,0)
(*,G)
Figure 7.3: Dense Mode emulation: Transforming the part of shared tree inside a
cloud into source-rooted shortest path trees. BR1 is the entry router to the cloud
for th e shared tree, BR2 and BR3 are two SPT entry routers for source S. BR4
and BR5 are two exit routers leading to downstream receivers on the RP tree. BR1
multicasts data packets received from the RP tree to all-border-routers group.
84
Table 7.3: State storage requirement for a group with a persistent-state shared tree
den 0.1 0.2 0.3 0.4
Number of (*,G) entries 7 13 19 25
On the one hand, the state inside a dense mode emulation cloud is made data-
driven; on the other hand, when there are multiple active sources, there will be
multiple shortest path trees setup for all sources. Compared with only one set of
(*,G) state, this makes the savings in state storage dependent on the sources’ active-
idle duty cycles.
To formulate the storage requirements under different conditions, we use the
following notations:
N : number of nodes inside the dense mode emulation cloud
L : number of links inside the dense mode emulation cloud
den : density of the (*,G) tree inside the cloud without doing dense
mode emulation
s : number of sources for group G
Y : the average active-idle duty cycle of sources for group G, i.e. an
entry (S,G) is maintained for d seconds out of T seconds interval
and it is deleted for an interval of (T — d)
By definition of den9, the total number of shared tree entries for group G is
L x den + 1. When all sources are active, the number of (S,G) entries within the
cloud is (s x N ). With dense mode emulation, and average source duty cycles of
d /T , the average number of (S,G) entries will be:
s x J V x ^ (7.1)
To compare the state storage overhead of dense mode emulation verses that of
the persistent-state shared tree mechanisms, we assume a cloud of 30 routers and
average node degree of 4. Table 7.3 shows the number of permanent-state shared
tree entries in the cloud without dense mode emulation.
9see chapter 6.
85
Table 7.4: State storage requirement for a group with a persistent-state shared tree
\ # o f
\Sources
d \
T \ 1 2 3 4 5 6 7 8 9
0.01 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7
0.05 1.5 3.0 4.5 6.0 7.5 9.0 10.5 12.0 13.5
0.10 3.0 6.0 9.0 12.0 15.0 18.0 21.0 24.0 27.0
0.20 6.0 12.0 18.0 24.0 30.0 36.0 42.0 48.0 54.0
0.40 12.0 24.0 36.0 48.0 60.0 72.0 84.0 96.0 108.0
Table 7.4 shows the average number of (S,G) entries inside the cloud running
dense mode emulation.
Comparing tables 7.3 and 7.4, it can be seen that:
• The smaller the number of sources, the more advantageous dense mode emu
lation is. E.g. with a single source that has a duty cycle of 1%, the average
(amortized) number of state entries required under dense mode emulation is
only 0.3, a saving of more than 20 times compared with th at of the permanent-
state shared tree.
• The smaller the S,G duty cycle, the more advantageous dense mode emulation
is. In the example topology, the advantages of dense mode emulation gradually
disappear when the duty cycle increases above 20%.
For example, if the cloud supports an open audio conference session that is alive
for 3 days (72 hours) and there are conference activities for 6 hours from 3 sources,
i.e. each source is only active for 2 hours on average which corresponds to j = 0.028.
Dense mode emulation yields 2.5 entries. This is quite favorable compared to the
case without dense mode emulation (21 entries when den = 0.1, 75 entries when
den = 0.4).
While dense mode emulation can potentially save state storage, it incurs some
extra bandwidth overhead due to the fact that (*,G) join traffic is constantly passed
through the cloud and data packets are delivered by the dense mode broadcast and
prune process.
86
Table 7.5: W ithout dense mode emulation: Number of control updates for (*,G)
maintenance
(*,G) tree density 0.1 0.15 0.2 0.25 0.3 0.35
# of control updates 25920 38880 51840 64800 77760 90720
Table 7.6: W ith dense mode emulation: number of control packets with 2 sources
\ (*’G).
\D en sity
d X
T \ 0.1 0.15 0.2 0.25 0.3 0.35
0.001 26231 39173 52116 65059 78001 90944
0.01 29030 41817 54604 67392 80179 92966
0.1 57024 68256 79488 90720 101952 113184
0.15 72576 82944 93312 103680 114048 124416
0.2 88128 97632 107136 116640 126144 135648
Let Ig be th e average total length of active intervals of a source during the group
life time, IT e fresh be the refresh interval join message is sent, and / prune_t«m e_ou« is the
length of the timer value to time out a negative cache entry in dense mode. The
following formula gives the total number of control updates within the cloud:
(L x den) x — ^ — — (7.2)
* re fre sh
U nder dense mode emulation, the number of control packets will increase by the
e x tra flood and prune traffic on unwanted links:
(L x den) x —^ - I - (1 - den) x L x 2 x - ----- — ---------x ^ x « (7.3)
*refresh * prune-tim e-out -»
In the same cloud of 30 routers that was described above, the control message
overhead of maintaining (*,G) state without dense mode emulation is shown in
ta b le 7.5. Table 7.6 shows the numbers of control packets inside the cloud with
dense mode emulation, when there are two sources sending to the group.
87
Comparing tables 7.5 and 7.6, we see that the increase in the number of con
trol packets due to dense mode emulation ranges from roughly 1% more in the best
case to about 2.4 times more control traffic when the sources have larger duty cycles.
Overall, when the source duty cycle is well below 0.2, dense mode emulation can nor
mally achieve good state savings at the cost of small increases in control bandwidth
consumptions. Note that the results presented in this section are not dependent
on particular graph structures inside the topology. The only significant factors that
affect the results are the group density, network node degrees, number of links inside
the network, number of sources and idle-active duty cycles of the sources.
7.3 Limiting the group life times
Sections 7.2.2 and 7.2.1 have assumed that the group life times are unlimited, de
spite the fact that the active periods of the sources are limited. In the real world,
applications can be left running for many days or even months on a workstation,
without input from a human user. The communication channels provided by such
applications have sometimes been used as hailing channels for users to interact in
‘cyberspace’ across geographically distant locations10. One critical question is how
many of the future multicast applications will be operating in this mode? Unfor
tunately, there is no way to get a definite answer and to accurately project the
application mixes in the future. When large numbers of such applications exist,
data-driven technique provides one solution to state storage reduction problem.
It has been argued that if an efficient mechanism can be designed to limit the
life time a group’s control traffic and state information can sustain in the network
in absence of data traffic11, we can achieve the same goals we tried to achieve with
data-driven techniques, but without the need to design specific mechanisms to be
10For example, the ’USC netlab audio’ vat channel has been used by people geographically spread
over ISI near the Pacific ocean, USC main campus near downtown Los Angeles, some students’ home
and Xerox Parc in Northern California and perhaps other places, to hold impromptu discussions
on things ranging from research topics to lab junk disposals.
n Note that we can not limit the life time a group can exist — there may exist applications that
are long lived and are always active, such as a weather station that broadcasts its report constantly
through a wide area multicast group.
88
‘data-driven’. Assume this is doable, since the presence of members of a group is con
trolled by applications outside the network, the only way the network can suppress
the sending of join packets in the absence of data traffic is to not let join messages
come out of the leaf subnetworks — either by suppressing the report of member
ship information, or by suppressing the sending of join messages by the designated
routers of leaf LANs. To trigger the suppression of join messages, there needs a
mechanism to inform the hosts or routers on these leaf LANs about the absence
of data traffic inside the network. This mechanism should also be able to inform
these LANs to terminate suppression of joins and restart sending join messages as
soon as there is presence of data traffic again. Since we don’t have yet another
point-to-multipoint delivery mechanism to deliver these notification messages from
the source to all leaf LANs with members, the only type of mechanism available is
a broadcast mechanism. By using a broadcast mechanism to deliver data-presence
or data-absence notification messages to leaf LANs, such machinery would be simi
lar to the data-driven mechanisms introduced in section 7.2.1. Therefore, we don’t
consider this concept an independent option in the design space.
7.4 Summary of Multicast Routing Information
Reduction Techniques
In this chapter, we explored routing information reduction techniques. Five mean
ingful design points were investigated in detail.
Route aggregation across sources can be achieved by aggregating entries based
on the source address prefixes, or by employing source-proxy mechanisms within a
network cloud. Both methods suffer realization problems that make them unattrac
tive.
Aggregation across groups is achievable and can have potentially significant sav
ings in state storage. Experiments in the 47-node Arpanet topology have shown that
the savings can be as much as about 80% of the original storage required when the
number of groups originating from a source is large enough.
The data-driven setup and maintenance of shared tree state approach can remove
both state and control traffic for the shared tree in the absence of data traffic.
89
However, this scheme lacks a good simple reliability mechanism for operations over
networks with lossy links.
Dense mode emulation for sparse mode groups can be used in a transit network
cloud to avoid potential multicast routing table explosions. It converts sparse mode
shared trees into dense mode source-rooted shortest path trees when data traffic is
present, and removes state about all groups that have no data traffic. The anal
ysis has shown that for small duty-cycle (active/idle) groups, the state saving is
significant at moderately increased control bandwidth.
Overall, aggregation across groups and dense mode emulation are two promising
approaches for multicast routing information reduction. The effectiveness of aggre
gation across groups largely depends on how many groups a source normally sends
to. And the effectiveness of dense mode emulation depends on how many multicast
groups are long-lived with short active-idle duty cycles. I.e. the effectiveness of both
methods depend on the composition of multicast traffic. Hopefully, multicast traf
fic characteristics will become known when th e need for further reducing multicast
routing state (to below that required by PIM currently) arises.
90
Chapter 8
Analysis of a Resequencer M odel for M ulticast
over ATM Networks
It is well-known that multicast delivery saves bandwidth and offers logical address
ing capabilities to the applications. In an ATM network, the receivers of a multicast
group need to differentiate cells sent by different sources. This demultiplexing re
quirement can be satisfied in an ATM environment using multiple dedicated point-
to-multipoint virtual channel connections (VCs), but with certain shortcomings.
This chapter discusses an alternative resequencing model to solve this problem. It
scales well in large networks. Three resequencing methods are are developed and
simulation results reported. The strategy is useful for applications spanning large
regions where it is desirable to mix streams of cells from different bursty sources
onto the same virtual channel.1
8.1 Introduction
It is important for future ATM networks to have multicast capability, as they will
support such applications as teleconferencing and information distribution services.
Applications based on multicast have two major advantages over unicast: logical
addressing and bandwidth savings [14]. With logical addressing, an application uses
a single multicast group address to send and receive data. It is not necessary for
senders and receivers to know the number or location of group members. Multicast
1 We would like to thank Steve Deering, Bryan Lyles, Lixia Zhang and the referees for helpful
discussions and comments.
91
provides significant savings in bandwidth because the sender transfers only a single
copy of the data. Data cells are not replicated until they reach a branching point in
the multicast delivery tree, at a location closer to the destinations.
ATM is a connection oriented technology, in which connections must be explicitly
setup before data can be transferred. An ATM connection can be viewed as a
cached route [28], in the sense that each cell carries only a small routing tag — the
virtual path identifier and virtual channel identifier (VPI/VCI). Unfortunately, if
cells from different sources are multiplexed onto the same virtual channel, they will
carry the same routing tag, or VPI/VCI, upon arrival at a receiver. There is not
a straightforward way to distinguish cells sent from different sources. This is often
referred to as the cell demultiplexing problem. Although several solutions have been
proposed, they have shortcomings and, most importantly, they do not scale well.
This work proposes a solution that both solves the cell demultiplexing problem
and scales well. The following section describes a few proposed solutions to ATM
multicasting and discusses their shortcomings. Section 8.3 develops the proposed
resequencer model. Section 8.4 presents simulation results for the performance of
several different resequencing methods.
8.2 Strategies for Implementing M ulticast in ATM
The cell demultiplexing problem can be avoided if one VPI/VCI is used for each
source. In this case, each multicast VPI/VCI is a one-to-many connection repre
senting one multicast tree and is independent from other V PI/VCI’s. Each tree can
be optimized according to some criterion such as least delay or least cost.
The VPI is an 8-bit field in the cell header. If it is used to identify a multicast
group, it restricts the number of multicast connections to even less than the number
available with VCI’s. In addition, VP switching may be used to bundle large number
of VC connections, or to separate VCs of differing quality of service classes. There
fore, we assume that the VP will not be available for use as a multicast tree identifier.
For the remainder of this discussion we refer to the use of VCI’s. Figure 8.2 shows
the one-vc-per-source multicast model for groups with multiple senders.
The one-vc-per-source strategy is compatible with the current IP multicast mod
els [14], provided that the switch signaling mechanisms have access to the installed
92
Senders
Receivers
Figure 8.1: One-vc-per-source multicast model
IP multicast routing tables [28]. The scheme is useful for (a) applications with small
numbers of sources; (b) local multicast groups where there is not a shortage of VCs
relative to the demand; and (c) applications where all sources continuously transmit
(e.g., video).
However this scheme uses a large number of VCIs and therefore does not scale
well to networks with many nodes and large numbers of long-lived multicast groups.
In particular, the scheme has a number of limitations. The number of virtual
channels is too limited to accommodate a sizable number of large, long-lived groups.
Furthermore, if bandwidth is reserved or allocated on a per VC basis, the large
number of VCs resulting from a multcast group will quickly use up the available
bandwidth. Similarly, In the public network, if the cost of cell switching depends
on the number of virtual channels, the additional VCs used in per-source-VC will
be a costly disadvantage. In addition, since each sender defines a tree, when a new
receiver joins a group, it must be added to all existing trees.
It is important for ATM multicast to scale to large size. Applications involving
hundreds, or even thousands, of end users globally are not too far distant. Thus it
is desirable to explore alternatives in which the number of available VCs in a switch
or host interface is not an upper bound on the number of sources an application can
support.
Recently several models have been proposed for ATM multicast, aimed at solving
the VC depletion problem, and achieving high throughput and scalability [26] [19]
[1] [29]. One proposal [19] is for all sources in a multicast group to share one
93
virtual channel, and to use collision detection techniques, such as Aloha, to avoid
intermixing cells from different sources. The CRC will detect when a cell is out of
order, i.e., not from the same sender, and the sender retransmits. In this scheme, the
common virtual channel is treated as a shared media, similar to an ethernet cable,
with some of the same advantages and disadvantages ethernet has. The scheme is
useable for local, low bandwidth and non real-time applications.
In another scheme [26], an identifier in the ATM payload field is used to dis
tinguish cells from different sources. It is not clear how simple or complex the
procedures for negotiating unique identifiers among sources would be. This scheme
was developed for AAL4 in which there is a MID (multiplexing ID) field in the
cell information header. However, currently many developers of ATM plan to use
AAL5 [22] which does not have such a demultiplexing field in the header. In any
case, the width of the MID field, which is 10 bits, puts a limit on the number of
sources allowed in a group. Another cost is that the host interface is required to
have additional demultiplexing hardware.
8.3 The Resequencer Solution to M ulticast
The scheme we introduce solves the VCI depletion and bandwidth allocation prob
lems without requiring specialized bandwidth allocation schemes for certain common
cases of multicast. The traffic classes for which this approach is well suited are: (a)
transmission per source is not continuous, (b) small to modest size PDUs, (c) large
multicast groups. First we describe the basic resequencing strategy, followed by
several extensions and refinements to the basic idea. Next we present a group tree
model which extends the scheme to a larger network.
8.3.1 Resequencers
To achieve better bandwidth sharing without complicating the VC-based resource
management mechanisms, and to avoid the VC availability dilemma, cells from dif
ferent sources are multiplexed onto a single VC. A designated source in the group is
elected as a resequencer. All other sources send their multicast ATM cells to the des
ignated resequencer, which buffers incoming cells from each source before all cells of
94
a PDU are received. After the last cell of a PDU has been received, the resequencer
forwards all cells of that PDU onto a single outgoing VC, without interleaving it
with cells from other sources.
Note that the outgoing VC of the resequencer is a point-to-multipoint connection.
Cells are duplicated at the branching point(s) and the order of cells is preserved
across all links in this one-to-many connection. If a sender is also a receiver, there
is a branch of the one-to-many VC leading back to the sender.
Although this single resequencer scheme provides the benefits of multicast, it
has several problems. The processor and link speed of the source designated as a
resequencer could become a bottleneck. Choosing the resequencer is complex, and
there may be situations where no ideal selection of designated resequencer exists.
This scheme is best when all senders are close to the resequencer.
To extend the scheme to accommodate the case where members are spread over
different regions, and each region has many members, members in each region elect
a source to act as a resequencer for local senders. A one-to-many VC is created
for each resequencer which delivers cells to all other resequencers of the group, and
cells are therefore delivered to all receivers. The number of VCs required is now
reduced to the number of participating regions, instead of the number of sources (N
one-to-many VCs for N resequencers).
To make the scheme more efficient, resequencing can be done in switches instead
of in hosts. Normally switches sustain higher aggregate cell rates. To participate
in multicast, the host only needs to know the address of its nearest multicast rese
quencer. In a local area ATM, the location of a multicast resequencer can be either
statically assigned, or dynamically discovered, so that the native nodes are kept up
dated about the available multicast servers. To avoid long call setup latency, the
resequencer information can be cached in the host. This approach eliminates the
necessity of going through a decision process to choose a host resequencer.
8.3.2 M ulticast Group Trees
Our final refinement of this approach is to further reduce the number of VCs main
tained for a group. We do this by building a group tree instead of per-source (or
95
/Multicast
[Resequenct
V M3 >
^MulticastS
Resequencer
. Ml .
LI
MulticastS
Resequencer
. M2 y
F i g u r e 8.2: T h e resequencer and group multicast tree model
per-source-region) rooted trees. Connections among the resequencers are bidirec
tional. A resequencer serializes all cells destined for a common group and duplicates
them onto the appropriate outgoing ports with correct VPI/VCIs — some lead to
other resequencers, some go to local members. Figure 8.2 depicts a group tree of re
sequencers and receivers. The group tree is a delivery path in common for cells from
all sources. As Figure 8.2 shows, each source establishes a point-to-point connection
to its resequencer and sends cells towards a multicast group along that connection,
as if it were a unicast destination.
Either point-to-point or point-to-multipoint connections can be used to distribute
cells from a resequencer to its local receiving members, depending on support from
the local network. If a member is both a sender and a receiver, a bidirectional point-
to-point connection is used. The connections between s2, s3 and Resequencer M2 in
fig 8.2 are such examples. To reduce the bandwidth used in local multicast cell de
livery, local switches must support point-to-multipoint connections. The connection
between resequencer M3 and rO, r l and r2 in fig 8.2 is such a point-to-multipoint
96
connection, where rO, rl and r2 are all receivers 2. Note also that there may be many
“pure” ATM switches between any two resequencers, and between any resequencer
and its local members. The paths drawn in figure 8.2 are logical channels.
8.3.3 D iscussion
The resequencer approach requires algorithms different from those used for IP mul
ticast to compute the group multicast tree. The tree will not be optimal in terms
of shortest path or least delay. However, suboptimal solutions should suffice for a
number of applications [29] [4].
Since this resequencer and group tree model can be built upon one-to-one and
one-to-many virtual channel connections, it can coexist with the one-vc-per-source
model. An application can choose one of the two models. For example, video
applications where sources transmit continuously may choose to use the one-vc-per-
source model, whereas video conferencing with compressed streams and multiple
changing sources may choose the resequencer approach. The resequencer and group
tree model is also flexible in that it works both with and without point-to-multipoint
virtual channels. The difference is in the bandwidth savings of the local cell delivery
paths. This approach turns out to be especially well suited for voice conference
applications, where one bandwidth allocation for the whole multicast group may
be sufficient and is independent of the number of sources. In an audio conference,
it is unusual to have multiple simultaneous transmitters. Usually only one or two
persons speak at a time, as is consistent with higher level human protocols.
In order to setup and maintain the group tree, a higher layer protocol — either a
network layer protocol or a call control level protocol, is needed. Different protocols
may be appropriate for this purpose. One possibility is to adapt Deering’s host
membership protocol and multicast tree setup protocols [14].
The resequencer and group tree multicast model has a nice scaling property.
Only one virtual channel is needed for each receiver to receive cells from all sources.
When a new member (be it a sender or a receiver or both) joins, it only needs to do
2Note that it is not suggested to replace the reverse channels of s2 s3 and s4 with a single
point-to-multipoint connection, unless the senders don’t care if they receive duplicated cells looped
back by resequencer M2.
97
the setup necessary to reach a resequencer already in the group. The operations of
join and leave are inherently local, growing a branch or chopping off a branch does
not have to involve a distant part of the tree.
Two performance parameters, maximum throughput and delay, will be used to
assess the resequencer and group tree model, and will determine the range of applica
tions of this model. Different implementation methods result in different maximum
throughputs ranging from tens of megabits per second to that approaching the max
imum unicast rate. We will show in the next section that as the needs for higher
speed multicast service appear, there are ways to increase the maximum throughput
without changing the end system.
8.4 Three Approaches to Resequencing
In the resequencer and group tree model, resequencers are the crucial components
that determine performance. We describe three different schemes for resequencing
cell streams from different sources: in software at the network layer; in hardware; and
an optimized hardware scheme. The performance of these approaches is evaluated
with respect to throughput, delay and delay jitter.
8.4.1 Software Resequencing
Multicast cells from the same source can be reassembled at the resequencer in the
same way that signaling messages are reassembled. An ATM switch with signal
ing capability has associated CPU and memory to process and forward signaling
messages [8]. Similarly for resequencing of multicast cells, the reassembled PDUs
are passed to the control processor, which makes routing decisions by inspecting
the address field in the header of the PDU. The forwarding process is shown in fig
ure 8.4.1(a). The route lookup and PDU forwarding are done in software, similar to
a datagram router.
To examine performance, as shown in figure 8.4.1(b), the cell forwarding delay
can be decomposed into three parts: (1) the delay to collect cells of the PDU; (2)
the delay due to the route lookup algorithm; and (3) transmission delay. 3 We
3PDU queuing delay needs not be considered here, if we assume no congestion.
98
Network Layer Protocol
PDU:....... ...
AAL5
ATM
•E3-E3-EE3...
Cells sent to multicast group
(a)
PDU cell
Collection gig Routing .^^ .T ransm ission
delay ~|~ delay ~ ~ delay_______
_____________ PDU forwarding delay __________
(b)
Figure 8.3: Software PDU forwarding
observe that the cell collection time and transmission time of consecutive PDUs can
be overlapped. Therefore, route lookup is the dominant factor affecting throughput.
W ith certain optimizations in the implementation, the software forwarding scheme
can achieve high throughput. Consider a switch equipped with a 50 MIPS proces
sor. Assume its multicast routing table has 1000 entries and it takes about 1000
instructions to forward one PDU4. The per PDU route lookup and forwarding time
will be ~20 /is. For 256-byte PDU, this corresponds to a maximum throughput of
100 Mbps, if the cell collection times and transmission times of consecutive PDUs
are completely overlapped.
The delay experienced by each individual PDU depends on a number of factors.
In the one-vc-per-source model, a PDU is not reassembled until reaching an end
host. Using the software resequencing method, a PDU has to be reassembled in
every resequencer it passes. The additional delay is the sum of cell collection delays
and routing delays the PDU experienced in all resequencers along its path. The
PDU cell collection delay depends on the PDU size and link speed.
Figure 8.4.1 shows the PDU cell collection delay under different link speeds and
PDU sizes. Multimedia applications with delay jitter or other real-time requirements
tend to use small packet sizes, between 128 to 256 bytes. Even with larger PDU sizes
(1000 Bytes) and slower link speed (155Mbps), the worst case aggregate reassembly
and routing delay in the above example can still be less than ~7.7ms, if less than
100 resequencers are involved in the longest path.
Thus the performance of the software resequencing method, in terms of speed
and delay, is good for applications requiring less than 100Mbps and involving less
than thousands of regions — provided the longest path of the multicast group tree
has less than a few hundred resequencers.
8.4.2 Hardware Resequencing
When higher throughput is required, we propose a method using hardware support
in the switch. Instead of a software route lookup, multicast cells can be forwarded
4This is a rough estimation. If the operating system in ATM switches are better designed for
communication protocol processing, the number of instructions would be smaller.
100
60.0
I Link 8p#ede166Mbps
40.0
I
“ 20.0
? 00.0 300.0 700.0 000.0
Figure 8.4: PDU reassembly delay
directly by hardware switch fabrics using mechanisms similar to that used for for
warding point-to-point connection cells. Incoming VCs carrying multicast cells are
treated separately from those for normal point-to-point connections; they are queued
in buffers, neither forwarded nor reassembled. The switch monitors the End of
Packet (EoP) cell. When the EoP cell arrives, the hardware dumps all cells in that
PDU back-to-back onto outgoing links.
Hardware resequencing requires extra buffers and buffer management mecha
nisms in the resequencing switch hardware. The amount of buffer required depends
on a variety of factors, including the PDU size, traffic characteristics, congestion
control algorithms, etc. We can estimate the number of buffers required by a rese
quencer in the restricted case assuming no congestion and that sources obey negoti
ated resource allocations. In this case, a switch has sufficient bandwidth to carry the
incoming traffic, and the queue size for each incoming VC should not grow beyond
2 PDUs. More buffer space is needed only when there is congestion. If the maxi
mum number of active multicast groups is N, the maximum number of active source
streams coming in to a resequencer for each active group is S, and the maximum
PDU size used by all multicast sources is P, the maximum amount of buffer required
in a resequencer is 2 x N x 5 x P.
101
8.4.3 O ptim ized Hardware Cell Forwarding
T he third method is an optimization of the second. When the resequencer receives
th e first cell of a PDU from a source, if there are no cells from other sources queued or
being forwarded, it directly forwards all cells on the outgoing links without queuing
them first, until it forwards the End of Packet cell. During the forwarding action,
if cells from other sources arrive (i.e., on other incoming links), they will be queued
and processed according to either the software or hardware resequencing method.
To reduce delay variance, after an EOP cell has been sent, the switch should try
to process queued cells before processing newly-arriving cells. These latter cells are
queued.
This optimized scheme is particularly suited for certain applications, such as
voice conferences, where normally only one or very few participants are actively
transm itting to the group at the same time. Both queuing time and route lookup
tim e for the preferred groups are eliminated. This scheme achieves almost the same
cell delivery throughput and delay as one-vc-per-source method, however it uses only
a single VC for the whole group. This method is less suitable for groups with many
members simultaneously transmitting data at a constant rate to the group.
8.5 Simulation Studies
Since this model is targeted towards applications with real-time constraints, we
have built an event driven simulator to study the cell forwarding delays and delay
jitters. End-to-end delay and variance are dependent on per switch delays and other
parameters. We simulate and measure the switches under different traffic conditions
and with each of the resequencing methods described.
The simulator uses a queuing model to characterize the behaviors of different
buffers in a switch. The input queues buffer incoming cells; the PDU queue is used
by the software resequencer to save reassembled PDU’s before route-lookup; the
output queues store outgoing cells when the cell arrival rate is larger than the link
speed. For simplicity, a FIFO algorithm is used to schedule the resources whenever
competition occurs. The link speed, link delay, network connectivity, software PDU
102
forwarding speed, end-to-end route, and multicast group information etc. are entered
by the user and stored statically.
A simple traffic model has been used to approximate Audio/Video style traffic
sources in the simulations. The traffic model assumes that each source generates
data at a constant rate, and every t seconds, a PDU is ready to send. Then the
PDU is chopped into ATM cells and sent onto the network at the link speed, while
at the same time the application (slower than the network) is generating data for
the next PDU. The source stops generating data after sending several PDU’s and
stays silent for a randomly selected period of time. Then it starts sending again.
A link during the interval between delivery of two consecutive PDUs is like a
vacuum. This partially explains why in the following simulation, even when the
total active sending time of all the relatively slow members of the group exceeds
100%, the delay caused by resequencing different PDU’s in collision is not very
significant. Of course, there is a chance that the number of collisions can be larger if
the slow sources together with the link delay make different PDU ’s frequently arrive
at a resequencer at the same time.
A fully connected 17-node network (including 11 hosts and 6 switches) was con
structed. A total of 10 multicast groups were created. The resequencer switch
located at the network bottleneck location was setup for measurements. Figure 8.5
shows the distributions of the cell delays across that switch. The same traffic sources
are used for the three runs with different resequencing methods. The measured path
has four sources sending to the same group and whose percentage of time actively
sending are 80%, 50%, 50%, 30%. Although the source rates are below the satu
ration point, they do collide inside the resequencer and get queued and reordered
on exits. The tails of the curves are caused by the queuing delays. The horizontal
distance between the Hardware resequencing curve and the software resequencing
curve reflects the route look up delay (20 ps). The horizontal distance between the
optimized and the hardware resequencing curves reflects the cell collection tim e (re
lated to link speed and PDU size). The differences in heights of the three curves
depict the fact that the more time a PDU spends in a resequencer, the more chance
it may collide with another PDU. The width of the delay distribution curves shows
the jitter the cells experience.
103
Number o f cells
6000.0
5000.0
Software Resequencing
Hardware Resequencing
4000.0
Optimized Resequencing
3000.0
2000.0
1000.0
0.0
200.0 0.0 100.0
Switch Cell forwarding delay (Microseconds)
Figure 8.5: Distribution of cell forwarding delays
104
Most of the sources simulated use 20-cell PDUs, except for one that uses smaller
9-cell PDU, and one that uses larger 30-cell PDUs. W ith larger PDU sizes, the
curves in the picture will move horizontally. Curves 1 and 2 will move towards
the right, indicating larger cell collection delay, and more chances of collision would
occur. The lines at the measured resequencer are heavily utilized at 50% to 90% of
available bandwidth.
8.6 Chapter Conclusions
We have introduced a resequencer and group tree multicast model for multicast
over ATM networks and discussed the traffic classes for which this approach is well
suited. The resequencer model solves the cell demultiplexing problem with group
shared trees and VC depletion problem. Three resequencing methods were studied
and simulation results shown. This model is particularly suitable for low rate sources
and for applications when it is beneficial to multiplex multiple sources onto the same
channel (e.g. multiple audio sources in an audio conference).
105
Chapter 9
Summary of Contributions and Future Research
We conclude this thesis with a summary of the major conclusions and a discussion
of future research areas.
9.1 Summary of main contributions
The overhead of a multicast routing protocol can be affected by the type of multicast
trees used, the protocol’s tree calculation algorithm or mechanisms, membership
advertisement and tree state maintenance mechanisms.
We presented a simulation based comparison of the performance of different mul
ticast tree types. Three measures were used to evaluate the quality of each type of
tree: delay, cost and traffic concentration. The source rooted shortest path trees
(SPTs) by nature offer the best end-to-end delay, and have less traffic concentra
tion when large numbers of groups are active. Center based trees offer satisfactory
average end-to-end delays, with worst case delays approaching 2 times the shortest
path delays. Each center based tree is shared by all senders sending to the same
multicast group, resulting in savings in state storage overhead compared with the
case when all sources use separate source-rooted shortest path trees for the same
group. Center based trees also have modest savings in total bandwidth consumed
for carrying the data traffic compared with source rooted shortest path trees. But
center based trees do exhibit higher degrees of traffic concentration than the source
rooted shortest path trees. In summary, the source-rooted shortest path trees and
center based trees are complimentary in terms of delay, cost and traffic concentration
characteristics.
106
In a protocol supporting both source rooted shortest path trees and group shared
center based trees, there exist a possibility of destructive multicast loops. We as
sumed a naive multicast protocol that relies on unicast routing to set up multicast
tree branches and which uses both tree types without special consideration for the
overlapping tree branches of different tree branches. We demonstrated that multi
cast routing loops may exist when the unicast routing is inconsistent or asymmetric.
We have shown that in the absence of unicast loops, an RPF check mechanism can
effectively prevent multicast routing loops when no multi-access LANs are involved.
When there do exist multiaccess LANs, an RPF check alone is not sufficient to pre
vent loops. Four necessary conditions and one sufficient condition for multicast loop
existence involving multiaccess LANs were presented. We then investigated three
multicast loop prevention mechanisms that break some of the necessary conditions,
and reasoned why the assert mechanism is favored over the other 2 mechanisms.
A multicast routing protocol may operate in sparse mode or dense mode and
achieve different bandwidth-state tradeoff for different group membership distribu
tions. We used PIM as a reference protocol to investigate the tradeoffs of sparse
mode and dense mode. With known measures of group density, the state storage
and the control bandwidth overhead can be calculated for dense mode operations.
We presented formulae calculating bounds to the sparse mode operation overhead.
Simulations were run over the Arpanet topology and results presented. The arpanet
experiments showed that for groups with densities of less than 0.3, sparse mode
operation has less storage and bandwidth overhead in general. A density value of
0.3 in the arpanet corresponds to groups with 10% to 20% of the total number of
network nodes. We speculated and verified with random topologies that the results
from the arpanet topology can be generalized to networks of different sizes with the
same average node degree.
Aside from choosing different operating modes or different tree types, we also
explored techniques to future reduce routing state information. We discussed the
design space for routing information reduction techniques. Five meaningful design
points were investigated in detail. Route aggregation across sources can be done
either by aggregating entries based on the source address prefixes, or by employing
source-proxy mechanisms within a network cloud. Both methods suffer realization
problems that make them unattractive. Aggregation across groups is possible and
107
can have potentially significant savings in state storage. Experiments in the 47-node
Arpanet topology have shown that the savings can be as much as about 80% of the
original storage required when the number of groups originating from a source is
large enough. The data-driven setup and maintenance of shared tree state approach
can potentially remove both state and control traffic for the shared tree in the
absence of data traffic. It operates in sparse mode in presence of data packets.
However, this scheme lacks a good simple reliability mechanism for operations over
networks with lossy links. Dense mode emulation for sparse mode groups can be
used in a transit network cloud to avoid potential multicast routing table explosions.
It converts sparse mode shared trees into dense mode source-rooted shortest path
trees when data traffic is present, and removes state about all groups that have no
data traffic. The analysis has shown that for small duty-cycle (active/idle) groups,
the state saving is significant at moderately increased control bandwidth. Overall,
we are not discouraged by the fact that aggregation across sources does not work
well, because low data rate sources can always stay on the shared tree and high data
rate sources don’t present an interesting case for aggregation: the maximum number
of sources sending to a group at high data rates is limited by the capacity of the
receivers. However, we are more concerned about aggregation across groups because
the number of groups with low traffic rates can be unlimited. It is important that
an aggregation scheme scale well in number of groups.
Finally, we looked at doing multicast routing in an ATM environment. We
introduced a resequencer and group tree multicast model for multicast over ATM
networks and identified the traffic classes for which this approach is well suited.
The resequencer model solves the cell demultiplexing problem with group shared
trees and VC depletion problem. Three resequencing methods were studied and
simulation results shown. This model is particularly suitable for low rate sources
and for applications when it is beneficial to multiplex multiple sources onto the same
channel (e.g. multiple audio sources in an audio conference).
9.2 Future Research
This thesis has mainly concentrated on the tree types, the dynamics of operating
modes and routing state overhead reduction. The following are related problems
108
that were not addressed in this thesis, but will be essential for efficient operation or
support of a scalable multicast routing protocol such as PIM.
9.2.1 Open Issues in M ulticast R outing Information
R eduction Techniques
In chapter 7, we indicated that it would be ideal if multicast routing can support
the same user population that unicast routing does at comparable cost. The work
presented in chapter 7 should be viewed as a first attem pt to solve the potential
multicast routing table explosion problem when multicast applications are widely
used.
As stated in the conclusion of chapter 7, the effectiveness of the various routing
information reduction techniques depends on the multicast application behaviors.
For example, the technique that aggregates multicast routing entries across groups
performs better when the number of groups a source sends to is large. While a
data-driven technique can be very effective when significant portions of multicast
applications are long-lived and only have traffic for very small percentages of their
group life times. However, it is a challenge to find out precisely the characteristics
of future multicast traffic.
In the mean time, it will be useful to study the social interactions and to ex
trapolate the user behaviors over the existing multicast facilities into the future.
Predictions of user population and distribution might be possible based on the in
formation available for the current Internet. When multicast routing becomes more
widely available and used, we may collect statistics from the routers.
The impact of the future network topological structures should also be consid
ered. The future Internet might still be composed of a small number of distinguish
able fast backbone routers and links, in which case the results from experiments
reported in this thesis can be easily extended to. But if the future Internet is less
structured and does not feature a top-level backbone (or a small number of such
backbones), other systematic investigations will be necessary.
109
9.2.2 Scalable Timers
Soft-state periodic-refresh based mechanism has been favored by certain protocol
(e.g. PIM, RSVP) designers due to the inherent advantages of soft-state based
mechanism, such as simplicity, robustness and compatibility with data-driven tech
niques. However, a potential pitfall with the soft-state approach is that when the
size of the state update is huge, the bandwidth overhead of state update can be
unbounded. Note that not all pieces of state information to be updated have to
correspond to active streams of data traffic. To tackle this problem, Van Jacobson
presented a scalable timer approach to limit the total update traffic below a certain
percentage of a link’s bandwidth capacity. With this approach, the period to send
a state refresh message can vary from router to router so that the low speed links
will not be overwhelmed by state update traffic.
The basic requirements of a scalable timer mechanism are:
1. Low overhead: The mechanism should impose minimal extra state and mes
sage overhead.
2. F ast response: When there is a state change in a certain part of the net
work, rapid propagation of the changes to the relevant places is important.
Inconsistent information can lead to non-optimal network resource utilization,
black out of regions for which usable alternative paths exist, and temporary
loops before the information is refreshed. As discussed in chapter 5, certain
loop prevention mechanisms may present transient loops in rare cases. How
fast such loops can be removed depends on how quickly the next state refresh
message arrives.
3. F lexibility: Such a mechanism should be flexible enough to allow heteroge
neous deployment and configurations.
4. R o bustness: The scalable timer itself should be robust and resilient to dif
ferent types of failures and different mixtures of independent configurations.
Although the concept of scalable timers is independent of specific protocols,
its realization and the impact will likely be protocol dependent. This is because
different protocols have different mixes of state information. And in most cases,
110
some information is more critical than the other, e.g. in the case of PIM, when
a unicast route change happens, it is more important to immediately refresh the
multicast routing state for active groups. And when there are too many active
groups to be refreshed at once, choose those with high data rates first.
The scalable timer concept is still relatively new, and there is no prior analysis
or application in a real protocol. It is important to evaluate its potential scaling
benefits and to explore the various ways of applying this concept in real protocol
designs.
9.2.3 M ulticast group address/session m anagement
Another area that is crucial to the success of ubiquitous deployment of multicast
service is the management of multicast group addresses: the assignment and efficient
advertisement of these addresses.
The available space of multicast group addresses is limited compared to the
space for unicast addresses1. There is a need to investigate ways to accommodate
permanent group address assignments, and to accommodate dynamic assignments
for transient groups.
The advertisement of the assigned group addresses should be: (1) timely, in
certain scenarios (e.g. certain transient groups) this can affect how fast a user can
join a group no m atter how small a join-latency the underlying multicast routing
protocol has; (2) robust; (3) scalable, so that when the number of groups advertised
grows large, the response time, robustness, cost, and other properties do not degrade
unacceptably; (4) easy to use, an application or a user can efficiently find the group
it needs to join to; (5) low cost, otherwise the session advertisement can become a
major factor restricting the number of multicast groups a network can support; and
(6) feasible to engineer in practice.
The current implementation of LBL session directory tool sd offers good scaling
property across large geographically-distributed groups and across heterogeneous
networks. It utilizes an efficient adaptive periodic refresh mechanism over a global
1E.g. in IP version 4, the multicast address space (or D class address space) is only 1/16 of the
total IP address space.
I l l
multicast group. The intervals for refresh messages are dynamically adjusted accord
ing to the number of advertised sources so that the total bandwidth consumption is
maintained below a certain level.
However when the number of simultaneously advertised global groups or sessions
grows to thousands, it will not be easy for the users to find and select the needed
sessions, and the response time with the current implementation will be intolerable.
Additional structures and transmission techniques will be needed to organize the
addresses.
One possible direction to pursue is to create a method to introduce structures
or logical hierarchies in the globally advertised groups. And devise a method to
propagate the session advertisement only to small subregions of the global network
th at contain group members or senders. The logical structure or hierarchies should
make it easier for users to find the wanted groups efficiently and quickly. By limit
ing the number of network routers/hosts involved with each advertised group, the
required network bandwidth/storage for advertising the session information should
be reduced.
Another possible direction to attack the scaling problem is to devise a massive
replication service, similar to the current domain name service system, that sets
up a hierarchy of servers that replicate the registered multicast session information.
This scheme would have different requirements from a DNS scheme. It must have a
much faster propagation time, and it must allow the end users to register and delete
entries.
In conclusion, a scalable multicast address session advertisement mechanism will
be another essential component in a full-featured multicast capable network archi
tecture.
112
Reference List
[1] T. Barzilai A Segall and Y. Ofek. Reliable multi-user tree setup with local
identifiers. In Proceedings o f the IEEE Infocom, 1992.
[2] Cengiz Alaettinoglu, A Udaya Shankar, Klaudia Dussa-Zieger, and Ibrahim
M atta. Mars (maryland routing simulator) - version 1.0 user’s manual. Tech
nical Report CS-TR-2687, Computer Science Department, University of Mary
land, jun 1991.
[3] Cengiz Alaettinoglu, A Udaya Shankar, Klaudia Dussa-Zieger, and Ibrahim
M atta. Design and implementation of mars: A routing testbed. Technical
Report CS-TR-2964, Computer Science Department, University of Maryland,
sep 1992.
[4] A. J. Ballardie, P. F. Francis, and J. Crowcroft. Core based trees. In Proceedings
o f the ACM SIGCOMM, San Francisco, 1993.
[5] Bela Bollobas. Random Graphs. Academic Press, Inc, Orlando, Florida, 1985.
[6] Bob Braden, Dave Clark, and Scott Shenker. Integrated services in the internet
architecture: an overview. RFC1633, June 1994.
[7] Steve Casner. Second IETF internet audiocast. Internet Society News, 1(3):23,
1992.
[8] CCITT. Q.93b specification, mar 1992. draft text.
[9] T. Cormen, C. Leiserson, and R. Rivest. Single-Source Shortest Paths. The
MIT Press, Cambridge, Massachusetts, 1992.
[10] Y. K. Dalai and R. M. Metcalfe. Reverse path forwarding of broadcast packets.
Communications of the ACM, 21(12):1040— 1048, 1978.
[11] S. Deering and D. Cheriton. Multicast routing in datagram internetworks and
extended Ians. ACM Transactions on Computer Systems, pages 85-111, May
1990.
113
[12] S. Deering, D. Estrin, D. Farinacci, V. Jacobson, C. Liu, and L. Wei. An archi
tecture for wide-area multicast routing- In Proceedings of the A C M SIGCOMM
94, London, September 1994.
[13] S- Deering, D. Estrin, D. Farinacci, V- Jacobson, C. Liu, and L- Wei. Protocol
independent multicast (pim), sparse mode protocol: Specification. Working
D raft, February 1995.
[14] Steven Deering. Multicast Routing in a Datagram Internetwork. PhD thesis,
Stanford University, December 1991.
[15] M atthew Doar and Ian Leslie. How bad is naive multicast routing. In Proceed
ings o f the IEEE Infocom’ 93, 1993.
[16] Dino Farinacci and Puneet Sharma. Private communications, 1995.
[17] Ron Frederick. IETF audio & videocast. Internet Society News, 1(4):19, 1993.
[18] E- N. Gilbert and H. O. Poliak. Steiner minimal trees. S IA M Journal on
Applied Mathematics, 16(1): 1-29, January 1968.
[19] D- H. Greene and J. Bryan Lyles. Reliability of adaptation layers. In Proceed
ings o f the 3rd IFIP WG6.1/6.4 workshop on protocols for high-speed networks,
Stockholm, may 1992.
[20] R. M. Karp. Reducihility among combinatorial problems. Plennum Press, New
York, 1972.
[21] L. Kou, G. Markowsky, and L. Berman- A fast algorithm for Steiner trees. Acta
Informatica, 15:141-145, 1981.
[22] T- Lyon. Simple and efficient adaptation layer(seal). In Proposal T lS l.5 /9 1 -
292 to ANSI working group T1S1.5, aug 1991.
[23] J. Moy. MOSPF: Analysis and experience. RFC1585, March 1994.
[24] J. Moy. Multicast extensions to OSPF- RPC 1584, March 1994.
[25] Vern Paxson. Growth trends in wide-area tcp connections. In IEEE Network,
pages 8-17, jul 1994.
[26] J. DeHart R. Bubenik, M. Gaddis. Communicating with virtual paths and
virtual channels. In Proceedings of IE E E Infocom, 1992.
[27] D- Waitzman S. Deering, C. Partridge. Distance vector multicast routing pro
tocol, nov 1988. RFC1075.
114
[28] A. Romanow T. Lyon, F. Liaw. Network layer architecture for ATM networks,
jun 1992. Working note.
[29] G. Polyzos V. Kompella, J. Pasquale. Multicasting for multimedia applications.
In Proceedings of the IEEE Infocom’ 92, 1992.
[30] David Wall. Mechanisms for Broadcast and Selective Broadcast. PhD thesis,
Stanford University, June 1980. Technical Report NO. 190.
[31] Bernard M. Waxman. Routing of multipoint connections. IEEE Journal on
Selected Areas in Communications, 6(9), December 1988.
[32] L. Wei. The design of the USC pim simulator (pimsim). Technical Report TR
95-604, Computer Science Department, USC, feb 1995.
[33] L. Wei and D. Estrin. The trade-offs of multicast trees and algorithms. In
Proceedings o f the 1994 international conference on computer communications
and networks, San Francisco, September 1994.
[34] Pawei W inter. Steiner problem in networks: A survey. Networks, 17(2): 129—
167, 1987.
115
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11257267
Unique identifier
UC11257267
Legacy Identifier
9617150