Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 682 (1998)
(USC DC Other)
USC Computer Science Technical Reports, no. 682 (1998)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LargeScale W eakly Consisten t Replication using Multicast
R amesh Govindan
Haob o Y u
Deb or ah Estrin
USCInformation Sciences Institute
A dmiraltyW a y Suite
Marina Del Rey CA
Phone F ax Email govindanhaoboyestrinisiedu
Abstract
In to da ys In ternet there exist sev eral rep ositories of resource allo cation information Sp ecically these r e gistries con tain information ab out IP address space delegations name space allo cations and
in terISP routing p olicies Suc h registries are useful for co ordinating allo cation of In ternet names
and addresses and for debugging net w ork routing
F or p erformance and a v ailabilit y reasons there is an increasing need to replicate these registries
In ternetwide This pap er describ es the design of an w eakly consisten t replication sc heme that
uses IP m ulticast The IP m ulticast service is unreliable w euse a sto chastic wait with suppr ession
tec hnique to scalably reco v er lost up dates This tec hnique reduces duplicate retransmissions of
lost up dates Detailed sim ulations demonstrate that ev en in large sparselydistributed registries
less than of the losses result in duplicate retransmissions F or the exp ected up date trac
patterns our approac h outp erforms other m ulticast loss reco v ery mec hanisms Weha v e implemen ted
a distributed In ternet routing p olicy registry using this replication sc heme a small exp erimen t
demonstrates the sc hemes feasibilit y In tro duction
In the In ternet to da y there exist sev eral distributed resource allo cation registries One suc h the Domain
Name System pro vides global recordorien ted access to host name and address allo cations These
allo cations are registered and managed in lo cal rep ositories and DNS denes mec hanisms and proto cols
that pro vide appro ximately an consisten t global view of this information
This pap er considers a class of distributed resource allo cation registries where unlik e in DNS eac h
rep ository is fully replicated at all others The In ternet routing registry is an instance of this class
Eac h rep ository in this registry con tains one or more ISPs r outing p olicy A routing p olicy describ es at
a high lev el the set of rules that go v ern the exc hange of routing information and consequen tly trac
bet w een that pro vider and its neigh b ors ISPs sp ecify these p olicies using a recordorien ted language
Routing p olicy can aect connectivit y bet w een sites a registry of these p olicies enables eac h ISP
to analyze the net w orkwide eects of its p olicies The registry is also useful for diagnosing anomalous
routing and automatically conguring routers These analyses ma y need to access net w orkwide p olicy
information and full replication is essen tial for p erformance
P age
In this pap er w e describ e the design of a scalable mec hanism for w eakly consisten t replication in
these registries This mec hanism uses IP m ulticast to disseminate rep ository record up dates to all other
rep ositories IP m ulticast deliv ery is unreliable and is suc h that a single lost up date ma y aect man y
rep ositories Under these circumstances classical ac kno wledgemen tbased loss reco v ery can result in an
implosion of ac kno wledgemen ts at the sender W e explore a mec hanism that probabilistically reduces
the lik eliho o d of duplicates In our mec hanism sites sto c hastically wait b efore requesting a lost up date
and suppr ess their o wn requests if they hear an iden tical request from another site This sto chastic
wait with suppr ession mec hanism is closely related to earlier w ork on in teractiveIn ternet collab oration
to ols In addition ho w ev er our replication mec hanism exploits the exp ected c haracteristics of
registry trac to scale b etter to a larger n um b er of rep ositories
Ho w eectiv ely this replication sc heme reduces the frequency of duplicateswithout signican tly
increasing up date reco v ery latencycritically determines its applicabilit y to a large widely distributed
registry Sim ulation results sho w that with sev eral h undred rep ositories in an In ternetlik een vironmen t
our sc heme ma y result in less than only one duplicate in thirt y losses at the cost of a reco v ery latency
of bet w een and seconds Sim ulations also sho w that our sc heme p erforms b etter than another
closely related m ulticast loss reco v ery sc heme o v er a wide range of conditions Using our protot yp e
implemen tation w e also presen t the results of a small In ternetwide exp erimen tthat v alidates manyof
our design decisions
Section motiv ates the use of IP m ulticast for w eakly consisten t replication Section discusses the
design of our replication sc heme Section presen ts the results of sim ulations ev aluating its p erformance
and Section describ es the results of an exp erimen tv alidating its feasibilit y Finally Section describ es
related w ork and Section presen ts our conclusions
Bac kground
Commitbased transactional replication w ould pro vide the ideal solution to full replication Ho w ev er
recen t analyses of poin ttop oin t widearea comm unication in the In ternet suggest that transactional
replication migh t not p erform w ell o v er an In ternetlik e en vironmen t A largescale study of TCP
dynamics sho ws that the In ternet exhibits suc h v ariation for almost an y measurepac k et loss proba
bilities congestion time scales lik eliho o d of pac k et duplication or outoforder deliv ery A study
of path dynamics also illustrates the existence of pathological routing in the In ternetpac k et lo oping
long path outages etc Another study presen ts evidence for the degradation in In ternet p erformance
with gro wth Giv en widely v arying connectivityc haracteristics then transactional replication can result in long
up date con v ergence times By relaxing the need for suc h tigh t consistency bet w een rep ositories w e
ma y be able to rapidly propagate up dates to a reasonably large prop ortion of rep ositories W e p osit
that scalable replication of rep ositories is more tractable if w e allo w for eventual c onver genc e of all
rep ositories that is rep ositories maybe temp orarily inconsisten t but our replication sc hemes ensures
that in the absence of further up dates all rep ositories ev en tually con v erge to ha ving iden tical con ten ts
P age
Router
Group member
Source
Figure IP Multic ast This gure illustrates senderro oted shortestpath distribution trees for IP m ulticast
In the tree on the left if the message w ere scop ed with a hopcoun tof t w o it w ould reac h all group mem b ers
Ho w ev er in the tree on the righ t a hopcoun toft wow ould only reacht w o of the three group mem b ers A loss
on an y link in the distribution tree w ould aect all group mem b ers in the corresp onding subtree
Usuallysuc haw eakly consisten t replication sc heme ma y need to reconcile b et w een conicting up dates
suc h reconciliation itself ma y not scale w ell to large registries F ortunatelyw e can assume that conicting up dates cannot o ccur in the class of resource allo cation
registries w e consider Eac h allo cation record is naturally registered and managed at its lo cal rep osi
tory In other w ords there is alw a ys one master site for eac h record This is also true of sev eral other
resource registries suc h as the registry of global IP address allo cations and registry of domain name
allo cations T oda y these registries are implemen ted in a cen tralized manner W e b eliev e our replication
sc heme w ould also w ork in a distributed implemen tation of these registries
Ha ving relaxed the requiremen t for transactional replication and ha ving assumed the absence of
reconciliation in these registries is there a problem left to solv e Existing solutions for ev en tually consis
ten t full replication reliably o o d up dates to all rep ositories This solution dynamically constructs
a virtual top ology consisting of reliable p oin ttop oin tcomm unication links b et w een rep ositories Up
dates are o o ded on a distribution tree that is o v erlaid on this virtual top ology This distribution tree
is computed dynamically based on message exc hanges among rep ositories The complexit y of robustly
computing this distribution tree compares with that of running a dynamic routing proto col instance
among rep ositories While not infeasible this needlessly duplicates net w orkla y er top ology computation
in order to ac hiev e transmission eciencies W e ask the question Giv en IP m ulticast do es there exist
an alternativ e to o o ding that scales w ell to large widelydistributed registries in the In ternet
What is IP m ulticast Multicast is a net w ork la y er primitiv e for group comm unication in the In ter
net Mem b ers of a group explicitly register with the net w ork their in terest in receiving comm unication
in tended for the group A group is iden tied b ya lab el that is syn tactically equiv alen t to an IP address
When a rep ository sends IP messages addressed to the group lab el routers within the In ternet conspire
to forw ard those messages along a distribution tree that spans all group mem b ers The m ulticast rout
ing proto col used in the In ternet to da y constructs for eac h sender shortestpath distribution trees
ro oted at that sender IP m ulticast messages can also be sc op e d that is m ulticast messages carry a
hopcoun t and are discarded after tra v ersing the corresp onding n um ber of lev els in the distribution
tree Figure P age
S
j
S
k
a Rep ositories S j and S
k
do not
receiv e up date
S
j
S
k
b S j s request for lost up date
suppresses S
k
s
S
j
S
k
S
l
c S
l
s resp onse suppressed other
p oten tial resp onders
Figure RRM Overview In RRM an y rep ository ma y resp ond to a retransmission request T o reduce the
lik eliho o d of implosion rep ositories sto c hastically bac k o b efore sending requests or resp onses Moreo v er a
rep ository suppr esses its o wn request or resp onse up on hearing another
If all rep ositories register to receiv e comm unication addressed to the same m ulticast group there
exists a simple solution to w eakly consisten t replication When a rep ository needs to up date a record
it simply m ulticasts that up date to the group Ho w ev er IP m ulticast is unr eliable and up dates ma y
be lost Moreo v er since a loss can occur an ywhere in the distribution tree a single loss can aect a
signican t subset of group mem bers A loss r e c overy proto col is necessary to ensure that the con v ergence
requiremen t is satised
Designing ecien t and scalable m ulticast loss reco v ery has receiv ed m uc h recen t atten tion
Sev eral factors aect the design of loss reco v ery mec hanisms the size and top ological distribution
of group mem bers the prop ortion of senders to group size the exp ected trac c haracteristics of the
application and the lo cation of congested net w ork links and the duration of congestion
In this pap er w e describ e the design and ev aluation of a loss reco v ery proto col for registry replication
This proto col whic hw e lab el RRM Registry Replication using Multicast exploits some c haracteristics
of exp ected up date trac to scale w ell to large widelydistributed registries
Registry Replication using Multicast
RRM ensures ecien t reco v ery from lost m ulticast transmissions Before w e describ e this proto col it is
essen tial to understand the c haracteristics of m ulticast loss and its impact on loss reco v ery mec hanisms
Multicast Loss and Loss Reco v ery
Malfunctioning hardw are and router congestion are usually understo o d to cause message losses in the
In ternet Of these congestion is widely assumed to b e the predominan t cause of message loss In ternet
poin ttop oin t comm unication proto cols incorp orate congestion a v oidance strategies for this reason
congestion is relativ ely shortliv ed in In ternet routers One study observ ed that congestion time scales
v ary b et w een afew h undred milliseconds and t w o seconds P age
S
j
Rep ositories are denoted bya n umeric subscript
N The total n um b er of rep ositories
u
im
The mth up date from rep ository S
i
W The sto c hastic w ait function
D The maximal onew a y distance b et w een an y pair of rep ositories
Figure Notation F or brevit yw e use some sym b ols to denote conceptual notions cen tral to the description
of RRM
Multicast transmission is aected to a greater exten tb y message loss than p oin ttop oin t comm uni
cation In tuitiv ely b ecause the m ulticast service deliv ers messages to m ultiple destinations at least one
group mem b er is highly lik ely to exp erience a lost message from a single transmission In one In ternet
wide exp erimen t consisting of rep ositories at least one receiv er exp erienced a loss in nearly of
all transmissions Moreo v er a single message loss can be shar e d among a large fraction of group
mem b ers In Figure for example a loss near the sender can aect all group mem bers What prop or
tion of m ulticast losses are shared byt w o or more group mem b ers No detailed analysis of loss lo cation
and exten t exists for m ulticast transmissions One study has rep orted seeing little or no shared
loss in one exp erimen t In Section our o wn exp erimen ts con tradict this w e see signican t shared
loss esp ecially for transcon tinen tal trac In tuitiv elyw ew ould exp ect mo derately high shared loss in
widearea m ulticast comm unication This comm unication tra v erses bac kb one routers and transo ceanic
links whic h are usually hea vily utilized
In p oin ttop oin t transmission the receiv er up on detecting a lost message sends an ac kno wledge
men t or a negativeac kno wledgemen t the original sender retransmits the lost message This classical
approac h to loss reco v ery p erforms p o orly when directly applied to m ulticast transmission A single
shared loss for example can result in sev eral ac kno wledgemen ts b eing returned to the sender resulting
in an implosion Using this loss reco v ery mec hanism for m ulticastbased replication can sev erely imp ede
scaling to large distributed registries
RRM Ov erview
Earlier w ork on in teractiveIn ternet collab oration has dened a m ulticast loss reco v ery framew ork called
Scalable Reliable Multicast or SRM RRM is adapted from SRM for registry replication In this
and subsequen t sections w e describ e howRRM w orks and ho w it diers from SRM
In the class of distributed registries w e consider in this pap er eac h rep ository initiates up dates only
for records for whic h it is a master rep ository In RRM successiv e up dates originated b y a rep ository
are assigned increasing se quenc e numb ers F or brevit y w e denote rep ositories b y n umeric subscripts
attac hed to S Figure Then u
im
denotes the mth up date sen t b y S
i
If another rep ository S
j
receiv es u
im
without ha ving receiv ed u
im
it assumes that the latter up date has b een lost
P age
Toreco v er from this loss S
j
m ust send a retransmission request for u
im
But other rep ositories ma y
ha v e exp erienced the same loss and maysim ultaneously request a retransmission In RRM Figure b efore sending a request S
j
w aits for a r andom interval determined b y a sto chastic wait function W After this in terv al S
i
multic asts its request for retransmission to the whole group Another rep ository
S
k
that sees a retransmission request for u
im
immediately suppr esses its o wn retransmission request
This sto c hastic w ait with suppressionin principle v ery similar to Ethernet collision a v oidance
do es not eliminate duplicates W crucially determines the ecacy of this tec hnique in reducing the
lik eliho o d of duplicates
Up on receiving the request S
i
can resp ond with the missing up date Ho w ev er b ecause the registry
is fully replicated an y other rep ository S
l
that has receiv ed u
im
can also resp ond to this request In the
In ternet this greater resp onse redundancy can increase greatly the o v erall robustness and eciency of
replication con v ergence No w ho w ev er man y rep ositories can resp ond to the same request creating a
r esp onse implosion RRM and SRM reduce the lik eliho o d of this implosion using a similar sto c hastic
w ait with suppression
Sto c hastic w ait with suppression can at the cost of increased reco v ery latency allo w m ulticast
loss reco v ery to scale w ell to large widelydistributed groups The c hoice of W the function used to
compute the sto c hastic w ait is dictated b y application trac c haracteristics Since it w as designed for
in teractiv e collab orativ e applications SRM c ho oses a w ait function that minimizes duplicate requests
and resp onses while k eeping reco v ery latency lo w RRMs c hoice of w ait function trades o increased
reco v ery dela y for lo w er duplicates In a later section Section w e argue that SRM is itself not
appropriate for registry replication W e also v alidate this argumen t with a p erformance comparison of
SRM and RRM Section
Detailed Proto col Description
In RRM up dates are distributed in data messages In the routing p olicy registry Section records
are small and rep ositories can transmit the en tire c hanged record in one data message F or situations
in whic h this design is w asteful of bandwidth the data message can con tain the dierence bet w een
the new v ersion of the record and its previous v ersion All rep ositories register to receiv e up dates
sen t to the same IP m ulticast groupeac h rep ository S
i
addresses its up dates to this m ulticast group
Successiv e up dates are assigned increasing sequence n um bers The data trac on the m ulticast group
is ratelimited
A rep ository S
j
detects the loss of u
im
when it receiv es u
im
In slo wly c hanging registries the
latency b et w een up dates can increase the p erceiv ed reco v ery dela y Toa v oid this eac h rep ository also
p erio dically transmits r ep ort messages A rep ort message from S
k
con tains a list of all rep ositories
from whic h S
k
has receiv ed up dates recen tly and the highest sequence n um ber heard from eac h
rep ository It also con tains the sequence n um ber of the highest up date that S
k
itself has originated
Instead of articially ratelimiting the source w e could ha v e devised a m ulticast congestion a v oidance mec hanism
that attempted to use as m uchof the a v ailable bandwidth as p ossible Ho w ev er the exp ected trac patterns in our
distributed registry do not app ear to w arrantsuc h mec hanisms
P age
This redundan t transmission of sequence information allo ws rep ositories to rapidly detect lost up dates
Up on receiving a rep ort message rep ository S
j
scans eac h listed rep ository and compares sequence
n um bers to detect losses Th us if a rep ort from S
k
indicates that S
i
s highest sequence n um ber is m and S
j
has only receiv ed up dates up to u
im it ma y infer the loss of u
im
T o con trol the o v erhead
asso ciated with rep orts eac h rep ository r atelimits the transmission of these messages A rep ositorys
rep orting frequency is in v ersely prop ortional to the n um ber N of rep ositories This n um ber can be
inferred from receiv ed rep orts
Th us there are t w o w a ys in whic h a rep ository S
j
can detect the loss of u
im
receipt of u
im or
the receipt of a rep ort whic h indicates that at least one other rep ository has receiv ed u
im
T o reco v er
from this loss S
j
sends a retransmission r e quest message Before sending this message ho w ev er S
j
sets a r e quest timer to expire after an in terv al determined b y the follo wing sto c hastic w ait function
W x D log
Nx where x is uniformly distributed in the range and D is the maximal onew a y delaybet w een an y
t w o rep ositories Section justies the c hoice of sto c hastic w ait function and describ es its prop erties
If b efore the timer expires
S
j
receiv es u
im
the request timer is canceled
S
j
receiv es a request message sen t b y some other rep ository S
k
for u
im
S
j
bac ks o its request
timer to t wice the previous v alue Request messages themselv es can be lost and this bac k o
ensures that the request is ev en tually resen tif u
im
is not retransmitted
In the latter case S
k
s request is said to suppr ess that of S
j
When the request timer expires S
j
m ulticasts the request message to all group mem bers The request message con tains the iden tityof S
i
and the requested sequence n um ber m S
j
then exp onen tially bac ks o its request timer As b efore
this ensures ev en tual reco v ery ev en if requests are lost
The resp onse algorithm is similar to the request algorithm When a rep ository S
k
that has receiv ed
u
im
gets a request for that up date it sets a r esp onse timer to expire at a time giv en b y Equation If
b efore this timer expires S
k
sees another resp onse for that request it cancels its o wn resp onse timer
When the timer expires S
k
m ulticasts to the en tire group on b ehalf of S
i
adata message con taining
the up date u
im
T o use Equation eac h rep ository m ust compute D the maximal onew a y dela y bet w een an y
t w o rep ositories In the absence of globally sync hronized clo c ks RRM lik e SRM con tains a simple
tec hnique to estimate D This tec hnique uses the follo wing message exc hange Supp ose rep ository S
i
sends out a message with a timestamp T
i supp ose further that the in terv al b et w een when some other
rep ository S
j
receiv es this message and resp onds to it is t
i
If S
j
s reply con tains both T
i and t
i
S
i
can estimate its roundtrip and consequen tly its onew a y dela yto S
j
F rom these S
i
can estimate
D lo cally It then transmits this estimate to all other rep ositories ev en tually all rep ositories con v erge
Supp ose S i receiv es the reply at T i the onew aydela y is half of T i T i t i P age
to a shared estimate of D In RRM eac h rep ository S
j
p erio dically m ulticasts a single sonar message
con taining S
j
s curren t estimate of D the time at whic h this message w as sen t and for ev ery other
rep ository S
i
a tuple con taining T
i and t
i
As with rep orts sonar messages are also ratelimited As
wesho w in Section this sc heme w orks w ell in practice although it has t w o limitations First b ecause
manyIn ternet paths are asymmetric the estimation of onew aydela y from roundtrip dela ycan be
inaccurate Second b ecause sonar messages are ratelimited con v ergence to D maybe slo w er in larger
rep ositories
Choice of Sto c hastic W ait F unction
What factors determine W In the w orst case N rep ositories ma y w ait b efore sending a request
or a resp onse eg when an up date is lost near the sender or near one receiv er Supp ose of these
N rep ositories S
j
selects the least w ait w Then for S
j
s request or resp onse to suppress another
rep ository S
k
the latters w ait time m ust b e greater than w b y the time tak en to propagate the request
or resp onse from S
j
to S
k
If the w ait times w ere c hosen to ensure probabilistically that the least
selected w ait diered from the N others byatleast Dv ery few duplicates w ould result This argues
that W is a function of D and N What constrain ts aect the c hoice of W Clearly excellen t suppression w ould result if the rep osito
ries pic k ed w ait times from an arbitrarily long in terv al But this could result in unacceptable reco v ery
latencies so it is necessary to b ound this in terv al It is nontrivial to c ho ose a W that minimizes
duplicates without greatly increasing reco v ery latency In this pap er w e considered the class of functions W giv en b y
W x D log
Nx where is a parameter whose c hoice w e discuss b elo w This c hoice of W ensures that when N v alues
of x are eachdra wn uniformly b et w een the resulting w ait times are exp onen tially distributed in
the range D log
N That is more w ait times are closer to the higher end of this range than
to the lo w er More imp ortan tly the w orst case w ait gro ws slo wly with N a v ery crucial p erformance
requiremen t for large registries with sev eral thousand rep ositories
What determines the c hoice of If N no des c hose w ait times based on Equation the exp ected
n um ber of duplicates dened as the n um ber of w ait times within D of the least w ait time and the
exp ected least w ait time for large N are
lim
N E dupl icates
lim
N E l east w ait D Supp ose that the least w ait time corresp onds to a dra wing of x minw e can also nd from Equation the v alue x
whichcorresponds toaw ait whic h is greater than the least w ait b y D The exp ected n um b er of duplicates is the exp ected
n um ber of dra wings b et w een x min and x
T aking the limit of this expression w earriv e at the rst iden tit y F or the latter
limit w e used Mathematica to iterativ ely compute the exp ected v alue for large N P age
0 100 200 300 400 500 600 700 800 900 1000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Expected Number of Duplicates
Num b er of Rep ositories
a Sample mean of duplicates for N 0 100 200 300 400 500 600 700 800 900 1000
0
5
10
15
20
25
Expected Least Wait (in the Units of D)
Num b er of Rep ositories
b Sample mean of least w ait times for N in units of D
Figure Empiric al R esults for Sto chastic W ait with Suppr ession These graphs w ere generated b y randomly
dra wing w ait times for dieren tn um b ers of rep ositories then computing the mean n um b er of duplicates and the
mean least w ait times
Up to a poin t increasing giv es a greater reduction in the exp ected n um ber of collisions than the
corresp onding increase in w ait times That is is a tunable parameter that trades o exp ected
least w ait for reduced duplicates W e ha v e found Section that a v alue of four for giv es go o d
p erformance o v er a large range of top ologies F or this c hoice of the exp ected n um b er of duplicates is
and the exp ected least w ait is D Our c hoice of W Equation w as motiv ated b y its exp ected beha vior for large N Not all N
rep ositories sim ultaneously comp ete to request an up date or retransmit one F or this reason it is
instructiv e to ev aluate the a v erage n um ber of duplicates and the a v erage least w ait for v arious k in
the range N F or N w e computed these v alues for k data poin ts ev enly distributed
across the range F or a particular v alue of kw edrew k uniformly distributed v alues and computed one
sample for the n um b er of duplicates and the least w ait time Figure sho ws the a v eraged samples o v er
all dra wings This gure highligh ts t w o desirable prop erties of W The a v erage n um ber of duplicates increases v ery slo wly with the n um ber of rep ositories exp eri
encing loss This prop ert y is crucial for scaling registries to large n um bers of rep ositories
When few rep ositories exp erience loss a v erage least w ait times are high With an increase in the
n um b er of rep ositories exp eriencing a loss the a v erage least w ait time decreases rapidly F rom Figure then a thousand rep ository distributed registry in to da ys In ternet migh t see one in
v e losses resulting in a retransmission and a least w ait of b et w een and seconds
Assuming from Section that D is seconds and that most least w aits are b et w een and D P age
Report
S
j
S
k
S
l
a Supp ose S
k
hears S
l
s rep ort
Request
S
j
S
k
S
l
b S
k
then m ulticasts request
with S
l
as preferred resp onder
S
j
S
k
S
l
c S
l
resp onds without bac k o
Figure Pr eferr e d R esp onder In the situation where a lost up date is detected from a rep ort message this
optimization can reduce b oth reco v ery latency and the n um b er of duplicate resp onses
The Preferred Resp onder Optimization
RRM diers from SRM in its c hoice of the sto c hastic w ait function In this section w e describ e
another dierencean optimization that lev erages the exp ected c haracteristics of up date trac in large
distributed registries This optimization reduces in some cases b oth reco v ery latency and duplicate
resp onses
A rep ository S
j
detects a lost up date u
im
in one of t w o w a ys either when it receiv es an up date
u
im
or when it hears a rep ort indicating that S
i
s curren t highest sequence n um b er equals or exceeds
m In the latter case the rep ort ma yha vebeen sentbyan y rep ository assume without loss of generalit y
that that the sender w as S
k
Then clearly S
k
m ust ha v e u
im
S
j
can sp ecify S
k
in its request message
as a pr eferr e d r esp onder for u
im
Up on receipt of a request S
k
can resp ond immediately without the
sto c hastic w ait other rep ositories S
l
suc h that l k w ait for a time determined b y
D D log
Nx where as b efore x is uniformly distributed in The additional w ait giv es the preferred resp onder
time to suppress other resp onders This optimization not only reduces reco v ery latency but also
decreases the lik eliho o d of duplicate resp onses b y preferen tially biasing one resp onder o v er others
This tec hnique impro v es o v erall p erformance if on a v erage man y losses are detected from rep orts
This happ ens if rep ository in terup date times are larger than the in terv al b et w een rep orts W e exp ect
this to b e the common case for the class of distributed registries describ ed in this pap er
Analysis of RRM
In this section w e analyze the p erformance of RRM o v er a range of In ternetlik e top ologies W e also
describ e the results of a p erformance comparison b et w een SRM and RRM Giv en the m ultidimensional
nature of the m ulticast loss reco v ery problem Section w e c hose sim ulation as our p erformance
P age
ev aluation metho dology Sim ulation Metho dology
W e implemen ted RRM in ns a detailed extensible net w ork sim ulator ns users sp ecify a net w ork
top ology no des and links in this top ology form the core constructs of ns It sim ulates no deb yno de
message deliv ery closely mo deling queuing at routers and propagation dela ys across links Users ma y
extend this basic infrastructure b y adding one or more mo dules Mo dules can b e attache d to no des or
links A mo dule ma y for instance c hange the queueing beha vior at a no de or sim ulate message loss
along one or more sp ecic links
W e implemen ted an RRM ns mo dule Eac h suc h mo dule attac hes to those no des in the top ology
whic h con tain rep ositories F or our comparisons w e used a preexisting SRM ns mo dule In our
sim ulations w e also used other preexisting soft w are
The In ternet T op ology Mo deler pro duces random In ternetlik e hierarc hical top ologies with
widearea links in terconnecting metrop olitanarea net w orks whic h in turn connect lo calarea net
w orks F rom the generated top ology ns computes routing tables that eect shortestpath deliv ery
bet w een no des
A r ep ository plac ement generator that randomly places RRM or SRM mo dules in the top ology ns then in v ok es this mo dule at eac h no de to whic h it deliv ers an RRM or SRM proto col mes
sage ns also precomputes for eac h rep ository its shortestpath distribution tree to all other
rep ositories Using these ns can sim ulate m ulticast message forw arding at eac h no de
A loss mo dule that when attac hed to a link sim ulates the eect of selectiv ely dropping up dates
Users can eect shared losses b y attac hing this loss mo dule to a link in a m ulticast distribution
tree
T o calibrate the sim ulator w ev eried b y hand the results of sim ulations on regular small test top ologies
More than other sim ulators the lev el of sim ulation detail in ns p ermits the realistic analysis
of proto col p erformance Ho w ev er in our eort to conduct a con trolled exp erimen t our sim ulations
represen t a simplied picture of realit y W e do not sim ulate the eect of path dynamics on m ulticast
distribution trees These dynamics can aect b oth reco v ery latency and suppression Moreo v er our loss
mo dules do not sim ulate losses of rep orts requests or resp onses Request losses can result for example
in longer p erceiv ed reco v ery latency or ev en increased duplication
Our sim ulations also div erge from realit y in three other w a ys First lac king realistic rep ository
placemen t c haracterizations w e resort to random placemen t Second lac king an accurate c haracteri
zation of m ulticast loss lo cations and durations w e randomly select loss mo dule attac hmen ts to links
W e exp ect suc h rep ository and loss placemen ts to capture RRM p erformance in a v erage case situations
it ma y not capture w orst case scenarios w ell Finallygiv en our fo cus on analyzing the eectiv eness of
sto c hastic w ait w e do not sim ulate the loss of requests and resp onses Doing so w ould ha vein tro duced
secondorder eects thereb y complicating the in terpretation of the sim ulation results
P age
RRM P erformance
Our analytic ev aluation of RRM p erformance Section is insucien t for t w o reasons First it cannot
predict p erformance impro v emen ts attributable to the preferred resp onder optimization Section Second the results of our analysis ma y be p erturb ed b y actual message transmission latencies in t w o
w a ys
All rep ositories are assumed to b egin their sto c hastic w aits sim ultaneously realistically dieren t
rep ositories will detect losses or receiv e requests and hence b egin their sto c hastic w aits at
dieren t instan ts This can adv ersely aect suppression
Moreo v er rep ositories whose w aits are greater b y D than the least w ait rep ository are assumed to
b e suppressed In practice this suppression is also a function of the actual onew a y delaybet w een
rep ositories F or example a rep ository whose least w ait w as only d D greater than the least
w ait rep ository could still b e suppressed b y the latter if its actual onew aydela y to that rep ository
w as less than d This feature can actually result in b etter suppression and lo w er reco v ery latencies
than that predicted b y Figure F or these reasons w ec hose to extensiv ely ev aluate RRM p erformance through sim ulation Our goal
in this section is to quan tify RRM p erformance as a function of rep ository distribution Our RRM
sim ulations w ere divided in to v e b atches represen ting top ologies of po w eroft w o sizes ranging from
to no des b oth inclusiv e F or eachbatc h w e conducted three runs In eac h run w e generated
a random rep ository placemen t The n um ber of rep ositories w as xed at onefourth of the size of the
corresp onding top ology this mo dels a relativ ely sparse distribution of rep ositories On this rep ository
placemen t w e randomly mark ed as lossy onefourth of the links on one senderro oted distribution tree
Eac h link randomly drops up dates the probabilit y of an up date b eing dropp ed b y at least one link
w as xed at This c hoice stresses RRMs loss reco v ery mec hanismman y no des exp erience losses
due to dieren t lossy links Eac h rep ository generates P oisson arriv als of up dates During eac h run w e
sim ulated t w o dieren t v ersions of RRM without the optimizations describ ed in Section RRM
and with only the preferred resp onder optimization RRMPR
In this and subsequen t sections three metrics dene RRM and SRM p erformance
A v erage Duplicate Requests F or eac h run of the exp erimen t this is obtained b y a v eraging the
n um b er of duplicate requests o v er the total n um b er of losses
A v erage Duplicate Resp onses F or eac h run of the exp erimen t this is obtained b y a v eraging the
n um b er of duplicate resp onses o v er the total n um b er of losses
A v erage Reco v ery Latency This is computed as follo ws The p erloss reco v ery latency for a single
rep ository is dened as the a v erage time b et w een when the loss is rst detected b y the rep ository
and when the rep ository receiv es the retransmitted up date F or a single loss w e a v erage the
This is quite dieren t from when the up date w as rst lost since the detection of a loss dep ends on receipt of a rep ort
or the subsequentupdate F or the purp oses of ev aluating the cost of sto c hastic w ait ho w ev er our denition of reco v ery
latency is appropriate
P age
0 200 400 600 800 1000 1200
6
8
10
12
14
16
18
20
22
Number of Nodes in Topology
Average Recovery Latency (in the units of D)
5% Responder
50% Responder
RRMPR
RRM
a
0 200 400 600 800 1000 1200
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Number of Nodes in Topology
Average Duplicate Requests
RRMPR
RRM
b
0 200 400 600 800 1000 1200
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Number of Nodes in Topology
Average Duplicate Responses
RRMPR
RRM
c
Figure RRM Performanc e Summary Dep ending on up date trac c haracteristics only some of the losses can
b e reco v ered using the preferred pro vider optimization So the observ ed p erformance of RRM will lie b et w een
the RRM and the RRMPR curv es sho wn The dots indicate the v alues of the p erformance measures corresp onding
to eac h of the three runs
reco v ery latencies o v er all the rep ositories that exp erience a loss F or the en tire sim ulation run
reco v ery latencies are a v eraged o v er all losses
In tuitiv ely a loss reco v ery sc heme that minimizes duplicate requests and resp onses while still allo wing
reasonable reco v ery latency can b e said to scale w ell
In eac h run of our exp erimen t w e sim ulated nearly losses Figure plots these measures
a v eraged o v er the three runs
Figure a depicts reco v ery latency normalized to units of D as a function of the size of the
top ology the n um b er of rep ositories is onefourth the size of the top ology As exp ected the preferred
resp onder optimization clearly reduces the reco v ery latency b y almost a factor of t w o regardless of
the size of the distributed registry In tuitiv ely the a v erage reco v ery latency is comp osed of almost t w o
equal parts the request w ait and the resp onse w ait The preferred resp onder optimization reduces the
latter dramatically What v alues of reco v ery latency mightw e exp ect in a real In ternet A D v alue of
milliseconds is not uncommon in an In ternetwide distributed registry Section F or this v alue
of D reco v ery latencies could b e b et w een and seconds for the range of rep ository distributions w e
ha vesim ulated
Figure a also sho ws signican t v ariation of the a v erage latencies across runs W e b eliev e this is
a manifestation of a lossplac ement ee ct In our sim ulations w e randomly select lossy links RRM
reco v ery latency is rather sensitiv e to loss placemen t for example if a particular placemen t results
in v ery few requestors then reco v ery latencies
can be high It is rather dicult to generate loss
placemen ts in t w o dieren t rep ository distributions that ensure comparable divisions of requestors and
See for example Figure where the least w aits are large when the n um b er of rep ositories that dra w sto c hastic w aits
is small
P age
resp onders Ho w ev er w ecan estimateb ya v eraging rep eated dra wings of w ait times exp ected latencies
for a loss placemen t whic h result when either a of the rep ositories are requestors or b of the
rep ositories are requestors Figure a also sho ws these reco v ery latency estimates These estimates are
appro ximate b ounds on the range of latencies w e migh t exp ect from RRM Inciden tally these estimates
also v alidate our sim ulation
Figure a also exhibits another in teresting trend Ev en though the upp er b ound of our sto c hastic
w ait increases logarithmically with N a v erage reco v ery latencies app ear to be largely indep enden t of
or v ary v ery slo wly with N This is v ery desirable scaling b eha vior in that a v erage reco v ery latency is
not aected b y the n um ber of rep ositories but only b y their distribution across the In ternet
F or the rep ository distributions w e studied RRMs sto c hastic w ait results in ab out one duplicate
request for ev ery v e losses Figure b Some of these duplicates are se c ondary requests that is
a rep ository w as suppressed b y the rst or primary request bac k ed o its request timer but did
not see the resp onse before that timer expired This phenomenon also explains wh y the preferred
resp onder optimization sho ws few er a v erage duplicate requests b ecause this sc heme resp onds faster
few er secondary requests are triggered
The preferred pro vider optimization reduces the incidence of duplicate resp onses to less than one in
thirt y losses Figure c Wh y are duplicate resp onses generated ev en with the preferred resp onder
sc heme One reason of course is that the least sto c hastic w ait is so small that ev en the immediate re
sp onse do es not suppress it In our sim ulations ho w ev er most duplicate resp onses result from duplicate
requests with dieren t preferred resp onders
Ho w do the a v erage duplicate requests and resp onses v ary with the size of the top ology Notice that
in b oth Figure a and Figure b the a v erages of the individual runs v ary signican tly w e attribute
this to the loss placemen t eect discussed ab o v e F or the range of sizes weha v e considered the a v erage
n um ber of duplicates app ears to not increase with N It app ears that the increased suppression from
actual onew a y dela ys b eing less than D out w eighs or at least osets the increased lik eliho o d of
duplicates with increasing N In our sim ulations of RRMPR all losses are reco v ered using the preferred resp onder optimization
In a distributed registry not all losses ma y be reco v ered using this optimizationrecall that a no de
m ust detect a loss from a rep ort message in order to indicate a preferred resp onder Realistically the
p erformance of an RRM implemen tation will lie b et w een the t w o curv es sho wn in eac h of the graphs of
Figure Comparison with SRM
SRMs sto c hastic w ait function is parametrized byt w o quan tities C
and C
Dieren t rep ositories can
ha v e dieren t v alues of these parameters When a rep ository S
j
detects the loss of u
im
it w aits for
a time randomly dra wn from C
j
d
ij
C
j
C
j
d
ij
where d
ij
is the onew a y dela y bet w een S
i
and S
j
If all no des had the same C
and C
v alues then sto c hastically the no des closer to the source
w ould w ait less In SRM ho w ev er eac h rep ository S
j
indep enden tly adjusts the v alues of C
and C
b y observing the n um ber of duplicate requests and the reco v ery latency during suc c essive losses This
P age
1 2 3 4 5 6 7 8 9
0
2
4
6
8
10
12
14
16
18
Number of Lossy Links
Average Recovery Latency (in the units of D)
SRM
RRMPR
RRM
a
1 2 3 4 5 6 7 8 9
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Number of Lossy Links
Average Duplicate Requests
SRM
RRMPR
RRM
b
1 2 3 4 5 6 7 8 9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Number of Lossy Links
Average Duplicate Responses
SRM
RRMPR
RRM
c
Figure r ep ository p erformancec omp arison SRM exhibits sup erior reco v ery latency p erformance Ho w ev er
RRM is clearly more eectiv e in suppressing duplicate requests and resp onses
adjustmen t tries to ensure that the rep ository closest to the lossy link is most lik ely to send the request
SRM adapts in a similar manner for resp onses In tuitiv ely no des closer to the lossy link learn to
b ecome more aggressiv e and no des further a w a y to b eing more conserv ativ e
RRM is adapted from SRM Wh y is SRM itself not suitable for registry replication F undamen
tally SRMs adaptationbased on successiv e losses observ ed at a rep ositoryis not w ellsuited to the
exp ected up date trac patterns in distributed registries Sp ecically in these registries up dates from
dieren t rep ositories mayarriv e interle ave d at a rep ository S
j
Therefore successiv e losses exp erienced
at S
j
ma y actually b e attributable to dier ent lossy links If S
j
is near one lossy link and far a w a y from
another it ma y alternately try to b e aggressiv e or conserv ativ e in sending requests This can result in
high a v erage duplicate requests A similar argumen t holds for resp onses as w ell
W e p erformed extensiv e sim ulations to v erify this h yp othesis The sim ulations are divided in to
batc hes ranging from to rep ositories placed randomly on a no de top ology Eac h batc h
comprises of sev eral runs and eac h run corresp onds to one assignmen t of lossy links to the top ology The n um ber of lossy links ranges from to of the n um ber of links in one distribution tree Eac h
link randomly drops up dates The probabilit y of an up date b eing dropp ed b y at least one link is xed
at Eac h rep ository generates P oisson up date arriv als Eac h run consists of three subruns one eac h
for SRM
RRM and RRM with the preferred resp onse optimization Eac h run sim ulated more than
lost up dates
Figure describ es the results of the rep ository batc h The graphs sho w p erloss a v erages for
three p erformance measures reco v ery latency duplicate requests and duplicate resp onses
Clearly SRM exhibits sup erior reco v ery latency regardless of the n um b er of lossy links Figure a
RRMs a v erage reco v ery latency with the preferred resp onder optimization is times greater than
A ctually w e conducted three SRM subrunsour results are a v eraged o v er these This is b ecause the w a y SRM
adapts to losses during one subrun can b e sligh tly sensitiv e to the c hoice of initial seed for the up date arriv al generator
P age
0 5000 10000 15000
0
1
2
3
4
5
6
7
8
Simulation Time
Parameter Value
C
1
C
2
Figure V ariation of SRM p ar ameters This gure sho ws the v ariation with sim ulated time of SRMs C
and C
parameters at one no de in our sim ulations These parameters nev er con v erge resulting in high a v erage
duplicate requests
that of SRM This dierence is predicted b y the design of these proto cols In SRM a no de S
j
s sto c hastic
w ait for requesting u
im
is function of d
ij
In RRM on the other hand the sto c hastic w ait is a function
of Din sk ew ed top ologies D can b e man y times larger than d
ij
As the n um ber of lossy links increases SRMs latency app ears to degrade and RRMs to impro v e
Figure a This is again a manifestation of the lossplacemen t eect Section
SRM lik e
RRM is sensitiv etothe lo cation of lossy links Analyzing these graphs for trends in reco v ery latency
ma y therefore be inappropriate It is ho w ev er en tirely appropriate to c omp ar e the t w o proto cols
since they are eac h sim ulated for the same loss conguration
F or some loss congurations SRM transmits nearly one additional request for ev ery loss b y compar
ison RRM only transmits a duplicate request once in ev ery v e or more losses on a v erage Figure b
With more than one lossy link SRM sends on a v erage one duplicate resp onse in ev ery t w o losses Fig
ure c RRM with its preferred resp onder sc heme only transmits one duplicate resp onse in more
than losses
With one lossy link SRM outp erforms RRM Ho w ev er the a v erage n um b er of duplicates increases
sharply bey ond the one lossy link scenario W e h yp othesized earlier that suc h beha vior w as caused
b y a rep ository adapting incorrectly to in terlea v ed losses Figure graphically sho ws the con tin uous
v ariation of C
and C
parameters at one rep ository in our sim ulations These parameters nev er
con v erge resulting in high a v erage duplicate requests On the other hand RRMs p erformance is
largely indep enden t of the n um ber of lossy links This is the exp ected b eha vior RRM adapts only
coarsely to the rep ository distribution but do es not unlik e SRM adapt to the placemen t of lossy links
F rom the ab o v e discussion w e conclude that in the presence of more than one lossy link RRM
particularly with its preferred resp onder optimization exhibits signican tly few er duplicates requests
Tov erify this w e sim ulated SRM in a top ology with lossy links the resultan ta v erage delayw as not signican tly
greater than the dela ys sho wn in Figure a
P age
0.05
0.1
0.15
0.2
0.25 10
20
30
40
50
60
0
2
4
6
8
10
12
14
16
18
20
Number of Sites
% of Lossy Links
Average Recovery Latency
SRM
RRMPR
RRM
a
0.05
0.1
0.15
0.2
0.25 10
20
30
40
50
60
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Number of Sites
% of Lossy Links
Average Duplicate Requests
SRM
RRMPR
RRM
b
0.05
0.1
0.15
0.2
0.25 10
20
30
40
50
60 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Number of Sites
% of Lossy Links
Average Duplicate Responses
SRM
RRMPR
RRM
c
Figure Performanc e Comp arison Summary RRM uniformly outp erforms SRMs duplicate suppression strat
egy in scenarios with more than one lossy link
and resp onses than SRM This is more clearly demonstrated b y the set of three dimensional graphs
sho wn in Figure These graphs summarize our sim ulations comparing SRM with RRM Although
SRMs reco v ery latency is uniformly b etter than RRMs its a v erage duplicates increase with t woor more
lossy links for the en tire range of rep ository placemen ts considered F or widelydistributed registries
w e exp ect that up dates will encoun ter m ultiple lossy links quite frequen tly Implemen tation P erformance
T o demonstrate the feasibilityof w eakly consisten t replication in the In ternet w e implemen ted RRM in
the distributed In ternet routing registry Section Our RRM implemen tation itself con tains less than
lines of C!! co de Our implemen tation has b een p orted to sev eral UNIXlik e op erating systems
including SunOS x Solaris BSDI F reeBSD and Lin ux x Wein tend to deployour
distributed routing registry on the In ternet in a few mon ths
Using our implemen tation w e conducted a simple exp erimen t in v olving sev en rep ositories six
in
the United States and one in Europ e
W e instrumen ted our implemen tation to trace all up date
transmission and loss reco v ery ev en ts In our exp erimen t nearly up dates w ere exc hanged in all
with eac h rep ository sending appro ximately onesev en th of that n um ber The a v erage up date size w as
b ytes In terup date times w ere uniformly distributed bet w een zero and ten seconds this c hoice
allo w ed rep ositories to detect some losses from rep ort messages and other losses from the receipt of the
next up date
W e do not b eliev e that this exp erimen tal conguration is represen tativ e of the exp ected scale of
oilisiedu and hidalgouscedu in California nsuoregonedu in Oregon northlcsmitedu in Massac h ussetts
looneastisiedu in W ashington DC espressomeritedu in Mic higan
armoniacnucecnrit in Pisa Italy
P age
In ternetwide distributed registries Our goals in designing this widelydistributed exp erimentw ere
Tov alidate the design of RRM
T o obtain some indication of the reco v ery latencies loss patterns and suppression beha vior that
these registries can exp ect to encoun ter
Before w e describ e the results of our exp erimen t it is instructiv e to analyze observ ed loss patterns
Eachro w in the table in Figure sho ws the p ercen tage of losses observ ed at one rep ository attributable
to ev ery other rep ository F or most North American rep ositories the ma jorit y oflostupdates w ere from
italy The lone exception is the mit umich pair they observ ed more losses from eac h other than from
italy The losses observ ed at the Italian rep ository w ere nearly uniformly distributed among all North
American rep ositories
An up date loss maybe shar e d among rep ositories The histogram in Figure coun ts the n um ber
of lost up dates that w ere shared bet w een rep ositories A ma jorit y of the lost up dates w ere only seen
at a single rep ository Ho w ev er single rep ository losses w ere predominan tly observ ed at the Italian
rep ository F rom Figure w e conclude that the transcon tinen tal link accoun ted for the ma jorit y of
losses
These loss patterns are not represen tativ e of the In ternets m ulticast infrastructure Ho w ev er they
highligh t t w o imp ortan t design considerations
Losses in the In ternet are hardly infrequen t and can b e shared among dieren t rep ositories So
when using m ulticast for replication careful design of the loss reco v ery mec hanism is essen tial for
go o d replication p erformance and scalabilit y The In ternet infrastructure is highly div erse In our exp erimen t relativ ely few losses w ere observ ed
bet w een the North American rep ositories Ho w ev er the transcon tinen tal link w as a cause of
signican t loss This div ersit y can stress replication p erformance in globally distributed registries
F or eac h loss observ ed at a rep ositoryw e dene its r e c overy latency to b e the in terv al b et w een when
the loss is rst detected either from a rep ort or from a subsequen t up date and when the up date w as
nally reco v ered at that rep ository T o obtain RRMs a v erage reco v ery latencyw e computed for eac h
lost up date the a v erage
Figure sho ws the mean of all these a v erages The a v erage loss reco v ery
latency of ab out seconds is unsurprising the w orstcase latency assuming a v alue of seconds for
D is seconds Equation Almost of the lost up dates w ere detected b y receiving a rep ort message These up dates could
tak e adv an tage of the preferred resp onder optimization describ ed ab o v e W e computed the reco v ery
latency for those up dates that w ere reco v ered using this optimization and those that w ere not On
a v erage the preferred resp onder optimization reduced reco v ery latency b y nearly a second Unlik e the
results in Section this is not signican tly lo w er than the o v erall mean reco v ery latency That is
W e observ ed a few outliers in the distribution of latencies These outliers w ere caused b y a bug in the distance
estimation co de as w ell as net w ork partitions In computing the a v erage latencyw e discarded these outliers
P age
isi usc umich uoregon dc mit italy T otal Up dates Lost
isi
usc
umich
uoregon
dc
mit
italy
1 2 3 4 5 6
0
1000
2000
3000
4000
5000
6000
Number of Sites Sharing Loss
Number of Losses
Figure L oss Statistics The table on the left indicates the sources of loss se en at a particular rep ository A
single up date loss ma y b e visible at man y rep ositories the histogram on the righ tsho ws the frequency of these
shared losses In our exp erimen t most up date losses w ere seen at a single rep ository b ecause in our exp erimen t italy observ ed most of the losses and b ecause the losses in that rep ository
w ere not shared its request latency dominates the a v erage
In RRM primary requests are sen t after the request timer expires A rep ository ma y also send
out se c ondary requests if the primary request failed to elicit an up date reco v ery this can happ en for
instance if either the primary request or the resp onse failed to reac h the rep ository Section T o
ev aluate the ecacy of our request suppression algorithm w e only coun t the n um b er of primary requests
sentfor eac h loss Figure sho ws the mean and standard deviation of this n um ber In our exp erimen t most losses w ere not shared Ho w ev er the magnitude of the shared losses w as
suc h that in the absenc e of suppr ession wew ould exp ect the a v erage n um ber of duplicate requests to
be The observ ed mean of Figure indicates that our request suppression w orks reasonably
w ell for the particular exp erimen t conguration Finally the a v erage n um ber of duplicate resp onses is
Since most of the losses w ere not shared the n um ber of p oten tial resp onders is for most losses
six ie one less than the total n um b er of rep ositories The lowa v erage n um b er of duplicate resp onses
clearly indicates the ecacy of b oth sto c hastic w ait and the preferred resp onder optimization
Related W ork
Earlier w ork on scalable widearea replication has relied up on the establishmen t of reliable p oin tto
poin t connections bet w een eac h rep ository and one or more other rep ositories Rep ositories then
dynamically compute a o o ding top ology that is a subgraph of these in terrep ository connections
This subgraph redundan tly spans all rep ositories and is used to propagate up dates globally Our
approac h do es not compute an optimal up date propagation top ology at the applicationlev el instead
w e directly use the distribution trees computed bym ulticast routing for propagating up dates
Most relev an t to our w ork is the rather large b o dy of researc h in reliable In ternet m ulticast comm u
nication Of direct relev ance to full replication is Scalable Reliable Multicast SRM already describ ed
P age
Reco v ery Latency seconds
Reco v ery Latency without Preferred Resp onder seconds
Reco v ery Latency with Preferred Resp onder seconds
Duplicate Requests
Duplicate Resp onses
Figure A ver ages After discarding dela y outliers w e found that the a v erage loss reco v ery time in our sc heme
w as ab out four seconds F or losses that sp ecied a preferred resp onder the a v erage reco v ery time impro v ed b y
nearly a second Incidences of duplicates w as negligible for requests and for resp onses
in some detail in an earlier section Similar to our analysis of the asymptotic b eha vior of RRMs timers
Section other w ork has analytically ev aluated the asymptotic b eha vior of SRM in certain simple
top ologies T o increase the scalabilit y of SRM to larger groups other w ork has examined lo calized
loss reco v ery in the con text of SRM In this w ork no des that exp erience shared loss collectiv ely
establish sp ecial r e c overy multic ast gr oups that includeprobabilisticallythe resp onder nearest to
a lossy link Resp onders can then retransmit up dates to these reco v ery groups Suc h lo calized loss
reco v ery can b e adapted to RRM as w ell
T o o v ercome the latency imp osed b y sto c hastic w ait and to lo calize reco v ery to the collection of
no des incurring loss some loss reco v ery mec hanisms prop ose sp ecialized router supp ort In par
ticular PGM pro vides minimal router supp ort for loss reco v ery suc h supp ort can augmentRRMs
applicationlev el reco v ery mec hanisms for example Other approac hes dynamically elect represen tativ e
resp onders or requestors Recen tly other w ork has also analyzed the suppression b eha vior of exp onen tially distributed timers
This w ork sho ws that an sto c hastic w ait function similar to W Equation sim ultaneously reduces
b oth latency and the n um b er of duplicates when the the n um b er of receiv ers exp eriencing loss is kno wn
a priori W is sligh tly more p essimistic than their c hoice in that it is a function of N the total n um ber
of group mem b ers and not on the n um ber of receiv ers exp eriencing loss
Replication has b een studied in av ariet y of con texts Distributed systems researc h has considered
group comm unication primitiv es that pro vide causal up date ordering to preserv e a v ariet y of consis
tency seman tics File system researc h has fo cused on transactional replication tec hniques to
preserv e le up date seman tics for replicated les More recen tly researc h has in v estigated
w eakly consisten t replication for main taining le consistency or pro viding transaction supp ort during
disconnected op eration Other w ork has qualitativ ely describ ed the implications of scale
and In ternetwide distribution on largescale replication and consensus This w ork argues that
transactional replication o v er the widearea ma y b e infeasible Our pap er represen ts an existence pro of
of the feasibilityof non transactional replication
P age
Conclusions and F uture W ork
Bey ond a certain n um ber of rep ositories RRMs aver age dela y and n um ber of duplicates app ears to
be relativ ely indep enden t of the n um ber of rep ositories This is encouraging w e ma y exp ect RRM to
scale w ell to sev eral thousand rep ositories The message exc hange o v erhead required to estimate Dor
send rep orts ma y limit the abilit y of RRM to scale b ey ond this limit Other researc h is considering the
formation of hierarc hies for scaling these message exc hanges In describing RRM w e ha v e simplied the consistency requiremen ts to only allo w one rep ository
to up date a record It is desirable to allo w more than one rep ository to up date a record in our routing
p olicy registry Section Ho w ev er w e b eliev e it will suce for the registry to implemen t eventual timestampbased consistency A c kno wledgemen ts
Discussions with Cengiz Alaettino"lu help ed clarify the requiremen ts of the routing p olicy registry Charley Liu describ ed the details of the SRM design and Kannan V aradhan explained the in ternals
of nss SRM implemen tation Mark Handley suggested in v estigating the exp onen tially distributed
sto c hastic w ait function and Go vindan Ra jesh analyzed its asymptotic b eha vior The exp erimen t
describ ed in Section w ould not ha v e been p ossible without the help of Suresh Bhoga villi Mark
HandleyJak eKh uon Da v e Mey er Damir P obric Prashan t Sheno y and Wilfried W o eb er
References
C Alaettinolu Scalable Router Conguration for the In ternet Pr o c e e dings of the International
Confer enc e on Networking Pr oto c ols Octob er K Birman A Sc hip er and P Stephenson Ligh t w eigh t Causal and A tomic Group Multicast A CM
T r ansactions on Computer Systems K Calv ert M Doar and EW Zegura Mo delling In ternet T op ology In IEEE Communic ations Magazine June S Ceri and G P elegatti Distribute d Datab ases Principles and Systems McGra w Hill Bo ok Compan y S Deering Multicast Routing in In ternet w orks and Extended LANs In Pr o c e e dings of A CM SIG
COMM Confer enc e on Communic ation A r chite ctur es and Pr oto c ols pages August D DeLucia and K Obraczk a Multicast F eedbac k Suppression using Represen tativ es In Pr o c e e dings of
IEEE INF OCOM
S Flo yd V Jacobson S McCanne ChingGung Liu and Lixia Zhang A Reliable Multicast F ramew ork
for Ligh tW eigh t Sessions and Application Lev el F raming In Pr o c e e dings of the A CM SIGCOMM Aug
R Go vindan and A Reddy An Analysis of In ternet In terDomain T op ology and Route Stabilit y In Pr o c
IEEE INF OCOM K ob e Japan Apr J Gra yP Helland P ONeil and D Shasha The Dangers of Replication and a Solution In Pr o c e e dings of
the A CM SIGMOD International Confer enceonthe Management of Data pages Mon treal
Canada June P age
R S Guy J S Heidemann W Mak T W P age G J P op ek and D Rothmeier Implemen tation of the
Ficus Replicated File System In USENIX Confer encePr o c e e dings pages June M Handley and J Cro w croft Net w ork text editor NTE A scalable shared text editor for the MBone In
A CM SIGCOMM Computer Communic ation R eview V ol No Oct Pr o c e e dings of the A CM
SIGCOMM pages R Ladin B Lisk o v L Shrira and S Ghema w at Pro viding High A v ailabilit y Using Lazy Replication In
A CM T r ansactions on Computer Systems pages No v em ber J C Lin and S P aul RMTP A Reliable Multicast T ransp ort Proto col In Pr o c e e dings of the IEEE
INF OCOM pages Marc h B Lisk o v S Ghema w at R Grub er P Johnson L Shrira and M Williams Replication in the Harp File
System In Pr o c e e dings of the Thirte enth A CM Symp osium on Op er ating System Principles pages
P acic Gro v e CA Octob er C Liu D Estrin S Shenk er and L Zhang Lo cal Error Reco v ery in SRM Comparisons of T w o Approac hes
T ec hnical Rep ort Computer Science Departmen t Univ ersit y of Southern California Submitted
to IEEE INF OCOM PMoc k ap etris Domain Names Concepts and F acilities Request for Commen ts In ternic Directory
Services K Obraczk a Multicast T ransp ort Mec hanisms A Surv ey and T axonom y In ternet Multicast T ransp ort
Proto col Surv ey Octob er K Obraczk a P Danzig and EY T sai D DeLucia AT ool F or Massiv ely Replicating In ternet Arc hiv es
Design Implemen tation and Exp erience In Pr o c e e dings of th International Confer enc e on DistributedCo
mputing Systems Hong Kong pp pages Ma y J J Ordille and BP Miller Database Challenges in Global Information Systems In Pr o c e e dings of A CM
SIGMOD International Confer enc e on Management of Data pages W ashington DC Ma y C P apadop oulos G P arulk ar and G V arghese An Error Con trol Sc heme for LargeScale Multicast
Applications Submitted for publication
V P axson Endtoend Routing Beha vior in the In ternet In Pr o c e e dings of the A CM SIGCOMM Symp osium
on Communic ation A r chite ctur es and Pr oto c ols San F rancisco CA Septem b er V P axson Endtoend In ternet P ac k et Dynamics In Pr o c e e dings of the A CM SIGCOMM Confer enc e
on Communic ation A r chite ctur es and Pr oto c ols Septem ber K P eterson M Spreitzer D T erry M Theimer and A J Demers Flexible Up date Propagation in
W eakly Consisten t Replication In Pr o c e e dings of the A CM Symp osium on Op er ating Systems Principles pages Septem b er S Raman S McCanne and S Shenk er Asymptotic Beha vior of Global Reco v ery in SRM In Pr o c A CM
SIGMETRICS PERF ORMANCE Joint International Confer enceon Me asur ement and Mo deling of
Computer Systems Madison WI June M Sat y anara y anan Co da A Highly A v ailable File System for a Distributed W orkstation En vironmen t
T ec hnical Rep ort CMUCS Sc ho ol of Computer Science Carnegie Mellon Univ ersit y July
P Sharma D Estrin S Flo yd and L Zhang Scalable Session Messages in SRM Submitted for publication
T Sp eakman D F arinacci S Lin and A T w eedly PGM Reliable T ransp ort Sp ecication In ternetDraft
draftspeakman pg ms pe c tx t Jan uary
B W alk er G P op ek R English and C Kline The LOCUS Distributed Op erating System In Pr o c e e dings
of the th Symp osium on Op er ating Systems Symp osium Octob er M Y agnik J Kurose and D T o wsleyP ac k et Loss Correlation in the MBone Multicast Net w ork T ec hnical
Rep ort UMCS Computer Science Departmen t Univ ersit y of Massac h usetts P age
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 706 (1999)
PDF
USC Computer Science Technical Reports, no. 667 (1998)
PDF
USC Computer Science Technical Reports, no. 704 (1999)
PDF
USC Computer Science Technical Reports, no. 723 (2000)
PDF
USC Computer Science Technical Reports, no. 677 (1998)
PDF
USC Computer Science Technical Reports, no. 690 (1998)
PDF
USC Computer Science Technical Reports, no. 669 (1998)
PDF
USC Computer Science Technical Reports, no. 697 (1999)
PDF
USC Computer Science Technical Reports, no. 731 (2000)
PDF
USC Computer Science Technical Reports, no. 673 (1998)
PDF
USC Computer Science Technical Reports, no. 674 (1998)
PDF
USC Computer Science Technical Reports, no. 657 (1997)
PDF
USC Computer Science Technical Reports, no. 672 (1998)
PDF
USC Computer Science Technical Reports, no. 692 (1999)
PDF
USC Computer Science Technical Reports, no. 745 (2001)
PDF
USC Computer Science Technical Reports, no. 655 (1997)
PDF
USC Computer Science Technical Reports, no. 560 (1993)
PDF
USC Computer Science Technical Reports, no. 631 (1996)
PDF
USC Computer Science Technical Reports, no. 663 (1998)
PDF
USC Computer Science Technical Reports, no. 614 (1995)
Description
R. Govindan, H. Yu, D. Estrin. "Large-scale weakly consistent replication using multicast." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 682 (1998).
Asset Metadata
Creator
Estrin, D.
(author),
Govindan, R.
(author),
Yu, H.
(author)
Core Title
USC Computer Science Technical Reports, no. 682 (1998)
Alternative Title
Large-scale weakly consistent replication using multicast (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
22 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16269112
Identifier
98-682 Large-Scale Weakly Consistent Replication using Multicast (filename)
Legacy Identifier
usc-cstr-98-682
Format
22 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/