Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 819 (2004)
(USC DC Other)
USC Computer Science Technical Reports, no. 819 (2004)
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Description
Xiaoliang Zhao, Dan Massey, Mohit Lad, Lixia Zhang. "ON/OFF model: A new tool to understand BGP update burst." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 819 (2004).
Transcript (if available)
Content
ON/OFF Model: A New Tool to Understand BGP Update Burst Xiaoliang Zhao, Daniel Massey University of Southern California Information Sciences Institute Email: fxzhao, masseydg@isi.edu Mohit Lad, Lixia Zhang Computer Science Department U. of California, Los Angeles Email: fmohit, lixiag@cs.ucla.edu Abstract|BGP, the inter-domain routing proto- col, can exhibit complex behaviors under various conditions.AlthoughBGPlogdatahavebeenmade availableintherecentyears,thesheersizeofthelog data makes it di±cult to interpret BGP behavior using only the raw BGP update messages and un- derstandingtheglobalroutingdynamicsintoday's Internetremainsagreatchallenge. InthispaperwefocusontheanalysisofBGPup- datebursts,acommonlyobservedeventthatoccurs at varying frequency. We de¯ne a BGP update burst asanoccurrenceofalargenumberofBGPupdates thatareseparatedbyveryshorttimeintervals.To investigate the causes of such bursts we developed anON/OFFmodelwhichcanbeusedtoclassifythe BGPburstsintotwoclasses:stableroutingchanges andtransientroute°apping.Astableroutingchange meansanexistingrouteisreplacedbyanewroute that lasts for a long time period, while transient route °apping means a series of routing updates occur for the pre¯x over a short time period but attheendofthebursttherouteisthesameasthe original route. By applying our ON/OFF model to BGP routing updates over the last two years, we foundthattheON/OFFmodelisane®ectivewayto identifystableroutingchanges,suchasthosecaused byphysicalfailuresinthenetwork,andthatabout half of the update bursts are caused by transient route°apping.Furtherinvestigationrevealsthespe- ci¯ccausesforanumberofthetransient°appings. Overall, the development of the ON/OFF model helpsusmakeasigni¯cantsteptowardsacomplete understandingoftheglobalroutingdynamics. I. Introduction TheInternetconsistsoflargenumberofAutonomous Systems (AS) that exchange routing information with each other to learn the best path to the destinations. Presently, BGP (Border Gateway Protocol) is the de factointer-ASroutingprotocolandisdesignedtoadapt tolinkfailures,AStopologychangesandroutingpolicy changes. BGP is a path vector based routing protocol and each BGP router advertises to neighbors (peers), entireASpathinformationtodestinations.Toexchange routing information, the BGP peers establish peering sessions.WheneveranewBGPsessionissetupbetween two peers, the complete routing tables are exchanged betweenthem.Afterthisinitialexchange,routersonly send update messages for routes that change or new routesthatareadded.InformationexchangedbyBGP is used for global routing. Therefore, faults or attacks intheBGPinfrastructurecanleadtoproblemssuchas denial of service and misdirected tra±c. Ideally, as a protocol, there would be a solid under- standingofBGP'sbehavior,itsresponsetofaults,and its vulnerabilities to attacks. But in practice, the BGP infrastructureconstitutesalargescalesystemandcould exhibit complex behaviors under various conditions. BGP log data have been available in the recent years, provided by Oregon Route-Views [1] and RIPE [2]. In theirservices,thereareoneormoremonitoringpoints, which are BGP routers that peer with routers within ISPs. A monitoring point archives its BGP routing tablesnapshotsandtheBGPupdatesreceivedfromits peers. These update messages that either signal route change or some route attribute change, are caused by events such as a physical link failure, the emergence of a better route, or simply a policy change. Due to thelargescaledeploymentofBGP,andpolicies,events arehiddenfromtheobserversatthemonitoringpoints. Instead, what we see at these monitoring points, is the results of the events. For instance, a physical link failureisaneventthatwouldcausetheendsofthelink to send update messages to their neighboring routers. Depending on how many of these routers use this link, wewouldhaveupdatesbeingpropagatedfurther.Ata remotemonitoringpoint,allweseeisupdatemessages, withoutanyideaaboutwhatkindofeventcausedthis update. This problem, as well as the sheer size of the log data, make it di±cult to interpret BGP behavior using only the raw BGP updates messages. Therefore, understanding BGP dynamic behavior continues to be an open challenge. In this paper, we propose a model that would be a signi¯cantsteptowardacompleteunderstandingofthe global routing dynamics. This paper is an attempt to demystifytheeventsbehindtheseupdatesasobserved frommonitoringpointsandtogainsomehighlevelin- sightintowhattheseupdatescantellusaboutthetype of changes in BGP routes. In particular we study the eventofBGPupdatemessagebursts. BGP burstrefers toaseriesofupdatestriggeredbyroutingchanges.We show that with our model we can gain considerable insightintotheeventscausingthesebursts.Weclassify BGP bursts into two classes: transient routing changes and non-transient routing changes.Atransientrouting change refers to a change in which a route, after a series of routing updates, is eventually restored back, while a non-transient change is one in which a route is replaced by another route for a signi¯cantly long time.Transientchanges,ifbetterunderstood,couldbe potentially bene¯cial for operational practices, such as optimizingsomeBGPparameterstobetterhandlesuch changes. By applying our ON/OFF model to BGP routing updates over the last two years, we found that the ON/OFF model is an e®ective way to identify stable routing changes, such as those caused by physical fail- uresinthe network, andthat about half of theupdate bursts are caused by transient route °apping. Further investigation reveals the speci¯c causes for a number of the transient °appings. Overall, the development of the ON/OFF model helps us make a signi¯cant step towardsacompleteunderstandingoftheglobalrouting dynamics. The paper is organized as follows. Section II talks about our methodology used for the data processing. Section III presents the ON/OFF model. Section IV shows that, given a ON timer as ¯ve minutes, there are 50% of total BGP bursts are transient changes, as well as some statistics for duration distribution of BGPburstsarepresented.SectionVstudiessomecases of BGP bursts and found some of them are caused by worm activities, faults, which may suggest us to look back at protocol design more carefully to better response to those changes. II. Data Source WeanalyzedBGProutingupdatescollectedbyRIPE NCC[2]duringseveralmonthsin2001and2002.RIPE NCC has eight data monitoring points (rrc00 - rrc07). We selected the rrc00 monitoring point and gathered data from the BGP routers listed in Table I. Some of these routers are located in global ISPs and others are located in regional ISPs. Geographically, routers are located in di®erent countries including the United States, Japan and three European countries. We chose the rrc00 monitoring point because it re- ceives full routing tables from ISPs. If an ISP only provides partial routing tables and then withdraws its routetoapre¯x,thismayindicatethatISPhaslostits routetothispre¯xormayindicatetheISPhassimply changed routes and the new route does not match the partial export policy. ItshouldalsobenotedthatBGPupdatesaresentto the monitoring point via multi-hop BGP connections. Intheoperational Internet,nearlyallISPpeeringsare through BGP routers sharing a common physical link, where BGP updates are sent via TCP connection over single link/hop. However, the BGP monitoring point RRC00 peers with ISP routers via TCP connections that cross multiple route hops and links. When the multi-hop session fails, the monitoring point reports Location ASesthatrrc00'speersbelongto US AS7018(AT&T),AS2914(Verio) Netherlands AS3333(RIPENCC) AS1103(SURFnet) AS3257(TiscaliGlobal) Switzerland AS513(CERN),AS9177(Nextra) Britain AS3549(GlobalCrossing) Japan AS4777(NSPIXP2) TABLEI RRC00's peering ASes that we examined a session state change. Note that if a peering session is reset, all routes are implictily withdrawn and, when the new peering session is started again, it involves a complete table exchange. In nearly all session reset we observed during the studied periods, the same routes are re-advertised when the session to the ISP router resumes.Weattributethisbehaviortolowerstabilityof themulti-hopBGPsessions.Wepre-processtheupdate ¯les to remove the updates that are generated due to session reset, resulting in a cleaned data set of BGP updates, for our analysis. A routing change can be broadly of two types, one that changes AS path for a given pre¯x, including withdrawal and announcement of a newly reachable pre¯x, while the other type that does not. AS path changes may be due to many reasons, such as hard- warefailures,operationalBGPsessionresets,orpolicy changes. A BGP update which does not convey new path information may change other BGP attributes, such as Multiple Exit Discriminator (MED), Commu- nityattributes.Suchkindofupdatesmaybecausedby policychangesorbadsoftwareimplementationchoices. Inthispaper,weareonlyconcernedabouttheASpath changes and do not look into details of other attribute changes.Intherestofthepaper,alltheroutechanges orupdatesarereferredtothosewithASPathChanges, unless speci¯ed otherwise. ThedatawascollectedoverthemonthsofJuly2001, September 2001, November 2001, Feburary 2002, July 2002,andAugust2002.Allofthedatawasexaminedby using the methods described in the following sections. But due to the paper size limitation, we only present results for some particular months and from some particular peers' point of view, but if not mentioned, the results for other months and peers are in general similar to the sample results. III. ON/OFF Model In this section, we develop the ON/OFF model and show its usefullness by analyzing BGP burst. A. Bursty Nature of BGP updates Figure 1 shows the number of updates on a hourly basis. As can beseen,the total number of updates per hour are normally below 100,000, but there are some spikes.Thosespikesareindicationsthatalargevolume 0 100000 200000 300000 400000 500000 600000 700000 800000 07/15 07/18 07/21 07/24 Number of Prefix Updates Fig. 1. Number of BGP Pre¯x Updates in Hourly Bins From July13th,2001toJuly24th,2001 of updates coming as a burst. BGP burst, also noted in [3], is a phenomenon of interest because it could be an indication of routing instability, which might be resultedfromroutingdevicefailure,con¯gurationerror, or even malicious attack. During BGP burst, routing paths may be altered, forwarding performace may be a®ected, and applications may experience delay and package loss. We need a measure of BGP burst to estimate, what the level of instability is at any given instant of time. GivenatraceofBGPupdates,onestraightforwardway toanalyzeitwouldbetosimplycountthetotalnumber of updates for a given period. Such simple count could giveussomecluesontheoccurenceofsomeevent.Asa matteroffact,thespikeon07/18/2001iscorresponding aknowntopologyevent 1 andthespikeon07/19/2001 correspondstoaknownwormattack 2 .However,those spikes do not tell whether bulk of them come from a smallsetofpre¯xes,orwhethertheupdatescomefrom a very large set of pre¯xes and are evenly distributed. Whatwearereallyinterestedaboutis the number of pre¯xes that are in the process of change at any given instant of time.Ifaverylargesetofpre¯xesexperience routingchangesimultaneously,itisastrongindication ofoccurrenceofroutingeventwarrantingfurtherinves- tigation.Simplecountofupdatemessagesisinadequate for this purpose. We will present the ON/OFF model in the rest of this section, which exactly captures the thenumberofpre¯xesthatareintheprocessofchange at any given instant of time. B. De¯nition of ON/OFF Model We build the ON/OFF model with a hypothetical case of updates for a single pre¯x. Fig 2 shows an example on how we can build on the ON/OFF model. At a higher level, the state ON corresponds to an active state, where a pre¯x may be expected to have routing changes, while OFF corre- sponds to a steady state, where the pre¯x's route is 1 Baltimoretunnel¯reoccurredatabout15:00ESTonJuly18, 2001. 2 CodeRedwormspreadoutonJuly19,2001. 5 20 25 15 10 ON OFF 5 Update message arrives before Update Message indicating a path change turns the prefix ON Timer expires and prefix continues to be ON 5 min 10 15 20 25 since last update expires and prefix moves to Off state. Fig (b) ON/OFF state transitions for the prefix P corresponding to updates Fig (a) Update Messages for a prefix P Every bar represents a single update. Time(min) Time(min) ON timer of 5 mins Fig.2. ON/OFFstatetransitionsforapre¯xPwith5minutes timercorrespondingtoupdatesreceivedforthatpre¯x expected to stay for some time. Part a of the ¯gure shows a sequence of update messages spaced in time. At time minute 5, on the arrival of the ¯rst update message, the pre¯x is turned ON, and the a timer, called ON timer, is started. This timer is used to account for temporary changes while alternate routes arebeingexploredaswellastoaccountforconvergence problems[4].Inthishypotheticalcase,weconsiderthis timer to be 5 minutes and thus the timer will expire atminute10.Aswecansee,thesecondupdatearrives at a time t=minute 7, before the timer expires. This update keeps that the pre¯x stays at ON state. At thismoment,thetimerisrestartedtowaitforanother 5 minutes to accommodate further updates. Similar OFF Update with No AS path change ON Update Message/ Path Change Update ON Timer Expires ON timer restarted Fig.3. On/O®Statetransitiondiagram action is taken at t=minute 8, when the 3rd update messagearrives.Howeverfromt=minute9tot=minute 14, there is no further update, and the timer expires thus pushing the pre¯x to OFF state as shown in part bof¯g2.Thepre¯xcontinuestobeOFFtillthenext updateannouncingaroutechangearrivesforthepre¯x, which is at t=minute 15. Again, the pre¯x is turned ONandturnsOFFwhenthereisnofurtherupdatefor the next 5 minutes. Thus, the pre¯x, as observed from the monitoring point is moving between ON and OFF states, depending on the updates being generated and the time spacing between successive updates. De¯ntion We de¯ne a pre¯x to be ON, if it has recently received an update indicating an AS path change and an timer with time t, that would turn it o®, has not expired since its last update. From the simple example discussed above, we can construct a state transition diagram for the ON and OFF states as in ¯g 3. Every pre¯x by default is in the OFF state. If we observe an update message for a pre¯x, but the new path announced is the same as the oldone,thepre¯xwillcontinuetobeintheOFFstate. However, if the new path announced is di®erent from theOLDpath,thenthepre¯xmovesontoONstate.In this state, the pre¯x waits for one of the two events to happen.Either,theONtimerexpires,inwhichcasethe pre¯x moves back to OFF state, or there is an update message for the pre¯x before the ON timer expires. In thelattercase,thepre¯xstaysatONstateandtheON timer will be restarted. Thus,apre¯xthatreceivesupdates,allofwhichare within t of each other will keep the pre¯x ON for the entire period of updates. However, if even one update arrives more than t of the last update for the same pre¯x,thenthepre¯xwouldhaveturnedo®assoonas theONtimerexpired.Thechoiceofthistimer tisvery critical, which will be discussed in detail later. C. Implications of ON/OFF periods Thetotalamountoftime,aparticularpre¯xremains ON is called as ON period. Similarly, the duration of time, a pre¯x remains OFF is called as OFF period. ON period for a pre¯x indicates that the pre¯x has very recently undergone a change of path. Thus, if a pre¯x was to have a long ON period, it would imply that the pre¯x path is changing more frequently than otherswithshorterONperiods.Gaininganestimateof howmanypre¯xesareONatanyinstant,wouldgiveus anideaofhowstabletherouteswereatthatinstant.If anexternaleventlikeatopologicalchangewoulda®ect BGProutes,thentheinstabilityshouldbere°ected,by a noticeable increase in the ON ratings. Initially, we set all pre¯xes at OFF state until path changes turn some pre¯xes to ON state. For a given pre¯x,thepathusedbeforethepre¯xenteringONstate isrecordedandcomparedwith thepathusedafterthe pre¯x returning back to OFF state. If two paths are equal, we de¯ne it as a transient change, otherwise we call them a non-transient change. Transientchangeisofinterestbecauseitmayre°ect some unexpected network events, such as a transient failure, which normally will be repaired very quickly, routing slow convergence, and other events. D. Choice of ON Timer The choice of ON timer plays an important role in ourmodel.Ifwechoosethetimervaluetobeverysmall, we might divide an ON period into very small interval succesiveONperiods,whichmeanstheroutingchange is still undergoing, but one single ON period cannot cover them. While if we choose the value to be very large, we would extend ON periods too long to cover uncorrelatedroutingchanges.Weranexperimentswith di®erent values like 5 mins, 10 mins, 15 mins, 20 mins and1hourtoobtaindi®erentresults.Inthispaper,we mainly present the results for 5 mins and 20 minues, which is based on the ditribution of inter-arrival time ofpathchangeupdates.AsshowninFigure4,atleast 50% of updates indicating path changes arrive within 300 seconds from the previous path change update. Therefore if we choose ON timer as 5 minutes, those updateswillbecoveredbyoneONperiod.Ifwechoose ON timer as 20 minutes, more than 65% path change updateswillbecoalescedwithotherupdates.Thecurve levels out roughly after 1200 seconds, so the longer timer may not make much di®erences. IV. Results This section shows the results obtained by apply- ing ON/OFF model to historical BGP updates. As we stated eariler that the count of ON pre¯xes at a particular time could give us a hint about how stable the routes were at that instant. The distribution of ON period reveals how long it will take for a pre¯x to converge to a new path after a series of updates. Mostimportantly,wearewonderinghowmanychanges are the tranisent changes. The following sections will attempt to answer these questions. A. Count of ON Pre¯xes Initially, it is assumed that every pre¯x is at OFF state, i.e., the total count of on pre¯xes is 0 at the start point. Whenever a pre¯x was turned ON, the total count will be increased by one at that time, and 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 300 1200 3600 CDF Pr(X<x) Inter-arrival time [2002.08] (Peer: 192.205.31.33) Fig.4. DistributionofInter-arrivaltimeofPathChangeUpdates the pair < time;count > will be recorded. Similarily, wheneverapre¯xwasturnedOFF,thetotalcountwill bedecreasedbyoneandanewpairwillberecordedas well. Figure 5 shows those pairs for August 2002 and September2001fromATTpointofview,whereX-axis represents the time, and Y-axis shows the total count of ON pre¯xes at that time. Given an ON timer of 5 minutes, Figure 5(a) shows that on average, the total ON pre¯xes are less than 500 at any given time with some exceptions. In fact, we totally obtained 193,186 < time;count > pairs in August 2002, and 96% of them have a count which is less than 500. Figure 5(b) shows that when the ON timerisincreasedto20minutes,thecountisincreased as well. Totally, there are 167,587 pairs, and 70% of them have a count which is less than 500. We have fewer pairs because a longer ON timer may collapse multipleONperiodsintoone,andconsequentlyreduce boththenumberofONandOFFperiods.Inaddition, ifapre¯xremainsatONperiodlonger,thepopulation of ON pre¯xes at a particular time will be increased. We also observe a few spikes, some of those will be explained in the section V. Figure 5(c) shows that count of ON pre¯xes for September 2001, with a ON timer as 20 minutes. One may quickly notice the sharp increase around Septem- ber 18, which will be explained later in this paper as well. B. ON Period Distribution The duration of a pre¯x staying at an ON state is counted as an ON period. Figure 6 shows the distri- bution of ON periods for August 2002 and September 2001. Given the ON timer as 5 minutes, we totally ob- served1,067,730ONperiods.OutofthoseONperiods, asshowninFigure6(a),38%areequalto300seconds 3 , 75% are less than 409 seconds, and 95% are less than 665 seconds. Considering we already arti¯cally add 5 3 ItmeansthatthoseONperiodsonlycontainoneupdate. minutes to the ON periods, the actual duration may be even shorter. These numbers may provide a general idea about how long a BGP burst would last. Given a longer timer as 20 minutes, we have the similar observation: totally, we obtained 758,248 ON periods, and 27% are equal to 1200 seconds, 75% are lessthan1519seconds,95%arelessthan2783seconds. NotethatONperiodstendtobelongerinthiscase,it is because two or more consecutive and closely-spaced ONperiodsmaybecombinedintoonebyalongtimer. Although most of ON periods are relatively short- lived, some ON periods last extremely long, even with a short ON timer. Such long-lived ON periods most likelyindicatetheinvolvedpre¯xesornetworkssu®ered networkproblems.Forexample,thelongestONperiod obtained from the data is 53348 seconds. The further investigation revealed that one particular pre¯x has °apped between two paths almost at every minute, sometimebeingwithdrawn,fromtheAugust11thearly morning until late night, and the pre¯x ¯nally ended with being withdrawn. The same thing happened on August 12th again, but the pre¯x ended with a third path. Based on its °apping pattern and timing infor- mation, we conjecture it was caused by a con¯guraion errorforthepre¯xduringthosetwodays.Thisexample shows that ON/OFF model could be used to narrow down to a small set of pre¯xes which are more worthy to investigate than others. Furthermore,wecounthowmanyupdatesbeensent during one ON period. The results show that for 5 minutes ON timer, 75% ON periods contain at most two updates, and 99% contain at most 8 updates. For 20 minutes timer, 75% contain at most 3 updates and up to 15 updates for 99% periods. C. Transisent Changes As we described earlier, an ON period implies a pre¯x was in the course of the convergence of routing changes.Wearemoreconcernedaboutwhichpathwill be used after the changes converge. If the new path is di®erentfrompreviouslyusedpath,itclearlyindicates that either the old path is expericing some failures, or the policy has been changed to prefer a new path. If the new path is the same as the previous one, i.e., a transient change occurred, it means that the routing path has been changed at least once but evetually old pathwasrestoredbacktotheroutingtable.Thecauses for such changes are not completely understood, some explanations will be provided in the next section. But ¯rst, one may wonder if the transient changes occur frequently, or rarely. IfwecountonetransitionfromanOFFperiodtothe next OFF period as one change, the transient changes canbeindenti¯edifthepathusedintheanOFFperiod is the same as the path used in the next OFF period. Figure7showthedailypercentageoftransientchanges overtotalchangesinAugust2002andSeptember2001. 0 1000 2000 3000 4000 5000 6000 03-00 10-00 17-00 24-00 31-00 The number of ON prefixes Time(DD-HH): [20020801-20020831] (Peer: Peer-A, Off Timer: 300) # of ON prefixes (a) August 2002, Timer=300 sec- onds 0 1000 2000 3000 4000 5000 6000 03-00 10-00 17-00 24-00 31-00 The number of ON prefixes Time(DD-HH): [20020801-20020831] (Peer: Peer-A, Off Timer: 1200) # of ON prefixes (b)August2002,Timer=1200sec- onds 0 1000 2000 3000 4000 5000 6000 01-00 08-00 15-00 22-00 29-00 The number of ON prefixes Time(DD-HH): [20010901-20010930] (Peer: Peer-A, Off Timer: 1200) # of ON prefixes (c) September 2001, Timer=1200 seconds Fig.5. CountofONPre¯xes. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 1000 10000 100000 1e+06 1e+07 CDF: Pr(X < x) The duration of ON period [200208] (Peer: Peer-A) timer=300 seconds timer=1200 seconds (a)August2002 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 1000 10000 100000 1e+06 1e+07 CDF: Pr(X < x) The duration of ON period [200109] (Peer: Peer-A) timer=300 seconds timer=1200 seconds (b)September2001 Fig.6. DistributionofONPeriods From this ¯gure, we can see that transient changes count for around 50% of total changes every day. It is clear that transient changes happen quite often, which isconsistentwithobservationsfromotherstudies[3].If it were better understood, the causes for such changes maybebettercontrolled,andtheprotocolmaybe¯ner tuned to react to such changes. Note that a transient change involves at least two updates; the ¯rst one turns the pre¯x ON, and the last one restores the old path. Combining the previous results that most of ON periods only contain 2 or 3 updates, it implies that the type of transient change, which ¯rstly fails over to a new path, then quickly returnsbacktotheoldpath,countsforadecentnumber of transient changes. One may also note that when the timer becomes longer, by comparing two curves in Figure 7(a), the transientchangestendtotakemoreproportionoftotal changes.Itissimilarforothermonthsandotherpeers, as shown in Figure 8(a). We already know that the longertimerwilldecreasethenumberofchanges,hence the number of transient changes. However, it seems longertimerreducesthetotalnumberofchangesmore thanthetransientchanges.Forexample,whentheON timer is changed from 5 minutes to 1 hour, the total number of changes is decreased by 44.27%, while the numberoftransientchangesisonlydecreasedby16.8% in August 2002. One reason might be a longer timer combines two non-transient changes into one transient changes. For example, a routing change like Path 1 ! Path 2 ! Path 1 is combined into Path 1 ! Path 1 , thus the total number of changes is decreased by one, but the transient changes is actually increased by one. The increase of the proportion of the transient changes by longer timer seems support such explanation. It is an interesting observation since it reveals that theroutingtoapre¯xseems sticktoaparticularpath; whatever the routing changes are and however long it takes, the path tends to be reused evetually. We could testthisconjecturefromanotherperspectivebycount- inghowmanypathsbeingusedduringOFFperiodsfor each pre¯xes. A path used during OFF period will be 0 20 40 60 80 100 0803 0810 0817 0824 0831 The percentage of transient changes (%) Time(MMDD): [20020801-20020831] (Peer: Peer-A) timer=300s timer=1200s (a)August2002 0 20 40 60 80 100 0901 0908 0915 0922 0929 The percentage of transient changes (%) Time(MMDD): [20010901-20010930] (Peer: Peer-A) timer=300s timer=1200s (b)Septermber2001 Fig.7. PercentageofTransientChanges termed as a non-transient path 4 . A non-transient path normally will be used to forward tra±c for a while, atleastlongerthanONtimer.Therefore,countingthe number of non-transient paths will reveal the stability of reaching a pre¯x. If a pre¯x is reachable via fewer non-transient paths, espeically if only via one path, it implies that the routing to the pre¯x is quite stable. When we increase the ON timer, if the conjecture is true, we should see more pre¯xes only have one non- transient paths. Figure 8(b) shows the results. The X- axis shows the di®erent value for the ON timer, while the Y-axis shows the percentage of pre¯xes which are with one non-transient path, as well as the percentage whicharewithmorethanonenon-transientpaths.The ¯gure shows that when the ON timer increases, more pre¯xes tend to be reachable via only one path, which supportsourconjecture.Such sticknesspropertyofthe routing system were also observed by other studies, such as [5] observed the path to reach top-level DNS servers are quite stable, and [6] also made the similar observation to popular sites. V. An Explanation for Transient Changes This section presents some cases we investigated based on some anomalies captured by our ON/OFF model, as well as the studies of the routing impacts caused by some known network events. A. Code Red/Nimda Worm Attack OneeventaroundSeptember18,2001thatattracted lots of attentions was the Nimda worm. According to theSANSInstitute,thescanningactivityoftheNimda worm dramatically increased at approximately 1pm GMT on September 18, and abated in the following 4 Anon-transientpathiscontrasttothepathappearedinON periods,whichisconsideredasatransientpath. hours[7].E®ectofNimdawormonBGPisexaminedin [8]. Figure 5(c) shows a sharp increase of ON pre¯xes around September 18, 2001, which an indication of a large number of pre¯xes experienced routing changes simultaneouslyonthatday.Infact,comparingwiththe mediannumberofONpre¯xeseverydayinSeptember, thenumberofchangesontheday18thwasincreasedby 87.6%, from 41242 to 77373. The number of transient changeswasincreasedby124.64%,from19701to44586. The increase of transient changes indicates that the excessive tra±c caused by worm activities a®ected the routing stability. For example, if AS A multi-homed with two providers, saying B and C. Normally, the incoming tra±c followed the path (BA). The worm tra±ccouldcongestthispath,then Amanagedtoswith to another path (CA) by withdrawing the pre¯x from B or changing community attributes to inform B or C thenewroutepreference.Quickly,thenewpathwas congestedagain, Ahadtoswitchback,andsoon.From the outside point of view, such routing changes are a transient changes, given a proper ON timer. OnJuly2001,theCodeRedwormattackedtheInter- net. Our data also show a similar pattern of increased number of transient changes. Note that although the proportion of transient changes over the total changes did increase with a longer ON timer as described in Section IV, the proportion remained almost same as other days, as shown in Figure 7(b). It means that the increase of transient changes are proportional to the total changes, which implies that the number of non- transient changes was also increased by worm tra±c. Rethink the above example, now the cases are after A switched to new path (CA), A will stay with the new pathlongerthantheONtimer,whichcountsforanon- transient change. Figure 7(b) suggests that both cases happened during the worm attack. 0 20 40 60 80 100 500 1000 1500 2000 2500 3000 3500 Percentage of Transient Changes Off Timer (seconds) 2001.09, Peer-A 2001.09, Peer-B 2001.09, Peer-C 2002.07, Peer-A 2002.07, Peer-B 2002.07, Peer-C (a) Percentage of transient changes with di®erent ON timer 0 20 40 60 80 100 0 500 1000 1500 2000 2500 3000 3500 4000 Percentage of prefixes Off Timer (seconds) 2002.08, Peer-B, with one path 2002.08, Peer-B, with more path 2002.08, Peer-A, with one path 2002.08, Peer-A, with more path (b) Percentage of pre¯xes with one or more non- transientpaths Fig.8. SticknessPropertyoftheRoutingSystem B. Spikes on Aug. 2002 Figure 5(a) and (b) show there are few spikes of the totalnumberofONpre¯xesinAugust2002,whichmay indicate some anomalies. However, if pre¯xes involved inaspikeendedwithnon-transientchanges,itmaybe causedbylegitimateroutingchangessuchasanewpath may cause many pre¯xes switched to it simuteneously. Thus pre¯xes involved in a spike ended with transient changes are more interesting for an investigation. An ON pre¯x could end with a transient change or non-transientchange.Figure9showsthenumberofON pre¯xes which ended with transient changes. Only one spikewassingledout,whichoccurredaroundAugust7, 19:18to19:23GMT.Thefurtherinvestigationrevealed that at that time, there were 2352 pre¯xes switched to use AS 1239 as a transit AS, but very quickly 5 , theyswitchedbacktotheiroriginalpaths.Becausethe transit AS was not used very long, we believe this was caused by some kind of errors, such as miscon¯gura- tions. C. Baltimore Tunnel Fire On July 18th 2001 at about 18:10 GMT, a 60- car freight train carrying paper, wood and hazardous materials derailed and caught ¯re in the Baltimore tunnel. The Baltimore tunnel carried communication ¯bers constituting part of the backbone network, and the¯reresultedindamagetothese¯bers[9].However, no data is available on exact time of ¯ber damage, but based on the delay and packet loss observed by operatorsonnanogmailinglist[9],itshouldbewithin several hours after the ¯re. The recovery and restoration of the ¯bers in the tunnel, following the ¯re, took much longer due to the 5 The ON periods for those changes ranged from 326 to 384 seconds. dangerous materious and extremely high temperatures inside(1500 degrees Fahrenheit). It was reported that operators and engineers worked overnight to reroute tra±c to other cables and restored service to most customers by through afternoon July 19th, 2001 [9]. Additional ¯bers were also laid outside the tunnel in order to restore some of the links. However,basedonthecommunicationswithnetwork operators, this event is rather regarded as an internal link failure than a link failure between di®erent ASes. Consequently, as an inter-domain routing protocol, BGP would not a®ected too much. Our model also showsthattherearenoobviousanomaliesobservedon thatday,whichisconsistentwiththepeople'scommon belief. VI. Related Work In[3],OlafandAnjaproposedamethodtogenerate realistic BGP tra±c in test labs. First, they de¯ned twoconcepts,instabilityoriginatorandinstabilityburst. Instability originator, referred to as any routing events a®ect pre¯xes, causes instability burst, referred to as a seriesofBGPupdateduetopropagationofchangesand updating routing tables. This paper used the similar method as our ON timer to determine the end of the burst, but with a much longer time window as 4000 seconds. Both of their work and our work share some similarresults,forexample,theyalsoreportthatmost of burst are short-lived, and transient changes are per- vasive.ButweexaminedvariousvaluesofONtimerand its implications, thoroughly studied the path change patterns, as well as we examined much longer time periodofBGPdatatoensurethestatisticalsigni¯cance of the results. Rexford et. al. [6] studied the routing behavior for popular destinations. To compare the instability for di®erentpre¯xes,theyalsomergemultipleupdatesinto 0 1000 2000 3000 4000 5000 6000 03-00 10-00 17-00 24-00 31-00 The number of ON prefixes ended with transient changes Time(DD-HH): [20020801-20020831] (Peer: Peer-A, Off Timer: 300) # of ON prefixes Fig.9. CountforONpre¯xeswhichendedwithatransientchange(ONtimer=5minutes one\event"basedonthespacebetweentwoconsecutive updates. However, a much shorter time window of 45 and75secondswasusedintheirstudy.Theyalsofound that the routes to the popular are quite stable. Along thesameline,Lanet.al.[5]studiedthereachabilityand routing changes for top-level DNS servers in order to protect those servers from route spoo¯ng attack. They alsofoundthattheroutestotop-levelDNSserversex- hibitquitehighstability.Comparingwiththeirstudies, we are more interested in BGP behavior for general pre¯xesbecauseourultimategoalistogainacomplete understanding of BGP behavior, which should not be limited by a particular set of pre¯xes. VII. Conclusion ThesheersizeoftheBGPlogmakesitdi±culttoin- terpretBGPbehaviorusingonlysimpleanalysistools. InthispaperwedevelopedanON/OFFmodeltostudy BGP behavior, which is an initial step toward a sold understanding BGP's performance under both normal and stressful conditions, its response to faults, and its vulnerabilities to attacks. By applying our ON/OFF model,wefoundthattheON/OFFmodelisane®ective waytoidentifytwotypeofinputstoBGPsystem,sta- blepathchangesandtransientpathchanges.Transient changes more likely are caused by con¯guration erros, transientfailures,andotherunexpectedevents.Andwe foundsuchtypeofchangesarepervasive:abouthalfof theupdateburstscanbeclassi¯edastransientchanges. Overall, the development of the ON/OFF model is an usefultooltohelpusmakeasigni¯cantsteptowardsa completeunderstandingoftheglobalroutingdynamics. References [1] \TheRouteViewsProject," http://www.routeviews.org/. [2] RIPE, \Routing Information Service," http://www.ripe.net/ris/index.html. [3] Olaf Maennel and Anja Feldmann, \Realistic bgp tra±c for testlabs," in Proceedings of the ACM SIGCOMM,2002. [4] C. Labovitz, A. Ahuja, A. Bose, and F. Jahanian, \Delayed Internet routing convergence," in Proceedings of the ACM SIGCOMM,August/September2000. [5] L. Wang, X. Zhao, D. Pei, R. Bush, D. Massey, A. Mankin, S.Wu,andL.Zhang, \Protectingbgproutestotopleveldns servers," in ICDCS03,2003. [6] J. Rexford, Jia Wang, Zhen Xiao, and Yin Zhang, \Bgp routing stability of popular destinations," in Proceedings of the ACM IMW 2002,Oct.2002. [7] Networking System Adminisration and Security In- stitute (SANS), \Nimda worm/virus report," http://www.incidents.org/react/nimda.pdf. [8] L. Wang, X. Zhao, D. Pei, R. Bush, D. Massey, A. Mankin, S. Wu, and L. Zhang, \Observation and analysis of bgp behaviorunderstress,"in Proceedings of the ACM IMW 2002, 2002. [9] NANOG,\TheNorthAmericanNetworkOperators'Group," http://www.nanog.org/.
Asset Metadata
Creator
Lad, Mohit (author), Massey, Daniel (author), Zhang, Lixia (author), Zhao, Xiaoliang (author)
Core Title
USC Computer Science Technical Reports, no. 819 (2004)
Alternative Title
ON/OFF model: A new tool to understand BGP update burst (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
9 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16269776
Identifier
04-819 ONOFF Model A New Tool to Understand BGP Update Burst (filename)
Legacy Identifier
usc-cstr-04-819
Format
9 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Linked assets
Computer Science Technical Report Archive