Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improving user experience on today’s internet via innovation in internet routing
(USC Thesis Other)
Improving user experience on today’s internet via innovation in internet routing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Improving User Experience on Today’s Internet via Innovation in Internet Routing
by
Brandon Schlinker
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2021
Copyright 2021 Brandon Schlinker
Acknowledgements
I consider research to be the process of making headway on problems and questions that have
no clear solution. By this definition, research requires a significant investment of time — it is an
iterative process during which you incrementally build an understanding of “how things work,”
and then harness that understanding to make progress on a much bigger goal. As such, “doing
research” during my PhD has meant fully immersing myself into whatever topic I was focused
on at the time, working around the clock to build my understanding of the space and then using
the resulting insights to solve real-world problems. I spent countless nights in empty conference
rooms at Google drawing out datacenter topologies on whiteboards, at Facebook seven days a
week building systems and writing SQL queries to understand and address the dynamics of Internet
performance and routing, and in hotel rooms pulling together papers, presentations, and even this
dissertation. I worked on the weekends, my birthday, during holidays — practically every day of the
year — because I was committed to solving the problems that I worked on.
But while my steadfast commitment to my work played a key role in my success as a PhD
student, I know that it alone would have only taken me so far. Over the past eight years, others have
helped me refine, build, and communicate my ideas, navigate myself through tough situations, gain
ii
access to key opportunities, recover when I failed, and help ensure that I didn’t give up when the
path forward became rough, all of which made this dissertation what it is today.
Stephanie Fung has tirelessly helped me throughout this entire journey. She has always encour-
aged me to pursue my ideas and repeatedly helped me refocus when I may have otherwise given
up. She spent one New Year’s Eve helping me understand Markov Chain Monte Carlo simulations
for a SIGCOMM paper.
1
Another time, she helped me prepare my application for the Facebook
Fellowship two hours before the deadline.
2,3
Throughout all of this, she reminded me of the bigger
picture whenever I was stressed, and helped me recognize that when you’re ambitious and take risks,
you will occasionally fail — and that’s OK.
Ítalo Cunha played a key role in practically every project I’ve worked on during my PhD and
his help was critical to realizing much of the work in this dissertation. In addition to spending
countless days, nights, and weekends working alongside me, Ítalo helped me refine my ideas when
they were in a nascent stage, serving as a sounding board and helping me recognize alternatives that
I would have otherwise not considered. Furthermore, Ítalo was an equal collaborator for much of
the systems building and analysis in this dissertation: I repeatedly went to him with problems and
only the faintest idea of how to solve them, and each time he rose to the occasion and developed a
principled, concrete solution.
1
At midnight we went outside to explode some party poppers, and then went right back to work.
2
I thought the deadline for the fellowship application was midnight, but realized on the date of the deadline that it was
actually due at noon. I was ready to give up (it was 10 AM when I realized my error), but Stephanie encouraged me to
submit and helped me quickly put together my application — I was awarded the Facebook Graduate Fellowship that year.
3
I was once referred to as the “David Foster Wallace (DFW) of SIGCOMM” due to my use of footnotes, in part
because like DFW, the "most significant themes of Brandon’s work are often in the footnotes themselves". I hope this
observation holds true for this section as well.
iii
There are multiple “PhDs” in the Schlinker family, but no one in my family pressured me to get
a PhD so I can’t blame anything about this experience on them.
4
My Mom and Dad, and my sister
Alaina, have all been supportive and understanding — especially when it came to me needing to
work around the clock — and gave me the confidence to relentlessly pursue my ideas during my
PhD without fear of failure. I have often turned to my family when I have encountered challenges,
and they have always been willing to listen and help me define a strategy for moving forward.
Ariel Rao,
5
Elizabeth Camporeale,
6
Ethan Katz-Bassett,
7
Hyojeong Kim,
8
Jason Yap,
9
Jeff
Mogul,
8
Mr. (Otis) Frenchums,
10
Petr Lapukhov,
8
Srikanth Sundaresan,
8
and Yi-Ching Chiu,
11
all
played special roles during this journey that helped me get to the finish line, while Frank Lin,
12
4
In fact, it’s the opposite: my parents assured me that I would be fine even if I didn’t get a PhD.
5
Among other things, Ariel kept me company during countless early-morning breakfasts at Facebook. I was almost
certainly a real grump sometimes (a combination of being sleep deprived and — according to Ariel — my grumpy
demeanor), but after talking to Ariel, I always left the table feeling energized.
6
Liz and I have been friends for over 25 years. She has always been supportive of my pursuits and acted as a sounding
board when I faced a dilema. While I can’t define exactly how, I’m certain that she played a key role in getting me here.
7
Ethan helped me articulate my ideas and work in both publications and presentations, and in particular devoted
significant time to help improve the clarity of my writing. I recognize that these revisions played a significant role in my
success as a PhD student — you need to be able to effectively communicate your ideas to have impact — and the skills
that I learned through this process continue to help me today.
8
Jeff was my mentor during my time at Google, and Hyojeong, Petr, and Srikanth all acted as mentors during my time
at Facebook. Ultimately, this meant that they had to deal with a hardworking, full of ideas, but (occasionally) defiant
Brandon. Despite this formidable challenge, they all played key roles in my success by helping me plan, communicate,
and execute on my ideas — even when they were arguably a bit too ambitious — while in parallel helping me avoid
potential landmines and resolve inevitable conflicts.
9
Jason has kept me company for many dinners. That has meant listening to me ramble about exciting topics such as
“Balanced Incomplete Block Design”.
10
Mr. Frenchums kept me company and helped me stay active. He also chewed through my laptop’s charging cable the
night before a SIGCOMM paper deadline — I took that as his way of telling me it was bedtime.
11
Yi-Ching helped me maintain some semblance of a life outside of work throughout much of my PhD, and helped
nudge me in the right direction when I needed it. She also endured countless hours of me sharing my frustrations on just
about every topic imaginable, and yet was often able to respond with insightful advice.
12
During my undergraduate studies, Frank, Kartik, and Xiao helped me realize that I could be successful at research
by taking an engineering-driven approach to my work. I have applied this approach throughout my PhD, using my
engineering skill set to surface challenges in production environments that could benefit from research, and then having
impact by designing and deploying solutions — including measurement techniques and control systems — that fully
account for the realities of such environments.
iv
Kartik Gopalan,
12
and Xiao Su
12
helped me get started way back during my undergraduate studies.
Alefiya Hussain, Bhaskar Krishnamachari, Hernan Galperin, John Heidemann, Ramesh Govindan,
and Wyatt Lloyd served as committee members and provided feedback throughout my PhD. Omar
Baldonado handled the logistics required for me to conduct research at Facebook during my PhD.
Thank you all.
Acknowledgement of funding. My tuition, stipend, and research at the University of Southern
California was funded in part by the Facebook Graduate Fellowship, and by NSF awards and
industry grants, including faculty research awards from both Google and Facebook.
Acknowledgement of prior publications. This dissertation includes work previously published
in conference proceedings [92, 376, 377, 378].
• Chapter 3 contains work previously published in the proceedings of the ACM Internet Mea-
surement Conference (ACM IMC), Yi-Ching Chiu was a co-first author of this work:
Yi-Ching Chiu, Brandon Schlinker, Abhishek Balaji Radhakrishnan, Ethan Katz-Bassett,
and Ramesh Govindan. “Are We One Hop Away from a Better Internet?” In: Proceedings
of the ACM Internet Measurement Conference. IMC ’15. ACM, 2015
v
• Chapter 4 contains work previously published in the proceedings of the Conference of the
ACM Special Interest Group on Data Communication (ACM SIGCOMM):
Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V Mad-
hyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. “Engi-
neering Egress with Edge Fabric”. In: Proceedings of the Conference of the ACM Special
Interest Group on Data Communication. SIGCOMM ’17. ACM, 2017
• Chapter 5 contains work previously published in the proceedings of the ACM Internet Mea-
surement Conference (ACM IMC):
Brandon Schlinker, Ítalo Cunha, Yi-Ching Chiu, Srikanth Sundaresan, and Ethan Katz-
Bassett. “Internet Performance from Facebook’s Edge”. In: Proceedings of the Internet
Measurement Conference. IMC ’19. ACM, 2019
• Chapter 6 contains work previously published in the proceedings of the International Confer-
ence on Emerging Networking EXperiments and Technologies (ACM CoNEXT):
Brandon Schlinker, Todd Arnold, Italo Cunha, and Ethan Katz-Bassett. “PEERING: Virtu-
alizing BGP at the Edge for Research”. In: Proceedings of the International Conference
on Emerging Networking EXperiments and Technologies. CoNEXT ’19. ACM, 2019
These publications represent the output of extensive collaborations involving all named authors,
and I greatly appreciate the contributions made by my co-authors to these publications and this
dissertation as a whole.
vi
Table of Contents
Acknowledgements ii
List of Tables xii
List of Figures xiii
Abstract xvi
Chapter 1: Introduction 1
1.1 Routes Between Users and Popular Content . . . . . . . . . . . . . . . . . . . . . 6
1.2 Challenges and Control Systems with Rich Interconnectivity . . . . . . . . . . . . 8
1.3 Performance and Opportunities with Rich Interconnectivity . . . . . . . . . . . . . 10
1.4 Advancing Internet Routing Research and Innovation . . . . . . . . . . . . . . . . 13
1.5 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 2: Background 21
2.1 The Border Gateway Protocol (BGP) . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Interconnecting autonomous systems (ASes) with BGP . . . . . . . . . . . 22
2.1.2 How BGP makes routing decisions . . . . . . . . . . . . . . . . . . . . . 23
2.2 Points of Presence and Interconnections . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Points of presence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Types of interconnection . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Growth in peering interconnections and the “flattening” of the Internet . . . 29
2.3 Routing Policies and Traffic Engineering . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.1 How interdomain routing policies are designed . . . . . . . . . . . . . . . 30
2.3.1.1 Gao-Rexford model . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1.2 Incorporating performance, cost, and backbone utilization . . . . 32
2.3.1.3 Optimizing routing decisions with software-defined networking . 33
2.3.2 How CDNs direct traffic to their points of presence . . . . . . . . . . . . . 34
2.4 Open Problems in Internet Routing . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.1 BGP’s design creates security vulnerabilities . . . . . . . . . . . . . . . . 38
2.4.1.1 Route validation (control-plane vulnerability) . . . . . . . . . . 38
vii
2.4.1.2 Source address validation (data-plane vulnerability) . . . . . . . 42
2.4.2 BGP’s design limits route diversity, flexibility, and control . . . . . . . . . 44
2.4.3 BGP’s decision process does not consider route performance or capacity . . 47
2.4.4 BGP can take significant time to converge and recover after an event . . . . 50
2.5 Internet Routing Research Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.5.1 Measurement tools, platforms, and datasets . . . . . . . . . . . . . . . . . 52
2.5.2 Simulation and emulation . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.5.3 Key limitation of existing tools: lack of control and realism . . . . . . . . 56
Chapter 3: Are We One Hop Away from a Better Internet? 60
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2 Measuring Internet Path Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2.1 Strawman approach: measuring from an academic testbed . . . . . . . . . 63
3.2.2 Approach used in this work . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.2.1 Datasets and measurements . . . . . . . . . . . . . . . . . . . . 65
3.2.2.2 Processing traceroutes to obtain AS paths . . . . . . . . . . . . . 67
3.3 Internet Path Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.1 Measuring paths from the cloud . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.2 Google’s interconnections . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.3 Estimating paths to a popular service (Google search) . . . . . . . . . . . . 74
3.3.4 Paths to other popular content . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.4 Can Short Paths be Better Paths? . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.4.1 Short paths sidestep existing hurdles . . . . . . . . . . . . . . . . . . . . . 81
3.4.2 Short paths can simplify many problems . . . . . . . . . . . . . . . . . . . 81
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Chapter 4: Engineering Egress with EDGE FABRIC 85
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Background: Overview of Facebook’s CDN . . . . . . . . . . . . . . . . . . . . . 91
4.2.1 Points of presence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2.2 Mapping users to points of presence . . . . . . . . . . . . . . . . . . . . . 94
4.2.3 Routing traffic to users . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.2.3.1 Interconnections and route diversity . . . . . . . . . . . . . . . . 96
4.2.3.2 Facebook’s routing policy . . . . . . . . . . . . . . . . . . . . . 97
4.2.3.3 Prevalence and egress traffic per interconnection type . . . . . . 100
4.3 Problems, Goals and Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3.1 How BGP’s limitations impact Facebook . . . . . . . . . . . . . . . . . . 101
4.3.2 Goals and design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.4 Avoiding a Congested Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.4.1 Capturing network state (inputs) . . . . . . . . . . . . . . . . . . . . . . . 109
viii
4.4.1.1 Routing information . . . . . . . . . . . . . . . . . . . . . . . . 110
4.4.1.2 Traffic information . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.4.2 Generating overrides (decisions) . . . . . . . . . . . . . . . . . . . . . . . 112
4.4.3 Enacting overrides (output) . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.4.4 Deploying, testing, and monitoring . . . . . . . . . . . . . . . . . . . . . 115
4.4.5 Results on production traffic . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5 Towards Performance and Application Aware Routing . . . . . . . . . . . . . . . . 121
4.5.1 Placing traffic on alternate paths . . . . . . . . . . . . . . . . . . . . . . . 122
4.5.2 Potential use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.5.2.1 Considering performance in primary and detour routing decisions 125
4.5.2.2 Optimizing use of limited capacity . . . . . . . . . . . . . . . . 126
4.6 Operational Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.6.1 Evolution of EDGE FABRIC . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.6.1.1 From stateful to stateless control . . . . . . . . . . . . . . . . . 128
4.6.1.2 From host-based to edge-based routing . . . . . . . . . . . . . . 129
4.6.1.3 From global to per-PoP egress options . . . . . . . . . . . . . . 131
4.6.1.4 From balanced to imbalanced capacity . . . . . . . . . . . . . . 132
4.6.2 Challenges at public IXPs . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Chapter 5: A View of Internet Performance From Facebook’s Edge 135
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.2 Data Collection Overview and Traffic Characteristics . . . . . . . . . . . . . . . . 140
5.2.1 Facebook user traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.2.2 Measurement infrastructure and dataset . . . . . . . . . . . . . . . . . . . 142
5.2.2.1 Why server-side passive measurements? . . . . . . . . . . . . . 142
5.2.2.2 Measurement approach and infrastructure . . . . . . . . . . . . 146
5.2.2.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.2.3 Session characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.3 Quantifying Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.3.1 Estimating round-trip propagation delay with MinRTT . . . . . . . . . . . 158
5.3.1.1 Defining round-trip propagation delay and its components . . . . 160
5.3.1.2 What transport RTT measurements capture . . . . . . . . . . . . 163
5.3.1.3 Estimating round-trip propagation delay from transport RTT . . 165
5.3.2 Measuring goodput with HDratio . . . . . . . . . . . . . . . . . . . . . . 167
5.3.2.1 Overview of approach . . . . . . . . . . . . . . . . . . . . . . . 168
5.3.2.2 Defining target goodput . . . . . . . . . . . . . . . . . . . . . . 170
5.3.2.3 Determining if a transaction tests for target goodput . . . . . . . 171
5.3.2.4 Measuring if a transaction achieved a testable goodput . . . . . . 178
5.3.2.5 Defining a session’s HDratio . . . . . . . . . . . . . . . . . . . 183
5.3.2.6 Other considerations . . . . . . . . . . . . . . . . . . . . . . . . 184
ix
5.3.2.7 Limitations of approach . . . . . . . . . . . . . . . . . . . . . . 186
5.3.2.8 Alternative approaches considered . . . . . . . . . . . . . . . . 188
5.3.3 Other metrics considered . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.3.3.1 Smoothed round-trip time . . . . . . . . . . . . . . . . . . . . . 191
5.3.3.2 Retransmissions and packet loss . . . . . . . . . . . . . . . . . 192
5.3.4 Aggregating measurements . . . . . . . . . . . . . . . . . . . . . . . . . . 194
5.3.4.1 Grouping measurements . . . . . . . . . . . . . . . . . . . . . . 195
5.3.4.2 Summarizing network conditions per aggregate . . . . . . . . . 199
5.3.5 Comparing performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
5.3.5.1 Controlling statistical significance . . . . . . . . . . . . . . . . 202
5.3.5.2 Temporal behavior classes . . . . . . . . . . . . . . . . . . . . . 203
5.4 Does Facebook’s Rich Connectivity Yield Good Performance? . . . . . . . . . . . 204
5.5 How Does Performance Change Over Time? . . . . . . . . . . . . . . . . . . . . . 210
5.6 How Does Facebook’s Routing Policy Impact Performance? . . . . . . . . . . . . 216
5.6.1 Could performance-aware routing provide benefit? . . . . . . . . . . . . . 217
5.6.1.1 When and where are the opportunities for improvement? . . . . 220
5.6.1.2 Are opportunities practical and realizable? . . . . . . . . . . . . 228
5.6.2 Comparing peer and transit performance . . . . . . . . . . . . . . . . . . . 231
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Chapter 6: PEERING: Virtualizing BGP at the Edge for Research 235
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
6.2 Goals and Key Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.2.1 Design goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.2.2 Challenge: native delegation with BGP and IP . . . . . . . . . . . . . . . . 241
6.2.3 Alternative approaches to delegation . . . . . . . . . . . . . . . . . . . . . 244
6.3 Virtualizing the Edge with VBGP . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.3.1 Key design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.3.2 Delegation to experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 248
6.3.2.1 Delegating the control plane . . . . . . . . . . . . . . . . . . . . 248
6.3.2.2 Delegating the data plane . . . . . . . . . . . . . . . . . . . . . 250
6.3.2.3 Summary of contribution . . . . . . . . . . . . . . . . . . . . . 254
6.3.3 Security and isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
6.4 PEERING: From a Router to an AS . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.4.1 Key design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
6.4.2 Footprint and connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . 261
6.4.3 Emulating a cloud provider . . . . . . . . . . . . . . . . . . . . . . . . . . 263
6.4.3.1 Backbone connectivity . . . . . . . . . . . . . . . . . . . . . . 263
6.4.3.2 Federation with CloudLab . . . . . . . . . . . . . . . . . . . . . 263
6.4.3.3 VBGP across the backbone . . . . . . . . . . . . . . . . . . . . 264
6.4.4 Experiment toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
x
6.4.5 Deploying experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
6.4.6 Security policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
6.5 Development and Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
6.5.1 Engineering principles and lessons . . . . . . . . . . . . . . . . . . . . . . 272
6.5.2 Challenges in debugging and operation . . . . . . . . . . . . . . . . . . . 277
6.6 Scalability of PEERING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
6.7 PEERING in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
6.7.1 How PEERING has been used . . . . . . . . . . . . . . . . . . . . . . . . 282
6.7.2 Native delegation is a cornerstone for generality . . . . . . . . . . . . . . . 285
6.7.3 Cooperation with network operators . . . . . . . . . . . . . . . . . . . . . 286
6.7.4 Experiments PEERING does not support . . . . . . . . . . . . . . . . . . . 287
6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Chapter 7: Literature Review 290
7.1 Internet Flattening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.2 Traffic Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
7.2.1 Detour routing and overlay networks . . . . . . . . . . . . . . . . . . . . . 295
7.2.2 Egress traffic engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 297
7.2.3 Ingress traffic engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 304
7.2.4 WAN traffic engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
7.3 Characterizations of Internet Connectivity and Performance . . . . . . . . . . . . . 311
7.3.1 Interconnection congestion . . . . . . . . . . . . . . . . . . . . . . . . . . 311
7.3.2 Performance by route type . . . . . . . . . . . . . . . . . . . . . . . . . . 314
7.3.3 Identifying and debugging circuitous routing . . . . . . . . . . . . . . . . 318
7.4 Measuring Goodput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
7.4.1 Using models to estimate the goodput a session can support . . . . . . . . 320
7.4.2 Using packet-pairs to estimate bottleneck and available bandwidth . . . . . 325
7.4.3 Using the congestion control algorithm’s estimate of bottleneck bandwidth 330
7.5 Virtualization of Network Control and Data Planes . . . . . . . . . . . . . . . . . 333
Chapter 8: Conclusions and Future Work 337
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
8.2.1 Improvements for the remaining 20% of Internet traffic . . . . . . . . . . . 340
8.2.2 Determining the root cause of variations in performance . . . . . . . . . . 342
8.2.2.1 Possible causes of temporal degradation . . . . . . . . . . . . . 343
8.2.2.2 Opportunities to improve clustering of endpoints . . . . . . . . . 345
8.2.2.3 Opportunities in congestion detection . . . . . . . . . . . . . . . 350
8.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Bibliography 353
xi
List of Tables
3.1 Estimated vs. measured path lengths from RIPE Atlas vantage points togoogle.com
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.1 Facebook traffic per interconnection type at example points of presence . . . . . . 99
5.1 Fraction of Facebook traffic for which a change in network performance was ob-
served during a ten day measurement study . . . . . . . . . . . . . . . . . . . . . 211
5.2 Fraction of Facebook traffic for which an opportunity to improve network perfor-
mance via performance-aware routing was observed during a ten day measurement
study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
5.3 Fraction of Facebook traffic for which an opportunity to improve network perfor-
mance via performance-aware routing was observed during a ten day measurement
study, partitioned by default and alternate route interconnection types . . . . . . . 227
6.1 Capabilities provided by the PEERING experiment software to simplify and abstract
basic tasks for experimenters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
8.1 The role of delegation of interdomain routing decisions in the design of PEERING
and EDGE FABRIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
xii
List of Figures
2.1 Example scenario where a network cannot route around a failure, despite a healthy
route being available, because it is constrained by the decisions of its upstream
providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1 Paths lengths from a Google Compute Engine virtual machine and PlanetLab virtual
machine to iPlane and end-user destinations . . . . . . . . . . . . . . . . . . . . . 70
3.2 How many (and what fraction) of autonomous systems Google interconnects with,
by autonomous system size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3 Paths lengths from Google.com and Google Compute Engine to end-users . . . . . 77
3.4 Paths lengths from different cloud platforms to end-users. . . . . . . . . . . . . . . 78
4.1 Architecture of a Facebook point of presence . . . . . . . . . . . . . . . . . . . . 92
4.2 Relative egress traffic volume for the 20 Facebook point of presence studied . . . . 93
4.3 Number of BGP prefixes to constitute 95% of for each of the 20 Facebook point of
presence studied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4 Number of routes to BGP prefixes contributing 95% of traffic for each of the 20
Facebook point of presence studied . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5 Distribution across Facebook points of presence of fraction of BGP prefixes that
would have experienced congestion had EDGE FABRIC not intervened . . . . . . . 102
xiii
4.6 Distribution of ratio of peak demand to capacity across interfaces at Facebook points
of presence that would have experienced congestion had EDGE FABRIC not intervened102
4.7 EDGE FABRIC’s components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.8 Utilization of interfaces relative to detour thresholds. . . . . . . . . . . . . . . . . 118
4.9 Fraction of time EDGE FABRIC detours from interfaces. . . . . . . . . . . . . . . . 119
4.10 Distributions of EDGE FABRIC detour period lengths across (PoP, prefix) pairs and
of time between detours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.11 Fraction of traffic detoured by EDGE FABRIC across 20 PoPs and at the PoP with
the largest fraction detoured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.1 Distribution of duration and busy times across HTTP sessions . . . . . . . . . . . 153
5.2 Distribution of bytes transferred across an entire HTTP session, and distribution of
response size for all responses and for media responses . . . . . . . . . . . . . . . 154
5.3 Distribution of transactions and bytes transferred per HTTP session . . . . . . . . 155
5.4 Sequence diagram for three back to back HTTP transactions over a single HTTP
session. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
5.5 Example of exponential CWND growth and ACK timings under ideal network
conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.6 Example of how shifts in client population can lead to changes in propagation delay
that can be misconstrued as changes in network conditions . . . . . . . . . . . . . 197
5.7 Distribution of propagation delay and HDratio over all HTTP sessions and split per
continent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.8 Observed relationship between propagation delay and a session’s ability to support
2.5Mbps goodput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.9 Distribution of MinRTT
P50
and HDratio
P50
degradation observed during ten day
measurement study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
xiv
5.10 Time series showing diurnal degradation of goodput for clients in a mobile network 214
5.11 Time series showing diurnal degradation of propagation delay and goodput for
clients in a mobile network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.12 Distribution of potential improvement in propagation delay and goodput that could
be achieved by shifting traffic to an alternate route, as observed during ten day
measurement study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
5.13 Time series showing episodic opportunity to improve propagation delay for clients
served by a fiber ISP by shifting traffic to an alternate route . . . . . . . . . . . . . 221
5.14 Time series showing episodic opportunity to improve propagation delay for clients
served by a cable broadband ISP by shifting traffic to an alternate route . . . . . . 223
5.15 Time series showing diurnal opportunity to improve propagation delay for clients
served by a fixed broadband ISP by shifting traffic to an alternate route . . . . . . . 224
5.16 Time series showing diurnal opportunity to improve propagation delay for clients in
a mobile network by shifting traffic to an alternate route . . . . . . . . . . . . . . . 225
5.17 Difference in propagation delay and HDratio
P50
between the preferred route and the
alternate route for different groupings of primary and alternate route interconnection
types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.1 Example experiment that PEERING was designed to support . . . . . . . . . . . . 242
6.2 How VBGP delegates control of egress route selection to experiments . . . . . . . 251
6.3 Logical locations of VBGP enforcement engines as they interpose on the data and
control planes between an experiment and the Internet . . . . . . . . . . . . . . . . 255
6.4 Architecture of a PEERING point of presence and the process that experiments use
to interface with PEERING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
6.5 Example of an experiment routing its traffic across PEERING’s backbone and through
egress connectivity available at another PEERING point of presence . . . . . . . . . 264
6.6 How memory consumption and CPU utilization grow with number of routes and
rate of updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
xv
Abstract
Today, over 80% of all Internet traffic is sourced from a small set of Content Distribution Networks
(CDNs). These CDNs have built globally distributed points of presence to achieve locality and to
facilitate regional interconnection, both of which are key to satisfying the increasingly stringent
network requirements of streaming video services and interactive applications. Content providers
rely heavily on CDNs, and many of the largest have built their own private CDNs.
Prior work has shed light on the rise of CDNs from multiple vantage points. However, we still
know little about how CDNs manage their connectivity and make routing decisions. Likewise, a
number of longstanding Internet routing problems centered around performance, availability, and
security can be attributed to fundamental issues in design of the Border Gateway Protocol (BGP), the
protocol used to stitch together and route traffic across networks on the Internet. What implications
will the rise of CDNs have on such problems?
This dissertation sheds light on these unknowns by examining how CDN providers interconnect
and route traffic in today’s Internet, along with the opportunities and challenges that arise in this
environment. First, we execute a measurement study to uncover the connectivity of CDNs and
capture how traffic flows between CDNs and end-users on today’s Internet. We find that much of
xvi
the traffic on today’s Internet no longer traverses transit providers, a special set of networks that
interconnect all other networks on the Internet. This structural transformation has been referred to
as the flattening of the Internet’s hierarchy — while end-user ISPs and content networks historically
passed traffic (and dollars) upwards to transit providers, this hierarchy has collapsed as interconnec-
tions have been established directly between these networks. We explore how this flattening may
enable deployable solutions to longstanding Internet problems for the bulk of today’s Internet traffic.
Next, we characterize the connectivity and routing policies of Facebook, a popular content
provider that operates its own CDN, and examine the opportunities (performance-aware routing,
fault-tolerance) and challenges (capacity constraints) that arise on the flattened Internet. We explore
the design of EDGE FABRIC, a software-defined egress routing controller that we built and deployed
in Facebook’s production network that enables efficient use of Facebook’s peering interconnections
while preventing congestion at Facebook’s edge, and we develop and employ novel measurement
techniques to characterize performance for traffic between Facebook’s CDN and end-users.
Finally, we discuss how we democratized Internet routing research by building PEERING,
a community platform that enables experiments to interact with the Internet routing ecosystem.
PEERING has enabled 40 experiments and 24 publications, unblocking impactful experiments that
researchers have historically struggled to execute in areas such as security and traffic engineering.
Through this work, we demonstrate that it is possible to solve longstanding Internet routing
problems and ultimately improve user experience by combining the rich interconnectivity of CDNs
in today’s flattened Internet with mechanisms that enable routers to delegate routing decisions to
more flexible decision processes.
xvii
Chapter 1
Introduction
Internet traffic and the Internet’s structure as a whole have rapidly evolved in the past decade. As
of 2019, the vast majority of all Internet traffic is sourced from a small set of Content Distribution
Networks (CDNs), with five web properties accounting for 50% of all Internet traffic, and ten web
properties accounting for 75% [248].
1
In comparison, it took the combined traffic of 150 networks to
account for 50% of Internet traffic in 2009, and thousands of networks to do the same in 2006 [302].
The consolidation of Internet traffic and the rise of CDNs can be traced in part to the growth of
streaming video services and other applications with demanding network requirements [248]. For
instance, in 2019 streaming video services YouTube and Netflix accounted for 35% and 15% of
global Internet traffic respectively [369]. Users of these services expect playback of videos to start
quickly and proceed without stalls, even at high bitrates, and these quality of experience expectations
1
These networks are sometimes referred to as “hyper giants” [250], although the term is loosely defined [56].
1
translate into high goodput and soft real-time latency demands of the underlying network [111].
2
Content providers rely on CDNs to meet these increasingly stringent network demands.
The largest content providers (e.g., Google/YouTube, Netflix, Facebook, and Microsoft, among
others) have built out their own CDNs [70, 72, 378, 442, 469], and the traffic of smaller content
providers is served by a handful of commercial CDNs [7]. For instance, in 2015 between 15% and
30% of all web traffic was served from Akamai’s CDN [88], and as of August 2020, 35% of the
top 1000 websites and 22% of the top 10,000 websites use Akamai as a CDN provider [451].
Combined, private and commercial CDNs now serve over 80% of the Internet’s traffic [248].
CDNs build points of presence (PoPs) around the world to bring content closer to users,
improving performance by decreasing latency and transfer times (§2.2 and chapter 4, [53, 70, 138,
317, 469]). In addition, CDNs use their local presence to interconnect directly with regional networks,
including establishing peering interconnections with end-user Internet Service Providers (ISPs, e.g.,
Comcast) (§2.2.3 and chapters 3 and 4). By aggressively establishing these peering interconnections,
CDNs have changed how much of the Internet’s traffic is routed. While a decade ago the Internet’s
structure followed a strict hierarchy, with most content providers relying on a handful of transit
providers (upstreams) to route traffic to users, today transit providers are cut out of the picture for
a significant fraction of Internet traffic. Instead of routing traffic (and dollars) upwards to transit
providers, CDNs use the aforementioned interconnections to pass traffic directly to end-user ISPs,
and in some cases even colocate PoPs within the networks of end-user ISPs (§2.2.3 and chapter 3, [70,
2
Likewise, multiplayer games are latency sensitive [199], and time to interaction — the amount of time required for a
webpage or application to be ready to respond to user input [435] — depends on goodput and latency [420, 471].
2
248]). As a result, the role of transit providers has diminished — the hierarchy in which they once
played a central role is no longer relevant for the bulk of Internet traffic.
This structural transformation has been referred to as the “flattening” of the Internet’s hierarchy;
prior work has examined this transformation and its implications from multiple vantage points [7,
109, 159, 250]. However, a number of unknowns remain, in part due to the limitations of the vantage
points used by prior work. For instance, we know little about how CDNs manage their connectivity
and make routing decisions, and the opportunities and challenges in this environment. Likewise, a
number of longstanding Internet routing problems centered around performance, availability, and
security can be attributed to fundamental issues in design of the Border Gateway Protocol (BGP), the
protocol used to stitch together and route traffic across networks on the Internet. What implications
will the flattening of the Internet have on these longstanding problems, and vice-versa?
Thesis statement It is possible to solve longstanding Internet routing problems, and thereby
improve user experience, by pairing the rich interconnectivity of today’s content distribution
networks with novel mechanisms that enable routers to delegate interdomain routing decisions to
more flexible decision processes.
The work in this dissertation supports this thesis as follows:
(§3) In Chapter 3 we execute a measurement study to quantify the impact of the Internet’s flattening
on the paths between end-users and popular content. We find that direct, one hop paths
between content providers and user networks are increasingly common, with some CDNs
3
colocating servers inside the network of end-user ISP. Based on this insight, we sketch the
potential implications of the Internet’s flattening on longstanding problems and discuss how
the flattened Internet may provide footholds for simple solutions that provide benefit for the
majority of Internet traffic.
(§4) In Chapter 4 we characterize the connectivity and routing policies of a popular content
provider, Facebook. We discuss challenges associated with Facebook’s volatile traffic de-
mands and rich interconnectivity, and explain why providers like Facebook must employ
sophisticated traffic engineering systems to sidestep BGP’s limitations. We examine how
Facebook delegates routing decisions traditionally made by BGP on routers at the edge of a
network to EDGE FABRIC, a software-defined egress routing controller that we built. Criti-
cally, EDGE FABRIC enables Facebook to consider demand and capacity in routing decisions
and thereby allows Facebook to maximize utilization of preferred interconnections while
avoiding congestion that would otherwise occur due to capacity constraints. In addition,
EDGE FABRIC provides a foundation for performance and application-aware routing, both of
which we investigate further in Chapter 5.
(§5) In Chapter 5 we characterize the Internet performance observed from Facebook’s CDN
deployment, including regional and temporal trends. To do so, we developed novel techniques
to capture and interpret network performance from production traffic. In addition, we evaluate
the potential utility of performance-aware routing by using the extensions we built into
EDGE FABRIC to send a portion of production traffic via alternate paths. Our results suggest
4
that by establishing points of presence and rich interconnectivity around the world, CDNs
are commonly able to provide good performance for the vast majority of traffic and users,
sidestepping longstanding performance challenges.
(§6) In Chapter 6 we discuss how we democratized Internet routing research by building PEERING,
a globally distributed network that enables researchers to execute experiments that interact
with the Internet routing ecosystem. Researchers and network operators have long struggled
to make progress on well-known Internet routing problems, and one of the key challenges
is that BGP does not lend itself well to supporting experimentation. BGP is an information
hiding protocol, and thus provides little visibility into the connectivity and routing policies
of networks on the Internet. As a result, researchers have limited insight into the Internet’s
behavior, and lack the data required to accurately model it. We argue that gaining insights
into problems on today’s Internet and evaluating potential solutions requires experiments to
interact with and affect the Internet’s routing ecosystem. PEERING enables such experiments
by letting researchers take control of a production BGP network with connectivity qualitatively
similar to that of the CDNs that serve much of today’s Internet traffic. Building PEERING
required developing techniques to enable a production BGP router to multiplex and delegate
control of its data and control planes to multiple experiments in parallel, while also enforcing
safeguards to maintain security and stability. PEERING has enabled 40 experiments and 24
publications in key research areas such as security, traffic engineering, and routing policies.
5
Sections 1.1 to 1.4 provide more details of the supporting work, and Section 1.5 summarizes
the contributions of this work in relation to the thesis statement.
1.1 Routes Between Users and Popular Content
Content distribution networks have built global networks to deliver high-volumes of traffic to users
while achieving demanding performance and availability goals. By building points of presence
around the world, CDNs bring content closer to users, thereby reducing latency and response times.
In addition, these points of presence provide an opportunity to bypass transit providers and the
Internet’s traditional tiered structure by interconnecting directly with networks that serve end-users.
In Chapter 3, we gain insights into the connectivity of large cloud providers by executing
traceroute measurements from the vantage point of a tenant within their networks. Peering inter-
connections, especially of content providers like Google, are notoriously hard to uncover, with
previous work projecting that traditional measurement techniques miss 90% of these links [319].
Our approach — executing traceroutes from the cloud — provides us with a much broader view
than the approaches used by traditional techniques — such as traceroutes from vantage points in
end-user networks to cloud providers and CDNs — as we can measure outward to all networks
rather than being limited to a relatively small number of available vantage points. Our goal is to
understand routes between users and popular content, and so we focus our measurements by using a
trace file provided by a global CDN to winnow down the global IP address space to the portions
6
likely to contain users — prefixes that have previously accessed a CDN’s content are more likely to
contain users (and not servers, etc.).
Our analysis of three million traceroutes executed in mid-2015 from Google Cloud to prefixes
in the CDN trace reveals that 61% of the prefixes have one hop paths from Google Cloud, meaning
the network originating the prefix announcement peers directly with Google. In total, we estimate
that Google interconnects directly with over 5000 networks, and we estimate that paths to some
Google services are even shorter in the many networks where Google has deployed an off-net cache.
We repeat our analysis for other cloud providers and find that SoftLayer and Amazon have one
hop paths to 40% and 35% of prefixes in our CDN trace and interconnect with 1986 and 756
networks respectively.
3
To estimate the connectivity of content providers for which we cannot measure from an internal
vantage point, we execute traceroutes from RIPE Atlas vantage points (§2.5.1) inside end-user
networks towards google.com, bing.com and facebook.com. Since this approach is
limited by the coverage of RIPE Atlas vantage points, we calibrate our results by also executing
traceroutes from the same vantage points towards our Google Cloud instance. Our analysis of
these traceroutes reveals that paths to google.com are shortest, paths to bing.com and our
Google Cloud instance are nearly identical in terms of path length distribution, and paths to
facebook.com are longest (we examine Facebook’s connectivity further in Chapters 4 and 5).
3
Our results represent estimates of lower bounds on these network’s connectivity because we only executed traceroutes
from a single vantage point in each network.
7
While Google leads the pack, these numbers suggest that major cloud providers and CDNs are
expanding their connectivity, and that user traffic is increasingly sent via one hop paths. We examine
the potential implications of the flattening of the Internet hierarchy, including how these short paths
may provide footholds for simple, deployable solutions to longstanding Internet problems.
1.2 Challenges and Control Systems with Rich Interconnectivity
The traceroutes we executed in Chapter 3 enabled us to examine the routes between end-users and
popular content on today’s Internet, and our analysis establishes that CDNs often have “short”,
direct paths into user networks. However, traceroutes only capture how a network is currently
routing traffic — they do not provide insight into CDN path diversity or routing decisions, or the
opportunities and challenges that CDNs experience in this environment. We gain insights into these
aspects in Chapter 4 by taking an insider’s look at Facebook’s CDN.
Facebook is a large content provider with billions of users around the world. Facebook’s users
are served by Facebook’s private CDN, which includes dozens of points of presence around the
world. Facebook interconnects widely at each point of presence: each has at least two routes to
every destination on the Internet (via transit providers), and many points of presence have four or
more distinct routes to the users that it is designed to serve.
At first glace, we see that Facebook’s rich interconnectivity provides a number of benefits: fault
tolerance, path diversity, and potentially an opportunity for performance-aware routing. However,
because BGP is unable to consider demand or capacity when making routing decisions (§§ 2.4.2
8
and 2.4.3), making effective use of this connectivity — without causing congestion at the edge of
Facebook’s network — is challenging. Although most of Facebook’s interconnections operate within
their provisioned capacity, some interconnections have insufficient capacity to handle peak loads and
would become congested without intervention. During our measurement period in early 2017, we
found that 10% of such interconnections experienced a period in which Facebook’s routing policy
would lead to BGP assigning twice as much traffic as the interface’s capacity! Further complicating
matters, traffic from a Facebook PoP to a user prefix can be unpredictable, with traffic rates to the
same prefix at a given time exhibiting as much as 170x difference across weeks. Failures can also
yield sudden changes in capacity, and interconnection capacity can vary across routers in the same
point of presence. Thus, while Facebook’s rich connectivity provides shorter paths, more options for
routing, and significant capacity in aggregate, the limitations of BGP combined with the capacity
constraints of individual paths and irregular traffic makes it difficult to use this connectivity.
In Chapter 4, we discuss how Facebook overcame these challenges by delegating decisions
typically made by BGP at routers to EDGE FABRIC, a software-defined egress route controller
that we built and deployed in production. EDGE FABRIC improves efficiency by placing as much
traffic on the interconnections preferred by Facebook’s routing policy as possible, and if needed,
dynamically shifts traffic to alternate routes to avoid congestion. With EDGE FABRIC, Facebook
can achieve interconnection utilization as high as 95% without packet loss. In addition, EDGE
FABRIC provides Facebook with flexibility in its routing decisions and serves as a foundation for
performance and application-aware routing, both of which we investigate further in Chapter 5.
9
EDGE FABRIC has been deployed in production since 2013, and our discussion of EDGE FABRIC
(as originally published in ACM SIGCOMM 2017, [378]) is the first public analysis shedding light
on such a system. We discuss how EDGE FABRIC’s design evolved over time, including changes
that improved flexibility and performance, and simplified the controller’s design.
1.3 Performance and Opportunities with Rich Interconnectivity
In Chapter 5, we expand upon our work in Chapter 4. We characterize Internet performance from
the vantage point of Facebook’s CDN and investigate the potential benefits and challenges of
incorporating real-time performance measurements into EDGE FABRIC’s routing decisions.
Our measurement study relies on a 10 day dataset composed of performance measurements
from trillions of HTTP sessions sampled at random from production traffic terminated at Facebook’s
points of presence. The dataset provides both the coverage required for a global analysis of
Internet performance and the high-volume of samples required to conduct granular analysis, such as
identifying spatial and temporal variations. Because a large share of global Internet traffic comes
from a small number of well connected CDNs with connectivity similar to Facebook’s (chapter 3),
performance measurements and conclusions that we draw from this vantage point are also likely
representative of other popular services.
Using production traffic measurements to quantify network performance and identify opportu-
nities for performance-aware routing presents several challenges. First, a connection’s ability to
provide Facebook users with a good experience is a function of the connection’s propagation delay
10
and ability to support a given goodput. For instance, loading a 50KB webpage in less than 200
milliseconds requires that the connection’s round-trip propagation delay — a function of the location
of Facebook’s point of presence and BGP route used to route traffic between the user and the point
of presence [243, 472] — must be less than 200 milliseconds, and further requires the connection be
capable of supporting 2Mbps+ goodput. Likewise, streaming a video encoded at 2.5Mbps requires
the connection be capable of supporting 2.5Mbps+ goodput. We want to capture the probability that
connections between Facebook’s points of presence and end-users can support the propagation delay
and goodput requirements of Facebook’s applications, and further understand how this probability
changes by end-user location, ISP, time-of-day, and the BGP route that Facebook uses to reach the
end-user. However, while a speedtest can determine the goodput that a connection can support by
exchanging data between a client and a server as quickly as possible, production traffic depends on
end-user and application behavior and thus can be restricted by non-network factors. For instance,
we find that most objects served by Facebook’s CDN are small and many connections are brief,
and we find that goodput estimates derived from such transfers will frequently under-estimate the
goodput that the connection is capable of supporting because the transfers may not exercise the
bandwidth-delay product due to their size or because congestion control is still in initial slow-start.
The novel approach that we introduce in Chapter 5 accounts for these intricacies and enables us to
differentiate between goodput restricted by network conditions (which we want to measure) and
goodput “only” restricted by sender behavior. Our approach is practical, and we have deployed it
worldwide in Facebook’s production CDN. Second, in order to separate measurement noise from
11
statistically significant differences, we must employ statistical tools when comparing aggregations
of performance across time and routes; this is non-trivial given the scale of our dataset.
Our analysis finds that the majority of user sessions have a sufficiently low round-trip propagation
delay (< 40ms) and achieve the goodput required to stream HD video. In addition, we find that
network performance between Facebook and end-user networks is relatively stable, with only 1.1%
of global traffic experiencing regular, repeated (temporal) instances of performance degradation.
We examine regional variances and find that users in Africa, Asia, and South America in particular
experience poorer performance and are more likely to experience variations in network conditions.
We investigate if incorporating real-time performance information into Facebook’s routing
decisions could yield performance benefits by executing a controlled experiment in which we
compare the performance of the route chosen by Facebook’s routing policy (e.g., the primary
route) against the performance of alternate routes. Using the footholds for performance-aware
and application-specific routing that we built into EDGE FABRIC in Chapter 4, we build a system
to randomly select and route a portion of Facebook’s production traffic to each end-user prefix
via alternate paths, enabling continuous measurement and comparison of the performance of the
primary and alternate routes. We find that the existing static BGP routing policy employed by
Facebook (§4.2.3) is close to optimal. Our analysis reveals that performance-aware routing decisions
(e.g., shifting traffic to an alternate route) could provide a modest improvement in latency and/or
goodput for only a few percent of global traffic — although our results do indicate more opportunity
in Africa, Asia, and South America. However, our results also show that it is not always possible to
take advantage of perceived opportunities given that a route’s performance is dependent on load,
12
and thus may change if (more of) Facebook’s traffic is shifted to it. These findings suggest that
sophisticated control systems would be required to take advantage of performance-aware routing.
From our analysis, we conclude that CDNs are able to provide good performance for the
vast majority of traffic and users. By establishing points of presence around the world with
rich connectivity, CDNs have sidestepped longstanding problems that have traditionally degraded
performance.
1.4 Advancing Internet Routing Research and Innovation
A number of longstanding Internet problems centered around performance, availability, and security
can be attributed to fundamental issues in BGP’s design (§2.4), and the flattening of the Internet
raises new questions, challenges, and opportunities. For instance, in Chapter 3 we speculate that it
may be easier to make progress on some of these problems if we limit the focus of our solutions to
the paths that carry the majority of Internet traffic on today’s flattened Internet, while in Chapters 4
and 5 we examine opportunities and challenges CDNs face on the flattened Internet.
However, it has historically been difficult for researchers to make progress on such topics, in part
because BGP does not lend itself well to supporting experimentation: BGP is an information hiding
protocol and thus provides little visibility into the connectivity and routing policies of networks on
the Internet [65, 466]. As a result, emulation and simulation cannot accurately model the Internet,
and even when a tool, such as a looking glass interface, is able to provide visibility into BGP’s
state, that tool cannot predict how that state would change if an event occurred. The flattening of
13
the Internet further reduces the utility of existing tools given that they have limited visibility into
the peering interconnections between CDNs and end-users that now carry the bulk of the Internet’s
traffic (§2.5.3, [319]).
Given these limitations, we conclude that experiments need to interact with and affect the
Internet’s routing ecosystem, taking control of a real production network and its connectivity,
policies, and traffic. For instance, evaluating the opportunity for performance-aware routing in
Chapter 5 requires controlling Facebook’s production traffic — a model would have been unable to
predict the results discussed given that they depend on conditions beyond the edge of Facebook’s
network (§5.6.1.2). But executing this experiment required extensive analysis of risk, building a
production control system with a number of safeguards, and coordinating with Facebook’s network
operations. And the ability to execute such an experiment is uncommon: network operators are
often unwilling to allow experimentation on a production network due to the potential wide-ranging,
negative effects [357].
As a result, despite the clear value of experiments capable of interacting with and affecting
the Internet’s routing ecosystem, it is rare to have the opportunity to execute such experiments
on a production network, especially at a regular cadence, and the alternative — deploying a well
connected network to run an experiment — requires significant time and resources, making it
impractical in the vast majority of cases.
In Chapter 6, we discuss how we removed the barriers to such experiments and democratized
Internet routing research by building PEERING, a globally distributed network (autonomous system
AS47065) open to the research community. PEERING has points of presence at 15 locations,
14
each of which has a router that maintains at least one transit interconnection (and corresponding
BGP control-plane session) that connects PEERING to the real Internet. In addition, a subset of
PEERING’s points of presence maintain peering interconnections with tens or hundreds of other
networks via public fabrics at Internet Exchange Points (§2.2.2).
Akin to how a hypervisor multiplexes a physical host’s resources across virtual machine,
PEERING virtualizes each point of presence router’s data and control plane interactions with other
networks and delegates control to experiments. This approach enables multiple experiments to
run in parallel while providing each experiment with the same control and visibility it would have
with its own (non-virtual) router and interconnections at the point of presence. In addition, this
virtualization layer enables PEERING to maintain security by interposing between experiments and
the Internet on both planes and blocking any potentially harmful actions.
However, building PEERING is non-trivial because such virtualization is inherently unsupported
by BGP’s design — because BGP is an information hiding protocol, a traditional BGP router applies
policy and makes routing decisions locally, routes all traffic to a destination via the route chosen by
its decision process, and only shares with neighbors (at most) its chosen route. If PEERING was
to operate under this set of limitations, each PEERING experiment would need to modify the point
of presence router’s configuration to control data and control plane decisions; granting that ability
is equivalent to giving root access to experiments, which is untenable from a security perspective.
We discuss how our solution to these challenges enables a BGP router to delegate its data and
control plane interface via a novel combination of IP and Layer-2 manipulation and intradomain
BGP advertisements. By delegating the data and control plane in this manner, PEERING can support
15
multiple experiments simultaneously while allowing those experiments to make independent routing
decisions at a per-packet granularity. And because PEERING’s approach to delegation is protocol
compliant and does not rely on extensions to BGP or custom protocols, it is fully compatible
with existing routers and BGP implementations; experiments that run on PEERING are directly
transferable to native networks, and vice versa.
All combined, PEERING provides experiments with safe, turn-key control of a global network
with connectivity qualitatively similar to that of a CDN provider. PEERING’s rich connectivity and
flexibility has enabled it to support over 40 experiments and 24 publications to date [20, 21, 47, 48,
49, 50, 137, 142, 200, 263, 288, 297, 323, 347, 366, 378, 381, 392, 397, 406, 411, 412, 413, 439].
1.5 Summary of Contribution
In this dissertation, we examine how the rise of CDNs and peering interconnections have transformed
the Internet’s architecture, uncovering both opportunities and challenges. In addition, we describe
how the design of BGP — the Internet’s routing protocol — is responsible for longstanding Internet
performance, availability, and security problems, and how BGP’s design further makes it difficult to
take advantage of the opportunities that arise with a flattened Internet. In response, we design and
deploy novel mechanisms that enable routers to delegate routing decisions traditionally made by
BGP to more flexible decision processes, and demonstrate that combining such delegation with the
rich interconnectivity of today’s CDNs enables progress on longstanding Internet routing problems.
Through our work, we (1) inform the community of key changes in the Internet’s architecture
16
along with associated opportunities and challenges, and the implications on performance; (2) share
key insights into the design and deployment of control and measurement systems that are key to
operating a large CDN on today’s flattened Internet; and (3) remove longstanding barriers to Internet
routing research. We make the following contributions:
We show that a sizable fraction of Internet traffic now traverses a one hop path path. The
vast majority of today’s Internet traffic is between large CDNs and end-user ISPs but capturing
how this traffic is routed across the Internet is challenging, in part because traditional measurement
techniques are often unable to observe peering interconnections. By measuring outwards from
vantage points within major cloud providers towards end-user prefixes, our measurements uncover
these interconnections, and our analysis reveals that Google and other major cloud providers and
CDNs are now able to send a significant fraction of user traffic via one hop paths. This architectural
shift has implications for the research and operational community, and we sketch how these short
paths can enable deployable solutions to longstanding Internet routing problems.
We share an insider’s look at a major CDN’s connectivity and uncover capacity constraints
that arise on a flattened Internet. We show that widely interconnecting can offer CDNs a number
of benefits — shorter paths, fault-tolerance, significant capacity in aggregate, and potentially an
opportunity for performance-aware routing — but that CDNs are unable to take advantage of these
opportunities due to the limitations of BGP. In particular, we show how capacity constraints of
17
peering interconnections combined with volatility in demand and failures can make it impossible for
a CDN to make effective use of their connectivity with traditional BGP.
We present the design of a novel software-defined egress control system that enables CDNs
to make efficient use of their connectivity, and we share insights from production. EDGE
FABRIC enables CDNs to sidestep the limitations of traditional BGP and make efficient use of
interconnections while avoiding congestion that would degrade performance. We discuss how our
novel design provides the flexibility and foundation necessary for CDNs to incorporate dynamic
signals — such as interface utilization, capacity, and real-time performance signals — into routing
decisions. We share architectural details, including how we delegated routing decisions from BGP
at routers to EDGE FABRIC, how EDGE FABRIC decides which traffic to shift, and how the design
of EDGE FABRIC’s control loop eases development and testing. We share insights into design
trade-offs and open challenges, such as why it is difficult for CDNs to make effective use of IXP
capacity. While EDGE FABRIC is deployed in production within Facebook’s network, any network
with similar connectivity likely requires the use of a similar controller to make effective use of their
connectivity [389, 469].
We develop novel techniques that enable the accurate characterization of network perfor-
mance from production traffic, and use these techniques to examine Internet performance
and opportunities from a CDN’s vantage point. We share a principled approach to capturing
insights into network conditions and comparing the performance of different routes to a destination.
18
We explain why capturing insight into achievable goodput from passive traffic measurements is
challenging, and share our novel approach, which enables us to differentiate between goodput
restricted by network conditions (which we want to measure) and goodput “only” restricted by
sender behavior. Our measurement techniques have been deployed in a CDN’s production network
for over two years and are applicable in a variety of environments. In addition, we share a concrete
methodology for capturing and converting network metrics into performance-aware routing deci-
sions. Using the techniques that we developed, we explore Internet performance and opportunities
for performance-aware routing using trillions of measurements captured from production traffic
from a CDN’s vantage point, while in parallel surfacing challenges that help guide future work.
Given the coverage of the dataset, and because a large share of global Internet traffic comes from a
small number of well connected content and cloud providers with similar connectivity, our analysis
likely reflects end-user performance to popular services in general.
We enable impactful routing research by building a production network and developing novel
mechanisms that give experimenters control while maintaining safety and stability. By build-
ing PEERING, we removed obstacles that have long beset research integral to understanding and
improving Internet routing. We share our design of novel mechanisms that we developed to multiplex
and delegate control of our production network to PEERING experiments, and how the resulting
control and realism has enabled over 40 experiments and played a role in 24 publications [20, 21,
47, 48, 49, 50, 137, 142, 200, 263, 288, 297, 323, 347, 366, 378, 381, 392, 397, 406, 411, 412, 413,
439]. In addition to democratizing Internet routing research, the mechanisms used by PEERING for
19
delegation are generalizable and demonstrate that such delegation is possible without the need for
extensions to BGP or custom protocols.
20
Chapter 2
Background
The Internet is composed of thousands of networks, including those of consumer Internet service
providers (e.g., Comcast, Verizon), universities, businesses, and transit providers. Networks
interconnect by establishing physical circuits between their routers and then using the Border
Gateway Protocol (BGP) to exchange routes to network address space. In this section, we introduce
the Border Gateway Protocol and other fundamentals of routing between such networks (known as
interdomain routing), and provide an overview of open problems in the space.
2.1 The Border Gateway Protocol (BGP)
The Border Gateway Protocol (BGP) is one of the foundational building blocks of the modern
Internet. BGP enables networks to exchange routes and traffic with minimal coordination, and its
design both enables and promotes fault-tolerance.
21
2.1.1 Interconnecting autonomous systems (ASes) with BGP
A group of routers and networks under a single administrative domain is known as an autonomous
system and is uniquely identified by an autonomous system number (ASN) [65, 192, 466]. Routing
between autonomous systems (ASes) is referred to as interdomain routing, while routing within an
autonomous system is referred to as intradomain routing [192].
When two ASes interconnect, they typically establish a physical circuit — sometimes referred
to as a private network interconnection (PNI) — between their routers, and then use this circuit to
establish a Border Gateway Protocol (BGP) control-plane session, which is used to exchange routes
between the ASes (§2.2.2, [65, 150, 177, 466]). BGP sessions are typically established between
hardware routers, but software routers, such as BIRD [427] and Quagga [222], are also used in
production networks (Chapters 4 and 6, [350, 469]).
1
Each AS uses BGP to announce a list of routes to neighboring ASes (neighbors); each route
contains a single block of contiguous network address space (a IP prefix) that the announcing AS
will accept traffic for [466]. An AS may announce (originate) a route for a prefix that it controls, and
may announce (redistribute) routes received from its neighbors for other prefixes. If the announcing
AS can no longer accept traffic for a prefix, it sends a withdraw message to its neighbors. Route
changes are communicated by announcing a route with updated information; updates invalidate any
previously announced route.
1
In Chapter 4, we discuss how Facebook uses a combination of hardware routers and software controllers that speak
BGP to manage interdomain routing, and in Chapter 6 we discuss how the PEERING research platform uses BIRD.
22
BGP is a path-vector protocol: each time a route is exchanged between ASes, the ASN of
the announcer is added to the end of theAS_PATH. As a result, theAS_PATH is a list that starts
with the ASN that originated the route and the ASNs that the route traverses and was redistributed
through.
2.1.2 How BGP makes routing decisions
Each router stores all of the routes received from neighbors via BGP in a routing table. When a
router has multiple routes for the same prefix, the BGP best path selection algorithm selects a single
best path, making a decision based on the properties of each route and locally defined policy [466].
In Section 2.3.1 we discuss how an AS’s routing policy can be used to influence the best path
selection algorithm.
The BGP best path selection algorithm primarily uses three route attributes:
1. The Local Preference (LOCAL_PREF), an attribute assigned locally based on the router’s
‘import’ policy. LOCAL_PREF is used to encode arbitrary routing policies, such as preferring
routes from specific neighbors or with specific attributes (such as AS in theAS_PATH); these
preferences are enacted by raising theLOCAL_PREF of matching routes.
2
2. TheAS_PATH length. Shorter paths are preferred.
2
For example,LOCAL_PREF is used in Chapter 4 to force Facebook’s peering routers to prefer routes injected by
EDGE FABRIC, Facebook’s egress routing controller.
23
3. The Multi-Exit Discriminator (MED), a metric set by the neighboring AS that exported the
route. The metric indicates the AS’s preference for receiving traffic for the given destination
via the corresponding interconnection, relative to other interconnections where the same AS
is exporting a route to the same destination [65, 124, 299, 466].
If necessary, additional attributes such as the BGP peer ID are used to break ties [466]. Once a
router selects a best path, it uses this path to forward all packets destined for the prefix. In addition,
the router may advertise the path to its neighbors, in accordance with its route export policy.
When a router receives a BGP update, whether it is an update or withdrawal of an existing route
or the announcement of a new route, the BGP path selection algorithm may need to be re-executed,
and the routes announced to neighbors may change. If a router had previously announced a path for
a prefix to a neighbor and following a BGP update or policy change chooses to no longer advertise
a route to that neighbor, it will withdraw the previously announced route. When all routers have
settled on a route to the destination, the BGP protocol has converged. As discussed in Section 2.3.1,
BGP convergence is not guaranteed but can be achieved if all autonomous system routing policies
are compliant with the Gao-Rexford Model.
In environments where BGP multipath is used, a router will apply the path selection logic to
identify all paths that are equal to the best path selected and then spreads traffic across these paths
(often using Equal Cost Multipath (ECMP) [203]), but will still only export (at most) a single best
path to neighbors.
3
3
The criteria used to identify equivalent routes for multipath is not standardized and varies across vendors.
24
2.2 Points of Presence and Interconnections
2.2.1 Points of presence
In the previous section we discussed how ASes interconnect by establishing physical circuits — com-
monly known as Private Network Interconnections (PNI) — between their routers. ASes may
establish a point of presence (PoP) at an Internet Exchange Point (IXP) to facilitate these intercon-
nections [6, 87, 273]. An IXP is a colocation facility, or a group of colocation facilities, specifically
designed to support the dense interconnection of networks.
In general, interconnecting widely increases path diversity, thereby improving fault-tolerance
and potentially providing opportunities to improve performance and/or reduce cost [9, 10, 243,
315, 424, 472]. ASes with large geographic footprints commonly interconnect at multiple PoPs for
redundancy and to minimize circuitous routing by enabling traffic to be exchanged regionally [243,
472]. CDNs also build out PoPs around the world to improve user experience and facilitate direct
interconnections with end-user ASes / Internet Service Providers (ISPs), and in doing so have
changed the structure of the modern Internet (§2.2.3).
2.2.2 Types of interconnection
Interconnections can be broadly classified into two categories: transit and peering [65, 149, 150,
177, 315]. The type of interconnection largely determines what routes are exchanged between ASes
and the commercial relationship (if any) between them.
25
Transit interconnections In a transit interconnection, one of the networks is a customer or
downstream and the other is a transit provider or upstream. Customers (typically) pay the transit
provider for connectivity to the entire Internet, and transit providers announce to their customers a
route to and accept traffic for every destination on the Internet. Likewise, customers announce routes
to destinations in their networks to their transit providers, and the transit providers are responsible
for propagating these announcements to the entire Internet. Transit interconnections are typically
established via PNIs, although in some cases they are established across the shared network fabric
of a public IXP (see Public IXPs and Route Servers, below) [316].
Peering interconnections In a peering interconnection, both parties only announce routes to
destinations within their networks or their customer’s networks (their customer cone). Peering
interconnections can be established between the networks for free (known as settlement-free in-
terconnection), or one of the networks can pay the other (known as paid peering) [315]. Some
networks have an open peering policy and will establish a settlement-free interconnection with any
other network, while others have a restrictive peering policy and will only establish interconnections
under certain conditions [315].
In general, a network must have a PoP at an IXP to establish peering interconnections. IXPs
centralize networks in a way that reduces the overhead of setting up interconnections. Each network
makes the requisite investments to establish connectivity and set up physical infrastructure at the
IXP (e.g., backhaul and routers) and then can use their locality to other networks to quickly establish
interconnections. Some networks establish peering interconnections without having a physical
26
presence at an IXP by paying another network to interconnect and backhaul traffic on their behalf, a
practice known as remote peering [79].
Public IXPs and route servers Peering interconnections are commonly established via PNI.
However, peering interconnections can also be established via shared network fabrics offered by
public IXPs. A shared fabric removes the need to establish a PNI to interconnect. Instead, each
participant establishes a circuit to the shared switch provided by the public IXP, and then participants
establish bilateral BGP sessions and exchange traffic using the connectivity provided by the shared
switch. In addition to reducing overhead for local participants, public IXPs significantly reduce
overhead for remote peers, since a remote peer can establish multiple interconnections with just
a single backhaul connection to the public IXP’s shared fabric [79]. Public IXPs can be operated
by the existing colocation facilities that comprise an IXP, or by an independent organization that
establishes a presence and a network fabric within existing IXP facilities [87, 273, 315, 327, 350].
Some public IXPs also operate route servers connected to the shared fabric to remove the need
for participants to establish bilateral BGP sessions [6, 195, 350]. Instead, participants establish a
BGP session with a route server and the route server reflects routes received from other participants.
However, route servers can limit path diversity: if the route server has multiple paths to a destination,
it will perform the standard BGP best path computation process and only announce a single best
path to participants.
27
Types of peering interconnections We differentiate between the various subtypes of peering
interconnections as follows:
• Private peers: A dedicated private network interconnection (PNI) is used to set up a bilateral
BGP control-plane session, exchange routes, and exchange traffic.
• Public peers: The shared network fabric of a public IXP is used to set up a bilateral BGP
control-plane session, exchange routes, and exchange traffic.
• Route server peers: A BGP control-plane session is established with a public IXP’s route
server via a shared network fabric. For each session, the route server accepts advertised routes
and reflects routes received from other peers, enabling participants to exchange routes without
setting up bilateral control-plane sessions. Traffic is exchanged between participants directly
via shared network fabric.
(Almost) every network has a transit interconnection Because peering interconnections do
not provide connectivity to the entire Internet, every network must maintain at least one transit
interconnection [69, 428]. Tier-1s are the exception to this rule: these networks are large transit
providers that have agreed to establish settlement-free peering interconnections with one another on
a quid pro quo basis [69, 149, 150, 177, 315, 428].
4
Because all tier-1 networks are interconnected,
they form a global fabric interconnecting all of their direct and indirect customers.
4
As of 2016, there are 16 tier-1 networks [69]. A few of these networks, such as the Energy Sciences Network (ESnet),
are not transit providers, but instead have negotiated settlement-free interconnections with all other tier-1s. Likewise,
some networks only have tier-1 status in certain parts of the world [315].
28
2.2.3 Growth in peering interconnections and the “flattening” of the Internet
CDNs have built points of presence around the world to provide good user experience given the
network requirements of streaming video and other demanding content (§2.2.3 and chapter 3).
These points of presence reduce the distance (and thus latency) between users and content, and
enable CDNs to interconnect directly with regional networks, including establishing peering inter-
connections with end-user Internet Service Providers (ISPs, e.g., Comcast) (§2.2.3 and chapters 3
and 4, [55, 70, 174, 391, 461, 469]). Many CDNs are also participants at public IXPs, as they enable
CDNs to establish peering interconnections with smaller networks for which the overheads of a PNI
may not be justified (chapter 4, [6, 87, 327]).
Peering interconnections can reduce cost for both parties and improve performance by eliminat-
ing circuitous routing. However, large CDNs often pursue peering interconnections because they
are necessary for the CDN to be able to deliver traffic to end-user networks in a congestion-free
manner — the capacity of transit networks is often insufficient to deliver the volume of traffic served
by large CDNs (§§ 4.2 and 5.6.1.2, [462]). In some cases, CDNs even colocate caching appliances
within end-user ISP networks to reduce demand on the ISP’s infrastructure [55, 70, 174].
All combined, the demands of modern Internet traffic have spurred tremendous growth in peering
interconnections, and this growth in turn has changed the Internet’s structure (chapter 3, [70, 79,
109, 248, 250, 273, 302, 315, 350, 461, 462, 469]. While a decade ago the Internet’s structure
followed a strict hierarchy, with traffic between content providers and end-users flowing through
a handful of transit providers (“upstreams”), today transit providers are cut out of the picture for
29
a significant fraction of Internet traffic. Instead of routing traffic (and dollars) upwards to transit
providers, CDNs use the aforementioned peering interconnections and caches to pass traffic directly
to end-user ISPs. As a result, much of the Internet’s traffic, including traffic served by commercial
CDNs and private CDNs operated by content providers such as Google, Facebook, and Netflix, is
now sent through peering interconnections (chapters 3 and 4, [461, 469]), and the role of transit
providers has diminished — the hierarchy in which they once played a central role is no longer
relevant for the bulk of Internet traffic.
This structural transformation has been referred to as the “flattening” of the Internet’s hierar-
chy, and prior work has examined this transformation and its implications from multiple vantage
points (§7.1). However, a number of unknowns remain, in part due to the limitations of the vantage
points used by prior work. For instance, we know little about how CDNs manage their connectivity
and make routing decisions, and the opportunities and challenges in this environment. We explore
these aspects further in Chapters 3 to 5, and develop a platform that enables the research community
to conduct impactful research on the flattened Internet in Chapter 6.
2.3 Routing Policies and Traffic Engineering
2.3.1 How interdomain routing policies are designed
An autonomous system’s interdomain routing policy can influence BGP’s best path selection
algorithm (§2.1.2) by defining a routing policy that changes route attributes such asLOCAL_PREF
30
andMED based on conditionals that consider other route attributes, such as whether an ASN appears
in theAS_PATH. In this section, we discuss common elements of interdomain routing policies.
2.3.1.1 Gao-Rexford model
Global convergence of BGP’s best path decision process (§2.1) is not guaranteed by the protocol’s
design; the routing policies used by networks can interact in ways that prevent convergence. Gao et
al. [150] proposed a set of guidelines that AS can use when defining their routing policies. Following
the guidelines guarantees global convergence of BGP while still providing the flexibility and routing
policies required to support different types of interconnections and commercial relationships.
These guidelines, commonly referred to as the “Gao-Rexford model”, define export rules
(e.g., when and over which interconnections a network should announce / redistribute a route) and
preference rules (e.g., how a network should select among available routes).
The Gao-Rexford Model guidelines are as follows:
Preference guidelines:
• Prefer routes from customers.
• Commonly implemented by defining an import policy that increases theLOCAL_PREF of
routes received from customers above that of any route from a peer or transit.
Export guidelines:
• Customer routes can be exported via all interconnections.
31
• Peer and provider routes are only exported to customers. Also known as valley-free routing,
stipulates that a network does not provide transit connectivity for its providers or peers.
• Commonly implemented by using communities to label routes received from providers and
peers, and an export policy to allow such labeled routes to only be redistributed to customers.
2.3.1.2 Incorporating performance, cost, and backbone utilization
BGP does not measure or expose a path’s performance or capacity. In addition, because the BGP
best path selection process makes decisions based on static policies that operate over the fields in
BGP route updates (e.g.,AS_PATH), there is no standardized method to incorporate such dynamic
signals into BGP’s decision process (§2.4.3).
As an alternative, network operators incorporate intuition and other insights into the process by
adding heuristics to the routing policy. Routing policies commonly such heuristics by manipulating
LOCAL_PREF such that routes are preferred in the following order: (1) routes from customers
(necessary for a network to be compliant with the Gao-Rexford model); (2) routes from peers; (3)
routes from transit providers [65, 356]. These preferences, combined with BGP’s preference for
paths with shorterAS_PATH (§2.1), are likely to align with performance — especially if the peer or
customer is the destination network given that all other routes would be (at least logically) circuitous
under such circumstances — and cost in cases where the peering interconnection is free. In addition,
the traffic demands of large CDNs often necessitate the use of peering interconnections because
routes via transit providers may have insufficient capacity — in such environments, the establishment
32
of peering interconnections, and preference for routing traffic via peering interconnections, is
necessary to avoid congestion (§§ 2.2.3, 4.2 and 5.6.1.2, [462])
Because BGP provides only coarse control of routing decisions, an AS can generally either
incorporate theMED values received from neighbors or discard them and optimize based on its own
priorities (e.g., minimizing backbone utilization) [399]. This decision determines whether an AS’s
routing policy is early-exit, in which an AS passes traffic as quickly as possible to the next AS in the
path, or late-exit, in which an AS backhauls traffic to the the next AS’s preferred interconnection
point (as signaled byMED).
2.3.1.3 Optimizing routing decisions with software-defined networking (SDN)
Prior work has sought to improve interdomain and intradomain routing efficiency through the use of
software defined networking (SDN), a network architecture in which control-plane decisions, such
as routing and forwarding decisions, are delegated to a software controller. The SDN controller is
often logically centralized and responsible for control-plane decisions that would traditionally be
made by multiple devices.
Routing Control Platform (RCP) [64, 131] is commonly cited as one of the first examples of
SDN [134, 191, 359, 385]. RCP centralized aspects of an AS’s interdomain routing decisions by
delegating routing decisions from BGP on routers to a centralized software controller. RCP then
used this centralization to jointly consider the cost of backhauling traffic across the AS’s and the
MED values included in routes received from neighboring ASes in interdomain routing decisions,
33
enabling an AS to define a routing policy that operated between the extremes of early-exit and
late-exit [64, 131].
EDGE FABRIC (discussed in Chapter 4) is similar to RCP and another application of software-
defined networking. Both RCP and EDGE FABRIC decouple the BGP decision process from routers
and involve a controller receiving route updates and injecting decisions using BGP. However, while
RCP’s goal was to enable optimization of intradomain routing, EDGE FABRIC focuses on enabling
more flexible interdomain routing decisions, in part to prevent congestion. Other related applications
of SDN in interdomain and intradomain routing, and traffic engineering, are discussed in Section 7.2.
2.3.2 How CDNs direct traffic to their points of presence
In Section 2.2.3 we discussed how CDNs build PoPs around the world to improve user experience.
In order to best make use of this infrastructure, CDNs must direct user requests to the PoP that
provides the best performance. In parallel, CDNs must be able to manage which requests are routed
to a PoP to prevent overload and to be able to redirect traffic for maintenance. Each network makes
two architectural decisions that determine how this mapping occurs [35].
First, the CDN must decide how to announce its address space. A network can announce the
same address space from all PoPs or announce a separate address space from each PoP. Announcing
address space from multiple PoPs maximizes the number of route options available between an
end-user ISP and the CDN network, and may allow traffic to ingress into the CDN’s network via a
physically and/or logically shorter path. However, when address space is announced from multiple
34
PoPs, a CDN is unable to control the PoP traffic ingresses at (§2.4.2). When address space is
announced at multiple PoPs, connections can either be terminated at the PoP where traffic ingresses
(Anycast), or if the CDN has a backbone, can be backhauled and terminated at another PoP/cluster
based on discriminators such as the source or destination IP address, or current load.
Second, the CDN must decide how to direct requests to its address space. Three approaches
are common [35]; in some cases a CDN may use a combination of approaches.
Anycast If Anycast is used (e.g., address space is announced from multiple PoP and con-
nections are terminated at the PoP where traffic ingresses), a CDN’s authoritative DNS server can
direct all DNS requests to the same IP address and BGP will route requests to a PoP. Because
BGP does not incorporate performance information into its routing decision process (§2.4.3), traffic
may be routed to a suboptimal PoP, degrading performance [32, 46, 243, 399]. In addition, BGP
routing changes can cause active connections to be suddenly terminated at a different PoP, disrupting
ongoing TCP connections [15, 16, 57, 247, 340]. However, recent work suggests that Anycast
generally provides good performance and that disruptions are not common [71, 140, 262, 263, 455];
this may be because the growth in peering interconnections and the corresponding flattening of the
Internet (§2.2.3 and chapter 3) has reduced the length of Internet paths and correspondingly reduced
the potential for routing volatility.
DNS response based on LDNS and EDNS0 client-subnet A CDN’s authoritative DNS
server can respond to requests with the IP address assigned to compute at a specific PoP (or DC).
35
The IP address for each DNS request can be determined based on which DNS resolver (known as
the local DNS server or LDNS) forwarded the user’s request to the authoritative DNS server [35,
317], or based on the IP address of the user that the request originated from if EDNS0 client-subnet
support is available [88, 106]. The CDN must build a mapping between LDNS and/or client IP
addresses and the optimal PoP through a separate measurement system [35, 88]. If the address space
is announced from multiple PoPs, the DNS response will not determine the PoP traffic will ingress
at (BGP will still determine the latter).
URL rewriting Similar to DNS manipulation, a CDN can rewrite the URLs of content
embedded into webpages, such as images and videos in webpages, so that the content is fetched
from a specific PoP. For instance, for a dynamically generated webpage the decision can be made
based on the IP address that the HTTP request was received from [2, 35]. This approach provides
more precision compared to LDNS as the provider knows the exact IP address of the client and
can make a decision on a per-object basis. With this precision, the CDN can consider more aspects
when deciding which PoP to send the request to, such as whether the object that the URL points
to is cached. However, this approach is only possible in cases where URLs are being generated,
such as dynamically generated webpages or in manifests used for video. Prior work has discussed
how URL rewriting has been used by Hulu, YouTube, and other video providers [2, 437]. Like
DNS-based redirection, if the address space is announced from multiple PoPs then URL rewriting
will not determine the PoP at which traffic ingresses.
36
2.4 Open Problems in Internet Routing
The limitations of BGP’s decision process and overall architecture results in security vulnerabilities
and performance issues that degrade end-user experience and make it difficult for operators to
manage their network’s traffic. In this section, we introduce the landscape of open problems in
Internet routing. In Chapter 3, we speculate as to how the flattening of the Internet may provide
footholds for simpler solutions to some of these longstanding problems. In Chapters 4 and 5 we
examine the relevance of these problems and how they manifest in the flattened Internet from a
CDN’s vantage point. In Chapter 6, we discuss how PEERING helps researchers make progress on
these problems in today’s flattened Internet.
Problems include:
1. BGP’s design makes difficult for an AS to assess the validity of a route: any AS can originate
or announce a route for any prefix, and a route’sAS_PATH can be manipulated. This weakness
has been used to disrupt Internet traffic and launch Denial of Service attacks (§2.4.1).
2. BGP is an information-hiding protocol: it only communicates best paths to neighbors and
thus limits path diversity. In addition, BGP’s design makes it difficult to understand and often
impossible to predict the global impact of changes to routing policy and configuration (§2.4.2).
3. BGP routes do not contain information about demand, capacity, or performance, and its
decision process is not capable of incorporating dynamic signals such as these. As a result,
37
BGP can make suboptimal routing decisions, including placing traffic onto congested or
otherwise poorly performing routes (§2.4.3).
4. BGP convergence can take tens of seconds after an event such as a router failure, and, while
converging, traffic may be blackholed (§2.4.4).
Despite these known problems, replacing BGP outright is impossible in the near-term due to its
wide adoption and the barriers in transitioning networks to a new technology [128]. As a result, to
be deployable in the near-term Internet routing research must focus on sidestepping the fundamental
limitations of BGP’s design.
2.4.1 BGP’s design creates security vulnerabilities
BGP’s design creates security vulnerabilities at both the control and data-plane layers. In this
section, we discuss challenges in route validation (control-plane vulnerability) and source address
validation (data-plane vulnerability). While a number of mitigation exist [136, 212, 260, 261],
their effectiveness depends on the fraction of ASes that adopt them, and some require significant
investments in new protocols and infrastructure [148, 155, 158, 200, 278, 289, 301, 347, 452]. In
this manner, Internet security is arguably determined by the weakest link.
2.4.1.1 Route validation (control-plane vulnerability)
Network operators construct policies to filter routes on import and decide what routes to export. In
general, every AS has an export policy to ensure that the AS only announces routes that it should be
38
carrying traffic for (BGP will announce all received routes to all neighbors unless blocked by an
export filter) [466]. In addition, transit providers commonly configure filters that explicitly specify
the prefixes that can be imported from customers that are also stub ASes (e.g., an AS without any
customers of its own) such as a university [169]. However, transit providers typically do not apply
import filters to providers, peers, or larger customers, such as a large ISP that services a country, as
the prefixes announced by these AS can quickly change for legitimate reasons making it difficult to
keep such filters current [169].
Beyond route filters, BGP’s design does not incorporate any additional form of route valida-
tion — BGP’s best path selection process considers all routes not removed by filters on import as
valid [466]. As a result, filters are the weakest link in BGP’s design; they are difficult to construct,
test, and maintain, and yet they are critical to the Internet’s security and stability [63, 141, 169, 285].
For instance, when an AS receives a route from a neighbor, the AS has no way of confirming if
the AS that theAS_PATH shows originated the route actually did so (a route’sAS_PATH can be
manipulated); if the AS that originated the route was authorized to do so (any AS can announce
any route, including for address space it does not control); and whether the ASes that theAS_PATH
shows the route traversing are appropriate (an AS can redistribute any route it receives).
Given this state of affairs, it is perhaps unsurprising that Internet-scale outages can be caused
by the actions of a single network. Two types of events, prefix hijacking and route leaks, can be
attributed to this fundamental weakness.
39
Prefix hijacking When an AS originates a route for a prefix without proper authorization, it is
said to have hijacked the prefix [33]. The impact of a hijacking event depends on the connectivity of
the network that originated the route along with the filtering policies of their neighbors and other
networks, but traffic routed via a hijacked route is typically blackholed, severing connectivity.
For instance, in 2008 an ISP in Pakistan attempted to block its customer’s access to YouTube
by announcing routes for YouTube’s prefixes and blackholing the traffic [119, 470]. While the ISP
intended for this announcement to not be redistributed outside of its AS, a configuration error caused
these announcements to be redistributed to the ISP’s neighbors. The ISP’s neighbors did not filter
the route on import, and because BGP cannot distinguish between a valid announcement and one
that is erroneous or malicious, some of the ISP’s neighbors selected the route and subsequently
propagated the announcements further into the Internet. The mistake of a single ISP, combined with
the resulting cascading effect of BGP, caused YouTube to be inaccessible to users around the world.
In addition to causing a service to become unavailable, hijacking can allow attackers to intercept
traffic and degrade or outright compromise services [100, 330, 413]. Prior work has also shown
how hijacking can be used to attack TLS by tricking certificate authorities into issuing certificates to
attackers, which can then be used to decrypt user traffic [47, 153].
Route leaks In general, an AS leaks a route if it redistributes a route received from a provider
and/or peer to other providers and peers. Redistributing routes in this manner is a violation of
valley-free routing (e.g., the Gao-Rexford model, §2.3.1) and in most cases signals an error in
routing configuration given that the AS’s providers may send the AS traffic for a prefix that is
40
not in the AS’s customer cone [38, 165, 388, 401, 403]. If other AS select the leaked route(s) as
the best path(s), they may redistribute them further. While the route to the destination is valid,
because the leaking network is not designed to act as a transit it often becomes congested by the
resulting avalanche of incoming traffic [210, 388, 436]. Route leaks are often attributed to routes
injected by an optimization system being improperly exported to other AS due to misconfigured
filters [312, 407], although some instances of route leaks have been speculated to be intentional
efforts to intercept Internet traffic [172, 281].
Given that static route filters are not a tractable solution to preventing these events, recent work
has sought to generate route filters based on information provided by each network to a central
registry. For instance, the Internet Routing Registry (IRR) [212] and the Resource Public Key
Infrastructure (RPKI) [260] allow a prefix owner to specify which ASes can announce a prefix.
However, neither mechanism provides complete protection against prefix hijacking or route leaks,
and both require wide adoption to provide protection [97, 347]. Adoption has in general remained
limited due to a variety of technical and legal challenges, and due to concerns about the correctness
of registry data [148, 155, 158, 200, 289, 301, 347, 452]. However, in 2019, AT&T, a tier-1 network
and one of the world’s largest ASes, began to reject BGP announcements that were invalid based on
RPKI [267]. The adoption of RPKI, and in particular strict RPKI (e.g., outright rejection of invalid
routes, instead of simply deprioritizing them) by a large network has been seen as a breakthrough
that may spur further adoption. Other industry initiatives, including the Mutually Agreed Norms for
Routing Security (MANRS), have been established to speed awareness and adoption of RPKI [145].
41
More sophisticated mechanisms such as BGPSec [261] improve security further, but bring
deployment and adoption challenges. For instance, BGPSec requires vendors to make significant
investments to support the technology and widespread adoption across operators (which would
require upgrading and reconfiguring thousands of devices, among other things) to provide value [278].
Other tools can help identify occurrences of prefix hijacking but alone they do not help an operator
mitigate an attack [33, 97, 205, 251, 475, 476].
2.4.1.2 Source address validation (data-plane vulnerability)
When an AS receives a packet from a neighbor, it has no way to verify the accuracy of the packet’s
source IP address — only the AS in control of the corresponding address space can determine if
a packet originated from it. As a result, in general networks must blindly accept packets received
from neighbors with no authentication of the source address.
This permissive behavior represents a significant security vulnerability. For instance, the source
IP address of packets in denial of service attacks is commonly invalid (spoofed) to prevent the attack
traffic from being traced to its origin, hindering attribution and resolution of security issues [136].
Spoofing can also be used in reflection attacks in which an attacker sends requests to connection-less
services (such as DNS or NTP) with the source address set to the IP address of a target. Such attacks
are also known as amplification attacks because the requests are typically small — and thus require
few resources from the attacker — but result in large responses that cause the target to become
inundated by the response traffic [103, 337].
42
Spoofing can be prevented by filtering packets at or near the source: if every AS drops packets
that originate from within the AS but do not contain a source IP address controlled by the AS,
then traffic can always be accurately sourced to an AS. Stricter filtering (e.g., dropping packets at
the router or host level within the AS) can ensure that an AS can properly attribute the source of
packets within its network. This filtering was proposed over 20 years ago in May 2000 [136] and is
considered to be a Best Current Practice (BCP) by the IETF.
5
However, while there is widespread
acceptance of the idea, its effectiveness depends on how many ASes actually deploy the required
filtering, and spoofing will remain a practical approach for attackers until such filtering is widely
deployed. Adoption has been challenging, in part due to the incentive structure: an AS that installs
filters does not gain any protection from an attack — the filters only ensure that the AS does not
originate attack traffic [41, 42, 43, 274].
Other solutions have focused on enabling the target network to identify the AS originating the
traffic via “traceback” mechanisms. However, these mechanisms require extending BGP, ICMP,
or other aspects of the Internet routing ecosystem [60, 152, 374, 393, 402], making them difficult
to deploy today. More recently, work supported by the PEERING platform (chapter 6) developed a
methodology to triangulate the source of traffic by using specialized BGP advertisements to change
the routes that attack traffic ingresses through [142].
Attacks involving spoofed traffic have decreased significantly in recent years: a study of DDoS
attacks in 2019 concluded that less than 10% of DDoS attacks involved spoofed traffic, despite how
5
RFC2827 [136] is BCP38, and has been updated as RFC3704 [29] and RFC8704 [400], BCP84.
43
the volume of DDoS attacks has continued to grow [248]. It is unclear if this reduction is the result
of filtering making it harder to spoof traffic, or if it only signals a shift in attacker strategy.
2.4.2 BGP’s design limits route diversity, flexibility, and control
BGP is described as an information hiding protocol because its design provides a given AS little
visibility into the connectivity and routing decisions of other ASes [65]. Additionally, an AS’s
egress routing options and the route that traffic ingresses into an AS are both (in part) functions
other AS’s routing decisions [65, 132, 342].
BGP only redistributes the best route to a destination, limiting route diversity. When two
ASes interconnect with BGP, each redistributes at most one route per destination and routes all
traffic via this path. As a result, ASes are constrained by the routing decisions of other networks.
To illustrate the practical implications of this lack of control, consider the example in Fig-
ure 2.1, in which USC has two transit providers, Hurricane Electric and PCCW. Networks com-
monly maintain multiple transit interconnections for redundancy and to diversify their routing
options (§4.2, [469]). However, while such connectivity offers resilience to failures that impact a
single interconnection or transit provider, connectivity via multiple transits does not necessarily
provide fully diverse routes to a destination. For instance, in the example illustrated in Figure 2.1,
USC’s route diversity is constrained by the connectivity and decisions of its transit providers. USC
has two routes to Yahoo, but both of them traverse Level(3), whose network is blackholing traffic
due to a failure. Despite a healthy route being available through Telia, network operators at USC
44
Route selected by BGP
Unused route
Failure
Figure 2.1: In the illustrated example, USC is unable to route its traffic around a failure in Level(3)’s network,
despite how a healthy path is available through Telia. USC has two transit providers, both of whom are
relying on a (failed) route through Level(3) to send traffic to Yahoo. USC cannot control how either of its
transit providers route. PCCW could mitigate the failure by switching to the route via Telia, but is unaware of
the failure because the route is still announced via the BGP control-plane.
cannot resolve the problem on their own — one of the other ASes involved must realize the problem
and take action.
Prior work has proposed using overlay networks to provide options in such situations [18, 328,
373]. However, even with additional routing options, network operators still need to be able to
determine the performance and usability of each route, a challenge in itself (chapter 5). Extensions
to the BGP protocol and new Internet routing architectures have also been proposed to provide more
control [255, 284, 468], but these approaches are difficult to incrementally deploy [128, 445].
45
BGP provides limited control of ingress routing. BGP provides network operators with few
tools to control how traffic ingresses into an AS:
• MED: ASes can encode preferences in a route’sMED value (§2.1.2), but some ASes may not
honorMED values (§4.2, [399]), andMED values are typically only visible to — and thus can
only directly influence the decisions of — an AS’s immediate neighbors [65, 237, 299, 300]).
• AS_PATH prepending: An AS can artificially increase the length of a route’sAS_PATH by
adding its own ASN multiple times, and in turn (potentially) change the routing decisions
of other ASes by taking advantage of the BGP path selection algorithm’s consideration
of AS_PATH length [86, 151, 288]. However, such prepending may not have the desired
impact [151, 237, 288]. For instance, an AS that receives a route to a prefix from both
AS
X
and AS
Y
may always prefer the route from AS
X
— regardless of the route’sAS_PATH
length — because its import policy increases theLOCAL_PREF of routes received from AS
X
,
perhaps because transit through AS
X
costs less.
• Selective advertising: Advertising a more specific route via the preferred interconnection can
be used to take advantage of the BGP path selection algorithm’s preference for more specific
routes [411]. However, Kaufmann [237] discussed scenarios that Akamai has encountered
during which more specific advertisements and other forms of ingress traffic engineering
yielded unexpected consequences, including blackholing and poor performance.
• AS_PATH poisoning and BGP communities: AS_PATH poisoning [62, 98] and BGP com-
munities can be used control route propagation [62, 98, 236, 392, 406]. However, BGP
46
communities often do not propagate past one hop and may not have the desired impact [341,
406]. Likewise, there exist significant barriers to productionizingAS_PATH poisoning, includ-
ing disagreement in the operational community about whether such announcements should be
allowed [62, 98, 236, 392, 395].
All of these solutions are non-deterministic — their efficacy depends on the connectivity and
routing policies of other AS — and the use of Anycast (§2.3.2) brings even more challenges [132,
140, 450]. As a result, ingress traffic engineering is commonly achieved through a trail and error
process [151]. For instance, if a network wants to use ingress traffic engineering to reroute traffic
around a blackhole (e.g., if Yahoo wanted to resolve the situation illustrated in Figure 2.1), network
operators would need to iteratively test ingress traffic engineering policies to identify the problem
and a workaround solution [151, 234, 235, 236].
Finally, there is a potential for harmful interactions and oscillations when multiple AS use
dynamic traffic engineering systems to accomplish ingress and/or egress traffic engineering. For
instance, if AS
X
uses an egress traffic controller (such as EDGE FABRIC, chapter 4) and AS
Y
uses an
ingress traffic controller (such as Sprite, [411]), there is a potential for the two control systems to
interact in a way that causes traffic to oscillate between interconnections (§5.6.1.2, [237]).
2.4.3 BGP’s decision process does not consider route performance or capacity
BGP route announcements do not contain information about a route’s performance or capacity,
and BGP’s static routing policies do not (natively) provide a way to incorporate dynamic signals
47
captured through other means into BGP’s decision process. As a result, BGP can make suboptimal
routing decisions that degrade network performance (chapters 4 and 5, [71, 328, 472]).
For instance, BGP’s decision process may:
• place more traffic on a route than the route’s capacity, leading to congestion (Chapter 4)
• route traffic via a poorly performing or otherwise suboptimal path (Chapter 5)
• route traffic via a path that is blackholing traffic, outright preventing connections [235, 473]
As discussed in Section 2.3.1, networks often incorporate heuristics into their routing policies to
improve performance. Beyond these simple policies, best practices suggest increasing or decreasing
a route’s LOCAL_PREF based on attributes, such as whether an ASN is in the path, to balance
traffic across links and incorporate operational insights on route performance [356]. These heuristics
may prove viable in simple scenarios — such as a small network attempting to load-balance traffic
across two transit providers — but they are not practical for networks with more than a handful of
interconnections because static policies cannot respond dynamically to changes in load, capacity, or
route performance, and because they do not enable efficient use of interconnection capacity. We
discuss these challenges and why today’s CDNs require more sophisticated control mechanisms in
Chapters 4 and 5.
BGP routes all traffic via the route selected by its selection algorithm. BGP does not na-
tively support load-balancing traffic across non-equivalent routes or making routing decisions on a
48
per-application basis (e.g., send performance-sensitive traffic via higher cost, but better perform-
ing Route
A
, and all other traffic via Route
B
) — these limitations hinder efficient use of an AS’s
connectivity.
Furthermore, this simplicity also presents barriers to performance-aware routing. Because route
performance is not available through control-plane signals, and because a route’s performance may
change over time, performance-aware routing requires continuously measuring the performance
of candidate routes by sending traffic — either existing production traffic, or active measurement
traffic — over them. This is not natively possible with BGP given that it routes all traffic via the
single preferred route.
While Policy-Based Routing (PBR)
6
does address some of these constraints by enabling traffic
to be routed based on identifiers such as DSCP label, BGP routing policies incorporating PBR are
still static and thus unable to incorporate real-time conditions. For instance, a CDN may want to
only route traffic via the next best route when the primary route is overloaded (e.g., demand exceeds
capacity), but PBR would only enable the CDN to statically (and thus continuously) route a portion
of traffic via a route chosen based on BGP attributes (and thus not necessarily the next best route).
Furthermore, keeping PBR rules up to date is non-trivial — since rules are based on route attributes,
rules inevitably become stale over time and may not reflect the current best decision. As a result, the
mapping between label and route must be continuously updated — this would require a separate
6
Manufacturers use different terminology to describe Policy-Based Routing. For instance, Cisco supports Policy-Based
Routing [93] and Virtual Routing and Forwarding [94], Juniper supports Routing Instances [228], and the Linux kernel
supports Policy-Based Routing [265]
49
control system to update the static PBR configuration. As a result, PBR alone is insufficient to make
best use of connectivity in a dynamic environment.
2.4.4 BGP can take significant time to converge and recover after an event
When a component such as a router or link fails, routers adjacent to the failure must realize the
failure and then select a new route to reach the destination [466]. This results in a cascading process:
each time a router receives a BGP update from a neighbor, the router must recalculate its best path.
If the router’s best path changes, it must send BGP updates to its neighbors.
It can take minutes for routers adjacent to the failure to realize that a failure has occurred and
begin selecting a new route, and minutes for all impacted routers to select a new route [61, 249].
Because remote routers are only informed that a route has been withdrawn (and not the root cause
of the route failure), a router may switch to a different route that is also impacted by the failure and
is subsequently withdrawn shortly thereafter. The resulting oscillations can extend the time required
to achieve convergence [201, 223, 247, 249, 375, 448, 454]. Until convergence is achieved, some
routers may continue to send traffic along a route that can no longer reach the destination, causing
traffic to be blackholed, and routes can change frequently, resulting in poor performance [247]. In
addition, configuration errors on backup paths can result in outages that last for hours or days [249].
While the flattening of the Internet has likely reduced convergence times for routes between
users and popular content by reducing path lengths and the number of routers that must update when
an outage happens (§§ 2.1.2 and 2.2.3 and chapter 3), expectations for availability have increased in
tandem. BGP convergence delays can still cause tens of seconds of downtime and remain a concern
50
for networks with high availability targets [201, 276]. In addition, interconnections established via
the shared network fabrics of public IXPs (§§ 2.2.2 and 2.2.3) can take longer to converge in the
case of a failure because such fabrics decouple link state from interconnection health [193, 194].
For instance, if AS
X
and AS
Y
interconnect via a PNI and AS
Y
’s router suffers a hardware failure,
AS
X
’s router will often be able to immediately detect the failure via a change in link state, drop the
corresponding BGP routes, and begin to send traffic via other routes — this quick reaction minimizes
the amount of traffic that is blackholed. In comparison, if AS
X
and AS
Y
are connected via a shared
fabric, AS
X
’s router will not be able to immediately detect the failure of AS
Y
’s router via a link flap
because it is connected to the shared fabric — and the state of the shared fabric will not change due
to the failure of a single participant. As a result, AS
X
’s router will only be able to determine that
AS
Y
’s router failed when the BGP session’s hold timer expires, significantly increasing the amount
of time following a failure during which traffic is blackholed [193, 194].
2.5 Internet Routing Research Tools
In this section, we describe tools that have traditionally been used in Internet routing research.
We broadly classify existing tools into two categories: (i) measurement tools, platforms, and
datasets, and (ii) simulation and emulation tools. Prior work has frequently relied on tools from both
categories; for instance, a measurement study may be conducted and then used to define a model
that serves as an input to a simulation [161, 162, 310, 360, 424]. In Section 2.5.3, we discuss why
these tools are unable to effectively support a wide range of research.
51
2.5.1 Measurement tools, platforms, and datasets
Measurement tools, platforms, and datasets can be broadly classified based on:
1. If routing state is captured directly from the control-plane or inferred from the data-plane.
2. If experiments have control, and if so the types of actions supported.
3. For platforms and datasets, underlying vantage point coverage and representativeness.
Route Collectors Route collectors establish and maintain BGP sessions with other ASes and
record the BGP route updates (advertisements and withdraws) received. The route updates received
by a route collector from an AS are a function of the routes available at the AS’s router, the
decisions made by the router’s BGP’s best path calculation algorithm (§2.1) — as influenced by
the AS’s routing policy (§2.3.1) — and the AS’s route export policy. As such, route updates
provide insights into connectivity of ASes that the route collector maintains a BGP session with,
and analysis ofAS_PATHs can also shed light into the connectivity between ASes. A number of
organizations operate route collectors at vantage points around the world, including the University
of Oregon’s RouteViews [361], Packet Clearing House (PCH) [325], RIPE NCC’s RIPEstat [351],
and Colorado State University’s BGPmon [467], among others [338]. In addition to collecting routes
advertisements, some route collectors also announce a beacon prefix that can be used to assess how
routes propagate across the Internet, global reachability of advertised routes, and routes from other
locations to the route collector [325, 351].
A route collector’s coverage and fidelity is determined by the BGP sessions the collector
maintains and the export policies of other ASes. The growth of peering interconnections (§2.2.3)
52
has reduced collector visibility into the routes taken by the bulk of today’s Internet traffic. For
instance, peering interconnections between content providers such as Google and end-user ISPs
such as Comcast will (generally) only be visible to RouteViews if one of the two ASes involved has
a BGP session with RouteViews and exports a route that traverses the interconnection. Yet as of
December 2020 many of the BGP sessions maintained by RouteViews are with transit and hosting
ASes [362] — RouteViews does not have BGP sessions with ISPs such as Comcast and does not
always receive a full view of an AS’s routing decisions [182], possibly because end-user ISPs and
CDNs attempt to keep details of their connectivity and routing decisions confidential [160, 272].
Distributed Measurement Platforms Distributed measurement platforms, including RIPE At-
las [353], CAIDA’s Archipelago (Ark) [68], and Speedchecker [398], along with others [368, 382,
415], enable experiments to execute data-plane measurements such asping,traceroute, and
HTTP fetches from vantage points around the world. While these platforms provide an invaluable
resource, experiments are limited in four key ways. First, the control provided to researchers
is limited; researchers do not have direct access to the platform nodes and instead must execute
measurements through a restricted API provided by each platform. Second, measurements are
throttled based on resource quotas, limiting the temporal and spatial coverage of measurement
studies [102]. Third, while control-plane routing decisions can be inferred fromtraceroute
measurements (translation is imperfect [89, 286, 287]), such measurements cannot be used to fully
characterize an AS’s connectivity or routing policy because they can only capture how an AS is
currently routing traffic. Finally, the representativeness of measurement studies executed from these
53
platforms depends on whether their vantage points have connectivity representative of end-users,
and wide coverage of end-user networks (chapters 3 and 5 and §7.1, [24]). Arnold et al. [24] found
that RIPE Atlas has vantage points in 56.5% of ASes containing end-users, and that Speedchecker
has vantage points in 91% of ASes containing end-users. While Speedchecker has a presence in the
vast majority of end-user ASes, performance can vary significantly within an AS [229], especially
for ASes with a large geographic footprint.
In addition to these specialized measurement platforms, general purpose platforms such as Plan-
etLab [329] and M-Lab [279] provide researchers with shared computing resources on infrastructure
around the world. These platforms have fewer limitations and enable researchers to run custom
software and conduct sophisticated data-plane measurement experiments with only self-imposed
rate-limiting. However, PlanetLab and M-Lab are predominantly connected to educational and large
transit networks (respectively), and thus measurements executed from these vantage points may not
be representative for users in residential end-user ISPs (chapter 3).
Looking Glass Interfaces Some AS provide a looking glass service that enable users to query
control-plane state and/or execute data-plane measurements such as traceroute and ping
from vantage points within the AS [163]. Looking glass services often limit the frequency that
queries can be performed and prohibit automated queries, and are often located in transit, hosting,
and educational networks, reducing their visibility into end-user ISP connectivity and routing
decisions [83, 84, 85, 102, 163].
54
Datasets A number of datasets relevant to Internet routing research have been made available
by members of the research and operational communities. Some of these datasets are historical
while others are regularly updated. Datasets include information about the connectivity between
networks [69, 280, 428] that has been derived from route collector data and data-plane measure-
ments. Other datasets provide insight into Internet traffic patterns [429, 440], route quality and
performance [432], address space utilization [67], and the points of presence and connectivity of
large content and transit providers [70, 135].
2.5.2 Simulation and emulation
A handful of general purpose, openly available frameworks are available for simulating BGP’s best
path computation process (§2.1) and route redistribution [110, 343]. Prior work has relied on these
open frameworks and building ad-hoc simulators to study route propagation and oscillations in
networks with hundreds of routers [133, 335, 344].
Emulation frameworks such as Mininet [257], MinineXt [304], the Virtual Internet Routing
Lab [318], and GNS3 [168] provide more realism and flexibility over simulation frameworks in
exchange for greater resource utilization . Unlike simulation frameworks, emulation frameworks
allow general purpose applications (such as a web server) to exchange traffic through an emulated
network composed of software routers and switches. Experiments can use these frameworks and
general purpose testbed infrastructure such as CloudLab [349], EmuLab [456], and DETERLab [305]
to emulate large networks and evaluate network and application behavior under different network
configurations.
55
2.5.3 Key limitation of existing tools: lack ofcontrol andrealism
Researchers and network operators are well aware of BGP’s limitations and their impact on per-
formance, availability, and security, but progress towards solving these problems remains slow. A
significant barrier to understanding problems and exploring solutions arises from the design of BGP
as an information-hiding protocol, as this makes it difficult to execute experiments that are both
realistic and provide the necessary control:
1. Measurements provide realism but no control, and are limited by vantage points. For
instance, whiletraceroute measurements indicate how an AS is currently routing traffic,
they do not shed light on how that AS would route traffic if conditions changed (e.g., if a link
failed), and don’t provide the control necessary to find out. While longitudinal studies may
capture more of an AS’s connectivity, resource quotas present a significant barrier (§2.5.1)
and such studies will be unable to capture a large AS’s connectivity. Furthermore, the
growth in peering interconnections and the corresponding flattening of the Internet (§2.2.3
and chapter 3) has reduced the effectiveness of existing measurement tools and datasets. For
instance, CAIDA’s AS Relationships Dataset [428] is constructed from RouteViews BGP
route collectors. As discussed previously (§2.5.1), peering interconnections between content
providers such as Google and end-user ISPs such as Comcast will typically not be visible in
RouteViews — and thus will not appear in CAIDA’s dataset [181, 182].
2. Simulations and emulations provide control, but limited fidelity. Emulation and simula-
tion cannot accurately model the Internet due to the lack of transparency in BGP and the
56
proprietary nature of routing policies. Simulations and emulations provide absolute control,
in that the experimenter has complete control over the experiment’s conditions, such as the
routing policies, connectivity, and behavior of simulated/emulated components. However,
the realism of an experiment is dependent on whether the experiment’s conditions are rep-
resentative of real-world networks. Since many of these properties are not disclosed by
network operators [160, 272], researchers can only try to infer them via measurements or
make assumptions [161, 162, 310, 424], both of which degrade realism and fidelity [62, 360].
To gain insight into problems on today’s Internet and evaluate potential solutions, experi-
ments need to interact with and affect the Internet’s routing ecosystem. This requires taking
control of a real AS and its connectivity, policies, and traffic. A small set of prior work has
demonstrated the value of such control, we discuss two examples here:
1. Wang et al. [454] measured BGP convergence time following a route failure. The experiment
(i) announced a route to all providers of a multihomed BGP beacon, (ii) withdrew the
route announced to one of the providers to mimic a route failure, and (iii) measured the
convergence time of the control and data-plane following the withdraw. These events provided
the experiment with control: the experiment injected updates into the Internet’s control-plane,
and realism: all events observed were the result of routing policies and BGP implementations
used by actual networks on the Internet.
2. Colliti [98] announced routes withAS_PATH that had been manipulated to uncover backup
routes (e.g., routes that were not selected by the BGP best path selection algorithm, §2.1).
57
Specifically, Colliti would announce a prefix and then use traceroutes or other tools (§2.5.1) to
identify theAS_PATH in the route selected by a remote AS to route traffic to the announced
prefix. Colliti would then announce anAS_PATH containing one or more of the AS in the
currently usedAS_PATH, a process known as poisoning. When a router at one of the poisoned
ASes received the route, BGP’s loop prevention logic would reject the route.
7
As a result, the
poisoned AS and any AS that previously relied on the poisoned route would either shift their
traffic to a different route, or in the case of no alternate, have no route to the destination. This
approach enabled Colliti to uncover backup paths in a controlled manner.
In addition to these concrete examples from prior work, such control can also help researchers
identify new problems and explore the design space. For instance, a researcher could build a
representative prototype CDN — operating as an AS on the real Internet — and announce routes,
exchange traffic, and run real services, and then execute experiments to explore aspects of CDN de-
sign such as ingress traffic engineering (§2.3.2). Prior work has shown that the hands-on experience
gained through building such prototypes can be invaluable and can spur new directions of research
that prove impactful [146].
However, while a small set of researchers have been able to take control of a real AS and
execute such experiments, such opportunities remain out of reach for most. Operators are
typically unwilling to allow experiments on a production network due to the potential wide-ranging
negative effects [357], and establishing a representative AS is an insurmountable barrier for most
7
By default, if a router receives a BGP route with its own ASN in theAS_PATH, it assumes that the route is a loop
and rejects (ignores) the route [466].
58
researchers. In Chapter 6, we describe how we removed these barriers and democratized Internet
routing research by building PEERING, a globally distributed, multiplexed AS open to the research
community.
59
Chapter 3
Are We One Hop Away from a Better Internet?
3.1 Introduction
Over the past decade, Content Distribution Networks (CDNs) have become a central pillar in the
Internet ecosystem; as of 2019, the vast majority of all Internet traffic is sourced from a small set
of CDNs. The largest content providers (e.g., Google/YouTube, Netflix, Facebook, and Microsoft,
among others) have built out their own CDNs (Chapter 4, [70, 72, 378, 442, 469]), and the traffic of
smaller content providers is served by a handful of commercial CDNs [7, 248, 302].
Content providers rely on CDNs to meet the demanding network requirements of today’s
Internet applications. For instance, users expect playback of streaming videos to start quickly
and proceed without stalls, even at high bitrates, and these quality of experience expectations
translate into high goodput and soft real-time latency demands of the underlying network [111].
CDNs build points of presence (PoPs) around the world to bring content closer to users, improving
performance by decreasing latency and transfer times [53, 70, 138, 317, 469]. However, PoPs
60
provide CDNs with more than locality — they also enable CDNs to bypass the traditional Internet
hierarchy and interconnect directly with regional networks, including end-user Internet Service
Providers (ISPs, e.g., Comcast) (§2.2.3 and chapter 4).
In this chapter we characterize CDN connectivity and path lengths between users and popular
content on today’s Internet by executing measurements from major cloud providers and community
measurement platforms. We make two contributions:
We quantify the full degree of flattening for major CDN and show that a sizable fraction of
Internet traffic now traverses a short or one hop path path. Our measurements (collected in
2015) show that, whereas the average arbitrary path on the Internet traverses 1-2 intermediate transit
ASes, most paths and the vast majority of traffic between Google’s network and end-users go directly
from Google’s network into the user’s ISP (§3.3). We execute similar measurements for other CDNs
and find that while Google leads the pack, other major CDNs are also expanding their connectivity
and are able to send a significant fraction of user traffic via such one hop paths.
While prior work suggested that the Internet has been “flattening” in this manner [159, 250],
our results are novel in a number of ways. First, whereas previous work observed flattening in
measurements sampling a small subset of the Internet, we quantify the full degree of flattening for
major CDN from vantage points within their networks. Our measurements cover paths to 3.8M /24
prefixes — all of the prefixes observed to request content from a major CDN — whereas earlier
work measured from only 50 [159] or 110 [250] networks. Peering interconnections, especially of
content providers like Google, are notoriously hard to uncover, with previous work projecting that
61
traditional measurement techniques miss 90% of these links [319]. Our results support a similar
conclusion to this projection: Whereas a previous study found up to 100 links per content provider
across years of measurements [383] and CAIDA’s analysis lists 184 Google peers [428], our analysis
uncovers interconnections from Google to over 5700 AS. We discuss related work in greater detail
in Section 7.1.
In addition, we provide context, showing that popular paths serving high volume client networks
tend to be shorter than paths to other networks, and that ASes that Google does not peer with often
have a local geographic footprint and low query volumes.
We sketch how short paths can be used to solve longstanding Internet problems. We consider
whether it may be possible to take advantage of short paths — in particular those in which the
CDN interconnects directly with the end-user’s ISP — to make progress on longstanding Internet
routing problems (§3.4).
1
Prior work proposed solutions that work for any Internet path; while this
maximizes the solution’s coverage and effectiveness, it also complicates the solution. In addition,
general solutions often require widespread adoption to be effective. In this chapter, we consider if it
is easier to make progress on these problems if we limit the focus of our solutions to the paths that
carry the majority of Internet traffic — the paths between CDNs and end-users. For example:
• Prior work focused on improving the reliability of Internet routing required complex lockstep
coordination among thousands of networks [225]. Is coordination simplified when the
concerned parties are directly interconnected?
1
Section 2.4 discussed how many Internet routing problems arise from BGP’s limitations and the challenge of building
deployable solutions to these problems.
62
• The source and destination of traffic have incentives to improve the quality of the route
between them, but lack control of, and visibility into problems within intermediate transit
providers. With short, direct paths, can we design approaches that use the natural incentives
of the source and destination — especially of large providers — to improve Internet routing?
• Prior solutions to Internet routing challenges have often been complicated because they sought
to support all scenarios. However, simpler but less general techniques could provide benefit
for the bulk of Internet traffic [278]. Given the disproportionate role of a small number of
providers, can we solve open challenges and perhaps even achieve extra benefit by tailoring
our approaches to apply to these few important players?
We have not answered these questions, but we sketch problems where short paths might provide a
foothold for a solution.
3.2 Measuring Internet Path Lengths
How long are Internet paths? In this section, we first demonstrate that the answer depends on the
measurements used. We then discuss how our measurement methodology allows us to assess path
lengths for the paths that carry the bulk of Internet traffic.
3.2.1 Strawman approach: measuring from an academic testbed
Traceroutes from academic testbeds are commonly used in academic studies, so as a starting point,
we consider a set of iPlane traceroutes from April 2015 [280], this dataset contains traceroutes from
63
all PlanetLab sites to 154K BGP prefixes. The 154k prefix in the iPlane dataset are derived by
clustering Internet end-hosts using BGP atoms [58]; each prefix represents one such cluster.
Figure 3.1 shows that only 2% of the paths measured from PlanetLab are one hop to the
destination, and the median path is between two and three AS hops.
2
However, there is likely little
traffic between the networks hosting PlanetLab sites (mostly universities) and most prefixes in the
iPlane list, so these longer paths may not carry much traffic. Instead, traffic is concentrated on a
small number of links and paths from a small number of sources.
For example, in 2009, 30% of traffic came from 30 ASes [250]. Likewise, at a large IXP in
2014, 10% of links contribute more than 70% of traffic [367]. Many paths and links are relatively
unimportant: at the same IXP, 66% of links combined contributed less than 0.1% of traffic [350].
Summary. Characterizing Internet path lengths for the bulk of Internet traffic requires our ap-
proach to measuring the paths that carry such traffic, instead of the paths between PlanetLab nodes
and clusters of arbitrary endpoints.
3.2.2 Approach used in this work
In the previous section, we concluded that traceroutes executed from PlanetLab nodes do not provide
representative insights into the Internet paths carrying the bulk of traffic on today’s Internet. In this
section, we introduce an approach that still relies on traceroutes, but differs in terms of the vantage
point that we measure from and how we weigh measurements during our analysis.
2
In addition to using PlanetLab, researchers commonly use BGP route collectors (§2.5) to measure paths. A study of
route collector archives from 2002 to 2010 found similar results to the PlanetLab traceroutes, with the average number of
hops increasing from 2.65 to 2.90 [121].
64
3.2.2.1 Datasets and measurements
Our approach relies on three inputs:
1. A CDN log containing user query volumes. Aggregated and anonymized queries to a large
CDN, giving (normalized) aggregate query count per /24 client prefix in one hour in 2013
across all of the CDN’s globally distributed servers. The log includes queries from 3.8M
client prefixes originated by 37496 ASes. The set has wide coverage, including clients in
every country in the world, according to MaxMind’s geolocation database [295].
While the exact per prefix volumes would vary across provider and time, we expect that
the trends shown by our results would remain similar. To demonstrate that our CDN log
has reasonable query distributions, we compare it with a similar Akamai log from 2014
(Fig. 21 in [88]). The total number of /24 prefixes requesting content in the Akamai log
is 3.76M, similar to our log’s 3.8M prefixes. If V
C
n
and V
A
n
are the percentage of queries
from the top n prefixes in our CDN dataset and in Akamai’s dataset, respectively, then
V^C_n - V^A_n< 6% across all n. The datasets are particularly similar when it comes to
the contribution of the most active client prefixes: V^C_n - V^A_n< 2% for n 100;000,
which accounts for31% of the total query volume.
2. Traceroutes from the cloud. In March and August/September 2015, we issued traceroutes
from cloud compute instances hosted in Google Cloud [Central US region], Amazon AWS
(EC2) [Northern Virginia region], and IBM SoftLayer [Dallas DAL06 datacenter] to all 3.8M
prefixes in our CDN trace and all 154K iPlane destinations. For each prefix in the CDN
65
log, we chose a target IP address from a 2015 ISI hitlist [126] to maximize the chance of a
response. We issued the traceroutes using Scamper [275], which implements best practices
like Paris traceroute [27].
Limitations. Our traceroutes may not capture all of the provider’s interconnections because
we only executed traceroutes from a single vantage point in each cloud provider’s network,
and because our traceroutes were subject to the cloud provider’s routing policy. For instance,
at the time of our measurements (2015) Amazon appeared to egress traffic early instead of
backhauling it to the Amazon point of presence closest to the destination (known as early-exit
routing, §2.3.1), this likely caused some of Amazon’s interconnections to be hidden from our
traceroutes.
3. Traceroutes from RIPE Atlas. The RIPE Atlas platform includes small hardware probes
hosted in thousands of networks around the world (§2.5.1). In April 2015, we issued tracer-
outes from Atlas probes in approximately 1600 ASes around the world towards our cloud
instances and a small number of popular websites. For all traceroutes executed from a RIPE
Atlas Probe to a hostname, DNS resolution of the hostname was performed by the vantage
point using its local DNS server configuration.
Limitations. Our traceroutes from the RIPE Atlas platform are limited by the platform’s
coverage and representativeness, e.g., the number of networks from which we execute a
traceroute from an Atlas node, and whether traceroutes from those networks (in aggregate)
66
are representative of paths between content providers and end-user networks. To address this
limitation, we calibrate measurements from RIPE Atlas by comparing the distribution of path
lengths measured from Atlas vantage points to our cloud instances against those measured
from our cloud instances to the prefixes in our CDN trace.
3.2.2.2 Processing traceroutes to obtain AS paths
Our measurements are IP-level traceroutes, but our analysis is over AS-level paths (AS_PATHs).
Challenges exist in converting IP-level paths toAS_PATH [89, 286, 287]; we do not innovate on
this front and simply adopt widely-used practices.
First, we remove any unresponsive hops, private IP addresses, and IP addresses associated with
Internet Exchange Points (IXPs).
3
Next, we use a dataset from iPlane [280] to convert the remaining
IP addresses to the ASN that originate them, and we remove any ASN that correspond to IXPs. If
the iPlane data does not include an ASN mapping for an IP address, we insert an unknown ASN
indicator into the AS_PATH. We remove one or more unknown ASN indicators if the ASN on
both sides of the unknown segment are the same, or if a single unknown hop separates two known
ASN. After we apply these heuristics, we discard paths that still contains unknown segments. We
then merge adjacent ASN in a path if they belong to the same organization (known as siblings),
using existing organization lists [66], since these ASN are under shared administration. Finally,
3
We filter IXPs because they simply facilitate connectivity between peers. We use two CAIDA supplementary
lists [164, 428] to identify ASNs and IPs associated with IXPs.
67
we exclude paths that do not reach the destination AS. Post-filtering, we have traceroutes from the
cloud for 3M of the 3.8M /24 prefixes.
We compared theAS_PATH inferred by our approach with those inferred by a state-of-the-art
approach designed to exclude paths with unclear AS translations [89], generating the results in our
paper using both approaches. The minor differences in the output of the two approaches do not
impact our results meaningfully, and so we only present results from our approach.
3.3 Internet Path Lengths
3.3.1 Measuring paths from the cloud
To begin answering what paths look like for one of these popular source ASN, we use our traceroutes
from GCE, Google’s cloud offering, to the same set of iPlane destinations. We use GCE traceroutes
as a view of the routing of a major cloud provider for a number of reasons. First, traceroutes from
the cloud give a much broader view than traceroutes to cloud and content providers, since we can
measure outward to all networks rather than being limited to the relatively small number where we
have vantage points.
Second, we are interested in the routing of high-volume services. Google itself has a number of
popular services, ranging from latency-sensitive properties like Search to high-volume applications
like YouTube. GCE also hosts a number of third-party tenants operating popular services which
benefit from the interdomain connectivity Google has established for its own services. For the
majority of these services, most of the traffic flows in the outbound direction.
68
Third, Google is at the forefront of the trends we are interested in understanding, maintaining
open peering policies around the world, a widespread WAN [159], a cloud offering, and ISP-
hosted front end servers [70]. Fourth, some other cloud providers that we tested filter traceroutes
(Section 3.3.4 discusses measurements from Amazon and SoftLayer, which also do not filter).
Finally, previous work developed techniques that allow us to uncover the locations of Google servers
and the client-to-server mapping [70], enabling some of the analysis later in this chapter.
Result: Paths from the cloud are short. Compared to PlanetLab paths towards iPlane destina-
tions, GCE paths towards iPlane destinations are much shorter: 87% are at most two hop, and
41% are one hop, indicating that Google interconnects directly with the AS originating the prefixes.
Given the popularity of Google services in particular and cloud-based services in general, these
short paths may better represent today’s Internet experience. However, even some of these paths
may not reflect real traffic, as some iPlane prefixes may not host Google clients.
Result: Paths from the cloud to end-users are even shorter. In order to understand the paths
between the cloud and end-users, we analyze 3M traceroutes from GCE to client prefixes in our
CDN trace (§3.2.2). We assume that, since these prefixes contain clients of one CDN, most of them
host end-users likely to use other large web services like Google’s. As seen in Figure 3.1, 61% of
the prefixes have one hop paths from GCE, meaning their origin ASN interconnect directly with
Google, compared to 41% of the iPlane destinations.
69
0 1 2 3 and above
Number of Hops
0
20
40
60
80
100
Percentage of Paths
PL to iPlane dsts
GCE to iPlane dsts
GCE to end-users
GCE to end-users,
weighted
GCE to end-users,
weighted -- shortest path
Figure 3.1: Paths lengths from a Google Compute Engine virtual machine and PlanetLab virtual machine to
iPlane and end-user destinations
Result: Prefixes with more traffic have shorter paths. The preceding analysis considers the
distribution of AS hops across prefixes, but the distribution across queries/requests/flows/bytes may
differ, as per prefix volumes vary. For example, in our CDN trace, the ratio between the highest and
lowest per prefix query volume is 8.7M:1. To approximate the number of AS hops experienced by
queries, the GCE to end-users, weighted line in Figure 3.1 weights each of the 3M prefixes by its
query volume in our CDN trace (§3.2.2), with over 66% of the queries coming from prefixes with a
one hop path.
While our quantitative results would differ with a trace from a different provider, we believe that
qualitative differences between high and low volume paths would hold. The dataset has limitations:
the trace is only one hour, so suffers time-of-day distortions, and prefix weights are representative of
the CDN’s client distribution but not necessarily Google’s client distribution. However, the dataset
70
suffices for our purposes: precise ratios are not as important as the trends of how paths with no/low
traffic differ from paths with high traffic, and a prefix that originates many queries in this dataset is
more likely to host users generating many queries for other services.
Insight: Path lengths to a single AS can vary. We observed traceroutes traversing paths of
different lengths to reach different prefixes within the same destination AS. Possible explanations
for this include: (1) traffic engineering by Google, the destination AS, or a party in between; and (2)
split ASes, which do not announce their entire network at every interconnection, often due to lack
of a backbone or a capacity constraint. Of 17,905 ASes that had multiple traceroutes in our dataset,
4876 ASes had paths with different lengths. Those 4876 ASes contribute 72% of the query volume
in our CDN trace, with most of the queries coming from prefixes that have the shortest paths for the
ASes. The GCE to end-users weighted, shortest path bars in Figure 3.1 show how long paths would
be if all traffic took the shortest observed path to its destination AS. With this hypothetical routing,
80% of queries traverse only one hop.
Insight: Path lengths vary regionally. Interconnections can also vary by region. For example,
overall, 10% of the queries in our CDN log come from end users in China, 25% from the US, and
20% from Asia-Pacific excluding China. However, China has longer paths and less direct peering,
so 27% of the 2 hop paths come from China, and only 15% from the US and 10% from Asia-Pacific.
71
3.3.2 Google’s interconnections
Result: We estimate that Google interconnects with over 5000 AS. Based on our analysis of
the traceroutes we collected in March 2015, Google interconnects with 5083 ASes (after merging
siblings).
4,5
Insight: Google interconnects with a larger fraction of higher volume ASes. By interconnect-
ing directly, networks may be able to improve performance, sidestep congestion in transit networks,
and potentially reduce costs. For instance, in Chapter 4 we discuss how Facebook must establish
peering connectivity in order to have sufficient capacity to deliver its content to users.
As such, we expect that Google will seek to interconnect with AS that it exchanges significant
traffic with. To confirm this, we investigate the projected query volume of ASes that do and do not
peer with Google. We form a flow graph by combining the end-user query volumes from our CDN
trace with theAS_PATH defined by our GCE traceroutes. So, for example, the total volume for an
AS will have both the queries from that AS’s prefixes and from its customer’s prefixes if traceroutes
to the customer went via the AS. We group the ASes into buckets based on this aggregated query
volume.
Figure 3.2 shows the number of ASes within each bucket that do / do not interconnect with
Google in our traceroutes. As expected, Google interconnects with a larger fraction of higher volume
ASes. And while there are still high volume ASes that do not peer with Google, most ASes that do
4
Google publishes its peering policy and facilities list at http://peering.google.com and in PeeringDB, although these
lists do not contain their actual peers.
5
Google does not necessarily maintain a private network interconnection with each of these ASes; some interconnec-
tions may be established via the shared fabric of a public IXP and with ASes using remote peering (§2.2.2, [79]).
72
1 0
0
1 0
1
1 0
2
1 0
3
1 0
4
1 0
5
1 0
6
1 0
7
1 0
8
Aggregated per AS query volume(normalized and bucketed)
0
2000
4000
6000
8000
Number of ASes
7% 8% 8% 11% 22% 43% 69% 89% 100%
Non-Peers
Peers
Figure 3.2: How many (and what fraction) of ASes Google interconnects with by AS size. AS size is the
number of queries that flow through it, given paths from GCE to end-user prefixes and per prefix query
volumes in our CDN trace. V olumes are normalized and bucketed by powers of 10.
not interconnect are small in terms of traffic volume and, up to the limitations of public geolocation
information, geographic footprint. We used MaxMind to geolocate the prefixes that Google reaches
via a single intermediate transit provider, then grouped those prefixes by origin AS. Of 20,946 such
ASes, 74% have all their prefixes located within a 50 mile diameter.
6
However, collectively these
ASes account for only 4% of the overall query volume.
Insight: Google’s connectivity is increasing over time. We evaluated how Google’s intercon-
nections changed over time by comparing our March 2015 traces with an additional measurement
conducted in August 2015. In August, we observed approximately 700 more interconnections
6
Geolocation errors may distort this result, although databases tend to be more accurate for end-user prefixes like the
ones in question.
73
than the 5083 we measured in March. While some of these interconnections may have been re-
cently established, others may have been previously hidden from our vantage point, possibly due to
traffic engineering or other limitations of our vantage point (§3.2.2). These results suggest that a
longitudinal study of cloud connectivity may provide new insights.
3.3.3 Estimating paths to a popular service (Google search)
The previous results measured the length of paths from Google’s GCE cloud service towards end-
user prefixes. However, these paths may not be the same as the paths from large web properties such
as Google Search and YouTube for at least two reasons. First, Google and some other providers
deploy front-end servers inside some end-user ASes [70], which we refer to as off-net servers. As
a result, some client connections terminate at off-nets hosted in other ASes than where our GCE
traceroutes originate. Second, it is possible that Google uses different paths for its own web services
than it uses for GCE tenants. In this section, we first describe how we estimate the path length
from end-users togoogle.com, considering both of these factors. We then validate our approach.
Finally, we use our approach to estimate the number of AS hops from end-users togoogle.com
and show that some of the paths are shorter than our GCE measurements above.
Approach. First, we use EDNS0 client-subnet queries to resolve google.com for each /24
end-user prefix, as in our previous work [70]. Each query returns a set of server IP addresses for that
end-user prefix to use. Next, we translate the server addresses into ASes as described in §3.2.2.1. We
74
discard any end-user prefix that maps to servers in multiple ASes, leaving a set of prefixes directed
to servers in Google’s AS and a set of prefixes directed to servers in other ASes.
For end-user prefixes directed towards Google’s AS, we estimate the number of AS hops to
google.com as equal to the number of AS hops from GCE to the end-user prefix, under the
assumption, which we will later validate, that Google uses similar paths for its cloud tenants and its
own services. For all other traces, we build a graph of customer/provider connectivity in CAIDA’s
AS relationship dataset [428] and estimate the number of AS hops as the length of the shortest path
between the end-user AS and the off-net server’s AS.
7
Since off-net front-ends generally serve only
clients in their customer cone [70] and public views such as CAIDA’s should include nearly all
customer/provider links that define these customer cones [319], we expect these paths to usually be
accurate.
Validation. To validate our methodology for estimating the number of AS hops togoogle.com,
we used traceroutes from 1409 RIPE Atlas probes to google.com and converted them to
AS_PATH.
8
We also determined the AS hosting the Atlas probe and estimated the number of
AS hops from it togoogle.com as described above.
9
For the 289 ground-truth traces directed to off-nets, we calculate the difference between the
estimated and measured number of AS hops. For the remaining 1120 traces that were directed to
front-ends within Google’s network, we may have traceroutes from GCE to multiple prefixes in the
7
If the end-user AS and off-net AS are the same, the length is zero.
8
Traceroutes issued from the remaining 191 probes failed.
9
We are unable to determine the source IP address for some Atlas probes and thus make estimations at the AS level.
75
Type Count no error error 1 hop
all paths 1,409 81.26% 97.16%
paths to on-nets 1,120 80.89% 98.21%
paths to off-nets 289 82.70% 93.08%
paths w/ 1 hop 925 86.05% 97.62%
Table 3.1: Estimated vs. measured path lengths from RIPE Atlas vantage points togoogle.com
Atlas probe’s AS. If their lengths differed, we calculate the difference between the Atlas-measured
AS hops and the GCE-measured path with the closest number of AS hops.
Section 3.3.3 shows the result of our validation: overall, 81% of our estimates have the same
number of AS hops as the measured paths, and 85% in cases where the number of hops is one
(front-end AS peers with client AS). We conclude that our methodology is accurate enough to
estimate the number of AS hops for all clients to google.com, especially for the short paths
we are interested in. Applying our estimation technique to the full set of end-user prefixes, we
arrive at the estimated AS hop distribution shown in the Google.com to end-users, weighted line in
Figure 3.3.
Insight: Google’s offnets shorten some paths even more. The estimated paths between
google.com and end-user prefixes are shorter overall than the traces from GCE, with 73%
of queries coming from ASes that either interconnect with Google, use off-nets hosted in their
providers, or themselves host off-nets.
10
For clients served by off-nets, the front-end to back-end
portions of their connections also cross domains, starting in the hosting AS and ending in a Google
datacenter. The connection from the client to front-end likely plays the largest role in client-perceived
10
As of 2020, Google no longer appears to direct traffic forgoogle.com to such off-nets. However, YouTube videos
and other forms of static content continue to be served by off-nets.
76
0 1 2 3 and above
Number of Hops
0
20
40
60
80
100
Percentage of Paths
GCE to end-users,
weighted
Google.com to end-users,
weighted
Figure 3.3: Paths lengths from Google.com and Google Compute Engine to end-users
performance, since Google has greater control of, and can apply optimizations to, the connection
between the front-end and back-end [138]. Still, we evaluated that leg of the split connection by
issuing traceroutes from GCE to the full set of Google off-nets [70]. Our measurements show that
Google has a direct connection to the hosting AS for 62% of off-nets, and there was only a single
intermediate AS for an additional 34%.
3.3.4 Paths to other popular content
In this section, we compare measurements of Google and other providers. First, in Figure 3.4, we
compare the number of AS hops (weighted by query volume) from GCE to the end-user prefixes
to the number of AS hops to the same targets from two other cloud providers. While SoftLayer
and AWS each have a substantial number of one hop paths, both are under 42%, compared to well
77
0 1 2 3 and above
Number of Hops
0
20
40
60
80
100
Percentage of Paths
GCE to end-users,
weighted
SoftLayer to end-users,
weighted
AWS to end-users,
weighted
Figure 3.4: Paths lengths from different cloud platforms to end-users.
over 60% for GCE. Still, the vast majority of SoftLayer and AWS paths have two hops or less.
Our measurements and related datasets suggest that these three cloud providers employ different
interconnection strategies. For example, we find that Google interconnects widely: based on our
traceroutes, Google interconnects with 5083 ASes and uses a one hop route for 65% of the queries
in our CDN query log. According to CAIDA’s AS-relationships dataset [428], Google only has 5
providers, and our traceroutes indicate that they use those providers to reach end users responsible
for 10% of the queries in our CDN trace. In comparison, Amazon interconnects with 756 ASes
and uses a one hop route for 35% of the queries in our CDN query log. However, Amazon uses
routes through 20 providers for 50% of queries in our CDN query log. SoftLayer is a middle ground,
interconnecting with 1986 ASes, using a one hop route for 40% of the queries in our CDN query
log, and using routes through 11 providers for another 47% of the queries.
78
0 1 2 3 and above
Number of Hops
0
20
40
60
80
100
Percentage of Paths
Google.com
GCE
Bing.com
Facebook.com
Figure 3.5: Path lengths from RIPE Atlas nodes to content and cloud
12
Other large CDNs and cloud providers are building networks similar to Google’s to reduce transit
costs and improve quality of experience for end-users (we discuss Facebook’s CDN Chapters 4
and 5), but we cannot issue traceroutes from within these providers’ networks.
11
As a workaround,
we execute traceroutes from a set of RIPE Atlas probes towardsfacebook.com and Microsoft’s
bing.com. We calibrate these results with our earlier ones by comparing to traceroutes from the
Atlas probes towardsgoogle.com and our GCE instance.
Figure 3.5 shows the number of AS hops to each destination.
12
The AS hop distribution to
bing.com is nearly identical to the AS hop distribution to GCE. Paths tobing.com are longer
than paths togoogle.com, likely because Microsoft does not have an extensive set of off-net
11
Microsoft’s Azure Cloud appears to block outbound traceroutes and we do not have vantage points in other CDN
networks.
12
The percentages are of total Atlas paths, not weighted.
79
servers like Google’s.
13
Facebook trails the pack, with just under 40% of paths tofacebook.com
having 1 AS hop.
3.3.5 Summary
Path lengths for popular services tend to be much shorter than random Internet paths. For instance,
while only 2% of PlanetLab paths to iPlane destinations are one hop, we estimate that 73% of
queries togoogle.com go directly from the client AS to Google.
3.4 Can Short Paths be Better Paths?
Our measurements suggest that much of the Internet’s popular content traverses at most one interdo-
main interconnection (one hop) on its path from source to end-users, and we predict that competitive
pressures, capacity demands (§4.2.3), and the increasingly stringent network requirements of today’s
popular content will cause this trend to continue.
In this section, we sketch how these short paths may provide a foothold to making progress on
longstanding Internet routing problems, including some of those discussed in Section 2.4. Because
traffic is concentrating along short paths, solutions tailored for this setting can have significant
impact even if they do not work as well for (or are not deployed along) arbitrary Internet paths.
13
To our knowledge, since the time of our study in 2015 Microsoft has added off-net servers similar to Google’s.
80
3.4.1 Short paths sidestep existing hurdles
One-hop paths only involve parties invested in path performance. The performance of web
traffic depends on the intra- and inter-domain routing decisions of every AS on the path. However,
while the source and destination are typically incentivized to improve performance given that it
determines user quality of experience and (potentially) revenue, transit ASes are arguably less
interested in (and potentially less aware of) a path’s performance.
Shorter paths better support cooperative routing. BGP provides network operators with lim-
ited control over how traffic ingresses an AS (§2.4.2). For instance, an AS can setMED values in its
advertisements and use BGP communities to express policy, but the impact of such signals ultimately
depends on the routing policies employed by other networks, and such signals typically only directly
influence (in the best case) the routing decisions of immediate neighbors. This limitation leaves
ASes with almost no ability to affect the routing past their immediate neighbors. However, one-hop
paths only consist of immediate neighbors, removing this problem.
3.4.2 Short paths can simplify many problems
Joint traffic engineering. Prior work found that instances of circuitous routing can arise due to
the challenges of mutually optimizing across AS boundaries and the limitations of BGP routing
policies [399] — an AS’s routing policy can be either early-exit or late-exit (§2.3.1.2).
In response, prior work proposed protocols to enable joint optimization of routing between
neighboring ASes [283]. Yet such protocols become more complex when they must be designed to
81
optimize routes that traverse intermediate ASes [282], to the point that it is unclear what fairness
and performance properties they guarantee. In comparison, one-hop paths between provider and
end-user ASes reduce the need for complicated solutions by enabling direct negotiation between the
parties that benefit the most. Furthermore, since the path is direct and does not involve the rest of
the Internet, it may be possible to use channels or protocols outside, alongside, or instead of BGP,
without requiring widespread adoption of changes.
Limiting prefix hijacks. Prefix origins can be authenticated with the RPKI, now being adopted,
but it does not enable authentication of the non-origin ASes along a path [278] (§2.4.1). So, an
AS having direct paths does not on its own prevent other ASes from hijacking the AS’s prefixes
via longer paths. While RPKI plus direct paths are not a complete solution by themselves, we
view them as a potential building block towards more secure routing. If an AS has authenticated
its prefix announcements, it seems reasonable for direct peers to configure preferences to prefer
one-hop, RPKI-validated announcements over competing advertisements — especially if those
announcements traverse a peering interconnection with a CDN, cloud provider, or end-user ISP.
Preventing spoofed traffic. Major barriers exist to deploying effective spoofing prevention (§2.4.1).
First, filters are only easy to deploy correctly near the edge of the Internet [43]. Second, existing
approaches do not protect the AS deploying a filter, but instead prevent that AS from originating
attacks on others. As a result, ASes lack strong incentives to deploy spoofing filters [43].
82
The short paths on today’s Internet create a setting where it may be possible to protect against
spoofing attacks for large swaths of the Internet by sidestepping the existing barriers. An AS like
Google that connects directly to most origins should know valid source addresses for traffic over
any particular peering and be able to filter spoofed traffic, perhaps using strict uRPF filters. The
direct connections address the first barrier by removing the routing complexity that complicates
filter configuration, essentially removing the core of the Internet from the path entirely. The AS is
the destination,
14
removing the second barrier as it can protect itself by filtering spoofed traffic over
many ingress links. While these mechanisms do not prevent all attacks — for instance, an attacker
can still spoof traffic from a CDN’s address space in a reflection attack — they reduce the attack
surface and may be part of a broader solution.
Speeding route convergence. BGP can experience delayed convergence (§2.4.3), inspiring gen-
eral clean-slate alternatives such as HLP [410] and simpler alternatives with restricted policies that
have better convergence properties [375]. Our findings on the flattening of the path distribution may
make the latter class of solutions appealing. Specifically, it may suffice to deploy restricted policies
based on BGP next-hop alone [375] for one-hop neighbors. In this as well, the incentive structure is
just right: delayed route failovers can disrupt popular video content, so the content provider wants
to ensure fast failover to improve the user’s quality of experience.
Faster isolation and repair of problems. The Internet is susceptible to long-lasting partial
outages in transit ASes [235]. The transit AS lacks visibility into end-to-end connections, so it may
14
Most CDN and cloud ASes are stub networks.
83
not detect a problem, and the source and destination lack visibility into or control over transit ASes,
making it difficult to even discern the location of the problem (§2.4.2, [236]). With a direct path, an
AS has much better visibility and control over its own routing to determine and fix a local problem,
or it can know the other party — also invested in the connection — is to blame. Proposals exist to
enable coordination between providers and end-user networks [144], and such designs could enable
reactive content delivery that adapts to failures and changes in capacity.
3.5 Conclusion
This chapter examined how large CDN and cloud ASes have flattened the Internet’s hierarchy by
building out points of presence and widely establishing interconnections, and how this transformation
has led to much of the Internet’s traffic traversing just “one hop” to get to the end-user’s ISP. This
trend towards one-hop paths for important content will likely accelerate, driven by competitive
pressures, user expectations, and the demands of modern Internet traffic. The trend further suggests
that, in a departure from the current focus on general solutions, interdomain routing and traffic
engineering techniques may benefit from optimizing for the common case of one-hop paths, a
regime where simpler, deployable solutions may exist.
84
Chapter 4
Engineering Egress with EDGE FABRIC
Steering Oceans of Content to the World
4.1 Introduction
In Chapter 3, we found that CDNs have changed the Internet’s structure by establishing points of
presence around the world and interconnecting widely. While traffic traditionally passed through a
hierarchy in which transit providers played a central role — flowing from content providers upwards
to transit providers and then back down to end-user ISPs — our measurements revealed that on
today’s Internet, CDNs often have direct interconnections to end-user networks that enable their
traffic — which represents the bulk of global Internet traffic — to bypass this hierarchy all together.
However, our measurements in Chapter 3 were limited in two ways. First, traceroutes only
capture how an AS is currently routing traffic and thus do not shed light into how the AS makes
routing decisions, and tangential aspects such as path diversity and fault-tolerance. While longitudi-
nal studies may be able to (partially) capture an AS’s path diversity and insights into some of these
85
aspects, they won’t be able to provide us with a broader understanding of the opportunities and
challenges that CDN’s face on today’s flattened Internet. Second, while we could execute traces to
any destination from our cloud instances, we still had to rely on traceroutes from vantage points in
end-user networks to characterize Internet paths between end-users and CDNs used by services like
Facebook and Google Search, limiting the fidelity and coverage of our estimates. In this chapter, we
take an insider’s look at Facebook’s CDN to gain deeper insights into all of these aspects.
Facebook’s CDN includes dozens of PoPs spread around the world that are used to serve over two
billion users across the vast majority of countries. These PoPs characterize Facebook’s connectivity
and examine how Facebook uses EDGE FABRIC, a software-defined egress route controller (§2.3.1.3),
to sidestep BGP’s limitations and make efficient use of Facebook’s connectivity.
We find that because Facebook interconnects widely at each PoP — establishing peering inter-
connections with regional networks and end-user ISPs — each of Facebook’s PoPs has substantial
path diversity and often has a short, direct path via a peering interconnection into end-user ISPs
served by the PoP. Unsurprisingly, this rich connectivity provides many benefits. The short, often
direct paths between Facebook and end-user ISPs provide more control and visibility, and the
aggregate connectivity of each point of presence provides the capacity Facebook needs to deliver
content to end-users and enables Facebook to avoid capacity constraints in transit networks. Path
diversity provides fault-tolerance and potential opportunities for performance-aware routing.
However, our study also reveals that it is challenging to make effective use of this rich connec-
tivity due to the limitations of the Border Gateway Protocol (BGP) (§2.1). Strikingly, despite the
massive changes in Internet traffic being delivered and the topology it is delivered over, the protocol
86
used to route the traffic — BGP — has remained essentially unchanged for over 20 years, and
significant barriers exist to replacing it (§2.4). While it is impressive that BGP has accommodated
these changes, BGP’s best path selection process and design as an information hiding protocol make
it ill-suited to the task of routing Facebook’s and other large CDN’s traffic on the flattened Internet:
• BGP cannot consider capacity, utilization, or performance in its routing decisions.
BGP’s best path selection process makes decisions based on static policies that operate
over route attributes (e.g.,AS_PATH). As a result, BGP cannot incorporate dynamic signals
such as capacity or performance into its decision process (§§ 2.1.2, 2.3.1 and 2.4.3). As
we discuss in this chapter, this limitation makes it challenging for CDNs like Facebook to
make efficient use of their connectivity, with the most pressing issue being that BGP’s routing
decisions can lead to interconnections becoming congested.
• BGP route updates do not communicate route performance or capacity. BGP route an-
nouncements do not include any information about a route’s performance or capacity (§2.4.3).
Thus, even if an AS delegates its routing decisions to a separate system, accounting for
performance and capacity requires that system to have additional sources of telemetry.
• BGP routes all traffic through the preferred route. While this simplicity provides some
design and operational advantages, it also prevents CDNs like Facebook from making the
most efficient use of constrained capacity. For instance, when a peering interconnection
with an end-user ISP is capacity constrained, Facebook’s routing policy should be able to
prioritize using the interconnection for performance-sensitive traffic — given that the short,
87
direct path may offer better performance (§§ 2.1.2 and 2.3.1) — and assign elastic traffic to
routes that traverse other interconnections with available capacity, but such an optimization is
not possible. This simplicity also presents barriers to performance-aware routing. Since route
performance is not available through control-plane signals, and because a route’s performance
may change, performance-aware routing requires continuously measuring path performance
by sending traffic — either existing production traffic, or active measurement traffic — over
alternate routes.
These limitations of BGP and others are discussed in further detail in Section 2.4. In this chapter,
we focus on their implications in Facebook’s environment and how they are addressed with EDGE
FABRIC. We make four contributions:
We characterize Facebook’s connectivity and egress traffic, and show that capacity con-
straints and volatility prevent CDNs from making efficient use of such connectivity with BGP.
Facebook invests heavily in building points of presence and establishing interconnections, and as a
result it is common for Facebook’s PoPs to have four or more routes to prefix containing Facebook
users. However, our measurements show that despite these investments, Facebook’s interconnections
sometimes have insufficient capacity and would become congested if routing decisions were made
solely by BGP based on Facebook’s routing policy (§§ 4.2 and 4.3). While prior work has found
that interconnections in the US often have spare capacity [130] — and our measurements across
20 of Facebook’s PoPs generally agree with this conclusion — our measurements also reveal a
subset of interconnections for which Facebook’s routing policy would lead to BGP assigning an
88
interconnection 2x more traffic than it has capacity! Moreover, our measurements show that the
traffic demand from a PoP to a prefix can be unpredictable, with traffic rates to the same prefix
at a given time exhibiting as much as 170x difference across weeks, likely due to a combination
of changes in user demand and decisions made by Facebook’s global load balancer. In addition,
failures can cause sudden changes in capacity that cause traffic to shift between interfaces and PoPs,
and can last on the order of seconds to days. This problem of insufficient capacity is not unique to
Facebook, with recent work showing the problem occurring for other CDNs as well [469].
Thus, while we find that the rich connectivity of CDN’s like Facebook can provide numerous
benefits — shorter paths, more options for routing, and significant capacity in aggregate — we also
show that interconnection capacity constraints and irregular traffic combined with the limitations of
BGP make it difficult for CDNs to use this connectivity in its most basic form.
We present the design of EDGE FABRIC — a software-defined egress route controller de-
ployed in production — and show that it enables efficient use of Facebook’s connectivity.
Facebook wants to route as much of its traffic as possible via the peering interconnections that it
establishes with other networks so that it can gain the benefits of short, direct paths (chapter 3). How-
ever, Facebook’s interconnections are sometimes capacity constrained, and BGP cannot incorporate
capacity into its decision process. We discuss how EDGE FABRIC, an instance of software-defined
networking (SDN, §2.3.1.3), enables Facebook to sidestep BGP’s limitations by delegating control
of routing decisions from BGP at routers to a flexible, per-PoP software controller (§4.4).
89
EDGE FABRIC receives BGP routes from Facebook’s routers, monitors capacities and demand,
and determines how to assign traffic to routes. EDGE FABRIC enacts its decisions by injecting
selected routes using BGP into Facebook’s routers, overriding the router’s BGP decision process in
a way that is compatible with existing routers’ BGP implementations, and thus widely deployable.
We discuss how EDGE FABRIC considers a number of inputs when making decisions, and how it
uses a stateless decision process that eases development and testing. EDGE FABRIC is deployed in
production, and we show how it enables Facebook to route as much traffic as possible according
to its preferred routing policy — with EDGE FABRIC, Facebook interconnections can support
utilizations as high as 95% — while also preventing packet loss by shifting traffic as needed to
prevent congestion (§4.4.5).
We design footholds to support performance and application-aware routing. In addition to
resolving capacity constraints, our design of EDGE FABRIC provides footholds for performance and
application-aware routing. Prior work has supported such flexibility through host-based routing [23,
469] — or the use of SDN-enabled switches [187] — both of which represent a paradigm shift in how
an AS routes its traffic and bring new challenges in maintaining and synchronizing state( §4.6.1.2).
In comparison, much like our approach to handling capacity constraints, our approach to providing
such flexibility is designed to be compatible with existing BGP implementations and is incrementally
adoptable. With our approach (§4.5), hosts use labels to mark flow types — for instance, that a
flow is carrying latency sensitive traffic — but do not determine the explicit route that the flow will
traverse. In parallel, EDGE FABRIC makes routing decisions per label and enacts these decisions by
90
using BGP to inject routes into alternate routing tables at Facebook’s routers. This design does not
require hosts to be aware of current network conditions or available routes, eliminating the need
for synchronization and simplifying our system’s design while still supporting a wide range of use
cases. In Chapter 5 we use this capability to evaluate opportunities for performance-aware routing
by routing a small fraction of production traffic via alternate routes.
We share operational insights. EDGE FABRIC has been deployed in production since 2013. We
evaluate how EDGE FABRIC operates in production and share how EDGE FABRIC’s design has
evolved over time (§4.6.1). In addition, we discuss unique challenges that arise at Facebook’s scale
in making use of interconnections at public IXPs (§§ 2.2.2 and 4.6.2).
4.2 Background: Overview of Facebook’s CDN
Chapter 2 introduced the fundamentals of interconnection, including BGP (§2.1), points of pres-
ence (PoPs, §2.2.1), interconnection types (§2.2.2), and routing policies (§2.3.1), and additionally
discussed common approaches CDNs employ to direct user requests to points of presence (§2.3.2).
In this section, we discuss these concepts in the context of Facebook’s CDN.
4.2.1 Points of presence
To help reduce user latencies and improve performance, Facebook has deployed dozens of PoPs
across six continents. The use of multiple PoPs reduces latencies in two ways: (1) they cache content
to serve users directly, and (2) when a user needs to communicate with a data center, the user’s
91
Transit
WAN
PR PR
ASW ASW
PR PR
ASW ASW
edge servers
Public
Exchange
Peer
Peering
Layer
Aggregation
Layer
Racks &
Servers
edge servers
Figure 4.1: A PoP has Peering Routers, Aggregation SWitches, and servers. A private WAN connects to
datacenters and other PoPs.
connection terminates at the PoP which maintains separate connections with data centers, yielding
the benefits of split TCP [138] and TLS termination. A significant fraction of Facebook’s user traffic
is directed to these PoPs; our analysis in this chapter is centered around this traffic and these PoPs.
Each PoP includes multiple peering routers (PRs) which maintain interconnections and BGP
sessions with other ASes (§§ 2.1 and 2.2). User requests are processed by racks of servers in each
PoP, and each top-of-rack switch is connected via intermediate aggregation switches (ASWs) to all
PRs, as seen in Figure 4.1. ASWs maintain BGP sessions with PRs and rack switches.
Figure 4.2 depicts the relative volume traffic served from 20 PoPs, a subset selected for geo-
graphic and connectivity diversity that combined serve most Facebook traffic. In this chapter, we
refer to the PoPs consistently by number, ordered by volume. At these PoPs, 95% of the traffic
92
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Relative share of global load
PoP ID
Figure 4.2: Relative egress traffic volume (rounded) of 20 PoPs.
comes from clients in 65;000 prefixes (during a day in Jan. 2017). Considering just the client
prefixes needed to account for 95% of a PoP’s traffic, Figure 4.3 shows that each PoP serves 700
to 13;000 prefixes, and 16 PoPs send 95% of their traffic to fewer than 6500 prefixes.
93
0
2000
4000
6000
8000
10000
12000
14000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# of prefixes that constitute
95% of PoP's traffic
PoP ID
Figure 4.3: Number of BGP prefixes to constitute 95% of PoP’s traffic.
4.2.2 Mapping users to points of presence
Facebook’s global load balancing system, Cartographer [387], steers user traffic to PoPs based on
performance information. Cartographer collects measurements to capture PoP performance for each
end-user prefix, and then directs requests from each prefix to the “best” performing PoPs (subject
to constraints such as capacity; PoPs currently not serving users for maintenance, troubleshooting,
or testing; and agreements with other networks). Each of Facebook’s PoP is assigned one or more
prefix from Facebook’s global IP space, and only that PoP announces the prefix.
1
Connections to
IP addresses in the PoP’s prefix space are terminated at a server at the PoP. Cartographer enacts its
decisions by controlling the IP addresses returned by DNS lookups — returning an IP address of
a particular PoP in response to a DNS request for a general hostname — and by injecting URLs
1
A covering prefix announced across PoPs guards against blackholes if a more-specific route fails to propagate.
94
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Fraction of prefixes
PoP ID
1-3 routes
4-6 routes
7-10 routes
11-20 routes
Figure 4.4: Number routes to BGP prefixes contributing 95% of PoP’s traffic.
into HTTP responses that resolve only to IP addresses at a particular PoP (§2.3.2). Through this
process, Cartographer determines the PoP at which traffic ingresses into Facebook’s network and
is subsequently served from. Further details of Cartographer are out of this work’s scope, and we
design EDGE FABRIC assuming it has no control over which PoP serves a given user.
Based on geolocation by user IP address, we find that most user traffic is directed to a nearby
PoP. Half of all Facebook PoP egress traffic is to users within 500km of the serving PoP, and 90% is
to users within 2500km and in the same continent. The 10% of egress traffic served by a PoP in a
different continent than the user is composed predominantly of European PoPs serving users in Asia
(4.8% of all traffic) and Africa (2.1% of all traffic).
95
4.2.3 Routing traffic to users
4.2.3.1 Interconnections and route diversity
At each PoP, Facebook’s PRs maintain transit and peering interconnections with other ASes (§2.2.2).
Most PoPs connect to 2+ transit providers, with 2+ of the PoP’s PRs maintaining an PNI intercon-
nection with each transit provider for capacity and failure resilience.
A PoP will often have multiple PNI interconnections with the same AS, and when possible
all PRs at a PoP maintain equal PNI capacity to each AS. However, PNI capacity to a given AS
can vary across PRs at a PoP; for the same AS, some PRs may have lower PNI capacity or no PNI
capacity at all. A PoP may also have multiple types of interconnections with the same AS (e.g., via
a PNI and via the shared fabric of a public IXP), such interconnections may be maintained by the
same PR or by different PRs.
In general, we configured Facebook’s network to egress a flow only at the PoP that the flow
enters at, rather than routing across the WAN from servers in one PoP to egress links at a different
PoP. Isolating traffic within a PoP reduces backbone utilization, simplifies routing decisions, and
improves system stability (§4.6.1.3). Even with this simplification, Facebook has diverse routing
options. Figure 4.4 shows the distribution of the number of routes that each PoP could choose from
to reach the prefixes that make up 95% of its traffic. If Facebook receives the same path through
a bidirectional interconnection established via the shared fabric of a public IXP and from a route
server at the same public IXP (as determined by the route’s nexthop IP address), or if multiple PRs
receive the same route (sameAS_PATH and interconnection type), we only count it once. Although
96
not derivable from the graph (which combines destinations with 1-3 routes), all PoPs except one
have at least two routes to every destination, and many have four or more routes to most prefixes.
4.2.3.2 Facebook’s routing policy
BGP does not expose a path’s performance or capacity, and it cannot incorporate dynamic signals
such as these into its decision process (§§ 2.3.1 and 2.4.3). Instead, Facebook’s routing policy
incorporates heuristics that make decisions based on other attributes, such as interconnection type
and AS_PATH length, that are assumed to align with performance. Facebook’s routing policy
prefers routes received via peering interconnections and thus is similar to policies commonly used
by other organizations (§2.3.1); the key differentiating factor is in how Facebook decides between
routes received via peering interconnections.
When multiple routes are available for a prefix, we decide among them by applying the following
tiebreakers in order:
1. Prefer longest matching prefix,
2. Prefer routes received via peering interconnections,
3. Prefer shorterAS_PATHs.
More specifically, we configure BGP at PRs to prefer routes received via peering interconnections
by increasing theirLOCAL_PREF (§2.1.2), and then useAS_PATH length as a tiebreaker. When
routes remain tied, PRs prefer routes from the following sources in order: private peers (e.g., routes
97
received via PNI) > public peers (e.g., routes received via a BGP session with an AS, that is
established via shared fabric of a public IXP)> route servers.
2
We encode peering interconnection
type inMED, strippingMED set by the neighboring AS, which normally express the AS’s preference
of interconnection points (§2.1.2) but are largely irrelevant given that Facebook egresses a flow at
the PoP where it ingresses.
We configured BGP at PRs and ASWs to use BGP multipath (§2.1.2). When a PR or an ASW
has multiple equivalent BGP best paths for the same destination prefix, it distributes traffic across
the equivalent routes using Equal Cost Multipath (ECMP) [203].
Why prefer routes via peering interconnections? Facebook’s routing policy prefers routes
received via peering interconnections because we assume that they are more likely to be short, direct
routes into the end-user AS with lower latency and better performance compared to routes via transit
interconnections. We discuss the potential value of short paths in solving longstanding Internet
problems in Chapter 3, and compare the performance of routes received from peering and transit
interconnections in Chapter 5 (§5.6.2).
In addition, from operational experience we know that routes via transit providers frequently lack
the capacity required to deliver Facebook’s traffic to a destination, resulting in congestion (§5.6.1.2).
More generally, while prior work has focused on assessing whether routes via peering interconnec-
tions offer better performance by reducing circuitous routing [8, 472], large CDNs like Facebook
are motivated to establish peering interconnections because such interconnections are necessary
2
Facebook de-prioritizes a handful of private peers relative to public peers for policy reasons, but the effect is minor in
the context of this work.
98
PoP ID 1 (EU) 2 (AS) 11 (EU) 16 (AS) 19 (NA)
Frac. Traffic Frac. Traffic Frac. Traffic Frac. Traffic Frac. Traffic
Peering via PNI .12 .59 .25 .87 .02 .24 .21 .78 .13 .73
Peering via IXP .77 .23 .39 .04 .45 .45 .54 .13 .85 .07
Rt Srvr .10 — .34 — .52 — .23 — 0 —
Transit .01 .18 .01 .10 :01 .31 .02 .08 .01 .20
Table 4.1: Fraction of PoP AS connected to per interconnection type, and volume of traffic routed via each
interconnection type, for example PoPs in EUrope, ASia, and North America. Each interconnection type is
counted once per unique AS, even if multiple PRs maintain an interconnection of the same type with the AS.
If a Facebook PoP maintains a peering interconnection with an AS via a PNI and also receives routes via a
bilateral BGP session established via a public IXP, or from a route server at a public IXP, both the private and
public interconnection will be counted. If Facebook maintains an bilateral BGP session with an AS via a
public IXP and also receives routes from the same AS via a route server, the interconnection is only counted
as public if the routes received from the route server are also received via the bilateral session (and thus
redundant). All traffic exchanged via a public IXP, whether using a route received via a bilateral BGP session
or from a route server, is counted as public.
given their capacity demands — transit providers simply do not have sufficient capacity to meet the
needs of today’s large CDNs (§5.6.1.2, [248, 469]).
Why prefer routes via PNIs? We prefer routes via PNIs to respect that the neighboring AS
dedicated resources to receiving Facebook traffic. In addition, the system described in this chapter,
EDGE FABRIC, can monitor PNI circuit capacity and utilization to prevent congestion (§4.4), but it
does not have visibility into other AS’s port utilization at an IXP fabric. As a result, by preferring
PNIs we avoid the possibility of cross-congestion at the egress (§4.6.2). Likewise, we prefer routes
learned via a bilateral BGP session established at a public IXP over routes received from a public
IXP’s route server because a bilateral peering indicates that the AS explicitly established a BGP
session with Facebook and announced the routes to Facebook, and thus is more likely to expect the
traffic and have a port provisioned that can handle the resulting traffic without becoming congested.
99
4.2.3.3 Prevalence and egress traffic per interconnection type
Facebook interconnects with thousands of ASes at each PoP. Table 4.1 shows, for example PoPs,
the fraction of interconnections that are of each type. Each of the four PoPs shown has hundreds
of interconnections in total, yielding rich connectivity, although the distribution of interconnection
types varies widely by count and by traffic. The table also shows the fraction of each PoP’s
traffic that would be routed via the given interconnection type if all traffic was assigned to the
route preferred by Facebook’s routing policy without considering capacity. Although PNI-based
peering interconnections make up at most a quarter of interconnections at any PoP, they receive
the majority of traffic at all but PoP-11; this is expected given that the dedicated capacity of PNI is
typically pursued for any high-volume peering interconnection (§2.2.2). At all but PoP-11, 80+%
of traffic egresses via peering interconnections rather than via transit interconnections, an example
of how large CDNs have “flattened” the Internet by establishing peering interconnections (§2.2.3
and chapter 3).
3
3
These results suggest that Facebook interconnects more widely than we concluded in our results in Section 3.3.4.
We suspect this discrepancy exists for two reasons. First, traffic tofacebook.com is steered by Cartographer using a
different mechanism than traffic for static content (e.g., images, videos), which makes up the bulk of Facebook’s egress.
Traffic tofacebook.com is steered with DNS redirection, while traffic to static content is steered using URL rewriting,
which enables Facebook to make more precise traffic engineering decisions (§2.3.2). As a result, our traceroutes in
Section 3.3.4 tofacebook.com may have observed a different route than what most Facebook egress traffic takes.
Second, while a significant fraction of Facebook’s egress traverses peering interconnections, that does not mean that all
traffic traversing those interconnections is taking a one-hop path to the destination AS.
100
4.3 Problems, Goals and Design Decisions
4.3.1 How BGP’s limitations impact Facebook
By establishing PoPs around the world and interconnecting widely at each location, we find that
Facebook has amassed rich connectivity that provides an array of benefits. Facebook’s PoPs often
have a short, direct path via a peering interconnection into the networks of end-user ISPs served
by the PoP, and Facebook’s route diversity provides fault-tolerance and potential opportunities for
performance-aware routing. Furthermore, the aggregate connectivity of each PoP should provide
Facebook with the capacity that it needs to deliver content to end-users while also enabling Facebook
to avoid capacity constraints in transit networks.
However, our study reveals that it is challenging for Facebook to make effective use of this rich
connectivity. Two problems — both of stem from the limitations of BGP’s best path selection process
and BGP’s design as an information hiding protocol (§§ 2.1 and 2.4) — motivate our work in this
chapter. The most pressing problem is that BGP does not consider capacity in its decision process:
the interconnections preferred by Facebook’s routing policy are sometimes capacity constrained,
but due to this limitation, Facebook cannot make efficient use of these interconnections while also
avoiding congestion at the edge of its network. The second problem arises from Facebook’s desire
to match its traffic with the best performing route. BGP does not consider or expose performance
information, and BGP does not provide Facebook with the mechanisms required to measure and
compare the performance of available routes, or to use application-specific routing to prioritize
constrained capacity for performance-sensitive applications.
101
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
CDF of PoPs
Fraction of prefixes congested
Figure 4.5: Distribution across PoPs of fraction of prefixes that would have experienced congestion had
EDGE FABRIC not intervened.
0
0.2
0.4
0.6
0.8
1
1 1.5 2 2.5 3 3.5 4 4.5 5
CDF of overloaded interfaces
Peak load (relative to interface capacity)
Figure 4.6: Distribution across interfaces of ratio of peak load to capacity for interfaces that would have
experienced congestion had EDGE FABRIC not intervened.
102
Problem: Interconnections can be capacity-constrained, but BGP is not capacity-aware. Al-
though Facebook builds PoPs and expands interconnection capacity whenever possible, an inter-
connection’s capacity may not suffice to deliver all traffic that Facebook would like to send over it.
Rapid growth in demand can quickly make the capacity of an existing interconnection insufficient,
and augmenting capacity depends the cooperation of the neighbor AS; in some cases the process
can take months or be outright impossible. In addition, even if there is typically sufficient capacity,
sudden changes in demand and capacity can create brief shortages. Facebook traffic is suspectable
to short-term spikes in demand (due to events, holidays), and capacity can change quickly due to
failures. Likewise, PoPs serve nearby users, and so diurnal patterns can lead to synchronized peaks
in demand, causing very high demand that can exceed interconnection capacity for brief periods; an
interconnection may have sufficient capacity except for a single hour in a week.
Further complicating matters is that interconnection capacity may be unequally distributed
across peering routers, but Equal Cost Multipath (ECMP) at Facebook’s ASWs (§4.2.1) will be
unaware of this imbalance and will evenly distribute traffic across PRs, which can result in overload
at some PRs and poor utilization of capacity at others.
4
In general, assigning more traffic to an
egress interface than it (or the downstream path) can handle causes congestion delay and packet
loss, and it also increases server utilization (due to retransmissions) [108, 139, 229, 414, 462].
To understand the scale of the problem, we analyzed a two-day log from January 2017 of each
prefix’s per-PoP egress traffic rate (averaged over a 1 minute window) and compared the capacity of
4
Section 4.6.1.4 describes why we do not use Weighted Cost Multipath (WCMP).
103
Facebook’s egress links to the rate of traffic that BGP would assign to them (based on Facebook’s
configured BGP policy, §4.2.3), if EDGE FABRIC did not intervene to prevent overload.
For each PoP, Figure 4.5 shows the fraction of prefixes that would traverse a congested intercon-
nection without EDGE FABRIC. Most PoPs are capacity-constrained for at least one prefix, and a
small fraction of PoPs are capacity-constrained for most prefixes. For any interconnection interface
that would have been overloaded at least once, Figure 4.6 shows peak load per interface (computed
over 1-minute intervals, relative to interface capacity). While most of these interfaces experience
low amounts of overload (median load = 1:19X capacity), 10% experience a period in which BGP
policy would lead to BGP assigning twice as much traffic as the interface’s capacity! In many cases,
Facebook has found that it cannot acquire sufficient additional interconnection capacity to route
traffic via the preferred route; in other cases, Facebook has sufficient capacity for the preferred route,
but cannot make use of it due to asymmetry, in which case some interconnections become congested
while others remain underutilized.
Problem: BGP’s design does not support performance-aware or application-specific routing.
When demand for a route preferred by Facebook’s routing policy exceeds underlying interconnec-
tion capacity, Facebook should be able to prioritize assigning performance-sensitive traffic to the
route — given that it may offer better performance (§2.3.1.2) — and assign elastic traffic to other
routes with available interconnection. However, such optimizations are not possible because BGP
routes all traffic via the route selected by its best path selection algorithm (§2.1.2).
5
5
Policy-Based Routing (PBR) [93] does address some of these constraints by enabling traffic to be routed based on
identifiers such as DSCP label. However, as discussed in Section 2.4.3, routing policies incorporating PBR are still static
and thus unable to incorporate dynamic conditions such as demand and capacity.
104
Likewise, Facebook should be able to use its rich connectivity to opportunistically improve
performance by routing traffic to end-users via the best performing route. However, BGP route
announcements do not include performance signals and BGP is not capable of incorporating
dynamic performance signals into its routing decisions (§2.4.3); instead, Facebook’s routing policy
incorporates heuristics that make decisions based on other attributes that are assumed to align with
performance (§§ 2.3.1 and 4.2.3.2).
6
Since route performance is not available through control-
plane signals, and because a route’s performance may change, performance-aware routing requires
continuously measuring path performance by sending traffic — either existing production traffic,
or active measurement traffic — over alternate routes. Once again, BGP’s limitations stand in the
way: such measurements would require being able to direct a subset of flows via alternate routes
and being able to incorporate dynamic measurements into BGP’s decision process, neither of which
BGP is capable of supporting (§§ 2.1.2 and 2.4.3).
4.3.2 Goals and design decisions
Our goal is to enable Facebook to make efficient use of its rich connectivity, while in parallel
preventing congestion at the edge of Facebook’s network. In this chapter, we build a foundation
capable of supporting:
• Capacity-aware routing: Facebook’s routing decisions must be aware of interconnection
capacity, utilization, and demand to avoid congestion at the edge of Facebook’s network.
6
In fact, because BGP control-plane signaling is detached and independent from the data-plane, a BGP router may
select and announce a route that is blackholing traffic [235, 473].
105
• Application-aware routing: When interconnection capacity is constrained, Facebook should
be able to prioritize assigning applications with performance-sensitive traffic to the best route.
• Performance-aware routing: Facebook should be able to measure route performance and
incorporate such measurements into its routing decisions.
In an effort to support these goals, in 2013 we began building EDGE FABRIC, a traffic engineering
system that manages egress traffic for Facebook’s PoPs worldwide. Section 4.4 describes how EDGE
FABRIC prevents interconnection congestion by incorporating capacity and demand into its routing
decisions. Section 4.5 discusses how EDGE FABRIC provides footholds to incorporate performance
signals into its routing decisions and EDGE FABRIC’s ability to perform application-specific routing;
we make use of these footholds in Chapter 5. Section 4.6 describes how EDGE FABRIC has evolved
over time in response to changing needs and insights from operational experience that led to
improvements in EDGE FABRIC’s stability and ability to combat congestion.
We now present the key design decisions of EDGE FABRIC:
Operate on a per-PoP basis. While EDGE FABRIC assigns traffic to egress routes, the global
load balancing system maps a user request to ingress at a particular PoP (§4.2.2), and a flow
egresses at the same PoP at which it ingresses. So, EDGE FABRIC need only operate at a per-PoP
granularity; it does not attempt to orchestrate global egress traffic. This design allows us to colocate
its components in the PoP, reducing dependencies on remote systems and decreasing the scope and
106
complexity of its decision process. We can then restart or reconfigure EDGE FABRIC at a PoP in
isolation (§4.4.4) without impacting other PoPs.
Centralize control with SDN. We chose to use an SDN-based approach (§2.3.1.3), in which
a centralized controller receives network state and then programs network routing decisions. This
approach brings benefits of SDN: it is easier to develop, test, and iterate compared to distributed
approaches. Because Facebook connects to its peers using BGP, part of the network state is the BGP
paths Facebook receives, which are continuously streamed to the controller (§4.4.1.1).
Support dynamic signals (capacity/performance) during the route selection process. The
controller receives measurements of capacity and demand multiple times per minute (§4.4.1.2),
enabling EDGE FABRIC to maximize utilization of preferred paths without overloading them (§4.4.2).
In addition, the controller can incorporate other signals into its decision process, and can make
routing decisions that only impact a subset of flows. This flexibility provides a foothold that
can be used to incorporate performance measurements into the decision process. It also enables
enables EDGE FABRIC to perform application-specific routing in the case of capacity constraints;
EDGE FABRIC can prioritize placing traffic that is more sensitive to network conditions on the best
performing path. In Chapter 5, we estimate the value of incorporating performance signals into
Facebook’s routing decisions.
Build atop existing BGP infrastructure. Despite the centralized controller, every PR makes
local BGP route decisions and PRs exchange routes in an iBGP mesh; the controller only intervenes
107
when it wants to override default BGP decisions. To enact an override, EDGE FABRIC sets its
preferred route to have high LOCAL_PREF and announces it via BGP sessions to PRs (§4.4.3),
which prefer it given that BGP’s best path selection process ranks routes byLOCAL_PREF (§2.1.2).
Building EDGE FABRIC atop our established BGP routing simplifies deployment, lets the network
fall back to BGP for fault tolerance, and leverages existing operational teams, their expertise, and
network monitoring infrastructure.
Leverage existing vendor software and hardware. We use battle-tested vendor gear and
industry standards, avoiding the need for custom hardware or clean slate design. Sections 4.4 and 4.5
describe our use of BGP, BMP, IPFIX, sFlow, ISIS-SR, and eBPF, and Section 4.6.1 explains how
specifics of vendor support have influenced our design.
Summary. EDGE FABRIC’s design values simplicity and compatibility with existing infrastructure,
systems, and practices. Its approach to satisfy our primary goal — avoid overloading egress
interfaces — does not require any changes to our servers or (browser-based or app-based) clients,
adding only BGP sessions between the routers and a per-PoP software controller. Our secondary
goal of laying the groundwork required for performance-aware routing relies on straightforward
software changes at servers and the addition of alternate routing tables at routers, functionality
supported by our existing infrastructure.
108
4.4 Avoiding a Congested Edge
EDGE FABRIC consists of loosely coupled microservices (Figure 4.7). Every 30 seconds, by default,
the allocator receives the network’s current routes and traffic from other services (§4.4.1), projects
interface utilization (§4.4.1.2), and generates a set of prefixes to shift from overloaded interfaces
and for each prefix, the detour path to shift it to (§4.4.2). Another service enacts these overrides
by injecting routes into routers via BGP (§4.4.3). We use a 30 second period to make it easier to
analyze the controller’s behavior, but we can lower the period if required due to traffic volatility.
4.4.1 Capturing network state (inputs)
EDGE FABRIC needs to know all routes from a PoP to a destination, and which routes traffic will
traverse if it does not intervene. In addition, it needs to know the volume of traffic per destination
prefix and the capacity of egress interfaces in the PoP.
BGP Injector Route Overrides BMP Collector Traffic Collector Allocator Topology States prefix1 via X.X.X.X Peering Routers Controller Figure 4.7: EDGE FABRIC’s components
109
4.4.1.1 Routing information
All available routes per prefix. The BGP Monitoring Protocol (BMP) allows a router to share a
snapshot of the routes received from BGP peers (e.g., all of the routes in its route information base,
or RIB) and stream subsequent updates to a subscriber [380]. The BMP collector service maintains
BMP subscriptions to all peering routers, providing EDGE FABRIC with a live view of every peering
router’s RIB. In comparison, if EDGE FABRIC maintained a BGP peering with each router, it could
only see each router’s best path.
7
Preferred paths per prefix. BMP does not indicate which path(s) BGP has selected, and a BGP
peering only shares a single path, even if the router is using ECMP to split traffic across multiple
equivalent paths. The controller also needs to know what paths(s) would be preferred without
any existing overrides; BGP may have selected a path that the controller previously injected. To
determine the paths that would have been selected by Facebook’s routing policy, the controller
performs BGP best path selection (including multipath computation and ignoring existing overrides
that would otherwise be preferred) for every prefix.
4.4.1.2 Traffic information
Current traffic rate per prefix. The Traffic Collector service collects traffic samples reported
by all peering routers in a PoP (using IPFIX or sFlow, depending on the router), groups samples
by the longest-matching prefix announced by BGP peers, and calculates the average traffic rate
7
Previously, we used BGP add-path capability to collect multiple routes from each router, but some vendor equipment
limits the number of additional paths exchanged, which we quickly exceeded given our rich interdomain connectivity.
110
for each prefix over a two-minute window. We use live rather than historical information because,
for example, the global load balancing system (§4.2) may have shifted traffic to/from the PoP,
destination networks may have changed how they originate their network space for ingress traffic
engineering, and traffic demands change over time on a range of timescales.
If the rate of a prefix exceeds a configurable threshold, for example 250 Mbps, the service will
recursively split the prefix (e.g., splitting a /20 into two /21s, discarding prefixes with no traffic)
until the rate of all prefixes is less than the threshold. Splitting large prefixes allows the allocator to
make more fine-grained decisions and minimize the amount of traffic that must be detoured when
interfaces are overloaded (§4.4.2).
Interface information. The allocator retrieves the list of interfaces at each peering router from
a network management service [421] and queries peering routers via SNMP every 6 seconds to
retrieve interface capacities, allowing the allocator to quickly adapt to capacity changes caused by
failures or provisioning.
Projecting interface utilization. The allocator projects what the utilization of all egress interfaces
in the PoP would be if no overrides had been injected, assigning each prefix to its preferred route(s)
from its emulated BGP path selection. The BGP best path computation process may return multiple
(equivalent) routes. These routes may be spread across multiple interfaces and/or peering routers.
In these cases, the allocator assumes that ECMP at both the aggregation layer and peering routers
splits traffic equally across the paths.
111
We project interface utilization instead of using the actual utilization to enable the allocation
process to be stateless. This approach simplifies our design: the allocator does not need to be aware
of its previous decisions or their impact — it generates a full allocation from scratch on each cycle
by projecting how traffic would flow in the absence of any overrides and then generates overrides to
prevent congestion. Section 4.6.1.1 discusses this design decision in detail.
Based on the projected utilization, the allocator identifies interfaces that will be overloaded if it
does not apply overrides. We consider an interface overloaded if utilization exceeds ~95% (the exact
threshold can vary based on interface capacity and peer type), striking a balance between efficient
utilization and headroom to handle volatility (including microbursts).
4.4.2 Generating overrides (decisions)
The allocator generates overrides to shift traffic away from interfaces that it projects will otherwise
be overloaded. For each overloaded interface, the allocator identifies the prefixes projected to
traverse the interface and, for each prefix, the available alternate routes.
8
It then generates all
possiblehprefix, alternate routei options, and applies a selection strategy that uses heuristics with
the goal of minimizing any performance impact arising from detouring traffic. To accomplish this,
the selection strategy jointly considers the prefix being detoured and the route that it will be shifted
to, applying the following rules in order until a singlehprefix, alternate routei emerges:
8
A prefix will not have any alternate routes if all routes to it are on interfaces that lack sufficient spare capacity (after
accounting for earlier detours from this round of allocation). This scenario is rare, as transit interfaces have routes to all
prefixes and, in our evaluation, always had at least 45% of their capacity free (§4.4.5).
112
1. Prefer shifting IPv4 prefix. We prefer shifting IPv4 prefixes over IPv6 prefixes because we
have experienced routes that blackhole IPv6 traffic despite advertising the prefix. If EDGE
FABRIC shifts traffic to such a route, end-users will fallback to IPv4 [460], causing traffic to
oscillate between IPv4 and IPv6.
2. Prefer prefixes that based on signals provided by BGP neighbors. Neighbors can define
a rank ordering of prefix to detour and signal these preferences using Facebook-defined BGP
communities.
3. Among multiple alternate routes for a given prefix, prefer routes with the longest prefix.
Unlike the standard BGP decision process, the allocator will consider using routes for less-
specific prefixes, just with lower preference.
4. Prefer prefix based on their alternate routes and Facebook’s routing policy. For instance,
the allocator will prefer shifting a prefix with an available route via a public exchange over a
prefix that only has an alternate route via a transit provider (§4.2.3).
5. Prefer routes based on an arbitrary but deterministic tiebreaker. The tiebreaker selects
first based on the prefix value. If there are equally preferred alternate routes for the chosen
prefix, the allocator orders alternate routes in a consistent way that increases the likelihood of
detour traffic being balanced across interfaces.
Once a pairing has been selected, the allocator records the decision and updates its projections,
removing the prefix’s traffic from the original interfaces and placing all of the PoP’s traffic for the
113
prefix onto the selected alternate route’s interface. EDGE FABRIC detours all traffic for the prefix,
even if the prefix’s primary route was across multiple interfaces or routers. However, the total traffic
per prefix is always less than the threshold that Traffic Collector uses when splitting high traffic
prefixes (§4.4.1).
The allocator continues to select prefixes to shift until it projects that the interface is no longer
overloaded or until the remaining prefixes have no available alternate routes. Because the allocation
process is stateless, it generates a new allocation from scratch every 30 seconds. To minimize churn,
we implemented the preferences to consider interfaces, prefixes, and detour routes in a consistent
order, leading the allocator to make similar decisions in adjacent rounds. The remaining churn is
often due to changes in traffic rates and available routes.
The headroom left by our utilization thresholds allows interface utilization to continue to grow
between allocation cycles without interfaces becoming overloaded. If a route used by the controller
for a detour is withdrawn, the controller will stop detouring traffic to the withdrawn route to prevent
blackholing of traffic.
4.4.3 Enacting overrides (output)
In each round, the allocator generates a set of BGP updates for EDGE FABRIC’s overrides and
assigns each update a very highLOCAL_PREF. The allocator passes the BGP updates to the BGP
Injector service, which maintains a BGP connection with every peering router in the PoP and enacts
the overrides by announcing the BGP updates to the target routers. Because the injected updates
have a very high LOCAL_PREF and are propagated between PRs and the ASWs via iBGP, all
114
routers prefer the injected route for each overridden prefix. The injector service then withdraws any
overrides that are no longer valid in the current allocation.
We configure EDGE FABRIC and the global load balancer such that their independent decisions
work together rather than at odds. First we need to protect against EDGE FABRIC decisions and
global load balancer decisions interacting in ways that cause oscillations. In selecting which PoP
to direct a client to, the global load balancer jointly considers performance from the PoP and
Facebook’s BGP policy’s preference for the best route from the PoP, but we configure it to ignore
routes injected by EDGE FABRIC and instead to consider the route that would be used in the absence
of an override. If the global load balancer was allowed to consider the override route, it could shift
client traffic away from a PoP in reaction to EDGE FABRIC detouring traffic from an overloaded
interface to a less-preferred route. This shift would reduce traffic at the PoP, lowering EDGE
FABRIC’s projection of interface load, which could cause it to stop detouring, opening the possibility
of an oscillation. Second, the global load balancer can track interface utilization and appropriately
spread traffic for a client network across all PoPs that it prefers equally for that network. So, EDGE
FABRIC need only intervene with overrides once interfaces are overloaded across all the PoPs.
4.4.4 Deploying, testing, and monitoring
We typically deploy EDGE FABRIC controller updates weekly, using a multi-stage release process to
reduce risk. First, because our design is modular and stateless, we write comprehensive automated
tests for individual components of EDGE FABRIC and EDGE FABRIC’s dependencies. Second,
because our controller is stateless and uses projected interface utilization instead of actual utilization,
115
we can run a shadow controller inside a sandbox that can query for the same network state as the
live controller, without needing any state from the live controller and without being dependent on its
earlier decisions. We continuously run shadow instances, built from the latest revision, for every
PoP and compare the decisions and performance of shadow instances against the controllers running
in production. We review these comparisons before beginning deployment of a new version. Third,
because we deploy EDGE FABRIC and all dependencies on a per-PoP basis, we can roll out new
versions of a controller and its dependencies on a PoP-by-PoP basis (an automated system performs
this). While the EDGE FABRIC controller is being updated, the BGP Injector service continues to
inject the previous round of decisions until the controller resumes (a process that takes less than
5 minutes). If we need to update the BGP Injector service, hold timers at PRs maintain existing
injections until the injector has restarted.
While the stateless controller is amenable to automated tests, it is particularly vulnerable to
errors in BGP route or traffic rate data, as these can cause the controller to misproject interface
utilization. To catch misprojections, a monitor compares the controller’s post-allocation projection
of interface utilization with actual interface utilization, and it raises an alarm if they differ by 5% for
more than a configurable period of time. Through this process, we identified and corrected bugs in
our routing policy and in how our PRs export IPFIX and sFlow samples. The controller projects that
ECMP will distribute traffic nearly evenly across links. The monitor identified instances of ECMP
unexpectedly distributing traffic in a highly-unbalanced manner, which EDGE FABRIC can mitigate
by overriding the multipath to send traffic to a single PR.
116
Similarly, while EDGE FABRIC’s use of BGP and distributed route computation lets us build on
existing infrastructure, it also exposes us to the underlying complexity of the BGP protocol. In one
scenario, a configuration issue caused routes injected by EDGE FABRIC to not be reflected across
the full PoP, shifted traffic away from the previously overloaded interface, but to a different interface
than desired. To detect such misconfigurations, we built an auditing system that regularly compares
EDGE FABRIC’s output against current network state and traffic patterns.
4.4.5 Results on production traffic
We deployed EDGE FABRIC for all production traffic, detouring traffic to avoid overloading interfaces
at PoPs around the world. In this section we discuss a two-day study executed in January 2017
that predates our current stateless controller. Instead, the study used our earlier stateful controller,
which also did not automatically split large-volume prefixes. We believe that our stateless controller
achieves better utilization than the stateful one (without negative side effects) but have not yet
formally evaluated it.
Does EDGE FABRIC achieve its primary goal, preventing congestion at edge interfaces while
enabling efficient utilization? EDGE FABRIC prevents congestion by detouring traffic to alternate
routes. During the study, non-overloaded alternate routes always existed, giving EDGE FABRIC
options to avoid overloading interfaces. In particular, transit providers can take detoured traffic
to any destination, and the maximum instantaneous transit utilization observed at any individual
PoP (sampled at one minute intervals) during the study was 55%. EDGE FABRIC successfully
117
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50 60 70 80 90 100
CDF of samples
Interface utilization relative to detour threshold (%)
When not detouring traffic
When detouring traffic
Figure 4.8: Utilization of interfaces relative to detour thresholds.
prevented egress traffic from overloading egress interfaces, with no packet drops at an interface
when EDGE FABRIC was not detouring traffic from it, nor in 99.9% of periods in which it was
detouring. Figure 4.8 shows the utilization on these interfaces (relative to their detour thresholds)
during these periods; EDGE FABRIC keeps utilization of preferred routes high even while avoiding
drops, and utilization is below a safe threshold during periods in which EDGE FABRIC decides not
to detour traffic.
How much traffic does EDGE FABRIC detour? Figure 4.9 shows the distribution of the fraction
of time that EDGE FABRIC detoured traffic from each interface to avoid overloading it. During
our evaluation period, EDGE FABRIC detoured traffic from 18% of interfaces at least once, and it
detoured 5% of interfaces for at least half the period. Figure 4.10 shows how long each period of
detouring lasts, and how long the periods are between detours for a given (PoP, destination prefix).
118
0.001
0.01
0.1
1
0 0.2 0.4 0.6 0.8 1
CCDF of interfaces
Fraction of time interface is overloaded
Figure 4.9: Fraction of time EDGE FABRIC detours from interfaces.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000
CDF of samples
Duration (minutes)
Detour duration
Gap between detours
Figure 4.10: Distributions of EDGE FABRIC detour period lengths across (PoP, prefix) pairs and of time
between detours.
119
0
0.05
0.1
0.15
0.2
0.25
0.3
Wed Thu Fri Sat Sun Mon Tue Wed
Fraction of traffic detoured
Day of week
Traffic at PoP with most detours
Global traffic
Figure 4.11: Fraction of traffic detoured by EDGE FABRIC across 20 PoPs and at the PoP with the largest
fraction detoured.
The median detour lasts 22 minutes, and 10% last at least 6 hours. Interestingly, the median time
between detours is shorter–only 14 minutes–but the tail is longer, with a gap of more than 3 hours
36% of the time and a sizable fraction of gaps long enough to suggest detouring during a short daily
peak. Figure 4.11 shows, over time, the fraction of traffic detoured across 20 PoPs and the fraction
of traffic detoured at the PoP (in this set of 20) that detours the highest fraction of its traffic. The
global and PoP detour volumes display diurnal patterns and remain a small fraction of overall traffic,
leaving spare capacity to absorb detours, as PoPs always had at least 45% of their transit capacity
free. EDGE FABRIC enables PoPs to dynamically detour traffic from interfaces that would otherwise
become heavily overloaded (see Figure 4.6) by taking advantage of available capacity elsewhere.
120
What is the impact on performance of EDGE FABRIC detours? We explore this question in
Chapter 5 by comparing the performance between the route preferred by Facebook’s BGP policy
and the best alternate path. As we discuss in Section 5.6, we find that there is typically an alternate
route that has similar performance to the route preferred by Facebook’s routing policy; this result
suggests that most detours will have little impact on performance.
4.5 Towards Performance and Application Aware Routing
EDGE FABRIC avoids performance problems due to congested links at the edge of Facebook’s
network. However, Facebook’s users can still suffer sub-optimal performance due to routing
decisions. First, EDGE FABRIC does not have visibility into link utilization and capacity beyond the
edge of Facebook’s network. While Facebook maintains PNI with many user networks, there are
still cases where Facebook routes traffic via IXPs or transit networks, increasing the potential for
congestion at downstream interconnections and performance problems that EDGE FABRIC cannot
account for. Even in cases when Facebook establishes a PNI with an end-user AS, alternate routes
may be able to bypass congestion, failures, or internal routing decisions degrading performance for
traffic traversing the PNI.
Second, BGP provides neither visibility into performance nor the explicit ability to make
decisions based on it (§2.4.3). EDGE FABRIC uses routing policies based solely on BGP signals and
heuristics, and thus may not choose the best-performing path for a given prefix’s default path, and
likewise, when detouring a prefix, may not choose the best-performing detour path. Even if BGP
121
does choose the best-performing path in all scenarios, performance can be degraded for detoured
traffic if the detour path has worse performance than the primary path.
Due to transient failures and volatility in traffic and congestion, which path performs best can
vary over time, and so making decisions based on performance requires continuously measuring
performance of available routes. In addition, there may be value in having EDGE FABRIC prioritize
certain types of content (for instance, a live video stream) when deciding which traffic will use the
limited capacity of the capacity constrained interconnection(s), and which traffic will be shifted to
an alternative route (with potentially worse performance). However, because BGP only supports
destination-based routing (§2.4.2), BGP alone cannot realize either of these objectives.
In this section, we extend EDGE FABRIC’s design to support routing a subset of flows, either
selected at random (e.g., for measuring path performance) or based on flow properties (e.g., for
application aware routing) onto specific paths. We have deployed these extensions into production
and use them in Chapter 5 to measure Internet performance from Facebook’s CDN and to identify
possible opportunities for performance-aware routing.
4.5.1 Placing traffic on alternate paths
To sidestep BGP’s limitation that it only supports destination-based routing, we build a mechanism
that allows us to route specific flows via select paths. The mechanism requires only minimal
modifications at servers and PRs, and does not require coordination between servers and other
122
network elements: servers can make decisions on a per-flow basis (per-flow overrides avoid out-of-
order packets that slow TCP [51, 309, 324, 458]) without having any knowledge of network state at
PRs. Our approach involves the following:
Servers select and mark flows. Servers identify flows that should be treated specially, such
as a flow is carrying real-time traffic and thus is performance-sensitive, and set the DSCP field
in the IP packets to a corresponding, pre-defined value. It is possible to apply DSCP markings
without requiring any application changes by using the Extended Berkeley Packet Filter (eBPF)
instruction set; eBPF programs can be loaded into the kernel and process all packets egressing
from the server. For instance, in Chapter 5 we implement a eBPF program that runs on Facebook’s
edge load balancers and randomly selects flows to be used to measure the performance of alternate
routes (§5.2.2.2). To support application-aware routing, we could add another eBPF program that
marks flows by type, based on the Facebook endpoint that the connection was established to.
Policy routing at the PRs routes marked flows via alternate tables. PRs match on the
DSCP markings applied by servers and route corresponding packets based on the corresponding
alternate routing table. We install a unique routing table for each DSCP value; if the table does not
contain a route for a marked flow, the flow is routed via the default routing table.
A controller injects routes into the alternate tables. EDGE FABRIC or a separate controller
injects routes into the alternate routing tables at PRs to control how flows are routed for each DSCP
123
marking. If a controller has not injected a route for a particular destination, flows marked with the
corresponding DSCP value will be routed based on the PR’s default routing table.
Advantages of our approach. Prior work has focused on achieving such flexibility through either
host-based routing — a model in which each host explicitly assigns the route that traffic will take
based either on local routing decisions or decisions made by a centralized controller [23, 469] — or
through the use of SDN-enabled switches [187]. Both of these approaches are starkly different
from traditional BGP based interdomain routing and come with their own share of operational
challenges (§4.6.1.2). In comparison to prior work, our approach does not require continued
synchronization between servers, routers, and controllers. Servers can continuously tag packets
based on flow properties without any knowledge of network state, and controllers can inject routes
into alternate routing tables to control the route of marked packets only as needed. And as long as a
network’s routers support alternate routing tables, no new hardware is needed. This compatibility
with existing BGP removes the need for a significant change in how a network routes its traffic,
enabling incremental adaption while still providing significant flexibility.
This approach also limits all changes to the PRs: the aggregation switch (ASW) layer (§4.2.1)
does not need to be aware of DSCP values assigned to flows and can blindly forward IP traffic
based on BGP’s best route for the destination prefix. However, this approach requires addressing
two problems. First, a PR with the best route may not be able to route traffic via the alternate
route selected by the controller, since different PRs at a PoP can have different interconnections.
In addition, although PRs are in a BGP mesh, routes injected into alternate routing tables are not
124
exchanged between PRs because BGP only exports a single path per session. To direct traffic to
the correct PR, the controller also injects the alternate route at PRs do not have the desired route.
PR without an interconnection providing a route to the nexthop will resolve the nexthop via IGP
and forward the traffic to a PR that does have a matching interface route. Second, trying to forward
traffic from one PR to another via IP would cause a loop since the aggregation layer forwards traffic
based on BGP’s best route for the destination. To avoid this, we configure the PRs to address each
other via labels (using ISIS-SR) and tunnel traffic via ASWs using MPLS forwarding.
4.5.2 Potential use cases
In this section, we sketch how we can use the footholds that we developed and deployed to improve
performance. We investigate these opportunities and provide insight into corresponding challenges
that arise in production in more detail in Chapter 5 (§5.6).
4.5.2.1 Considering performance in primary and detour routing decisions
We may be able to improve performance by using these footholds to continuously measure route
performance and incorporating such measurements into how EDGE FABRIC’s detours traffic during
instances of interconnection congestion, and during steady state. First, we can use these footholds
to override BGP’s default decisions for a small percentage of traffic selected at random, and use this
traffic to continuously measure the performance of alternate routes. Second, we can incorporate
these measurements into the detour decision process described in Section 4.4.2 so that EDGE
FABRIC considers path performance if capacity is constrained — performance can be considered
125
when deciding which traffic to detour and which route to place detoured traffic onto. Third, EDGE
FABRIC can continuously incorporate performance measurements when deciding where traffic
should be placed — before considering capacity constraints — and inject overrides when there is an
opportunity to improve performance by making a decision different than Facebook’s BGP policy;
we explore this opportunity further in Chapter 5. Fourth, we can use these measurements to forecast
the impact of a traffic shift due to a failure (e.g., losing PNI capacity to a peer); such insights can be
used in network risk planning.
4.5.2.2 Optimizing use of limited capacity
When capacity is limited, EDGE FABRIC may be forced to detour traffic to paths with comparatively
worse performance. EDGE FABRIC can be extended to best use the limited capacity on the primary
path by shifting prefixes and/or flows that are less likely to be impacted.
First, if measurements of path performance are available, EDGE FABRIC can amend its decision
criteria (§4.4.2) to prefer shifting prefixes that will experience less performance degradation on their
detour route. Second, EDGE FABRIC can prioritize capacity on constrained routes for flows that
are more sensitive to network conditions, such as a live video stream. Front-end servers can assign
predefined DSCP values to flows that are higher priority. Then, when EDGE FABRIC injects routes
to shift default-routed traffic away from an overloaded interface, it can in parallel inject routes into
an alternate routing table to keep flows that are marked with a high priority DSCP value on the
better performing route. Both IPFIX and sFlow samples collected by Traffic Collector include the
DSCP field, thereby making it possible for EDGE FABRIC to determine the rate of traffic marked
126
with a given DSCP value and account for this in its projection. This prioritization can be employed
and provide benefit even without measurements of route performance if the route preferred by
Facebook’s routing policy is typically the best performing path and shifting traffic to alternate paths
will cause degradation at least some percent of the time. In Chapter 5, we compare the performance
of the route preferred by Facebook’s routing policy against the performance of alternate routes to
gain insights into the performance implications of EDGE FABRIC detours.
4.6 Operational Experience
EDGE FABRIC has evolved over years in response to growth at our PoPs and from realizations
derived from operational experience. Our current design of EDGE FABRIC focuses on providing the
flexibility that we require to handle different egress routing scenarios, but prefers well understood
techniques and protocols over more complex approaches whenever possible.
4.6.1 Evolution of EDGE FABRIC
As the size and number of our PoPs have continued to grow, we strive for a simple, scalable design.
These desired traits have required the continuous evaluation and improvement of different pieces of
our design and the careful consideration of how a design decision will impact us in the long-term.
127
4.6.1.1 From stateful to stateless control
Our current implementation of EDGE FABRIC is stateless, meaning that it makes its allocation and
override decisions from scratch in each 30 second cycle, without being aware of its previous detours.
This approach has a number of advantages stemming from the simplicity of the design. For instance,
because the stateless allocator begins each cycle by gathering all information it needs and projecting
what utilization will be if the controller does not intervene (§4.4.1.2), it is straightforward to test,
restart, or failover the controller. The controller only needs to calculate what traffic should be moved
given its inputs and projection, and can be tested by simply providing input scenarios and checking
its decision (§4.4.4).
In comparison, our previous stateful implementation required recording the allocator’s state
after each round both locally and remotely. If the controller was restarted due to an upgrade or a
failure, it had to recover its previous decisions from a remote log, increasing complexity. In addition,
the stateful controller’s decision process was more complicated, as the controller not only had to
decide which prefixes to shift when interfaces were overloaded but also which existing overrides to
remove given current interface load. Because the stateful controller would not consider removing
overrides until interface utilization dropped below a threshold, it could not backtrack while interface
utilization was still increasing, and its options for further detouring were restricted by the impact of
its previous actions. In some cases, the controller would shift a prefix to a detour interface, only to
have the detour interface become overloaded in a subsequent cycle (due to the natural growth of
traffic), requiring that the prefix be shifted yet again. Maintaining proper accounting of these states
128
and decisions complicated the implementation and testing, since the logic and tests had to reason
about and inject cascading decisions and states, ultimately providing the motivation for the stateless
redesign.
4.6.1.2 From host-based to edge-based routing
Our current implementation of EDGE FABRIC enacts overrides by using BGP to inject routes to
PRs (§4.4.3). In addition, our design of EDGE FABRIC provides footholds for application and
performance-aware routing without requiring host-based routing: hosts mark flows that may require
special routing and controllers use BGP to inject routes to alternate routing tables at peering routers
as needed (§4.5.1). In comparison, early implementations of EDGE FABRIC relied on host-based
routing to enact overrides. In this model, EDGE FABRIC installed rules on every server in the PoP.
These rules applied markings to traffic destined towards different prefix. Corresponding rules at PRs
matched on these markings to determine which egress interface a packet should traverse, bypassing
standard IP routing.
During the host-based routing era, EDGE FABRIC evolved through three different marking
mechanisms in production: MPLS, DSCP, and GRE. (Our MPLS based implementation went a step
further than what EDGE FABRIC does today by routing all egress traffic based on decisions made
at the hosts, effectively shifting all IP routing decisions away from our PRs.) MPLS and DSCP
were compatible with our early PoP architectures, in which we strived for balanced peer and transit
connectivity across PRs, and any traffic that required detouring was sent to transit. Since traffic
was subject to ECMP, all PRs had identical rules for the same peers (e.g., DSCP value 12 would
129
detour traffic to transit X on all PRs). However, as our PoP architectures grew, we increasingly had
imbalanced transit and peering capacity across PRs and wanted control of which PR traffic egressed
at, and so we switched to using GRE tunnels between servers and PRs.
Challenges encountered with host-based routing. From our experience with these mechanisms,
we have found that it is non-trivial to obtain both host software and vendor software that provide
fast and robust support for these tunneling protocols. Shifting the responsibility of routing traffic via
a specific egress interface to end-hosts makes debugging and auditing the network’s behavior more
difficult, as configuration must be inspected at multiple layers. Further, when interfaces fail or routes
are withdrawn, end-hosts must react quickly to avoid blackholing traffic, making synchronization
among end-hosts, PRs, and controllers critical.
In comparison, EDGE FABRIC does not require hosts to be aware of network state and reduces
synchronization complexities by injecting overrides to PRs at the edge of the network. In addition,
this approach empowers PRs to invalidate controller overrides to prevent blackholing of traffic, since
an interface failure will cause the PR to begin routing traffic to the next best route.
Implications on performance-aware and application-specific routing. We believe our current
edge-based approach provides us with many of the advantages of host-based routing with minimal
complexity. While host-based routing gives hosts more control of how packets are routed, the
additional flexibility is not currently worth the added complexity for the following reasons.
130
First, our approach to overriding destination-based routing allows servers to tag select flows for
special treatment, and it allows controllers to decide whether the routes for these flows should be
overridden (§4.5.1). Although this decoupling limits the degree of control at servers, we believe
that it results in a system that is simpler to design and troubleshoot, and that it provides sufficient
flexibility for our intended use cases (§4.5.2) and the realities of our production environment. In
particular, most prefixes have at most four paths per PoP (§4.2.3), and there is likely little benefit to
splitting application decisions beyond three categories (low, medium, and high-priority). In addition,
our measurements in Chapter 5 reveal limited opportunity for performance-aware routing, reducing
the number of routing decisions we expect to encode and further motivating a simple design.
Second, because Facebook chooses to have traffic ingress, egress, and be served at the same
PoP, the choices for routes are limited to decisions that can be signaled to and enacted at the PRs. If
another PoP starts to provide better performance for a user network, the global traffic controller will
redirect the traffic to that PoP.
4.6.1.3 From global to per-PoP egress options
Previously, Facebook propagated routes from external peers between PoPs in an iBGP mesh, such
that a user’s traffic could ingress via one PoP and egress via another. The ability to route traffic
across the WAN to egress at a distant PoP can improve performance in some cases, but we had to
design mechanisms to keep it from causing oscillations. For instance, some of the traffic on an
overloaded egress interface may have ingressed at a remote PoP. If EDGE FABRIC overrode the
route at the egress PoP to avoid congestion, the override update propagated to the ingress PoP. If
131
we had allowed the override to cause the ingress PoP’s BGP process to stop preferring the egress
PoP, the projected demand for the overloaded egress interface could have dropped, which could
cause EDGE FABRIC to remove the override, which would cause the ingress PoP to again prefer the
original egress PoP, an oscillation. To prevent the override from causing the ingress PoP to change
its preferred egress PoP, the controller set the BGP attributes of the override route to be equal to the
original route. However, manipulating BGP attributes at the controller obfuscated route information,
making it difficult to understand and debug egress routing. We disabled route redistribution between
PoPs once we improved the accuracy and granularity of the global load balancer’s mapping, and now
traffic egresses at the same PoP as it ingresses. Since the global load balancer controls where traffic
ingresses, it can spread traffic for a client network across PoPs that it considers to be equivalent
to make use of egress capacity at multiple PoPs. This allows Facebook to avoid using backbone
capacity to route traffic between PoPs and simplifies EDGE FABRIC’s design.
4.6.1.4 From balanced to imbalanced capacity
As our PoPs have grown, we have had to extend our PoP design to handle corner-cases. As we
increased the size and scale of our PoPs, we began to have more peers with imbalanced capacity
(varying capacity to the same peer across PRs), partly due to the incremental growth of peering
connectivity, but also because of inevitable failures of peering links and long recovery times. These
imbalances created a problem because our ASWs use ECMP to distribute traffic evenly among PRs
with the best paths. Instead of extending EDGE FABRIC to handle these imbalances, we could have
chosen to use WCMP (Weighted Equal-Cost Multi-Path routing) at our ASWs and PR.
132
However, we chose to instead extend EDGE FABRIC to handle these capacity imbalances for a
number of reasons. First, while a number of routing and switching chipsets support WCMP, it has far
less adoption and support from our vendors than ECMP, making it riskier to adopt. Even with vanilla-
ECMP, we have observed unequal traffic distributions (§4.4.4) and other erroneous or unexpected
behavior. Second, since the WCMP implementations used by vendors are proprietary, we cannot
predict how WCMP will behave, making projecting utilization and engineering traffic more difficult,
and potentially increasing the complexity of EDGE FABRIC. Finally, WCMP implementations
operate on the router’s entire routing table, can take minutes to converge after an event (such as
a link failure creating imbalanced capacity), and may not be efficient enough to balance traffic
properly [477]. In comparison, EDGE FABRIC can identify the subset of prefixes with traffic and
inject routes to mitigate the failure within seconds.
4.6.2 Challenges at public IXPs
Public Internet Exchange Points (§2.2.2) have been a focus in academia [6, 186, 187], but they
present unique challenges to a provider of Facebook’s scale. In contrast to a dedicated private
interconnect, a provider cannot know how much capacity is available at a peer’s port at a IXP, since
other networks at the IXP may be sending to it as well. This limited visibility makes it harder to
simultaneously avoid congestion and maximize interface utilization. EDGE FABRIC supports setting
limits on the rate of traffic sent to a peer via a public exchange to avoid congesting a peer’s public
exchange connection (§4.3.1). Some IXPs report the total capacity of each peer’s connection to
the shared fabric, but this information alone cannot be used to set a limit since we need to account
133
for traffic that the peer will receive from other peers on the exchange. As a result, we set capacity
constraints by contacting IXP peers and asking them for estimated limits on the maximum rate of
traffic that we can send to them, as the peers have more insight into their interface capacity and
utilization. We handle overload on these connections using the same approach as a regular PNI
interface, except we limit utilization per nexthop.
4.7 Conclusion
Today’s Internet traffic is dominated by a small number of large CDNs. How they interact with other
ASes largely shapes interdomain routing around the world.
This chapter discusses the design and implementation of EDGE FABRIC, a system that steers
Facebook’s CDN traffic to the world. EDGE FABRIC delegates routing decisions traditionally made
by BGP to a software-defined controller, and in doing so enables Facebook’s routing decisions to
consider interconnection capacity and demand. EDGE FABRIC maintains compatibility with existing
BGP infrastructure while also providing footholds for performance-aware and application-specific
routing. Our evaluation demonstrates that EDGE FABRIC is able to prevent congestion at the edge
of Facebook’s network while facilitating high interconnection utilization.
BGP will be the Internet’s interdomain routing standard for the foreseeable future. By sharing
our experience engineering Facebook’s egress traffic, including a detailed look at opportunities
and challenges faced by today’s large CDNs, we hope that the limitations of BGP can be better
understood and every Internet user’s experience can be improved.
134
Chapter 5
A View of Internet Performance From Facebook’s Edge
5.1 Introduction
In Chapter 4, we examined how Facebook, by way of establishing points of presence and peering
interconnections around the world, has built a CDN with short paths to users, path diversity that
provides options for routing, and significant capacity in aggregate.
Ideally, Facebook would be able to use this rich connectivity to improve user experience by
always routing traffic via the most performant path. However, before we could explore such
opportunities, we first had to address challenges that were making it difficult for Facebook to
use this connectivity with even a simple routing policy. In Chapter 4, we discussed how EDGE
FABRIC’s dynamic routing decisions make efficient use of Facebook’s capacity while also preventing
congestion at the edge of Facebook’s network. Any provider with connectivity similar to Facebook’s
likely requires the use of a similar controller to make effective use of their connectivity [389, 469].
135
With these challenges addressed, we return our focus to understanding the value of, and potential
opportunities that arise with, Facebook’s connectivity. This chapter focuses on two questions:
1. Does Facebook’s rich connectivity provide good performance for Facebook’s users?
2. Can Facebook improve performance by incorporating real-time performance measurements
into EDGE FABRIC’s routing decisions?
These questions naturally arise given the investments that companies like Facebook have made
to establish interconnections around the world and design control systems (like EDGE FABRIC) to
manage them. However, while an extensive body of prior work has investigated user performance on
the Internet across a variety of dimensions, prior work has typically focused on a specific region [91,
185], access technology [31, 34, 472], or specifically on interconnections [8, 185, 213]. Studies that
attempted to more broadly characterize user performance or opportunities for performance-aware
routing have been constrained in vantage points and measurements, limiting the conclusions they
could make [8, 9, 10, 71, 213, 416].
To get answers, we design and deploy a measurement system to collect performance insights
from existing production traffic at all of Facebook’s points of presence worldwide (§5.2.2), a subset
of Facebook’s CDN infrastructure. The measurement system runs continuously in production and
randomly samples user HTTP sessions at Facebook’s servers.
Our analysis relies on a 10 day dataset captured in September 2019. The dataset captures
performance for trillions of HTTP sessions and has global coverage with measurements from
hundreds of countries and billions of unique client IP addresses. This high-volume of samples
136
is necessary for us to conduct granular analysis of performance, such as identifying spatial and
temporal variations. The dataset also contains measurements from production traffic continuously
routed via alternate routes using EDGE FABRIC capabilities previously introduced in Section 4.5.1.
We use these measurements to evaluate if Facebook’s extensive connectivity creates opportunities
for performance-aware routing, in which an alternate (non-default) route offers better performance.
We make four contributions:
We characterize Facebook’s user traffic. We begin with a characterization of Facebook’s user
traffic, which during the September 2019 study period was predominantly composed of TCP flows
carrying HTTP/1.1 or HTTP/2 traffic (§5.2.3). Most objects requested by users are small (50% of
objects fetched are less than 3 KB), and HTTP sessions can be idle for the majority of their lifetimes.
As we show, these insights are key to our methodology for measuring performance, and are relevant
to tangential work in congestion control and transport design.
We develop techniques to enable the accurate characterization of network performance from
production traffic. We quantify performance by measuring propagation delay and goodput from
existing production traffic (§§ 5.2.2 and 5.3). Propagation delay is important in interactive applica-
tions where a user is actively blocked awaiting a response, such as waiting for a search query to return
or a video to start playing. Goodput depends on loss, latency (propagation, queuing, and link/MAC
layer delays), available bandwidth, the behavior of the congestion control algorithm, and sender
behavior, and a connection must be able to support high goodput to facilitate streaming high-quality
137
video and to enable clients to quickly download large objects. However, capturing insights into
achievable goodput from measurements of existing production traffic is challenging given that most
objects served by Facebook are small; under such conditions goodput measurements often represent
a lower bound on what the underlying connection is capable of supporting because the transfer may
not be large enough to exercise the bandwidth-delay product, or because the congestion control
algorithm may still be in initial slow start. Our novel approach to goodput measurements accounts
for these intricacies by determining (1) if a transfer is capable of testing for a given goodput (we
focus on 2.5Mbps, the minimum bitrate for HD video) and (2) for capable transfers, if the transfer
achieved the target goodput after correcting for aspects such as CWND growth and transmission
time (§5.3.2). This approach enables us to differentiate between goodput restricted by network
conditions (which we want to measure) and goodput “only” restricted by sender behavior. Our
approach is practical and deployed in production at Facebook’s PoPs worldwide.
We characterize Internet performance from Facebook’s vantage point across temporal and
spatial dimensions. Using the results from our methodology, we characterize the performance
seen by Facebook’s users worldwide (§5.4). We find that a majority of Facebook user sessions have
low propagation delay (the global median MinRTT — an estimate of propagation delay — is less
than 40ms) and achieve the goodput required to stream HD video.
We aggregate measurements by geolocation information and time to facilitate spatial and tempo-
ral analysis, and employ statistical tools when comparing aggregations to separate measurement
noise from statistically significant differences. We examine regional variances and show that users in
138
Africa, Asia, and South America in particular experience poorer performance. We find that the vast
majority of traffic sees minimal degradation over the 10 days in the study period (§5.5). We identify
episodes of degradation that could be caused by failures and predictable periods of degradation that
could indicate recurring downstream congestion.
Given the coverage of the dataset, and because a large share of global Internet traffic comes
from a small number of well connected content and cloud providers with connectivity similar to
Facebook’s (chapter 3), performance measurements and analysis from Facebook’s vantage point via
this dataset likely roughly reflect user performance to popular services in general.
We investigate the potential utility of performance-aware routing and compare the perfor-
mance of peer and transit routes. Finally, we investigate whether it is possible to improve
user performance by incorporating performance measurements into Facebook’s routing decisions.
Chapter 4 discussed how Facebook’s EDGE FABRIC provides footholds to support the measure-
ments and control systems required for performance-aware routing, and related work such as
Google’s Espresso [469] discusses a handful of cases where incorporating real-time performance
measurements provides value. However, to date no work has defined a concrete methodology for
capturing and converting network metrics into decisions, or exhaustively quantified the benefits of
performance-aware routing in such environments.
In this chapter, we use the aforementioned footholds in EDGE FABRIC to route a small fraction
of production traffic to every prefix along alternate paths throughout our 10 day study period. Our
analysis of these measurements finds that there is limited opportunity to improve performance
139
by incorporating such measurements into EDGE FABRIC’s decision process (§5.6). In particular,
we find that performance-aware routing could improve propagation delay by 5ms or more for
only a few percent of traffic, showing that the existing static BGP routing policy employed by
Facebook (§4.2.3) is close to optimal. In addition, we find that incorporating performance into
EDGE FABRIC’s decision process is non-trivial and can result in complex oscillations; this is
because route performance is in part a function of load and thus can change when Facebook — or
another organization — shifts traffic in response to performance observations. However, while
performing-aware routing provides little benefit, we find that establishing peering interconnections
yields significant performance benefits, with 40% of traffic egressing via peering routes having
statistically significant improvements in propagation delay over the best transit path.
Summary. Our results show that Facebook (and likely other CDNs with similar infrastructure)
are able to provide good user experience by widely interconnecting and by using control systems,
such as EDGE FABRIC, to prevent congestion on these interconnections.
5.2 Data Collection Overview and Traffic Characteristics
This section presents an overview of the Facebook content serving infrastructure as it pertains to
serving client traffic (§5.2.1). We describe our passive measurement infrastructure and the data
that we collect and discuss how we use the EDGE FABRIC functionality introduced in Section 4.5
to continuously measure the performance of multiple egress routes in parallel with production
traffic (§5.2.2). We then characterize Facebook user connections at the application and transport
140
layers (§5.2.3). In Section 5.3, we use these insights to illustrate the challenges to quantifying
user performance from passive measurements in Facebook’s environment and how our approach
overcomes these challenges.
5.2.1 Facebook user traffic
At the time of this study (September 2019), the vast majority of Facebook’s user traffic was HTTP/1.1
or HTTP/2 secured with TLS atop a TCP transport, which we refer to as an HTTP session.
1
A client
establishes an HTTP session with an HTTP endpoint (distinguished by IP address) depending on
the application and the type of request. Each HTTP session can have one or more transactions, each
composed of an HTTP request from the client and response from the server.
As discussed in Chapter 4, Facebook’s global load balancing system, Cartographer [387], steers
client requests to PoPs based on performance information, and determines the PoP at which traffic
ingresses into Facebook’s network and is subsequently served from (§4.2.2). TCP connections
directed to a PoP’s address space are terminated at one of Facebook’s Proxygen load balancers [396]
located within the PoP. Once the transport-layer connection is established, the Proxygen load
balancer forwards HTTP requests to internal endpoints. Throughout this chapter, when we refer to a
server we are referring to a Proxygen load balancer.
1
As of 2020, the majority of Facebook’s traffic is composed of QUIC flows carrying HTTP/3 traffic [226]. We do not
believe this change in transport has significant implications on this study’s results, and the largest implication — the
removal of interference caused by Performance Enhancing Proxies — is discussed in Section 5.2.2.1 and footnote 2.
141
Section 4.2 provides more details on the architecture of Facebook’s points of presence, including
how Facebook uses Cartographer to direct client traffic to ingress at a specific point of presence, and
how EDGE FABRIC is used to route egress traffic to clients.
5.2.2 Measurement infrastructure and dataset
We measure performance by collecting metrics from existing production traffic at our servers. We
present the reasoning behind this design choice and its trade-offs relative to approaches used in prior
work. We also describe how we collect data and the properties of the dataset.
5.2.2.1 Why server-side passive measurements?
Prior work has characterized performance via instrumentation at clients that executes measurement
tasks such as downloading an object or pinging an endpoint [8, 72, 213] and/or network probing
measurements executed from the server side, such as pings or traceroutes [474]. In contrast, we
use server-side measurements of existing production traffic as they best support our measurement
goals. In this section, we discuss how this approach supports our goals, while also surfacing three
key drawbacks of server-side measurements:
Avoid overhead at clients: Capturing measurements at the server side from existing traffic
does not introduce any overhead at clients. In comparison, executing measurements from clients
must be done with extreme care to avoid negative consequences. For instance, using a speedtest
to determine a connection’s ability to support a given goodput (e.g., “Can the connection play HD
142
video without stalling?”) may require transferring a large volume of traffic (§5.3.2) and thus may
increase data usage and reduce battery life on user devices.
Facilitate granular time series analysis: Server side measurements can satisfy the high sam-
pling rates required for temporal analysis. Events such as congestion or failures can cause network
performance to quickly change. Detecting such events and evaluating options to mitigate requires
sufficient samples to make statistically significant conclusions at short-time scales. Capturing
sufficient samples with active measurements is challenging [417].
Ensure representative results: Active measurements such as pings may not produce repre-
sentative performance measurements and may further suffer from coverage limitations. For instance,
ICMP traffic may be de-prioritized, dropped, or routed over a different path than TCP traffic [206,
404, 409], and prior work has found that over 40% of hosts do not respond to ICMP probes, limiting
the coverage of systems that rely on such probing traffic [206, 474]. Likewise, active measurements
that have low overhead — such as pings and small transfers — do not provide insight into key
metrics, such as a connection’s ability to support a given goodput.
We sidestep these challenges by measuring performance from production traffic. First, we can
be confident that our measurements are representative of the network conditions our production
traffic is subject to. In addition, because we sample HTTP sessions, the probability of our dataset
having measurements for a client or group of clients is a function of the sampling rate that we
configure and the number of sessions created by clients. Finally, by capturing measurements from
143
production traffic we can observe when network conditions become a barrier to performance, such
as preventing clients from achieving the goodput required to stream an HD video without stalls.
Enable rapid experimentation: Changes at the server side are easy and quick to roll out
and maintain, in direct contrast with client changes which typically have longer rollout cycles and
require extensive testing. For instance, in this work we evaluate whether changes in how we route
traffic to users can improve performance (§5.6) without requiring any coordination with clients.
Knowledge of CDN conditions and congestion control behavior: Cache misses, high load,
and other conditions at the CDN can delay delivery of objects to clients. Yet measurements captured
at client devices are unaware of these delays, and thus attribute all delays to network conditions.
In contrast, measuring at the server enables us to differentiate between these two culprits. In
addition, the server’s perspective provides insight into the interplay between network conditions and
congestion control behavior, and the resulting impact on delivery of objects to the client — these
insights are key to identifying improvements to congestion control behavior.
(Drawback) May not capture end-to-end performance: Performance enhancing proxies
(PEPs) are commonly deployed in satellite and cellular networks and attempt to improve perfor-
mance by splitting the TCP connection between the user and server and then optimizing TCP
behavior for each segment’s characteristics [53, 465]. Under these conditions, server side perfor-
mance measurements reflect the performance between Facebook and the PEP instead of end-to-end
performance, and thus may overestimate goodput and underestimate latency relative to what would
144
be measured end-to-end.
2
However, since Facebook can only optimize for conditions between
Facebook’s edge and the PEP, this does not have any considerable drawback on our analysis.
(Drawback) Experiments can degrade connection performance: Experiments that impact
production traffic, such as shifting traffic to alternate routes, could (inadvertently) degrade perfor-
mance for users. Limiting the traffic impacted by an experiment reduces this risk, and remaining
aware of alternate route performance reduces a different class of risk, in that it provides Facebook
with continuous insight into the performance implications of the primary route becoming unavailable.
We acknowledge that in some environments, the risks of experimenting with production traffic
may remain a barrier; prior work has cited risks to production traffic in justifying separate active
measurement infrastructure for experiments [72]. However, server-side passive measurements can
still be used to capture real-time insight into current network conditions.
(Drawback) Clients, networks with better connectivity are overrepresented in dataset:
We expect users are more likely to interact with Facebook services when they expect that network
conditions will enable them to have a good experience. As a result, we expect that clients with better
connectivity to Facebook will be overrepresented in our dataset in both fraction of connections
and fraction of bytes transferred. For instance, a user that has found Facebook services perform
poorly over their cellular connection may use the service mostly on their home network. If this
behavior is common for other users with the same cellular service, then in aggregate we will receive
2
Facebook’s adoption of QUIC in 2020 nullifies this drawback, as QUIC’s encryption inherently prevents performance
enhancing proxies from splitting connections [214, 245, 434].
145
fewer samples and exchange fewer bits with the cellular provider — and receive more samples and
exchange more bits with other providers. While this in itself does not prevent us from evaluating
performance for that provider, it does impact the weight that the network receives during our analysis
in this chapter. For instance, in Section 5.4 we analyze the distribution of performance metrics over
measured sessions, and in Section 5.5 we look at how performance changes for a network aggregate
over time, weighting each aggregate by volume of traffic transferred. If our assumption holds true,
our analysis of these distributions may be optimistic in terms of global and regional performance, as
it gives more weight to users with better connectivity.
We attempt to mitigate this risk by performing spatial analysis by continent — this enables us to
better surface connectivity in regions that would likely have less weight in global distributions. Using
client triggered measurements could potentially help by enabling measurements to be uncoupled
from usage, but such a shift in our measurement methodology would inherently increase overhead
on the clients. In general, this is a non-trivial problem that requires considering other aspects,
such as subscriber counts, to identify cases where an aggregate is underrepresented by our current
methodology relative to the number of users that want to use Facebook in that aggregate. We defer
further exploration of this to future work.
5.2.2.2 Measurement approach and infrastructure
We want to evaluate if Facebook’s connectivity provides good performance for users (§5.4), and
whether Facebook observes instances of performance degradation that could indicate problems
downstream (§5.5). In addition, we want to evaluate if performance-aware routing could provide
146
additional benefit (§5.6). We developed a single unified measurement strategy that supports all of
these goals.
Our approach make use of the footholds that we built into EDGE FABRIC to support performance
and application aware routing (§4.5). In addition, we install instrumentation at our edge load
balancers to capture performance metrics; this vantage point is optimal because all Facebook user
HTTP sessions and their underlying transport connections are terminated here (§5.2.1), thereby
providing complete coverage. Our approach works as follows:
Servers randomly select and mark flows to be used for measurement traffic. We implemented
an Extended Berkeley Packet Filter (eBPF) program that runs on all load balancers. The program
randomly selects the underlying TCP flows used by HTTP sessions and applies one of three DSCP
values. The DSCP value applied determines whether the flow will be used to measure the primary,
2nd best, or 3rd best route to the destination. The probability that a flow will be assigned one of
these DSCP values is configurable. This approach allows us to avoid any application-layer changes;
we discuss eBPF programs in more detail in Section 4.5.1.
The ROUTEPERF controller injects routes. EDGE FABRIC’s modular design enabled us to use
the same framework to build a separate ROUTEPERF controller that is responsible for making routing
decisions for measurement traffic. Similar to EDGE FABRIC, the ROUTEPERF controller runs every
30 seconds, uses BGP routes retrieved from the BMP Collector service (§4.4.1) to determine the
routes available for each prefix, and uses the BGP Injector service (§4.4.3) to inject routes. The key
147
difference is that the ROUTEPERF controller’s injections only control traffic marked with one of the
aforementioned DSCP values; the controller generates and injects routes for each DSCP value into
the corresponding routing table at peering routers (PRs).
3
The ROUTEPERF controller injects the
route that it determines would be selected by Facebook’s routing policy (repeating the projection
process described in Section 4.4.1.2 and thus ignoring capacity constraints) into the first routing
table. The ROUTEPERF controller also calculates the 2nd and 3rd best routes (again, as determined
by Facebook’s routing policy), and if available, injects them into their corresponding tables.
The ROUTEPERF controller is able to ignore capacity constraints because the eBPF program
installed at load balancers only marks a small fraction of flows with one of the DSCP values reserved
for measurements (approximately 1 in 100), and spreads marked flows across the 1st, 2nd, and
3rd best routes. EDGE FABRIC keeps interface utilization from exceeding ~95% (§4.4.1.2), and
the traffic potentially added to an interface by ROUTEPERF’s injections can only consume a small
fraction of the remaining headroom given the eBPF program’s configuration. This ensures that
interconnections at the edge of Facebook’s network remain congestion free, and that changes in
performance observe are caused by events further downstream.
Because each Facebook PoPs exchanges the bulk of its traffic with a few thousand pre-
fixes — compared to the 700k+ in the global routing table (§4.2, [209, 459]) — the ROUTE-
PERF controller only injects routes for prefixes with significant traffic in the past 10 minutes, as
3
The BGP Injector service establishes one BGP session per PR routing table and uses communities to tag routes; route
policy at the PR uses the communities to propagate routes to the correct table.
148
defined by a configurable threshold and based on measurements collected by the Traffic Collec-
tor service (§4.4.1.2). This filter drastically reduces the size of the injected routing tables and
corresponding resource requirements for Facebook’s peering routers.
As a safeguard, the ROUTEPERF controller takes as input a set of destination ASNs that it will
not inject alternate routes for. This list includes ASes that Facebook knows (from prior operational
experience) to have extremely poor performance on alternate paths to reduce the potential impact to
end-users.
PRs use alternate routing tables for marked flows. Facebook’s PRs attempt to route flows
marked with one of the three DSCP value via the corresponding routing table first. If the ROUTEPERF
controller has not injected a covering route to the table, the flow will be routed based on the PR’s
default routing table.
Load balancers capture performance measurements. For each new HTTP session, the load
balancer checks if the eBPF program is marking the flow’s packets with the DSCP value reserved
for measurement traffic. For marked flows, the load balancer enables additional instrumentation
that captures transport (TCP) state on accept of, and at termination of, the underlying transport
connection. In addition, for each HTTP transaction in a sampled session, the load balancer captures
socket timestamps and transport state at prescribed points to enable calculation of goodput. We
discuss the specific state collected and how it is used to evaluate performance in Section 5.3.
149
A post-processing pipeline adds BGP route information. When the load balancer detects a
sampled session’s underlying transport connection has been closed (e.g., for TCP when the connec-
tion enters theTCP_CLOSED state in the kernel’s finite-state machine), it captures the connection’s
termination state. The load balancer then sends all captured information to a separate process
that uses the DSCP marking and BMP information (§4.4.1.1) to determine which egress route the
flow traversed, and then adds the corresponding BGP information (such as the BGP IP prefix and
AS_PATH) to the sample. If the ROUTEPERF controller was not injecting a covering route (e.g., if
the sample’s egress route was not controlled by ROUTEPERF), the sample is discarded.
Summary. Our approach enables load balancers to continuously measure primary and alternate
route performance without requiring the load balancers to maintain any information about network
state or to coordinate with the ROUTEPERF controller. In addition, because the ROUTEPERF
controller explicitly steers flows over the 1st, 2nd, and 3rd best route — as defined by Facebook’s
BGP policy — there is no need to coordinate with EDGE FABRIC. Specifically, EDGE FABRIC does
not control the route(s) used by ROUTEPERF controlled traffic, and the routes used by ROUTEPERF
controlled flows will not change due to EDGE FABRIC’s injections; flows assigned the ROUTEPERF
DSCP value associated with the best route will always traverse the best route according to Facebook’s
routing policy, even if EDGE FABRIC is actively steering traffic away from that route. This separation
150
ensures that changes in performance captured by our measurements are not caused by EDGE FABRIC
changing the path being measured.
4,5
5.2.2.3 Dataset
The bulk of our analysis relies on a 10 day dataset captured in September 2019. During the collection
period, the eBPF program at our load balancers overrode the DSCP field for approximately 1 in
100 of all HTTP sessions so that the ROUTEPERF controller could assign flows to the 1st, 2nd, and
3rd best routes. ROUTEPERF routed approximately 47% of the samples in our dataset via the best
route to the destination as defined by Facebook’s routing policy, regardless of the decisions made by
EDGE FABRIC (§5.2.2.2). The remaining samples measure the performance of the 2nd and 3rd best
routes to the destination as defined by Facebook’s routing policy. We provide more details on the
signals that the load balancers capture along with how we transform and interpret these signals in
Section 5.3.
Since our focus is on performance between Facebook’s edge and users, we filter samples if a
third-party commercial service reports that the client IP address is controlled by a hosting provider
(~2% of measured traffic).
6
After filtering, our dataset contains measurements for trillions of HTTP
sessions. This volume of samples and our sampling methodology ensures that the dataset has wide
4
EDGE FABRIC injections could still impact the performance of a route measured by ROUTEPERF given that
performance is a function of load, and EDGE FABRIC injections change the load on a path (§5.6.1.2).
5
It is possible for routes measured by ROUTEPERF to change if the set of available BGP routes changes (§2.1.2);
ROUTEPERF always measures the 1st, 2nd, and 3rd best routes from the set of routes available.
6
We have found that these HTTP sessions are composed of API requests (from other organization’s servers) and traffic
relayed by VPN providers. VPN traffic can mislead temporal performance analysis because the composition of users
and user locations behind the IP address sourcing VPN traffic can change drastically over time, which in turn changes
performance.
151
coverage; we find that it contains samples from billions of unique client IP addresses spread across
hundreds of countries.
5.2.3 Session characteristics
In order to inform the design of our performance measurements, we characterize Facebook’s user
traffic at the HTTP session and transaction level. Our analysis reveals that the majority of session
time (e.g., the time from establishment to termination of the underlying TCP connection) is spent
idle, and most sessions and transactions transfer small amounts of data. Section 5.3 discusses how
these insights shape our approach to measuring goodput.
HTTP sessions are idle for most of their durations. Figure 5.1a shows the distribution of client
session durations. Session times vary, with 7.4% lasting for less than a second, 33% lasting for
less than a minute, and 20% lasting more than 3 minutes. HTTP/2 sessions, which are commonly
used by web browsers and some of Facebook’s mobile applications, have fewer short sessions than
HTTP/1.1. For example, 44% of HTTP/1.1 sessions lasted for less than a minute, while only 26%
of HTTP/2 sessions did. Figure 5.1b shows the percentage of time that the load balancer is actively
sending data for an HTTP session (i.e., the load balancer has data to send to the client and/or there
is unacknowledged data in flight). For both HTTP/1.1 and HTTP/2, the majority of sessions are idle
for most of their lifetime; 80% of HTTP/2 sessions are active less than 10% of the time, while 75%
of HTTP/1.1 sessions are active less than 10% of the time.
152
0 50 100 150 200 250 300
Session Duration [s]
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Fraction of Sessions
All
HTTP/1.1
HTTP/2
(a) Session duration, per session HTTP protocol
0 20 40 60 80 100
Percentage of Session Time Sending Traffic
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Fraction of Sessions
All
HTTP/1.1
HTTP/2
(b) Session busy time, per session HTTP protocol
Figure 5.1: Cumulative distribution across sessions of duration and busy times. Most sessions end within 60
seconds and spend most of their lifetime idle.
153
10
2
10
3
10
4
10
5
10
6
Size [bytes]
0.0
0.2
0.4
0.6
0.8
1.0
Cum. Frac. of Responses or Sessions
Sessions
All Responses
Media Responses
Figure 5.2: Distribution of bytes transferred across an entire HTTP session, and distribution of response size
for all responses and for media responses.
A small fraction of sessions transfer the bulk of data volume. Figure 5.2 shows the distribution
of the number of bytes transferred per HTTP session. Over 58% of sessions transfer fewer than
10 kilobytes. However, there is a long tail with 6% of sessions transferring over 1 megabyte, the
majority of which we find contain transactions for streaming video objects (not shown).
Figure 5.2 also shows the distribution of response sizes across transactions. Over 50% of
responses are fewer than 6 kilobytes, the vast majority of which are responses to API calls, rendered
HTML, and other dynamically generated content. Responses from HTTP endpoints designated
for images and videos (labeled as “media responses”) have a slightly larger response size, with a
median response size of ~19 kilobytes, and 17% of responses transferring at least 100 kilobytes.
154
10
0
10
1
10
2
10
3
Number of Transactions in Session
0.0
0.2
0.4
0.6
0.8
1.0
Cum. Frac. of Sessions
All
HTTP/1.1
HTTP/2
(a) Distribution of transactions per session, per session HTTP protocol.
10
0
10
1
10
2
10
3
Number of Transactions in Session
0.00
0.25
0.50
0.75
1.00
Cum. Frac. of Transferred Bytes
All
HTTP/1.1
HTTP/2
(b) Distribution of bytes transferred per session, per session HTTP protocol.
Figure 5.3: Distribution of transactions and bytes transferred per HTTP session. Over 80% of sessions have
fewer than 5 transactions, but the majority of traffic is on sessions with greater than 50 transactions.
155
HTTP sessions comprise a small number of transactions. Figure 5.3 shows that most sessions
have a single transaction, and over 87% of HTTP/1.1 and 75% of HTTP/2 sessions have fewer than 5
transactions. For HTTP/1.1, web browsers may establish multiple connections to the same endpoint
to enable multiple objects to be requested in parallel. In comparison, multiplexing and pipelining are
supported within an HTTP/2 connection, so web browsers typically only establish a single HTTP/2
connection to an endpoint. As a result, HTTP/2 sessions have more transactions than HTTP/1.1 on
average. However, the bulk of total traffic is carried by sessions with many transactions — sessions
with 50 or more transactions account for more than half of all egress traffic.
5.3 Quantifying Performance
In this chapter, our goal is to characterize the network conditions for HTTP sessions between
Facebook’s points of presence and clients. Because Facebook’s global load-balancer directs clients
to the PoP that provides the best performance (§5.2.1), this characterization provides insight into
the ability of Facebook’s PoPs — a subset of Facebook’s global CDN infrastructure — to provide
the network performance required by Facebook’s applications. In addition, it provides insights into
how network performance changes due to factors outside of Facebook’s control (e.g., congestion
and route changes beyond the edge of Facebook’s network), and how network performance varies
regionally. We use two metrics to characterize network performance:
1. Round-trip network propagation delay. The round-trip network propagation delay is time
required for a single minimal sized TCP packet (e.g., a packet with no payload) to travel from
156
the client to the server at the Facebook PoP and back. The round-trip network propagation
delay defines the minimum network time for an HTTP transaction, and thus also determines
the minimum user-perceived latency for any request/response interaction [22, 112, 238]. The
round-trip network propagation delay depends on conditions of the network route between
the client and the PoP, and aspects related to the client’s access technology, such as the use of
interleaving [243, 399, 416, 471, 472].
2. Goodput ratio. The probability that a TCP connection between a client and a server at the
Facebook PoP can support a given goodput for transfers from the PoP to the client.
7
This
probability is a function of available and bottleneck bandwidth, the provisioned speed of the
user’s access link (e.g., the rate that the user subscribes to from their ISP, and at which the
ISP throttles the link), loss, latency (propagation, queuing, and link/MAC layer delays), the
behavior of the congestion control algorithm, sender behavior, the client’s access technology
and the ISP’s network management practices [14, 139, 190, 215, 217, 293, 294, 321, 416,
471].
User quality of experience is a function of both of these components for every application — what
varies is each application’s specific requirements. For instance, loading a webpage composed of a
single 50KB object in less than 200 milliseconds requires the sum of round-trip network propagation
delay and transmission time be less than 200 milliseconds.
8
Streaming high-definition video has
7
While we focus on TCP, our approach is also applicable to QUIC [214] and any reliable, stream-oriented transport.
8
More specifically, transferring 50KB in less than 200 milliseconds requires (i) the round-trip network propagation
delay be less than 200 milliseconds and (ii) network be able to support a goodput greater than 2 Mbps, given that
transmission time at 2 Mbps is already 200 milliseconds (50KB 2Mbps= 200 milliseconds).
157
high goodput requirements and soft real-time latency demands [111] — streaming a video encoded
at 2.5 Mbps requires the connection be capable of supporting at least 2.5 Mbps goodput, and
propagation delay determines the minimum time required for a video to start playing. Real-time
games require low propagation delay and the network be capable of continuously delivering a
low-rate stream of time sensitive updates [199].
9
5.3.1 Estimating round-trip propagation delay with MinRTT
When a user clicks a button on a Facebook web page, the minimum amount of time it will take for
the request to reach the Facebook server and for the result to arrive back at the user is determined by
the round-trip propagation delay between the user and the server.
10
While other components — in-
cluding server processing time, and the size of the request and response — also play a role in the
total time required for a request/response operation, these timings are in addition to the round-trip
propagation delay. Put another way, the round-trip propagation delay defines the minimum network
time for any size payload, and therefore also determines the minimum user-perceived latency [22,
112, 238, 471] for request/response interactions when the response is not cached or prefetched.
11
Our estimate of round-trip propagation delay enables us to identify when the minimum network
time presents a barrier to application performance and quality of experience. For instance, for
content that is non-cached or prefetched, it will be impossible to achieve a user-perceived latency of
9
Since games commonly use UDP for transport, this property is often measured in terms of jitter [199].
10
The propagation delay of an individual link is defined by physical constraints such as the speed of light through
fiber [246]. The round-trip propagation delay is the time required for a single packet to traverse all components in the
path, which includes each link’s propagation delay.
11
Caching and prefetch are commonly used to reduce user-perceived latency [30, 238, 244, 270, 441].
158
200 milliseconds or less if round-trip propagation delay exceeds 200 milliseconds — even if the
client is connected to the Internet via a 1 Gbps fiber connection. We discuss the components that
play a role in round-trip propagation delay in Section 5.3.1.1.
Rules of thumb for round-trip propagation delay. Prior work has both measured and quantified
the impact of user-perceived latency on e-commerce and other applications [22, 40, 99, 112, 238,
264, 306] and provide rules of thumb that we can use to understand the relationship between
round-trip propagation delay and quality of experience:
• Beyond 8Mbps, round-trip propagation delay is the primary bottleneck for page load times
in last-mile access networks [420].
• An online gaming services provider uses 80ms round-trip propagation delay as a cutoff for
good performance [296].
• ITU-T G.114 recommends at most a 150ms one-way propagation delay for telecommunica-
tions applications; higher delays significantly degrade user experience [346].
• Recent work suggests that fully immersive AR/VR applications may require round-trip
propagation time to be as low as 7ms, a requirement driven by the strict latency thresholds of
human senses [306]. Less immersive applications — such as remote surgery and teleoperated
vehicles — require a round-trip propagation time of less than 250ms [306].
159
5.3.1.1 Defining round-trip propagation delay and its components
We define the round-trip network propagation delay as the minimum time — meaning that it excludes
variable forms of delay, such as sporadic periods of queuing on the client’s access link — required
for a single TCP packet with no payload to travel from the client to a Facebook server at a PoP, and
back to the client. Given this definition, the round-trip network propagation delay depends on:
1. The sum of propagation delays for all links in the path. The propagation delay for a
single link is a function of physical constraints and the delay added by mechanisms such as
interleaving. The physical length of each link determines the propagation delay added by
constraints such as the speed of light through fiber; the delay incurred due to these constraints
can be significant if there is no Facebook PoP nearby, and for clients connected to the Internet
via a satellite connection.
12
Even when a PoP is physically close to the client, the physical
distance the packet traverses may be significantly greater than the “as the crow flies” distance
between the client and the PoP, given that the former is a function of physical network
topology and routing decisions [91, 185, 243, 420, 471, 472].
Interleaving is used to provide protection against noise. For instance, in Digital Subscriber
Line (DSL) networks, interleaving is used in conjunction with forward error correction to
facilitate recovery of errors caused by line noise between a client’s modem and the upstream
Digital Subscriber Line Access Multiplexer (DSLAM) [170]. However, this error handling
comes at a cost: interleaving can increase propagation delay by 20 milliseconds or more [28,
12
Traversing a satellite in geosynchronous orbit can add 500 milliseconds or more to the round-trip propagation
delay [19, 178, 298, 354].
160
170, 416]. The overhead of interleaving varies by access technology and the type of noise
that interleaving is used to guard against. For instance, interleaving is used on DOCSIS
cable Internet connections for the downstream path — from the Cable Modem Termination
System (CTMS) to the client — and typically adds at most 1 millisecond in propagation
delay [37].
13
In comparison, for 2G and 2.5G data networks such as GPRS and EDGE,
interleaving increases propagation delay by hundreds of milliseconds [230, 358, 443].
2. The sum of transmission delays for all links in the path. A packet will typically traverse
multiple devices, including switches and routers, each of which will forward a packet to its
next hop in the path. A switch or router cannot begin forwarding a packet until the packet has
completely arrived from the previous hop — a packet cannot be forwarded bit-by-bit. The
time required for a packet to traverse a link is determined by the link’s per-bit transmission
delay. The link between the client and ISP or mobile network is typically the bottleneck link
in the path — the link with the highest per-bit transmission delay — and thus the primary
contributor of transmission delay. Transmission delay is typically a small component of the
round-trip propagation delay for modern networks. For instance, the transmission time for a
40 byte packet — the size of a TCP packet with no payload — would be 320 microseconds
through a 1 Mbps access link.
14,15
13
Specifically, "[t]he deepest interleaving depth available in the DOCSIS RF specification provides 95 microsecond
burst protection at the cost of 4 milliseconds of additional propagation delay" [242] for QAM 64, and 2.8 milliseconds of
propagation delay for QAM256 [37, 242].
14
The minimum TCP header size is 20 bytes and the minimum IPv4 header size is 20 bytes.
15
The transmission time for a single packet is a function of the link’s per-bit transmission delay; for radio networks,
this delay varies with channel conditions. The time required for a sequence of packets to traverse a link is a function
of the link’s per-bit transmission delay and the link’s provisioned rate, the latter of which is enforced through traffic
shaping [139]. A shaper does not impact the per-bit transmission delay, rather, it delays packet transmission as needed to
161
3. Time spent queued due to resource contention. Resource contention at shared links in a
network’s backbone can lead to standing queues or delays in forwarding time that remain
present for extended periods of time. For instance, resource contention in cellular networks
can lead to situations where packets arriving at a base station are delayed considerable time
before being forwarded to the corresponding handset [31]. Likewise, a persistent standing
queue can form at an interconnection between autonomous systems or at a link within an
autonomous system when demand consistently exceeds capacity (chapter 4, [108]). Since
such delays increase the minimum network time for an extended period of time and are not
specific to any single client’s network conditions, they are part of the round-trip propagation
delay.
Round-trip propagation delay is expected to be constant for a session. We expect the round-
trip propagation delay to be constant for the vast majority of sessions because the components of
propagation delay are likely to remain constant over the lifetime of a session:
1. The underlying physical path and properties such as interleaving are unlikely to change.
Sessions are brief (§5.2.3) and thus the client’s (rough) physical location and the network
route between the client and the Facebook PoP are unlikely to change for the vast majority
of sessions. Likewise, we expect the delay contributed by interleaving to remain constant
for most sessions given that it would only change if the client was changing between access
technologies, the probability of which is low given the brief session time.
ensure that the number of bytes transmitted per interval does not exceed the link’s provisioned rate. Thus, if a packet is
not delayed by a shaper or queuing, it will traverse the link in transmission time.
162
2. The delay accrued due to transmission delay is unlikely to change. The transmission
delay is a constant for most links between Facebook and the client — it is likely only
variable if the client is using a radio link. Even then, changes in transmission delay are
likely inconsequential given that transmission delay contributes minimal time to the overall
round-trip propagation delay except in the case of very slow links.
3. Contention is likely to be stable over time. Prior work has found that traffic arrival rates in
backbone networks at small time scales are smooth because small variations in round-trip
time and processing time de-synchronize the large number of flows, and flow transfer rates
are slow relative to backbone link capacity [143, 339]. As a result, we expect that when
contention occurs, it will not be episodic but instead will be steady and related to the average
demand relative to available capacity. For instance, during peak hours contention may occur
at a cellular base station, and may remain until user demand decreases. Because sessions are
brief (§5.2.3), we expect sessions to either (a) exist during contention and thus be subject to
the (presumably relatively stable) effects of contention, or (b) experience no contention.
In Section 5.3.1.2 we discuss what RTT transport measurements capture, and in Section 5.3.1.3
we discuss how we use these measurements to estimate propagation delay.
5.3.1.2 What transport RTT measurements capture
Reliable transport protocols such as TCP and QUIC keep track of round-trip time by measuring
the time between packet transmission and receipt of the corresponding ACK. These measurements
163
are used to decide when to declare a packet lost and trigger retransmission [214, 371], and for
pacing [73]. The transport’s RTT measurement is not impacted by packet loss, as the measurements
reflect the time it takes for a specific packet to traverse the network and not the time it takes for a
payload of application-data to traverse the network (the latter is a function of goodput).
16
A transport RTT measurement captures round-trip propagation delay and other forms of variable
delay, some of which are a function of the sender’s behavior. For instance, a congestion control
algorithm probes for bandwidth by (i) increasing the rate at which it writes data to the network, and
then (ii) inferring whether sending at this new rate causes congestion at the bottleneck link [52, 215,
324].
17
If the sender writes a sequence of packets to the network at a rate that is faster then they
can traverse the bottleneck link, then the RTT measured by the transport will increase [73, 414].
However, this increase is not representative of a change in network conditions — rather, it is an
artifact of sender behavior, and more specifically, a function of the rate at which the sender wrote
data to the network relative to available bandwidth at the bottleneck link. Likewise, a congestion
control algorithm that does not pace its writes to the network may write in bursts that cause
queuing at the bottleneck link, even if its average sending rate is less than the available bandwidth
16
TCP does not use timing signals captured from retransmitted packets when estimating RTT [371] because following
retransmission it is not possible to determine whether an ACK received is for the original packet (in which case, the
retransmission was spurious) or if it is for the retransmitted packet, as both packets will have the same sequence number.
While options such as TCP timestamps [54] can mitigate this, the TCP implementation in the Linux kernel does not
currently make use of timestamps for this purpose. In comparison, when QUIC declares a packet lost it retransmits
the data in a new packet with a new sequence number (which may also potentially have different contents from the
original packet); the use of a new sequence number removes this limitation and enables QUIC to use every packet for
RTT measurements without incorporating retransmission time into RTT measurements [214].
17
The congestion control algorithm operating at the sender infers that the route between the sender and the receiver is
congested by either (a) assuming that packet loss is indicative of congestion — the approach traditionally employed by
loss-based congestion control algorithms [52, 190, 215, 324]); or (b) by looking for changes in latency — the approach
used by more recent delay-based congestion control algorithms, such as BBR [73], and extensions to existing congestion
control algorithms, such as CUBIC’s Hybrid Slow Start [189].
164
at the bottleneck link.
18
These bursts and the subsequent queuing will also cause variations in
transport RTT measurements which are again a function of sender behavior and thus not indicative
of changes in underlying network conditions. MAC and link-layer delays, including 802.11 link-
layer retransmissions and contention [418, 419], and DOCSIS channel contention [107, 457], also
contribute to the delay captured by the transport’s RTT measurements, but the impact of such delays
is difficult to interpret in isolation. Finally, transport behavior, such as delayed ACKs, can contribute
delay that is not related to network conditions at all.
5.3.1.3 Estimating round-trip propagation delay from transport RTT
The time captured by transport RTT measurements includes both round-trip propagation delay
and forms of variable delay. However, because we assume that the round-trip propagation delay
is constant (§5.3.1.1), it follows that the minimum RTT (MinRTT) measured by the transport
over a connection’s lifetime reflects the RTT captured at the point during which variable forms
of delay (queuing on the client’s access link, MAC delays, etc.) were lowest (or outright zero).
Thus, a connection’s MinRTT represents an upper bound on and our best estimate of the round-trip
18
A congestion control algorithm that does not pace its writes into the network will immediately transmit pending data
when bytes in flight is less than CWND — the data will be written to the network as fast as the sender’s network interface
will allow. While ACK-clocking naturally paces a sender’s writes once a connection has entered pipelining [215], without
pacing, bursts into the network can occur when an application writes a significant amount of data to a connection that has
a high CWND and had become application-limited or completely idle. For instance, if a server has a 40Gbps NIC and a
connection with a 100 packet CWND, the server may write 100 packets to the network at 40Gbps if the application writes
100 packets or more data to the socket buffer. In comparison, a congestion control algorithm that paces its writes into
the network will not burst when bytes in flight is less than CWND, regardless of the amount of data in the socket buffer.
Instead, a congestion control algorithm that employs pacing will write packets to the network at a defined interval. For
instance, if a congestion control algorithm paces at
CWND
RT T
, then it will attempt to write packets so that (i) each RTT a
CWND worth of data is written to the network and (ii) so that writes are spread over each RTT. By preventing large bursts
into the network, pacing reduces the potential for packet loss caused by downstream queues overflowing and loss induced
by policers due to the burst causing the packet arrival rate to exceed the policer’s threshold [139].
165
propagation delay. The Linux Kernel’s MinRTT metric captures the minimum round-trip time the
transport observed over a configurable window; in Facebook’s environment this window is set to 5
minutes. Because the vast majority of HTTP sessions terminate within 5 minutes (§5.2.3), recording
this metric at session termination effectively captures the minimum RTT observed over the session’s
lifetime.
Representativeness of MinRTT samples. While the MinRTT is our best estimate of the round-
trip propagation delay for an individual session, the MinRTT may still include time from variable
delay. We surmise that the potential for such overestimation is correlated with (i) the provisioned
speed of the user’s access link and the bottleneck link rate in the path; (ii) the connection workload;
(iii) the behavior of the congestion control algorithm; and (iv) the length of the connection in both
time and packets transferred. The first three components determine (in part) the potential for queuing
in the path. For instance, a client with a slower access link transferring large files is more likely
to experience queuing caused by self-induced congestion or cross traffic (potentially from other
connections created by the same application or endpoints) [156, 414]; although the potential for such
queuing is reduced with congestion control algorithms that consider latency in their decision process,
such as BBR [73, 74, 208]. The fourth component determines the opportunities the transport has to
measure RTT. Queuing delay on access links is expected to be transient, and MAC-layer delays will
change over time. Thus, we assume that a transport able to capture multiple RTT measurements over
a longer period of time is more likely to be able to measure RTT at a time when queuing and other
variable delays are lower (or zero), and thus that the MinRTT for such connections is more likely to
166
reflect the actual round-trip propagation delay as defined. We have not yet established thresholds to
use in evaluating whether a connection has had sufficient opportunity to measure propagation delay.
Since our goal is to understand the network conditions between Facebook and a group of clients,
and not the network conditions for any single client, we use the median MinRTT (MinRTT
P50
)
when assessing how performance changes over time (§5.5) and opportunities to improve perfor-
mance (§5.6). In comparison to the average or other percentiles, the median MinRTT of an aggregate
is robust to measurements that reflect conditions specific to an individual client, such as congestion
on a client’s access link or 802.11 wireless network. We discuss aggregation of measurements
further in Section 5.3.4.
5.3.2 Measuring goodput with HDratio
User experience depends on Facebook’s being able to quickly deliver objects to end-users. For
instance, after a video has started playing, user experience is primarily dependent on the sessions’s
ability to sustain the playback bitrate — the client must receive chunks from the CDN faster than it
plays them to avoid stalls [111]. Clients with low goodput will be unable to continuously stream
high bitrate video without frequent rebuffering and will experience delays in loading images and
other large objects, and clients with extremely low goodput may be unable to access Facebook
content if requests timeout before delivery of the response completes.
Goodput is a function of available and bottleneck bandwidth, the provisioned speed of the user’s
access link (e.g., the rate that the user subscribes to from their ISP, and at which the ISP throttles
the link to), loss, latency (propagation, queuing, and link/MAC layer delays), the behavior of the
167
congestion control algorithm, sender behavior, the client’s access technology and the ISP’s network
management practices [14, 139, 190, 215, 217, 293, 294, 321, 416, 471].
We use goodput to measure connection quality instead of focusing on metrics such as loss,
as such metrics are difficult to interpret in isolation. For example, while MinRTT (§5.3.1) can
be used to estimate the minimum user-perceivable latency, MinRTT does not reflect the impact
of loss [409]. Likewise, the impact of packet loss cannot be easily discerned in isolation — the
potential for and impact of loss depends (among other things) on the congestion control algorithm,
available bandwidth, application workload, propagation delay, and whether the loss was the result
of self-induced congestion, congestion caused by cross-traffic, or random [14, 190, 215, 217, 293,
294, 321]. Goodput captures the interactions between these elements and their combined impact on
performance.
5.3.2.1 Overview of approach
As with round-trip propagation delay, goodput depends on conditions and properties of the backbone,
access, and customer premise networks. A session may have low goodput because of the end-user’s
Internet plan or access technology; insight into such limitations can be informative during the
development of Facebook’s applications and services. Likewise, goodput measurements provide
a powerful tool for temporal analysis of degradation and opportunities for performance-aware
routing, both of which may be actionable. For instance, if two routes to an end-user ISP have
similar MinRTT, but one route has a lower goodput (perhaps due to loss caused by backbone
168
congestion [108]), placing traffic on the route with higher goodput may yield better performance for
end-users (§5.3.5).
Our goodput metric estimates a session’s ability to support a given goodput while excluding
non-network factors related to sender behavior and the behavior of congestion control at the
start of a session. We differentiate between network and non-network factors by modeling TCP
and congestion control behavior under ideal network conditions, and assessing a session’s actual
performance relative to this ideal.
While we focus on TCP in this chapter, our approach is applicable to QUIC [214] as well.
Our model of congestion control behavior is compatible with loss and latency-based congestion
control algorithms that (i) exponentially grow the CWND when the transport is CWND limited until
congestion is detected, a phase known as initial slow-start [52], and (ii) do not pace, or pace
packets at a rate greater than or equal to
CWND
RT T
.
18,20
Given these requirements, our approach is
compatible with all of the congestion control algorithms widely used on Linux, including Reno [52],
NewReno [188], Cubic [190] (the default on Linux), and BBR [73, 74].
19
Our implementation of the approach described in this section is publicly available [422].
19
BBR defines the initial phase of exponential growth as the startup phase instead of slow start [74], but the dynamics
relevant to our model are not affected by this difference. Likewise, while BBR employs pacing, it is compatible with
our second requirement as it paces traffic at a rate greater than
CWND
RT T
during theSTARTUP phase and at an average
of
CWND
RT T
during thePROBE_BW cycle. More specifically, during thePROBE_BW cycle BBR paces at 1:25 to probe
for bandwidth, then at 0:75 to allow any queues to drain, and finally at 1 to cruise at the calculated send rate [266].
While BBR may limit goodput during thePROBE_RTT phase, this phase is only triggered if sender behavior does not
periodically provide an opportunity to measure MinRTT; the BBR implementation notes that the idle timings of web
applications often provide such opportunities [266] — matching what we observed in Section 5.2.3 — and thus we expect
this phase to have little impact on our measurements.
169
5.3.2.2 Defining target goodput
Speedtests are often used to capture the maximum goodput that a connection between a client and
server can support, but this maximum does not correlate linearly with application performance
and user experience. A client that can achieve 100 Mbps goodput will rarely experience any
improvement in performance over a client with 10 Mbps goodput because neither connection is
likely to be saturated by the types of content served by Facebook (voice or video calls, live or
time-shifted streaming video, and photos) [420]. Furthermore, speedtests are intrusive and can have
negative impact on users (e.g., consuming mobile data and battery life), and thus are not suitable to
our setting (§5.2.2).
Instead of attempting to measure the maximum goodput a session can achieve, we define a
target goodput that suffices to provide good experience for Facebook’s services, and then design
our methodology to check whether sessions are able to achieve this goodput. Given the importance
of video on today’s Internet, we define our target goodput as 2.5 Mbps, the minimum required
to stream HD video [176]. We refer to this target goodput as HD goodput. Once HD goodput is
achieved, round-trip propagation delay (which determines the minimum user-perceived latency)
may be more important [40, 420]. Although we focus on HD goodput, our methodology is generic
and can work for any target goodput.
Because Facebook HTTP sessions are idle for most of their duration (§5.2.3), goodput at the
session level is not meaningful — we must calculate goodput at the transaction level. However,
even at the transaction level, capturing meaningful goodput measurements from small responses
170
is non-trivial compared to a speedtest. In particular, the goodput measured for a small response
may represent a lower bound on what the session is capable of supporting for multiple reasons,
including (i) the transfer may not be large enough to exercise the bandwidth-delay product, (ii) the
congestion control algorithm may still be in initial slow start, and thus the CWND was undersized
at the start of the transfer, and (iii) the effects of network transmission time, pacing applied by the
congestion control algorithm, and delayed ACKs, all of which can have a significant impact on
transfer time for small transfers. In the sections that follow, we describe how we determine if an
HTTP transaction is capable of testing for our targetHDgoodput (§5.3.2.3). Then, for the subset
of transactions capable of testing, we describe how we determine if the transaction achieved the
targetHD goodput while correcting for transmission time and other aspects (§5.3.2.4). Finally, we
describe how we summarize these results into HDratio, a metric that reflects the ability of a given
HTTP session to deliver traffic at a rate sufficient to maintainHD goodput (§5.3.2.5).
5.3.2.3 Determining if a transaction tests for target goodput
In Facebook’s environment, a transaction’s response size is small relative to the amount of data
transferred in a speedtest (§5.2.3). The example scenario in Figure 5.4 illustrates how small
responses impact the design of our approach to measure goodput from the server’s perspective. In
this example, a client requests three objects in series via a single HTTP session that has a propagation
delay of 60ms (as estimated by MinRTT). We assume no pacing and ideal network conditions,
under which data and ACK packets traverse the network in constant time (e.g., propagation delay)
and thus are not subject to any transmission, queuing or MAC/link-layer delays. To keep things
171
60 ms
1xRTT Data (2 packets) ACK 120 ms
2xRTT Data (10 packets) ACK Data (14 packets) ACK 60 ms
1xRTT Data (14 packets) ACK TXN3 start CWND = 20 TXN2 start CWND = 10 Time / Transactions TXN1 start CWND = 10 Figure 5.4: Sequence diagram for three back to back HTTP transactions over a single HTTP session.
172
simple, we set the TCP maximum segment size to 1500 bytes and set the initial congestion window
(CWND) to 10 packets. With no loss and fixed round-trip times, a loss or latency-based congestion
control algorithm will never exit initial slow start, and the CWND will grow exponentially when the
connection is CWND limited.
20
We denote W
ideal
as the expected CWND at the server at the start of each transaction’s response.
W
ideal
is initialized to the connection’s initial congestion window (10 in our example). After each
response is complete, we model the expected CWND growth under the described ideal conditions
and update W
ideal
to reflect the expected value.
Under the stated ideal conditions, goodput is limited by the response size relative to the bandwidth
delay product or CWND growth under ideal network conditions (W
ideal
), and we would observe the
following:
• Transaction 1: Goodput = 0.4Mbps (2 packets=60ms).
Start: W
ideal
= 10, greater than response size; 1 RTT to send all packets and receive ACK.
End: No CWND growth, so W
ideal
= 10.
20
The Linux Kernel’s TCP implementation grows the CWND when the connection is CWND limited in the last round-trip
and is not in loss or recovery states. Growth is determined by the number of data packets ACKed, not the number of ACK
packets received. A connection is CWND limited during slow start if it had more than half of the CWND in flight. After
slow start a connection is CWND limited if sending was blocked on CWND. The growth for partially-utilized CWNDs is
difficult to model as it depends on precisely when acknowledgments are received, and thus we do not account for such
growth in our model.
173
• Transaction 2: Goodput = 2.4Mbps (24 packets=120ms).
Start: W
ideal
= 10, less than response size; 2 RTT to send all packets and receive ACK.
End: CWND grows to at least 20, so W
ideal
= 20.
• Transaction 3: Goodput = 2.8Mbps (14 packets=60ms).
Start: W
ideal
= 20, greater than response size; 1 RTT to send all packets and receive ACK.
End: W
ideal
= 20.
We make three observations from this example:
1. Goodput can be calculated per RTT. When delivery of a transaction’s response triggers
CWND growth under ideal circumstances, (i.e., W
ideal
grows during the response), goodput for
a single RTT can be greater than transaction’s overall goodput. For instance, transaction 2
sends 10 packets in its first RTT, yielding a goodput of 2Mbps, and 14 packets in its second
RTT, yielding a goodput of 2.8Mbps.
2. Goodput per RTT depends on prior transactions. The CWND at the start of a response
restricts the goodput a transaction can achieve and depends on previous transactions. For in-
stance, transaction 3 is able to transfer 14 packets in its first and only RTT because transaction
2 grew the CWND to 20. In comparison, transaction 1 did not grow the CWND, so transaction
2 could only transfer 10 packets in its first RTT.
3. The highest per-RTT goodput under ideal network conditions is the highest goodput a
transaction can test for. This type of ideal-case analysis determines the maximum achievable
174
goodput in an RTT for a given W
ideal
and response size, providing an upper bound on the
highest goodput the response can exercise under real network conditions. From the example,
we can observe that the maximum testable goodput for a given response is the maximum
number of bytes delivered in a single round-trip:
• Transaction 1 can test if 0.4Mbps goodput can be achieved.
• Transaction 2 (because of its second RTT) and transaction 3 can test whether a goodput
of 2.8Mbps can be achieved.
For a given transaction, the maximum testable goodput occurs either on the (idealized) last
round-trip or the penultimate round-trip if the last round-trip has fewer bytes to send.
Formalizing our intuition. We can determine the maximum goodput that each transaction can
test for, denoted G
testable
, by maintaining a model of CWND growth under ideal conditions. At the
start of the session W
ideal
is initialized to the underlying connection’s initial CWND in bytes. W
ideal
is
updated at the end of each transaction to reflect expected CWND growth.
We define m as the number of round-trips required to transfer a response of B
total
bytes if the CWND
size (in bytes) when the first byte of the transaction is sent is W
ideal
:
m=dlog
2
(B
total
=W
ideal
+ 1)e: (5.1)
175
We define W
ideal
(n) as the expected size of CWND at the start of the n
th
round-trip assuming ideal
network conditions:
W
ideal
(n)= 2
(n1)
W
ideal
: (5.2)
A transaction’s maximum testable goodput is the maximum number of bytes transferred over each
of the last two round-trips assuming ideal network conditions, divided by the propagation delay.
G
testable
:=
8
>
>
>
>
<
>
>
>
>
:
B
total
MinRTT
if m= 1
maxfW
ideal
(m 1);B
total
å
m1
i=1
W
ideal
(i)g
MinRTT
if m> 1
(5.3)
In Figure 5.4, transaction 2 has m= 2, W
ideal
(2)= 20, and G
testable
= 2:8Mbps.
Thus, if a transaction’s G
testable
is greater than or equal toHDgoodput, then the transaction is
capable of testing for HD goodput. In Section 5.3.2.4 we discuss how we determine if a transaction
capable of testing forHDgoodput was able to achieve it, which depends on actual network conditions.
Why G
testable
is calculated using ideal CWND. For a transaction to test for HD goodput there
must be a period during which (B
inflight
MinRTT)>HD goodput; as illustrated in Figure 5.4, this
is contingent on the transaction’s response size (B
total
), the response size of previous transactions
and their impact on CWND growth, and MinRTT.
G
testable
intentionally does not reflect the impact of actual network conditions on CWND growth.
To understand why, consider how the session illustrated in Figure 5.4 may behave in production.
Under poor network conditions, the CWND at the start of the third transaction may be 1 (instead of 20,
176
as predicted by W
ideal
) because timeouts during the preceding transactions caused the actual CWND
to be reduced. If we (incorrectly) considered W
ideal
= 1, then G
testable
= 1:4Mbps (7 packets=60ms)
and we would infer that the third transaction cannot test for HD goodput. This is problematic as
we only learn about a network’s ability to support HD goodput when we are able to test for HD
goodput. Worse, saying that the third transaction cannot test for HD goodput would incorrectly
ignore strong evidence of poor performance that indicates that the network cannot support HD
goodput. Transactions in a session being unable to test forHD goodput is in itself not a signal of
network conditions; it simply reflects that the session transferred small objects that could not achieve
HD goodput due to sender and congestion control behavior.
To avoid this problem, we always calculate G
testable
by setting W
ideal
for each transaction
assuming CWND growth under ideal circumstances. Then, the transactions identified as capable of
testing HD goodput are the points in the session when, if a session has good performance, it will (a)
have a CWND capable of supportingHD goodput and (b) be delivering a large enough response to
achieveHDgoodput. We define W
NIC
as the CWND measured when a transaction’s first response byte
is written to the NIC. For the first transaction W
ideal
is equal to W
NIC
. For all subsequent transactions,
we define W
ideal
as the maximum between W
NIC
and the ideal CWND at the end of the previous
transaction (estimated as W
ideal
(m), where m is the number of round-trips in the previous transaction
under ideal network conditions).
21
21
W
ideal
(m) provides a lower bound on the ideal W
ideal
of the next transaction because it ignores any growth of the
CWND during the last round-trip (footnote 20). Taking the maximum between W
NIC
and W
ideal
(m) allows us to increase
the maximum testable goodput when W
NIC
is greater than the modeled W
ideal
(m).
177
5.3.2.4 Measuring if a transaction achieved a testable goodput
The previous section described how to determine G
testable
, the maximum goodput that each transac-
tion can test for under ideal network conditions. This section describes how we determine whether a
given transaction identified as being capable of testing for a given goodput achieved that goodput.
Making this determination requires accounting for the impact of real network conditions (e.g., packet
loss, queuing delays, and transmission time) and extrapolating behavior about goodput for one RTT
to the entire transaction.
For each transaction, we denote T
total
as the time elapsed between the server’s NIC transmitting
the first byte of the response and the server’s TCP implementation processing a cumulative ACK
that covers the last byte of the response (we use socket timestamps [239] to capture these details).
Thus, T
total
captures the time it took to successfully deliver the response to the client, including
propagation delay, transmission time, queuing delay, delays incurred due to pacing by the congestion
controller, retransmissions made necessary due to packet loss, and the time required to receive the
last ACK from the client.
We present our solution in two parts. First, we consider a simplified approach that can be used
to determine if a transaction was able to achieve G
testable
or a lower G without explicitly accounting
for CWND growth that may have occurred during the transaction. We then extend our solution to
consider CWND growth to enable us to detect if a short transaction was able to achieve and sustain
G
testable
or a lower G.
178
Determining the goodput a transaction achieved without considering CWND growth. All
responses will traverse a bottleneck link that will shape packets due to transmission delays and
may additionally delay packet transmissions as determined by a shaper or policer.
15
Consider again
transaction 3 in the session illustrated in Figure 5.4, but this time assume there is a link with 3Mbps
bottleneck bandwidth between Facebook and the client.
22
Under ideal conditions, transmission
times at the bottleneck link will add55ms to transaction 3’s transfer time, increasing its T
total
to115ms.
23
If we calculate the achieved goodput as B
total
T
total
, we will incorrectly infer that
transaction 3 only achieved a goodput of 1.46Mbps (14 packets=115ms).
We do not know the capacity of the bottleneck link and therefore cannot correct for this directly.
Instead, we estimate the rate at which the response arrived at the client, denoted R. First, we
remove our estimate of propagation delay from T
total
(i.e., T
total
MinRTT), thereby removing our
estimate of the minimum amount of time for a packet from the server to reach the client and for the
corresponding ACK from the client to arrive back at the server; the remaining duration contains
all other forms of delay. We then estimate R as B
total
(T
total
MinRTT). We can compare R to
G
testable
to determine if G
testable
was achieved. If R G
testable
, then we can mark the transaction as
having achieved G
testable
. For instance, if R is greater than HD goodput for a transaction capable of
testing forHD goodput, we can conclude thatHD goodput was achieved.
22
While we use bottleneck bandwidth in our examples throughout this section, the examples and approach holds if the
link has a bottleneck bandwidth higher than 3Mbps but is provisioned to operate at a rate of 3Mbps through the use of a
shaper, given that the shaper will cause the transfer to incur the same amount of delay. See Footnote 15 for more details.
23
MinRTT captures (at a minimum) the transmission time for TCP headers, but we also assume that MinRTT is for a
smaller packet (potentially for a SYN/ACK packet), so we only consider the impact of transmission time at the bottleneck
link on payload. Even if the MinRTT does include transmission time for a non-minimal sized packet, we expect the
error incurred to be minor given that transmission time will be small for common link speeds, and because goodput
measurements are performed using transmissions that span multiple packets and therefore incur transmission time for
multiple packets; this in turn minimizes the potential impact of MinRTT including transmission time of a single packet.
179
Extending our approach to consider CWND growth. Recall that G
testable
is the maximum good-
put that could be achieved on the last two round-trips assuming ideal CWND growth. For instance,
G
testable
will be 2.8Mbps for transaction (2) in the session illustrated in Figure 5.4. However, R may
be less than G
testable
even if the network behaved ideally because R is calculated over the entire
response and thus represents an average of the rate at which the response arrived. Because the
goodput achieved on the transaction (2)’s first RTT was 2Mbps, the goodput achieved overall (under
ideal circumstances) would be only 2.4Mbps.
To determine if a transaction was able to achieve and sustain G
testable
through CWND growth,
we compare T
total
against T
model
(G), the expected value of T
total
if CWND growth enables goodput G
to be achieved and sustained during delivery of the response. We model the entire response time
instead of trying to measure over a single round-trip because we want to look at as much of the
transaction’s behavior as possible when measuring network conditions.
T
model
(G) is determined by a model that consumes four inputs: (i) a goodput G, (ii) the actual
CWND at the start of the response (W
NIC
), (iii) the MinRTT, and (iv) the response size. While the
model used in Section 5.3.2.3 to determine G
testable
relied on W
ideal
(the size of the CWND under
ideal conditions), the model in this section relies on W
NIC
to determine the CWND growth required to
achieve goodput G given actual conditions at the start of the response.
Figure 5.5 illustrates our model of a response achieving goodput G after CWND growth. For
the given G, the model mimics the exponential growth of slow start, doubling the CWND (starting
from W
NIC
) each round-trip.
24
The model exits exponential growth when doubling the CWND again
24
Because a transaction’s testable goodput is the maximum goodput that could be achieved on the last two round-trips
assuming exponential CWND growth and ideal network conditions, the model must also assume exponential growth when
180
W NIC Bytes per RTT Time in exponen tial growth RTT Last packet sent Last ACK received Transmission time i=0 B MSS G n-1 2
3
x W
NIC
B total − ∑(2 i x W NIC ) Bytes T ransferred at G = RTT
1 RTT
2
RTT
3
RTT
4
G x RTT i=0 n-1 Figure 5.5: Model of a transaction for which the CWND starts at W
NIC
and exponentially grows to be equal to
(G RT T), after which goodput G is sustained for the remainder of the transfer. n= 4 because exponential
growth of the CWND ends after the fourth RTT, delineated by the thicker line at the end of the 4th RTT.
would cause the CWND to grow beyond what is necessary to support goodput G. After exiting
exponential growth, the model assumes that the transaction sustains goodput G from then on — i.e.,
CWND=RT T = G.
If the model exits exponential growth after n round-trips, its transfer time is given by (i) the
round-trip propagation delay incurred during the n round-trips in slow start, (ii) the transmission
determining what the value of T
model
(G) would be for a transaction that achieved goodput G. If it is not possible to grow
the CWND exponentially from W
NIC
to the value required for goodput G, then we can conclude that the transaction could
not have achieved G given actual network conditions.
181
time of n full-size packets,
25
(iii) plus the transmission time of the remaining bytes, (iv) plus one
round-trip for receiving the acknowledgment of the last packet:
26
T
model
(G)= n
MinRTT+
B
MSS
G
+
B
total
å
n1
i=0
(2
i
W
NIC
)
G
+ MinRTT (5.4)
where B
MSS
is the TCP maximum segment size for the session.
We estimate the goodput achieved and sustained for a response as the largest G for which
G G
testable
and T
total
T
model
(G). If a transaction is capable of testing for HD goodput and
T
total
T
model
(HD goodput), then we can assert that the transaction achieved HD goodput because it
completed at least as fast as a session with a 2.5Mbps bottleneck link would.
While the model is designed around cases where CWND growth enables a transaction to achieve
and sustain a higher goodput, the approach remains correct even if the connection’s CWND does
not grow exponentially, or even if it shrinks. Under such conditions, the estimate of the goodput
achieved and sustained for a transaction, e.g., the largest G for which T
total
T
model
(G), will reflect
the average goodput achieved during delivery of the entire response.
25
This is a small correction that accounts for the added delay between consecutive congestion windows while in
exponential growth due to the transmission time of a data packet at the bottleneck link. In other words, the round-trip
time plus the transmission delay of one packet add up to the total time to receive the ACK for the first packet transmitted
in each CWND, which triggers the beginning of the transmission of the next CWND. This term did not appear in the
corresponding equation published in an earlier version of this work [377], but was included in the model implementation
and thus was incorporated into the results included in that publication. The results in this chapter and in the earlier
publication are the same.
26
If there is no queuing at the bottleneck link (e.g., because of pacing induced by ACK-clocking or explicit pacing by
the congestion control algorithm), it takes one MinRTT to receive the last ACK. If there is queuing at the bottleneck link
(e.g., when transmitting a burst of packets in the initial congestion window, or because the congestion control algorithm
paced at a rate greater than the bottleneck link’s bandwidth), it takes one MinRTT plus the transmission time of the
previous packets over the bottleneck link to receive the ACK of the last packet. Given that transmission time is accounted
for separately by item (iii), we add only one MinRTT to receive the ACK of the last packet in both cases.
182
Validation using simulation. We validated that our goodput estimation technique accurately
approximates bottleneck bandwidth under ideal network conditions using simulations in ns3.
27
We simulated transfers through 15,840 configurations, varying bottleneck bandwidth G
bottleneck
(0.5–5 Mbps), round-trip propagation delays (20–200 ms), initial CWND sizes (1–50 packets), and
transfer sizes (1–500 packets). We identify configurations whose transfers can test for the bottleneck
rate by checking if G
testable
> G
bottleneck
. For these configurations, the goodput G inferred by our
technique never overestimates the bottleneck rate, and usually only underestimates slightly: the
99-th percentile of the distribution of the relative error (G
bottleneck
G)=G
bottleneck
is 0.066. Under
realistic network conditions (e.g., including cross traffic, losses and jitter), the estimated goodput
G could be lower (never overestimating G
bottleneck
) due to the reduced available bandwidth at the
bottleneck resulting in queuing and/or packet loss, and inefficiencies in the congestion control
algorithm. Thus, the estimate captures how fast data was delivered to the destination and allows us
to evaluate if a network can support a given goodput.
5.3.2.5 Defining a session’s HDratio
We use the approach above to compute HDratio, a metric that captures the ability of an HTTP
session’s underlying connection to sustain HD goodput. Our approach can estimate whether a
transaction can test for and deliver traffic at any rate G, but in HDratio we set G to HD goodput (i.e.,
G= 2:5 Mbps). We define HDratio for each HTTP session as the ratio between the number of its
27
ns3’s TCP implementation grows the CWND by number of ACKs received (instead of data packets ACKed). We
disabled delayed ACKs to force the simulator to better match CWND growth in the Linux Kernel’s TCP implementation
(described in more detail in footnote 20).
183
transactions that tested for and achievedHD goodput and the number of its transactions that tested
forHD goodput (i.e., transactions with G
testable
2:5 Mbps are ignored).
When we group sessions by network and other dimensions (§5.3.4), we compute HDratio for
each HTTP session in the group and then aggregate HDratio to give each session equal weight
in the resulting aggregation. The alternative approach — calculating HDratio as the ratio of all
transactions that tested for and achieved HD goodput and the number of transactions that tested
for HD goodput — would cause sessions with more tests for HD goodput to have greater weight
in the resulting aggregation. From our experience, the latter approach can result in a handful of
sessions defining the HDratio for an aggregation. If one of those sessions has abnormal performance
(e.g., very good or bad relative to other sessions in the same group, perhaps due to access-link
conditions specific to a single user) it will skew results for the entire group, making the HDratio
for the group unrepresentative. For instance, we find episodic temporal events (§5.3.5.2) are more
common with the latter approach, likely due to the aforementioned effects of abnormal connections.
5.3.2.6 Other considerations
Accounting for delayed ACKs. On some operating systems, TCP delays sending an ACK until
it has two packets to ACK or until an implementation-dependent timeout (30ms+ for Linux). If
the client’s operating system delays the ACK for the last byte in a transaction’s response, it may
inflate T
total
(significantly for small responses) and lead to underestimation of the achieved goodput.
We avoid this by ignoring the last data packet (and its ACK). Instead, T
total
is the interval between
the instant when the first byte of the response is written to the NIC and the instant when an ACK
184
covering the second-to-last packet of the transaction is received by the NIC, and B
total
is the total
amount of bytes transferred minus the number of bytes in the last packet.
28
Accounting for HTTP/2 preemption and multiplexing. HTTP/2 preempts and multiplexes
transactions based on the transactions’ priorities [39]. Sending of a transaction response is preempted
(paused) if a higher priority transaction is ready to send, and the HTTP/2 send window is multiplexed
when transactions have equal priority. When preemption or multiplexing occurs, a transaction’s
T
total
may include time spent transferring other transaction’s bytes, but because those bytes will
not be included in B
total
we will underestimate the achieved goodput at the transport layer. We
want to focus on the goodput that the underlying network is capable of supporting — doing so is
necessary to minimize the effects of changes in application and user behavior on our conclusions
about network conditions, especially temporal analysis — and so we coalesce transactions together
into a single larger transaction when their responses are multiplexed or preempted. In addition, we
coalesce transactions when their responses are written back-to-back with no gap at the transport
layer to enable a sequence of small responses to be considered as a single large response.
29
Accounting for bytes in flight. Our approach to calculating goodput assumes that no response
bytes are in flight when a new transaction’s first response byte is sent. To preserve this requirement,
28
If the ACK for the second-to-last packet is delayed, it will be sent when the timeout expires or when the last packet
arrives at the receiver. Thus, the delay for the second-to-last ACK is no larger than the guaranteed timeout incurred for
delayed last ACKs. This approach performs no worse than waiting for the last ACK, and is more accurate in the common
case of the last two packets transmitted close in time.
29
If two responses are written in series and the last byte of the first response has not been transmitted by the NIC before
the first byte of the second response is written to the send buffer, then there is no gap at the transport layer.
185
a transaction is ineligible to be used for goodput measurements if a previous transaction’s response
was still in flight (e.g., last byte not yet ACKed) when the first byte of its response was sent, but the
conditions for coalescing were not met.
5.3.2.7 Limitations of approach
Measurements are sensitive to minor variations in delay. A typical speedtest evaluates the
goodput a connection can support by sending data as quickly as possible
30
for an extended period of
time — such tests may run for ten seconds or longer. In comparison, our measurement methodology
evaluates what a session can support by measuring how long it takes to transfer HTTP transactions.
Given that these transactions often have small responses (§5.2.3), they will often traverse the network
in less than a second.
In comparison to a speedtest, our approach allows us to measure network conditions without
requiring the use of synthetic traffic. Additionally, by using all goodput samples, we avoid biases
that would potentially be introduced by filtering transactions based on response size (i.e., clients with
slower connections may transfer smaller objects, and considering only large transfers may remove
samples from those clients entirely). However, compared to a traditional speedtest, our approach is
significantly more sensitive to variations in T
total
(e.g., queuing delay and MAC/link-layer delays),
particularly for sessions with low propagation delay.
As an example, consider a client connected to the Internet via a cable modem with a provisioned
downlink rate of 100Mbps. DOCSIS — the media access protocol used for cable Internet — requires
30
Typically bounded by the congestion control algorithm in use at the sender.
186
cable modems to request access to the channel before they can send; it can take up to 5 milliseconds
for a cable modem to receive the requisite "grant" from the Cable Modem Termination System [107,
457].
31
Now, imagine if this same client had a connection to Facebook with a propagation delay of
1 millisecond and for which the first transaction has a 3000 byte response. For this transfer, if
W
ideal
= 10 packets (the initial congestion window), then G
testable
= 24Mbps. However, even if the
response arrives at the client at 24Mbps, if DOCSIS delays prevent delivery of the ACK from the
client to the server by 2 milliseconds or more, our methodology will estimate the achieved goodput
to be 12Mbps or less. These types of problems are less likely to be an issue for a typical speedtest
because a speedtest transfers more data over a longer period of time and triggers pipelining, both of
which make it unlikely that such delays play a significant role in the measured transfer time.
Given that such delays are common in DOCSIS and other MAC protocols used in multi-access
networks [457] and can have significant impact on the achieved goodput calculated for small
transfers — while having little impact on actual user experience — it may be better to add a
tolerance factor such that if a measurement fails to achieve G
testable
by a small margin (a few
milliseconds), the measurement is ignored instead of being considered as an indicator of a session’s
ability to support G
testable
.
Measurements are weighted equally regardless of bytes transferred. Our methodology cur-
rently weights responses that test for a given G equally, despite how transactions that transfer more
31
DOCSIS 1.1 onward also allows "Unsolicited Grant Synchronization", in which a cable modem can send without
requesting a grant, but to our knowledge this is typically only used by providers for providing telephony services [107].
187
bytes have more opportunity to observe network conditions, and due to pipelining and a longer
transfer time are less sensitive to minor variations in delay and packet loss. In the future, we plan to
consider response size as well when aggregating samples.
5.3.2.8 Alternative approaches considered
In this section, we discuss possible modifications to our measurement methodology and how they
compare with our current approach. We discuss other techniques to evaluating a session’s ability to
support a given goodput in Section 7.4.
Using the duration between receipt of the first and last ACK when calculating R. As an
alternative to removing propagation delay by subtracting MinRTT from the transfer duration T
total
,
we could instead measure the time between the server’s receipt of cumulative ACKs covering the
first and last body bytes — i.e., calculating R as B
total
(T
lastByteACK
T
firstByteACK
). However, we
avoided this approach because it may overestimate R in a number of circumstances, including
in the case of transport and link layer retransmissions, and in scenarios where packets become
bunched together after traversing the bottleneck link due to MAC delays. For instance, if the spacing
between the first and last data packets (from sender to receiver) is reduced due to 802.11 frame
aggregation [179] or link-layer retransmissions, then the response may arrive at the destination at a
rate that is faster than the bottleneck link, and thus is unsustainable. Packet loss at the transport-layer
can also create problems: if the first packet is lost, then the duration T
lastByteACK
T
firstByteACK
may
decrease because an ACK for the first byte will be delayed until the retransmission arrives, causing
188
achieved goodput to be overestimated. ACK suppression — a technique commonly employed on
asymmetric access links [1] — and ACK compression — a scenario when the spacing between
ACKs is (inadvertently) reduced to improve return path efficiency [258, 291] — can similarly cause
the duration T
lastByteACK
T
firstByteACK
to no longer be representative and lead to overestimates of
achieved goodput.
32
Considering R
testable
instead of G
testable
. G
testable
considers propagation delay and thus is limited
by a transaction’s response size relative to the bandwidth delay product. In contrast, we remove our
estimate of propagation delay when calculating the response arrival rate R to determine if a goodput
tested for was achieved. As a result, it is possible for R to exceed G
testable
if the congestion control
algorithm does not pace data being written to the network, or if it paces at a rate greater than G
testable
.
For instance, if W
NIC
is greater than a transaction’s B
total
and the B
total
is written to the network at
a rate greater than or equal to B
total
RT T — for instance, if the congestion control paces writes
to the network at a rate of
W
NIC
RT T
— then G
testable
is a conservative estimate of the maximum testable
arrival rate and R may be greater than G
testable
.
We define R
testable
as an alternative to G
testable
. To ensure that a transaction marked as being
capable of testing for R can achieve R, we rely on our requirement that the congestion control
algorithm paces bytes into the network at a rate greater than or equal to
CWND
RT T
or does not pace at
all (§5.3.2.1 and footnote 18). Given this requirement, R
testable
can be defined as the minimum rate
32
For instance, the DOCSIS standard requires cable modems to request a “grant” to data upstream, ACKs will be
queued until this grant arrives [107, 457]. During the grant period, a cable modem will typically attempt to send as much
data as possible through concatenation [107, 258, 290, 291]. As a result, the spacing between individual ACKs — and
more specifically, between the first and last ACK, particularly for small transfers — may not be representative of the rate
that data arrived at the client [291].
189
that data would be written to the network by the congestion control algorithm under ideal network
conditions during the final round-trip m:
R
testable
=
W
ideal
(m)
MinRTT
(5.5)
where W
ideal
(m) is the expected CWND size at the start of the final round-trip.
Using R
testable
instead of G
testable
would enable smaller responses to test for higher rates. How-
ever, we do not make use of R
testable
in this work, in part because R
testable
is more sensitive to
variations in delay caused by MAC/link-layer conditions. As a result, there may be cases during
which we could have tested for HD goodput but did not do so; we do not believe this has any impact
on our results given that the absence of tests can only impact our results by reducing the number
of samples available in our dataset and does not bias our samples. We plan to experiment with
R
testable
in future work in combination with other changes to reduce the model’s sensitivity to minor
variations in latency.
5.3.3 Other metrics considered
In this section, we discuss other metrics commonly used to characterize network conditions, and
why we do not use them in this work.
190
5.3.3.1 Smoothed round-trip time
TCP’s retransmission timeout (RTO) determines how long TCP waits before declaring a packet lost
and retransmitting it. The RTO is derived from a metric called the smoothed RTT (sRTT), which is
itself derived from transport RTT measurements. While we capture sRTT in production, we do not
use it in our analysis in this chapter (except in Section 5.6.1.2) because our goodput and propagation
delay metrics already capture the relevant network conditions.
As discussed in Section 5.3.1.2, we expect variations in transport RTT to be caused by last
mile/customer premise network conditions, including queuing delays caused by on-path buffers
at the access link becoming bloated due to (i) queuing caused by cross traffic or (ii) self-induced,
non-standing congestion caused by the dynamics of congestion control algorithms [156, 414],
and link/MAC layer delays caused by wireless/cellular signal quality issues triggering link-layer
retransmissions. For instance, loss and latency-based congestion control algorithms may induce
queuing when probing for available bandwidth, and often must induce queuing in the network to
drive a connection to the highest possible goodput [73, 74, 75, 156, 190, 215, 321, 414]. Thus, we
expect sRTT will often be higher than MinRTT, especially for client’s with slower access links, as
this increases the potential for queuing for the same workload. The alternative would be for the
congestion control algorithm to attempt to avoid queuing outright, but such behavior would lead to
bottleneck starvation, which in turn would reduce goodput and overall transport efficiency [73, 74,
75, 321].
191
Variations in transport RTT measurements are unlikely to be caused by backbone conditions:
prior work has found that traffic arrival rates in backbone networks at small time scales are smooth
because small variations in round-trip time and processing time de-synchronize the large number of
flows, and flow transfer rates are slow relative to backbone link capacity [143, 339]. In addition,
congestion in the backbone is likely to result in a persistent standing queue [108] that causes
MinRTT and our estimate of propagation delay to increase, while also causing loss that decreases
goodput.
We conclude that it is difficult to reason about sRTT directly given sRTT is in part dependent
on the congestion control algorithm’s behavior and session workload. Thus, instead of trying to
reason about sRTT directly, we use our goodput metric to characterize a session’s ability to deliver
bytes in a reliable manner at a given rate for the estimated propagation delay. This approach still
captures the impact of cross traffic, RF conditions, and other factors that can lead to changes in RTT
measurements — but it enables us to focus on how such variations impact our ability to get data to a
client in a timely manner.
5.3.3.2 Retransmissions and packet loss
Similar to sRTT, we find that retransmissions are difficult to interpret in isolation. First, a retrans-
mission does not necessarily indicate that a packet has been lost — TCP retransmits a packet when
it believes the packet may have been lost, or to guard against packet loss significantly extending
transfer time (e.g., tail loss probes [117]). Traditionally, TCP was unable to tell after retransmission
whether the originally transmitted packet was lost, and thus could not accurately discern whether a
192
retransmission was necessary or spurious. While TCP mechanisms such as DUPACK and times-
tamps enable implementations to identify spurious retransmissions, the TCP implementation in the
Linux Kernel does not fully expose this information.
Second, even if a retransmission is not spurious, the fact that a packet was lost is not in itself a
signal of poor network conditions. A congestion control algorithm must probe for bandwidth, and
such probing naturally creates risk for loss [73, 74, 190, 215, 414]. The design of such probing and
steady state behavior of a congestion control algorithm must carefully balance risk and reward: it
may be possible to reduce or outright eliminate retransmissions by being less aggressive, but such
behavior may significantly reduce goodput, ultimately degrading application performance. In some
cases, an application’s performance may be better when the congestion control algorithm is more
aggressive — even if this leads to more retransmissions — as this behavior enables the congestion
control algorithm to make more efficient use of available resources. In particular, if propagation
delay is low, lost packets can be retransmitted quickly without risking significant increases in transfer
time and user-perceived latency.
Third, because retransmissions can occur due to self-induced congestion, changes in workload
can cause changes in retransmissions, even if network conditions have remained the same. An
application that exchanges network data at a low rate — perhaps because the user is not actively
using the application, and thus it only has keep-alive traffic — may have few retransmits. A
change which causes the application to more aggressively use network resources — such as the user
beginning to watch a streaming video — may also cause the retransmission rate to increase. If we
blindly assume that changes in retransmission rate signal changes in network conditions, we will
193
inevitably make conclusions that network conditions have degraded when in reality the only change
has been in how the application is using the network.
5.3.4 Aggregating measurements
Our focus is the performance of the routes between the edge of Facebook’s PoPs and groups of
clients, and not the performance of any single session, client, or user. There are two reasons for this:
1. Individual sessions may exhibit poor performance due to client-specific conditions that
are not actionable. Facebook and other network operators may be able to make changes to
address instances of poor performance that affect aggregates of clients, such as all clients
served by ISP
X
in geographic location Y. For instance, Facebook may be able to improve
connectivity between Facebook’s network and ISP
X
by establishing peering interconnections,
or by building a PoP closer to Y to reduce propagation delay. Likewise, ISP
X
may be able to
increase capacity in their backbone to reduce congestion at shared bottlenecks, or upgrade
access technology to reduce propagation delay and increase the goodput individual access
links can support. However, there will inevitably be cases of poor performance that are
specific to individual clients. For instance, a client may be connected to a poorly performing
residential 802.11 wireless network, or be competing with other flows on the same access
link, such as when multiple clients in a residence are competing for bandwidth.
2. Traffic engineering decisions need to be made at the granularity of groups of clients
and applications. When evaluating the potential opportunity of performance-aware routing,
194
we want to determine if changing the route used by Facebook to deliver traffic to a group of
clients could improve performance. Given the design of end-user networks, we expect that
such a change will impact clients in the same geographic area, served by the same ISP and
access technology equally. Furthermore, while EDGE FABRIC gives us flexibility in routing
decisions, we still need to make routing decisions at the granularity of a BGP prefix and
application.
33
Thus, we want to compare the performance of a given route for each of these
aggregates and not for any specific client or user.
In the rest of this section we discuss how we aggregate measurements.
5.3.4.1 Grouping measurements
We define client groups as aggregates of clients in the same AS that are likely to experience similar
performance (e.g., in the same geographic location and using the same access technology), using
the client’s BGP IP prefix and client country (as inferred from the client’s IP address). Because
we expect network performance observed for aclient group may vary based on the Facebook PoP
serving the traffic — the PoP’s location determines propagation delay, and interconnections may
vary across PoPs, leading to differences in routing — we also consider the PoP when aggregating
measurements for temporal and spatial analysis. We refer to such aggregates asclient-PoP groups.
Finally, we group measurements for each client-PoP group into 15 minute time windows to
enable temporal analysis of both degradation (§5.5) and opportunity for performance-aware routing
33
While EDGE FABRIC can split a BGP prefix (e.g., splitting a /20 into two /21, §4.4.1.2), the possible routing decisions
for each subprefix are constrained by the set of covering routes.
195
(§5.6). We refer to measurements for aclient-PoP group during atime window as aclient-PoP-time
aggregation. We choose a 15 minute window to balance insights into brief events with the need to
have sufficient measurements for statistically significant results.
Summary of technique. Measurements are grouped intoclient-PoP-time aggregations based on:
1. the Facebook point of presence,
2. the client’s BGP IP prefix (and thus inherently client AS),
3. the client’s country, and
4. the 15-minute time window during which the session terminated.
Key considerations. Our approach balances the advantages of aggregating data finely (e.g., the
ability to see events that impact a small group of clients or that impact clients for a short period of
time) with the requirement to have sufficient samples for a statistically significant result:
• We include the BGP prefix in the grouping because we assume that users in the same BGP
prefix are more likely to have the same access technology, and because we must consider the
BGP prefix when assessing opportunity for performance-aware routing (§5.6) as the set of
routes available to clients in the same AS may vary by BGP prefix.
• We include geolocation information because we find that it reduces variability relative to
aggregating to the BGP prefix alone. Network address space loosely correlates with location;
two user IP addresses in the same /24 are likely to be in the same geographic location [147,
196
2019− 05− 05 2019− 05− 06 2019− 05− 07 2019− 05− 08
0
20
40
60
80
Median MinRTT [ms]
All clients California clients Hawaii clients
(a) MinRTT
P50
over time for a client-PoP group
2019− 05− 05 2019− 05− 06 2019− 05− 07 2019− 05− 08
0
200
400
600
800
1000
1200
Number of Samples
All clients California clients Hawaii clients
(b) Number of samples over time for a client-PoP group
Figure 5.6: Example of how shifts in client population can lead to changes in a client-PoP group’s MinRTT
that can be misconstrued as changes in network conditions. In this example, IP geolocation data was used to
determine that the BGP prefix associated with the client-PoP group contained clients in two geographical
regions: California and Hawaii. While clients in each region have a stable median MinRTT, theclient-PoP
group’s median MinRTT decreases to ~20ms during peak hours in California, increases to ~60ms during peak
hours in Hawaii, and oscillates between these two extremes during other periods.
197
157, 259]. However, a BGP prefix can contain a large block of address space and under such
circumstances there is a higher probability of clients being spread over a wide geographic
area. For instance, we have observed BGP prefix belonging to a multinational end-user ISPs
in Europe that contain clients in multiple countries.
• However, we have observed cases where aggregating at finer geographic granularities could
prove beneficial. For example, Figure 5.6 shows a /16 prefix found to contain clients in
California and Hawaii, which ultimately causes MinRTT
P50
to oscillate. We experimented
with de-aggregating BGP prefixes (e.g., splitting a /16 into /18s or /20s) and geolocating
clients at finer granularities (states and “tiles” [59]) to handle such cases. However, we found
that these approaches yielded minimal reduction in variability while simultaneously reducing
the coverage of our dataset due to cases where deaggregation left too few measurements for
us to be able to make statistically significant conclusions (§5.3.5.1). More generally, our
ability to cluster clients by location is limited by the availability and accuracy of IP-based
geolocation information; prior work has found that IP to geolocation results are often accurate
at the country level, but that accuracy can quickly diminish at finer granularities [184, 241,
271, 332, 384]. While we were able to use geolocation data to split up the /16 prefix in
Figure 5.6, we will inevitably encounter scenarios where geolocation information is wrong or
unavailable, potentially resulting in erroneous conclusions about performance changes.
34
We discuss possible improvements to our aggregation approach in Section 8.2.2.
34
It is possible that fetching information directly from network operators through BGP communities and other means
could help, but there are barriers to the adoption and accuracy of any approach that requires such coordination.
198
5.3.4.2 Summarizing network conditions per aggregate
For each client-PoP-time aggregation, we capture the 50th percentile (median, p50) of MinRTT
across all sessions, denoted MinRTT
P50
, and the median HDratio across sessions that had at least one
transaction test the session’s ability to achieveHDgoodput, denoted HDratio
P50
.
35
We aggregate
MinRTT and HDratio to percentiles and focus on the median rather than the average or higher
percentiles so that our analysis is robust to network conditions impacting individual clients or
sessions.
36,37
In addition, we can calculate confidence intervals for the median without needing to
assume normality (§5.3.5.1).
Weighting aggregates in distributions. When reporting performance across client-PoP-time
aggregations, we weight each aggregate by the volume of traffic in the corresponding HTTP
sessions. We focus on traffic volumes because prefixes are arbitrary units of address space whose
size may not map to the underlying userbase size [104, 197] and can be subdivided arbitrarily [44].
As discussed in Section 5.2.2.1, this approach to weighting aggregations may result in aggregates
with better connectivity being overrepresented. Ideally, aggregates would be weighted by the number
of users per aggregate that want to use the service, as this would ensure that the aggregates do not
filter out users with poor connectivity; we defer work on this topic to future work.
35
We find the distribution of HDratio is frequently bimodal as sessions often have HDratio at the extremes 0.0 and 1.0.
36
We have also reproduced our analysis in §5.6 comparing the average HDratio across client-PoP-time aggregations
(omitted), with qualitatively similar results and findings.
37
For instance, we observe MinRTT values in the extreme tail of the distribution having values on the order of seconds,
likely either due to bufferbloat [156] or last-mile= last-link problems [322].
199
5.3.5 Comparing performance
Internet performance can vary over time due to failures, routing changes, traffic engineering,
transient congestion (e.g., during peak hours), changes in client population (e.g., as users change
networks at the end of working hours or as users in earlier time zones go to sleep). To capture these
effects, we evaluate performance over time to assess whether we can identify temporal patterns
of Internet performance. We characterize performance degradation to check how the performance
for aclient-PoP group changes over time — e.g., is performance at 6 AM significantly better than
performance at 8 PM every day?
In addition, Facebook PoPs often have multiple routes to the client-PoP groups they serve
(§4.2.3.1). For each client-PoP group, we compare the performance of available routes to determine
if there exists opportunity to improve performance by using control systems such as EDGE FAB-
RIC (chapter 4) or Google’s Espresso [469] to incorporate dynamic performance signals into egress
routing decisions.
We compute performance degradation over time by identifying the baseline performance of
each client-PoP group, then comparing performance of the (BGP) preferred route in each time
window against the baseline. We define the baseline MinRTT
P50
of aclient-PoPgroup as the 10
th
percentile of the MinRTT
P50
distribution of its preferred route across all time windows, and the
baseline HDratio
P50
as the 90
th
percentile of the corresponding distribution. We consider anclient-
PoP-time aggregation to be experiencing performance degradation whenever the lower bound of the
confidence interval of the difference between the baseline performance and the current performance
200
is above a configurable threshold (e.g., 5ms for MinRTT
P50
). We compare the lower bound of the
confidence interval to check forclient-PoP-time aggregations where there likely is degradation at
the chosen threshold.
We also compute the opportunity to improve performance over time by using alternate routes.
Within an client-PoP-time aggregation, we compare the performance of the preferred route with
the performance of the best alternate route. We consider there to be an opportunity to improve
performance whenever the lower bound of the confidence interval of the performance difference
between the preferred and best alternate routes is above a configurable threshold (e.g., 0.05 for
HDratio
P50
).
We prioritize improving HDratio over MinRTT — that is, when an alternate route has better
MinRTT
P50
at the specified threshold, we classify it as an opportunity for improving MinRTT
P50
only if the HDratio
P50
of the alternate route is statistically equal to or better than that of the preferred
route. We prioritize HDratio because MinRTT is an estimate of propagation delay and thus does not
include the impact of loss. In comparison, HDratio captures a route’s ability to deliver bytes to the
destination, a function of loss, latency (propagation, queuing, and MAC/link-layer delays), available
bandwidth, and the behavior of the congestion control algorithm. As a result, HDratio offers a richer
view of performance: a route with a better MinRTT may also have a worse HDratio if congestion
along the route is causing packet loss; under such conditions, sessions traversing the route with the
worse HDratio will likely yield worse user experience, as lower goodput will ultimately translate
into higher user-perceivable latency and lead to stalls for streaming videos.
201
5.3.5.1 Controlling statistical significance
The statistical significance of our observations depends on the underlying performance variance
and the number of session samples. If clients in a client-PoP group have similar performance, then
few samples are needed to obtain a good estimate of their performance; conversely, if clients in a
client-PoP group have highly variable performance, then more samples are needed. Our approach
allows us to not focus on a target number of samples, but instead on whether the comparison is of
enough precision to support conclusions.
When comparing two client-PoP-time aggregations (baseline vs current performance for degra-
dation or primary vs best alternate for opportunity), we compute their performance difference
and the a = 0:95 confidence interval of the difference. Since we cannot assume normality, we
compute the confidence interval for the difference of medians for MinRTT
P50
and HDratio
P50
using
a distribution-free technique [336].
38
We only considerclient-PoP-timeaggregations with at least
30 samples (e.g., an aggregation for MinRTT
P50
must have at least 30 sessions to be valid, and an
aggregation for HDratio
P50
must be built from at least 30 sessions that were capable of testing forHD
goodput), and we define comparisons of client-PoP-time aggregations to be valid for analysis when
we can calculate “tight” confidence intervals: we require the confidence intervals of the differences
to be smaller than 10ms for MinRTT
P50
and 0.1 for HDratio
P50
. Using larger thresholds allows us
to capture more traffic at a lower statistical significance, and smaller thresholds provide additional
38
Traffic engineering systems in production need to be able to make these comparisons in near real-time (for instance,
to compare the performance of routes to a network). t-digests (an on-line, probabilistic data structure [118]) can be
used to efficiently calculate percentiles in streaming analytics frameworks and calculate confidence intervals via the
distribution-free technique [336] that we apply.
202
statistical significance at the cost of invalidating moretimewindows. We find that thresholds defining
confidence intervals with half and double the chosen sizes yield qualitatively similar results (not
shown).
5.3.5.2 Temporal behavior classes
After we compute degradation/opportunity over time, we try to identify temporal patterns. In
particular, we identify client-PoP groups that have persistent or diurnal degradation/opportunity. We
classify each client-PoP group into one of the following classes, checking the conditions for each in
order such that a prefix is assigned to the first group that it matches.
1. The persistent class includes client-PoP groups with degradation/opportunity for at least 75%
of the time windows. This class captures client-PoP groups with frequent degradation or
where the alternate route is often better than the preferred route.
2. The diurnal class includes client-PoP groups with degradation/opportunity for at least one
fixed 15-minutetime window (e.g., 11:00–11:15) in at least 5 days in our dataset. This class
captures client-PoP groups where there is degradation/opportunity for part of the day over
multiple days.
3. The episodic class includes all remaining client-PoP groups. It capturesclient-PoP groups
with some degradation/opportunity but that do not fit into the consistent or diurnal classes.
4. The uneventful class includes client-PoP groups where no valid time window has degradation/
opportunity. This class captures client-PoP groups where performance is stable over time
203
(no degradation) or the preferred route is consistently better than the best alternate route (no
opportunity).
As we need a representative view of a client-PoP group’s behavior over time to classify its
behavior, we ignoreclient-PoPgroups that have validclient-PoP-timeaggregations for less than 60%
of thetime windows. This can happen, for example, due to theclient-PoP group only sporadically
having client traffic or being predominately served by other PoPs.
The definitions above make the uneventful class restrictive (excludes client-PoP groups with
any opportunity/degradation), while the other classes are somewhat inclusive (e.g., anyclient-PoP
group with one time window that experiences repeated opportunity/degradation will be classified as
diurnal). Results presented in Sections 5.5 and 5.6 are robust to and findings qualitatively similar
for different thresholds in the classification algorithm.
5.4 Does Facebook’s Rich Connectivity Yield Good Performance?
In this section we present a snapshot of Internet performance for users around the world, as measured
from Facebook’s perspective. The breadth and diversity of Facebook’s users — Facebook serves
billions of users from hundreds of countries every day — and the fact that Facebook is a large
content provider yields an opportunity to explore and compare network performance across regions
worldwide.
204
0.0 0.2 0.4 0.6 0.8 1.0
HDratio
0.0
0.2
0.4
0.6
0.8
1.0
Cum. Fraction of Sessions
HDratio
0 50 100 150 200
Minimum RTT [ms]
MinRTT
(a) All Sessions
0 50 100 150 200
Minimum RTT [ms]
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Fraction of Sessions
OC
NA
EU
SA
AS
AF
(b) MinRTT, Per Continent
0.0 0.2 0.4 0.6 0.8 1.0
HDratio
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Fraction of Sessions
AF
AS
SA
EU
NA
OC
(c) HDratio, Per Continent
Figure 5.7: Distribution of MinRTT and HDratio over all sessions and split per continent. Most sessions
have low propagation delay, and most sessions that test for HD goodput are able to achieve it every time they
can test for it.
205
0.0 0.2 0.4 0.6 0.8 1.0
HDratio
0.0
0.2
0.4
0.6
0.8
1.0
Cum. Fraction of Sessions
81+
51-80
31-50
0-30
Figure 5.8: Observed relationship between MinRTT (different lines) and HDratio (x axis). Sessions with
higher MinRTT values are often still able to achieve HD goodput.
Propagation delay is typically low. Figure 5.7a shows the distribution of MinRTT per session.
We observe that 50% of sessions have MinRTT less than 39 milliseconds, and 80% of sessions have
MinRTT of less than 78 milliseconds. Figure 5.7b depicts the MinRTT distribution by continent.
The median latency in Africa is 58ms, Asia is 51ms, and South America is 40ms. The median in
other continents is approximately 25 milliseconds or less. While the 90th percentile exceeds 100
milliseconds, we expect that some of the samples in the upper percentile are not representative of
the common experience for that client’s client-PoP-time aggregation, but instead represent network
conditions specific to that individual client or sample (e.g., poor cellular signal strength, access
link congestion).
39
The extreme tail of the distribution (not shown) contains values on the order of
seconds, likely either due to bufferbloat [156] or last-mile= last-link problems [322].
39
Plotting the weighted distribution of MinRTT
P50
across client-PoP-time aggregations would likely capture this, but
we found that such distributions are more difficult to interpret and thus chose to plot a distribution of sessions instead.
206
These results indicate that most sessions reach Facebook over routes with low MinRTT, enabling
real-time applications such as video calls. Performance tends to be better in continents with more
developed infrastructure [91], both in terms of access networks and density of Facebook PoPs.
Most user sessions are able to achieve HD goodput. Figure 5.7a shows the distribution of
HDratio across HTTP sessions. Over 82% of sessions have an HDratio greater than zero. This
means that at least some of these sessions’ transactions were able to achieve HD goodput, indicating
that the underlying routes can support HD goodput when there is no congestion. Further, 60% have
HDratio of 1, meaning most sessions have enough bandwidth to support HD video. Figure 5.7c
shows that HDratio follows a similar per-continent trend as MinRTT, with Africa, Asia and South
America standing out for having more sessions with HDratio equal to zero: 36% of African sessions,
24% of Asian sessions, and 27% of South American sessions. This result indicates a higher
concentration of clients with non-HD-capable access links in these regions.
While conducting this analysis, we evaluated the benefit of our approach to evaluating if a trans-
action achievedHD goodput by comparing the distribution produced with our approach (§5.3.2.5)
against one built by estimating the achieved goodput as B
total
T
total
(but still using our techniques
from Sections 5.3.2.3 and 5.3.2.6 to identify eligible transactions and handle multiplexing). The
simple approach underestimates which transactions reach HD goodput, yielding an median HDratio
of 0.69; this is significantly lower than the median we get when (1) when using our technique in full.
207
MinRTT does not directly correlate with HDratio. A transaction’s ability to test forHD good-
put is partially dependent on MinRTT; sessions with higher MinRTT require larger transfers to test
for and achieve HD goodput (§5.3.2), and thus if we do not account for this dynamic, goodput will
decrease as MinRTT increases if transfer sizes remain the same. However, because our approach
considers whether a transaction is able to test forHDgoodput based on the response size and the
CWND growth under ideal circumstances relative to MinRTT, HDratio does not decrease linearly
as MinRTT increases. Instead, a connection’s ability to achieveHDgoodput at higher MinRTT is
dependent on other factors, including loss and congestion control behavior. For instance, Facebook
uses BBR congestion control [74] for most connections; prior work has shown that BBR is capable
of supporting high goodput at higher latencies because it uses latency changes — instead of loss — to
detect congestion [73, 208].
Figure 5.8 shows how MinRTT and HDratio correlate. We group by ranges of MinRTT and
show the distribution of HDratio for each range. HDratio degrades as propagation delay increases,
but the majority of sessions achieveHD goodput for some transactions even at MinRTT above 80
milliseconds, indicating that users in theseclient-PoPgroups have connections capable of supporting
greater than 2.5Mbps. Thus, the largest barrier to these clients achieving HD goodput is likely jitter
and loss caused by self-induced congestion and traffic policers — as MinRTT increases, even a
small loss rate makes maintaining a high goodput difficult [73, 139, 208, 321].
208
0 5 10 15 20
MinRTT
P50
Degradation [Current− Baseline]
0.0
0.2
0.4
0.6
0.8
1.0
Cum. Fraction of Traffic
(a) MinRTT
P50
0.0 0.1 0.2 0.3 0.4
HDratio
P50
Degradation [Baseline− Current]
0.0
0.2
0.4
0.6
0.8
1.0
Cum. Fraction of Traffic
(b) HDratio
P50
Figure 5.9: Degradation in MinRTT
P50
and HDratio
P50
, comparing the performance of each time window
with the baseline performance for the same client-PoP group. The shaded areas show the distributions of the
lower and upper bounds of the degradation confidence interval (not the confidence interval around individual
points on the CDF), and provide an indication of where the distribution is.
209
5.5 How Does Performance Change Over Time?
Performance may vary for a number of reasons, including route changes, congestion, or changes in
the client population. In this section we search for instances of performance degradation (§5.3.5)
for eachclient-PoP group. After removingclient-PoP-time aggregations with insufficient traffic or
too large of a confidence window, and then removingclient-PoPgroups that haveclient-PoP-time
aggregations for less than 60% of the time windows (§5.3.5), we are able to search for instances of
performance degradation in MinRTT
P50
and HDratio
P50
for 94.8% of traffic and 89.5% of traffic in
our dataset respectively.
Degradation relative to volume of traffic impacted. Figure 5.9 shows the distributions of degra-
dation for MinRTT
P50
and HDratio
P50
, comparing the difference in performance for each client-
PoP-time aggregation to the baseline for the client-PoP group, weighted by the volume of traffic of
each client-PoP-time aggregation. The vast majority of traffic sees minimal degradation over the 10
days in the study period, with only 10% of traffic experiencing a 4 millisecond or worse degradation
in MinRTT
P50
and 10% experiencing a 0.065 or worse degradation in HDratio
P50
, both of which
can be the result of minor changes in client population or client behavior. However, in the tail we
observe 1.1% of traffic experiencing degradation of at least 20 milliseconds in in MinRTT
P50
, and
2.3% of traffic with a degradation of at least 0.4 in HDratio
P50
. These changes are more significant
and may indicate congestion or a route change between the serving PoP and theclient-PoP group.
210
Periods of degraded performance (§5.5)
CLASS= MinRTT
P50
(§§ 5.3.1, 5.3.4 and 5.3.5) HDratio
P50
(§§ 5.3.2, 5.3.4 and 5.3.5)
CONTINENT +5ms +10ms +20ms +50ms 0.05 0.1 0.2 0.5
Uneventful .575 .705 .809 .929 .598 .625 .655 .742
AF .344 .561 .710 .837 .541 .541 .544 .713
AS .378 .518 .688 .880 .481 .487 .494 .507
EU .637 .747 .813 .932 .590 .634 .688 .754
NA .680 .813 .909 .984 .656 .671 .681 .817
OC .899 .955 .976 .993 .662 .662 .662 .672
SA .296 .454 .633 .817 .497 .501 .541 .721
Continuous .008 .007 .002 .001 .000 .000 .000 .000 .019 .018 .009 .009 .008 .008 .001 .001
AF .017 .015 .002 .001 .000 .000 .000 .000 .212 .202 .212 .199 .212 .196 .052 .051
AS .006 .005 .001 .001 .000 .000 .000 .000 .035 .033 .035 .033 .035 .033 .001 .001
EU .007 .006 .003 .003 .000 .000 .000 .000 .021 .020 .001 .001 .000 .000 .000 .000
NA .009 .007 .000 .000 .000 .000 .000 .000 .003 .003 .002 .001 .000 .000 .000 .000
OC .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
SA .010 .007 .001 .000 .000 .000 .000 .000 .011 .011 .011 .011 .011 .011 .000 .000
Diurnal .175 .060 .091 .023 .043 .008 .010 .002 .134 .086 .135 .075 .089 .043 .017 .009
AF .312 .149 .183 .092 .126 .047 .069 .011 .091 .059 .091 .035 .085 .012 .083 .065
AS .322 .125 .166 .041 .064 .011 .022 .003 .075 .054 .069 .051 .065 .049 .085 .054
EU .149 .035 .082 .012 .045 .004 .002 .000 .135 .076 .143 .066 .059 .028 .006 .001
NA .075 .026 .033 .009 .011 .003 .002 .000 .154 .108 .151 .096 .132 .062 .005 .001
OC .034 .009 .018 .003 .008 .001 .002 .000 .002 .000 .000 .000 .000 .000 .000 .000
SA .383 .164 .174 .063 .082 .023 .018 .004 .234 .174 .234 .157 .197 .070 .020 .010
Episodic .242 .007 .202 .005 .148 .003 .061 .001 .249 .002 .231 .001 .249 .002 .239 .001
AF .327 .007 .255 .003 .164 .002 .094 .002 .156 .001 .156 .000 .159 .000 .153 .000
AS .294 .012 .315 .012 .247 .006 .099 .002 .410 .002 .410 .002 .407 .002 .406 .001
EU .207 .006 .167 .004 .142 .003 .066 .001 .253 .002 .222 .001 .253 .001 .240 .001
NA .236 .004 .153 .003 .080 .001 .014 .000 .186 .002 .176 .001 .187 .002 .178 .001
OC .067 .001 .027 .001 .016 .000 .005 .000 .336 .001 .338 .001 .338 .001 .328 .001
SA .312 .011 .371 .015 .285 .010 .164 .004 .258 .001 .254 .001 .250 .001 .258 .004
Table 5.1: Fraction of traffic forclient-PoPgroups by temporal behavior class (§5.3.5.2) andclient-PoPgroup location for periods of degraded
performance (§5.5) at various thresholds of degradation (§5.3.4). In each pair of columns, eachclient-PoPgroup is assigned a single class
and a single continent. The first blue column weights the client-PoP group by its total traffic volume, normalized overall and per continent,
and so the classes sum to 1 per column (across classes, without continent breakdown) and to 1 per column per continent. This column
provides insight into the fraction of clients (weighed by their total traffic) that experienced the temporal behavior, and therefore provides
insight into how widespread the events are. The orange column shows the fraction of overall (or per continent) traffic sent to those client-PoP
groups during the episodes of performance degradation. This column provides insights into the amount of traffic served during periods
of performance degradation. For example (bottom leftmost entries), 31.2% of traffic to South American clients is for client-PoP groups
that experience episodic degradation of at least 5ms, and 1.1% of traffic to South American clients is sent during periods of performance
degradation.
211
Degradation per temporal behavior class. Table 5.1 shows degradation, computed at different
thresholds, per temporal behavior class (§5.3.5). For each temporal behavior class and threshold,
the table shows the fraction of clients (weighed by their total traffic) that experienced the temporal
behavior — providing insight into how widespread the events are — and shows the fraction of
overall (or per continent) traffic sent to those client-PoP groups during episodes of performance
degradation — providing insight into the amount of traffic impacted by the episode. For example,
client-PoP groups responsible for 13.4% of overall traffic are classified as experiencing diurnal
HDratio
P50
degradation of at least 0.05 (entries mentioned in text are underlined and bold in table).
However, only a fraction of this traffic experienced degradation: the second number shows that
8.6% of overall traffic was delivered for these client-PoP groups during these periods of diurnal
degradation. The orange columns of the table show that most of the performance degradation is
diurnal, which could be caused by congestion at peak periods; as we discuss further in Section 5.6,
we suspect that such instances of degradation occur due to congestion inside of theclientgroup’s
AS (e.g., inside the end-user ISP’s core or access network).
We observe that most instances of performance degradation are small: the fraction of traffic
experiencing degradation decreases as the threshold increases. For example, only 0.8% of traffic is
in aclient-PoPgroup that experiences diurnal degradation in MinRTT
P50
of 20ms or more.
40
Finally,
while we find that a significant fraction ofclient-PoP groups experience episodic degradation, the
volume of traffic impacted is small (compare total traffic in blue vs. impacted traffic in orange),
40
The fraction of traffic experiencing diurnal and episodic degradation can increase as the threshold increases — client-
PoP groups identified as experiencing continuous or diurnal degradation at low thresholds may be identified as only
experiencing diurnal or episodic degradation (or no degradation at all) at higher thresholds.
212
indicating that episodic instances of degradation are prevalent across client-PoP groups, but are
short-lived and thus have little impact on most traffic. Results for individual continents follow
similar trends, with Africa, Asia, and South America experiencing above-average degradation, and
Oceania experiencing below-average degradation.
Example time series. Figures 5.10 and 5.11 show examples of diurnal degradation impacting
client-PoP groups consisting of clients in large mobile provider networks. We observe instances of
degradation, including diurnal degradation, that impact client-PoP groups for which traffic is routed
through a direct, one-hop peering interconnection (chapter 3). For instance, Figure 5.10 shows a
client-PoP group that experiences diurnal degradation in HDratio
P50
and for which traffic is routed
via a direct, one-hop peering interconnection. Given that EDGE FABRIC prevents congestion of
interconnections at the edge of Facebook’s network, we know that this degradation is not caused by
congestion at the peering interconnection between Facebook and the mobile provider. Instead, we
suspect it may be due to congestion in the provider’s backbone or radio access network. Degradation
of HDratio
P50
may be limited for this client-PoP group in part due to Facebook’s use of BBR, as
BBR does not consider loss events to be indicative of congestion.
While we have not yet formally evaluated whether the probability of a client-PoP group ex-
periencing degradation is correlated with the route used to deliver traffic, Figure 5.10 serves as a
reminder that even in the ideal case in terms of interconnection, performance still ultimately depends
on conditions inside of the destination network.
213
2019− 09− 11 2019− 09− 12 2019− 09− 13 2019− 09− 14 2019− 09− 15
60
70
80
90
100
MinRTT
P50
[ms]
0.0
0.2
0.4
0.6
0.8
1.0
HDratio
MinRTT
P50
HDratio
P50
HDratio
Avg
Normalized Traffic Vol.
Figure 5.10: Example of diurnal degradation of goodput (as measured by HDratio
P50
) for a client-PoP group
consisting of clients in a large mobile provider’s network. The route between Facebook and these clients
traverses a direct (one-hop, chapter 3) peering interconnection. Each day when traffic is lowest, HDratio
P50
is
1.0 — this indicates that at least 50% of sessions that tested for HD goodput were able to achieve it every
time they tested. However, HDratio
P50
decreases throughout the day — when traffic is highest, HDratio
P50
is
less than 0.4, indicating that most clients are unable to consistently achieve HD goodput. While MinRTT
P50
is also higher at peak — moving from 75ms at off-peak to 80ms at peak — these variations alone are
not indicative of a performance problem. Given the degradation in HDratio
P50
at peak with little change to
MinRTT
P50
, we surmise that loss increases at peak, potentially at a link with a short buffer. We know that this
loss is not due to congestion at the peering interconnection between Facebook and the mobile provider given
that EDGE FABRIC prevents congestion at Facebook’s edge. Instead, we suspect it may be due to congestion
in the provider’s backbone or radio access network.
214
2019− 09− 11 2019− 09− 12 2019− 09− 13 2019− 09− 14 2019− 09− 15
40
60
80
100
120
MinRTT
P50
[ms]
0.0
0.2
0.4
0.6
0.8
1.0
HDratio
MinRTT
P50
HDratio
P50
HDratio
Avg
Normalized Traffic Vol.
Figure 5.11: Example of diurnal performance degradation impacting both propagation delay and goodput
(as measured by MinRTT and HDratio
P50
respectively) for a client-PoP group consisting of clients in a large
mobile provider’s network. The route between Facebook and these clients traverses a transit interconnection.
Each day when traffic is lowest, HDratio
P50
is 0.6, indicating that most sessions are unable to consistently
achieve HD goodput when they are able to test for it. While this is already poor performance, we observe
that HDratio
P50
further decreases throughout the day, and that when traffic is highest, HDratio
P50
is zero,
indicating that most sessions are unable to achieveHDgoodput at all. MinRTT
P50
also changes throughout the
day, moving from 60ms at off-peak to 120ms at peak. We surmise that the significant rise in MinRTT
P50
and drop in HDratio
P50
is caused by congestion and queuing in the path between Facebook and the clients in
the destination ISP, and that this queuing results in loss that degrades network performance. While EDGE
FABRIC can prevent congestion at interconnections at the edge of Facebook’s network, this congestion is
further downstream — such as between the transit provider and the destination ISP — and thus cannot be
detected by EDGE FABRIC without the use of additional signals.
215
5.6 How Does Facebook’s Routing Policy Impact Performance?
This section evaluates if incorporating performance signals into EDGE FABRIC’s decision process
could be beneficial, along with the potential utility of application specific routing. Facebook and
other network operators have traditionally resorted to using heuristics to guide their routing decisions
given BGP’s limitations (§§ 2.3.1, 2.4.3 and 4.2.3.2). Furthermore, the routing decisions made at
each Facebook PoP impact all traffic to a given destination; this catchall behavior means that when
EDGE FABRIC detours traffic for a prefix to an alternate path to avoid congestion, all of that prefix’s
traffic, regardless of its sensitivity to network conditions, is shifted. Ideally, EDGE FABRIC would
be able to shift application traffic that is less sensitive to network conditions to the alternate path
when the alternate path will have worse performance.
The opportunity to shift away from coarse grained decisions made by heuristics and instead
optimize for performance and application needs is one potential benefit of traffic engineering systems
such as EDGE FABRIC. This use-case is discussed as one of the key motivations behind similar
systems, such as Google’s Espresso [469].
There are two ways such control systems could make use of performance signals and application
specific routing. First, systems could continuously incorporate performance signals into their
routing decisions; in the extreme, performance would be used to select among routes and other
metrics (such asMED,AS_PATH length, and business policies) would only be used to break ties.
Systems could also weigh performance and application needs alongside cost and business objectives,
potentially making decisions on a per-application basis. Second, systems could use performance
216
and application-specific routing when forced to shift traffic to alternate paths to minimize the impact
of such shifts, such as by keeping performance-sensitive traffic (e.g., video calls) on the preferred
path. We discuss both of these applications in the context of EDGE FABRIC in Section 4.5.2, in this
section we quantify their potential value in Facebook’s environment.
In addition, to searching for opportunities to improve performance, we also use our dataset to
compare the performance observed for the sameclient-PoP group for routes traversing peering and
transit interconnections to understand the performance benefits provided by Facebook’s peering
connectivity.
5.6.1 Could performance-aware routing provide benefit?
To answer this question, we compare the performance observed by flows routed via the preferred
route (the route preferred by Facebook’s routing policy, §4.2.3) and via the 2nd and 3rd best routes.
We describe how we continuously measure the performance of the preferred and alternate routes in
Section 5.2.2.2; our approach relies on the ROUTEPERF controller controlling the egress route taken
by measurement traffic.
89.5% of the traffic has valid client-PoP-time aggregations (at least two routes and “tight”
confidence intervals) for at least 60% of thetimewindows for computing opportunity to improve
MinRTT
P50
(85.8% of traffic for HDratio
P50
). For each valid client-PoP-time aggregation, we
identify the best performing alternate route for MinRTT
P50
and HDratio
P50
. We then compare
MinRTT
P50
and HDratio
P50
between the preferred route and the best performing alternate route for
217
each validclient-PoP-timeaggregation to identify instances where performance-aware routing could
have provided benefit.
Comparing preferred and best alternate routes. The solid line in Figure 5.12 shows the per-
formance difference between the preferred and best alternate routes for all valid client-PoP-time
aggregations for both MinRTT
P50
and HDratio
P50
. The distributions are concentrated around
x= 0, indicating that the preferred and best alternate routes often have similar performance. The
MinRTT
P50
of the preferred route is within 3ms of the optimal (where optimal is defined as the
lowest MinRTT
P50
observed for the client-PoP group between the preferred and best alternate
routes) for 83.9% of traffic, and the HDratio
P50
of the preferred route is within 0.025 of optimal for
93.4% of traffic. Although we did not specifically look at the performance difference for prefixes
detoured by EDGE FABRIC during our study period, this result suggests that when EDGE FABRIC
shifts traffic to prevent interconnection congestion (§4.4), the shifted traffic is unlikely to experience
any significant degradation in performance. Since degradation appears to be limited, there may be
little value in employing application-specific routing during such detours.
Opportunities to improve performance by shifting traffic. We find few opportunities to im-
prove performance by using performance instead of the existing heuristics in Facebook’s routing
policies: Figure 5.12 shows that MinRTT
P50
can be improved by 5ms or more for only 2.0% of
traffic, and HDratio
P50
can be improved by 0.05 or more for only 0.2% of traffic. One possible
explanation for finding less opportunity to improve HDratio
P50
compared to MinRTT
P50
is that the
218
− 10 − 5 0 5 10
MinRTT
P50
Difference [Preferred − Alternate]
0.0
0.2
0.4
0.6
0.8
1.0
Cum. Fraction of Traffic
Preferred
is better
Alternate
is better
(a) MinRTT
P50
− 0.2 − 0.1 0.0 0.1 0.2
HDratio
P50
Difference [Alternate − Preferred]
0.0
0.2
0.4
0.6
0.8
1.0
Cum. Fraction of Traffic
Preferred
is better
Alternate
is better
(b) HDratio
P50
Figure 5.12: Possible performance improvement, weighted by traffic, over 15 minutetimewindows. Positive
values mean the alternate path is better than the primary path (lower MinRTT
P50
, higher HDratio
P50
). The
shaded areas show the distributions of the lower and upper bounds of confidence intervals.
219
barrier to goodput is often within the last-mile network (e.g., congestion within the delivery network,
or limitations of the access technology in use); in these cases, performance cannot be improved by
using alternate routes, as all routes converge and traverse the same bottleneck. MinRTT
P50
, however,
is not defined by a single bottleneck and can improve whenever a better alternate route is available.
The difference distribution for MinRTT
P50
has more density on x< 0 (i.e., they are skewed to the
left), which means that the preferred route is more likely to outperform the best alternate route than
the opposite.
5.6.1.1 When and where are the opportunities for improvement?
Table 5.2 breaks down opportunity for improving MinRTT
P50
by 5ms and HDratio
P50
by 0.05
by temporal pattern and client continent. We prioritize HDratio
P50
when assessing connection
performance and exclude cases where MinRTT
P50
improves if HDratio
P50
degrades (§5.3.5). Results
for higher improvement thresholds are qualitatively similar, but apply to smaller fractions of traffic
(not shown). We report on lower thresholds as we identify few opportunities.
Opportunity per temporal behavior class. We find that most (1.2% of overall traffic) opportunity
for improving MinRTT
P50
is for client-PoP groups classified as continuous, meaning that the
preferred route usually has a higher propagation delay than the best available route. We find few
diurnal or episodic opportunities to improve MinRTT
P50
, and even fewer opportunities to improve
HDratio
P50
. Similar to degradation (§5.4), we find more opportunity in Africa, Asia and South
220
2019− 09− 10 2019− 09− 11 2019− 09− 12 2019− 09− 13 2019− 09− 14
0
20
40
60
80
100
120
140
MinRTT
P50
[ms]
Route Preferred by Policy
Best Alternate Route
Normalized Traffic Volume
10ms+ Stat Sig. Difference
Figure 5.13: Time series showing episodic opportunity to improve propagation delay (as measured by
MinRTT
P50
) by shifting traffic to an alternate route. The client-PoP group consists of clients served by a
fixed broadband ISP that exclusively providers 1 Gbps+ fiber optic Internet connections. The preferred
and best alternate route between Facebook and these clients traversed peering and transit interconnections
respectively. The peering interconnection was direct to the destination AS while the transit interconnection
had anAS_PATH length of (1). The period of opportunity lasts for approximately a day. During this time,
MinRTT
P50
increased for both the preferred and best alternate paths, but because MinRTT
P50
increased less
for the best alternate path, it ended up being better than the preferred path. There were no BGP route changes
observed during this time. We suspect that the increase in MinRTT
P50
is not indicative of queuing due to
congestion and is likely instead the result of a routing anomaly given that (i) MinRTT
P50
did not increase by
the same amount for both routes and (ii) because MinRTT
P50
remained stable for both paths throughout the
episode, despite the episode lasting for an entire day, during which time load changed.
America, and less in Oceania. Figures 5.13 and 5.14 show examples of episodic opportunity to
improve MinRTT
P50
, and Figure 5.15 show diurnal opportunity to improve the same.
Summary. Considering the conditions under which we observe opportunity, we conclude that:
• Opportunity to improve performance over alternate routes implies that the access network /
technology is not the barrier to performance. Thus, when there is opportunity to improve
MinRTT
P50
or HDratio
P50
, it must be due to an issue or an event, such as congestion or a
failure, impacting a portion of the preferred route that is not shared with the alternate route.
Our analysis suggests that such events are relatively rare.
221
Opportunity for performance-aware routing (§5.6)
CLASS= MinRTT
P50
HDratio
P50
CONTINENT 5ms 10ms +0.05
Uneventful .890 .943 .844
AF .570 .740 .722
AS .711 .828 .798
EU .939 .977 .857
NA .916 .961 .839
OC .901 .976 .688
SA .583 .676 .913
Continuous .013 .012 .006 .006 .000 .000
AF .119 .115 .043 .041 .000 .000
AS .049 .046 .036 .034 .000 .000
EU .004 .004 .000 .000 .000 .000
NA .004 .003 .001 .001 .000 .000
OC .004 .004 .004 .004 .000 .000
SA .072 .069 .046 .043 .000 .000
Diurnal .016 .007 .005 .002 .005 .001
AF .094 .049 .069 .031 .000 .000
AS .035 .010 .012 .004 .003 .002
EU .005 .002 .001 .000 .009 .002
NA .015 .008 .001 .001 .000 .000
OC .000 .000 .000 .000 .000 .000
SA .108 .036 .062 .027 .000 .000
Episodic .081 .001 .046 .001 .151 .001
AF .217 .002 .148 .006 .278 .001
AS .205 .007 .124 .003 .200 .001
EU .052 .001 .022 .000 .134 .001
NA .064 .001 .036 .000 .160 .001
OC .095 .001 .020 .000 .312 .002
SA .237 .004 .217 .003 .087 .000
Table 5.2: Fraction of traffic on prefixes by temporal behavior class (§5.3.5.2) and client geographic
location for periods with opportunity for performance-aware routing (§5.6) at various thresholds of improve-
ment (§5.3.4). In each pair of columns, each client-PoP group is assigned a single class and a single continent.
The first blue column weights the client-PoP group by its total traffic volume, normalized overall and per
continent, and so the classes sum to 1 per column (across classes, without continent breakdown) and to 1
per column per continent. This column provides insight into the amount of users (weighed by their total
traffic) that experienced the temporal behavior, and therefore provides insight into how widespread the events
are. The orange column shows the fraction of overall (or per continent) traffic sent from those PoPs to those
prefixes during the episodes of opportunity for improvement via performance-aware rerouting. This column
provides insights into the amount of traffic associated with the episodes.
222
2019− 09− 08 2019− 09− 09 2019− 09− 10 2019− 09− 11 2019− 09− 12 2019− 09− 13 2019− 09− 14
45
50
55
60
65
70
75
MinRTT
P50
[ms]
Route Preferred by Policy
Best Alternate Route
Normalized Traffic Volume
10ms+ Stat Sig. Difference
Figure 5.14: Time series showing episodic opportunity to improve propagation delay (as measured by
MinRTT
P50
) by shifting traffic to an alternate route. Theclient-PoPgroup consists of clients served by a cable
broadband ISP. Both the preferred and best alternate route between Facebook and these clients traversed a
peering interconnection, but neither peering interconnection was direct to the destination AS. We focus on
the period of opportunity on 2019-09-13, which lasts for approximately half a day. During this time, the
destination AS began to advertise routes withAS_PATH prepending (§2.1);prepending was visible on all
routes Facebook received. While this prepending did not change how Facebook routed traffic, it may have
changed how other networks routed traffic, leading to congestion on the route preferred by Facebook’s policy.
It is unclear why the destination AS began to use prepending. Based on the step-wise change in MinRTT
P50
on 2019-09-12, we suspect the destination AS was changing routing or their network topology, and as part
of this change was also attempting to load-balance ingress traffic across interfaces, but ultimately ended up
(inadvertently) causing congestion. The opportunity disappeared when the prepending stopped.
223
2019− 09− 13 2019− 09− 14 2019− 09− 15 2019− 09− 16
30
40
50
60
MinRTT
P50
[ms]
Route Preferred by Policy
Best Alternate Route
Normalized Traffic Volume
10ms+ Stat Sig. Difference
Figure 5.15: Time series showing diurnal opportunity to improve propagation delay (as measured by
MinRTT
P50
) by shifting traffic to an alternate route. The client-PoP group consists of clients served by
a fixed broadband ISP. Both the preferred and best alternate route between Facebook and these clients
traversed a peering interconnection, but neither peering interconnection was direct to the destination AS. Both
routes had anAS_PATH length of (2) with prepending removed. The destination AS’s use ofAS_PATH
prepending (§2.1) broke the tie between the two available peering routes and caused Facebook to prefer a
route with relatively worse performance. We observe a diurnal trend up until 2019-09-16, with MinRTT
P50
increasing during the peak hours of the day on the primary path. This pattern suggests queuing at was likely
occurring inside of the peer AS used for the primary path, at the interconnection between the peer AS and
destination AS, or inside of a portion of the destination AS network that was only used by the preferred route.
Since the increase in MinRTT
P50
was isolated to the primary path, we can be confident that the increase was
not caused by a change in client population and was not related to conditions in the last mile or access network.
The increase was not directly correlated with Facebook’s traffic, suggesting that other traffic traversing the
same route (including non-Facebook traffic) played a role in the congestion. On 2019-09-16, the destination
AS changed its route advertisements in middle of the day, removing prepending from one route and adding it
to the other. This change caused Facebook’s routing policy to begin to prefer the route that was previously the
best alternative route. Following this change, the propagation delay on the route preferred by Facebook’s
BGP policy remained stable, while the propagation delay of the alternate route still increased throughout the
day (although less so, perhaps due to less traffic now on that route). This result suggests that Facebook could
have successfully employed performance-aware routing to improve propagation delay prior to the prepending
change on the 16th.
224
2019− 09− 10 2019− 09− 11 2019− 09− 12 2019− 09− 13 2019− 09− 14
140
160
180
200
220
MinRTT
P50
[ms]
Route Preferred by Policy
Best Alternate Route
Normalized Traffic Volume
10ms+ Stat Sig. Difference
Figure 5.16: Time series showing diurnal opportunity to improve propagation delay (as measured by
MinRTT
P50
) by shifting traffic to an alternate route. The client-PoP group consists of clients in a large mobile
provider’s network. Both the preferred and best alternate route between Facebook and these clients had an
AS_PATH length of (2) and traversed a transit interconnection, and the clients were in a different continent
from the serving PoP. We observe a roughly diurnal trend, with MinRTT
P50
likely to increase during the peak
hours on the preferred path; this pattern suggests queuing at was likely occurring inside of the transit network
used for the primary path, at the interconnection between the transit network and destination AS, or inside
of a portion of the destination AS network that was only used by the preferred route. Since the increase
in MinRTT
P50
was isolated to the primary path, we can be confident that the increase was not caused by a
change in client population and was not related to conditions in the last mile or access network. The increase
was not directly correlated with Facebook’s traffic, suggesting that other traffic traversing the same route
(including non-Facebook traffic) played a role in the congestion.
225
• Opportunities to improve MinRTT
P50
may also arise due to temporary route changes (e.g.,
when the route typically used is unavailable or the set of alternate routes briefly changes).
Since the events are episodic, they suggest that congestion or path changes may have occurred
due to a failure/maintenance, and not a longstanding bottleneck that impacts traffic regularly.
Opportunity by route properties. Table 5.3 breaks down the traffic with opportunity for perfor-
mance improvement (orange columns in Table 5.1) by the peering relationships of the preferred
and alternate routes. A significant fraction of opportunity happens when the preferred and alternate
routes have the same relationship (blue rows). In these cases, the alternate routes are often less
preferred (i.e., not chosen) due to having a longerAS_PATH compared to the preferred route. We
also inspect how often the alternate route has more prepending than the preferred route, as this
may be a signal of ingress traffic engineering (perhaps the route is better performing, but capacity
constrained) [86], meaning that the alternate route should be de-prioritized and thus is not a good
candidate for improving performance. An additional fraction of opportunity is on traffic sent over
private and public exchange peering links that shows better performance on transit (orange rows).
Results are qualitatively similar for MinRTT
P50
and HDratio
P50
, although transit providers account
for more opportunity for improving HDratio
P50
.
226
RELATIONSHIPS ABSOLUTE RELATIVE LONGER PREPENDED
MinRTT
P50
(§§ 5.3.1, 5.3.4 and 5.3.5)
Private! Private .0118 .489 .449 .327
Private! Transit .0046 .191 .147 .012
Public! Public .0001 .003 .003 .002
Public! Transit .0021 .086 .085 .000
Transit! Transit .0026 .108 .048 .027
Others .0029 .122 .032 .025
HDratio
P50
(§§ 5.3.2, 5.3.4 and 5.3.5)
Private! Private .0003 .081 .066 .023
Private! Transit .0014 .398 .391 .006
Public! Public .0000 .001 .001 .000
Public! Transit .0003 .091 .068 .003
Transit! Transit .0012 .361 .015 .043
Others .0002 .068 .024 .004
Table 5.3: Opportunity to improve MinRTT
P50
and HDratio
P50
by relationship type of preferred and alternate
routes (private interconnects, public IXPs, and transit providers). Blue rows correspond to opportunity over
alternate routes of the same relationship type, and orange rows correspond to cases where a transit performs
better than a peer. The absolute column shows the fraction of total traffic with opportunity, and the other
columns show the fraction of opportunity in each relationship (relative column adds up to 1 for MinRTT
P50
and 1 for HDratio
P50
). We show the fraction of opportunity where the alternate route’s AS_PATH was
longer than the preferred route, as well as the fraction where it was prepended more than the preferred route.
Because Facebook’s routing policy prefers routes from peers over transits before considering AS_PATH
length (§4.2.3.2), there are cases where a transit provider has a route with a shorterAS_PATH that also has
better performance. Such cases can occur when the peering interconnection is not between Facebook and
the destination AS, but instead between Facebook and an AS that provides connectivity to the destination
AS. For instance, some of the AS that Facebook maintains peering interconnections with may act as transit
providers to other AS on the Internet; in these scenarios, Facebook’s routing policy’s preference of routes
through peering interconnections over transit interconnections is less likely to be effective. Such scenarios
also illustrate one of the scenarios during which a peering interconnection may not have better performance.
227
5.6.1.2 Are opportunities practical and realizable?
Our measurements indicate that there exist client-PoP-time aggregations for which an alternate
path provides superior performance for measurement traffic. However, this does not guarantee that
Facebook can realize such performance improvements. In this section, we experiment with and
discuss challenges in productionizing performance-aware routing.
Thus far we have not shifted traffic based on performance measurements. In addition, because
ROUTEPERF only steers a fraction of Facebook’s traffic via alternate routes (§5.2.2.2), the vast
majority of Facebook’s traffic remains on the primary route unless there is an EDGE FABRIC override.
As a result, we have only observed the performance of alternate routes when they are carrying a
small amount of Facebook’s traffic and we do not know if these routes have the capacity required to
carry all of Facebook’s traffic for a given destination. However, from operational experience we
know that Facebook’s high traffic volumes can congest even Tier-1 transit providers,
41
and thus it is
entirely possible that shifting traffic based on performance measurements will result in downstream
congestion, in which case performance may degrade, potentially beyond that of the primary path.
Likewise, when an alternate route traverses a peering interconnection, the peer may not expect (or
be willing to carry) the sudden increase in traffic.
Experimenting with performance-based overrides. An experiment we executed in January
2017 illustrates the potential for a route’s performance to change when Facebook shifts traffic. We
41
The aggregate capacity of CDNs has grown faster than that of transit providers. For instance, in 2019 reported that
while a content provider’s network capacity had grown 25x in recent years, a transit provider’s capacity had only grown
12x over the same time period [248].
228
used seven days of ROUTEPERF measurements collected from four of Facebook’s PoPs (§4.2.1) — one
in North America (PoP-19), one in Europe (PoP-11), and two in Asia Pacific (PoPs-2, 16) — to
quantify primary and alternate route performance. At the time of the experiment in early 2017,
ROUTEPERF steered less of Facebook’s traffic than it did during our experiment in September 2019.
In addition, we captured sRTT instead of MinRTT, and we did not measure goodput, instead relying
on average retransmission rate to consider loss.
42
During the seven day measurement period, we collected over 350M alternate path measurements
to 20K ASes and an average of 8000 measurements perclient-PoPgroup.
43
From these measure-
ments, we identified 400client-PoP groups for which an alternate route had a sRTT
P50
that was at
least 20ms lower (and retransmission rate was not worse) than the route preferred by Facebook’s
routing policy. We configured EDGE FABRIC to prefer these alternate routes, causing EDGE FABRIC
to inject overrides that shifted production traffic for theseclient-PoP groups to the alternate routes.
We left these overrides in place for 24 hours and then evaluated performance measurements.
Experiment results. We used performance measurements captured during the override period to
compare the performance of the alternate path (now carrying the majority of traffic for eachclient-
PoPgroup) versus the path preferred by Facebook’s routing policy (which was carrying only a small
amount of traffic for each client-PoP group for ROUTEPERF measurements). Our analysis revealed
42
We discuss the difference between sRTT and MinRTT and the value of measuring goodput in Section 5.3. Although
we used a different set of metrics in our 2017 study, we do not expect that this had any significant impact on our
experimental results given that our goal was to materialize the performance improvements observed with ROUTEPERF
measurements.
43
Since we cannot consider user country when injecting override decisions (§4.4.3), we do not include it in our
definition of client-PoP group in this section.
229
that while some client-PoP groups experienced an improvement in performance — a successful
instance of performance-aware routing — other client-PoP groups experienced a degradation in
performance:
• For 45% of shiftedclient-PoPgroups, sRTT
P50
improved by least 20ms, and for 28% sRTT
P50
improved by 100ms or more.
• However, for 17% of shiftedclient-PoP groups, sRTT
P50
degraded by at least 20ms, and for
1% sRTT
P50
degraded by 100ms or more.
We speculate that overrides yielded worse performance due to a combination of two factors.
First, a route’s performance is a function of the load placed onto it.
44
It is possible that the route for a
client-PoPgroup was able to provide better performance when it was used by Facebook to deliver the
fraction of the client-PoP group’s traffic controlled by ROUTEPERF, but became congested — and
thus yielded worse performance — after all traffic for theclient-PoP group was shifted. Second, a
route’s performance can change over time. Changes in cross-traffic or the internal routing decisions
made by intermediate networks may have had performance implications.
Summary. Based on the results of the executed experiment, we conclude that a traffic engineering
system that simply shifts traffic onto the best performing alternate route will likely cause conges-
tion and risk oscillations. For example, if the system used in our January 2017 experiment had
dynamically adjusted routing in response to performance measurements — as opposed to the static
44
This is not a linear relationship — a route’s performance will only degrade if a link along the route begins to queue
or drop packets because demand exceeds the link’s capacity. For instance, with EDGE FABRIC (chapter 4) Facebook’s
interconnections can operate at 95% utilization with no packet loss.
230
overrides based on seven days of data — there would have been potential for oscillations. Thus,
incorporating performance into the decision process of systems like EDGE FABRIC (chapter 4)
would require a more sophisticated control loop in which the controller would gradually shift traffic,
pausing at each step to measure and assess if the shift caused congestion that degraded performance.
If multiple large content providers were to operate such controllers, the controllers may interact in
complex ways, presenting significant challenges in achieving fairness, stability, and convergence to
a stable state.
5.6.2 Comparing peer and transit performance
In Section 4.2.3 we discussed how Facebook’s routing policy prefers routes that traverse peering
interconnections. In this section, we compare the performance of routes by interconnection type to
examine the value of such a heuristic.
Figure 5.17 shows the distribution of performance differences between primary and alternate
routes for the sameclient-PoP group for different groupings of route interconnection types. Given
two relationships r
1
and r
2
, we consider client-PoP-time aggregations where the preferred route
is of type r
1
and there is at least one alternate route of type r
2
. As before, we compare the
difference in MinRTT
P50
and HDratio
P50
and ignore time windows for which we cannot compute
“tight” confidence intervals (10ms for MinRTT
P50
and 0.1 for HDratio
P50
). Whereas our analysis of
opportunity considered the best-performing alternate route, here, if multiple alternate routes of the
same relationship type are available, we pick the most preferred one based on Facebook’s routing
policy (§4.2.3).
231
− 10 − 5 0 5 10
MinRTT
P50
Difference
0.0
0.2
0.4
0.6
0.8
1.0
Cum. Fraction of Traffic
Preferred
is better
Alternate
is better
Peering vs Transit
Transit vs Transit
Private vs Public
(a) MinRTT
P50
difference in route performance by relationship type, weighted by traffic.
− 0.2 − 0.1 0.0 0.1 0.2
HDratio
P50
Difference
0.0
0.2
0.4
0.6
0.8
1.0
Cum. Fraction of Traffic
Preferred
is better
Alternate
is better
Peering vs Transit
Transit vs Transit
Private vs Public
(b) HDratio
P50
difference in route performance by relationship type, weighted by traffic.
Figure 5.17: Difference in MinRTT
P50
and HDratio
P50
between the preferred route and the alternate route
for different groupings of primary and alternate route interconnection types. For instance, the “peering vs.
transit” line compares performance of these two route types for client-PoP groups for which the primary route
traverses a peering interconnection and there is a transit route available as an alternate route. Likewise, the
“transit vs. transit” line compares the performance of these two route types forclient-PoPgroups for which the
primary route is a transit route and a transit route is available as an alternate route. If multiple alternate routes
of the same relationship type are available, we pick the most preferred based on Facebook’s routing policy.
232
The distributions for MinRTT
P50
in Figure 5.17a are concentrated around x= 0, indicating
that differences are frequently small. However, some routes through peering interconnections
significantly outperform alternate transits, as 10% of traffic has peer routes with at least 10ms better
MinRTT
P50
than alternate transits. All distributions are also shifted to the left, particularly when
comparing peering vs transit. This means that transit rarely has better MinRTT
P50
, which is intuitive
as peer routes are usually direct (i.e., have an AS-path length of 1). The distribution for transit
vs transit is less skewed, but the preferred transit routes (i.e., with either equal or shorter length
compared to the less preferred transit route) are better than alternate transits slightly more often
than not, suggesting that shorter routes correlate with better performance, and giving weight to
the heuristics used in Facebook’s routing policy (§4.2.3). The private vs public line shows that
some IXP peers might present an opportunity to improve performance. In all these cases, however,
utilizing the alternate route in practice requires solving the challenge of avoiding congestion and
oscillations (§5.6.1.2).
Results for HDratio
P50
in Figure 5.17b find the difference between peering and transit is con-
centrated around x= 0 (comparable performance) and mostly symmetrical, indicating that cases
where peering outperforms transit occur as often and by as much as cases where transit outperforms
peering. HDratio
P50
difference for transit vs transit are qualitatively similar to the MinRTT
P50
difference (small differences, and slightly skewed to the left).
233
5.7 Conclusion
This chapter investigates if the rich connectivity of Facebook’s points of presence yields good
performance for Facebook’s users, and if incorporating real-time performance measurements into
EDGE FABRIC’s routing decisions could provide further benefit.
We began by identifying aspects that make it challenging to measure performance from Face-
book’s existing production traffic. In response, we develop a novel approach to estimating the
probability that a connection between a user and a Facebook PoP can support a given goodput.
Equipped with a robust measurement methodology and dataset, we find that Facebook’s CDN is
able to provide good performance for most user sessions, although there are regional variances.
Next, we used the footholds developed during our design of EDGE FABRIC to evaluate the
potential utility of performance-aware routing. Our analysis reveals that there is limited opportunity
to improve performance by incorporating performance measurements into EDGE FABRIC’s decision
process; the decisions made by Facebook’s routing policy are largely optimal. In addition, we find
that incorporating performance measurements into a controller’s decision process is non-trivial due
to the potential for oscillations.
From our analysis, we conclude that today’s CDNs are able to provide good performance for
the vast majority of traffic and users. By establishing points of presence around the world with
rich connectivity, CDNs have sidestepped longstanding problems that have traditionally degraded
performance.
234
Chapter 6
PEERING: Virtualizing BGP at the Edge for Research
6.1 Introduction
A number of longstanding Internet problems centered around performance, availability, and security
can be attributed to fundamental issues in BGP’s design (§2.4), and the flattening of the Internet
raises new questions, challenges, and opportunities. For instance, in Chapter 3 we speculate that it
may be easier to make progress on some of these problems if we limit the focus of our solutions to
the paths that carry the majority of Internet traffic on today’s flattened Internet, while in Chapters 4
and 5 we examine opportunities and challenges CDNs face on the flattened Internet.
However, it has historically been difficult for researchers to make progress on such topics, in part
because BGP does not lend itself well to supporting experimentation: BGP is an information hiding
protocol and thus provides little visibility into the connectivity and routing policies of networks on
the Internet [65, 466]. As a result, emulation and simulation cannot accurately model the Internet due
to the lack of transparency provided by BGP and the proprietary nature of routing policies, and tools
235
that provide visibility into the current state of the routing ecosystem do not facilitate much needed
interaction with the ecosystem (§2.5.3, [272]). The flattening of the Internet further reduces the utility
of existing tools given that they have limited visibility into the peering interconnections between
CDNs and end-users that now carry the bulk of the Internet’s traffic (chapter 3 and §2.5.3, [319]).
To gain better insight into problems and how solutions will perform, experiments need to interact
with and affect the Internet’s routing ecosystem. Such interaction would require researchers to take
control of a real production Autonomous System (AS) and its connectivity, routing policies, and
traffic. However, few network administrators are willing to allow experimentation on a production
network due to the potential wide-ranging, negative effects [357], and few researchers have the
resources required to deploy a network with a footprint similar to that of a CDN.
Multiplexing and virtualization have repeatedly come to the rescue in similar scenarios to
provide researchers with access to necessary resources. For instance, EmuLab, CloudLab, and
XSEDE provide access to compute resources [96, 123, 463], PlanetLab shares machines around
the world [331], and FlowVisor enables multiplexing of layer 2 networks [386]. However. no such
platform exists to support the needs of Internet research, and two challenges stand in the way of the
development of such a platform.
First, multiplexing control of a BGP router’s interactions with other networks introduces control
and security challenges. A BGP router applies policy and makes routing decisions locally, routes
all traffic to a destination via a single “best” route, and only informs other routers of (at most)
that single option which limits visibility of available connectivity. For an experiment to change
a policy or decision would traditionally require manually modifying the router’s configuration;
236
granting that ability is equivalent to giving root access to experiments, which is untenable from a
security perspective. Second, operating a community AS and platform that enables turn-key Internet
routing research presents a number of operational challenges. Infrastructure and tooling must be
developed to model and actualize the configurations required to support experiments, maintain
safety, and manage interconnections — all of which are key to enabling the platform to safely scale.
The platform must also be able to safely evolve over time as researchers identify new capabilities
required to execute experiments.
In this chapter, we build a framework that enables virtualization and multiplexing of a production
BGP router, and then use that framework to build a community platform that providers researchers
with turn-key control of an globally interconnected autonomous system on the real Internet. We
make two contributions:
We design VBGP, a framework for virtualizing the data and control planes of a BGP router.
Akin to a hypervisor multiplexing resources across VMs, VBGP (§6.3) virtualizes a router’s data
and control plane interactions with other networks, delegating them to multiple experiments (§6.3.2).
It provides control and visibility equivalent to if each experiment had its own (non-virtual) router
with a BGP session to each neighbor, and provides safety by interposing between experiments and
the Internet on both planes (§6.3.3).
VBGP is the first approach to delegate control of a BGP router to experiments running as BGP
routers themselves, including the ability for parallel experiments to specify routing decisions at a
per-packet level. We use a novel combination of IP and layer 2 manipulation and intradomain BGP
237
advertisements to delegate PEERING’s data and control plane interfaces to experiments (§6.3.2).
Because the mechanisms used are protocol compliant, they are fully compatible with existing
routers and BGP implementations: experiments that run on VBGP are directly transferable to native
networks, and vice versa.
We use VBGP to build PEERING, a globally distributed AS open to the research community.
PEERING has routers at 15 points of presence. Each PEERING Point of Presence (PoP) (§6.4.2)
connects with at least one AS on the Internet, a subset of PoPs are connected to tens or hundreds of
other ASes via the shared fabric of an public IXP (§2.2.2), and a subset are interconnected via a
backbone network (§§ 6.4.3 and 6.4.3.3). PEERING provides experiments with turn-key access to a
global AS (§§ 6.4 and 6.4.5) with connectivity qualitatively similar to that of a CDN and is capable
of supporting any exchange of routes and traffic that an experiment could perform with dedicated
control over the PEERING infrastructure. PEERING employs strict security policies on both the data
and control planes to prevent experiments from disrupting the Internet (§6.4.6). We employed a
principled approach to development, testing, deployment, and configuration management that eases
operation and supports extensibility (§6.5), and our current software stack can be deployed at even
the largest public IXPs for the foreseeable future on off-the-shelf servers (§6.6). To date, PEERING
has supported 24 publications (§6.7, [20, 21, 47, 48, 49, 50, 137, 142, 200, 263, 288, 297, 323, 347,
366, 378, 381, 392, 397, 406, 411, 412, 413, 439]).
238
6.2 Goals and Key Challenge
6.2.1 Design goals
Our overarching goal is the design, implementation, and deployment of a platform for routing
experimentation that can delegate control of a real AS to researchers, allowing them to exchange
routes and traffic with real networks on the Internet. We decompose this high-level goal into the
following subgoals:
Maintain safety. An experiment should be prevented from disrupting other experiments, the
platform, and, critically, the broader Internet. BGP allows for disruptive behaviors including prefix
hijacks, route leaks, interception attacks, blackholes, routing oscillations, and spoofed traffic. It is a
challenge even for experts to design BGP configurations that operate as intended [119, 141, 285], so
the platform must prevent even well-intentioned experiments from causing problems.
Allow parallel experiments. To support long-running studies, iteratively-refined experiments,
and the synchronized demand before conference submission deadlines, the platform should support
parallel experiments while isolating them. It should do so without compromising the degree of
control given to experiments, and without requiring coordination between experiments or adminis-
trators.
Multiplex and virtualize BGP, instead of abstracting away from it. An autonomous system
establishes interconnections and BGP sessions with other networks, and then employs a routing
239
engine to exchange routes, define routing policies, and make routing decisions (§2.1.1). Almost any
experiment will need to exchange routes and traffic, and the platform should enable experiments to do
so using the same technologies that production networks use today. Maintaining such compatibility
enables researchers to use existing tools to execute experiments. For instance, an experimenter
can use a software router — such as Quagga [222] or BIRD [427] — to establish BGP sessions,
exchange routes, and enact routing decisions.
Support a wide range of experiments. Since we cannot anticipate the full range of experiments
researchers may want to run, our goal is a flexible platform that provides researchers with the same
control over the data and control planes as they would have operating their own network (subject to
the safety requirements). The platform should be able to support the following (and more):
• Supporting experiments that use existing routers, to allow fidelity and transitioning of experi-
ments from the platform to non-multiplexed environments.
• Allowing experiments to host services (e.g., an HTTP or DNS server) which are accessible
from the Internet.
• Supporting settings qualitatively similar to content or cloud providers, with PoPs at geographi-
cally diverse locations, including Internet eXchange Points (IXPs) with many interconnections,
and a backbone that interconnects PoPs with data centers, since this setting is increasingly
important to academia [186, 187] and industry (chapters 3 and 4). This complex setting will
also suffice for a range of experiments not specific to cloud providers.
240
6.2.2 Challenge: native delegation with BGP and IP
Achieving our high-level goal requires developing an approach for multiplexing and delegating
control of an AS to experiments. In developing our approach, we consider that (1) the interface
between the platform and the Internet must be BGP and IP, since those are the protocols used by
every AS, and (2) to flexibly support a wide range of experiments, experiments should be able to
perform any (safe) action that they could do with direct control of an AS using standard protocols,
and should be able to use standard routing implementations.
As such, we posit that the interface between experiments and the platform should also be BGP
and IP: an experiment should get visibility by receiving BGP announcements and IP traffic, and it
should control its announcements and route its outgoing traffic just as it would with direct control of
the router.
However, the design of BGP presents a number of challenges to using it as an interface — in
particular, BGP does not natively support multiplexing or delegating control (§2.4). Understanding
how BGP’s design complicates multiplexing and delegation requires understanding the basic design
of BGP, how BGP makes routing decisions, how those decisions impact IP forwarding, and the
interactions between BGP speakers (§2.1). In this section, we discuss these topics in detail. In
Section 6.2.3 we discuss why, despite these challenges, native delegation is a superior option to
other approaches, such as custom protocols and out-of-band interfaces.
To understand why delegating control without use of a separate interface is challenging, consider
the scenario illustrated in Figure 6.1, in which a single edge router (E1), has two neighbors (N1
241
N1
N2
E1
Prefix: 192.168.0.0/24
ASPath: AS(N1)
X1
Router
Prefix: 10.1.0.0/24
Prefix: 10.2.0.0/24
Experiments Internet
X2
Controller
(e.g. Espresso)
Prefix: 192.168.0.0/24
ASPath: AS(N2)
Figure 6.1: Basic scenario for what our platform should support: two parallel experiments (X1, X2)
competing to use the resources of a single BGP edge router (E1). E1 connects to two neighbor routers (N1,
N2). Both X1 and X2 announce prefixes, while N1 and N2 announce a path for the same prefix and E1 selects
N1’s path.
andN2). The envisioned platform should be able to support delegating visibility and control to
two experiments (X1 andX2). Further, our design should be able to accommodate various types
of experiments without having to customize an interface for each. For instance, in our example
scenario,X1 is an experiment using a standard software router and making BGP announcements
to uncover backup routes [20], andX2 is evaluating the benefits of a more sophisticated routing
control system, such as Espresso [469], and thus requires flexible per-packet forwarding.
Challenge: Controlling announcements. In our example, each experiment is assigned a prefix
to announce,10.1.0.0/24 forX1 and10.2.0.0/24 forX2. If each experiment had direct
control ofE1, it would be able to define policies to control what it announced to each neighbor
on a per-prefix basis. For instance, experiment X1 could manipulate the AS_PATH to perform
prepending [86] or BGP poisoning [62] for announcements forwarded toN1, and perform a different
set of manipulations (or none at all) for announcements forwarded toN2. Likewise,X1 could decide
to only announce a route to a subset of neighbors (e.g., justN1).
242
However, standard BGP advertises at most a single path for each destination to neighbors, and
thusX1 can only advertise a single route toE1. By default, controlling announcement propagation
or modifying announcement attributes would require configuration changes atE1, which does not
meet our goal of providing experiments with dynamic control.
Challenge: Controlling packet forwarding. In our example, both neighbors announce a route
to the same destination (192.168.0.0/24). Again, if each experiment had direct control ofE1,
it could define policies to control which route is used. For instance, experimentX2 could choose to
send a subset of its traffic via the route provided byN1, and the rest viaN2.
However, per BGP’s default behavior,E1 will select a route (in this case, the route throughN1),
forward only this route to experiments, and route all traffic via the chosen route. There is no native
mechanism in BGP that can be used to allow earlier hops (such asX1 orX2) to signal how they
wantE1 to route traffic to the destination.
TheADD-PATH option [453] solves part of this problem, but is not a complete solution. While
standard BGP advertises at most a single path to neighbors, ADD-PATH allows a BGP speaker
to advertise multiple routes. However,ADD-PATH does not provide a method for experiments to
overrideE1’s local decision to forward all traffic destined for the Internet viaN1, and thus while
ADD-PATH extends visibility, it does not delegate control. This is becauseADD-PATH is primarily
intended for scenarios where there is value in learning multiple routes, such as when an aggregator
(a route reflector or route server) collects each route from a different router to pass on as a collection.
In such a scenario, the aggregator is on the control plane but not the data path, and a route is selected
243
by forwarding traffic on a distinct data path. This scheme does not work for our scenario, when
E1 must be on the data path for both routes. It is not feasible to deploy a distinct router for each
neighbor, especially at IXPs with hundreds of neighbors.
6.2.3 Alternative approaches to delegation
The previous section proposed using native BGP and IP to multiplex and delegate control to experi-
ments and explored the associated challenges. In this section, we explore alternative approaches and
explain why we do not choose to pursue them.
Option: Provide direct control of PEERING router configuration. By far, the simplest ap-
proach to delegate a router would be to give experiments direct access to the router’s configuration
interface. This approach, however, makes it impossible to guarantee safety and significantly com-
plicates execution of parallel experiments. Having administrators configure routers on behalf of
experiments addresses some of these concerns, but does not scale to shared environments and makes
it impossible to run experiments that require dynamic control of routes or traffic.
Option: Provide indirect control of PEERING router configuration. Providing indirect control
of the router via a separate protocol can address some of the concerns around security and control,
but such protocols would be incompatible with existing routing components. For instance, the
platform could require that experiments communicate their routing decisions using OpenFlow; prior
work provides a foundation for multiplexing control of forwarding decisions with OpenFlow [386].
However, BGP routing engines (e.g., BIRD, Quagga, and hardware routers) expect to receive routes
244
via BGP and then enact their decisions via local mechanisms (e.g., BIRD programs the Linux kernel
via Netlink [364]). Requiring decisions to be enacted via OpenFlow or other non-standard interfaces
would necessitate either the complex modification of existing routing engines, or development of a
custom routing engine, both of which would decrease fidelity and neither of which is practical. In
addition, BGP (or yet another protocol) would remain necessary to exchange route information.
Option: Use tags (e.g., MPLS) and/or data-plane tunnel per option. Another approach in-
volves using tags or tunnels to signal routing decisions. For example, Google’s Espresso encapsu-
lates packets with MPLS to convey which route should be used [469], and Transit Portal attempted
delegation by having clients maintain multiple VPN tunnels, each corresponding to a single BGP
neighbor (sending traffic via a specific tunnel would send it to the corresponding neighbor) [444].
However, these approaches introduce additional complexity and may not be supported by existing
routing engines. For instance, using MPLS labels requires a separate label redistribution protocol,
an MPLS enabled kernel (only recently available [438]), and a routing engine that supports MPLS
(not natively supported by BIRD or Quagga). Likewise, using tunnels requires communicating a
mapping of tunnel to BGP neighbor (or tunnel to route) via an out-of-band protocol and necessitates
the use of a custom routing engine to select and install routes (existing routing engines do not
support such mappings), or manual installation of routes.
Summary. None of the approaches discussed are capable of supporting our design goals of
ensuring safety and enabling parallelism, while also enabling the platform to scale and enabling
245
the use of standard routing implementations. In addition, the approaches considered only address
delegation of the data plane. BGP or other custom protocols would remain necessary for the control
plane. We conclude that maintaining conformity and compatibility by devising solutions to support
delegation natively with existing BGP and IP is ideal given the overhead and challenges these
alternate approaches present.
6.3 Virtualizing the Edge with VBGP
To address the challenges of multiplexing control of a single BGP router for multiple experiments, we
present VBGP, a framework to virtualize the data and control planes of a BGP router by providing (1)
mechanisms for delegating complete visibility and control of data and control planes to experiments
and (2) an architecture capable of enforcing sophisticated security policies required to prevent
experiments from performing unsafe actions. Analogous to hypervisors in other virtualization
domains, VBGP multiplexes experiments over the same BGP router to support parallel experiments;
provides safety by isolating experiments from the underlying router and each other, and interposing
on experiment interactions with the rest of the Internet; and exposes data and control plane interfaces
to experiments that are equivalent to having sole control over the router’s BGP process (akin to a
hypervisor exposingx86).
We use VBGP at all PoPs in our implementation of PEERING, described in §6.4. VBGP is
generalizable and compatible with hardware or software routers; our deployment instantiation of
VBGP runs atop Linux and uses an open-source software router.
246
6.3.1 Key design decisions
Three architectural decisions are key to realizing our goals:
Untether experiment logic from router infrastructure (§6.3.2) Virtualization for experimen-
tation traditionally involves partitioning a machine’s resources and then granting an experimenter
control of a partition. In comparison, VBGP virtualizes a router’s data and control plane decisions
and delegates them to a system under the control of the experimenter. Decoupling experiment logic
from the router enables VBGP to support a variety of experimental setups, letting researchers enact
their experiment logic via hardware or software routers or SDN controllers, at their university, in
the cloud, or in a container on the VBGP router.
Devise protocol-compliant mechanisms to natively provide complete data and control plane
visibility and control (§6.3.2) As stated in §6.2.2, our interface between experiments and VBGP
should be BGP and IP, since this is the interface any native (non-virtualized) experiment would have
with the Internet. We developed mechanisms within BGP and layer 2 protocols to let experiments
innately convey their decisions to VBGP without modifying protocol logic, or using out-of-band
communication. From the perspective of an experiment, interactions with VBGP “just work” as
they would if the experiment was the edge router, exposing the complete range of (security-policy-
compliant) data and control plane interactions with the rest of the Internet.
Interpose on experiment data and control plane activity (§6.3.3) Experiments can exchange
data and control plane traffic with the Internet, so VBGP must take measures to mediate between
247
experiments and the Internet to enforce security policies, prevent dangerous activity, and perform
logging necessary for attribution [207]. In particular, VBGP must be able to (1) enforce a wide
range of security policies that are atypical (due to our use case) on the data and control planes
and (2) to intercede in ways that let us prevent prohibited activities without otherwise affecting
experiment control and capabilities. Supporting these requirements is challenging, as our security
policies require functionality beyond what is provided by existing router policy frameworks. Our
solution is platform-specific software that interposes between experiments and the Internet on both
the data and control planes, separately, to enforce security.
6.3.2 Delegation to experiments
VBGP delegates both the data and control plane decisions of a BGP edge router to experiments,
which are logically (and can be physically) separate from the VBGP router. The control plane
mechanisms we employ involve adapting existing BGP mechanisms for our setting and are not
particularly groundbreaking on their own. However, they combine with our novel data plane
mechanisms to delegate control in a way that addresses longstanding limitations in BGP routing.
6.3.2.1 Delegating the control plane
BGP has no intrinsic mechanisms for delegating visibility or control. We adapt two mechanisms,
one for inbound announcements from the Internet and one for outbound announcements from
experiments.
248
Announcements from the Internet to experiments A BGP router may receive routes for the
same destination from multiple BGP neighbors. For instance, Figure 6.1 showsE1 receives a route
from bothN1 andN2, butE1 would only forward its preferred route toX1 andX2, limiting their
visibility (§6.2.2).
VBGP uses the BGPADD-PATH extension [453] to send each experiment all received routes
within a single BGP session. As a result, experiments see multiple routes coming from the VBGP
node, as depicted in Figure 6.2a.
Announcements from experiments to the Internet VBGP delegates control of which BGP
neighbors an experiment’s announcement will propagate to through the use of BGP communities.
Communities are labels that a router can attach to a BGP announcement [82]. VBGP defines
whitelist/blacklist BGP communities for neighbors at every PoP, and experiments label prefix
announcements with communities that specify whether or not to announce the prefix to specific
neighbors. If no communities are attached, VBGP forwards the announcement to all neighbors.
Experiments can couple this control with BGPADD-PATH to send different announcements for
the same prefix to different neighbors to support more sophisticated policies, such as those described
in §6.2.2. For example, in Figure 6.1,X1 can announce an update for its prefix,10.1.0.0/24,
withAS_PATH prepending and tagged with a community to export the announcement only toN1.
The experiment can also make an announcement for the same prefix, without any prepending, tagged
with a community to export only toN2.
249
Communities could be used to signal behavior such as prepending or poisoning, but to maximize
flexibility and realism we chose to allow experiments to directly announce the routes they want.
6.3.2.2 Delegating the data plane
Routing traffic to the Internet Although BGPADD-PATH providesX1 andX2 with visibility
of all BGP routes and updates atE1, it alone does not empower the experiments to control which
route is used for traffic; outgoing traffic fromX1 andX2 remains subject to the routing decision(s)
made atE1 based onE1’s configuration (§6.2.2).
1
We want experiments to be able to control how their traffic is routed in a manner that “just works,”
and thus do not want to introduce additional protocols, encapsulation, or out-of-band communication
that is incompatible with existing routers, BGP implementations, and production tooling (§6.2.3).
Realizing a solution that operates within these constraints is non-trivial; previous work claimed
that “forwarding traffic on different paths requires the data packets to carry an extra header or label”
[196].
Our key insight is that routers already add an extra header to packets: the layer 2 header. We
develop a technique that encodes an experiment’s decision of how to route a packet via the layer 2
header and results in existing routers automatically encoding the decision via their normal behavior.
While the technique only conveys the decision across a single layer 2 domain, our target setting
1
We have found that there is sometimes confusion over whether BGP communities can be used to address this
challenge. BGP communities can be used to signal policies for routes announced by experiments, not select which route
(received from an upstream neighbor) to use for forwarding traffic from an experiment to a neighbor.
250
192.168.0.0/24; Prefix:
Next Hop: 1.1.1.1; ASPath: AS(N1)
192.168.0.0/24; Prefix:
Next Hop: 127.65.0.1; ASPath: AS(N1)
Next Hop: 127.65.0.2; ASPath: AS(N2)
Experiments Internet
E1
N1
N2
X1
Router
X2
Controller
(e.g. Espresso)
192.168.0.0/24; Prefix:
Next Hop: 127.65.0.1; ASPath: AS(N1)
Next Hop: 127.65.0.2; ASPath: AS(N2)
192.168.0.0/24; Prefix:
Next Hop: 2.2.2.2; ASPath: AS(N2)
1
2
3
4
(a) Control Plane
N1 Routing Table (at E1)
Prefix: 192.168.0.0/24, NH: 1.1.1.1
Routing Table (at X1)
Prefix: 192.168.0.0/24
Next Hop: 127.65.0.2
Experiments Internet
E1
N1
X1
Router
X2
Controller
(e.g. Espresso)
N2 Routing Table (at E1)
Prefix: 192.168.0.0/24, NH: 2.2.2.2
DMAC: 22:22:22:22:22:22
DIP: 192.168.0.1
N2
MAC to Routing Table
(at E1)
11:11:11:11:11:11 - N1
22:22:22:22:22:22 - N2
ARP: 127.65.0.2
Reply: 22:22:22:22:22:22 7
5
6
8
11
9
10
(b) Data Plane
Figure 6.2: Figure 6.2a shows how VBGP overwrites BGP next-hops to delegate control to experiments. The
next-hop for announcements from the neighborsN1 andN2 ( 1
, 2
) are rewritten to IP addresses that are
local toE1 ( 3
, 4
). Figure 6.2b shows how VBGP forwards packets from experiments. X1 prefers to route
viaN2 ( 5
), so when sending a packet to192.168.0.1, it first ARPs for the MAC of the next-hop ( 6
), to
whichE1 responds with the MAC it locally assigned toN2 ( 7
). When the frame arrives atE1 ( 8
), it knows
based on the destination MAC (DMAC) to look up the route in its local routing table forN2 ( 9
, 10
).
251
naturally bridges a single domain, from an experiment router to a VBGP router. In §6.4.3.3 we
extend the technique across multiple domains, although it is still not completely general.
Understanding our approach requires considering how a router typically enacts forwarding
for its choice of a route. Although platforms can optimize the process to minimize lookups, the
process is generally as follows: a router performs a lookup on the packet’s destination to find its
preferred route, which maps to a next-hop IP address from the route’s BGP announcement. The
router forwards the packet towards the next-hop, which it must be able to reach without BGP (e.g.,
directly connected or via an IGP). Typically, a router announcing BGP routes to a neighbor either
keeps the next-hops unchanged (if the neighbor can reach them, which they cannot in the VBGP
setting) or sets them all to a specific local IP addresses (e.g., a loopback address).
In our design, VBGP systematically modifies next-hop IP addresses and manipulates layer 2
interactions in a manner that results in the experiment’s routing choices being naturally conveyed
per packet to the router. Specifically, VBGP assigns distinct private IP and MAC addresses for each
BGP neighbor. It also maintains one routing table per BGP neighbor. As depicted in Figure 6.2a,
when a VBGP router receives an announcement from a neighbor ( 1
, 2
), it stores the route in the
table for the neighbor, then rewrites the next-hop to the IP address it assigned to the neighbor before
exporting the route to experiments ( 3
, 4
).
When an experiment selects a route to send a packet towards a destination, it resolves the next-
hop’s MAC address, just as any BGP router would to forward a packet. The VBGP instance offering
the next-hop responds to an ARP or NDP query with the MAC, and the experiment forwards
a layer 2 frame containing the packet to that MAC. Since the process is identical to standard
252
BGP forwarding, the experiment can use a standard software or hardware router (X1) or a more
sophisticated controller that uses BGP to interface with the Internet (X2). Once the VBGP router
receives the frame, it inspects the destination MAC to determine which BGP neighbor’s route the
experiment selected. VBGP then routes the packet using the table corresponding to the neighbor.
Figure 6.2 illustrates the process. In Figure 6.2a, X1 and X2 receive routes from E1 with a
next-hop of127.65.0.1 and127.65.0.2 which correspond to neighbors1.1.1.1 (N1) and
2.2.2.2 (N2), respectively. X1 has configured policy to prefer routes to the destination network
via N2. In Figure 6.2b, when X1 wants to forward a packet to 192.168.0.1, it looks up the
next-hop in its routing table ( 5
) and sends an ARP query for the next-hop ( 6
), equivalent to what
it would do if directly connected toN2. E1 responds with MAC(127.65.0.2) ( 7
), whichX1
sets as the destination for its frame ( 8
). Upon receipt of the frame,E1 uses that MAC to determine
which routing table to use ( 9
). E1 performs a lookup in the routing table corresponding toN2 (10
)
and forwards the packet to next-hop2.2.2.2 (11
).
AlthoughX2 uses a more sophisticated process to decide which route to use (such as deciding
per application), it still uses BGP to exchange routing information and performs the same process as
X1 to encapsulate a packet within a frame to forward. Because all routing decisions are delegated to
experiments, the VBGP node does not need to make any routing decisions of its own.
Routing traffic to experiments VBGP forwards traffic received from neighbors towards the
experiment announcing the corresponding address space. Normally the source MAC address of the
frame that arrives at the experiment will be MAC(E1), not the MAC address ofE1’s neighbor that
253
delivered the traffic. To provide experiments with visibility into which neighbor delivered the traffic,
VBGP rewrites the source MAC address of each packet received from a neighbor with the MAC
address it assigned to the neighbor, e.g., MAC (127.65.0.2).
6.3.2.3 Summary of contribution
The delegation provided by VBGP does not follow from the simple combination of existing
components. The building blocks used in VBGP are commonly known. For instance, VBGP
relies on functionality provided by BGP ADD-PATH, BGP communities, and policy-based routing
to enable delegation. However, these building blocks have not previously been synthesized to
achieve the same goals, and the delegation provided by VBGP does not follow from their simple
combination.
2
Through our design of VBGP, we demonstrate that delegation is possible without requiring
additional protocols or mechanisms. Prior work concluded that delegation would require addi-
tional protocols or mechanisms outside BGP, such as those discussed in Section 6.2.3 [196]. Our
design of VBGP demonstrates that this is not true. By carefully considering layer 2 interactions that
already occur for a BGP session, and by manipulating these interactions through changes localized
to the VBGP node and widely supported through existing building blocks, we demonstrate that
delegation can be achieved without requiring custom software, encapsulation, or changes to protocol
headers, and thus supports experiments using existing hardware and software routers, and modern
2
Section 2.4.3 describes why Policy-Based Routing alone is insufficient.
254
Control Plane Enforcement
vBGP
Legend
Control Plane
Data Plane
Routing Engine Experiment Neighbor
Data Plane Enforcement
Figure 6.3: Logical locations of the enforcement engines as they interpose on the data and control planes
between an experiment and the Internet.
controllers that speak BGP. VBGP has operational uses outside of PEERING and influenced the
design of part of Facebook’s BGP control system (§§ 4.5.1, 5.2.2.2 and 6.7.2).
6.3.3 Security and isolation
In order to maintain safety and isolation, VBGP supports limiting experiment data and control
plane activity based on any discretionary stateful or stateless policy, not just those supported by
conventional routers. This approach supports more sophisticated policies that balance experimenter
control with the need to maintain safety, enables evolution of policy to account for new capabilities
or concerns, and allows capabilities to be enabled on a per-experiment basis, in keeping with the
principle of least privilege.
Policy enforcement architecture VBGP uses policy enforcement engines that operate alongside
the routing engine and interpose on all experiment activities. The engines have non-volatile storage
to maintain state.
255
VBGP separates policy enforcement from the router for two reasons. First, most router im-
plementations can only support a limited set of policies. Decoupling the enforcement engine and
implementing it separately allows VBGP maximum flexibility in the policies it supports, including
stateful policies, and ensures VBGP is not tied to a specific router implementation. This allows
VBGP to use a variety of industry standard, hardened software and hardware routing engines to
communicate with neighbors without being limited by the routing engine’s policy capabilities.
Second, it is difficult to validate the correctness=behavior of policies enforced by traditional
router implementations; testing frequently requires setting up an emulated network with multiple
BGP routers to create the desired test conditions [36, 129]. In comparison, we can validate the
behavior of our decoupled implementation using unit tests that inject test conditions. Figure 6.3
depicts where the data and control plane enforcement engines fall logically in the VBGP architecture.
Control plane enforcement The enforcement engine receives all routes announced by exper-
iments from the router, evaluates whether each route is policy-compliant, and announces only
compliant routes back to the router. The router only forwards announcements received from the
enforcer to its neighbors. Our implementation uses ExaBGP [125], which is a BGP engine that
allows execution of Python code inside the BGP pipeline. We capture the policy in Python, allowing
a great deal of flexibility in what the administrator of the VBGP instance can enact and facilitating
easier testing. For instance, state can be synchronized among VBGP instances to enable AS-wide
policies, such as limiting the total number of times a prefix can be announced or withdrawn across
all PoPs during a 24 hour period. Our current policies are defined in Section 6.4.6.
256
Data plane enforcement VBGP’s data plane is run in an isolated container, so it can either be
collocated with a software router or run on a separate server. It interposes on experiment data plane
traffic through the use of extended Berkeley Packet Filters (eBPF), which allows loading simple
programs into the kernel to inspect packets. The eBPF program can make a stateless or stateful
decision to allow, transform, or block each packet, enabling policies such as rate limiting experiment
traffic on a per PoP or per neighbor basis.
6.4 PEERING: From a Router to an AS
While the design of VBGP is generally applicable to different infrastructures, we used VBGP to
build PEERING, a platform for routing research that we make available to the community. Figure 6.4
shows an overview of PEERING’s architecture. PEERING maintains infrastructure at PoPs around
the world, and each consists of a commodity server running VBGP, from which we interconnect
with one or more networks using BGP. We implement VBGP using common open source software,
i.e., the BIRD software router [427] for our BGP routing engine and OpenVPN [320] for VPN
tunnels with experiments. Section 6.5 presents engineering aspects of PEERING, some of which
would benefit other networks. VBGP adds manageable overhead, allowing a commodity server
to virtualize a router at even the largest IXPs today and in the foreseeable future. Section 6.6
demonstrates scalability.
257
6.4.1 Key design decisions
Previous Internet routing research platforms were limited in the type of research experiments they
could support, their ease-of-use and accessibility to experimenters, and their long-term maintain-
ability. Our approach accounts for these challenges and focuses on addressing them in multiple
ways.
Deploy at IXPs and universities (§6.4.2) To achieve a good representation of today’s widely
interconnected content and cloud providers (§2.2.3 and chapters 3 and 4) our approach for deploying
PEERING focuses on a mix of both university and IXP sites. This allows us to sidestep the limitations
of each by combining their strengths. In particular, IXP sites provide many interconnections and
university sites allow easy federation with other resources that offer complementary functionality.
Federate with other platforms (§§ 6.4.3 and 6.4.3.3) To better approximate the cloud provider
setting—interdomain connectivity at locations around the world, data centers providing computa-
tional resources, and a backbone connecting them all—PEERING federates with CloudLab [96] to
provide researchers with cloud-like data centers, and with educational networks to provide slices of
their multiplexing backbones to interconnect PEERING PoPs. These federations support a wider
range of experiments while ensuring that the platform’s resources remain predominantly focused on
expansion of the AS and its connectivity.
Low-overhead, turn-key experiment and infrastructure setup and deployment (§§ 6.4.4 and 6.4.5)
To democratize Internet routing research, we designed and implemented standardized workflows
258
to allow easy provisioning and deployment of new experiments, new VBGP sites, and new peer
networks. Further, we provide experimenters with a toolkit that can be used to instantiate a wide
variety of experimental setups without requiring prior experience with BGP, VBGP, or PEERING.
Follow the principle of least privilege (§§ 6.4.5 and 6.4.6) To balance our goals of maintaining
safety while supporting a wide range of experiments, by default we tightly restrict what an experiment
can do, especially in terms of the range of announcements it can make to the Internet. We carefully
review experiments that need more functionality, including consulting commercial network operators
for feedback as needed, and PEERING supports per-experiment capabilities for those that can safely
justify richer functionality.
259
VPN
AS100
AS200
AS300
PoP Neighbors
(Interconnections)
Points of Presence Experiments
IXP Switch
AS400
Backbone (AL2S)
BIRD OpenVPN
Experiment #1
Experiment Toolkit
PEERING Server
Allocation: 184.164.224.0/23
BIRD OpenVPN
Network Controller
Location: Internet Exchange Point
PEERING Server
Location: University
PEERING Server
Location: University
Control Plane Data Plane
Security Enforcement Engines
VBGP
VBGP
VBGP
VBGP
Figure 6.4: PEERING’s architecture. Experiments connect via VPN to one or more PEERING servers, or
Points of Presence, and use a software router to establish BGP sessions with a VBGP router at each PoP.
Experiments exchange routes and traffic via the tunnels and the corresponding BGP session, where VBGP
delegates data and control plane decisions while enforcing policy. All PoPs run VBGP, which consists of
the networking controller (§6.5), routing engine (§6.3.2), OpenVPN (§§ 6.4.4 and 6.4.5), and enforcement
engines (§6.3.3).
260
6.4.2 Footprint and connectivity
Numbered resources PEERING has 8 AS numbers (ASNs), including three 4-byte ASNs, and is
allocated a total of 40 /24 IPv4 prefixes and one /32 IPv6 prefix. We dedicate one or more prefixes
to each approved experiment for a specified duration.
Points of presence As of June 2019, PEERING has thirteen operational PoPs on three continents,
four at IXPs and nine at universities. PEERING servers at five additional PoPs are projected to come
online. At IXPs, we establish bilateral peering with tens or hundreds of networks and to many more
via interconnections with route servers, and we pursue partnerships to obtain transit interconnections.
At universities, the platform has a transit interconnection with the university’s AS. The sites have
different tradeoffs: IXPs offer richer connectivity, but universities can present opportunities such
as our CloudLab federation. Universities add operational overhead to debug connectivity (§6.5.2),
whereas IXPs add operational overhead to negotiate hosting and transit.
Getting from the initial agreement to a server inside of a colocation facility with connectivity was
the biggest challenge to getting connectivity at IXPs; the business aspects, bureaucratic procedures
on both sides of placing a server within the colocation facility, and negotiating transit consumed
months on average. Once the paperwork was completed, it took very little time to connect the server
and to establish peering links.
Peer networks Obtaining bilateral peering agreements with other members of the IXP was easier
than we expected. Most member ASes at an IXP have open peering policies, but that is not a
261
guarantee that they will peer blindly with everyone. Taking a direct approach and reaching out to the
other IXP members resulted in several hundred direct peering connections. PEERING currently has
12 transit providers and 923 unique peers (129 via bilateral BGP sessions and the rest only via IXP
route servers [350]). We peer with 854 ASes at AMS-IX (106 bilaterally), 306 (63) at Seattle-IX,
140 (10) at Phoenix-IX, and 129 (6) at IX.br/MG in Brazil.
According to PeeringDB [327], our peers are balanced across diverse types of networks: 33% of
our peers are transit providers, 28% are cable/DSL/ISPs, and 23% are content providers. Of the
remaining 17%, 8% cannot be classified, and the rest are a mix of education/research networks,
enterprise networks, non-profits, and route servers. An industry study published in 2016 reported
that 60% of traffic comes from a small number of content delivery networks. PEERING connects
directly to 7 of the 10 named [302].
PEERING announcements can reach all ASes via transit providers, so experiments can exchange
traffic with all ASes. If network P is a transit provider for network C (either directly or transitively),
C is in P’s customer cone. ASes in the customer cones of our peers receive announcements made
by experiments to peers. This reach is of interest to researchers due to the importance of peering
routes on today’s Internet (§2.2.3 and chapters 3 and 4) and because it reflects ASes towards which
PEERING has “extra” route diversity, as they are reachable both via all PEERING transits and via at
least one peer.
262
6.4.3 Emulating a cloud provider
To support experiments in environments similar to those used by content and cloud providers,
experimenters need to be able to pair PEERING’s rich interdomain connectivity with backbone
connectivity and compute resources.
6.4.3.1 Backbone connectivity
We worked with research and education networks, such as Internet2, to establish backbone connec-
tivity between PEERING PoPs and configured VBGP routers on the backbone to exchange routes
in a BGP mesh. Our US PoPs connect to Internet2’s Advanced Layer 2 Services (AL2S) [4], and
our Brazilian site uses RNP’s equivalent in Brazil [355]. These services allow us to create VLANs
between sites (including bridging between the US and Brazil), with provisioned capacity across the
educational networks. In the future, we will use Geant [154] to integrate our European PoPs. An
experiment connected to one PoP has visibility into routes at all other PoPs in the BGP mesh, and
it can direct announcements and traffic across the backbone to BGP neighbors at any of the PoPs
(§6.4.3.3). Section 6.6 provides an overview of TCP throughput over the backbone.
6.4.3.2 Federation with CloudLab
CloudLab provides researchers with access to bare-metal systems for establishing their own clouds
to conduct experiments. PEERING has PoPs with backbone connectivity at all CloudLab locations.
By colocating PEERING PoPs at CloudLab sites, CloudLab experiments can select from routes
available at any PEERING PoP to reach destinations across the Internet, then route across the
263
backbone to reach the selected PoP. Combined, PEERING and CloudLab provide experiments with
edge PoPs, a backbone, and compute resources, enabling experiments to operate in environments
that are qualitatively similar to those of large content or cloud providers.
6.4.3.3 VBGP across the backbone
Experiments should be able to route traffic across the backbone. For example, in Figure 6.5,
experimentX1 is controlling VBGP instanceE1 and should be able to send traffic toN2 via the
backbone link between E1 and E2. For this to work, N2’s BGP announcement must reach E1
andX1 with next-hops they can reach, either via layer 2 or via an IGP. Neither approach works
out of the box. With an IGP, the next-hop would be an interface on E2, and X1 would make a
forwarding decision by looking up the next-hop in its IGP table to learn that it should forward toE1.
It would then forward a frame with a destination MAC belonging toE1. In order to avoid losing
X1’s decision, this MAC would have to uniquely encode the next-hop and egress (E2 and N2),
which would add significant complexity to the IGP configuration. However, making an interface on
E2 reachable via layer 2 fromX1 would require tunneling (e.g., VLANs), additional complexity.
X1 N1 E1
N2 E2
192.168.0.0/24 Prefix:
Next Hop: 1.1.1.1; ASPath: AS(N1)
192.168.0.0/24 Prefix:
Next Hop: 2.2.2.2; ASPath: AS(N2)
Prefix:192.168.0.0/24
Next Hop: 127.127.0.2; ASPath: AS(N2)
Prefix: 192.168.0.0/24
Next Hop: 127.65.0.1; ASPath: AS(N1)
Next Hop: 127.65.0.2; ASPath: AS(N2)
Figure 6.5: Example connectivity for an experiment (X1) using PEERING’s backbone connectivity. E1 and
E2 are VBGP routers in PEERING, and each has a single neighbor (N1 andN2). The prefix announcements
demonstrate how VBGP’s delegation rewrite’s next-hops to enableX1 to controlE2’s connectivity toN2.
264
Instead, we extend our next-hop-based control hop-by-hop without requiring an IGP. We use
a common pool of IPs to assign a unique global (to PEERING) IP to each external neighbor. This
allows E1 to recognize E2’s next-hop 127.127.0.2 and overwrite it with IP 127.65.0.2
from its local pool, prior to announcing the prefix toX1. It maintains a separate routing table for
this local IP (and its corresponding MAC), containing the routes with next-hop127.127.0.2.
When X1 wants to send a packet to 192.168.0.0/24 via N2, it uses its routing table to
find the next-hop127.65.0.2 and sends an ARP request, to whichE1 responds with the MAC
address. X1 sets this as the destination for its frame, thenE1 uses the routing table corresponding to
the MAC to look up the route to198.168.0.0/24, which has next-hop127.127.0.2. The
process now repeats as E1 sends an ARP request for 127.127.0.2, and E2 responds with a
MAC.E1 transmits the frame,E2 performs its lookups based on the MAC, and forwards the frame
toN2.
6.4.4 Experiment toolkit
We provide experimenters with a toolkit to connect to PEERING and execute experiments. Ex-
periments establish BGP sessions with PEERING routers over VPN tunnels. The toolkit contains
wrappers for OpenVPN and BIRD that implement a turn-key interface for common tasks such as
establishing BGP sessions or making prefix announcements. Table 6.1 provides a full list of the
functionality provided by the wrapper software. Advanced features such as per-packet routing
(§6.3.2.2) must be configured by experimenters.
265
Category Functionality
OpenVPN Open=close=check status of tunnels
BGP=BIRD
Start=stop BIRD v4 and v6 sessions
Status of BGP connections
Access BIRD CLI
Prefix Management
Announce=withdraw prefix
Manipulate community attribute
Manipulate the AS-path attribute
Table 6.1: A list of capabilities provided by the PEERING experiment software to simplify and abstract basic
tasks to configure and set up experiments.
While the toolkit is open source and designed around OpenVPN and BIRD, experiments are
free to use any software that can establish BGP sessions and VPN tunnels with PEERING servers
(past experiments have used Quagga and ExaBGP for BGP). Because security policies are enforced
at PEERING servers, we do not need to enforce any behavior at the experiment side.
6.4.5 Deploying experiments
Experiments execute on infrastructure that is separate from PoPs. This decoupling promotes
flexibility and enables the platform to support a variety of experimental setups. For instance, the
BGP announcements of a measurement experiment can be managed by custom logic executed from
a laptop. Experiments can also run applications such as a web server or a Tor relay, or combine
PEERING connectivity with emulated intradomain topologies using tools such as Mininet [257].
Experiments requiring more computational resources can run on CloudLab [96] or cloud providers.
Before receiving access and prefixes to execute an experiment, experimenters must submit a
proposal that outlines the experiment’s goals, resource requirements, and execution plan via a simple
web form. This process of manual approval mimics the one that was successful for PlanetLab, except
266
PlanetLab outsources approval to the PIs at individual sites. We considered automatic approval and
allocation of an IPv6 prefix (which are plentiful and will let experimenters start to use PEERING),
since VBGP’s security architecture and filters will prevent misbehavior. However, the current rate
of proposals is manageable with manual review, so we have not invested the development effort to
automate limited approval.
Once we approve the experiment via a simple management web interface, the management sys-
tem we built automatically generates credentials for the experimenters that enable VPN connections
to VBGP routers. The system updates the policies and configuration each VBGP router needs to al-
low the experiment (and to filter disallowed traffic and announcements that the experiment might try
to send). It pushes the updates to VBGP routers without disrupting ongoing experiments or running
BGP sessions, since we are only modifying configurations relative to individual experiments.
Although the number of experiments varies over time, during the past 12 months PEERING
typically hosts from 3 to 6 concurrent experiments. Concurrency is limited by available IPv4 address
space: although PEERING controls plenty of IPv6 space, most experiments to date concentrated on
IPv4. Fortunately, no experiment has had to wait due to insufficient IPv4 address space thus far.
6.4.6 Security policies
PEERING experiments exchange routes and traffic with other production networks on the Internet
that are outside of our control. As a result, we cannot formally verify the safety of PEERING
because we cannot guarantee that all possible interactions are safe, even if they comply with all
relevant protocol specifications. This fundamental challenge exists in any environment in which
267
there are interactions between varying implementations and configurations of a protocol standard
[326]. For instance, it is fundamentally impossible to guarantee that BGP announcements will
not trigger bugs in remote routers. Even announcements that are fully compliant with the BGP
specification can cause widespread outages due to bugs in router implementations [268, 357, 433],
and it is not feasible to test whether a given announcement may cause disruption given the plethora
of implementations and configurations.
Because we cannot formally verify PEERING’s safety, our approach is to define conservative
security policies which match current best practices and verify correct enforcement. VBGP’s design
supports data and control plane policies. For PEERING we define two dimensions for security
policies: the rate of traffic and BGP updates, and the content of packets and BGP updates.
Ensuring control-plane and data-plane activity is safe PEERING prevents experiments from
performing activities that are harmful or that would prevent attribution back to the experiment. An
experiment cannot announce a BGP update or source traffic using address space that is not part of
the experiment’s allocation (hijacking and spoofing), which also means experiments cannot transit
non-experiment traffic; cannot originate announcements from an ASN it is not authorized to use; and
cannot manipulate BGP attributes in ways not allowed by our capability framework. We ensure that
all IP traffic sent by the experiment has a source IP address from a range allocated to the experiment.
Throttling control-plane and data-plane activity PEERING shapes traffic at (two) sites with
bandwidth constraints to the rates agreed upon with the sites’ operators. To date, no experiments
268
exercised these bandwidth limits, and experiments that would could still be deployed on sites without
bandwidth constraints. To limit overhead on routers in the Internet, PEERING limits experiments
to 144 BGP updates=day for each prefix and PoP pair. This corresponds to an average rate of
one update every 10 minutes, which amounts to a small fraction of the “background noise” in the
interdomain routing system [45].
Capability framework In keeping with the principle of least privilege, the management system
has a capability-based framework that defaults to limiting experiments to “basic” announcements
and supports adding capabilities on a per-experiment basis. Experiments request capabilities via the
experiment web form, and admins can simply add the capability on the approval web form. Current
capabilities include:
• Allow a limited number of poisoned ASes [62].
• Allow attaching a limited number of BGP communities or large communities to announce-
ments [82, 198, 406].
• Allow optional BGP transitive attributes [357].
• Allow experiments to announce routes learned from one network to another network, for
experiments that require legitimately providing transit for an experimental prefix.
Our existing capabilities suffice for most experiments. When an experiment requires novel
capabilities, we work with experimenters to deploy them and add them to our capability framework
for future experiments. For example, an experiment recently required the ability to announce 6to4
IPv6 addresses [76].
269
Implementation of security policies On the control plane, we implement security policies in
BIRD whenever possible, as it provides better performance than the general ExaBGP engine. We
currently use the ExaBGP engine to limit the rate of announcements from experiments and to filter
BGP updates with disallowed (any non-standard) BGP attributes. Similarly, we implement data
plane security policies using Linux’s built-in tools whenever possible.
Testing security policies In the interest of safety, we do not verify enforcement of our security
policies by executing adversarial tests against the production PEERING platform. Instead, we
deploy our production configurations and software stack in a test environment (§6.5) that includes
(emulated) PEERING experiments, servers, and BGP neighbors, and then use a custom framework to
execute automated tests of our security policies and handling of experiment capabilities. Through
this process, we verify that our security implementations correctly enforce the expected policies in
terms of what traffic and announcements are (dis)allowed.
For each capability, we deploy two (emulated) experiments in our controlled environment: one
that does not require the capability and one that does. We execute both experiments twice, with and
without the capability. We check that the routes exported and traffic exchanged in each execution
match the configured policy and are safely handled by the software routers we test them against
within the test environment. For example, we deploy an experiment that makes announcement with
BGP communities with and without the corresponding capability, and check that communities are
stripped from exported announcements when the capability is missing.
270
Impact of misbehaving experiments PEERING’s security engine and deployed policies protect
the Internet from misbehaving experiments. However, misbehaving experiments may put strain on
PEERING’s infrastructure and negatively impact its performance. For example, experiments sending
an excessive rate of updates could overwhelm our control plane security engine (§6.3.3), impacting
other experiments. To the extent possible, we designed PEERING to isolate the performance
experienced by our upstreams from that experienced by experiments, and our security mechanisms
are designed to protect our upstreams and the broader Internet, even if it leads to an outage of
the platform itself. For instance, if the security enforcement engine was to become overloaded
due to a misbehaving experiment, the enforcement engine would fail closed, thereby blocking any
experimental announcement from propagating upstream. However, as of November 2019, we have
not experienced any scenario in which a misbehaving experiment caused a platform outage.
6.5 Development and Deployment
PEERING’s design requires weaving together many components, and building PEERING required
addressing a number of technical and logistical challenges that arose as we scaled the testbed in
terms of the number of PoPs, experiments supported, and experiment capabilities.
271
6.5.1 Engineering principles and lessons
With a growing number of heterogeneous servers deployed over multiple years into highly diverse en-
vironments, it has been essential for us to invest in tooling that enables us to maintain a standardized
deployment and automate the numerous processes required to support experiments.
In this section we describe three pillars of PEERING’s engineering, some of which we believe can
be applied to other networks to positive effect. Our solutions allow a small research team to develop
and operate a distributed infrastructure with hundreds of interconnections that services a dynamic
and sophisticated set of research experiments. These solutions were key to enabling a production
platform and are distinguishing elements from the prototype designs and implementations in prior
work.
Intent-based configuration PEERING’s components have complex configuration files; for exam-
ple, the configuration files for BIRD alone can exceed over 10,000 lines at large PoPs. PEERING
configuration files are dynamic; for example, BGP sessions must be enabled and disabled on BIRD
whenever an experiment connects and disconnects from a PoP. Finally, PEERING configuration files
have specificities; for example, only two PEERING PoPs have traffic bandwidth limits. As a result,
it is not possible to maintain these complex and dynamic configurations by hand.
We employ intent-based configuration best-practices [379, 421] to transform a model containing
desired configuration (such as experiment capabilities) into service-specific (e.g., network controller,
BIRD, OpenVPN, and policy enforcers in Figure 6.4) configuration files. The desired configuration
is stored on a centralized database accessible through a web service. The database has information
272
including approved experiments and their capabilities (§6.4.6), network configuration at each PoP,
and interconnection information for BGP sessions with peers at each PoP. The desired configuration
in the database is used to generate service configuration files automatically through a templating
engine, and the resulting configuration files are used by the services.
As an example, consider how control plane capabilities vary by experiment. When we add
support for a new experiment capability, such asAS_PATH poisoning, we add support for expressing
the capability in our desired configuration model. Then we identify how the configurations of
different services (Figure 6.4) need to be transformed so that the capability is enabled for authorized
experiments (and blocked for others). We use the resulting insights to modify the configuration
template, and then use the templating engine to generate configurations and test in an offline
development environment (discussed later in this section) to ensure that the generated configuration
works as expected. Following local testing, we update the production templating configuration and
regenerate configuration files for all PEERING servers.
All configuration files deployed to PEERING servers are stored in a version-control system where
they can be inspected and rolled back if needed. When we make templating changes, we canary
the new configuration on a subset of our production fleet as a safeguard. We use Ansible to fetch
configuration files from the version-control system, deploy them to a subset of servers, and then
reload the impacted services. Once we are confident in the new configuration, we instruct Ansible
to deploy updated configurations to all PEERING servers. A similar templating process is used at the
servers to update BGP session configuration as experiments connect / disconnect.
273
Network configuration with transactional semantics In order to maintain BGP sessions with
PoP neighbors and delegate the data plane, VBGP must configure (1) physical interfaces used
for interconnecting with upstream neighbors, (2) virtual interfaces used for delegating control of
the data plane to experiments, (3) routing tables and rules, and (4) filters used to enforce security
policies (§§ 6.3.2.2 and 6.4.3.3).
Given that VBGP network configuration is dynamic, we developed a network controller program
that updates the server’s network configuration such that it aligns with the high-level, intent-based
description modeled in our centralized configuration database. However, the interface to configure
Linux networking (Netlink) does not support expressing intents; it provides a request-response
interface that allows querying, adding, and removing network configuration (e.g., routes and
addresses). When the network controller receives a configuration update, simply resetting the
network configuration and applying the new configuration from scratch would reset BGP sessions
and VPN connections, interrupting ongoing experiments and interdomain connectivity with PoP
neighbors. Instead, the controller attempts to minimize the amount of configuration changes by
implementing logic, unavailable in Linux’s networking tools, that (i) removes configuration that is
incompatible with the intended state, (ii) keeps any configuration compatible with the intended state,
and (iii) adds any missing configuration.
Two requirements complicate our network controller’s design further. First, we enforce trans-
actional semantics, where either all configuration changes are successfully applied or no changes
are applied (e.g., partially complete changes are rolled back) to ensure that a server is never in an
inconsistent state. Second, Linux network interfaces can have one primary address and an arbitrary
274
number of secondary addresses. PEERING needs to control the primary address as it is used when
generating ICMP error messages, particularly TTL Exceeded replies to traceroute probes. Because
the Linux kernel does not support changing the primary address (it is set based on the order in which
addresses are added), PEERING’s network controller verifies each interface’s primary address and, if
incorrect, removes and reads the interface’s addresses in the proper order.
Standardization and isolation We standardize configuration, deployment, and upgrades to our
infrastructure by running servers with stripped down operating systems and packaging PEERING’s
services (e.g., OpenVPN, BIRD, and PEERING’s network controller service and enforcement
engines) into containers. Linux namespaces isolate specific functionality of the Linux kernel, which
container and lightweight virtualization technologies employ to control resource sharing. PEERING
uses containers to isolate VBGP services’ process, file system, and network namespaces from that
of the host, allowing us to isolate each service and its dependencies, preventing conflicts and easing
upgrades.
This isolation is key given that implementing VBGP requires tight integration with Linux’s
networking stack (§§ 6.3.2.2 and 6.4.3.3). If VBGP made configuration changes to the host’s
networking namespace, then configuration errors, software bugs, or failures in VBGP could put the
host’s networking stack in a dysfunctional state and prevent all in-band access. However, because
VBGP configures an entirely separate networking namespace, it is logically isolated from the host’s
network configuration, significantly reducing risk and also enabling our tooling to reset the state of
the namespace if needed.
275
We deploy containers to servers using Ansible. Our Ansible playbooks reset the server’s
operating system to a known, desired state, including managing the host’s networking stack. Ansible
is executed periodically and verifies that every PEERING server is in compliance, redeploying out of
date containers and upgrading configuration files as needed (when configuration files are updated,
BGP sessions are not reset and thus experiments are not impacted). This results in a single Ansible
and operating system configuration for all servers, while supporting diverse PoPs.
In addition to maintaining standardization in production, using containers simplifies platform
development and testing. During development, we need to be able to execute experiments and test
changes in an environment that is representative of a PEERING PoP. Thus, the environment must
have the same packages and configuration that are used in production, must run the VBGP network
controller, and must also have interactions between a PEERING PoP, its neighbors, and experiments.
We accomplish this by using the same containers and automatically-generated configurations used in
production on our personal development machines, and using additional containers running software
routers and PEERING’s client toolkit (§6.4.4) to emulate a PoP’s neighbors and experiments. This
allows us to systematically test changes to PEERING (such as a new experiment capability) in an
environment that is guaranteed to be representative (in both software and configuration) of our
production environments without the risk of problems on our development machines due to (for
instance) the VBGP network controller reconfiguring the machine’s network interfaces.
276
6.5.2 Challenges in debugging and operation
Debugging route propagation We found instances of PEERING announcements not being glob-
ally reachable due to improperly configured or out-of-date filters in other networks. Networks
employ route import and export filters to prevent route leaks and prefix hijacks [100, 330]. When
debugging these situations, our goal is to identify the network that is incorrectly filtering, but the
process is manual and relies heavily on looking glass servers [163]. Although looking glasses
help, they cannot accurately pinpoint filters because they only provide a restricted command line
interface. Even in the optimistic scenario where two directly-connected networks A and B have
looking glasses, if network A has the route and network B does not, the looking glasses do not allow
us to disambiguate between (1) network A not exporting the route to B or (2) network B filtering the
route received from A. In practice, debugging usually requires emailing our transit providers. They,
in turn, may have to email their providers.
As future work, we plan to investigate the more general problem of identifying whether networks
do not appear on routes to our prefixes because they are misconfigured or because they are less
preferred than other providers. We plan to evaluate methods for automated filter troubleshooting.
Debugging layer 2 connectivity Systems like AL2S promote automation in educational back-
bones. However, at university sites, PEERING servers may be deployed to facilities (e.g., ‘server
rooms’ or labs) that are not managed by the university’s core network operators. PEERING servers
may also interact with equipment that is not under complete control of the university’s networking
team, such as when federating with other testbeds (e.g., CloudLab switches). These facilities are
277
often out of the reach of automated management, and relatively straightforward tasks, like trunking
a VLAN from a core router connected to Internet2 to a server at a (possibly unmanaged) location,
can be surprisingly difficult. Operators would benefit from tools to ease debugging and network
management systems suited to these environments.
6.6 Scalability of PEERING
We evaluate the performance of our VBGP instantiation used in PEERING, implemented using
the BIRD software router. Despite BIRD’s limitations (most notably running a single thread),
PEERING can support hundreds of peers and thousands of updates per second. Our implementation
can virtualize routers in the largest IXPs today and in the foreseeable future. We also evaluate the
achievable throughput of our interconnectivity across our backbone, and conclude that our backbone
capacity can support a variety of experiments.
Known routes and memory utilization VBGP employs BGPADD-PATH to inform experiments
of all available routes. This makes the number of routes managed by VBGP, and the memory
requirements in PEERING, proportional to the number of routes learned across all upstream in-
terconnections. Figure 6.6a shows the memory utilization of BIRD’s routing tables as a function
of the number of known routes. The control plane line corresponds to a minimal configuration
with a single global Routing Information Base (RIB), required for BGP operation (but without the
Forwarding Information Base (FIB) necessary to forward traffic). The per-interconnection data
plane line adds in the overhead of VBGP, which maintains one FIB entry (in the Linux kernel)
278
for each known route to allow experiments to choose their own routes when sending traffic. The
per-interconnection data plane with default line additionally adds in the overhead of the router
maintaining its own best-path routing table and keeping it synchronized with a kernel FIB. This
additional overhead is not necessary for PEERING because VBGP experiments receive all routes via
ADD-PATH and make their own routing decisions, but would be necessary if the VBGP node was
also routing production traffic. Memory use in BIRD scales linearly with the number of routes, at a
rate of approximately 327B/route, allowing a server with 32GiB of RAM to support 100 million
routes. PEERING’s VBGP router at AMS-IX, one of the largest IXPs in the world, exchanges routes
with all 4 route servers, 2 transit providers, plus 235 routers in 104 member networks, and currently
has 2.7 million routes from 854 ASes (combining route server and bilateral peers).
Rate of updates and CPU utilization VBGP needs to rewrite the BGP next hop of announce-
ments received from the Internet and filter invalid announcements from experiments (§6.3), which
incurs additional CPU overhead in PEERING. We focus our analysis on BIRD’s CPU utilization for
two reasons: (i) most of our route processing filters are implemented in BIRD (§6.4.6) and (ii) only
experiment announcements are processed by the ExaBGP-based security engine (which is invoked
infrequently as we limit experiments to 144 announcements/day per prefix).
Figure 6.6b shows BIRD’s CPU utilization for different filter configurations as a function of the
rate of BGP updates processed. We consider different filter configurations in a worst-case scenario
where BIRD runs all filters to completion without rejecting any routes. The accept line shows results
when BIRD is configured to simply accept all routes without any checks (the lower bound on CPU
279
utilization). The single-router VBGP line shows the CPU utilization for the filters VBGP applies to
announcements from experiments. This overestimates the complexity in actual VBGP deployments,
as filters applied to updates received from the Internet (the majority) are significantly simpler than
filters applied to announcements from experiments to the Internet. Finally, the multi-router VBGP
line shows the CPU utilization for BIRD in the BGP mesh configuration on PEERING’s backbone,
which requires a more complex handling of BGP next hops (§6.4.3).
The results show that CPU utilization grows linearly with the rate of updates and, more im-
portantly, is not significantly impacted by VBGP’s safety filters. During an 18h period in March
2018, PEERING’s VBGP router in AMS-IX processed an average of 21.8 updates/sec (with a 99th
percentile of approximately 400 updates/sec). Although filters incur additional propagation delays
for BGP update messages, this applies to any implementation and is small relative to global propa-
gation delays [236] and delays imposed by BGP’s built-in minimum route advertisement interval
(MRAI) [466].
Data plane performance PEERING relies on Linux’s networking stack for data plane forwarding,
with PEERING-specific configuration for virtual interfaces, multiple routing tables, and our BPF-
based security framework. PEERING’s performance benefits from any improvements to the Linux
networking stack. Significantly better performance could be achieved by techniques such as kernel
bypass [430, 431] and offloading our security framework to BPF-capable NICs [303]; to date no
experiments have required such capabilities and thus we leave these optimizations as future work.
280
0 500 1000 1500 2000 2500 3000 3500 4000
Routes (thousands)
0
500
1000
1500
2000
2500
Memory (MB)
Per-interconnection data plane w/ default
Per-interconnection data plane
Control plane
(a) Memory vs known routes
0 500 1000 1500 2000 2500 3000 3500 4000
Updates per Second
0
10
20
30
40
50
60
CPU utilization (%)
Multi-router vBGP
Single-router vBGP
Accept
(b) CPU vs rate of updates
Figure 6.6: Memory consumption and CPU utilization grow linearly with number of routes and rate of
updates. Results indicate VBGP can be deployed using commodity servers in the largest IXPs.
281
To verify forwarding performance of our backbone between PoPs, we conducted throughput
measurements using iperf3. Between sites, the average TCP throughput measurement was
approximately 400 Mbps, with a minimum of 60 Mbps and a maximum of 750 Mbps between
all PoP pairs. While not equivalent to the capacity available from the dedicated fiber connectivity
for large content providers, the backbone connectivity between PEERING locations has sufficient
capacity available for supporting various experiments (including all experiments proposed to date)
except those requiring enormous data transfers.
6.7 PEERING in Practice
6.7.1 How PEERING has been used
Since 2015, PEERING has supported experiments with different goals, exposing an array of different
behaviors to the Internet — includingAS_PATH poisoning,AS_PATH prepending, and controlled
hijacks (of PEERING’s own address space) — and from teams with varying degrees of experience
with BGP and the Internet routing ecosystem. We rejected as risky an experiment proposal that
required a large number of AS poisonings and one that planned to announce AS_PATHs with
thousands of ASes. We granted all other requests for access to the testbed, with many experiments
running for months at a time. The research performed on PEERING was part of 24 publications:
three at SIGCOMM, four at IMC, two at IEEE S&P, three at USENIX Security, three at USENIX
NDSS, two at SOSR, and 1 each at ToN, HotPETS, IFIP, SecureComm, TMA, CCS, and CCR [20,
21, 47, 48, 49, 50, 137, 142, 200, 263, 288, 297, 323, 347, 366, 378, 381, 392, 397, 406, 411,
282
412, 413, 439]. In addition, PEERING supported multiple projects at a BGP hackathon [105]. The
majority of these experiments were conducted by researchers not affiliated with PEERING. Earlier
studies used a rudimentary version of our infrastructure that did not fully support multiplexing and
helped inspire PEERING’s requirements [187, 223, 232, 236, 328, 446].
Almost all experiments announce routes (a small set have used PEERING as a looking glass) and
many exchange traffic. Several experiments have used BGP poisoning [20, 21, 48, 137, 413]. A few
experiments have used BGP communities [49, 366, 406] and fine-grained control of announcements
or traffic [49, 263, 347, 381], capabilities more recently added to the platform. We find that PEERING
offers the following key benefits to research using it:
Controlled experiments. Barriers exist to conducting controlled experiments on the Internet
because it is a huge system with properties reliant on the complex interactions among tens of
thousands of ASes with opaque topologies and policies. This challenge manifests in two ways:
First, researchers rely on uncontrolled, natural experiments, in which conditions of interest vary
outside of the control of the researchers and independent of the current research question. These
experiments can lead to challenges in isolating the cause of observations, as measurements are often
consistent with multiple explanations, especially given researchers’ limited visibility. For example,
a recent study relied on passive observation of Internet routes to identify networks that deployed
BGP security techniques to prevent prefix hijacks [158], but, because it relied on uncontrolled
experiments without isolating security policy as the cause for routing decisions, it could misdiagnose
unrelated traffic engineering as evidence of security policies [347].
283
Second, and closely related, researchers often lack ground truth to use in evaluating the accuracy
of a system’s inferences. Both cases are compounded in situations when the phenomena of interest
are rare or hard to identify.
PEERING gives researchers the ability to control aspects of routing as a means to conduct
controlled experiments or systematically generate ground truth data. A recent study demonstrated
how to use PEERING to manipulate announcements in order to probe the security policies of ASes
in a controlled manner, isolating the cause of decisions by varying only whether an announcement
was valid [347]. Other work issued requests to a Content Delivery Network (CDN) over different
paths while concurrently manipulating the performance of each path to measure the sensitivity of a
traffic engineering system [378], and to generate known, ground-truth events for evaluating a system
to protect Tor from routing attacks [412].
In-the-wild demonstrations. To have a better chance of adoption, extensions or alternate ap-
proaches to the Internet routing system must be incrementally deployable, backwards compatible,
and should provide benefit to early adopters. Traditionally, however, evaluations are limited to
emulation or simulation, and the community’s limited ability to measure or model the Internet’s
topology or policies means that the fidelity of the evaluations can be unclear. PEERING allows
prototypes to interact with the real Internet to show their compatibility and capabilities, such as
connecting an intradomain network to the Internet to measure the benefits of ingress traffic engi-
neering for a multihomed AS [411], assessing a technique to identify and neutralize BGP prefix
hijacking [381], or evaluating the BGP-compatibility of future Internet architectures [366]. Such
284
demonstrations can lend credibility and encourage adoption, especially since network operators can
often be conservative about change. They can also uncover pragmatic concerns that would not turn
up in a lab setting.
The security community places value on real-world demonstrations of attacks, and researchers
used our platform to demonstrate traffic interception attacks [49] and attacks on applications
such as Tor [413], TLS certificate generation [48], and cryptocurrencies [21]. This line of work
demonstrates how vulnerabilities in BGP can be exploited to create attacks on Internet-based
systems for anonymity, security, and currency. The real-world demonstrations led to the adoption of
countermeasures by Tor and by Let’s Encrypt, the world’s largest certificate authority.
Measurements of hidden routes. The design of BGP leads to routes only showing up in mea-
surements if they are being used, providing limited visibility into backup routes, route diversity, or
the underlying topology. PEERING can manipulate which routes are available to reach it by using
selective advertisements, AS-path prepending, BGP poisoning, or BGP communities for traffic
engineering. Researchers used this ability to reverse engineer routing policy preferences at finer
granularities [20] than was possible previously [149, 277].
6.7.2 Native delegation is a cornerstone for generality
The decision to seek out flexible solutions using existing layer 2 protocols, IP, and BGP instead of
non-native approaches (§§ 6.2.2 and 6.2.3) proved to be a good design decision that enabled a wide
range of experiments. Many experiments focused on the interactions between applications and BGP
285
and required flexibility and full delegation of both the data and control planes [21, 47, 48, 263, 378,
411, 412, 413]. Alternative interfaces, such as a custom out-of-band mechanism or an API with a
BGP beacon [352], would not suffice for many experiments and would need to be extended for each
experiment with new requirements. In addition, we find that experimenters are generally familiar
with and expect to be able to use an existing routing engine (such as BIRD)—because our approach
is inherently compatible with existing BGP deployments and tooling, PEERING can support these
experiments without modification.
A transit provider could adapt our approach to allow customers to choose among multiple routes.
It is a possible deployment pathway for flexible routing schemes proposed in research [196, 464] or
SDN control over BGP via existing devices, whereas other approaches, like Espresso [469], require
replacing infrastructure. For instance, our approach inspired a variation that we helped deploy in
production at Facebook to route traffic via alternate routes (§§ 4.5.1 and 5.2.2.2). The variant uses
per-packet data plane signaling and multiple routing tables at routers, but a centralized controller
decides which routes to use and injects them into tables at routers.
6.7.3 Cooperation with network operators
Researchers must push boundaries to test and evaluate protocols, implementations, and potential
solutions. Safely conducting experiments is challenging due to the large variation in implementations
and configurations of routers [120, 357]. Conversely, the operational community’s primary goal is
stability and security. Although the goals seem diametrically opposed, operators are supportive and
286
appreciative of research in the area, especially when researchers announce experiments and take
feedback prior to execution.
PEERING provides an environment for “white-hat” hackers to conduct experiments that rely on
BGP manipulation [21, 48, 406, 413]. The experiment review process, conservative security policies,
and cooperation with the operator community combine to enforce ethical use of the platform. When
in doubt, we err on the side of safety. For example, one recent experiment requested feedback
from NANOG in advance and proceeded to make (standards-compliant) announcements on a fixed
schedule [268]. The announcements identified a vulnerability in an open-source routing daemon
which caused BGP sessions to reset [433]. Although the majority of operator responses on the
NANOG mailing list supported continuing the experiment, the experiment was halted until affected
systems could be patched.
6.7.4 Experiments PEERING does not support
No direct control over other networks. Although PEERING delegates its data and control planes
to experiments, an experiment’s announcements and traffic are subject to policies enforced by
other networks in the Internet, which experiments have no control over. This limitation is not
specific to PEERING, it is intrinsic to the Internet’s architecture. Experiments need to plan for the
lack of control; e.g., an experiment that studied routing policies deployed hundreds of different
announcement configurations to exercise and observe policies in different scenarios [20]. However,
even without direct control over other networks, PEERING can still influence their routes and traffic
towards its prefixes.
287
PEERING can support limited types of experiments with multiple ASes: PEERING operates
multiple ASNs, which allows experiments to emulate multiple networks that interact with the
Internet as customers of PEERING’s main AS or of each other.
No high volume, production, or transit traffic. PEERING employs servers as routers and does
not operate any network links. PEERING’s backbone links are provisioned on top of other networks’
infrastructures, and researchers connect their experiments to PEERING routers using VPN tunnels
over the Internet. Although peering can support experiments that exchange traffic at moderate rates
(hundreds of Mbps, §6.6), capacity varies by PoP and is ultimately limited. PEERING, as a research
platform, also does not provide any guarantees on availability. PEERING also blocks announcements
and traffic for prefixes outside its IP space, and so experiments cannot transit traffic that is neither
from nor to a PEERING address.
Limited support for latency-sensitive or real-time experiments. Experiments connect to PEER-
ING PoPs via an OpenVPN tunnel. As a result, experiment traffic traverses the tunnels, incurring
additional latency and impacting latency-sensitive or real-time experiments. Experiments desiring
low latency can deploy on (and tunnel from) CloudLab (with which we federate and colocate PoPs)
or cloud providers with low-latency paths to PEERING PoPs (e.g., PEERING peers directly with some
cloud providers). We also have a preliminary implementation of an extension to our platform to
support lightweight applications on experiment-controlled containers running directly on PEERING
servers [227].
288
6.8 Conclusion
Internet routing research has been limited by obstacles in executing experiments in the Internet.
Without control of an AS, researchers are limited to simulations, which cannot realistically capture
Internet properties, and measurements, which can observe routes as they are but cannot manipulate
them to study the impact.
This chapter presented PEERING, a production platform that realizes our vision of enabling
turn-key Internet routing research. PEERING is built atop VBGP, a system that we designed to
virtualize the data and control planes of a BGP edge router, while providing security mechanisms to
prevent experiments from disrupting the Internet or each other. VBGP supports parallel experiments,
each with control and visibility equivalent to having sole ownership of the router, using standard
interfaces, which provide realism and flexibility.
With PEERING, experiments operate in an environment that is qualitatively similar to that of a
cloud provider, and can exchange routes and traffic with hundreds of other networks at locations
around the world. To date, PEERING’s rich connectivity and flexibility have enabled it to support
over 40 experiments and 24 publications in research areas such as security, network behavior, and
route diversity [20, 21, 47, 48, 49, 50, 137, 142, 200, 263, 288, 297, 323, 347, 366, 378, 381, 392,
397, 406, 411, 412, 413, 439].
289
Chapter 7
Literature Review
7.1 Internet Flattening
In Chapter 3, we examine how a significant volume today’s Internet traffic is delivered over short,
direct paths between content providers and end-user ISPs. The measurement study in Chapter 3 was
executed in 2015 and the Internet has continued evolving since. We discuss historical work which
uncovered the first signs of a flattening Internet and captured it at various stages over the past two
decade along with work executed since our 2015 study.
In 2008, Gill et al. [159] used traceroute measurements from public traceroute servers across 30
countries to quantify the prevalence of Tier-1 ISPs in paths to popular content. Their measurements
showed that paths between the traceroute vantage points and Google, Microsoft, and Yahoo — three
of the largest content providers at that time — were less likely to traverse a Tier-1 ISP, with 60%
of paths containing no Tier-1 hops (but potentially still traversing transit providers). In addition,
they identified that these networks were beginning to interconnect widely, noting that the collected
290
traceroutes showed Microsoft and Google interconnecting with 20+ ASes. The authors hypothesized
that interconnecting widely may enable these networks to adopt technologies that have traditionally
been stymied due to lack of support from major Tier-1 networks [166, 167].
Our analysis in Chapter 3 is based on measurements collected in 2015, and thus we expect
the results to differ from those of Gill et al.’s 2008 study. However, the work also differs in
methodology, the impact of which illustrates the challenges of accurately characterizing a network’s
interconnections from an external vantage point. Gill et al. resolve popular domain names to IP
addresses once and then execute traceroutes from a handful of public traceroute servers to these
addresses; this methodology has two drawbacks. First, public traceroute servers are often located
in transit, hosting, and educational networks (§2.5.1), these networks often do not have the same
connectivity as end-user ISPs, and thus routes observed from these vantage points may not be
representative of user experiences (as discussed in Section 3.2.1). Second, the IP addresses returned
during DNS resolution may have been influenced by the LDNS resolver (§2.3.2). For instance, the
authoritative DNS server for “facebook.com” will change its response based on the requesting LDNS
server in an effort to direct each user to a nearby point of presence because Facebook announces a
separate address space from each PoP to steer traffic (chapter 4, [387]). Combined, these factors
may have caused some of the executed traceroutes to traverse an abnormally long path relative to
what a typical user in the same country would observe. In comparison, Chapter 3 primarily relies on
measurements executed from the networks of major cloud providers to end-user address space. This
enables a single vantage point to have visibility into many interconnections, and is likely one of the
reasons why we identified more peering interconnections and shorterAS_PATHs.
291
In 2010, Labovitz et al. [250] found that “the majority of inter-domain traffic by volume now
flows directly between large content providers, data center / CDNs and consumer networks”. Their
conclusion was based on a dataset captured from July 2007 to July 2009 that included traffic flow
measurements from 3,095 peering routers spread across 110 AS. This dataset captured the rapid
consolidation of Internet traffic into the networks of large cloud and hosting providers. In July
2007, the dataset showed that top 150 ASNs contributed 30% of all interdomain traffic. By July
2009, the dataset showed that just 30 ASNs originated more than 30% of all interdomain traffic and
150 ASNs originated more than 50%. Our findings in Chapter 3 are similar to those of Labovitz
et al., although our vantage point within cloud networks likely provided more complete visibility of
interconnections. In addition, we consider the implications of content serving infrastructure being
colocated directly within an end-user ISP’s network; this traffic was not visible in the methodology
used by Labovitz et al., as they only measured at peering routers.
In 2010, Dhamdhere et al. [109] used a model to study the potential implications of flattening,
defining criteria based on traffic volumes and requirements for global reachability that determined
when an interconnection would be established. In addition, they hypothesized the implications of
a single AS, such as Google, controlling the majority of Internet traffic. In Chapter 4, we discuss
one implication that was not realized by Dhamdhere et al.: peering becomes necessary for large
providers like Facebook due to the capacity limitations of the existing hierarchical Internet (§4.2.3).
292
In 2011, Ager et al. [7] developed a methodology to derive content-based AS rankings. Unlike
topology-driven rankings which focus on a network’s connectivity (e.g., CAIDA’s AS-Rank [69]),
Ager et al.’s rankings focus on the amount of content a network hosts. Their analysis showed that
beyond Google, other Internet content is increasingly served by a small set of hosting and CDN
providers, thereby reinforcing the value of focusing on the connectivity of, and paths between these
key providers and users. More recent work shows that this evolution has quickly continued. For
instance, in 2015 Chen et al. [88] reported that between 15% and 30% of all web traffic was served
from Akamai’s CDN. And in 2019 Labovitz [248] found that the vast majority of all Internet traffic
is sourced from a small set of CDNs, with five web properties accounting for 50% and ten web
properties accounting for 75% of all Internet traffic. Labovitz further found that transit growth over
the past decade had been considerably slower than CDN network growth. For instance, while a
CDN’s network capacity had grown 25x, a transit provider’s capacity had only grown 12x.
The 2020 work of Arnold et al. [25] is arguably most relevant to our work in Chapter 3. Arnold
et al. used a similar methodology (e.g., traceroutes from cloud providers) to characterize cloud
provider connectivity. Like our 2015 study, their analysis examined how many networks cloud
providers interconnect with; their results show that Google interconnects with 7757 ASes, compared
to the 5083 found in our 2015 study. However, their analysis went a step further by also considering
the types of interconnections (e.g., peering or transit) and networks (e.g., Tier-1, Tier-2, or other)
that cloud providers use to reach a destination, with the goal of capturing how close cloud providers
are to achieving global reachability without requiring traffic to traverse Tier-1 and Tier-2 ISPs. For
293
instance, their results found that Google can reach all but 174 ASes without needing to rely on
its transit providers, and 89.9% of ASes without traversing a Tier-1.
1
Arnold et al. also examined
the reliance of cloud provider on each of their peers for routes to ASes, and used simulations to
assess whether this rich interconnectivity naturally provides resilience to route leaks, a possibility
we discussed in Chapter 3 (§3.4).
As the Internet has flattened, a number of efforts have sought to map the serving infrastructure
of large content providers. In 2013 Calder et al. [70] designed a methodology to enumerate
and geolocate Google’s serving infrastructure and executed a longitudinal measurement study
that captured Google’s expansion of serving infrastructure into end-user networks around the
world. Calder et al. exploited the EDNS-client-subnet extension to determine how the authoritative
nameserver forgoogle.com would map clients in different /24 prefix. By querying for all possible
client prefix, they were able to enumerate 1400 Google points of presence, many of which were
located within end-user ISP networks. In 2018 Böttger et al. [55] similarly developed a methodology
to enumerate Netflix’s OpenConnect serving infrastructure. In recent years, some CDNs have
publicly disclosed their network connectivity; for instance, Google maintains a webpage showing all
of their PoPs worldwide, including caching appliances hosted within end-user ISP networks [174].
1
While Tier-1 ISPs commonly act as transit providers, Arnold et al. found that Google has peering interconnections
with 15 Tier-1 ISPs.
294
7.2 Traffic Engineering
A large body of prior work has investigated how to improve Internet performance and availability
through the use of traffic engineering. In this section, we summarize this work along with tangential
work in the datacenter space.
7.2.1 Detour routing and overlay networks
Much of the earliest related work focused on detour routing and overlay networks, along with the
benefits of multihoming. In 1999, Savage et al. [373] used active measurements of “path quality”
taken between pairs of Internet hosts to compare the performance of the route used to connect
each pair against synthetic routes that could be constructed by circuits through other measurement
hosts. They found an alternate path with significantly superior quality in 30-80% of the cases,
depending on the metric used. In 2001, Andersen et al. [18] built upon the insights of Savage et al.
and introduced Resilient Overlay Network (RON), an architecture for constructing, maintaining,
and using overlay networks formed from a small group of nodes. RON continuously executed
active probing measurements between nodes to measure path performance and update overlay
routes. Around this time, Akamai, one of the first commercial CDN providers, incorporated overlay
networks into their design of SureRoute to bypass performance problems and outages [390, 461].
In 2004, Akella et al. [9, 10, 11] compared the performance benefits of overlay networks and
multihoming from the perspective of Akamai. They found that they could extract “good wide-area
performance” from multihoming alone, although they could not achieve all of the benefits of overlay
295
networks, particularly in terms of availability. More recently, in 2007 Duffield et al. [116] evaluated
the benefits of performance-aware routing from the perspective of a large Tier-1 ISP. They measured
performance of all available routes between 15 vantage points inside of the ISPs network and 738
DNS servers that were randomly selected from a larger corpus. Their analysis showed opportunities
to reduce loss (at least 2%) and delay (at least 20ms) for time periods that were long enough to be
acted on by a control system.
The Internet has evolved in the time since this work was conducted, and that evolution requires
that we consider the conclusions of this work in context. In particular, this early work investigated
opportunities to improve performance when the Internet was in its nascent stage, during which
content providers had not yet built out global points of presence, most traffic flowed through transit
providers, and peer-to-peer traffic made up a significant volume of global Internet traffic [370]. As
a result, much of this work focused on the performance and availability of routes that may not be
critical to Internet performance today — as discussed in Chapter 3, the vast majority of traffic on
the Internet today flows between users and a small number of content, cloud, and CDN networks.
Given this evolution and broader improvements in the Internet reliability and performance, the
opportunities reported by this early work may no longer be actionable or relevant. For instance,
while early work proposed using sophisticated overlay networks to route around Internet problems
(e.g., RON in 2001), in 2013 Peter et al. [328] posited that many routing problems can be lessened
by tunneling to a network that offers reliable transit, and that the flattening of the Internet means
that one such tunnel is often enough to find a reliable path.
296
More recent work has focused on optimizing the performance of routes between users and the
networks that host popular content, along with the challenges that arise for such networks in today’s
flattened Internet. The rest of this section focuses on work in this setting.
7.2.2 Egress traffic engineering
Early work on traffic engineering discussed techniques for controlling how flows are routed through
an AS, such as for a transit AS. Feamster et al. [131] proposed a Routing Control Platform (RCP)
that centralized an AS’s routing decisions, decoupled the BGP decision process from routers, and
eliminated the need to maintain BGP adjacency (mesh) between every router in an AS. Caesar
et al. [64] describe a concrete RCP implementation that is often cited as an early example of SDN.
Although RCP is focused on intradomain routing, it has similarities to EDGE FABRIC in that it
decouples the BGP decision process from routers and has a controller receive route updates and
inject decisions using BGP. Van der Merwe et al. [447] and Verkaik et al. [449] proposed an
Intelligent Route Service Control Point that takes an explicit ranking of egress router per destination
and dynamically updates intradomain routing decisions. Teixeira et al. [423, 425] discussed egress
traffic engineering in the context of “hot-potato routing”, in which one AS attempts to pass traffic to
another AS as quickly as possible.
Our work in Chapters 4 and 5 differs from this earlier work in that we do not focus on intradomain
routing. Instead, we focus on preventing congestion of interdomain interconnections at the edge of
Facebook’s network and opportunities for performance-aware routing. This difference in setting has
implications in controller design. For instance, while intradomain routing decision processes can
297
rely on measurements collected directly from devices to determine route conditions, EDGE FABRIC
and similar egress routing systems must continuously infer route conditions from measurements.
Espresso [469], Entact [474], and CASCARA [389] are similar in environment and goals to
EDGE FABRIC, in that all of these systems focus on control of interdomain routing and incorpo-
rate load and performance into their decision processes. Software Defined Internet Exchanges
(SDX) [186, 187] also provide some of EDGE FABRIC’s functionality to exchange participants
and may provide solutions to some of the challenges that EDGE FABRIC faces today in Internet
Exchange Points (§4.6.2).
Espresso. Espresso (Yap et al. [469]) is Google’s SDN-based system to control egress routing.
Espresso and EDGE FABRIC are both designed by huge content providers needing to overcome
challenges with BGP as they expand their PoP and peering footprint in the face of massive traffic
growth. They take a similar top-level approach, centralizing control of routing while retaining BGP
as the interface to peers. However, the two systems prioritize different tradeoffs in many other
important design decisions, presenting an interesting case study of how the ranking of priorities
can impact a design. Espresso uses a bespoke architecture to remove the need for BGP routers
that support full Internet routing tables, whereas EDGE FABRIC relies on BGP and vendor BGP
routers to build on existing experience and systems. EDGE FABRIC restricts the size of its multiple
routing tables by isolating PoPs, such that the number of prefixes carrying user traffic per PoP is low
(Figure 4.3). Whereas Facebook achieves simplicity by isolating prefix announcements, ingress,
egress, and control to individual PoPs, Espresso uses a single global controller and can route traffic
298
across the WAN to egress at distant PoPs, providing flexibility. EDGE FABRIC’s controller pushes
its egress decisions only to peering routers, allowing us to isolate Facebook’s hosts from network
state. Espresso, on the other hand, pushes routing decisions to hosts, maximizing flexibility but
requiring a more sophisticated controller architecture and the continuous synchronization of routing
state at hosts to prevent blackholing of traffic. Section 4.6.1 discusses these tradeoffs (relative to
our goals) in more detail, based on our past experience with routing egress traffic between PoPs
and with host-based routing (earlier EDGE FABRIC designs that were more similar to Espresso).
Espresso includes approaches for mitigating some of the challenges that section describes.
Espresso discusses incorporating performance measurements into routing decisions as a goal and
notes that a smoothed aggregation of "bandwidth, goodput, RTT, retransmits and queries-per-second
reports" is incorporated into the decision process. However, Espresso does not define a concrete
methodology for capturing and converting these metrics into decisions, nor does it evaluate how often
incorporating performance measurements into its decision process yields benefit. In Section 5.2.2.2,
we describe how we used footholds we introduced in EDGE FABRIC’s design (§4.5.1) to build
ROUTEPERF, a system that steers a fraction of Facebook’s production traffic over alternate egress
routes. In Section 5.3 we discuss how we capture latency and goodput from this traffic along with
a principled approach to incorporating them into a decision process. Finally, in Section 5.6, we
conduct a global analysis and find that performance-aware routing provides limited benefit, and
discuss how simply shifting the traffic to the best path based on performance measurements can
result in congestion and oscillations.
299
Entact. Entact (Zhang et al. [474]) overrides BGP’s default routing decisions through a well-
designed approach that balances performance, load, and cost, evaluating the approach via emulation.
Similar to Facebook’s ROUTEPERF, Entact directs some traffic to alternate paths to measure their
performance. We build on this idea, working through the details of deploying such a system in
production, at scale, in a way that applies to all our services and users. For example, while Entact
measured alternate path performance by injecting override routes for individual IP addresses within
prefixes, ROUTEPERF assigns flows at random to alternate paths, guarding against cases in which
addresses within the same BGP prefix experience different performance (see Figure 5.6 for an
example). In addition, because EDGE FABRIC’s design supports making decisions on a per-flow
basis, it can also support application-specific routing. Entact uses active measurements (ICMP
pings) to measure path performance, but is unable to find responsive addresses in many prefixes
so can only make decisions for 26% of MSN traffic. The need for responsive addresses also limits
the number of alternate paths that Entact can measure in parallel and keeps it from increasing the
granularity of its decisions by de-aggregating prefixes (both would require finding more responsive
addresses). These atomic assignments may not be a good approximation of Entact’s optimal traffic
assignments, which assume a provider can split traffic to a prefix arbitrarily across multiple paths.
By applying an approach similar to Entact’s but based on passive measurement of production traffic,
we build ROUTEPERF and server-side measurement infrastructure to collect latency and goodput
measurements that cover and are representative of our entire user base, can split prefixes to increase
decision granularity, and can use as many paths in parallel as our peering routers can support. Finally,
we expose challenges that arise in practice (§5.6.1.2), including the potential for oscillations, that
300
Entact noted could occur, but could not evaluate for in emulation (Section 2.5.3 discusses how
emulation limits the fidelity of experimentation).
CASCARA. CASCARA (Singh et al. [389]) performs joint optimization of both cost and perfor-
mance to determine which egress route option to select for Microsoft Azure’s cloud traffic. The
work focuses on how to enable this joint optimization, along with the generalizability of the results.
Commercial appliances. Commercial route optimization appliances have existed since the early
days of the Internet. These appliances include Internap’s Managed Internet Route Optimizer (MIRO)
and Flow Control Platform (FCP) [211, 394], RouteScience’s PathControl [173, 175], and Noction’s
Intelligent Routing Platform (IRP) [313]. Route optimization appliances have been used by content
and hosting providers. For instance, press release indicate that Google was using RouteScience’s
PathControl technology in 2002 [173, 175], while in 2008 SoftLayer was using Internap’s FCP [394].
Route optimization appliances are commonly described as enabling performance and capacity-
aware routing [173, 175, 313, 394], but few details are disclosed about how they operate. Documen-
tation for Noction’s IRP [122, 314] indicates that it uses traffic flow data reported by edge routers
via IPFIX or sFlow to determine the top destinations of production traffic, and then measures propa-
gation delay and loss to these destinations for each available route using ICMP probes. As discussed
earlier in this section, using active measurements with ICMP probes requires finding representative
endpoints that are responsive to probing traffic, and ICMP traffic may be prioritized differently and
thus not representative. In addition, using active measurements likely reduces the appliance’s ability
301
to detect changes in network conditions at short time-intervals, and also reduces the appliance’s
sensitivity to low rates of loss that can impact the behavior of congestion control algorithms for
production traffic.
2
Finally, we suspect that such appliances are designed for environments where
the volume of traffic shifted in response to performance measurements is low, and thus unlikely to
cause downstream congestion. As a result, these appliances alone are insufficient for environments
like Facebook’s.
Software Defined Internet Exchanges (SDX). In Section 2.2.2 we discussed how route servers
(RS) at public IXPs enable mass interconnection among participants without requiring each pair of
participants to establish a bilateral BGP session. Instead, participants establish a BGP session with
a RS and announce routes via this session. The RS performs BGP best path computation over all
routes received and announces to all participants a path for each destination.
3
Software Defined Internet Exchanges (SDX) (Gupta et al. [186, 187]) extend the route server
concept, removing limitations and offloading more of the decisions that would typically be made
on a participant’s router. For instance, SDX participants can define sophisticated routing policies,
such as application-specific routing, that the SDX enacts. A participant’s policy can specify that
the exchange should route traffic for application A (identified based on port numbers or some other
2
Noction’s IRP uses five ICMP probes to measure a route’s performance, and only executes additional measurements
if those probes indicate discrepancies [122, 314]. This low number of probes inherently reduces the appliance’s ability to
detect low rates of loss along a path.
3
Some route servers support control over filtering and route propagation via BGP communities, RPSL attributes,
and other mechanisms [17]. For instance, a participant can choose to not receive routes from a specific participant or to
only announce a route to a specific participant. However, because the route server still calculates a single best path per
destination (a limitation of traditional BGP, similar to the challenges encountered in the development of PEERING, §6.2.2)
such controls only determine to which participants the route server announces its selected route.
302
supported discriminator), with destination D, via the route provided by exchange participant P. Such
policies are possible because, unlike traditional RS architectures, SDX also exercise control over the
data-plane of the shared fabric interconnecting participants. However, this control of the data-plane
is likely to make it difficult for the networks most likely to benefit from SDX capabilities to be able
to incorporate SDX into their egress routing.
First, an SDX can only enact policies on traffic that traverses the exchange fabric. In addition
to having connections to the exchange fabric, CDNs will commonly maintain Private Network
Interconnections (PNI) with peers and transits at the same point of presence (§§ 2.2 and 4.2) and are
unlikely to replace such PNI with exchange connections due to concerns around cost, redundancy,
control, and visibility into demand/capacity (§4.6.2). As a result, the CDN router cannot simply
offload all routing decisions to the SDX, routing decisions — whether defined via static policy,
dynamically injected via by a system such as EDGE FABRIC, or encoded in the packet header by
hosts through the use of MPLS and GRE tunnels à la Espresso [469] — must still be enacted at the
CDN’s router so that when the router receives a packet, it knows how to forward it (e.g., via a PNI
or to the SDX). This creates additional overhead: a CDN would need to manage its egress routing
policy and the routing policy configured within the SDX and keep the two policies synchronized.
Second, given that CDNs are unlikely to offload all routing decisions to the SDX, and given
that CDNs are likely to need control systems like EDGE FABRIC to prevent congestion at the edge
of their networks, it is unclear if SDX policies would provide value. For instance, a CDN could
instead use its own control system to perform application-specific routing (e.g., §4.5). Likewise,
the value of application-specific routing policies is unclear. For instance, our results in Chapter 5
303
show that there are limited opportunities for performance-aware routing, and a sizeable portion of
the opportunities that do exist are brief and sporadic. Further, we found that blindly shifting all
traffic to the best path can result in oscillations. Thus, defining SDX policies to route via the best
performing path would still require a network operator to continuously measure the performance
of all routes and to build a control loop that would slowly shift traffic by updating SDX policies.
The only piece that could be offloaded to SDX — assuming that all of the relevant traffic already
traversed the SDX — is enacting the routing decisions.
7.2.3 Ingress traffic engineering
In Chapter 3, we discussed how CDNs are building out PoPs around the world to reduce the distance
between users and content. CDNs have a variety of techniques to direct user traffic to PoPs, including
Anycast, controlling DNS responses, and rewriting URLs (§2.3.2). The latter two techniques require
building a mapping between LDNS and/or client IP addresses and the optimal PoP through a separate
measurement system (§2.3.2, [35, 88]). In most implementations of these two techniques, this
mapping determines the PoP that traffic will be terminated at. Our work in Chapters 4 and 5 is
complementary in that it focuses on mapping traffic to, and quantifying the performance of egress
routes from a PoP to users, and identifying opportunities to improve performance through changes
to traditionally simple egress routing policies.
Prior work has focused on challenges related to Anycast and understanding how CDNs make
and enact mapping decisions from an outsider’s perspective. More relevant to our work in Chapters 4
304
and 5 is work that has focused on the problem from a CDN’s perspective, and in particular on
transforming performance measurements into mapping decisions.
Anycast stability and load-shedding. A large body of prior work has focused on Anycast, in-
cluding potential disruptions caused by route changes and performance relative to other options. In
2006 Levine et al. [262] examined the prevalence of route changes disrupting Anycast connections
and found that such disruptions were rare. In 2009 Al-Qudah et al. [340] examined how TCP and
application-layer protocols behave in the case of a routing change causing an Anycast connection to
be suddenly terminated. They found that application-layer retries can enable graceful degradation
in most cases, with TCP tuning and more aggressive closing of dormant connections providing
additional benefit. In 2015 Flavel et al. [140] developed a technique to enable CDNs to use Anycast
while also maintaining control over the amount of load served by a PoP (a challenge given the
limitations of BGP, §2.4.2). Their approach involves a layered model in which nodes at each layer
redirect load inwards as needed to handle excess load. The redirection technique relies on the
colocation of authoritative DNS servers at each PoP, and that DNS requests from users are served
by the same PoP that Anycast directs the user’s other traffic to, an invariant the authors show is
typically true.
Exploring mapping decisions from an outsider’s perspective. Fan et al. [127] examined how
the mapping between user and PoP changes over time for Google and Akamai. They found that
50-70% of prefixes switch between PoPs that are distant from each other, and that these shifts can
305
result in large increases in latency (e.g., 100+ms). It is possible that such changes occur due to
capacity limitations (e.g., insufficient compute, egress capacity, etc.) at the preferred PoP. Adhikari
et al. [3] and Torres et al. [437] analyzed how YouTube directs video requests to PoPs from different
perspectives. Adhikari et al. focused on how YouTube’s architecture provides control over which
PoP traffic is served from. Torres et al. found that latency between user and PoP plays a role in the
decision process, but that other factors, such as content popularity, are also considered.
Mapping and routing decisions from a CDN’s perspective. Most relevant to our work in Chap-
ters 4 and 5 is prior work that has focused on measuring the performance implications of different
mapping and routing decisions, including techniques and metrics for quantifying and comparing
each option’s performance.
In 2015 Chen et al. [88] discussed the performance improvements observed by Akamai after
rolling out the EDNS0 client-subnet extension [106]. The EDNS0 client-subnet extension enables
Akamai’s authoritative DNS server to make decisions based on the client’s IP address, rather than
the IP address of the client’s LDNS server. Akamai reported a 2x decrease in RTT and a 30%
improvement in time to first byte.
In 2015 Calder et al. [71] compared the performance of Anycast and DNS redirection for
directing requests tobing.com. Similar to how our work in Chapter 5 compared the performance
of the primary and alternate paths, Calder et al. compared the performance of the primary path (as
chosen by Anycast) against the performance of three other PoPs chosen based on LDNS information.
Upon search results being displayed onbing.com, the user’s browser would fetch an object from
306
the selected CDN endpoints and track how long it took to retrieve it. Calder et al. found that
on average, 19% of prefixes observed better performance for requests directed to a different PoP
than the one used by Anycast. However, these improvements were typically minor: while 12% of
clients observed 10ms or more improvement, only 4% of clients saw 50ms or more improvement. In
addition, temporal analysis showed that instances of the Anycast endpoint having worse performance
were typically sporadic, potentially due to transient events such as failures and/or route changes in
the end-user ISP’s network.
In 2018 Calder et al. [72] discussed the design of Odin, Microsoft’s CDN measurement system.
Odin has similar goals to our work in Chapter 5 in that it seeks to quantify user perceived performance
for each prefix at short-timescales to detect instances of degradation, and to facilitate A/B tests.
However, while Chapter 5 focuses on the performance of egress routes and comparisons between
them, Odin focuses on the performance between users and different PoPs, along with comparisons
between PoPs. Measurements collected by Odin are also used as input to Microsoft’s Azure
DNS redirection service, to investigate instances of poor Anycast routing, and to monitor service
availability. Calder et al. discussed problems similar to those raised in Chapter 5, including the
coverage and representativeness of Layer 3 measurements such as ICMP pings. However, the
approach used by Odin differs in that it relies on active measurements executed from the client side
application, instead of using existing production traffic to measure and characterize performance.
Part of this difference in approach is required given the goal of measuring the performance of
different PoPs: the client must initiate connections to different PoPs for this to occur. However,
307
the use of active measurement traffic — instead of redirected production traffic — is described as
necessary given the risks of redirecting production enterprise traffic.
PECAN (Valancius et al. [446]) proposed joint optimization of the point of presence that
users are routed to and the interconnections used at that PoP. Specifically, PECAN hosted content
replicas at five TransportPortal points of presence
4
and used BGPAS_PATH poisoning to influence
how other networks routed traffic to prefix advertised by the evaluation AS.
5
PECAN measured
performance for each combination of (client, PoP, ingress route) and compared latency and goodput
for the best combination against the performance that would have been achieved without control
of ingress routes. The results showed that joint optimization of PoP and ingress route yielded
an additional 4.3% reduction in latency. Chapter 5 similarly examines the potential to improve
Facebook user performance by optimizing egress route selection, although we do not investigate
joint optimization of PoP and route selection because it is outside of the scope of EDGE FABRIC’s
control. Unlike PECAN, we find little opportunity to improve performance through changes in
Facebook’s egress routing (§5.6). This difference in results may be in part because Facebook PoPs
are typically physically close to the users they serve and route the majority of their traffic through
peering interconnections with end-user AS. This setting likely reduces the potential benefits of
performance-aware routing, given that peering routes often provide equal or better performance
than transit routes (§5.6.2). In comparison, because PECAN clients were PlanetLab nodes spread
4
An early version of PEERING, known as TransitPortal [444], was used to evaluate PECAN’s ability to improve
performance on the Internet; PEERING is discussed in Chapter 6 and TransitPortal is discussed in Section 7.5.
5
PECAN’s design also considered optimization of egress routes and other approaches to steer ingress routing, such as
AS_PATH prepending. However, these aspects could not be evaluated because TransitPortal did not support control of
egress routing and did not have points of presence with multiple peers. We have addressed these limitations in our design
and implementation of PEERING (§§ 6.2.1, 6.2.2, 6.3.2 and 6.4.2).
308
around the world, the paths between PECAN clients and the five TransportPortal points of presence
(all of which were located in the United States) were unlikely to be the short, direct paths that
Facebook often has, and thus had a higher probability of traversing a transit provider. In addition to
differences in results, Chapters 4 and 5 also focuses on addressing measurement challenges that
arise in production systems, and investigates potential benefits for performance-aware routing at
finer granularity, including temporal aspects.
7.2.4 WAN traffic engineering
Production traffic engineering systems are commonly found managing datacenter and WAN traffic
inside of large networks. These settings naturally provide the requisite visibility and administrative
control that is necessary for such controllers.
Footprint (Liu et al. [269]) is a WAN traffic engineering system that jointly optimizes assignment
of users to edge proxies, proxies to datacenters, and the interconnecting WAN routes. Footprint is
designed to account for the impact of session stickiness, in which sessions remain assigned to same
the edge proxy, datacenter pairing for their lifetime. Due to this stickiness, Footprint’s decision
process must account for session decay times. Prototype evaluation of Footprint in Microsoft’s
network showed that joint optimization prevents congestion that occurred when decisions were
made by independent processes that did not consider the impact of their decisions on WAN load or
application latency. As a result, Footprint is shown to reduce the latency experienced by user traffic
and improve efficiency. While Footprint does not focus on preventing congestion on interconnections
between Microsoft and other AS, it does show how aspects of application design, such as stickiness,
309
can have significant impact on control system design. In addition, it discusses challenges around
maintaining state and multiple control loops that are tangential to the discussions in Chapter 4 about
EDGE FABRIC’s design, including stateful vs. stateless control loops (§4.6) and the challenges of
multiple controllers making decisions independently (§5.6.1.2).
B4 (Jain et al. [221]) and SWAN (Hong et al. [202]) centralize control of inter-datacenter
networks to maximize utilization without hurting performance of high-priority traffic. EDGE
FABRIC has a similar goal, and it also uses centralized control. However, the difference in the
setting introduces new challenges. In particular, B4 and SWAN operate in a closed environment
in which they have complete visibility into network conditions, including link utilization, failures,
and properties of the underlying physical topology that impact performance. In addition, all hosts
and network devices are under unified administration, ;and the majority of the traffic can tolerate
delay and loss. In contrast, EDGE FABRIC controls egress traffic to networks and users outside
its control, and can only gain visibility into conditions beyond the edge of Facebook’s network
through measurements (Chapter 5). Further, much of the traffic has performance constraints that
make packet loss untenable; for instance adaptive bitrate video has soft latency demands far beyond
the elastic traffic on inter-datacenter WANs.
310
7.3 Characterizations of Internet Connectivity and Performance
7.3.1 Interconnection congestion
A 2014 Measurement-Lab (M-Lab) report concluded that diurnal performance degradation of several
access networks was caused by congested interconnections between user ISPs and major transit
providers [213]. M-Lab’s dataset is composed of Network Diagnostic Test (NDT) measurements
executed ad-hoc by users around the world. Among other things, these tests execute a speedtest
during which the user connects to an M-Lab server with transit connectivity through a Tier-1 provider
such as Level(3). The report showed instances of median download speed dropping for tests between
users in specific metropolitan areas, such as New York City, and M-Lab servers connected to specific
transit providers.
Sundaresan et al. [417] examine the conclusions of the M-Lab report; critically, they estimate that
the M-Lab’s infrastructure does not have visibility into between 79% and 90% of AS interconnections
in the United States that popular content traverses. This insight calls into question the impact of
the observed degradation on the average user’s Internet experience. In addition, this insight aligns
with our findings in Chapters 3 and 4 that a significant volume of Internet traffic traverses peering
interconnections, and not the transit interconnections that M-Lab’s study examined. In addition,
Sundaresan et al. discuss how the earlier report’s use of crowd sourced measurements may have
resulted in erroneous conclusions. First, users with connection problems may be more likely to
execute NDT tests, introducing confounding factors and the potential for the dataset to not be
representative of typical user experience in an AS. Second, the M-Lab dataset in 2014 contained
311
only 40 thousand measurements per day.
6
This volume of samples is insufficient to execute
both spatial and temporal analysis, and is one of the reasons the 2014 report relied on samples
aggregated per (AS, M-Lab cluster) grouping. Instances of interdomain congestion often show
regional effects [229], but such effects may have been hidden at these aggregations. Finally,
Sundaresan et al. noted that it is difficult to define a threshold on throughput degradation that can be
used to detect interdomain congestion; many networks show such variations in the M-Lab dataset
and it is unclear whether they signify interdomain congestion, congestion within the ISP’s network,
congestion at the user’s access link, or are the result of some other confounding factor(s).
Many of the concerns raised by Sundaresan et al. can be attributed to use of crowd sourced
measurements in the M-Lab dataset. We sidestep these issues in our analysis of Internet performance
from Facebook’s edge (Chapter 5) by sampling connections at random from production traffic.
Using production traffic avoids confounding factors that may have biased M-Lab’s dataset and
enables us to collect the volume of samples required to perform temporal and spatial analysis. In
addition, while the M-Lab report looked at changes in speedtest throughput in isolation, our analysis
estimates the impact of such changes on Facebook users’ ability to stream HD videos.
Feamster [130] used measurements collected from the peering routers of seven large end-user
ISPs in the United States (Bright House Networks, Comcast, Cox, Mediacom, Midco, Suddenlink,
and Time Warner Cable) to evaluate the prevalence of congestion on the interconnections between
content providers and end-user ISPs. From their analysis of measurements captured from October
6
As of 2019 the M-Lab dataset was collecting 3M tests per day. While this is a significant increase relative to 2014,
it may still be insufficient to execute the spatial and temporal analysis performed in Chapter 5, and the dataset still suffers
from the other confounding factors discussed.
312
2015 to February 2016, they found less than 10% of links exceeded 90% utilization, and less than 4%
of links exceeded 95% utilization. In addition, they found that aggregate interconnection utilization
was approximately 50% at peak. Their finding that a small number of interfaces experience high
utilization aligns with our analysis in Chapter 4, in which we evaluated the prevalence of congestion
on interconnections at the edge of Facebook’s network. However, their methodology also illustrates
how evaluating the prevalence of interconnection congestion can be challenging — even with
visibility into peering router link utilizations. First, the dataset used by Feamster is composed
exclusively of United States ISPs and thus the results may not hold for other countries. Second,
if traffic engineering systems like EDGE FABRIC intervened to prevent congestion, then a link
that would otherwise have insufficient capacity may never appear congested in the dataset used
by Feamster, leading to erroneous conclusions. Third, as discussed in Chapter 3, the majority of
traffic on today’s Internet is between end-users and a small set of content providers. Due to this,
the majority of interconnections may be over-provisioned (and thus never congested) but this does
not necessarily mean that the small set of interconnections carrying the majority of traffic do not
become congested.
7
Feamster’s finding that 10% of overall interconnect capacity experienced a
95th percentile peak utilization exceeding 95% appears to show that links with higher capacities
(and thus carrying more traffic) are more likely to become congested.
7
As an example, Richter et al. [350] found that at a large IXP 66% of links contributed less than 0.1% of traffic.
313
7.3.2 Performance by route type
Peer vs. transit. Ahmed et al. [8] compared the performance of routes via peering and transit
interconnections from the vantage point of a global CDN provider. The characterization was
performed in an environment similar to Facebook’s, with each point of presence having up to four
routes to a destination, with three via transit interconnections and one via a peering interconnection.
Servers selected the route that traffic egresses via a labeling scheme. Performance was assessed
via JavaScript executed in the client’s browser that fetched a small object (less than one packet
worth of data) from different endpoints, each associated with a specific interconnection. Their
analysis concludes that peering interconnections outperform transit interconnections for 91% of
ASes. Our results in Chapter 5 are similar, although they are weighed by volume of traffic. We
find that peering interconnections outperform transit interconnections for 40% of traffic, and
are statistically indistinguishable for 50% of traffic (§5.6.2). Ahmed et al. attempt to separate
propagation delay from queuing delay based on temporal analysis. However, their analysis is limited
by the number of measurements in their dataset (1 million over a 22 month period); they must
aggregate measurements to the AS level and use the minimum latency observed over a day to
estimate the propagation delay. We found in Chapter 5 that such aggregations can be misleading:
Figure 5.6 showed an example of how temporal changes in user population can result in misleading
performance measurements at even finer resolutions. In addition, et al. [229] also note that instances
of interdomain congestion often show regional effects; such regional differences may be hidden or
result in erroneous conclusions when AS level aggregation is employed.
314
In 2018, Wohlfart et al. [461] analyzed the latency and goodput observed for client connections to
Akamai servers for a handful of ISPs. They group observations based on the route used to serve
the traffic: on-net (when the Akamai server is hosted within the end-user ISP), private, public, and
transit. In general, they conclude that traffic served from servers via an on-net or private route
typically experiences better performance, although transit provides better performance for one ISP.
However, it is difficult to draw conclusions from this analysis due to the methodology employed.
First, Wohlfart et al. did not exert any control of routing during experiments. Instead, they relied
on traffic being served — by chance — by a point of presence where an alternate route type would
be used. This methodology does not enable accurate comparison of performance by route type given
that such samples may show performance differences for other reasons (e.g., the location of the PoP
relative to the user). In comparison, Ahmed et al. [8] and Chapter 5 measured route performance by
explicitly assigning traffic at a given point of presence to egress via the routes to be compared.
Second, because Wohlfart et al. measure goodput at the application layer their results are subject
to many of the confounding factors discussed in Chapter 5. For instance, the differences in throughput
observed between on-peak and off-peak may not be caused by congestion as hypothesized, but
instead be due to temporal variations in the response size distribution.
Backbone vs. Internet. Google Cloud Platform (GCP), Amazon Web Services (AWS), and a
handful of other cloud providers allow customers to select between two tiers of network service:
premium
8
in which the cloud provider uses their WAN/backbone to backhaul customer traffic to the
8
Google Cloud Platform uses the terminology Premium Tier, Amazon Web Services refers to this service as AWS
Global Accelerator.
315
optimal interconnection point, and basic in which customer traffic is backhauled to the closest avail-
able interconnection point. Cloud providers pitch the premium tier as offering performance benefits
and correspondingly charge a higher price. Arnold et al. [24] evaluated how performance varied
between these two tiers of service for GCP and AWS by executing ICMP ping from Speedchecker
(a distributed measurement platform, §2.5) vantage points (VP) around the world to multiple GCP
and AWS virtual machines (VM), with each VM having basic or premium network tier performance,
and executing bidirectional traceroutes for each (VP, VM) pairing. Performance was assessed based
on ICMP latency measurements collected and aggregated over the multi-month period and thus did
not incorporate the impact of loss or temporal effects.
9
In Section 5.3.2 we discuss how it is possible
for a route with lower latency to have worse performance due higher loss and how performance can
change over time due to transient events and changes in load.
For GCP, Arnold et al. confirm that premium tier traffic is significantly more likely to remain on
Google’s network. In addition, they find that premium traffic is more likely to traverse a short, direct
path into the network hosting the VP, while basic tier traffic is more likely to traverse a route via a
transit provider. Performance implications were mixed: 48% of VP saw a reduction in latency with
the premium tier, 43% had statistically indistinguishable performance, and 9% had lower latency on
the basic tier. The authors further found that the potential for performance benefits of the premium
tier grew as the distance between the VP and VM increased.
In instances of the basic tier providing better performance, Arnold et al. found that GCP’s
premium tier appeared to prefer routes announced via peering interconnections and/or with shorter
9
The authors note that they executed HTTP GET measurements but they are not used in the paper’s analysis.
316
AS paths (both common preferences in routing policies, as discussed in Section 2.3.1) over routes
announced by global transit providers and employed by the basic tier. However, in these instances
the premium tier routes had higher latency due to circuitous routing inside of Google’s and other’s
networks. These instances of degradation on the premium tier are surprising given that Google’s
egress route controller Espresso [469] is capable of incorporating performance information into its
routing decisions.
While existing work has offered limited details into precisely how Espresso measures and
incorporates performance into its decision process, we speculate that it may have been unable to do
so in these instances due to insufficient measurements and may have instead made decisions based
on other criteria, including based on longitudinal measurements and/or heuristics such as preferring
peers over transits to avoid congestion (§§ 2.3.1, 4.3.1 and 5.6.1.2). Specifically, instances of the
basic tier outperforming the premium tier reported by Arnold et al. often occurred when the VM and
VP were far away from each other (e.g., a VP in Asia and a VM in Europe). Given that web services
are often colocated close to users, there may be minimal traffic traversing such routes. If Google’s
Espresso relies on measurements from production traffic to evaluate route performance like EDGE
FABRIC (chapter 5), or if it performs active measurements only for routes with high volumes of
traffic, then it may not have the measurements required to incorporate performance into its routing
decisions — especially in real-time.
Arnold et al.’s analysis is related to our work in Chapter 5 in that it examines how a network’s
routing decisions impact performance, and further shows the value of peering widely and exchanging
traffic in the closest available PoP. However, because Facebook already (1) interconnects widely,
317
(2) serves traffic from points of presence located around the world, and (3) steers user traffic to
the closest, best performing PoP, Facebook cannot further improve performance by routing traffic
across its backbone. Arnold et al. also found that performance differences are less likely to be
observed when the client is close to the serving infrastructure; this aligns with our finding of minimal
performance difference between peering and transit routes for Facebook users.
7.3.3 Identifying and debugging circuitous routing
Prior work has explored how suboptimal routing decisions and the locations of interconnections can
degrade Internet performance. As the Internet has flattened, the locality of popular Internet content
has increased and in turn the potential for such problems has decreased. However, in Chapter 5, we
observed that such problems can still degrade performance in the long-tail, such as in emerging
markets.
In 2009 Krishnan et al. [243] investigated why clients directed to nearby Google frontend had
higher than expected latency.
10
From their analysis of RTT samples, geolocation information, and
route data, they concluded that many clients experience latency overhead for two reasons: (1)
inefficient routes to the nearby frontend, and (2) queuing of packets. They developed WhyHigh, a
tool to debug such issues, and identified four main culprits for higher latency: (1) lack of peering;
(2) routing misconfiguration; (3) traffic engineering; and (4) queuing related to congestion, caused
10
In order to estimate the propagation delay between the frontend and the user, Krishnan et al. used a measurement
approach similar to MinRTT (§5.3.1). Specifically, using server-side measurements they captured the time elapsed between
the server sending a SYN-ACK to a client and receiving a SYN-ACK-ACK response; because these transmissions are
small, they assume that they are unlikely to be impacted by transmission delays on the end-user access link.
318
by insufficient capacity. Similarly, Zarifis et al. [472] evaluated the prevalence of path inflation
for mobile Internet traffic for the four major US cellular carriers. Their analysis, executed on
a dataset captured during 2011 - 12, revealed instances of circuitous routing caused by routing
policies and insufficient interconnection points between cellular carriers and content / CDN networks.
These circuitous routes increased page load times for webpages like Google.com by hundreds of
milliseconds.
Gupta et al. [185] examined ISP interconnectivity in Africa in 2014. They executed bi-directional
traceroutes between Measurement Lab servers deployed in South Africa, Kenya, and Tunisia, and 17
BISMark routers [415] hosted in 7 different ISPs and physically located across 9 provinces in South
Africa. Their analysis found that this traffic — which should remain in-continent — often traversed
circuitous routes through Europe. Using information from BGP collectors, PeeringDB [273, 327],
and additional BISmark traceroutes, they determined that these circuitous routes occur because many
AS in Africa do not peer with each other anywhere on the continent. Some local AS had no presence
in IXPs, while others had a presence but did not interconnect. The nascent stage of interconnectivity
in Africa aligns with our observations in Chapter 5, in which we find that the distribution of latency
between Facebook’s edge and users in Africa is higher than in other continents. However, while we
find that there is little difference between peering and transit performance for the vast majority of
Facebook traffic, Gupta et al.’s findings show that a lack of peering connectivity can have significant
implications in some scenarios, particularly in emerging markets.
319
7.4 Measuring Goodput
In Chapter 5, one of our goals is to evaluate whether sessions between Facebook and a group
of clients can support HD video. Specifically, we want to evaluate a session’s ability to support
2.5 Mbps goodput, the minimum goodput required to stream HD video without stalls or buffering.
Evaluating this from existing production traffic requires distinguishing between goodput restricted
by network conditions (which we want to measure) and goodput “only” restricted by sender behavior.
Our solution, discussed in Section 5.3.2, relies on socket timestamps to capture when bytes are sent
by the server and ACKed by the client and modeling of CWND growth under ideal circumstances to
determine when a session has an opportunity to send at a rate of 2.5 Mbps.
Given the intricacies of our approach and the wide body of existing techniques for measuring
goodput and related metrics, it is reasonable to ask why an existing technique could not be applied.
In this section, we examine existing techniques and explain why they are ill-suited given our goal
and setting.
7.4.1 Using models to estimate the goodput a session can support
Prior work has developed models that can be used to predict the goodput that would be achieved
by a given congestion control algorithm under specified network conditions. Mathis et al. [294]
320
proposed a model to estimate goodput for the TCP Tahoe [215] and TCP Reno [52] congestion
control algorithms:
T hroughput =
MSSC
RT T
p
p
(7.1)
where MSS is the maximum segment size, C is a constant that incorporates elements of the congestion
control algorithm’s behavior, and p is the number of congestion signals per acknowledged packet
(e.g., the probability of the congestion control algorithm changing behavior in response to perceived
congestion).
11
The equation can be further simplified to an upper bound as:
T hroughput <
MSS
RT T
p
p
(7.2)
The model assumes that (i) the connection is (at worse) subjected to "light to moderate packet
losses"; (ii) that occurrences of loss are random, and that congestion is alleviated through the use of
active queue management triggering random loss; (iii) that retransmission timeouts (RTOs) are rare,
and thus that congestion avoidance events (e.g., triple DUPACK) are the majority of loss events;
and (iv) that the congestion control algorithm exits the slow start phase and spends the majority
of the connection’s time in the congestion avoidance phase (e.g., that the connection consists of a
bulk-transfer between a source and a sink). The model further requires the use of TCP’s internal
measurements of loss because the number of congestion signals per acknowledged packet cannot be
determined from link layer loss measurements.
11
11
Mathis et al. [294] note that p is not necessarily equal to the the packet loss rate because some congestion control
algorithms treat multiple packet losses in a single RTT as a single congestion signal.
321
Mathis et al. [294] found it difficult to apply their model to predict goodput for Internet transfers
because timeouts were frequent on the Internet, likely in part because active queue management
(AQM) is uncommon on the Internet. Without AQM, queue growth at the bottleneck link is not
proactively mitigated through the use of random loss, but is instead relieved through tail loss and
timeout events.
For a loss-based congestion control algorithm such as Reno [52], CWND exhibits sawtooth
behavior during steady state as (i) the congestion control algorithm probes for available bandwidth
by increasing the CWND and then (ii) eventually causes packet loss due to self-induced congestion,
causing the CWND to be decreased [52, 215]. CWND will also decrease if packet loss is caused due
to competition between flows (e.g., cross traffic). CWND can be sampled and averaged over time
during steady state, and this average can be used in an alternative form of the model:
T hroughput =
CWND
RT T
(7.3)
The model states that a congestion window (CWND) worth of data will be delivered each round-trip
time (RT T ).
Altman et al. [14] and Padhye et al. [321] improved upon the work of Mathis et al., proposing
higher-fidelity models incorporating more of the intricacies of TCP’s congestion control behavior.
Cardwell et al. [75] recognized that “the majority of TCP flows traveling over the wide-area
Internet are very short, [...]. Since the steady-state models assume flows suffer at least one loss, they
are undefined for this common case”. When packet loss is zero, Cardwell et al.’s model estimates
322
throughput based on other events that can account for a significant amount of a transfer’s time in
a short connection, including the connection set up time, CWND growth, and delayed ACKs. Part
of our approach in Section 5.3.2 is similar, in that we consider the impact of CWND growth and
delayed ACKs when estimating goodput.
Applicability. These models were intended to be used to analyze the behavior of a congestion
control algorithm for a given set of network conditions, in part to facilitate research on congestion
control algorithm design. For instance, using these models, one can reason about the theoretical
best-case performance of a congestion control algorithm, and identify how changes to the algorithm
design could improve or degrade performance under different circumstances. However, they are
not intended to be used as a substitute to directly measuring a session’s ability to support a given
goodput for use cases such as ours.
More specifically, these models are not applicable to our use-case for two reasons. First, none of
these models are representative for the congestion control algorithms employed by Facebook today,
which include BBR [73] and CUBIC with Hybrid Slow Start [189, 190]; they were all designed for
other congestion control algorithms, including TCP NewReno [52, 324].
Second, with the exception of the work of Cardwell et al., all of these techniques require insights
captured by the congestion control algorithm (e.g., number of triple DUPACK, number of packets
lost, CWND size). Yet it is unclear how to capture values for such inputs that will be representative.
For instance, the size of a connection’s CWND is a function of the session’s workload, propagation
delay, and other components. If we were to capture the average or median CWND across sessions
323
and plug it into the
CWND
RT T
, the result may be “low” simply because most connections transfer little
data, and thus do not have the opportunity to grow a large CWND (as discussed in Chapter 5).
Similarly, for some sessions
CWND
RT T
may be greater than the maximum goodput that the session can
actually achieve, as congestion control algorithms grow the CWND to probe for bandwidth, and
generally attempt to send faster than the bottleneck link speed to avoid bottleneck starvation [52,
73, 190, 215, 324, 414]. While models that take packet loss as an input can have higher fidelity,
we find that packet loss is rare in production. Packet loss is likely rare because the probability of
it occurring is in part a function of the session and congestion control algorithm’s behavior. For
instance, Facebook uses both BBR [73] and CUBIC with Hybrid Slow Start [189, 190], both of
which attempt to avoid packet loss caused by self-induced congestion through the use of latency
signals. In addition, because Facebook connections transfer little data, they are inherently less likely
to create self-congestion and cause packet loss. This makes it difficult to determine what value to
use for the packet loss rate.
12
And while the model proposed by Cardwell et al. can work without a
packet loss rate, it limits the model’s fidelity and ability to predict goodput for transfers that will
incur significant transmission time.
Instead of measuring packet loss or CWND and using those signals to infer the goodput that
a session can support, we directly measure a session’s ability to achieve a given goodput when it
has an opportunity to do so. While we use a model to determine the latter (§5.3.2.3), this model is
12
The models assume that flows enter steady state, in which case these parameters can be measured more easily. While
bulk-transfer flows are assumed to enter steady-state [292], connections at Facebook’s edge transfer are brief and spend
most of their time idle.
324
significantly simpler, as it is only used to determine whether the congestion control algorithm had
an opportunity to grow the connection’s CWND to the size required to achieve a given goodput.
7.4.2 Using packet-pairs to estimate bottleneck and available bandwidth
Prior work has used packet-pairs and similar forms of probing traffic to estimate a path’s bottleneck
bandwidth and/or available bandwidth without incurring the overhead of a speedtest. The bottleneck
and available bandwidths for a path between a source and destination are defined as follows:
• Bottleneck bandwidth: A link’s bandwidth is the rate at which it can transit data; a path’s
bottleneck bandwidth is the minimum bandwidth of its links [217, 218, 220, 224, 333].
• Available bandwidth: A link’s available bandwidth is the fraction of its capacity that is
unused over a sampling interval; a path’s available bandwidth is the minimum available
bandwidth over all links in the path for the same sampling interval [217, 218, 220, 224, 333].
In this section, we discuss this category of techniques and then consider their applicability relative
to our goals in Chapter 5.
Packet-pairs. Keshav [240] designed a technique to probe bottleneck bandwidth with packet-pairs.
To probe, the source sends two packets, back-to-back, and the receiver sends an ACK immediately
upon receipt of each packet. Assuming no cross-traffic or media access delays, the time between
the probe packets arriving at the receiver represents the path’s bottleneck bandwidth from sender to
325
receiver. If the path’s bandwidths are symmetric, the time between ACKs arriving at the sender can
also be used to derive the bottleneck bandwidth.
13
A large body of subsequent work follows from this idea; Guerrero et al. [183] and Prasad et al.
[333] provide a detailed overview of work in this area, including a taxonomy and discussion of
applicability. In this section, we categorize approaches based on whether they measure properties
per-link or end-to-end.
Per-link measurements. These approaches estimate the bottleneck bandwidth of each link in a
path [77, 115, 216, 253, 363] using derivatives of the packet-pair probing scheme. For instance,
pathchar (Jacobson [216]) sends precisely timed, TTL-limited packet-pair probes to infer the latency
and capacity of each link in the path, along with the probability of packet loss; Downey [115]
discussed using pathchar to measure Internet bottlenecks.
The accuracy of these approaches degrades in the presence of cross-traffic or if a path is longer
than a few hops [253], and subsequent work focused on making measurements more robust to the
presence of cross traffic [363]. Per-link measurements are further complicated by the fact that they
require each device along a path to respond to ICMP or other probing traffic. As a result, ICMP rate
limits [345] constrain the speed that a path’s capacity can be measured. All combined, it is difficult
to apply such approaches in a production environment.
13
The idea of packet-pairs naturally follows from Jacobson [215]’s work on TCP congestion control — Jacobson
showed that in an environment where links have symmetric bandwidth, the ACK arrival rate is a function of the sender’s
rate and the rate at which packets travel through the network, and therefore can be used to estimate the bottleneck
bandwidth. This property enables TCP to be a self-clocking protocol.
326
End-to-end measurements. End-to-end measurements are more closely related to our work in
Chapter 5. These measurements estimate a path’s end-to-end bottleneck bandwidth [77, 80, 81,
231, 252, 254, 372, 478] or available bandwidth [12, 77, 204, 217, 219, 348, 405]. Techniques
vary in whether they assume symmetrical bandwidth or can handle asymmetry (common in access
networks), and whether they can require control (or cooperation) of both endpoints.
Techniques to measure end-to-end bottleneck bandwidth and available bandwidth have employed
derivatives of Keshav’s packet-pair probing scheme. For instance, Cprobe (Carter et al. [77])
estimates available bandwidth by having one endpoint send a burst or train of eight full size
packets, back-to-back, and the other endpoint measure the time between receiving the first and last
packets. Available bandwidth, also known as the average dispersion rate (ADR), is calculated as
AvailBW =
BytesTrans f erred
RecvDuration
.
In contrast to prior work, Dovrolis et al. [113, 114] showed that in the presence of cross traffic the
ADR does not represent available bandwidth as previously believed, but instead represents a lower
bound of the path’s capacity and an upper bound of the path’s available bandwidth. Subsequent
work has attempted to correct for the impact of cross-traffic by using statistical techniques [81, 231,
254] and/or by detecting and accounting for cross traffic [80, 478].
Other work has focused on measuring available bandwidth using self-loading periodic streams.
Pathload (Jain et al. [217, 219]), pathChirp (Ribeiro et al. [348]), and BFind (Akella et al. [12])
send a continuous stream or a burst of packets at a given rate and then evaluate whether it resulted
in queuing by either using pings to detect a corresponding increase in latency, or based on packet
arrival timings captured at the receiver (if queuing occurs, interpacket arrival times will be shaped
327
by the bottleneck link and can be used to measure the average dispersion rate). If no queuing is
detected, they conclude that the available bandwidth must be higher than the current sending rate,
and test again at a higher rate.
Applicability. Techniques to measure end-to-end available bandwidth are most relevant to our
work in Chapter 5, in which we attempt to determine whether the underlying connection for a
given session — and then by extension, and aggregate of sessions — can support the 2.5 Mbps
goodput required to stream HD video. Prior work has explored using these techniques for similar
purposes [101, 307, 308] for reasons similar to the goals discussed in Chapter 5, including wanting
to limit the amount of data that must be exchanged to derive a measurement. However, while these
approaches can be used to derive estimates with minimal overhead, this advantage also presents
tradeoffs in terms of accuracy. Prior work has evaluated these techniques and explored these aspects
in further detail [13, 171, 183]. In this section, we focus on aspects specific to Facebook’s production
environment.
First, a path’s end-to-end available bandwidth may be lower or higher than the goodput a reliable
transport could achieve. It may be lower, since existing flows on the path may relinquish capacity
once they detect congestion [220]. Likewise, it may be higher if the behavior of the congestion
control algorithm and other network conditions — such as loss caused by interactions with cross-
traffic or policers [139] — reduce transport efficiency and prevent it from achieving a goodput equal
to the path’s available bandwidth. For instance, it is well known that loss-based congestion control
328
algorithms struggle to operate efficiently under certain network conditions [14, 73, 217, 294, 321].
However, probing schemes do not consider such dynamics.
Second, while some of these approaches are designed to estimate available bandwidth with a
small number of probing packets, we suspect that in a production environment they may struggle to
do so accurately given their sensitivity to packet arrival timings. More specifically, policers [139],
802.11 frame aggregation [180], and Interrupt Coalescence [334], along with MAC and link-layer
delays, including 802.11 link-layer retransmissions and contention [418, 419], and DOCSIS channel
contention [107, 457] can all create havoc for such measurements, as they can cause the spacing
between individual packets to not be representative of the available bandwidth at the bottleneck
link, and the goodput that the session can sustain. Furthermore, if the measurements are done using
TCP connections, then ACK suppression — a technique commonly employed on asymmetric access
links [1] — and ACK compression — a scenario when the spacing between ACKs is (inadvertently)
reduced to improve return path efficiency [258, 291] — can create additional challenges.
Finally, our goal is to assess network conditions from production traffic, e.g., we do not want to
require specialized probing traffic. While it may be possible to manipulate production traffic to meet
the requirements of these probing techniques (e.g., by changing pacing), we would still contend
with the challenges associated with requiring precise packet timings, as described above.
In comparison to these approaches, we evaluate a session’s ability to support 2.5 Mbps goodput
by evaluating whether it was able to deliver data across the network at this rate when it had the
opportunity to do so. Because this approach is directly measuring a session’s ability to deliver data,
it is inherently representative and incorporates all relevant components, including loss, propagation
329
delay, and the behavior of the congestion control algorithm. In addition, by measuring over a larger
amount of data we are less sensitive to the timings of individual packets.
7.4.3 Using the congestion control algorithm’s estimate of bottleneck bandwidth
BBR, a rate-based congestion control algorithm [73, 74], relies on continuous measurements of
“delivery rate” to set transport parameters. The delivery rate estimates are integral to BBR’s ability to
detect congestion and more specifically if a bottleneck link is saturated. In this section, we discuss
what the delivery rate is and why we cannot use it to determine if the network between Facebook
and a user can supportHD goodput.
What is the “delivery rate”? The delivery rate estimates the rate at which the underlying network
can deliver the flow’s data given current conditions. The delivery rate is updated on each ACK
and is the ratio of the number of packets delivered (ACKed, via either cumulative or selective
acknowledgement) between the transmission of the ACKed packet and receipt of the ACK, to the
corresponding duration [90]. Therefore, the delivery rate measures the throughput of the connection
over the last RTT, not the goodput.
BBR assumes that a bottleneck is saturated and a queue is growing if the delivery rate remains
steady but transport RTT measurements begin to grow [73]. Under such conditions, the delivery rate
reflects the amount of bandwidth available to the flow. BBR uses each flow’s measured RTT and
delivery rate to derive the CWND and pacing rate, with the goal of maximizing the flow’s delivery
rate while minimizing queueing and buffer bloat [73]. BBR periodically “probes” the network’s
330
ability to support a higher delivery rate by changing these parameters and observing the impact on
the delivery rate and RTT [73].
The estimator that derives the delivery rate also sets a flag if the transport becomes “application
limited”, meaning that the transport is ready and waiting for the application to send more bytes.
Specifically, the transport is considered to be application limited if (1) it is not CWND limited, (2)
all packets assumed to be lost have been retransmitted, and (3) there is no data in the transport or
pacing buffer. All delivery rate estimates taken from packets transmitted when the flag is set are
marked as application rate limited. This flag is unset once the transport sends and receives an ACK
for the first byte in the latest send window. The application limited flag helps BBR determine when
the delivery rate may have changed due to sender behavior.
The Linux Kernel TCP implementation exposes the most recent delivery rate estimate and appli-
cation limited flag in the structure returned by theTCP_INFO socket option. QUIC implementations,
including Facebook’s mvfst [311], also expose this information.
Why delivery rate is not a replacement for our approach. At first glance the delivery rate and
application-limited signals may appear similar to those discussed in Section 5.3.2 — the signals are
captured at the transport-layer and provide insight into when sender behavior is a limiting factor.
However, these signals do not provide the same information as our achieved and achievable goodput
signals.
First, the delivery rate measures throughput, not goodput. On a connection with high loss — such
as a connection subject to the effects of a policer [139] — the throughput can be significantly higher
331
than goodput. Second, the application limited flag does not tell us the achievable goodput (§5.3.2.3).
Because the flag is just one bit, a common misconception is that when the flag is false, the delivery
rate represents a speed test, or the maximum rate that the underlying network can deliver data at.
However, the flag only indicates that the transport entered pipelining.
Third, while the signal is available through TCP_INFO and related APIs, applications are
unaware of when data is ACKed at the transport-layer and thus do not know when to query this
interface. This is further complicated by the fact that the signal only reflects conditions over the
past RTT. For instance, if the application queries the transport immediately after writing data to the
socket, data may not have been delivered yet, in which case there is no current delivery rate estimate
or the delivery rate estimate is stale (from a preceding transfer). If application writes become
blocked, then the application can fetch the delivery rate after each subsequent write and assume that
it reflects information for previously ACKed bytes (the ability to write to the transport after being
blocked is a strong signal that write buffer utilization has decreased due to packets being ACKed by
the remote). However, it is uncommon for Facebook’s load balancers to become blocked writing
to the transport given that most responses are small relative to typical write buffer sizes (§5.2.3).
In addition, only using measurements from connections where the load balancer becomes blocked
while writing would likely bias the resulting dataset, as slower connections are more likely to exhibit
such conditions. If the application instead uses socket timestamps to determine when the ACK
for the last byte of a transfer arrives and captures the signal at that time, the delivery rate may be
significantly lower than the rate achieved during most of the transfer if any bytes in the final window
were retransmitted.
332
Applicability. The delivery rate signal was designed to be used by congestion control algorithms
such as BBR to estimate the link’s bottleneck bandwidth. Because it is focused on throughput
and not goodput, and because the signal does not indicate what goodput a connection could have
achieved over the last window, it is not a replacement for our approach to measuring goodput.
7.5 Virtualization of Network Control and Data Planes
In Chapter 6 we discuss how PEERING, a globally distributed AS open to the research community,
provides researchers with the control and realism required to execute a wide class of impactful
Internet routing research on today’s Internet. While prior work has established ASes to perform
such research [32, 62, 444, 450], these ASes were built ad-hoc to support experiments envisioned by
their designers — they were not designed to serve as community platforms. For most researchers,
designing and deploying an AS is an arduous task, and likely represents an insurmountable barrier for
most research projects. Furthermore, network operators are typically unwilling to allow experiments
on a production AS due to the potential wide-ranging negative effects [357], so researchers cannot
realistically turn to their sponsoring organization (e.g., university, company) for help. As a result,
the development of PEERING has truly opened the door to the larger research community, providing
many researchers with their first opportunity to execute experiments that require the control of an
AS connected to the real Internet.
Transit Portal. Transit Portal (Valancius et al. [444]) shared our vision of multiplexing the
connectivity, traffic, and policies of an AS. However, Transit Portal did not fully develop or
333
implement the technologies required to multiplex an AS as we did in Chapter 6 with VBGP. As
a result, Transit Portal had limited fidelity and flexibility. For instance, Transit Portal envisioned
delegating control of egress routing decisions by (i) creating a separate VPN endpoint for each
interconnection at a point of presence and (ii) having experiments route traffic by establishing
VPN connections with the appropriate endpoints and manually installing rules to route traffic via
connections as desired. This solution is incompatible with standard routers and BGP implementations
(e.g., it is not possible to configure BIRD to install such routes based on BGP control-plane signals)
and significantly increases the overhead of setting up a new experiment — instead of simply setting
up an existing routing daemon and configuring a routing policy, one must manually identify and
install routes, and keep installed routes up to date. In addition, Transit Portal did not support fine-
grained control over BGP announcements — such as whether an announcement is sent to a specific
neighbors at a point of presence — and did not include the security measures necessary to prevent
experiments from causing harm. Finally, Transit Portal’s prototype design and implementation
required considerable manual intervention to support each experiment, which in turn made the
prototype both brittle and increased the risk of potentially catastrophic configuration errors (§2.4.1).
In short, Transit Portal was unable to reliably support or scale to the needs of a community platform.
FlowVisor. FlowVisor (Sherwood et al. [386]) has similar goals to our work in that it seeks to
safely multiplex a production network to enable experimentation. However, while PEERING focuses
on multiplexing and delegating control of a production AS’s interdomain connectivity — including
the BGP control-plane and the data-plane — FlowVisor focuses on multiplexing an organization’s
334
internal network — specifically, FlowVisor is used to multiplex a portion of a university’s campus
network. Furthermore, FlowVisor was designed to take advantage of the (at the time) emerging
capabilities of OpenFlow-based Software-Defined Networks, which are inherently more hospitable
to such multiplexing and delegation. In stark contrast, our design of PEERING faces a different set
of challenges given our interdomain setting and the need to interoperate with other networks via
BGP, a protocol with intrinsic limitations.
Other uses of layer-2 signaling. PEERING has been under development since 2014, and in parallel
other network operators and researchers have discovered opportunities to use layer 2 mechanisms
(and specifically MAC addresses) for signaling data-plane decisions [5, 23, 187]. Agarwal et al. [5]
propose the use of shadow MACs in which an SDN controller associates routes thorough a network
with MAC addresses, and installs corresponding rules so that edge devices can select a path through
the network by simply sending packets with the destination MAC address set to a route’s assigned
MAC address.
14
Gupta et al. [187] use MAC addresses to memoize and encode routing state and
decisions in a Software-Defined Internet Exchange. The work of Araùjo [23] is most relevant in
that their work uses MAC addresses and policy-based routing to enable servers in a CDN point of
presence to select which egress route to use for a given flow. However, as discussed in Section 2.4.3,
policy-based routing only addresses a portion of the challenges that we faced in building PEERING:
achieving seamless compatibility with existing routing engines using standard protocols required
14
The idea of a “simple core” has long been pursued in the networking space [78].
335
addressing other challenges related to the BGP control-plane, and building a community platform
required our solution to provide safety and scalability.
336
Chapter 8
Conclusions and Future Work
8.1 Contributions
In this dissertation we examine how CDN providers interconnect and route traffic in today’s flattened
Internet, along with the opportunities and challenges that arise in this environment. In addition, we
examine the challenges that researchers face in executing impactful Internet routing research.
In Chapter 3 we executed a measurement study to understand the impact of the Internet’s
flattening on the paths between end-users and popular content. We found that direct, one hop paths
between CDNs and end-user ISPs are increasingly common and likely represent the bulk of today’s
Internet traffic. Based on this observation, we sketched the potential implications of the Internet’s
flattening on longstanding problems and discussed how the flattened Internet may provide footholds
for simple solutions that can benefit the majority of Internet traffic.
In Chapter 4 we characterized Facebook’s connectivity and routing policies, and examined
challenges that arise with Facebook’s volatile traffic demands and rich interconnectivity. We found
337
that, while Facebook’s connectivity offers a number of benefits, making efficient use of it requires
Facebook to employ sophisticated traffic engineering systems to sidestep BGP’s limitations. We
examined how Facebook delegated routing decisions traditionally made by BGP on routers at the
edge of its network to EDGE FABRIC, a software-defined egress routing controller that we built and
deployed in Facebook’s production network. EDGE FABRIC enables Facebook to make efficient use
of the interconnections preferred by Facebook’s routing policy, while dynamically shifting traffic if
needed to avoid congestion. With EDGE FABRIC, Facebook can achieve interconnection utilization
as high as 95% without packet loss.
In Chapter 5 we characterized the Internet performance observed from Facebook’s CDN de-
ployment, including regional and temporal trends. To do so, we developed novel techniques to
capture and interpret network performance from production traffic. In addition, we evaluated the
potential utility of performance-aware routing by using the extensions we built into EDGE FABRIC
to send a portion of production traffic via alternate routes. Our results suggest that, on the flattened
Internet, CDNs are able to provide good performance for the vast majority of traffic and end-users,
and that incorporating performance into Facebook’s routing decisions could provide some benefit in
emerging markets.
Finally, in Chapter 6 we discussed how we democratized Internet routing research by building
PEERING, a globally distributed network that enables researchers to execute experiments that interact
with the Internet routing ecosystem. PEERING provides researchers with control of a network with
connectivity qualitatively similar to that of the CDNs that serve much of the traffic on today’s
flattened Internet. To date, PEERING has enabled researchers to execute experiments including 40
338
EDGE FABRIC PEERING
approach
use BGP to exchange routes with the Internet
employ novel mechanisms to delegate interdomain
routing decisions to more flexible decision processes
control delegated to software controller experimenters
delegation enables capacity, perf-aware routing control by experiments
multiplexing enables
application specific routing,
measurement traffic
multiple experiments
to run in parallel
constraints and goals
simplicity, flexibility scalability, safety, realism
compatibility with existing BGP tooling
Table 8.1: The role of delegation of interdomain routing decisions in the design of PEERING and EDGE
FABRIC. While each system serves a different use case, they both rely on delegation and multiplexing to
enable technologies that have ultimately facilitated improvements to Internet user experience.
experiments and 24 publications in key research areas such as security, traffic engineering, and
routing policies [20, 21, 47, 48, 49, 50, 137, 142, 200, 263, 288, 297, 323, 347, 366, 378, 381, 392,
397, 406, 411, 412, 413, 439]. Many of the experiments executed on the PEERING platform were
considered infeasible prior to the platform’s development.
In summary, we demonstrate that it is possible to solve longstanding Internet routing problems
and ultimately improve user experience by combining the rich interconnectivity of CDNs on
today’s flattened Internet with novel mechanisms that enable routers to delegate interdomain routing
decisions to more flexible decision processes. Table 8.1 shows how both EDGE FABRIC and
PEERING rely on such mechanisms to delegate and multiplex control of routing decisions, and in
doing so remove barriers to innovation. For instance, the limitations of BGP’s static routing policies
motivated the design of both EDGE FABRIC and PEERING. BGP’s static policies prevented Facebook
from making efficient use of its rich interdomain connectivity, motivating the development of EDGE
339
FABRIC, and presented a barrier to delegating control of an AS’s routing decisions to experimenters,
motivating the development of VBGP and PEERING. In both cases, designing novel mechanisms to
delegate control enabled us to sidestep the limitations of BGP’s static routing policies, while still
using BGP to facilitate the exchange of routes with other AS.
8.2 Future Work
8.2.1 Improvements for the remaining 20% of Internet traffic
In this dissertation we focused on the paths between users and CDNs; these paths carry over 80% of
today’s Internet traffic and thus undoubtedly play a key role in user experience. Our analysis showed
that these short, direct paths can often provide good performance and enable simple solutions to
longstanding Internet problems (or in some cases sidestep them entirely). However, while these
paths are the focus of our work, they alone do not define the health of the Internet.
While the vast majority of Internet traffic may be between users and CDNs, a sizeable portion
of traffic continues to traverse other paths. For instance, video calls commonly use peer-to-peer
connections in an effort to avoid circuitous routing and minimize propagation delay, and to reduce
the number of calls that must traverse the service provider’s network [95]. Likewise, VPN traffic,
whether for users telecommuting to work or using a personal VPN for privacy and/or freedom,
commonly traverses the traditional Internet. The rapid transition to work-from-home in 2020 due to
COVID-19 created challenges as more traffic flowed over these less optimized routes [365].
340
More generally, for the foreseeable future there will be cases where there is little or no opportu-
nity for optimization by a CDN, and/or for which using a CDN is not cost-effective. For instance,
when a student at home downloads a document from a web page hosted on a university server,
that document will likely be served via a path that traverses the “traditional Internet hierarchy”,
transit providers and all. In most cases, such content is likely to have a low hit-rate, preventing it
from being cached and reducing the potential benefit of CDNs to their ability to serve as an overlay
network when fetching the request from the origin server [256].
In addition, CDN points of presence and/or interconnections may become unavailable or
congested, and under such conditions traffic may fallback to a more “traditional route” (e.g., a route
traversing a transit provider) such as when EDGE FABRIC detours traffic to a transit provider (§4.4).
CDNs need to be cognizant of the impact on user experience that such fallback will have and
optimize in advance. In Chapter 5 we estimated the impact of such a fallback by comparing the
performance of the primary route and best alternate route and found that in most cases, the best
alternate route had comparable performance (§5.6). However, our analysis did not look at the
implications of shifting traffic to a different point of presence — such a shift may also change the
transit providers available, and have a larger impact on traffic.
The importance of the traditional Internet hierarchy is perhaps best showcased by the impact of
outages. In August 2020, Level(3) / CenturyLink — a Tier-1 ISP (§2.2), and the Tier-1 ISP with the
largest customer cone — had a widespread failure of IP routing services; the impact of the failure
was felt across the Internet by CDNs and end-users alike [26].
341
To ensure that the Internet remains healthy, we need to continue to invest in understanding and
solving the many longstanding Internet routing problems discussed in Section 2.4. (Un)surprisingly,
some of the solutions to these problems may not be new, but instead just refinements of ideas
introduced decades ago. For instance, Teridion [426] uses overlay networks to improve performance
of long-distance WAN connections, much in the same way that RON (§7.2.1) did two decades
prior. In addition, while we found that performance-aware routing would likely provide limited
performance benefits for Facebook’s users (§5.6), large transit providers continue to incorporate
route optimization appliances into their networks [408], suggesting that these systems may provide
benefit for traffic traversing the traditional Internet hierarchy.
8.2.2 Determining the root cause of variations in performance
In Chapter 5 we searched for instances of performance degradation in network conditions from the
vantage point of Facebook’s CDN. In this section we discuss how improvements in clustering and
congestion detection could aid in understanding such variations and help CDNs identify opportunities
to improve performance.
Most traffic during our 10-day study period did not experience any significant degradation.
However, 1.1% of traffic experienced at least one 15-minute period of at least 20 milliseconds
degradation in MinRTT
P50
, and 2.3% of traffic experienced at least one 15-minute period degradation
of at least 0.4 in HDratio
P50
. In addition, we found client-PoP groups that experienced repeated
instances of degradation for the same 15-minutetime window for at least five days: 2.3% of traffic
had a recurring instance with 10 milliseconds or worse degradation in MinRTT
P50
, and 0.9% of
342
traffic experienced a recurring instance with 0.5 or worse degradation in HDratio
P50
— the latter of
which suggests a significant change in performance.
8.2.2.1 Possible causes of temporal degradation
Instances of temporal degradation are of particular interest because they are less likely to be caused
by failures. Determining the root cause behind such instances of degradation can help CDNs
identify possible mitigation, but while the measurements captured in Chapter 5 capture a change in
performance, they are often insufficient to determine the root cause with certainty. In particular, we
speculate that the temporal instances of performance degradation observed could be caused by:
1. Changes in the population of a client-PoP group. In Chapter 5 we show an example of
a /16 prefix that serves clients in California and Hawaii (Figure 5.6); throughout the day,
the MinRTT
P50
for the correspondingclient-PoPgroup changes based on whether sessions
from California or Hawaii dominate theclient-PoP-timeaggregation for atimewindow. In
addition to such organic changes, Cartographer (Facebook’s global load balancer, §4.2.2),
incorporates PoP load into its decision process, and as a result the clients in a given client-
PoP group — which is based on the serving PoP along with the client’s BGP prefix and
country — may change over time.
1
These shifts can change the performance observed for the
correspondingclient-PoP-time aggregations.
2. Traffic engineering decisions made by other networks. Our analysis looked at how perfor-
mance varied for the primary route — the route that would be chosen by Facebook’s BGP
1
Like EDGE FABRIC, Cartographer can split BGP prefixes to make decisions at finer granularities.
343
policy by default (§4.2.3) — and thus the variations observed should not have been caused by
EDGE FABRIC overrides. However, other networks downstream may use systems similar to
EDGE FABRIC to shift traffic to alternate paths, and such shifts could cause MinRTT
P50
to
increase and HDratio
P50
— a function of propagation delay, loss, congestion control behavior,
among other aspects, §5.3.2) — to decrease.
3. Self-inducted congestion and changes in client behavior. Users may congest their access
links when watching streaming videos or downloading large files, a phenomena known as
self-induced congestion [414]; self-induced congestion can increase loss and latency. The
potential for self-induced congestion depends on the access link speed and user behavior. For
example, in the evening it may be more likely for multiple users to be home and sharing the
same access link, and it may also be more likely for users to transfer large files and watch
streaming videos, all of which increases the potential for self-induced congestion, particularly
for slower access links. If many connections in a client-PoP-time aggregation traverse access
links experiencing self-induced congestion, performance for the aggregate may deteriorate.
In addition to self-induced congestion, the HTTP response size distribution can vary over the
day due to user behavior, and such variations can result in changes to goodput measurements
even if underlying network conditions have not changed. However, the approach we describe
in Section 5.3.2 accounts for the response size when calculating the tested goodput, and
therefore should be robust to this potential source of bias.
344
4. Congestion in shared portion of the path. EDGE FABRIC does not have visibility into the
capacity and utilization of links beyond the edge of Facebook’s network (§4.5) and so it is
possible for links further downstream to be congested, including a downstream interconnection
between two other ASes in the path, a link in a downstream AS’s backbone, or a link in the
core or access network of the end-user’s ISP. Such instances of congestion have received
significant attention in the past decade [108, 213, 229, 414, 417, 462].
Based on the insights we have gained through our work in this dissertation, we believe that
efforts to find the root cause of degradation would benefit from advancements in two areas: (i)
clustering of endpoints and (ii) congestion detection.
8.2.2.2 Opportunities to improve clustering of endpoints
Ideally, eachclient group would be a homogeneous group of clients with the following in common:
• Access technology: The goodput that a client connection can support is (in part) a function
of the underlying access technology. For instance, an ISP may offer both DSL and fiber
service, but clients connected via fiber will likely have connections capable of supporting
higher goodput. In addition, when clients in aclient group have the same access technology,
it increases the likelihood of shared fate. For instance, an equipment failure may degrade
performance for clients connected via an ISP’s fiber service, but not impact performance for
clients connected via the same ISP’s DSL service.
345
• Geographic region: A client’s location determines the best-case propagation delay between
the client and a Facebook PoP and also determines the underlying infrastructure and logical
routes that connections between the client and the PoP traverse. This is important because the
route between clients and content for clients in large ISPs will often vary based on the client’s
location, and thus instances of congestion may only impact clients in specific regions [229].
In addition, if a client group contains clients spread across geographic regions it will degrade
our ability to detect instances of degradation localized to specific regions, and variations in
performance observed may not be caused by changes in underlying network conditions but
instead be the result of changes in the aggregate’s population (Figure 5.6).
• Network path to Facebook: The network performance of a connection between a client and
Facebook is (in part) a function of ISP core and access network conditions, and the condition
of other components along the path (e.g., conditions in intermediate AS). Ideally, a client
group consists of clients for which all of these components are common, as this increases the
likelihood of shared fate in the case of a route change or failure and improves the sensitivity
of detectors to such changes.
In Chapter 5 we attempt to build such homogeneous groups but are subject to limitations arising
from the information available in our dataset:
• We account for access technology with IP prefix: We do not know a client’s access technol-
ogy; instead, we assume that clients in the same BGP prefix have the same access technology.
2
2
While datasets exist to infer a client’s access technology from the client’s IP address, we found issues with the
accuracy of these datasets and chose to not rely on them.
346
• We account for geographic region with client country and IP prefix: We account for a
client’s location by mapping the client’s IP address to a country. While we experimented with
finer granularities (§5.3.4), we found they yielded minimal reduction in variability but reduced
coverage in cases where additional deaggregation left too few measurements for us to be able
to make statistically significant conclusions (§5.3.5.1). More generally, we are limited by
the accuracy of IP geolocation: prior work has found that IP to geolocation results are likely
accurate at the country level, but that accuracy can quickly degrade at finer granularities [184,
241, 271, 332, 384].
Considering the IP prefix enables us to benefit from decisions made by Cartographer (§4.2.2).
Because Cartographer directs requests to PoPs based on performance measurements captured
and aggregated at fine granularities, connections originating from clients in the same BGP
prefix and terminated at the same PoP are more likely to be from clients in similar geographic
locations — even if the BGP prefix contains clients from many different locations.
• We account for the path between the client and Facebook with IP prefix: Our dataset
does not capture details such as the IP route that a connection traverses, so we cannot group
connections at such granularity.
3
Instead, we assume that connections between a Facebook
PoP and clients in the same IP prefix — which combined ensures connections traverse the
same BGP route — and in the same country will largely traverse the same downstream
infrastructure and experience shared fate.
3
Even with the requisite data, deriving groupings is non-trivial [259].
347
Using heuristics enabled us to conduct analysis at scale in Chapter 5, but correspondingly degrades
our ability to identify the root cause of observed changes in performance:
• We struggle to differentiate instances of degradation caused by changes inclientgroup
population from those caused by changes in network conditions: For instance, in Sec-
tion 5.3.4 we show an example (Figure 5.6) in which a single BGP prefix contains endpoints
in Hawaii and California. Because the prefix is served by the same PoP, and because our
definition of client group considers only the client’s country, measurements from clients
in Hawaii and California are aggregated into the same client-PoP group. As a result, the
client-PoP group shows temporal performance variations in MinRTT
P50
that are unrelated to
changes in network conditions. Similar problems arise from our reliance on Cartographer to
split up large BGP prefixes. For instance, if Pre f ix
Z
is typically served by PoP
X
and PoP
Y
, and
Cartographer redirects clients typically served by PoP
X
to PoP
Y
(perhaps due to maintenance
at PoP
X
), the performance observed at PoP
Y
for the Pre f ix
Z
may change — despite how there
has been no change in underlying network conditions.
• We can only construct groups at the granularity of BGP IP prefix: A BGP IP prefix
represents a logically arbitrary aggregate of endpoints in the same destination network,
and prefixes can be subdivided arbitrarily [44]. In some cases, aggregating small prefixes
(e.g., adjacent /24s) together may enable us to have sufficient samples to make statistically
significant conclusions when otherwise not possible — but this can only occur if clients in
those prefixes have the same properties (access technology, geographic location, path).
348
Prior work has used traceroutes to aggregate IPv4 address space with a common last-mile
router [259]. We believe that it is possible to design a similar aggregation technique, but employing
existing control-plane and data-planes signals, such as BGP routes along with performance measure-
ments from ROUTEPERF (§5.2.2) and Cartographer. For instance, instead of relying on geolocation
information inferred from IP addresses, we could cluster clients by region by using measurements
of propagation delay between clients and PoPs captured by Cartographer. Specifically, we could
aggregate measurements to /24 and /48 granularity for IPv4 and IPv6 respectively — granularities
which are likely to contain clients in the same geographic region [147, 157, 259] — and then
recursively merge adjacent prefixes when measurements indicate that they contain clients in the
same geographic region [233, 259]. This approach would enable us to sidestep accuracy issues
associated with IP-based geolocation services [184, 241, 271, 332, 384] and barriers to adoption
that make it difficult to pursue approaches involving coordination with network operators. We
could extend the process to incorporate access technology by checking whether clients in adjacent
address prefixes had a similar probability of supporting a given goodput, and only merging when
this condition is true.
Derivingclient-PoPgroups organically through such means would likely reduce variance and
the size of confidence intervals, increase the likelihood that temporal variations reflect changes
in the underlying network’s performance — and not changes inclient-PoPgroup population, and
improve our sensitivity to events that only impact a subset of clients in an ISP.
349
8.2.2.3 Opportunities in congestion detection
It is difficult to determine if an instance of performance degradation is caused by traffic engineering
decisions made by other networks (e.g., a route change increasing propagation delay), congestion
in a shared portion of the path (e.g., inside of the end-user ISP’s network), or changes in client
behavior (e.g., self-induced congestion) because all three potential culprits can have a similar affect
on network conditions. For instance, self-induced congestion and congestion in an ISP’s backbone
will both cause average packet loss to increase, and all three possible culprits will cause goodput to
decrease given that goodput is a function of loss, latency (propagation, queuing, and MAC/link-layer
delays), available bandwidth, and the behavior of the congestion control algorithm.
Prior work has attempted to identify instances of shared congestion by using active measurements
and by analyzing congestion control behavior:
• Time Series Latency Probes (TSLP, Dhamdhere et al. [108]) uses TTL-limited probes to esti-
mate queuing delay at interconnections between AS. However, TSLP’s accuracy diminishes
for interconnections beyond the edge of the vantage point’s network and the approach’s use of
active measurements makes it difficult to scale. Given these realities, Facebook likely cannot
use TSLP to continuously monitor for congestion at each element between Facebook’s edge
and groups of end-users.
• Sundaresan et al. [414] considered when a loss event occurs during a connection to discrim-
inate between self-induced congestion and congestion in the backbone or another shared
350
portion of the path. However, packet loss is uncommon given Facebook’s workload (§5.2)
and because the congestion control algorithms that Facebook uses — BBR [73, 74] and
CUBIC [190] with Hybrid Slow Start [189] — attempt to prevent loss caused by buffer
overflows by considering latency signals.
We hypothesize that it is possible to distinguish between these culprits without requiring the use
of active measurements given that congestion in a shared portion of the path will impact all clients
in aclientgroup equally, regardless of the client’s activity. For instance, if a backbone link inside
of an ISP’s network is congested and dropping packets, every packet that traverses that link will
have an equal probability of being dropped. By building a detector capable of recognizing such
a loss process based on existing production traffic, Facebook could quickly identify instances of
degradation caused by congestion in shared portions of the path and then execute more expensive
active instrumentation to localize the problem. While some (perhaps most) instances of congestion
may not be addressable, knowledge of the problem along with its impact on network performance
and user experience can help Facebook identify opportunities to improve end-user experience
through investments in PoPs and connectivity, and through collaborations with other networks.
8.3 Summary
A decade ago, the vast majority of traffic between end-user ISPs and content providers flowed
through transit providers: end-user ISPs and content networks historically passed traffic (and dollars)
upwards to transit providers, which in turn interconnected these parties. Fast-forward to today and
351
we observe a significant shift in the Internet’s architecture: most content is now hosted by content
distribution networks that directly interconnect with end-user ISPs at points of presence around the
world, and the hierarchy in which transit providers once played a central role is no longer relevant for
the bulk of the Internet’s traffic. In this dissertation we examined the implications of this flattening
of the Internet’s structure and designed systems to address challenges faced by content distribution
networks. In addition, we developed PEERING, a platform that is enabling impactful research on
longstanding Internet problems and new problems arising from the shift to a flattened Internet. We
hope that our work, including our characterizations of connectivity, performance, and routing on the
flattened Internet, can provide valuable insights to future work.
352
Bibliography
[1] [aqm] TCP ACK Suppression. IETF Mail Archive. 2015. URL: https://mailarchive.ietf.org/
arch/msg/aqm/XJ4SemDwXB2SPYWO7CKSLcX0IQE/.
[2] Vijay Kumar Adhikari, Yang Guo, Fang Hao, V olker Hilt, and Zhi-Li Zhang. “A tale of
three CDNs: An active measurement study of Hulu and its CDNs”. In: Proceedings of IEEE
INFOCOM Workshops. IEEE, 2012.
[3] Vijay Kumar Adhikari, Sourabh Jain, Yingying Chen, and Zhi-Li Zhang. “Vivisecting
YouTube: An Active Measurement Study”. In: Proceedings of Annual Joint Conference of
the IEEE Computer and Communications Societies. INFOCOM ’12. IEEE, 2012.
[4] Advanced Layer 2 Services. URL: https://www.internet2.edu/products-services/advanced-
networking/layer-2-services/.
[5] Kanak Agarwal, Colin Dixon, Eric Rozner, and John Carter. “Shadow MACs: Scalable
Label-switching for Commodity Ethernet”. In: Proceedings of the ACM Workshop on Hot
Topics in Software Defined Networking. HotSDN ’14. ACM, 2014.
[6] Bernhard Ager, Nikolaos Chatzis, Anja Feldmann, Nadi Sarrar, Steve Uhlig, and Walter
Willinger. “Anatomy of a Large European IXP”. In: Proceedings of the Conference of the
ACM Special Interest Group on Data Communication. SIGCOMM ’12. ACM, 2012.
[7] Bernhard Ager, Wolfgang Mühlbauer, Georgios Smaragdakis, and Steve Uhlig. “Web
Content Cartography”. In: Proceedings of the ACM Internet Measurement Conference. IMC
’11. ACM, 2011.
[8] Adnan Ahmed, Zubair Shafiq, Harkeerat Bedi, and Amir Khakpour. “Peering vs. Transit:
Performance Comparison of Peering and Transit Interconnections”. In: Proceedings of the
IEEE International Conference on Network Protocols. ICNP ’17. 2017.
353
[9] Aditya Akella, Bruce Maggs, Srinivasan Seshan, and Anees Shaikh. “On the Performance
Benefits of Multihoming Route Control”. In: IEEE/ACM Transactions on Networking (TON)
16.1 (2008), pp. 91–104.
[10] Aditya Akella, Bruce Maggs, Srinivasan Seshan, Anees Shaikh, and Ramesh Sitaraman. “A
Measurement-Based Analysis of Multihoming”. In: Proceedings of the Conference of the
ACM Special Interest Group on Data Communication. SIGCOMM ’03. ACM, 2003.
[11] Aditya Akella, Jeffrey Pang, Bruce Maggs, Srinivasan Seshan, and Anees Shaikh. “A
Comparison of Overlay Routing and Multihoming Route Control”. In: ACM SIGCOMM
Computer Communication Review 34.4 (2004), pp. 93–106.
[12] Aditya Akella, Srinivasan Seshan, and Anees Shaikh. “An Empirical Evaluation of Wide-
Area Internet Bottlenecks”. In: Proceedings of the ACM Internet Measurement Conference.
IMC ’03. ACM, 2003.
[13] Ahmed Ait Ali, Fabien Michaut, and Francis Lepage. “End-to-End Available Bandwidth
Measurement Tools: A Comparative Evaluation of Performances”. In: arXiv preprint
arXiv:0706.4004 (2007).
[14] Eitan Altman, Konstantin Avrachenkov, and Chadi Barakat. “A Stochastic Model of TCP/IP
with Stationary Random Losses”. In: Proceedings of the Conference of the ACM Special
Interest Group on Data Communication. SIGCOMM ’00. ACM, 2000.
[15] Hussein A Alzoubi, Seungjoon Lee, Michael Rabinovich, Oliver Spatscheck, and Jacobus
Van Der Merwe. “A Practical Architecture for an Anycast CDN”. In: ACM Transactions on
the Web (TWEB) 5.4 (2011), pp. 1–29.
[16] Hussein A Alzoubi, Seungjoon Lee, Michael Rabinovich, Oliver Spatscheck, and Jacobus
Van der Merwe. “Anycast CDNs Revisited”. In: Proceedings of the International Conference
on World Wide Web. WWW ’08. ACM, 2008.
[17] AMS-IX Route Servers. URL: https://www.ams-ix.net/ams/documentation/ams-ix-route-
servers.
[18] David G. Andersen, Hari Balakrishnan, M. Frans Kaashoek, and Robert Morris. “Resilient
Overlay Networks”. In: Proceedings of the ACM Symposium on Operating Systems Princi-
ples. SOSP ’01. ACM, 2001.
354
[19] Yaw Anokwa, Colin Dixon, Gaetano Borriello, and Tapan Parikh. “Optimizing High Latency
Links in the Developing World”. In: Proceedings of the ACM Workshop on Wireless Networks
and Systems for Developing Regions. WiNS-DR ’08. ACM, 2008.
[20] Ruwaifa Anwar, Haseeb Niaz, David Choffnes, Italo Cunha, Phillipa Gill, and Ethan Katz-
Bassett. “Investigating Interdomain Routing Policies in the Wild”. In: Proceedings of the
ACM Internet Measurement Conference. IMC ’15. ACM, 2015.
[21] Maria Apostolaki, Aviv Zohar, and Laurent Vanbever. “Hijacking Bitcoin: Routing Attacks
on Cryptocurrencies”. In: Proceedings of IEEE Symposium on Security and Privacy. S&P
’17. 2017.
[22] Ioannis Arapakis, Xiao Bai, and B Barla Cambazoglu. “Impact of Response Latency on
User Behavior in Web Search”. In: Proceedings of the International ACM SIGIR Conference
on Research & Development in Information Retrieval. SIGIR ’14. ACM, 2014.
[23] João Taveira Araùjo. Building and scaling the Fastly network, part 1: Fighting the FIB. 2016.
URL: https://www.fastly.com/blog/building-and-scaling-fastly-network-part-1-fighting-fib.
[24] Todd Arnold, Ege Gürmeriçliler, Georgia Essig, Arpit Gupta, Matt Calder, Vasileios Giotsas,
and Ethan Katz-Bassett. “(How Much) Does a Private WAN Improve Cloud Performance?”
In: Proceedings of Annual Joint Conference of the IEEE Computer and Communications
Societies. INFOCOM ’20. IEEE, 2020.
[25] Todd Arnold, Jia He, Weifan Jiang, Matt Calder, Italo Cunha, Vasileios Giotsas, and Ethan
Katz-Bassett. “Cloud Provider Connectivity in the Flat Internet”. In: Proceedings of the
ACM Internet Measurement Conference. IMC ’20. ACM, 2020.
[26] August 30th 2020: Analysis of CenturyLink/Level(3) Outage. The Cloudflare Blog. Aug.
2020. URL: https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage/.
[27] Brice Augustin, Xavier Cuvellier, Benjamin Orgogozo, Fabien Viger, Timur Friedman,
Matthieu Latapy, Clémence Magnien, and Renata Teixeira. “Avoiding Traceroute Anomalies
with Paris Traceroute”. In: Proceedings of the ACM Internet Measurement Conference. IMC
’06. ACM, 2006.
[28] Vaibhav Bajpai, Steffie Jacob Eravuchira, and Jürgen Schönwälder. “Dissecting Last-mile
Latency Characteristics”. In: ACM SIGCOMM Computer Communication Review 47.5
(2017), pp. 25–34.
355
[29] F. Baker and P. Savola. Ingress Filtering for Multihomed Networks. Internet Requests for
Comments, RFC 3704. RFC. Mar. 2004.
[30] Abdullah Balamash, Marwan Krunz, and Philippe Nain. “Performance analysis of a client-
side caching/prefetching system for web traffic”. In: Computer Networks 51.13 (2007),
pp. 3673–3692.
[31] Arjun Balasingam, Manu Bansal, Rakesh Misra, Kanthi Nagaraj, Rahul Tandra, Sachin
Katti, and Aaron Schulman. “Detecting if LTE is the Bottleneck with BurstTracker”. In:
Proceedings of the Annual International Conference on Mobile Computing and Networking.
MobiCom ’19. ACM, 2019.
[32] Hitesh Ballani, Paul Francis, and Sylvia Ratnasamy. “A Measurement-Based Deployment
Proposal for IP Anycast”. In: Proceedings of the ACM Internet Measurement Conference.
IMC ’06. ACM, 2006.
[33] Hitesh Ballani, Paul Francis, and Xinyang Zhang. “A Study of Prefix Hijacking and In-
terception in the Internet”. In: Proceedings of the Conference of the ACM Special Interest
Group on Data Communication. SIGCOMM ’07 4. ACM, 2007.
[34] Dziugas Baltrunas, Ahmed Elmokashfi, and Amund Kvalbein. “Measuring the Reliabil-
ity of Mobile Broadband Networks”. In: Proceedings of the ACM Internet Measurement
Conference. IMC ’14. ACM, 2014.
[35] A. Barbir, R. Nair, and O. Spatscheck. Known Content Network (CN) Request-Routing
Mechanisms. Internet Requests for Comments, RFC 3568. RFC. July 2003.
[36] Ryan Beckett, Aarti Gupta, Ratul Mahajan, and David Walker. “A General Approach to
Network Configuration Verification”. In: Proceedings of the Conference of the ACM Special
Interest Group on Data Communication. SIGCOMM ’17. ACM, 2017.
[37] Iljitsch Van Beijnum. Meet DOCSIS, Part 2: the jump from 2.0 to 3.0. May 2011. URL:
https://arstechnica.com/information-technology/2011/05/meet-docsis-part-2-the-jump-
from-20-to-30/.
[38] Steve Bellovin, Marcus Leech, and Tom Taylor. Methods for Detection and Mitigation of
BGP Route Leaks. Internet-Draft draft-ietf-grow-route-leak-detection-mitigation-03. IETF
Secretariat, July 2020. URL: https://tools.ietf.org/html/draft-ietf-grow-route-leak-detection-
mitigation-03.
356
[39] M. Belshe, R. Peon, and M. Thomson. Hypertext Transfer Protocol Version 2 (HTTP/2).
RFC 7540. May 2015.
[40] Mike Belshe. More bandwidth doesn’t matter (much). 2010. URL: http://docs.google.com/a/
chromium.org/viewer?a=v&pid=sites&srcid=Y2hyb21pdW0ub3JnfGRldnxneDoxMzcyOWI1N2I4YzI3NzE2.
[41] R. Beverly, R. Koga, and k. claffy k. “Initial Longitudinal Analysis of IP Source Spoofing
Capability on the Internet”. In: Internet Society (July 2013).
[42] Robert Beverly and Steven Bauer. “The Spoofer project: Inferring the extent of source
address filtering on the Internet”. In: Proceedings of the Steps to Reducing Unwanted Traffic
on the Internet Workshop. SRUTI ’05. USENIX, 2005.
[43] Robert Beverly, Arthur Berger, Young Hyun, and k claffy k. “Understanding the Efficacy
of Deployed Internet Source Address Validation Filtering”. In: Proceedings of the ACM
Internet Measurement Conference. IMC ’09. ACM, 2009.
[44] Robert Beverly, Arthur Berger, and Geoffrey G Xie. “Primitives for Active Internet Topology
Mapping: Toward High-Frequency Characterization”. In: Proceedings of the ACM Internet
Measurement Conference. IMC ’10. ACM, 2010.
[45] BGP Analysis Reports - BGP Table Data. URL: https://bgp.potaroo.net/index-bgp.html.
[46] Rui Bian, Shuai Hao, Haining Wang, Amogh Dhamdere, Alberto Dainotti, and Chase Cotton.
“Towards Passive Analysis of Anycast in Global Routing: Unintended Impact of Remote
Peering”. In: ACM SIGCOMM Computer Communication Review 49.3 (2019), pp. 18–25.
[47] Henry Birge-Lee, Yixin Sun, Annie Edmundson, Jennifer Rexford, and Prateek Mittal.
“Bamboozling Certificate Authorities with BGP”. In: Proceedings of USENIX Security
Symposium. USENIX Security ’18. USENIX, 2018.
[48] Henry Birge-Lee, Yixin Sun, Annie Edmundson, Jennifer Rexford, and Prateek Mittal.
“Using BGP to Acquire Bogus TLS Certificates”. In: Proceedings of Workshop on Hot
Topics in Privacy Enhancing Technologies. HotPETS ’17. 2017.
[49] Henry Birge-Lee, Liang Wang, Jennifer Rexford, and Prateek Mittal. “SICO: Surgical
Interception Attacks by Manipulating BGP Communities”. In: Proceedings of ACM SIGSAC
Conference on Computer and Communications Security. CCS ’19. London, United Kingdom:
ACM, Nov. 2019.
357
[50] Kyle Birkeland, Jared M Smith, and Max Schuchard. “Peerlock: Flexsealing BGP”. In:
Proceedings of Network and Distributed System Security Symposium. NDSS ’21. Internet
Society, 2021.
[51] Ethan Blanton and Mark Allman. “On Making TCP More Robust to Packet Reordering”. In:
ACM SIGCOMM Computer Communication Review 32.1 (2002), pp. 20–30.
[52] Ethan Blanton, Dr. Vern Paxson, and Mark Allman. TCP Congestion Control. RFC 5681.
Sept. 2009. DOI: 10.17487/RFC5681. URL: https://rfc-editor.org/rfc/rfc5681.txt.
[53] J. Border, M. Kojo, J. Griner, G. Montenegro, and Z. Shelby. Performance Enhancing
Proxies Intended to Mitigate Link-Related Degradations. RFC 3135. June 2001.
[54] David Borman, Robert T. Braden, Van Jacobson, and Richard Scheffenegger. TCP Ex-
tensions for High Performance. RFC 7323. Sept. 2014. DOI: 10.17487/RFC7323. URL:
https://rfc-editor.org/rfc/rfc7323.txt.
[55] Timm Böttger, Felix Cuadrado, Gareth Tyson, Ignacio Castro, and Steve Uhlig. “Open
Connect Everywhere: A Glimpse at the Internet Ecosystem Through the Lens of the Netflix
CDN”. In: ACM SIGCOMM Computer Communication Review 48.1 (2018), pp. 28–34.
[56] Timm Böttger, Felix Cuadrado, and Steve Uhlig. “Looking for hypergiants in peeringDB”.
In: ACM SIGCOMM Computer Communication Review 48.3 (2018), pp. 13–19.
[57] P Bret, K Prashanth, J Samir, and AK Zaid. TCP over IP anycast—pipe dream or reality.
2010. URL: https://engineering.linkedin.com/network-performance/tcp-over-ip-anycast-
pipe-dream-or-reality.
[58] Andre Broido and kc claffy kc. “Analysis of RouteViews BGP data: Policy atoms”. In:
Network Resource Data Management Workshop. 2001.
[59] Ricky Brundritt, Saisang Cai, and Chris French. Bing Maps Tile System. 2018. URL: https:
//docs.microsoft.com/en-us/bingmaps/articles/bing-maps-tile-system.
[60] Hal Burch and Bill Cheswick. “Tracing Anonymous Packets to Their Approximate Source.”
In: Proceedings of Large Installation System Administration Conference. LISA ’00. USENIX,
2000.
[61] Randy Bush, T. Griffin, and M. Mao. “Route flap damping: harmful?” In: Proceedings of
NANOG. NANOG 26. Oct. 2002.
358
[62] Randy Bush, Olaf Maennel, Matthew Roughan, and Steve Uhlig. “Internet Optometry:
Assessing the Broken Glasses in Internet Reachability”. In: Proceedings of the ACM Internet
Measurement Conference. IMC ’09. ACM, 2009.
[63] Kevin Butler, Toni R Farley, Patrick McDaniel, and Jennifer Rexford. “A survey of BGP
security issues and solutions”. In: Proceedings of the IEEE 98.1 (2009), pp. 100–122.
[64] Matthew Caesar, Donald Caldwell, Nick Feamster, Jennifer Rexford, Aman Shaikh, and
Jacobus van der Merwe. “Design and Implementation of a Routing Control Platform”. In:
Proceedings of USENIX Symposium on Networked Systems Design and Implementation.
NSDI ’05. USENIX, 2005.
[65] Matthew Caesar and Jennifer Rexford. “BGP routing policies in ISP networks”. In: IEEE
Network 19.6 (2005), pp. 5–11.
[66] Xue Cai, John Heidemann, Balachander Krishnamurthy, and Walter Willinger. An Organization-
Level View of the Internet and its Implications (Extended). Tech. rep. USC/ISI TR, June
2012.
[67] Xue Cai and John S. Heidemann. “Understanding Block-level Address Usage in the Visible
Internet”. In: Proceedings of the Conference of the ACM Special Interest Group on Data
Communication. SIGCOMM ’10. ACM, 2010.
[68] CAIDA Archipelago Measurement Infrastructure.
[69] CAIDA-ASRank. URL: http://as-rank.caida.org/.
[70] Matt Calder, Xun Fan, Zi Hu, Ethan Katz-Bassett, John Heidemann, and Ramesh Govindan.
“Mapping the Expansion of Google’s Serving Infrastructure”. In: Proceedings of the ACM
Internet Measurement Conference. IMC ’13. ACM, 2013.
[71] Matt Calder, Ashley Flavel, Ethan Katz-Bassett, Ratul Mahajan, and Jitendra Padhye.
“Analyzing the Performance of an Anycast CDN”. In: Proceedings of the ACM Internet
Measurement Conference. IMC ’15. ACM, 2015.
[72] Matt Calder, Ryan Gao, Manuel Schröder, Ryan Stewart, Jitendra Padhye, Ratul Mahajan,
Ganesh Ananthanarayanan, and Ethan Katz-Bassett. “Odin: Microsoft’s Scalable Fault-
Tolerant CDN Measurement System”. In: Proceedings of USENIX Symposium on Networked
Systems Design and Implementation. NSDI ’18. USENIX, 2018.
359
[73] N. Cardwell, Y . Cheng, C. Stephen Gunn, S. Hassas Yeganeh, and V . Jacobson. “BBR:
Congestion-Based Congestion Control”. In: ACM Queue 14.5 (2016), 50:20–50:53.
[74] Neal Cardwell, Yuchung Cheng, Soheil Hassas Yeganeh, and Van Jacobson. BBR Congestion
Control. Internet-Draft draft-cardwell-iccrg-bbr-congestion-control-00. IETF Secretariat,
July 2017. URL: https://tools.ietf.org/id/draft-cardwell-iccrg-bbr-congestion-control-
00.html.
[75] Neal Cardwell, Steven Savage, and Thomas Anderson. “Modeling TCP Latency”. In: Pro-
ceedings of Annual Joint Conference of the IEEE Computer and Communications Societies.
INFOCOM ’00. IEEE, 2000.
[76] B. Carpenter and K. Moore. Connection of IPv6 Domains via IPv4 Clouds. Internet Requests
for Comments, RFC 3056. RFC. Feb. 2005.
[77] Robert L Carter, Mark E Crovella, et al. “Measuring Bottleneck Link Speed in Packet-
Switched Networks”. In: Performance evaluation 27.4 (1996), pp. 297–318.
[78] Martin Casado, Teemu Koponen, Scott Shenker, and Amin Tootoonchian. “Fabric: A
Retrospective on Evolving SDN”. In: Proceedings of the ACM Workshop on Hot Topics in
Software Defined Networking. HotSDN ’12. ACM, 2012.
[79] Ignacio Castro, Juan Camilo Cardona, Sergey Gorinsky, and Pierre Francois. “Remote
Peering: More Peering Without Internet Flattening”. In: Proceedings of the International
Conference on Emerging Networking EXperiments and Technologies. CoNEXT ’14. ACM,
2014.
[80] Edmond WW Chan, Ang Chen, Xiapu Luo, Ricky KP Mok, Weichao Li, and Rocky KC
Chang. “TRIO: Measuring Asymmetric Capacity with ThreeMinimum Round-trip Times”.
In: Proceedings of the International Conference on Emerging Networking EXperiments and
Technologies. CoNEXT ’11. ACM, 2011.
[81] Edmond WW Chan, Xiapu Luo, and Rocky KC Chang. “A Minimum-Delay-Difference
Method for Mitigating Cross-Traffic Impact on Capacity Measurement”. In: Proceedings
of the International Conference on Emerging Networking EXperiments and Technologies.
CoNEXT ’09. ACM, 2009.
[82] R. Chandra and R. Traina. BGP Communities Attribute. Internet Requests for Comments,
RFC 1997. RFC. Aug. 1996.
360
[83] Di-Fa Chang, Ramesh Govindan, and John Heidemann. “Locating BGP Missing Routes
Using Multiple Perspectives”. In: Proceedings of the ACM SIGCOMM workshop on Network
troubleshooting: research, theory and operations practice meet malfunctioning reality. ACM,
2004.
[84] Hyunseok Chang, Ramesh Govindan, Sugih Jamin, Scott Shenker, and Walter Willinger. On
Inferring AS-level Connectivity from BGP Routing Tables. Tech. rep. UM-CSE-TR-454-02.
University of Michigan, 2002.
[85] Hyunseok Chang, Ramesh Govindan, Sugih Jamin, Scott J Shenker, and Walter Willinger.
“Towards Capturing Representative AS-level Internet Topologies”. In: Computer Networks
44.6 (2004), pp. 737–755.
[86] Rocky K.C. Chang and Michael Lo. “Inbound Traffic Engineering for Multihomed ASes
Using AS Path Prepending”. In: IEEE Network 19.2 (2005), pp. 18–25.
[87] Nikolaos Chatzis, Georgios Smaragdakis, Anja Feldmann, and Walter Willinger. “There is
More to IXPs than Meets the Eye”. In: ACM SIGCOMM Computer Communication Review
43.5 (2013), pp. 19–28.
[88] Fangfei Chen, Ramesh K. Sitaraman, and Marcelo Torres. “End-User Mapping: Next
Generation Request Routing for Content Delivery”. In: Proceedings of the Conference of
the ACM Special Interest Group on Data Communication. SIGCOMM ’15. ACM, 2015.
[89] Kai Chen, David R. Choffnes, Rahul Potharaju, Yan Chen, Fabian E. Bustamante, Dan
Pei, and Yao Zhao. “Where the Sidewalk Ends: Extending the Internet AS Graph Using
Traceroutes from P2P Users”. In: Proceedings of the International Conference on Emerging
Networking EXperiments and Technologies. CoNEXT ’09. ACM, 2009.
[90] Yuchung Cheng, Neal Cardwell, Soheil Hassas Yeganeh, and Van Jacobson. Delivery Rate
Estimation. Internet-Draft draft-cheng-iccrg-delivery-rate-estimation-00. IETF Secretariat,
July 2017. URL: https://tools.ietf.org/id/draft-cheng-iccrg-delivery-rate-estimation-00.html.
[91] Marshini Chetty, Srikanth Sundaresan, Sachit Muckaden, Nick Feamster, and Enrico Ca-
landro. “Measuring Broadband Performance in South Africa”. In: Proceedings of the 4th
Annual Symposium on Computing for Development. 2013.
[92] Yi-Ching Chiu, Brandon Schlinker, Abhishek Balaji Radhakrishnan, Ethan Katz-Bassett,
and Ramesh Govindan. “Are We One Hop Away from a Better Internet?” In: Proceedings
of the ACM Internet Measurement Conference. IMC ’15. ACM, 2015.
361
[93] Cisco. Understanding Policy Routing. 2005. URL: https://www.cisco.com/c/en/us/support/
docs/ip/border-gateway-protocol-bgp/10116-36.html.
[94] Cisco. Virtual Route Forwarding Design Guide. 2008.
[95] Cloud-Based and Peer-to-Peer Meetings. URL: https://blog.zoom.us/cloud-based-and-peer-
peer-meetings/.
[96] CloudLab. URL: http://www.cloudlab.us/.
[97] Avichai Cohen, Yossi Gilad, Amir Herzberg, and Michael Schapira. “Jumpstarting BGP
security with path-end validation”. In: Proceedings of the Conference of the ACM Special
Interest Group on Data Communication. SIGCOMM ’16. ACM, 2016.
[98] Lorenzo Colliti. “Internet Topology Discovery Using Active Probing”. PhD thesis. Universita
degli Studi Roma Tre, 2006.
[99] Lorenzo Corneo, Nitinder Mohan, Aleksandr Zavodovski, Walter Wong, Christian Rohner,
Per Gunningberg, and Jussi Kangasharju. “(How Much) Can Edge Computing Change
Network Latency?” In: Proceedings of the IFIP Networking Conference. IFIP ’21. IEEE,
2021.
[100] Jim Cowie. The New Threat: Targeted Internet Traffic Misdirection. Dyn Research Blog.
Nov. 2013. URL: http://research.dyn.com/2013/11/mitm-internet-hijacking/.
[101] Daniele Croce, Taoufik En-Najjary, Guillaume Urvoy-Keller, and Ernst W Biersack. “Ca-
pacity Estimation of ADSL links”. In: Proceedings of the International Conference on
Emerging Networking EXperiments and Technologies. CoNEXT ’08. ACM, 2008.
[102] Ítalo Cunha, Pietro Marchetta, Matt Calder, Yi-Ching Chiu, Brandon Schlinker, Bruno V A
Machado, Antonio Pescapè, Vasileios Giotsas, Harsha V Madhyastha, and Ethan Katz-
Bassett. “Sibyl: A Practical Internet Route Oracle”. In: Proceedings of USENIX Symposium
on Networked Systems Design and Implementation. NSDI ’16. USENIX, 2016.
[103] Jakub Czyz, Michael Kallitsis, Manaf Gharaibeh, Christos Papadopoulos, Michael Bailey,
and Manish Karir. “Taming the 800 Pound Gorilla: The Rise and Decline of NTP DDoS
Attacks”. In: Proceedings of the ACM Internet Measurement Conference. IMC ’14. ACM,
2014.
[104] A. Dainotti, K. Benson, A. King, B. Huffaker, E. Glatz, X. Dimitropoulos, P. Richter,
A. Finamore, and A. C. Snoeren. “Lost in Space: Improving Inference of IPv4 Address
362
Space Utilization”. In: IEEE Journal on Selected Areas in Communications 34.6 (2016),
pp. 1862–1876.
[105] Alberto Dainotti, Ethan Katz-Bassett, and Xenofontas Dimitropolous. “The BGP Hackathon
2016 Report”. In: ACM SIGCOMM Computer Communication Review 46.3 (2018), pp. 1–6.
[106] J. Damas, M. Graff, and P. Vixie. Extension Mechanisms for DNS (EDNS(0)). Internet
Requests for Comments, RFC 7854. RFC. Apr. 2013.
[107] Data-Over-Cable Service Interface Specifications DOCSIS 3.1: MAC and Upper Layer
Protocols Interface Specification. Tech. rep. Cable Television Laboratories, Inc., 2015.
[108] Amogh Dhamdhere, David D Clark, Alexander Gamero-Garrido, Matthew Luckie, Ricky
KP Mok, Gautam Akiwate, Kabir Gogia, Vaibhav Bajpai, Alex C Snoeren, and Kc Claffy.
“Inferring Persistent Interdomain Congestion”. In: Proceedings of the Conference of the
ACM Special Interest Group on Data Communication. SIGCOMM ’18. ACM, 2018.
[109] Amogh Dhamdhere and Constantine Dovrolis. “The Internet is Flat: Modeling the Transition
from a Transit Hierarchy to a Peering Mesh”. In: Proceedings of the International Conference
on Emerging Networking EXperiments and Technologies. CoNEXT ’10. ACM, 2010.
[110] Xenofontas Dimitropoulos and George Riley. “Efficient Large-Scale BGP Simulations”.
In: Computer Networks, Special Issue on Network Modeling and Simulation 50.12 (2006),
pp. 2013–2027.
[111] Florin Dobrian, Vyas Sekar, Asad Awan, Ion Stoica, Dilip Joseph, Aditya Ganjam, Jibin
Zhan, and Hui Zhang. “Understanding the Impact of Video Quality on User Engagement”. In:
Proceedings of the Conference of the ACM Special Interest Group on Data Communication.
SIGCOMM ’11. ACM, 2011.
[112] Josep Domenech, Julio Sahuquillo, José A Gil, and Ana Pont. “The Impact of the Web
Prefetching Architecture on the Limits of Reducing User’s Perceived Latency”. In: Proceed-
ings of the IEEE/WIC/ACM International Conference on Web Intelligence. WI ’06. IEEE,
2006.
[113] Constantinos Dovrolis, Parameswaran Ramanathan, and David Moore. “Packet-dispersion
techniques and a capacity-estimation methodology”. In: IEEE/ACM Transactions on Net-
working (TON) 12.6 (2004), pp. 963–977.
363
[114] Constantinos Dovrolis, Parameswaran Ramanathan, and David Moore. “What do packet
dispersion techniques measure?” In: Proceedings of Annual Joint Conference of the IEEE
Computer and Communications Societies. INFOCOM ’01. IEEE, 2001.
[115] Allen B Downey. “Using pathchar to estimate Internet link characteristics”. In: Proceedings
of the Conference of the ACM Special Interest Group on Data Communication. SIGCOMM
’99 4. ACM, 1999.
[116] Nick Duffield, Kartik Gopalan, Michael R Hines, Aman Shaikh, and Jacobus E Van Der
Merwe. “Measurement Informed Route Selection”. In: Proceedings of the International
Conference on Passive and Active Network Measurement. PAM ’07. Springer, 2007.
[117] Nandita Dukkipati, Neal Cardwell, Yuchung Cheng, and Matt Mathis. Tail Loss Probe
(TLP): An Algorithm for Fast Recovery of Tail Losses. Internet-Draft draft-dukkipati-tcpm-
tcp-loss-probe-01. Work in Progress. Internet Engineering Task Force, Feb. 2013. 20 pp.
URL: https://datatracker.ietf.org/doc/html/draft-dukkipati-tcpm-tcp-loss-probe-01.
[118] Ted Dunning and Otmar Ertl. “Computing Extremely Accurate Quantiles Using t-Digests”.
In: arXiv preprint, arXiv:1902.04023 (2019).
[119] Dyn. Pakistan hijacks YouTube. URL: https://dyn.com/blog/pakistan-hijacks-youtube-1/.
[120] Dyn. Reckless Driving on the Internet. URL: https://dyn.com/blog/the-flap-heard-around-
the-world/.
[121] Benjamin Edwards, Steven Hofmeyr, George Stelle, and Stephanie Forrest. “Internet topol-
ogy over time”. In: arXiv preprint (2012).
[122] Efficient and effective probing of Internet subnetworks. June 2015. URL: https://www.
noction.com/blog/probing-internet-subnetworks.
[123] Emulab. URL: https://www.emulab.net/.
[124] Deborah Estrin, Yakov Rekhter, and Steven Hotz. “Scalable Inter-domain Routing Ar-
chitecture”. In: Conference Proceedings on Communications Architectures & Protocols.
SIGCOMM ’92. 1992.
[125] ExaBGP. URL: https://github.com/Exa-Networks/exabgp.
364
[126] Xun Fan and John Heidemann. “Selecting Representative IP Addresses for Internet Topology
Studies”. In: Proceedings of the ACM Internet Measurement Conference. IMC ’10. ACM,
2010.
[127] Xun Fan, Ethan Katz-Bassett, and John Heidemann. “Assessing Affinity Between Users
and CDN Sites”. In: Proceedings of the International Workshop on Traffic Monitoring and
Analysis. TMA ’15. Springer, 2015.
[128] Dino Farinacci. “A Decade of Technology Pitfalls and Successes”. In: Proceedings of
NANOG. NANOG 30. 2004.
[129] Seyed K. Fayaz, Tushar Sharma, Ari Fogel, Ratul Mahajan, Todd Millstein, Vyas Sekar, and
George Varghese. “Efficient Network Reachability Analysis Using a Succinct Control Plane
Representation”. In: Proceedings of USENIX Symposium on Operating Systems Design and
Implementation. OSDI ’16. USENIX, 2016.
[130] Nick Feamster. “Revealing Utilization at Internet Interconnection Points”. In: CoRR abs/1603.03656
(2016).
[131] Nick Feamster, Hari Balakrishnan, Jennifer Rexford, Aman Shaikh, and Jacobus Van Der
Merwe. “The Case for Separating Routing from Routers”. In: Proceedings of the ACM
SIGCOMM workshop on Future Directions in Network Architecture. 2004, pp. 5–12.
[132] Nick Feamster, Jay Borkenhagen, and Jennifer Rexford. “Controlling the impact of BGP
policy changes on IP traffic”. In: AT&T Labs-Research, Tech. Rep. HA173000-011106-02TM
(2001).
[133] Nick Feamster and Jennifer Rexford. “Network-wide prediction of BGP routes”. In: IEEE/ACM
Transactions on Networking (TON) 15.2 (2007), pp. 253–266.
[134] Nick Feamster, Jennifer Rexford, and Ellen Zegura. “The Road to SDN: An Intellectual
History of Programmable Networks”. In: ACM SIGCOMM Computer Communication
Review 44.2 (2014), pp. 87–98.
[135] Andrew D Ferguson, Jordan Place, and Rodrigo Fonseca. “Growth analysis of a large ISP”.
In: Proceedings of the ACM Internet Measurement Conference. ACM. ACM, 2013, pp. 347–
352.
[136] P. Ferguson and D. Senie. Network Ingress Filtering: Defeating Denial of Service Attacks
which employ IP Source Address Spoofing. Internet Requests for Comments, RFC 2827.
RFC. May 2000.
365
[137] Julián Martín Del Fiore, Pascal Merindol, Valerio Persico, Cristel Pelsser, and Antonio
Pescapè. “Filtering the Noise to Reveal Inter-Domain Lies”. In: Proceedings of the Network
Traffic Measurement and Analysis Conference. TMA ’19. 2019.
[138] Tobias Flach, Nandita Dukkipati, Andreas Terzis, Barath Raghavan, Neal Cardwell, Yuchung
Cheng, Ankur Jain, Shuai Hao, Ethan Katz-Bassett, and Ramesh Govindan. “Reducing Web
Latency: The Virtue of Gentle Aggression”. In: Proceedings of the Conference of the ACM
Special Interest Group on Data Communication. SIGCOMM ’13. ACM, 2013.
[139] Tobias Flach, Pavlos Papageorge, Andreas Terzis, Luis Pedrosa, Yuchung Cheng, Tayeb
Karim, Ethan Katz-Bassett, and Ramesh Govindan. “An Internet-Wide Analysis of Traffic
Policing”. In: Proceedings of the Conference of the ACM Special Interest Group on Data
Communication. SIGCOMM ’16. ACM, 2016.
[140] Ashley Flavel, Pradeepkumar Mani, David Maltz, Nick Holt, Jie Liu, Yingying Chen, and
Oleg Surmachev. “FastRoute: A Scalable Load-Aware Anycast Routing Architecture for
Modern CDNs”. In: Proceedings of USENIX Symposium on Networked Systems Design and
Implementation. NSDI ’15. USENIX, 2015.
[141] Ari Fogel, Stanley Fung, Luis Pedrosa, Meg Walraed-Sullivan, Ramesh Govindan, Ratul
Mahajan, and Todd D Millstein. “A General Approach to Network Configuration Analysis”.
In: Proceedings of USENIX Symposium on Networked Systems Design and Implementation.
NSDI ’15. USENIX, 2015.
[142] Osvaldo Fonseca, Ítalo Cunha, Elverton Fazzion, Wagner Meira, Brivaldo Junior, Ronaldo A
Ferreira, and Ethan Katz-Bassett. “Tracking Down Sources of Spoofed IP Packets”. In:
Proceedings of the IFIP Networking Conference. IFIP ’20. IEEE, 2020.
[143] Chuck Fraleigh, Fouad Tobagi, and Christophe Diot. “Provisioning IP Backbone Networks
to Support Latency Sensitive Traffic”. In: Proceedings of Annual Joint Conference of the
IEEE Computer and Communications Societies. INFOCOM ’03. IEEE, 2003.
[144] Benjamin Frank, Ingmar Poese, Yin Lin, Georgios Smaragdakis, Anja Feldmann, Bruce
Maggs, Jannis Rake, Steve Uhlig, and Rick Weber. “Pushing CDN-ISP Collaboration to the
Limit”. In: ACM SIGCOMM Computer Communication Review 43.3 (2013), pp. 34–44.
[145] David Freedman, Brian Foust, Barry Greene, Ben Maddison, Andrei Robachevsky, Job
Snijders, and Sander Steffann. Mutually Agreed Norms for Routing Security (MANRS)
Implementation Guide. 2019.
366
[146] Michael J. Freedman, Eric Freudenthal, and David Mazieres. “Democratizing Content
Publication with Coral”. In: Proceedings of USENIX Symposium on Networked Systems
Design and Implementation. NSDI ’04. USENIX, 2004.
[147] Michael J. Freedman, Mythili Vutukuru, Nick Feamster, and Hari Balakrishnan. “Geographic
Locality of IP Prefixes”. In: Proceedings of the ACM Internet Measurement Conference.
IMC ’05. ACM, 2005.
[148] Andrew Gallo. RPKI: BGP Security Hammpered By A Legal Agreement. PacketPushers.
Dec. 2014. URL: https://packetpushers.net/rpki-bgp-security-hammpered-legal-agreement/.
[149] Lixin Gao. “On Inferring Autonomous System Relationships in the Internet”. In: IEEE/ACM
Transactions on Networking (TON) 9.6 (2001), pp. 733–745.
[150] Lixin Gao and Jennifer Rexford. “Stable Internet routing without global coordination”. In:
IEEE/ACM Transactions on Networking (TON) 9.6 (2001), pp. 681–692.
[151] Ruomei Gao, Constantinos Dovrolis, and Ellen W Zegura. “Interdomain ingress traffic
engineering through optimized AS-path prepending”. In: Proceedings of the International
Conference on Research in Networking. Springer, 2005.
[152] Zhiqiang Gao and Nirwan Ansari. “A practical and robust inter-domain marking scheme for
IP traceback”. In: Computer Networks 51.3 (2007), pp. 732–750.
[153] Artyom Gavrichenkov. “Breaking HTTPS with BGP Hijacking”. In: Blackhat Security
Conference. 2015.
[154] GÉANT. URL: https://www.geant.org/.
[155] Wes George. “Adventures in RPKI (non) deployment”. In: Proceedings of NANOG. NANOG
62. Oct. 2014. URL: https://www.nanog.org/sites/default/files/wednesday_george_
adventuresinrpki_62.9.pdf.
[156] Jim Gettys. “Bufferbloat: Dark buffers in the internet”. In: IEEE Internet Computing 3
(2011), p. 96.
[157] M. Gharaibeh, H. Zhang, C. Papadopoulos, and J. Heidemann. “Assessing Co-locality of IP
Blocks”. In: IEEE INFOCOM Workshops. INFOCOM ’16. IEEE, 2016.
367
[158] Yossi Gilad, Avichai Cohen, Amir Herzberg, Michael Schapira, and Haya Shulman. “Are
We There Yet? On RPKI’s Deployment and Security”. In: Proceedings of Network and
Distributed System Security Symposium. NDSS ’17. Internet Society, 2017.
[159] Phillipa Gill, Martin Arlitt, Zongpeng Li, and Anirban Mahanti. “The Flattening Internet
Topology: Natural Evolution, Unsightly Barnacles or Contrived Collapse?” In: Proceedings
of the International Conference on Passive and Active Network Measurement. PAM ’08.
Springer, 2008.
[160] Phillipa Gill, Michael Schapira, and Sharon Goldberg. “A Survey of Interdomain Routing
Policies”. In: ACM SIGCOMM Computer Communication Review 44.1 (2013), pp. 28–34.
[161] Phillipa Gill, Michael Schapira, and Sharon Goldberg. “Let the Market Drive Deployment:A
Strategy for Transitioning to BGP Security”. In: ACM SIGCOMM Computer Communication
Review 41.4 (2011), pp. 14–25.
[162] Phillipa Gill, Michael Schapira, and Sharon Goldberg. “Modeling on Quicksand: Dealing
with the Scarcity of Ground Truth in Interdomain Routing Data”. In: ACM SIGCOMM
Computer Communication Review 42.1 (2012), pp. 40–46.
[163] Vasileios Giotsas, Amogh Dhamdhere, and Kimberly C. Claffy. “Periscope: Unifying
Looking Glass Querying”. In: Proceedings of the International Conference on Passive and
Active Network Measurement. PAM ’16. Springer, 2016.
[164] Vasileios Giotsas, Matthew Luckie, Bradley Huffaker, and kc claffy kc. “Inferring Complex
AS Relationships”. In: Proceedings of the ACM Internet Measurement Conference. IMC
’14. ACM, 2014.
[165] Vasileios Giotsas and Shi Zhou. “Valley-free violation in Internet routing — Analysis
based on BGP Community data”. In: Proceedings of IEEE International Conference on
Communications. ICC ’12. IEEE. 2012.
[166] Lenny Giuliano. “Internet Multicast: It’s Still a Thing”. In: Proceedings of NANOG. NANOG
62. Oct. 2014.
[167] Lenny Giuliano. “Multicast to the Grandma (MTTG): It’s finally here!” In: Proceedings of
IETF. IETF 104. IETF, 2019.
[168] GNS3. URL: https://gns3.com/.
368
[169] Sharon Goldberg. “Why Is It Taking So Long to Secure Internet Routing?” In: Communica-
tions of the ACM 57.10 (2014), pp. 56–63.
[170] Philip Golden, Hervé Dedieu, and Krista S Jacobsen. Fundamentals of DSL Technology.
CRC Press, 2005.
[171] Emanuele Goldoni and Marco Schivi. “End-to-End Available Bandwidth Estimation Tools,
An Experimental Comparison”. In: Proceedings of the International Workshop on Traffic
Monitoring and Analysis. TMA ’10. Springer, 2010.
[172] Dan Goodin. BGP event sends European mobile traffic through China Telecom for 2 hours.
ArsTechnica. June 2019. URL: https://arstechnica.com/information-technology/2019/06/bgp-
mishap-sends-european-mobile-traffic-through-china-telecom-for-2-hours/.
[173] Google chooses RouteScience Internet technology. July 2002. URL: https://www.computerweekly.
com/news/2240046663/Google-chooses-RouteScience-Internet-technology.
[174] Google Edge Network. URL: https://peering.google.com/#/infrastructure.
[175] Google Picks RouteScience. July 2002. URL: https://www.lightreading.com/ethernet-
ip/google-picks-routescience/d/d-id/582262.
[176] Google Video Quality Report. 2018. URL: https://www.google.com/get/videoqualityreport/
#methodology.
[177] Ramesh Govindan and Anoop Reddy. “An Analysis of Internet Inter-Domain Topology and
Route Stability”. In: Proceedings of Annual Joint Conference of the IEEE Computer and
Communications Societies. INFOCOM ’97. IEEE, 1997.
[178] Matthew Graydon and Lisa Parks. “‘Connecting the unconnected’: a critical assessment of
US satellite Internet services”. In: Media, Culture & Society 42.2 (2020), pp. 260–276.
[179] Carlo Grazia and Natale Patriciello. TCP small queues and WiFi aggregation — a war story.
June 2018. URL: https://lwn.net/Articles/757643/.
[180] Carlo Augusto Grazia, Natale Patriciello, Martin Klapez, and Maurizio Casoni. “BBR+:
Improving TCP BBR Performance over WLAN”. In: Proceedings of IEEE International
Conference on Communications. ICC ’20. IEEE, 2020.
[181] Enrico Gregori, Alessandro Improta, Luciano Lenzini, Lorenzo Rossi, and Luca Sani. “On
the Incompleteness of the AS-level Graph: a Novel Methodology for BGP Route Collector
369
Placement”. In: Proceedings of the ACM Internet Measurement Conference. IMC ’12. ACM,
2012.
[182] Enrico Gregori, Alessandro Improta, and Luca Sani. “On the African peering connectivity
revealable via BGP route collectors”. In: International Conference on e-Infrastructure and
e-Services for Developing Countries. Springer, 2017.
[183] Cesar D Guerrero and Miguel A Labrador. “On the applicability of available bandwidth
estimation techniques and tools”. In: Computer Communications 33.1 (2010), pp. 11–22.
[184] Bamba Gueye, Steve Uhlig, and Serge Fdida. “Investigating the Imprecision of IP Block-
Based Geolocation”. In: Proceedings of the International Workshop on Passive and Active
Network Measurement. PAM ’07. Springer, 2007.
[185] Arpit Gupta, Matt Calder, Nick Feamster, Marshini Chetty, Enrico Calandro, and Ethan
Katz-Bassett. “Peering at the Internet’s Frontier: A First Look at ISP Interconnectivity in
Africa”. In: Proceedings of the International Conference on Passive and Active Network
Measurement. PAM ’14. Springer, 2014.
[186] Arpit Gupta, Robert MacDavid, Rüdiger Birkner, Marco Canini, Nick Feamster, Jennifer
Rexford, and Laurent Vanbever. “An Industrial-Scale Software Defined Internet Exchange
Point”. In: Proceedings of USENIX Symposium on Networked Systems Design and Imple-
mentation. NSDI ’16. USENIX, 2016.
[187] Arpit Gupta, Laurent Vanbever, Muhammad Shahbaz, Sean P. Donovan, Brandon Schlinker,
Nick Feamster, Jennifer Rexford, Scott Shenker, Russ Clark, and Ethan Katz-Bassett. “SDX:
A Software Defined Internet Exchange”. In: Proceedings of the Conference of the ACM
Special Interest Group on Data Communication. SIGCOMM ’14. ACM, 2014.
[188] Andrei Gurtov, Tom Henderson, Sally Floyd, and Yoshifumi Nishida. The NewReno Modifi-
cation to TCP’s Fast Recovery Algorithm. RFC 6582. Apr. 2012. DOI: 10.17487/RFC6582.
URL: https://rfc-editor.org/rfc/rfc6582.txt.
[189] Sangtae Ha and Injong Rhee. “Hybrid Slow Start for High-Bandwidth and Long-Distance
Networks”. In: Proceedings of PFLDnet. 2008.
[190] Sangtae Ha, Injong Rhee, and Lisong Xu. “CUBIC: A New TCP-friendly High-speed TCP
Variant”. In: ACM SIGOPS Operating Systems Review 42.5 (July 2008), pp. 64–74.
[191] Evangelos Haleplidis, Kostas Pentikousis, Spyros Denazis, Jamal Hadi Salim, David Meyer,
and Odysseas Koufopavlou. Software-Defined Networking (SDN): Layers and Architecture
370
Terminology. RFC 7426. Jan. 2015. DOI: 10.17487/RFC7426. URL: https://rfc-editor.org/
rfc/rfc7426.txt.
[192] S. Hares and D. Katz. Administrative Domains and Routing Domains A Model for Routing
in the Internet. Internet Requests for Comments, RFC 1136. RFC. Dec. 1989.
[193] W. Hargrave, M. Griswold, J. Snijders, and N. Hilliard. Mitigating the Negative Impact of
Maintenance through BGP Session Culling. Internet Requests for Comments, RFC 8327.
RFC. Mar. 2018.
[194] Will Hargrave. “Reducing the impact of IXP maintenance”. In: Proceedings of RIPE. RIPE
67. RIPE, 2013. URL: https://ripe67.ripe.net/presentations/374-WH-IXPMaintReduce.pdf.
[195] Dimitry Haskin. “A BGP/IDRP Route Server alternative to a full mesh routing”. In: (1995).
[196] Jiayue He and Jennifer Rexford. “Toward Internet-Wide Multipath Routing”. In: IEEE
Network 22.2 (Mar. 2008), pp. 16–21.
[197] John Heidemann, Yuri Pradkin, Ramesh Govindan, Christos Papadopoulos, Genevieve
Bartlett, and Joseph Bannister. “Census and Survey of the Visible Internet”. In: Proceedings
of the ACM Internet Measurement Conference. IMC ’08. ACM, 2008.
[198] J. Heitz, J. Snijders, K. Patel, I. Bagdonas, and N. Hilliard. BGP Large Communities
Attribute. Internet Requests for Comments, RFC 8092. RFC. Feb. 2017.
[199] Tristan Henderson. “Latency and User Behaviour on a Multiplayer Game Server”. In: Intl.
Workshop on Networked Group Communication. 2001.
[200] Tomas Hlavacek, Italo Cunha, Yossi Gilad, Amir Herzberg, Ethan Katz-Bassett, Michael
Schapira, and Haya Shulman. “DISCO: Sidestepping RPKI’s Deployment Barriers”. In:
Proceedings of Network and Distributed System Security Symposium. NDSS ’20. Internet
Society, 2020.
[201] Thomas Holterbach, Stefano Vissicchio, Alberto Dainotti, and Laurent Vanbever. “SWIFT:
Predictive Fast Reroute”. In: Proceedings of the Conference of the ACM Special Interest
Group on Data Communication. SIGCOMM ’17. ACM, 2017.
[202] Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri,
and Roger Wattenhofer. “Achieving High Utilization with Software-driven WAN”. In:
Proceedings of the Conference of the ACM Special Interest Group on Data Communication.
SIGCOMM ’13. ACM, 2013.
371
[203] C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. Internet Requests for Comments,
RFC 2992. RFC. Nov. 2000.
[204] Ningning Hu and Peter Steenkiste. Estimating Available Bandwidth Using Packet Pair
Probing. Tech. rep. Carnegie Mellon University School of Computer Science, 2002.
[205] X. Hu and Z. Morley Mao. “Accurate Real-time Identification of IP Prefix Hijacking”. In:
Proceedings of IEEE Symposium on Security and Privacy. S&P ’07. 2007.
[206] Cheng Huang, David A Maltz, Jin Li, and Albert Greenberg. “Public DNS system and global
traffic management”. In: Proceedings of Annual Joint Conference of the IEEE Computer
and Communications Societies. INFOCOM ’11. IEEE, 2011.
[207] Mark Huang, Andy Bavier, and Larry Peterson. “PlanetFlow: Maintaining Accountability
for Network Services”. In: ACM SIGOPS Operating Systems Review 40.1 (2006), pp. 89–94.
[208] Geoff Huston. BBR, the new kid on the TCP block. APNIC Blog. May 2017. URL: https:
//blog.apnic.net/2017/05/09/bbr-new-kid-tcp-block/.
[209] Geoff Huston. BGP Routing Table Analysis Reports. URL: https://bgp.potaroo.net/.
[210] Geoff Huston. Leaking Routes. Potaroo: The ISP Column. Mar. 2012. URL: https://www.
potaroo.net/ispcol/2012-03/leaks.html.
[211] Internap Brings Performance Optimization to Enterprise Networks with Intelligent Traffic
Routing Appliance. June 2015. URL: https://www.inap.com/press-release/optimizing-
enterprise-networks-with-intelligent-traffic-routing-appliance/.
[212] Internet Routing Registry Tutorial. APNIC. 2008.
[213] ISP Interconnection and its Impact on Consumer Internet Performance. Tech. rep. Mea-
surement Lab Consortium (M-Lab), Oct. 2014. URL: https://www.measurementlab.net/
publications/isp-interconnection-impact.pdf.
[214] Jana Iyengar and Martin Thomson. QUIC: A UDP-Based Multiplexed and Secure Transport.
RFC 9000. May 2021. DOI: 10.17487/RFC9000. URL: https://rfc-editor.org/rfc/rfc9000.txt.
[215] Van Jacobson. “Congestion Avoidance and Control”. In: Proceedings of the Conference on
Communications Architecture and Protocols. SIGCOMM ’88 4. ACM, 1988.
[216] Van Jacobson. Pathchar: A tool to infer characteristics of Internet paths. 1997.
372
[217] Manish Jain and Constantinos Dovrolis. “End-to-End Available Bandwidth: Measurement
Methodology, Dynamics, and Relation with TCP Throughput”. In: Proceedings of the
Conference of the ACM Special Interest Group on Data Communication. SIGCOMM ’02.
ACM, 2002.
[218] Manish Jain and Constantinos Dovrolis. “End-to-end estimation of the available bandwidth
variation range”. In: ACM SIGMETRICS Performance Evaluation Review 33.1 (2005),
pp. 265–276.
[219] Manish Jain and Constantinos Dovrolis. “Pathload: A measurement tool for end-to-end
available bandwidth”. In: Proceedings of the Passive and Active Measurements Workshop.
PAM ’02. 2002.
[220] Manish Jain and Constantinos Dovrolis. “Ten Fallacies and Pitfalls on End-to-End Available
Bandwidth Estimation”. In: Proceedings of the ACM Internet Measurement Conference.
IMC ’04. ACM, 2004.
[221] S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer,
J. Zhou, M. Zhu, J. Zolla, U. Hölzle, S. Stuart, and A. Vahdat. “B4: Experience with a
Globally-deployed Software Defined Wan”. In: Proceedings of the Conference of the ACM
Special Interest Group on Data Communication. SIGCOMM ’13. ACM, 2013.
[222] Paul Jakma and David Lamparter. “Introduction to the quagga routing suite”. In: IEEE
Network 28.2 (2014), pp. 42–48.
[223] Umar Javed, Italo Cunha, David R. Choffnes, Ethan Katz-Bassett, Thomas E. Anderson,
and Arvind Krishnamurthy. “PoiRoot: Investigating the Root Cause of Interdomain Path
Changes”. In: Proceedings of the Conference of the ACM Special Interest Group on Data
Communication. SIGCOMM ’13. ACM, 2013.
[224] Guojun Jin. Algorithms and requirements for measuring network bandwidth. 2002.
[225] John P. John, Ethan Katz-Bassett, Arvind Krishnamurthy, Thomas Anderson, and Arun
Venkataramani. “Consensus Routing: The Internet As a Distributed System”. In: Proceedings
of USENIX Symposium on Networked Systems Design and Implementation. NSDI ’08.
USENIX, 2008.
[226] Matt Joras and Yang Chi. How Facebook is bringing QUIC to billions. Oct. 2020. URL:
https://engineering.fb.com/2020/10/21/networking-traffic/how-facebook-is-bringing-quic-
to-billions/.
373
[227] Brivaldo Junior, Ronaldo A. Ferreira, Ítalo Cunha, Brandon Schlinker, and Ethan Katz-
Bassett. “High-Fidelity Interdomain Routing Experiments”. In: ACM SIGCOMM Posters
and Demos. SIGCOMM ’18. ACM, 2018.
[228] Juniper. Routing Instances Overview. 2017. URL: https://www.juniper.net/documentation/
en_US/junos/topics/concept/routing-instances-overview.html.
[229] k. claffy k., D. Clark, S. Bauer, and A. Dhamdhere. “Policy Challenges in Mapping Internet
Interdomain Congestion”. In: Telecommunications Policy Research Conference (TPRC).
Oct. 2016.
[230] Theo Kanter and Christian Olrog. “V oIP in applications for wireless access”. In: Proceedings
of the IEEE Workshop on Local and Metropolitan Area Networks. LANMAN ’99. IEEE,
1999.
[231] Rohit Kapoor, Ling-Jyh Chen, Li Lao, Mario Gerla, and M Young Sanadidi. “CapProbe: A
Simple and Accurate Capacity Estimation Technique”. In: Proceedings of the Conference of
the ACM Special Interest Group on Data Communication. SIGCOMM ’04. ACM, 2004.
[232] Ethan Katz-Bassett, David R Choffnes, Ítalo Cunha, Colin Scott, Thomas Anderson, and
Arvind Krishnamurthy. “Machiavellian Routing: Improving Internet Availability with BGP
Poisoning”. In: Proceedings of the ACM Workshop on Hot Topics in Networks. HotNets ’11.
ACM, 2011.
[233] Ethan Katz-Bassett, John P John, Arvind Krishnamurthy, David Wetherall, Thomas An-
derson, and Yatin Chawathe. “Towards IP Geolocation Using Delay and Topology”. In:
Proceedings of the ACM Internet Measurement Conference. IMC ’06. ACM, 2006.
[234] Ethan Katz-Bassett, Harsha V . Madhyastha, Vijay Kumar Adhikari, Colin Scott, Justine
Sherry, Peter van Wesep, Thomas E. Anderson, and Arvind Krishnamurthy. “Reverse
Traceroute”. In: Proceedings of USENIX Symposium on Networked Systems Design and
Implementation. NSDI ’10. USENIX, 2010.
[235] Ethan Katz-Bassett, Harsha V . Madhyastha, John P. John, Arvind Krishnamurthy, David
Wetherall, and Thomas Anderson. “Studying Black Holes in the Internet with Hubble”. In:
Proceedings of USENIX Symposium on Networked Systems Design and Implementation.
NSDI ’08. USENIX, 2008.
374
[236] Ethan Katz-Bassett, Colin Scott, David R. Choffnes, Ítalo Cunha, Vytautas Valancius, Nick
Feamster, Harsha V . Madhyastha, Thomas Anderson, and Arvind Krishnamurthy. “LIFE-
GUARD: Practical Repair of Persistent Route Failures”. In: Proceedings of the Conference
of the ACM Special Interest Group on Data Communication. SIGCOMM ’12. ACM, 2012.
[237] Christian Kaufmann. “BGP and Traffic Engineering with Akamai”. In: Proceedings of
MENOG. MENOG 14, Apr. 2014. URL: http://www.menog.org/presentations/menog-
14/282-20140331_MENOG_BGP_and_Traffic_Engineering_with_Akamai.pdf.
[238] Conor Kelton, Jihoon Ryoo, Aruna Balasubramanian, and Samir R Das. “Improving User
Perceived Page Load Times Using Gaze”. In: Proceedings of USENIX Symposium on
Networked Systems Design and Implementation. NSDI ’17. USENIX, 2017.
[239] Kernel documentation network timestamping. URL: https://www.kernel.org/doc/Documentation/
networking/timestamping.txt.
[240] Srinivasan Keshav. “A Control-Theoretic Approach to Flow Control”. In: Proceedings of
the Conference on Communications Architecture and Protocols. SIGCOMM ’91. ACM,
1991.
[241] Dan Komosny, Miroslav V oznak, and Saeed Ur Rehman. “Location Accuracy of Commercial
IP Address Geolocation Databases”. In: Information Technology And Control 46.3 (2017),
pp. 333–344.
[242] Bill Kostka. An Overview of the DOCSIS Two Way Physical Layer. Tech. rep. Cable
Television Laboratories, Inc., 1998.
[243] Rupa Krishnan, Harsha V Madhyastha, Sridhar Srinivasan, Sushant Jain, Arvind Krishna-
murthy, Thomas Anderson, and Jie Gao. “Moving Beyond End-to-End Path Information to
Optimize CDN Performance”. In: Proceedings of the ACM Internet Measurement Confer-
ence. IMC ’09. ACM, 2009.
[244] Jonathan Kua, Grenville Armitage, and Philip Branch. “A Survey of Rate Adaptation
Techniques for Dynamic Adaptive Streaming Over HTTP”. In: IEEE Communications
Surveys & Tutorials 19.3 (2017), pp. 1842–1866.
[245] Nicolas Kuhn, Gorry Fairhurst, John Border, and Stephan Emile. QUIC for SATCOM.
Internet-Draft draft-kuhn-quic-4-sat-00. IETF Secretariat, July 2019. URL: http://www.ietf.
org/internet-drafts/draft-kuhn-quic-4-sat-00.txt.
375
[246] James F Kurose and Keith Ross. Computer Networking: A top-down approach. Pearson
Education, 2021.
[247] Nate Kushman, Srikanth Kandula, and Dina Katabi. “Can you hear me now?!: it must be
BGP”. In: ACM SIGCOMM Computer Communication Review 37.2 (2007), pp. 75–84.
[248] Craig Labovitz. “Internet Traffic 2009-2019”. In: Asia Pacific Regional Internet Conference
on Operational Technologies. Feb. 2019.
[249] Craig Labovitz, Abha Ahuja, Abhijit Bose, and Farnam Jahanian. “Delayed Internet routing
convergence”. In: IEEE/ACM Transactions on Networking (TON) 9.3 (2001), pp. 293–306.
[250] Craig Labovitz, Scott Iekel-Johnson, Danny McPherson, Jon Oberheide, and Farnam Jaha-
nian. “Internet Inter-domain Traffic”. In: Proceedings of the Conference of the ACM Special
Interest Group on Data Communication. SIGCOMM ’10. ACM, 2010.
[251] Mohit Lad, Dan Massey, Dan Pei, Yiguo Wu, Beichuan Zhang, and Lixia Zhang. “PHAS: A
Prefix Hijack Alert System”. In: Proceedings of USENIX Security Symposium. USENIX
Security ’06. USENIX, 2006.
[252] Kevin Lai and Mary Baker. “Measuring Bandwidth”. In: Proceedings of Annual Joint
Conference of the IEEE Computer and Communications Societies. INFOCOM ’99. IEEE,
1999.
[253] Kevin Lai and Mary Baker. “Measuring Link Bandwidths Using a Deterministic Model of
Packet Delay”. In: Proceedings of the Conference of the ACM Special Interest Group on
Data Communication. SIGCOMM ’00. ACM, 2000, pp. 283–294.
[254] Kevin Lai and Mary Baker. “Nettimer: A Tool for Measuring Bottleneck Link Bandwidth”.
In: Proceedings of USENIX Symposium on Internet Technologies and Systems. USITS ’03.
USENIX, 2001.
[255] Karthik Lakshminarayanan, Ion Stoica, Scott Shenker, and Jennifer Rexford. Routing as a
Service. Computer Science Division, University of California Berkeley, 2004.
[256] Rustam Lalkaka. Introducing Argo — A faster, more reliable, more secure Internet for
everyone. 2017. URL: https://blog.cloudflare.com/argo/.
[257] Bob Lantz, Brandon Heller, and Nick McKeown. “A Network in a Laptop: Rapid Prototyping
for Software Defined Networks”. In: Proceedings of the ACM Workshop on Hot Topics in
Networks. HotNets ’10. ACM, 2010.
376
[258] Wei-Tsong Lee, Kuo-Chih Chu, Chin-Ping Tan, Kuo-Kan Yu, et al. “How DOCSIS Protocol
Solves Asymmetric Bandwidth Issue in Cable Network”. In: Journal of Applied Science
and Engineering 9.1 (2006), pp. 55–62.
[259] Youndo Lee and Neil Spring. “Identifying and Aggregating Homogeneous IPv4 /24 Blocks
with Hobbit”. In: Proceedings of the ACM Internet Measurement Conference. IMC ’16.
ACM, 2016.
[260] Matt Lepinski, Richard Barnes, and Stephen Kent. “An infrastructure to support secure
internet routing”. In: (2012).
[261] Matt Lepinski and Kotikalapudi Sriram. BGPsec Protocol Specification. RFC 8205. Sept.
2017. DOI: 10.17487/RFC8205. URL: https://rfc-editor.org/rfc/rfc8205.txt.
[262] Matt Levine, Barrett Lyon, and T Underwood. “TCP Anycast-Don’t believe the FUD”. In:
Proceedings of NANOG. NANOG 37. 2006.
[263] Zhihao Li, Dave Levin, Neil Spring, and Bobby Bhattacharjee. “Internet Anycast: Perfor-
mance, Problems, & Potential”. In: Proceedings of the Conference of the ACM Special
Interest Group on Data Communication. SIGCOMM ’18. ACM, 2018.
[264] Jim Liddle. “Amazon found every 100ms of latency cost them 1% in sales”. In: The
GigaSpaces 27 (2008).
[265] Linux Advanced Routing & Traffic Control HOWTO: 4.1. Simple source policy routing. URL:
https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.rpdb.simple.html.
[266] Linux Kernel netipv4tcp bbr.c. URL: https://github.com/torvalds/linux/blob/master/net/ipv4/
tcp_bbr.c.
[267] NANOG Mailing List. AT&T/as7018 now drops invalid prefixes from peers – Thread. Feb.
2019. URL: https://web.archive.org/web/20200410155304/https://mailman.nanog.org/
pipermail/nanog/2019-February/099501.html.
[268] NANOG Mailing List. BGP Experiment – Thread. Jan. 2019. URL: https://mailman.nanog.
org/pipermail/nanog/2019-January/098761.html.
[269] Hongqiang Harry Liu, Raajay Viswanathan, Matt Calder, Aditya Akella, Ratul Mahajan,
Jitendra Padhye, and Ming Zhang. “Efficiently Delivering Online Services over Integrated
Infrastructure”. In: Proceedings of USENIX Symposium on Networked Systems Design and
Implementation. NSDI ’16. USENIX, 2016.
377
[270] Xuanzhe Liu, Yun Ma, Yunxin Liu, Tao Xie, and Gang Huang. “Demystifying the imperfect
client-side cache performance of mobile web browsing”. In: IEEE Transactions on Mobile
Computing 15.9 (2015), pp. 2206–2220.
[271] Ioana Livadariu, Thomas Dreibholz, Anas Saeed Al-Selwi, Haakon Bryhni, Olav Lysne,
Steinar Bjørnstad, and Ahmed Elmokashfi. “On the Accuracy of Country-Level IP Geoloca-
tion”. In: Proceedings of the Applied Networking Research Workshop. ANRW ’20. ACM,
2020.
[272] Aemen Lodhi, Nikolaos Laoutaris, Amogh Dhamdhere, and Constantine Dovrolis. “Com-
plexities in Internet Peering: Understanding the“Black” in the “Black Art””. In: Proceedings
of Annual Joint Conference of the IEEE Computer and Communications Societies. INFO-
COM ’15. IEEE, 2015.
[273] Aemen Lodhi, Natalie Larson, Amogh Dhamdhere, Constantine Dovrolis, and kc claffy kc.
“Using peeringDB to Understand the Peering Ecosystem”. In: ACM SIGCOMM Computer
Communication Review 44.2 (2014), pp. 20–27.
[274] M. Luckie, R. Beverly, R. Koga, K. Keys, J. Kroll, and k. claffy k. “Network Hygiene,
Incentives, and Regulation: Deployment of Source Address Validation in the Internet”. In:
ACM Computer and Communications Security (CCS). Nov. 2019.
[275] Matthew Luckie. “Scamper: A Scalable and Extensible Packet Prober for Active Measure-
ment of the Internet”. In: Proceedings of the ACM Internet Measurement Conference. IMC
’10. ACM, 2010.
[276] Matthew Luckie and Robert Beverly. “The impact of router outages on the AS-level In-
ternet”. In: Proceedings of the Conference of the ACM Special Interest Group on Data
Communication. SIGCOMM ’17. ACM, 2017.
[277] Matthew Luckie, Brian Huffaker, Amogh Dhamdhere, Vasileios Giotsas, and kc claffy kc.
“AS Relationships, Customer Cones, and Validation”. In: Proceedings of the ACM Internet
Measurement Conference. IMC ’13. ACM, 2013.
[278] Robert Lychev, Sharon Goldberg, and Michael Schapira. “BGP Security in Partial Deploy-
ment: Is the Juice Worth the Squeeze?” In: Proceedings of the Conference of the ACM
Special Interest Group on Data Communication. SIGCOMM ’13. ACM, 2013.
[279] M-Lab. URL: https://www.measurementlab.net.
378
[280] H. Madhyastha, T. Isdal, M. Piatek, C. Dixon, T. Anderson, A. Krishnamurthy, and A.
Venkataramani. “iPlane: an Information Plane for Distributed Services”. In: Proceedings of
USENIX Symposium on Operating Systems Design and Implementation. OSDI ’06. USENIX,
2006.
[281] Doug Madory. Large European routing leak sends traffic through China Telecom. APNIC
Blog. June 2019. URL: https://blog.apnic.net/2019/06/07/large-european-routing-leak-
sends-traffic-through-china-telecom/.
[282] Ratul Mahajan, David Wetherall, and Thomas Anderson. “Mutually Controlled Routing
with Independent ISPs”. In: Proceedings of USENIX Symposium on Networked Systems
Design and Implementation. NSDI ’07. USENIX, 2007.
[283] Ratul Mahajan, David Wetherall, and Thomas Anderson. “Negotiation-based routing be-
tween neighboring ISPs”. In: Proceedings of USENIX Symposium on Networked Systems
Design and Implementation. NSDI ’05. USENIX, 2005.
[284] Ratul Mahajan, David Wetherall, and Thomas Anderson. “Towards Coordinated Interdomain
Traffic Engineering”. In: Proceedings of the ACM Workshop on Hot Topics in Networks.
HotNets ’04. ACM, 2004.
[285] Ratul Mahajan, David Wetherall, and Tom Anderson. “Understanding BGP Misconfigu-
ration”. In: Proceedings of the Conference of the ACM Special Interest Group on Data
Communication. SIGCOMM ’12. ACM, 2012.
[286] Zhuoqing Morley Mao, Jennifer Rexford, Jia Wang, and Randy H. Katz. “Towards an
Accurate AS-level Traceroute Tool”. In: Proceedings of the Conference of the ACM Special
Interest Group on Data Communication. SIGCOMM ’03. ACM, 2003.
[287] Pietro Marchetta, Valerio Persico, Antonio Pescapé, and Ethan Katz-Bassett. “Don’t Trust
Traceroute (Completely)”. In: Proceedings of the ACM CoNEXT Student Workshop. ACM,
2013.
[288] Pedro Marcos, Lars Prehn, Lucas Leal, Alberto Dainotti, Anja Feldmann, and Marinho
Barcellos. “AS-Path Prepending: there is no rose without a thorn”. In: Proceedings of the
ACM Internet Measurement Conference. IMC ’20. ACM, 2020.
[289] Ciprian Marginean and Aris Lambrianidis. “Meet the Falcons”. In: Proceedings of NANOG.
NANOG 66. Feb. 2016. URL: https://www.nanog.org/sites/default/files/Marginean_Meet_
The_Falcons.pdf.
379
[290] Jim Martin. DOCSIS Performance Issues. Tech. rep. NCTA Technical Papers, 2005. URL:
https://www.nctatechnicalpapers.com/Paper/2005/2005- docsis- performance- issues/
download.
[291] Jim Martin and Nitin Shrivastav. “Modeling the DOCSIS 1.1/2.0 MAC protocol”. In:
Proceedings of the International Conference on Computer Communications and Networks.
ICCCN ’03. IEEE, 2003.
[292] M. Mathis and M. Allman. A Framework for Defining Empirical Bulk Transfer Capacity
Metrics. Internet Requests for Comments, RFC 3148. RFC. July 2001.
[293] Matt Mathis, John Heffner, and Raghu Reddy. “Web100: extended TCP instrumentation for
research, education and diagnosis”. In: ACM SIGCOMM Computer Communication Review
33.3 (2003), pp. 69–79.
[294] Matthew Mathis, Jeffrey Semke, Jamshid Mahdavi, and Teunis Ott. “The Macroscopic
Behavior of the TCP Congestion Avoidance Algorithm”. In: ACM SIGCOMM Computer
Communication Review 27.3 (1997), pp. 67–82.
[295] MaxMind GeoLite Database. URL: http://dev.maxmind.com/geoip/legacy/geolite/.
[296] Peyton Maynard-Koran. Fixing the Internet for Real Time Applications: Part II. URL:
https://technology.riotgames.com/news/fixing-internet-real-time-applications-part-ii.
[297] Tyler McDaniel, Jared M Smith, and Max Schuchard. “The Maestro Attack: Orchestrating
Malicious Flows with BGP”. In: International Conference on Security and Privacy in
Communication Systems. SecureComm 2020. Springer, 2020.
[298] Margaret M McMahon and Robert Rathburn. Measuring Latency in Iridium Satellite Con-
stellation Data Services. Tech. rep. NA V AL ACADEMY ANNAPOLIS MD DEPT OF
COMPUTER SCIENCE, 2005.
[299] D. McPherson and V . Gill. BGP MULTI_EXIT_DISC (MED) Considerations. Internet
Requests for Comments, RFC 4451. RFC. Mar. 2006.
[300] D. McPherson, V . Gill, D. Walton, and A. Retana. Border Gateway Protocol (BGP) Persistent
Route Oscillation Condition. Internet Requests for Comments, RFC 3345. RFC. Aug. 2002.
[301] Danny McPherson, Larry Blunk, Eric Osterweil, Shane Amante, and Dave Mitchell. Consid-
erations for Internet Routing Registries (IRRs) and Routing Policy Configuration. Internet
Requests for Comments, RFC 7682. RFC. Dec. 2015.
380
[302] Stefan Meinders. “The New Internet”. In: RIPE NCC Regional Meeting: Eurasia Network
Operators Group (ENOG 11). 2016.
[303] Metronome Systems Inc. eBPF Offload Getting Started Guide. 2018. URL: https://www.
netronome.com/m/documents/UG_Getting_Started_with_eBPF_Offload.pdf.
[304] Mininet eXtended. URL: http://mininext.uscnsl.net/.
[305] Jelena Mirkovic and Terry Benzel. “Teaching cybersecurity with DeterLab”. In: IEEE
Security & Privacy 10.1 (2012), pp. 73–76.
[306] Nitinder Mohan, Lorenzo Corneo, Aleksandr Zavodovski, Suzan Bayhan, Walter Wong, and
Jussi Kangasharju. “Pruning Edge Research with Latency Shears”. In: Proceedings of the
ACM Workshop on Hot Topics in Networks. HotNets ’20. ACM, 2020.
[307] Ricky KP Mok, Weichao Li, and Rocky KC Chang. “IRate: Initial Video Bitrate Selection
System for HTTP Streaming”. In: IEEE Journal on Selected Areas in Communications 34.6
(2016), pp. 1914–1928.
[308] Ricky KP Mok, Xiapu Luo, Edmond WW Chan, and Rocky KC Chang. “QDASH: a QoE-
aware DASH system”. In: Proceedings of the Multimedia Systems Conference. MMSys ’12.
ACM, 2012.
[309] Al Morton, Gomathi Ramachandran, Stanislav Shalunov, Len Ciavattone, and Jerry Perser.
Packet Reordering Metrics. RFC 4737. Nov. 2006. DOI: 10.17487/RFC4737. URL: https:
//rfc-editor.org/rfc/rfc4737.txt.
[310] Wolfgang Mühlbauer, Anja Feldmann, Olaf Maennel, Matthew Roughan, and Steve Uhlig.
“Building an AS-topology model that captures route diversity”. In: Proceedings of the
Conference of the ACM Special Interest Group on Data Communication. SIGCOMM ’06.
ACM, 2006.
[311] mvfst. GitHub. URL: https://github.com/facebookincubator/mvfst.
[312] Lily Hay Newman. The Infrastructure Mess Causing Countless Internet Outages. WIRED.
June 2019. URL: https://www.wired.com/story/bgp-route-leak-internet-outage/.
[313] Noction Intelligent Routing Platform. URL: https://www.noction.com/intelligent-routing-
platform-bgp-network-optimization.
[314] Noction Intelligent Routing Platform: Installation and Configuration Guide, v3.11.
381
[315] William B Norton. The Internet peering playbook: connecting to the core of the Internet.
DrPeering Press, 2011.
[316] William B Norton. Transit Traffic at Internet Exchange Points? May 2013. URL: http:
//drpeering.net/AskDrPeering/blog/articles/Ask_DrPeering/Entries/2013/5/17_Transit_
Traffic_at_Internet_Exchange_Points.html.
[317] Erik Nygren, Ramesh K. Sitaraman, and Jennifer Sun. “The Akamai Network: A Platform
for High-performance Internet Applications”. In: ACM SIGOPS Operating Systems Review
44.3 (2010), pp. 2–19.
[318] Joel Obstfeld, Simon Knight, Ed Kern, Qiang Sheng Wang, Tom Bryan, and Dan Bourque.
“VIRL: the virtual internet routing lab”. In: Proceedings of the Conference of the ACM
Special Interest Group on Data Communication. SIGCOMM ’14. ACM, 2014.
[319] Ricardo Oliveira, Dan Pei, Walter Willinger, Beichuan Zhang, and Lixia Zhang. “The
(in)completeness of the Observed Internet AS-level Structure”. In: IEEE/ACM Transactions
on Networking (TON) 18.1 (2009), pp. 109–122.
[320] OpenVPN. URL: https://openvpn.net/.
[321] Jitendra Padhye, Victor Firoiu, Don Towsley, and Jim Kurose. “Modeling TCP Throughput:
A Simple Model and Its Empirical Validation”. In: Proceedings of the Conference of the
ACM Special Interest Group on Data Communication. SIGCOMM ’98. ACM, 1998.
[322] Ramakrishna Padmanabhan, Patrick Owen, Aaron Schulman, and Neil Spring. “Timeouts:
Beware Surprisingly High Delay”. In: Proceedings of the ACM Internet Measurement
Conference. IMC ’15. ACM, 2015.
[323] Mengying Pan, Robert MacDavid, Shir Landau-Feibish, and Jennifer Rexford. “Memory-
Efficient Membership Encoding in Switches”. In: Proceedings of the Symposium on SDN
Research. SOSR ’20. ACM, 2020.
[324] Dr. Vern Paxson, Mark Allman, and W. Richard Stevens. TCP Congestion Control. RFC
2581. Apr. 1999. DOI: 10.17487/RFC2581. URL: https://rfc-editor.org/rfc/rfc2581.txt.
[325] PCH Daily Routing Snapshots. URL: https://www.pch.net/resources/Routing_Data/.
[326] Luis Pedrosa, Ari Fogel, Nupur Kothari, Ramesh Govindan, Ratul Mahajan, and Todd
Millstein. “Analyzing Protocol Implementations for Interoperability”. In: Proceedings
382
of USENIX Symposium on Networked Systems Design and Implementation. NSDI ’15.
USENIX, 2015.
[327] PeeringDB. URL: https://www.peeringdb.com/.
[328] Simon Peter, Umar Javed, Qiao Zhang, Doug Woos, Arvind Krishnamurthy, and Thomas
Anderson. “One Tunnel is (Often) Enough”. In: Proceedings of the Conference of the ACM
Special Interest Group on Data Communication. SIGCOMM ’14. ACM, 2014.
[329] Larry Peterson, Andy Bavier, Marc E. Fiuczynski, and Steve Muir. “Experiences Building
PlanetLab”. In: Proceedings of USENIX Symposium on Operating Systems Design and
Implementation. OSDI ’06. USENIX, 2006.
[330] Alex Pilosov and Tony Kapela. “Stealing The Internet: An Internet-Scale Man In The Middle
Attack”. In: Proceedings of DEFCON Security Conference. DEFCON 16. 2008.
[331] PlanetLab. URL: https://www.planet-lab.org/.
[332] Ingmar Poese, Steve Uhlig, Mohamed Ali Kaafar, Benoit Donnet, and Bamba Gueye.
“IP Geolocation Databases: Unreliable?” In: ACM SIGCOMM Computer Communication
Review 41.2 (2011), pp. 53–56.
[333] Ravi Prasad, Constantinos Dovrolis, Margaret Murray, and KC Claffy. “Bandwidth estima-
tion: metrics, measurement techniques, and tools”. In: IEEE network 17.6 (2003), pp. 27–
35.
[334] Ravi Prasad, Manish Jain, and Constantinos Dovrolis. “Effects of Interrupt Coalescence on
Network Measurements”. In: Proceedings of the International Workshop on Passive and
Active Network Measurement. PAM ’04. Springer, 2004.
[335] Brian J Premore. “An analysis of convergence properties of the border gateway protocol
using discrete event simulation”. PhD thesis. Dartmouth College Hanover, New Hampshire,
2003.
[336] Robert M. Price and Douglas G. Bonett. “Distribution-Free Confidence Intervals for Differ-
ence and Ratio of Medians”. In: Journal of Statistical Computation and Simulation 72.2
(2002), pp. 119–124.
[337] Matthew Prince. Technical Details Behind a 400Gbps NTP Amplification DDoS Attack.
Feb. 2014. URL: https://blog.cloudflare.com/technical-details-behind-a-400gbps-ntp-
amplification-ddos-attack/.
383
[338] Public Route Servers. URL: http://routeserver.org/.
[339] Lili Qiu, Yin Zhang, and Srinivasan Keshav. “Understanding the performance of many TCP
flows”. In: Computer Networks 37.3-4 (2001), pp. 277–306.
[340] Zakaria Al-Qudah, Seungjoon Lee, Michael Rabinovich, Oliver Spatscheck, and Jacobus
Van der Merwe. “Anycast-Aware Transport for Content Delivery Networks”. In: Proceedings
of the International Conference on World Wide Web. WWW ’09. ACM, 2009.
[341] Bruno Quoitin and Olivier Bonaventure. A survey of the utilization of the BGP community
attribute. Internet-Draft draft-quoitin-bgp-comm-survey-00. IETF Secretariat, Feb. 2002.
URL: https://tools.ietf.org/html/draft-quoitin-bgp-comm-survey-00.
[342] Bruno Quoitin, Cristel Pelsser, Louis Swinnen, Olivier Bonaventure, and Steve Uhlig.
“Interdomain traffic engineering with BGP”. In: IEEE Communications Magazine 41.5
(2003), pp. 122–128.
[343] Bruno Quoitin and Steve Uhlig. “Modeling the routing of an autonomous system with
C-BGP”. In: IEEE Network 19.6 (2005), pp. 12–19.
[344] Veena Raghavan, George Riley, and Talal Jaafar. “Realistic Topology Modeling for the In-
ternet BGP Infrastructure”. In: Proceedings of IEEE International Symposium on Modeling,
Analysis and Simulation of Computers and Telecommunication Systems. MASCOTS 2008.
IEEE. 2008, pp. 1–8.
[345] Riccardo Ravaioli, Guillaume Urvoy-Keller, and Chadi Barakat. “Characterizing ICMP
Rate Limitation on Routers”. In: Proceedings of IEEE International Conference on Commu-
nications. ICC ’15. IEEE, 2015.
[346] Recommendation G. 114, One-way transmission time. Series G: Transmission Systems and
Media, Digital Systems and Networks, Telecommunication Standardization Sector of ITU.
2000.
[347] Andreas Reuter, Randy Bush, Italo Cunha, Ethan Katz-Bassett, Thomas C. Schmidt, and
Matthias Wählisch. “Towards a Rigorous Methodology for Measuring Adoption of RPKI
Route Validation and Filtering”. In: ACM SIGCOMM Computer Communication Review
48.1 (2018), pp. 19–27.
[348] Vinay Joseph Ribeiro, Rudolf H Riedi, Richard G Baraniuk, Jiri Navratil, and Les Cottrell.
“pathChirp: Efficient Available Bandwidth Estimation for Network Paths”. In: Proceedings
of the Passive and Active Measurements Workshop. PAM ’03. 2003.
384
[349] Robert Ricci and Eric Eide. “Introducing CloudLab: Scientific infrastructure for advancing
cloud architectures and applications.” In: USENIX; login: (2014).
[350] Philipp Richter, Georgios Smaragdakis, Anja Feldmann, Nikolaos Chatzis, Jan Boettger, and
Walter Willinger. “Peering at Peerings: On the Role of IXP Route Servers”. In: Proceedings
of the ACM Internet Measurement Conference. IMC ’14. ACM, 2014.
[351] NCC RIPE. RIPEstat. 2011.
[352] RIPE NCC. Current RIS Routing Beacons. 2018. URL: https://www.ripe.net/analyse/internet-
measurements/routing-information-service-ris/current-ris-routing-beacons.
[353] RIPE Atlas. URL: https://atlas.ripe.net/.
[354] Greg Ritchie and Thomas Seal. “Why Low-Earth Orbit Satellites Are the New Space Race”.
In: The Washington Post (July 2020). URL: https://www.washingtonpost.com/business/why-
low-earth-orbit-satellites-are-the-new-space-race/2020/07/10/51ef1ff8-c2bb-11ea-8908-
68a2b9eae9e0_story.html.
[355] Rede Nacional de Ensino e Pesquisa. URL: https://www.rnp.br/.
[356] Dani Roisman. “Effective BGP Load Balancing Using "The Metric System": A real-world
guide to BGP traffic engineering”. In: Proceedings of NANOG. NANOG 45. 2009.
[357] Erik Romijn. RIPE NCC and Duke University BGP Experiment. Aug. 2010. URL: https:
//labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experiment.
[358] Peter Romirer-Maierhofer, Fabio Ricciato, Alessandro D’Alconzo, Robert Franzan, and
Wolfgang Karner. “Network-wide measurements of TCP RTT in 3G”. In: Proceedings of
the International Workshop on Traffic Monitoring and Analysis. TMA ’09. Springer, 2009.
[359] Christian Esteve Rothenberg, Marcelo Ribeiro Nascimento, Marcos Rogerio Salvador,
Carlos Nilton Araujo Corrêa, Sidney Cunha de Lucena, and Robert Raszuk. “Revisiting
Routing Control Platforms with the Eyes and Muscles of Software-Defined Networking”. In:
Proceedings of the ACM Workshop on Hot Topics in Software Defined Networking. HotSDN
’12. ACM, 2012.
[360] Matthew Roughan, Walter Willinger, Olaf Maennel, Debbie Perouli, and Randy Bush. “10
Lessons from 10 Years of Measuring and Modeling the Internet’s Autonomous Systems”.
In: IEEE Journal on Selected Areas in Communications 29.9 (2011), pp. 1810–1821.
385
[361] The University of Oregon Routeviews Project. URL: http://www.routeviews.org.
[362] RouteViews Peering Statius. Dec. 2020. URL: http://www.routeviews.org/peers/peering-
status.html (visited on 12/07/2020).
[363] Khondaker M Salehin and Roberto Rojas-Cessa. “Schemes to Measure Available Bandwidth
and Link Capacity with Ternary Search and Compound Probe for Packet Networks”. In:
Proceedings of IEEE Workshop on Local and Metropolitan Area Networks. LANMAN ’10.
IEEE, 2010.
[364] J. Salim, H. Khosravi, A. Kleen, and A. Kuznetsov. Linux Netlink as an IP Services Protocol.
Internet Requests for Comments, RFC 3549. RFC. July 2003.
[365] Jim Salter. Yesterday’s corporate network design isn’t working for working from home. Oct.
2020. URL: https://arstechnica.com/gadgets/2020/10/future-of-collaboration-01/.
[366] Raja R Sambasivan, David Tran-Lam, Aditya Akella, and Peter Steenkiste. “Bootstrapping
Evolvability for Inter-Domain Routing with D-BGP”. In: Proceedings of the Conference of
the ACM Special Interest Group on Data Communication. SIGCOMM ’17. ACM, 2017.
[367] Mario A. Sanchez, Fabian E. Bustamante, Balachander Krishnamurthy, Walter Willinger,
Georgios Smaragdakis, and Jeffrey Erman. “Inter-Domain Traffic Estimation for the Out-
sider”. In: Proceedings of the ACM Internet Measurement Conference. IMC ’14. ACM,
2014.
[368] Mario A Sánchez, John S Otto, Zachary S Bischof, David R Choffnes, Fabián E Bustamante,
Balachander Krishnamurthy, and Walter Willinger. “Dasu: Pushing Experiments to the
Internet’s Edge.” In: Proceedings of USENIX Symposium on Networked Systems Design and
Implementation. NSDI ’13. USENIX, 2013.
[369] Sandvine. Sandvine Global Internet Phenomena Report 1H2019. 2019.
[370] Sandvine. Sandvine: P2P Traffic Swamps Networks. 2002. URL: https://www.lightreading.
com/sandvine-p2p-traffic-swamps-networks/d/d-id/583590.
[371] Matt Sargent, Jerry Chu, Dr. Vern Paxson, and Mark Allman. Computing TCP’s Retransmis-
sion Timer. RFC 6298. June 2011. DOI: 10.17487/RFC6298. URL: https://rfc-editor.org/rfc/
rfc6298.txt.
[372] Stefan Saroiu, P Krishna Gummadi, and Steven D Gribble. “Sprobe: A Fast Technique
for Measuring Bottleneck Bandwidth in Uncooperative Environments”. In: Proceedings of
386
Annual Joint Conference of the IEEE Computer and Communications Societies. INFOCOM
’02. IEEE, 2002.
[373] Stefan Savage, Andy Collins, Eric Hoffman, John Snell, and Thomas E. Anderson. “The
End-to-End Effects of Internet Path Selection”. In: Proceedings of the Conference of the
ACM Special Interest Group on Data Communication. SIGCOMM ’99. ACM, 1999.
[374] Stefan Savage, David Wetherall, Anna Karlin, and Tom Anderson. “Network Support for IP
Traceback”. In: IEEE/ACM Transactions on Networking (TON) 9.3 (2001), pp. 226–237.
[375] Michael Schapira, Yaping Zhu, and Jennifer Rexford. “Putting BGP on the Right Path:
A Case for Next-hop Routing”. In: Proceedings of the ACM Workshop on Hot Topics in
Networks. HotNets ’10. ACM, 2010.
[376] Brandon Schlinker, Todd Arnold, Italo Cunha, and Ethan Katz-Bassett. “PEERING: Virtual-
izing BGP at the Edge for Research”. In: Proceedings of the International Conference on
Emerging Networking EXperiments and Technologies. CoNEXT ’19. ACM, 2019.
[377] Brandon Schlinker, Ítalo Cunha, Yi-Ching Chiu, Srikanth Sundaresan, and Ethan Katz-
Bassett. “Internet Performance from Facebook’s Edge”. In: Proceedings of the Internet
Measurement Conference. IMC ’19. ACM, 2019.
[378] Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V Madhyastha,
Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. “Engineering
Egress with Edge Fabric”. In: Proceedings of the Conference of the ACM Special Interest
Group on Data Communication. SIGCOMM ’17. ACM, 2017.
[379] Brandon Schlinker, Radhika Niranjan Mysore, Sean Smith, Jeffrey C. Mogul, Amin Vahdat,
Minlan Yu, Ethan Katz-Bassett, and Michael Rubin. “Condor: Better Topologies through
Declarative Design”. In: Proceedings of the Conference of the ACM Special Interest Group
on Data Communication. SIGCOMM ’15. ACM, 2015.
[380] J. Scudder, R. Fernando, and S. Stuart. BGP Monitoring Protocol (BMP). Internet Requests
for Comments, RFC 7854. RFC. June 2016.
[381] Pavlos Sermpezis, Vasileios Kotronis, Petros Gigis, Xenofontas Dimitropoulos, Danilo
Cicalese, Alistair King, and Alberto Dainotti. “ARTEMIS: Neutralizing BGP Hijacking
within a Minute”. In: IEEE/ACM Transactions on Networking (TON) 26.6 (2018), pp. 2471–
2486.
387
[382] Yuval Shavitt and Eran Shir. “DIMES: Let the Internet measure itself”. In: ACM SIGCOMM
Computer Communication Review 35.5 (2005), pp. 71–74.
[383] Yuval Shavitt and Udi Weinsberg. “Topological Trends of Internet Content Providers”. In:
Proceedings of the Workshop on Simplifying Complex Networks for Practitioners. SIMPLEX
’12.
[384] Yuval Shavitt and Noa Zilberman. “A Geolocation Databases Study”. In: IEEE Journal on
Selected Areas in Communications 29.10 (2011), pp. 2044–2056.
[385] Scott Shenker. Software-Defined Networking (SDN), CS168 Lecture Notes, Lecture 23. 2014.
URL: https://inst.eecs.berkeley.edu/~cs168/fa14/lectures/lec23-public.pdf.
[386] Rob Sherwood, Glen Gibb, Kok-Kiong Yap, Guido Appenzeller, Martin Casado, Nick
McKeown, and Guru Parulkar. “Flowvisor: A network virtualization layer”. In: OpenFlow
Switch Consortium, Tech. Rep 1 (2009), p. 132.
[387] Patrick Shuff. “Building A Billion User Load Balancer”. In: USENIX SREcon. 2015.
[388] Muhammad Shuaib Siddiqui, D Montero, Marcelo Yannuzzi, René Serral-Gracià, and
Xavier Masip-Bruin. “Route leak identification: A step toward making Inter-Domain routing
more reliable”. In: Proceedings of International Conference on the Design of Reliable
Communication Networks. DRCN ’14. IEEE, 2014.
[389] Rachee Singh, Sharad Agarwal, Matt Calder, and Paramvir Bahl. “Cost-Effective Cloud
Edge Traffic Engineering with Cascara”. In: Proceedings of USENIX Symposium on Net-
worked Systems Design and Implementation. NSDI ’21. USENIX, 2021.
[390] Ramesh K Sitaraman, Mangesh Kasbekar, Woody Lichtenstein, and Manish Jain. “Overlay
Networks: An Akamai Perspective”. In: Advanced Content Delivery, Streaming, and Cloud
Services 51.4 (2014), pp. 305–328.
[391] Ben Treynor Sloss. Expanding our global infrastructure with new regions and subsea cables.
URL: https://blog.google/topics/google-cloud/expanding-our-global-infrastructure-new-
regions-and-subsea-cables/.
[392] Jared M Smith, Kyle Birkeland, and Max Schuchard. “Withdrawing the BGP Re-Routing
Curtain: Understanding the Security Impact of BGP Poisoning via Real-World Measure-
ments”. In: Proceedings of Network and Distributed System Security Symposium. NDSS
’20. Internet Society, 2020.
388
[393] Alex C Snoeren, Craig Partridge, Luis A Sanchez, Christine E Jones, Fabrice Tchakountio,
Beverly Schwartz, Stephen T Kent, and W Timothy Strayer. “Single-packet IP traceback”.
In: IEEE/ACM Transactions on Networking (TON) 10.6 (2002), pp. 721–734.
[394] SoftLayer’s network gets more awesometastic. Dec. 2008. URL: https://www.networkworld.
com/article/2233533/softlayer-s-network-gets-more-awesometastic-.html.
[395] someone is using my AS number. June 2019. URL: https://mailman.nanog.org/pipermail/
nanog/2019-June/101373.html.
[396] Daniel Sommermann and Alan Frindell. Introducing Proxygen, Facebook’s C++ HTTP
framework. 2014. URL: https://engineering.fb.com/production-engineering/introducing-
proxygen-facebook-s-c-http-framework/.
[397] Raffaele Sommese, Leandro Bertholdo, Gautam Akiwate, Mattijs Jonker, Roland van
Rijswijk-Deij, Alberto Dainotti, KC Claffy, and Anna Sperotto. “MAnycast2: Using Anycast
to Measure Anycast”. In: Proceedings of the ACM Internet Measurement Conference. IMC
’20. ACM, 2020.
[398] Speedchecker. URL: https://www.speedchecker.com/.
[399] Neil T. Spring, Ratul Mahajan, and Thomas E. Anderson. “The Causes of Path Inflation”. In:
Proceedings of the Conference of the ACM Special Interest Group on Data Communication.
SIGCOMM ’03. ACM, 2003.
[400] K. Sriram, D. Montgomery, and J. Haas. Enhanced Feasible-Path Unicast Reverse Path
Forwarding. Internet Requests for Comments, RFC 8704. RFC. Feb. 2020.
[401] Kotikalapudi Sriram, Alexander Azimov, Brian Dickson, Doug Montgomery, Keyur Patel,
Andrei Robachevsky, Eugene Bogomazov, and Randy Bush. Design Discussion of Route
Leaks Solution Methods. Internet-Draft draft-sriram-idr-route-leak-solution-discussion-03.
IETF Secretariat, Mar. 2020. URL: https://tools.ietf.org/html/draft-sriram-idr-route-leak-
solution-discussion-03.
[402] Kotikalapudi Sriram, Alexander Azimov, Brian Dickson, Doug Montgomery, Keyur Patel,
Andrei Robachevsky, Eugene Bogomazov, and Randy Bush. ICMP Traceback Messages.
Internet-Draft draft-ietf-itrace-04. IETF Secretariat, Feb. 2003. URL: https://tools.ietf.org/
html/draft-ietf-itrace-04.
389
[403] Kotikalapudi Sriram, Doug Montgomery, and Brian Dickson. Methods for Detection and Mit-
igation of BGP Route Leaks. Internet-Draft draft-sriram-idr-route-leak-detection-mitigation-
01. IETF Secretariat, July 2015. URL: https://tools.ietf.org/html/draft-sriram-idr-route-leak-
detection-mitigation-01.
[404] Richard A Steenbergen. “A practical guide to (correctly) troubleshooting with traceroute”.
In: North American Network Operators Group (2009), pp. 1–49.
[405] Jacob Strauss, Dina Katabi, and Frans Kaashoek. “A Measurement Study of Available Band-
width Estimation Tools”. In: Proceedings of the ACM Internet Measurement Conference.
IMC ’03. ACM, 2003.
[406] Florian Streibelt, Franziska Lichtblau, Robert Beverly, Anja Feldmann, Cristel Pelsser,
Georgios Smaragdakis, and Randy Bush. “BGP Communities: Even more Worms in the
Routing Can”. In: Proceedings of the ACM Internet Measurement Conference. IMC ’18.
ACM, 2018.
[407] Tom Strickx. How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline
Today. June 2019. URL: https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-
knocked-large-parts-of-the-internet-offline-today/.
[408] Tom Strickx. How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline
Today. The Cloudflare Blog. June 2019. URL: https://blog.cloudflare.com/how-verizon-and-
a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/.
[409] Stephen D. Strowes. “Passively Measuring TCP Round-trip Times”. In: Communications of
the ACM 56.10 (2013), pp. 57–64.
[410] Lakshminarayanan Subramanian, Matthew Caesar, Cheng Tien Ee, Mark Handley, Morley
Mao, Scott Shenker, and Ion Stoica. “HLP: A Next Generation Inter-domain Routing
Protocol”. In: Proceedings of the Conference of the ACM Special Interest Group on Data
Communication. SIGCOMM ’05. ACM, 2005.
[411] Peng Sun, Laurent Vanbever, and Jennifer Rexford. “Scalable Programmable Inbound Traffic
Engineering”. In: Proceedings of the ACM SIGCOMM Symposium on Software Defined
Networking Research. SOSR ’15. ACM, 2015.
[412] Yixin Sun, Anne Edmundson, Nick Feamster, Mung Chiang, and Prateek Mittal. “Counter-
RAPTOR: Safeguarding Tor Against Active Routing Attacks”. In: Proceedings of IEEE
Symposium on Security and Privacy. S&P ’17. 2017.
390
[413] Yixin Sun, Anne Edmundson, Laurent Vanbever, Oscar Li, Jennifer Rexford, Mung Chiang,
and Prateek Mittal. “RAPTOR: Routing Attacks on Privacy in Tor”. In: Proceedings of
USENIX Security Symposium. USENIX Security ’15. USENIX, 2015.
[414] Srikanth Sundaresan, Mark Allman, Amogh Dhamdhere, and Kc Claffy. “TCP Congestion
Signatures”. In: Proceedings of the ACM Internet Measurement Conference. IMC ’17. ACM,
2017.
[415] Srikanth Sundaresan, Sam Burnett, Nick Feamster, and Walter De Donato. “BISmark: A
Testbed for Deploying Measurements and Applications in Broadband Access Networks”. In:
Proceedings of USENIX Annual Technical Conference. ATC ’14. USENIX, 2014.
[416] Srikanth Sundaresan, Walter De Donato, Nick Feamster, Renata Teixeira, Sam Crawford,
and Antonio Pescapè. “Broadband Internet Performance: A View From the Gateway”. In:
ACM SIGCOMM Computer Communication Review 41.4 (2011), pp. 134–145.
[417] Srikanth Sundaresan, Xiaohong Deng, Yun Feng, Danny Lee, and Amogh Dhamdhere.
“Challenges in Inferring Internet Congestion Using Throughput Measurements”. In: Pro-
ceedings of the ACM Internet Measurement Conference. IMC ’17. ACM, 2017.
[418] Srikanth Sundaresan, Nick Feamster, and Renata Teixeira. “Home Network or Access
Link? Locating Last-Mile Downstream Throughput Bottlenecks”. In: Proceedings of the
International Conference on Passive and Active Network Measurement. PAM ’16. Springer,
2016.
[419] Srikanth Sundaresan, Nick Feamster, and Renata Teixeira. “Measuring the Performance of
User Traffic in Home Wireless Networks”. In: Proceedings of the International Conference
on Passive and Active Network Measurement. PAM ’15. Springer, 2015.
[420] Srikanth Sundaresan, Nick Feamster, Renata Teixeira, and Nazanin Magharei. “Measur-
ing and Mitigating Web Performance Bottlenecks in Broadband Access Networks”. In:
Proceedings of the ACM Internet Measurement Conference. IMC ’13. ACM, 2013.
[421] Yu-Wei Eric Sung, Xiaozheng Tie, Starsky HY Wong, and Hongyi Zeng. “Robotron: Top-
down Network Management at Facebook Scale”. In: Proceedings of the Conference of the
ACM Special Interest Group on Data Communication. SIGCOMM ’16. ACM, 2016.
[422] tcp-goodput. https://github.com/bschlinker/tcp-goodput.
391
[423] Renata Teixeira, Timothy G Griffin, Mauricio GC Resende, and Jennifer Rexford. “TIE
breaking: Tunable interdomain egress selection”. In: IEEE/ACM Transactions on Networking
(TON) 15.4 (2007), pp. 761–774.
[424] Renata Teixeira, Keith Marzullo, Stefan Savage, and Geoffrey M V oelker. “In Search of Path
Diversity in ISP Networks”. In: Proceedings of the ACM Internet Measurement Conference.
IMC ’03. ACM, 2003.
[425] Renata Teixeira, Aman Shaikh, Tim Griffin, and Jennifer Rexford. “Dynamics of Hot-
Potato Routing in IP Networks”. In: Proceedings of the joint international conference
on Measurement and modeling of computer systems. SIGMETRICS ’04/Performance ’04.
ACM, 2004.
[426] Teridion.com. URL: https://www.teridion.com/.
[427] The BIRD Internet Routing Daemon. URL: http://bird.network.cz/.
[428] The CAIDA AS Relationships Dataset. cited February 2015. URL: http://www.caida.org/
data/as-relationships/.
[429] The CAIDA UCSD Anonymized Internet Traces. 2014.
[430] The Linux Foundation. Data Plane Development Kit. 2018. URL: https://www.dpdk.org.
[431] The Linux Kernel Documentation. AF_XDP Overview. 2018. URL: https://www.kernel.org/
doc/html/latest/networking/af_xdp.html.
[432] The M-Lab NDT Data Set. URL: https://measurementlab.net/tools/ndt.
[433] The Mitre Corporation. CVE-2019-5892. 2019. URL: http://cve.mitre.org/cgi-bin/cvename.
cgi?name=CVE-2019-5892.
[434] Ludovic Thomas, Emmanuel Dubois, Nicolas Kuhn, and Emmanuel Lochin. “Google
QUIC performance over a public SATCOM access”. In: International Journal of Satellite
Communications and Networking (2019).
[435] Time to Interactive. URL: https://web.dev/interactive/.
[436] Andree Toonk. Massive route leak causes Internet slowdown. BGPMon. 2015. URL: https:
//www.bgpmon.net/massive-route-leak-cause-internet-slowdown.
392
[437] Ruben Torres, Alessandro Finamore, Jin Ryong Kim, Marco Mellia, Maurizio M Munafo,
and Sanjay Rao. “Dissecting Video Server Selection Strategies in the YouTube CDN”. In:
Proceedings of the 31st International Conference on Distributed Computing Systems. IEEE,
2011.
[438] Linus Torvalds. Linux Kernel 4.1—Release Notes. 2015. URL: https://kernelnewbies.org/
Linux_4.1.
[439] Muoi Tran, Akshaye Shenoi, and Min Suk Kang. “On the Routing-Aware Peering against
Network-Eclipse Attacks in Bitcoin”. In: Proceedings of USENIX Security Symposium.
USENIX Security ’21. USENIX, 2021.
[440] UCSD Network Telescope. 2010. URL: http://www.caida.org/data/passive/network_
telescope.xml.
[441] Understanding prefetching and how Facebook uses prefetching. 2021. URL: https://www.
facebook.com/business/help/1514372351922333 (visited on 05/29/2021).
[442] USC CDN Coverage. http://usc-nsl.github.io/cdn-coverage.
[443] Francesco Vacirca, Fabio Ricciato, and René Pilz. “Large-scale RTT measurements from
an operational UMTS/GPRS network”. In: Proceedings of the International Conference on
Wireless Internet. WICON ’05. IEEE, 2005.
[444] Vytautas Valancius, Nick Feamster, Jennifer Rexford, and Akihiro Nakao. “Wide-Area
Route Control for Distributed Services”. In: Proceedings of USENIX Annual Technical
Conference. ATC ’10. USENIX, 2010.
[445] Vytautas Valancius, Cristian Lumezanu, Nick Feamster, Ramesh Johari, and Vijay V Vazi-
rani. “How many tiers?: pricing in the internet transit market”. In: ACM SIGCOMM Com-
puter Communication Review 41.4 (2011), pp. 194–205.
[446] Vytautas Valancius, Bharath Ravi, Nick Feamster, and Alex C. Snoeren. “Quantifying the
Benefits of Joint Content and Network Routing”. In: Proceedings of the ACM SIGMETRICS
international conference on Measurement and modeling of computer systems. SIGMETRICS
’13. ACM, 2013.
[447] J. Van der Merwe, A. Cepleanu, K. D’Souza, B. Freeman, A. Greenberg, D. Knight, R.
McMillan, D. Moloney, J. Mulligan, H. Nguyen, M. Nguyen, A. Ramarajan, S. Saad,
M. Satterlee, T. Spencer, D. Toll, and S. Zelingher. “Dynamic Connectivity Management
393
with an Intelligent Route Service Control Point”. In: Proceedings of the 2006 SIGCOMM
workshop on Internet Network Management. ACM, 2006.
[448] Kannan Varadhan, Ramesh Govindan, and Deborah Estrin. “Persistent Route Oscillations in
Inter-domain Routing”. In: Computer Networks 32.1 (2000), pp. 1–16.
[449] Patrick Verkaik, Dan Pei, Tom Scholl, Aman Shaikh, Alex C Snoeren, and Jacobus E
Van Der Merwe. “Wresting Control from BGP: Scalable Fine-Grained Route Control”. In:
Proceedings of USENIX Annual Technical Conference. ATC ’07. USENIX, 2007.
[450] Wouter B. de Vries, Ricardo de O. Schmidt, Wes Hardaker, John Heidemann, Pieter-Tjerk
de Boer, and Aiko Pras. “Broad and Load-aware Anycast Mapping with Verfploeter”. In:
Proceedings of the ACM Internet Measurement Conference. IMC ’17. ACM, 2017.
[451] W3Techs. Usage of web hosting providers broken down by ranking. Aug. 2020. URL:
https://w3techs.com/technologies/cross/web_hosting/ranking.
[452] Matthias Wählisch, Robert Schmidt, Thomas C Schmidt, Olaf Maennel, Steve Uhlig, and
Gareth Tyson. “RiPKI: The Tragic Story of RPKI Deployment in the Web Ecosystem”. In:
Proceedings of the ACM Workshop on Hot Topics in Networks. HotNets ’15. ACM, 2015.
[453] D. Walton, A. Retana, E. Chen, and J. Scudder. Advertisement of Multiple Paths in BGP.
Internet Requests for Comments, RFC 7911. RFC. July 2016.
[454] Feng Wang, Zhuoqing Morley Mao, Jia Wang, Lixin Gao, and Randy Bush. “A Measurement
Study on the Impact of Routing Events on End-to-End Internet Path Performance”. In:
Proceedings of the Conference of the ACM Special Interest Group on Data Communication.
SIGCOMM ’06. ACM, 2006.
[455] Lan Wei and John Heidemann. Does Anycast Hang up on You? (extended). Tech. rep. ISI-
TR-716. johnh: pafile: USC/Information Sciences Institute, Feb. 2017. URL: http://www.isi.
edu/%7ejohnh/PAPERS/Wei17a.html.
[456] Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold,
Mike Hibler, Chad Barb, and Abhijeet Joglekar. “An Integrated Experimental Environ-
ment for Distributed Systems and Networks”. In: Proceedings of USENIX Symposium on
Operating Systems Design and Implementation. OSDI ’02. USENIX, 2002.
[457] Grey White. Latency in DOCSIS Networks. CableLabs. Sept. 2013.
394
[458] Rick Whitner, Tarun Banka, Abhijit A. Bare, Nischal M. Piratla, and Professor Anura P.
Jayasumana. Improved Packet Reordering Metrics. RFC 5236. June 2008. DOI: 10.17487/
RFC5236. URL: https://rfc-editor.org/rfc/rfc5236.txt.
[459] Robin Whittle. [rrg] Geoff Huston’s BGP/DFZ research. URL: https://www.ietf.org/mail-
archive/web/rrg/current/msg06163.html.
[460] D. Wing and A. Yourtchenko. Happy Eyeballs: Success with Dual-Stack Hosts. Internet
Requests for Comments, RFC 6555. RFC. Apr. 2012.
[461] Florian Wohlfart, Nikolaos Chatzis, Caglar Dabanoglu, Georg Carle, and Walter Willinger.
“Leveraging Interconnections for Performance: The Serving Infrastructure of a Large CDN”.
In: Proceedings of the Conference of the ACM Special Interest Group on Data Communica-
tion. SIGCOMM ’18. Budapest, Hungary: ACM, 2018.
[462] Edward Wyatt and Noam Cohen. “Comcast and Netflix Reach Deal on Service”. In: The
New York Times (Feb. 2014). URL: https://www.nytimes.com/2014/02/24/business/media/
comcast-and-netflix-reach-a-streaming-agreement.html.
[463] XSEDE. URL: https://www.xsede.org/.
[464] Wen Xu and Jennifer Rexford. “MIRO: Multi-path Interdomain ROuting”. In: Proceedings
of the Conference of the ACM Special Interest Group on Data Communication. SIGCOMM
’06. ACM, 2006.
[465] Xing Xu, Yurong Jiang, Tobias Flach, Ethan Katz-Bassett, David Choffnes, and Ramesh
Govindan. “Investigating transparent web proxies in cellular networks”. In: Proceedings
of the International Conference on Passive and Active Network Measurement. PAM ’15.
Springer, 2015.
[466] Y . Rekhter and T. Li and S. Hares. A Border Gateway Protocol 4 (BGP-4). Internet Requests
for Comments, RFC 4271. RFC. Jan. 2006.
[467] He Yan, Ricardo Oliveira, Kevin Burnett, Dave Matthews, Lixia Zhang, and Dan Massey.
“BGPmon: A real-time, scalable, extensible monitoring system”. In: Conference For Home-
land Security, 2009. CATCH’09. Cybersecurity Applications & Technology. IEEE. 2009,
pp. 212–223.
[468] Xiaowei Yang, David Clark, and Arthur W Berger. “NIRA: a new inter-domain routing
architecture”. In: IEEE/ACM Transactions on Networking (TON) 15.4 (2007), pp. 775–788.
395
[469] Kok-Kiong Yap, Murtaza Motiwala, Jeremy Rahe, Steve Padgett, Matthew Holliman, Gary
Baldus, Marcus Hines, Taeeun Kim, Ashok Narayanan, Ankur Jain, Victor Lin, Colin
Rice, Brian Rogan, Arjun Singh, Bert Tanaka, Manish Verma, Puneet Sood, Mukarram
Tariq, Matt Tierney, Dzevad Trumic, Vytautas Valancius, Calvin Ying, Mahesh Kallahalla,
Bikash Koley, and Amin Vahdat. “Taking the Edge off with Espresso: Scale, Reliability
and Programmability for Global Internet Peering”. In: Proceedings of the Conference of the
ACM Special Interest Group on Data Communication. SIGCOMM ’17. ACM, 2017.
[470] YouTube Hijacking: A RIPE NCC RIS case study. URL: https://www.ripe.net/publications/
news/industry-developments/youtube-hijacking-a-ripe-ncc-ris-case-study.
[471] Yasir Zaki, Jay Chen, Thomas Pötsch, Talal Ahmad, and Lakshminarayanan Subramanian.
“Dissecting Web Latency in Ghana”. In: Proceedings of the ACM Internet Measurement
Conference. IMC ’14. ACM, 2014.
[472] Kyriakos Zarifis, Tobias Flach, Srikanth Nori, David R. Choffnes, Ramesh Govindan, Ethan
Katz-Bassett, Zhuoqing Morley Mao, and Matt Welsh. “Diagnosing Path Inflation of Mobile
Client Traffic”. In: Proceedings of the International Conference on Passive and Active
Network Measurement. PAM ’14. Springer, 2014.
[473] Ming Zhang, Chi Zhang, Vivek Pai, Larry Peterson, and Randy Wang. “PlanetSeer: Internet
Path Failure Monitoring and Characterization in Wide-Area Services”. In: Proceedings of
USENIX Symposium on Operating Systems Design and Implementation. OSDI ’04. USENIX,
Dec. 2004.
[474] Zheng Zhang, Ming Zhang, Albert G Greenberg, Y Charlie Hu, Ratul Mahajan, and Blaine
Christian. “Optimizing Cost and Performance in Online Service Provider Networks”. In:
Proceedings of USENIX Symposium on Networked Systems Design and Implementation.
NSDI ’10. USENIX, 2010.
[475] Zheng Zhang, Ying Zhang, Y Charlie Hu, Z Morley Mao, and Randy Bush. “iSPY: detecting
IP prefix hijacking on my own”. In: Proceedings of the Conference of the ACM Special
Interest Group on Data Communication. SIGCOMM ’08. ACM, 2008.
[476] Changxi Zheng, Lusheng Ji, Dan Pei, Jia Wang, and Paul Francis. “A Light-Weight Dis-
tributed Scheme for Detecting IP Prefix Hijacks in Realtime”. In: Proceedings of the
Conference of the ACM Special Interest Group on Data Communication. SIGCOMM ’07.
ACM, 2007.
[477] Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul Kabbani, Leon Poutievski, Arjun Singh, and
Amin Vahdat. “WCMP: Weighted Cost Multipathing for Improved Fairness in Data Centers”.
396
In: Proceedings of the Ninth European Conference on Computer Systems. EUROSYS ’14.
2014.
[478] Zou ZiXuan, Lee Bu Sung, Fu Cheng Peng, and Song Jie. “Packet triplet: an enhanced
packet pair probing for path capacity estimation”. In: Proceedings of Network Research
Workshop. 2003.
397
Abstract (if available)
Abstract
Today, over 80% of all Internet traffic is sourced from a small set of Content Distribution Networks (CDNs). These CDNs have built globally distributed points of presence to achieve locality and to facilitate regional interconnection, both of which are key to satisfying the increasingly stringent network requirements of streaming video services and interactive applications. Content providers rely heavily on CDNs, and many of the largest have built their own private CDNs. ❧ Prior work has shed light on the rise of CDNs from multiple vantage points. However, we still know little about how CDNs manage their connectivity and make routing decisions. Likewise, a number of longstanding Internet routing problems centered around performance, availability, and security can be attributed to fundamental issues in design of the Border Gateway Protocol (BGP), the protocol used to stitch together and route traffic across networks on the Internet. What implications will the rise of CDNs have on such problems? ❧ This dissertation sheds light on these unknowns by examining how CDN providers interconnect and route traffic in today’s Internet, along with the opportunities and challenges that arise in this environment. First, we execute a measurement study to uncover the connectivity of CDNs and capture how traffic flows between CDNs and end-users on today’s Internet. We find that much of the traffic on today’s Internet no longer traverses transit providers, a special set of networks that interconnect all other networks on the Internet. This structural transformation has been referred to as the flattening of the Internet’s hierarchyㅡwhile end-user ISPs and content networks historically passed traffic (and dollars) upwards to transit providers, this hierarchy has collapsed as interconnections have been established directly between these networks. We explore how this flattening may enable deployable solutions to longstanding Internet problems for the bulk of today’s Internet traffic. ❧ Next, we characterize the connectivity and routing policies of Facebook, a popular content provider that operates its own CDN, and examine the opportunities (performance-aware routing, fault-tolerance) and challenges (capacity constraints) that arise on the flattened Internet. We explore the design of Edge Fabric, a software-defined egress routing controller that we built and deployed in Facebook’s production network that enables efficient use of Facebook’s peering interconnections while preventing congestion at Facebook’s edge, and we develop and employ novel measurement techniques to characterize performance for traffic between Facebook’s CDN and end-users. ❧ Finally, we discuss how we democratized Internet routing research by building PEERING, a community platform that enables experiments to interact with the Internet routing ecosystem. PEERING has enabled 40 experiments and 24 publications, unblocking impactful experiments that researchers have historically struggled to execute in areas such as security and traffic engineering. ❧ Through this work, we demonstrate that it is possible to solve longstanding Internet routing problems and ultimately improve user experience by combining the rich interconnectivity of CDNs in today’s flattened Internet with mechanisms that enable routers to delegate routing decisions to more flexible decision processes.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Detecting and mitigating root causes for slow Web transfers
PDF
Measuring the impact of CDN design decisions
PDF
Enabling efficient service enumeration through smart selection of measurements
PDF
Anycast stability, security and latency in the Domain Name System (DNS) and Content Deliver Networks (CDNs)
PDF
Making web transfers more efficient
PDF
Improving network reliability using a formal definition of the Internet core
PDF
Balancing security and performance of network request-response protocols
PDF
Exploiting diversity with online learning in the Internet of things
PDF
Multichannel data collection for throughput maximization in wireless sensor networks
PDF
Understanding the characteristics of Internet traffic dynamics in wired and wireless networks
PDF
Global analysis and modeling on decentralized Internet
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Learning about the Internet through efficient sampling and aggregation
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
Improving network security through collaborative sharing
PDF
Detecting and characterizing network devices using signatures of traffic about end-points
PDF
Mitigating attacks that disrupt online services without changing existing protocols