Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Anycast stability, security and latency in the Domain Name System (DNS) and Content Deliver Networks (CDNs)
(USC Thesis Other)
Anycast stability, security and latency in the Domain Name System (DNS) and Content Deliver Networks (CDNs)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Anycast Stability, Security and Latency in The Domain Name System (DNS) and Content Deliver
Networks (CDNs)
by
Lan Wei
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2020
Copyright 2021 Lan Wei
Dedication
My love is dedicated for the whole cosmos.
As a terran, I dedicate my love for Earth.
As a human, I dedicate my love for my hubby, my parents, my grandparents, and my friends.
As an animal lover, I dedicate my love for pandas, pigs, dinosaurs, orcas, and platypus.
ii
Acknowledgements
At ISI,
Melissa Snearl-Smith (the first colleague I met in ISI, who introduced me to all others)
Alba Regalado (who I ran to after experiencing a robbery, and who encouraged me to report to police)
Joseph Kemp (who would put my poster in a poster holder, while I noticed a lot of students outside my
group were carrying their posters without a holder)
Jeanine Yamazaki (who told me that I can always count on her to fix the printer, which turns out she may
not fix it either)
At USC,
Lizsl De Leon (who helped me twice with the after-deadline class register)
Andy Chen (who helped me a lot upon graduation)
At Edgecast CDN (Verizon Media),
Marcel Flores, Harkeerat Bedi, Evita Bakopoulou, Marc Warrior, Anant Shah, Paulo Tioseco, and a transi-
tion manager whose hair is red.
At DNS B-root,
Wes Hardaker, Robert Story, John Heidemann. Great seniors to work with and learn from.
iii
In my lab/team,
Yuri Pradkin (who can almost manage and fix every technical thing in the lab)
Liang Zhu (who always encourages me to graduate sooner)
Calvin Ardi (who cheers me with his passion for research and meticulous documenting)
Abdul Qadeer (who helps proofread my papers multiple times)
Hang Guo (who has a wife who chats with me a lot)
A S M Rizvi (who counts on me),
Guillermo Baltra (who makes me realize there are actually people getting up at 6am every day)
Aqib Nisar, Asma Enayet, Song Xiao, Basileal Imana
Especially, to the first two colleagues I work with,
Ricardo Schmidt (who keeps guiding me through five years, also on my defense committee)
Wouter de Vries (who develops great work)
My guiding committee and my defense committee,
for great questions, so that both my projects and I as a researcher and an engineer can improve.
Guiding: John Heidemann, Ramesh Govindan, Kostas Psounis, Ricardo Schmidt, Ethan Katz-Bassett,
Muhammad Naveed
Defense: John Heidemann, Ramesh Govindan, Kostas Psounis, Ricardo Schmidt
My advisor,
John Heidemann
If reversing my calendar to five years ago, I would still choose John Heidemann as my advisor. As an inter-
national student, it is great luck to have an advisor that not only has professional opinions in the technical
iv
fields but also cares that I can live a happy Ph.D. life.
Friends help me so much in job hunting:
Thanks 2e10 times to get through the special year, 2020,
Matt Calder,
Liang Zhu,
Xue Cai,
John Heidemann,
Ali Khayam.
My friends are like family,
Yuting Jiang (I feel at ease that my panda toy becomes your godson. Our dierences make me realize how
great you are and I am)
Qiujia Wang (I recall so many times that you reached your strong arm to drag me running in our middle-
school PE class. I am still a coward sometimes, so please keep building your muscle, for me and for Qiukui.)
Shiqi Quan (Our stupid emoji are building up the cache of my instant messaging app every day. My stupidity
is your fault. Fart and rise shoulder slope!)
My family is like no others,
Thanks for taking great care of my grandma!
v
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables ix
List of Figures xi
Abstract xiii
Chapter 1: Introduction 1
1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Demonstrating The Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 First study: Anycast Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Second study: Anycast Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Third study: Anycast Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2: Anycast Stability in UDP and TCP 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Anycast Routing Instability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Sources and Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Queries from RIPE Atlas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Other Sources: High Precision Queries, TCP and BGP . . . . . . . . . . . . . . . . 18
2.3.4 Detecting Routing Flips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.5 Identifying Load Balancers with Paris Traceroute . . . . . . . . . . . . . . . . . . . 19
2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 What Does Anycast Instability Look Like? . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Is Anycast Instability Long Lasting, and for How Many? . . . . . . . . . . . . . . . 22
2.4.3 Is Anycast Instability Persistent for a User? . . . . . . . . . . . . . . . . . . . . . . 24
2.4.4 Is Anycast Instability Near the Client? . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.5 Higher Precision Probing Shows More Frequent Flipping . . . . . . . . . . . . . . . 29
2.4.6 Does Per-Packet Flipping Occur? . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.7 Are TCP Connections Harmed by Anycast Flipping? . . . . . . . . . . . . . . . . . 33
2.4.7.1 The relationship between UDP flipping and TCP-flipping . . . . . . . . . 33
2.4.7.2 Why Does J-Root See More TCP Timeouts? . . . . . . . . . . . . . . . . 35
vi
2.4.7.3 Can we locate the per-packet balancer? . . . . . . . . . . . . . . . . . . . 36
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Chapter 3: Anycast Security in DNS Spoofing 41
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.1 Goals of the Spoofer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2 Spoofing Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Targets and Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Finding Spoofed DNS responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2.1 Detecting Overt Spoofers By Server ID . . . . . . . . . . . . . . . . . . . 47
3.3.2.2 Detecting Covert Delayers with Latency Dierence . . . . . . . . . . . . 48
3.3.3 Identifying Spoofing Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.4 Spoofing Parties from Server IDs . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.1 The Root DNS system and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.2 Spoofing Is Not Common, But It Is Growing . . . . . . . . . . . . . . . . . . . . . 53
3.4.2.1 Spoofing is uncommon . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.2.2 Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.3 Where and When Are These Spoofers? . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.4 Who Are the Spoofing Parties? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.5 How Do Spoofing Parties Spoof? . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.6 Does Spoofing Speed Responses? . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5.1 Validation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.2 Validation of Overt Spoof Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.3 Validation of Covert Delayers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5.4 Non-Anycast Mechanism: Proxy or Injection? . . . . . . . . . . . . . . . . . . . . 63
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Chapter 4: Anycast Latency in A CDN 66
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 Observations to Find Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 RTT Inequality between Anycast/Unicast . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Bidirectional Anycast/Unicast Probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.1 BAUP Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.2 Detecting Improvable Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.3 Locating the Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.3.1 Detecting Slow Hops . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.3.2 How RTT Surge Reveals A Slow Hop . . . . . . . . . . . . . . . . . . . 75
4.4.3.3 Avoiding False Slow-Hops . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.3.4 Circuitous Path Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.4 From Problems to Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
vii
4.5.1 Case Studies: Using BAUP to Identify Problems . . . . . . . . . . . . . . . . . . . 79
4.5.1.1 Intra-AS Slow Hop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5.1.2 Inter-AS Slow Hop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5.1.3 Problem near the CDN hop . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5.1.4 A Circuitous Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5.2 How Often Does BAUP Find Latency Dierences? . . . . . . . . . . . . . . . . . . 82
4.5.3 Root Causes and Mitigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.6 Improving Performance with BAUP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.7 BAUP Evaluation Of DNS B-Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.7.1 How Often Does BAUP Find Latency Dierences? . . . . . . . . . . . . . . . . . . 86
4.7.2 Some Case Studies: Root Causes and Potential Mitigations . . . . . . . . . . . . . . 88
4.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Chapter 5: Conclusions 93
5.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Bibliography 98
viii
List of Tables
2.1 Details of datasets (UDP CHAOS, TCP CHAOS, traceroute) . . . . . . . . . . . . . . . . . 15
2.2 Observed sites of 13 Root Letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Number of flips per VP, for each Root Letter . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Overlap of anycast instability for specific VPs in half-weeks. . . . . . . . . . . . . . . . . . 26
3.1 Mechanisms for DNS spoofing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Query detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Data Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 DNS spoof observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Countries with largest fraction of VPs experiencing spoofing in 2019. . . . . . . . . . . . . 56
3.6 Classification of spoofing parties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.7 How many queries reach B-root based on spoof detection . . . . . . . . . . . . . . . . . . . 61
3.8 The range of true positive rate of spoof detection . . . . . . . . . . . . . . . . . . . . . . . 61
3.9 Covert delayer validation: how many reached B-Root . . . . . . . . . . . . . . . . . . . . . 62
4.1 Basic information about routing problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 An intra-AS slow hop from a VP to PoP FRA . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 An inter-AS slow hop from a VP to PoP FRA . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 A near-CDN slow hop from a VP to PoP VIE . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5 A circuitous path from a VP to PoP FRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
ix
4.6 BAUP results on the CDN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.7 VPs aected by AS-H before and after fixing . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.8 BAUP results on B-root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.9 Two intra-AS slow hops in one path from a VP to B-root site AMS . . . . . . . . . . . . . . 88
4.10 An intra-AS slow hop from a VP to B-root site LAX . . . . . . . . . . . . . . . . . . . . . 88
x
List of Figures
1.1 A simply illustration of an anycast infrastructure and its users. . . . . . . . . . . . . . . . . 2
2.1 Sites accessed by 140 VPs during a time period . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Cumulative distribution of mean flip time for each VP . . . . . . . . . . . . . . . . . . . . . 23
2.3 The percentage of anycast unstable VPs for each day in a week. . . . . . . . . . . . . . . . 25
2.4 The CDF of unstable VPs for how many root DNS services . . . . . . . . . . . . . . . . . . 28
2.5 Counting site flips from 100 VPs to D-Root . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Fraction of time one VP spends at the JFK site of C-Root . . . . . . . . . . . . . . . . . . . 32
2.7 Mean and standard deviation of site hit ratio in sliding windows . . . . . . . . . . . . . . . 32
2.8 The CDF of timeout responses of TCP query (varied base sets) . . . . . . . . . . . . . . . . 34
2.9 The CDF of timeout responses of TCP query (a same set) . . . . . . . . . . . . . . . . . . . 35
3.1 CDF of root counts seen overtly-spoofed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Fraction of all available VPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Fraction of fixed 3000 VPs over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Fraction of spoofing per country . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Location of spoofed VPs over time for each year . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 Number of VPs with dierent spoofing mechanisms over time. . . . . . . . . . . . . . . . . 58
3.7 CDF of RTT
ping
minus RTT
dns
from spoofed VPs on 2019-08-24 . . . . . . . . . . . . . . . 59
4.1 Four one-way delays in BAUP traceroute . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
xi
4.2 CDF of RTT before and after applying fix . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 CDF of RTT before minus RTT after applying fix . . . . . . . . . . . . . . . . . . . . . . . 85
xii
Abstract
Clients’ performance is important for both Content-Delivery Networks (CDNs) and the Domain Name Sys-
tem (DNS). Operators would like the service to meet expectations of their users. CDNs providing stable
connections will prevent users from experiencing downloading pause from connection breaks. Users expect
DNS trac to be secure without being intercepted or injected. Both CDN and DNS operators care about a
short network latency, since users can become frustrated by slow replies.
Many CDNs and DNS services (such as the DNS root) use IP anycast to bring content closer to users.
Anycast-based services announce the same IP address(es) from globally distributed sites. In an anycast
infrastructure, Internet routing protocols will direct users to a nearby site naturally. The path between a
user and an anycast site is formed on a hop-to-hop basis—at each hop (a network device such as a router),
routing protocols like Border Gateway Protocol (BGP) makes the decision about which next hop to go to.
ISPs at each hop will impose their routing policies to influence BGP’s decisions. Without globally knowing
(also unable to modify) the distributed information of BGP routing table of every ISP on the path, anycast
infrastructure operators are unable to predict and control in real-time which specific site a user will visit and
what the routing path will look like. Also, any change in routing policy along the path may change both the
path and the site visited by a user. We refer to such minimal control over routing towards an anycast service,
the uncertainty of anycast routing. Using anycast spares extra trac management to map users to sites, but
can operators provide a good anycast-based service without precise control over the routing?
xiii
This routing uncertainty raises three concerns: routing can change, breaking connections; uncertainty
about global routing means spoofing can go undetected, and lack of knowledge of global routing can lead
to suboptimal latency. In this thesis, we show how we confirm the stability, how we confirm the security,
and how we improve the latency of anycast to answer these three concerns. First, routing changes can
cause users to switch sites, and therefore break a stateful connection such as a TCP connection immediately.
We study routing stability and demonstrate that connections in anycast infrastructure are rarely broken by
routing instability. Of all vantage points (VPs), fewer than 0.15% VP’s TCP connections frequently break
due to timeout in 5s during all 17 hours we observed. We only observe such frequent TCP connection break
in 1 service out of all 12 anycast services studied. A second problem is DNS spoofing, where a third-party
can intercept the DNS query and return a false answer. We examine DNS spoofing to study two aspects
of security—integrity and privacy, and we design an algorithm to detect spoofing and distinguish dierent
mechanisms to spoof anycast-based DNS. We show that DNS spoofing is uncommon, happening to only
1.7% of all VPs, although increasing over the years. Among all three ways to spoof DNS—injections,
proxies, and third-party anycast site (prefix hijack), we show that third-party anycast site is the least popular
one. Last, diagnosing poor latency and improving the latency can be dicult for CDNs. We develop a
new approach, BAUP (bidirectional anycast unicast probing), which detects inecient routing with better
routing replacement provided. We use BAUP to study anycast latency. By applying BAUP and changing
peering policies, a commercial CDN is able to significantly reduce latency, cutting median latency in half
from 40 ms to 16 ms for regional users.
xiv
Chapter 1
Introduction
Both the Domain Name System (DNS) and Content Deliver Networks (CDNs) are critical components of
the Internet. The DNS is the name system for the Internet, receiving queries about every entity on the
Internet with this entity’s domain names. The DNS is able to reply with a variety of information, including
an entity’s underlying addresses (translating a human-readable URL to a numeric address ) [60], its name
server record (returning an ID of the DNS server) [21] or alias of this domain name [60], and so on. CDNs
are an important layer in the Internet ecosystem that serves content to users, such as video streaming and
software downloads. The majority of web trac today is served through CDNs, including huge portions
of web trac from companies such as Facebook, Netflix, and YouTube [86]. There are commercial CDNs
such as Akamai [3] and Verizon Media [32]. Such CDN companies host and deliver content for other
business partners. There are also companies choose not to use commercial CDN services but build their own
CDNs. Facebook [34], a social media company, spreads investment to build their own CDNs to improve
user experience.
CDNs and DNS operators use globally distributed sites to bring content closer to users. A site refers
to the physical location of where the edge servers are located in the network infrastructure. Ideally, the
infrastructure should map the user to a nearby site of multiple available sites. As illustrated in Figure 1.1,
if an anycast infrastructure has two sites in United State, one in Los Angeles and one in Miami, a good
1
Figure 1.1: A simply illustration of an anycast infrastructure and its users.
user-site mapping will map a Los Angeles user to the site in Los Angeles rather than the site of Miami. The
end-to-end latency depends on the networking distance, which is equal or larger than the physical linear
distance between two points. It takes more time to deliver the data from Miami site to Los Angeles, since it
is further away from the user than the Los Angeles site [89].
CDNs use two major mechanisms to direct users’ trac to a nearby site—DNS and anycast. DNS-based
redirection use DNS answers to map users to site. DNS-based redirection was pioneered by Akamai [65].
Akamai has deployed a globally distributed system of highly-available authoritative DNS servers. These
DNS servers will provide customized DNS answers containing the address of a closest CDN site or a best
site according to Akamai’s own criteria. Users’s local DNS servers will contact the Akamai’s DNS server
for the address of a site. Akamai replies the address of a site, with a decision based on the location of the
users’ local DNS server. To providing proper answers for worldwide users, such a distributed DNS system
requires large global coverage and considerable investment.
The other major redirection method, anycast, relies on BGP to map users to sites. Some newer CDNs
like Verizon Media, Cloudflare, Azure [?, 16, 20, 32] rely on anycast
1
, announcing the same IP address(es)
from multiple locations. In an anycast infrastructure, the Internet routing associates users to a nearby site by
Border Gateway Protocol (BGP). Anycast oers only minimal control over the mapping between the users
1
Azure and Verizon Media use both DNS and anycast for mapping between users and sites
2
and sites. However, it is easy and cheap to deploy an anycast-based CDN—it requires no infrastructure
investment of DNS servers, only the multiple distributed sites.
In anycast routing, both users and infrastructure operators are uncertain about the site selection and the
routing path until the users’ trac arrives at the destination site. When a user queries the service, this query
will go to one of the multiple available sites. In an anycast service, multiple users will possibly go to one
same site. We name such mapping between a group of users and a single site a catchment. The site selection
is neither in charge of the users nor the service providers, but depends on routing policies of every ISPs in
the path. Each ISP on the path is able to impose their own policy decisions (there could be multiple ISPs
in the path). Business relationships (customer-provider, peer-peer, backup) and trac engineering between
ISPs and their neighbors may influence the BGP decision process of next hop selection [14]. At any time
an ISP can make a policy change, such change may alter routing paths. A routing path change may lead to
a catchment change, that is, a new routing path may direct a user to a new anycast site that is dierent from
the last site visited.
There are concerns about anycast since site selection is uncertain to users and operators. Service opera-
tors and Internet users expect services to provide stable connections, but routing uncertainty leads to doubts
about whether a change in site selection will break a current connection. In a TCP connection, if a user
initiates the connection to one site, but later switches to another site, this new site will send a TCP RST to
the user, causing the connection to break. Most DNS today is sent over UDP, but zone transfers use TCP,
and recent work has suggested widespread use of TCP and TLS for DNS privacy [46, 100]. For DNS, such
connection break results in a much larger response time. For a CDN or video streaming, it might result in
playback stalls and “buering” messages. Playback stalls and larger response time would make the users
frustrated [25]. Both the DNS and CDNs would like to avoid such poor experience of users.
Internet users expect DNS replies to come from the authoritative servers, but DNS can be spoofed by
third-party intercepting or injecting answers. DNS spoofing creates security problems. In our study, we look
3
at two aspects of security—integrity of the answer could be compromised, and a third-party may eavesdrop
on trac, compromising privacy. DNSSEC can protect against some aspects of spoofing by insuring the
integrity of DNS responses [31]. Unfortunately, DNSSEC deployment is far from complete, with names of
many organizations (including Google, Facebook, Amazon, and Wikipedia) still unprotected [68]. Increas-
ing use of DNSSEC [31], challenges in DNSSEC deployment [18], and increased study over DNS privacy
encryption [55] reflect interest in DNS integrity. Compromised integrity of the DNS answer may cause
property loss to the users. For example, a user queries an online bank but is diverted a fake website hosted
on a third-party’s computer. Even if the integrity of the DNS answers are intact, there are still concerns about
privacy breach. For example, eavesdropping on a specific user’s DNS query will help the attacker learn the
private life of the user—most frequently visited website, working hours, and so on. To our knowledge, there
has been little public analysis of general spoofing of DNS over time.
Both CDN providers and DNS providers care about optimizing the latency between servers and users
[47, 50], but routing path in anycast infrastructure is uncertain to have the lowest latency. While multiple
large CDNs directly serves 1k to 2k ASes [3,15,17], and there are more than 1000 root DNS instances [83],
with more than 67k ASes on the Internet [5], the majority are served indirectly through other ASes. In
addition, some CDNs and DNS providers operate fewer, larger PoPs, or cannot deploy in some ASes or
countries due to financial or legal constraints, so optimizing performance across multiple ASes is essential.
In this thesis, we will show how we understand anycast routing uncertainty in order to reliably and e-
ciently use anycast by performing measurements and analysis with data collected from both root DNS and a
commercial CDN. We confirm that anycast infrastructure is able to oer stable ((subsection 1.2.1) and secure
(subsection 1.2.2) connections, and service operators are able to improve its latency (subsection 1.2.3).
4
1.1 Thesis Statement
The thesis of the dissertation is that we confirm the stability, the security, and improve the latency of
anycast by understanding anycast routing uncertainty.
We define the stability as when a user contacts an anycast address, the responding site will usually be
the same one over time. Such stability makes sure that anycast infrastructure can provide service without
the connection being disrupted by routing changes. The confirmation of stability supports our subsequent
studies of the security. Since we show that catchment changes are very rare, we do not need also consider
stability when we evaluate two queries sent to one anycast service ( the latency of the two queries should be
similar assuming they would reach the same site).
We study two aspects of security, its integrity and privacy. Integrity of DNS answer means the query
is replied by the authoritative server, not from a third-party. Privacy of DNS answer means the query is
not injected, intercepted, or eavesdropped. Given anycast stability, we are able to analyse historical dataset
to confirm that DNS spoofing is uncommon, although increasing over time. We also show that third-party
anycast sites are not a popular way to attack anycast service.
Stability and security of anycast infrastructure show that anycast is a reliable way to deploy services.
In addition to reliability, operators care about performance. We define improving the latency as when we
can find a path that has a faster latency for a certain user than the current path in use. The ability to improve
latency in anycast-based CDN helps operators to deliver better services. This study of improving the latency
ensures that anycast infrastructure is not only reliable but is also improvable.
1.2 Demonstrating The Thesis
We support the thesis statement by showing anycast can provide stable connections, is secure from frequent
third-party spoofing, and operators are able to provide better services with shorter latency. First we confirm
5
the stability, evaluated security risks by performing new measurements and developing new approaches
which characterizes the trac in anycast services. This shows that anycast is a reliable way to deploy a
service. Further, we improve an anycast service by reducing its latency by designing a new methodology,
BAUP. Our studies show that anycast infrastructure is stable, secure, and improvable for network services
such as DNS and CDNs.
1.2.1 First study: Anycast Stability
In our first study (Chapter 2), we answer this question: in anycast services, do connections break often?
Based on the work to answer this question, our first study confirms the stability by understanding anycast
routing uncertainty, which supports the first part of the thesis statement. We show that instability rarely
happens in UDP and rarely leads to connection breaks in TCP. We look at three aspects of stability. First,
by studying queries using UDP, we find only about 1% of combinations of vantage point and anycast service
are anycast unstable, frequently changing routes to dierent sites of a service. Second, we point out anycast
instability is very specific to paths: with almost all VPs with instability seeing it in only a few services (one
to three), not to all 12 services studied. Catchment change only disturbs UDP connection, but will break
TCP connections. Third, we show anycast instability is even rarer in TCP scenario, that with the VPs that
experience anycast instability, less than 0.15% VPs timeout in TCP connections.
Given anycast stability, two queries sent to one anycast services are highly likely to reach one site.
Based on the stability result, researchers can observe and compare multiple queries to detect whether or
not a third-party intercepts or injects replies. We next study root DNS spoofing (root DNS are deployed in
anycast) based on our first work about stability.
6
1.2.2 Second study: Anycast Security
In our second study (Chapter 3), we answer this question: is spoofing common in anycast infrastructures?
Based on the work to answer this question, we confirm the security of anycast by studying the anycast-based
root DNS, which supports the second part of the thesis statement. This work builds on our prior evaluation
of anycast stability showing catchment rarely changes. We quantify the DNS spoofing by designing a
methodology to detect spoof and dierentiate the mechanism used to spoof. We show that DNS spoofing
is not common although it happens worldwide and keeps increasing over years. We study three aspects of
DNS spoofing: how many queries are spoofed, how to spoof, and who are spoofing. First, we design a new
method that can recognize DNS spoofing replies. We find that from year 2014 to 2020, as an anycast service,
root DNS is rarely spoofed, but the spoofing keeps growing, with 0.7% (2014-02-04) to 1.7% (2020-05-03)
of total VPs experience DNS spoofing. Second, we develop a new approach, that combines analysis of
DNS, ICMP, traceroute queries replies to distinguish the mechanism a spoofer uses to spoof the DNS. We
find that out of all three mechanisms, prefix hijacking is the least frequently used as compared with proxy,
and injections. Third, we find most identifiable spoofing organizations are ISPs benignly use spoof to reply
faster, although root DNS are optimized by being geographically distributed using anycast. Moreover, by
comparing authoritative and spoofed replies, we find distributions of latency from A-root to M-root vary.
The first and second studies demonstrate anycast is a reliable way to deploy services, but we still wonder:
is there any possibility to improve the anycast services? Although we confirm anycast is stable, we have
not looked at routing latency for anycast performance. In the third work, we study the routing latency of a
commercial anycast-based CDN.
1.2.3 Third study: Anycast Latency
In our third study (Chapter 4), we answer this question: can we improve the latency of an anycast infrastruc-
ture? Based on the work to answer this question, our third study improves the latency of an anycast CDN,
7
which supports the third part of the thesis statement. We design and apply a new and eective approach to
detect and improve the latency for anycast-based CDNs. We reduce the latency by half for regional users of
a commercial CDN. We accomplish four tasks to improve the latency. First, we design a new methodology,
BAUP (Bidirectional Anycast Unicast Probing), that detects latency that are potentially improvable and sug-
gest alternative better routing. Our methodology studies routing path to unicast address and anycast address
of the CDN PoP, on both the forwarding and reverse paths. Second, our new methodology suggests that
slow hops or circuitous routing could be the reason for poor latency, and provides example of such cases.
Third, we find routing ineciency is not common in the CDN studied, only about 1.59% of VPs are detected
with heavily under-performance routing. Fourth, with the BAUP, the CDN studied has tangible performance
improvement for a popular regional users, with median latency dropping from 40 ms to 16 ms.
We confirm anycast stability, anycast security, and improve its latency through above three studies. The
three studies suggest anycast can be trustworthy as a infrastructure to deploy services such as the DNS, and
CDNs. While the uncertainty of anycast is built in its nature—routing at the mercy of BGP, researchers and
operators are able to rely on its stability in TCP connections, and tune its latency to be shorter.
1.3 Contributions
The first contribution of above three studies is to prove our thesis statement. Each of above three studies
support one aspect of our thesis — anycast stability, anycast security, or improving anycast latency. Addi-
tionally, each work has its own research contributions.
We make two contributions in our stability study. First, we provide evidence for operators to trust
anycast infrastructure for applications that depends on TCP connections. We confirm that almost none
TCP connections will timeout due to catchment shift caused by routing changes. Second, we demonstrate
instability happens for specific pairs of (VPs, anycast infrastructure), and is not a feature for a specific
VP or a specific infrastructure. We studies 11 anycast infrastructures of dierent number of sites, and
8
dierent locations of sites. For the small group (1%) of VPs that ever experience instability , these VPs
only experience instability in one or two services out of 12 infrastructures studied, but not in the other
infrastructures.
The study of anycast security has three contributions. First, we alert that DNS spoofing has been in-
creasing in 6 years since 2014 to 2020. In May 2020, about 1.7% of VPs received spoofed DNS answers;
however with only 0.7% spoofed in 2014, this fraction of spoofed VPs more than doubled during 6 years.
Second, we detect spoofing by checking server IDs. We validate this methodology by using B-root server
logs. By using B-root server logs, we show that our detection methodology has a high true-positive rate
over 0.96. Third, we show that today proxies (dropping original query packet) are a more popular way than
injections to spoof, and third-party anycast site remains as an unpopular way to spoof. We see that in B-root
logs, majority spoofed queries never reach B-root.
In our last study about anycast latency, we make two contributions. First, our methodology BAUP is a
general approach for anycast operators to diagnose and improve the latency of their current infrastructure.
Operators can use BAUP to find improvable latency and locate where slow hops are. Second, by applying
BAUP on Verizon Media Platform CDN, we made tangible latency changes, cutting the median latency of
regional users by more than a half, from 40 ms to 16 ms.
9
Chapter 2
Anycast Stability in UDP and TCP
In this chapter, we examine data from more than 9000 geographically distributed clients to 12 anycast ser-
vices to evaluate one question: in anycast infrastructures, how often does instability happen due to routing
changes? Two studies [12, 51] looked at the instability of anycast, and their conclusions are largely quali-
tative. Our analysis of this data provides the first quantification of anycast instability. We explores where,
why, and how often anycast instability occurs. This study supports the first part of the thesis by confirming
the stability of anycast.
Our study provides the first quantification of anycast stability. We see that about 1% of VPs are anycast
unstable, reaching a dierent anycast site frequently (sometimes every query). Flips back and forth between
two sites in 10 seconds are observed in selected experiments for given service and VPs. Moreover, we show
that anycast instability is persistent for some VPs—a few VPs never see a stable connections to certain any-
cast services during a week or even longer. The vast majority of VPs only saw unstable routing towards one
or two services instead of instability with all services, suggesting the cause of the instability lies somewhere
in the path to the anycast sites. We point out that for highly-unstable VPs, their probability to hit a given site
is constant, suggesting load balancing might be the cause to anycast routing flipping. Finally, we directly
examine TCP flipping and show that it is much rarer than UDP flipping, but does occur in about 0.15% (VP,
10
letter) combinations. Moreover, we show concrete cases in which TCP connection timeout in anycast con-
nection due to per-packet flipping. Our findings confirm the common wisdom that anycast almost always
works well, but provide evidence that a small number of locations in the Internet where specific anycast
services are never stable.
This study of anycast stability in both UDP and TCP connections supports our thesis statement (sub-
section 1.2.1) by confirming that anycast is stable. First, we provide evidence for operators to trust TCP
connections to be stable most of the time in anycast infrastructures. We confirm that almost no TCP connec-
tions will timeout due to site shift caused by routing changes. Second, we demonstrate instability happens
for specific pairs of (VPs, anycast infrastructure), and is not a feature for a specific VP or a specific infras-
tructure. For the small group (1%) of VPs that experience instability, they VPs only experience instability
in 1 - 3 services out of 12 infrastructures studied, with all 12 with dierent number of sites, and dierent
locations of sites.
Part of this chapter was published in The Network Trac Measurement and Analysis Conference (TMA)
2017 [95], and IEEE Transactions on Network and Service Management (TNSM) 2018 [96].
2.1 Introduction
A concern about anycast is that BGP routing changes can silently shift trac from one site to another—
we call this problem potential anycast instability. Without centralized control, such a shift will cause the
connection to break. Yet this problem cannot possibly be widespread—anycast’s wide use across many
commercial providers suggests it works well. This observation is supported by multiple studies that have
shown routing changes interrupt connections rarely [12, 45, 51]. Internet applications already must include
some form of recovery from lost connections to deal with server failures and client disconnections, so
anycast instability should not be a problem provided it is infrequent. Moreover, most web connections are
11
only active for short periods of time, so the fraction of time when a route change will directly aect users is
small.
In addition to BGP changes that may cause anycast instability, load balancers are widely used in many
places in the Internet. While load balancing at the destination is usually engineered to provide stable des-
tinations for each client, load balancing in the wide-area network is not always so careful. Prior work has
observed that WAN-level load balancing can disrupt RTT estimation [67]; we believe it can also result in
anycast instability. While such problems may be very rare (aecting only users that cross a specific link, and
perhaps only certain trac types), such eects in the WAN are particularly concerning because they happen
outside the control of both the user and the service provider. It is extraordinarily dicult to detect problems
that aect a tiny fraction of users, while being able to provide service to the vast majority of users. With
billions of users, even a fraction of percent is a serious problem.
This chapter provides the first quantitative evaluation of the stability of anycast routing. While very rare,
we find that about 1% of combinations of vantage point and anycast service are anycast unstable, frequently
changing routes to dierent sites of a service (subsection 2.4.2). We call these route changes anycast flips,
and they can disrupt anycast service by losing state shared between the VP and the server with which it was
previously communicating.
This result follows from the study of 11 dierent anycast deployments, each a global Root DNS Letter
with an independent architecture, with sizes varying from 5 to about 150 anycast sites (each a location with
its own anycast catchment and one or more servers).
Our second contribution is to demonstrate the severity and potential causes of of route flips through
a number of measurement studies. This study provides a broad view by examining all combinations of
about 9000 VPs and 11 anycast services. We use several measurement methods to examine how frequent
flips are, proving they often flip between anycast sites in tens of seconds (subsection 2.4.5), and strongly
suggesting they may flip more frequently, perhaps every packet as shown in our data (subsection 2.4.6). We
12
also find that anycast instability is often continuous and persistent: 80% of unstable pairs of VP and anycast
services are unstable for more than a week. For a few, these problems are very long lasting: 15% VPs are
still unstable with some service even 8 months later (subsection 2.4.3). We show that anycast instability
is specific to paths: with almost all VPs with instability seeing it in only a few services (one to three),
not all 11 (subsection 2.4.4). Although we cannot definitively know the root causes of anycast instability,
we do show from our measurement that certain (VP, service) pairs flip very frequently, likely every packet
(subsection 2.4.6).
In earlier work we suggested a possible explanation is load balancers on WAN links [95]. Here we
report evaluation of both UDP and TCP connections (subsection 2.4.7), showing that UDP flipping is rare,
occuring in about 1% of (VP, service) pairs, and that TCP flipping is rarer still, with regular TCP flipping
occurring in for only about 15 pairs (less than 0.15% VPs). However, when TCP flipping occurs, more than
20% of DNS queries over TCP do not complete.
Our results have three important implications. First, anycast almost always works without routing prob-
lems: for 99% of combinations of VP and anycast service, routes are stable for hours, days, or longer. With
multiple successful commercial CDNs, this result is not surprising, but it is still important to quantify it
with a clear, public experiment. Second, we show that anycast does not work for all locations, and a few
VP/anycast combinations (about 1%) see persistently UDP route instabilities with paths flipping frequently.
Third, TCP anycast flipping does occur in certain cases resulting in connections terminating, although not
happening for every anycast service in our data. We also show that some users behind a per-packet balancer
might be terribly aected by one anycast service, but not by other anycast services. This result suggests that
commercial anycast CDNs and those providers who want to provide service to all users may wish to study
locations that have anycast unstable routes and investigate ways to reduce this instability.
13
2.2 Anycast Routing Instability
In IP anycast, an anycast service uses a single IP address, and a user’s trac is directed to a “nearby”
site selected by BGP routing. Typically, “nearby” is defined by the length of the path in AS hops, but
BGP supports multiple mechanisms that allow service operators and ISPs to impose policy decisions on
routing (for details, see an overview [14]). Policies can be political (this path is only for academic trac),
commercial (send more trac to the less expensive peer), or technical (load balance across these links).
Anycast flips can be a problem because changes in routing shift a client to a new server without any
notification to either. If the client and server have some shared states, such as active TCP connections, these
will break because of a TCP reset and need to be restarted.
CDNs often keep persistent TCP connections open to clients when sending streaming media such as
video. While applications need to be prepared for unexpected termination of TCP connections, a route flip
will greatly increase latency as the problem is discovered and a new connection is built.
Most DNS today is sent over UDP, but zone transfers use TCP, and recent work has suggested widespread
use of TCP and TLS for DNS privacy [46,100]. For DNS, a route flip results in a much larger response time.
For a CDN or video streaming, it might result in playback stalls and “buering” messages.
A key factor aecting the degree of impact that an anycast flip has on the client is how long its TCP
connections are open, plus how long they are active. For video, connections may be open for many tens of
minutes, during which they may be active around 10% of the time. For DNS, connections may be open for
tens of seconds and active briefly, but multiple times.
2.3 Methodology
This section explains the essential features of the datasets and the methodology we use to analyze the dataset.
14
VPs probing
start duration number interval targets (root letters)
a 2015-12-05 00:00 7 days 9184 240s --CDEFG-IJKLM
b 2016-08-01 00:00 7 days 9254 240s A-CDEFG-IJKLM
c 2017-01-29 21:00 30 minutes 100 20s ---D---------
d 2016-08-01 00:00 7 days 192 peers cont. A-CDEFG-IJKLM
e 2017-07-08 00:00 17 hours 100 per root 20m ABCDEFG-IJKLM
f 2017-07-19 22:30 17 hours 15 20m ABCDEFG-IJKLM
g 2017-07-19 22:30 17 hours 15 30m ABCDEFG-IJKLM
Table 2.1: Datasets used in this chapter: a, b, c: datasets observing catchments from UDP-based CHAOS
queries; d: BGP routing updates; e, f: datasets observing catchments from TCP-based CHAOS queries; g:
traceroute datasets.
2.3.1 Sources and Targets
Our chapter uses five CHAOS query datasets listed in Table 2.1. All use the RIPE Atlas infrastructure [74].
Two are existing public datasets they collect via UDP queries [75], the third is an additional publicly-
available dataset we collect to improve time precision [79]. We also collect another two datasets [76, 77],
both using RIPE Atlas VPs to query DNS letters via TCP. Additionally, we use RouteViews dataset [85] to
check the updates of BGP routing table. We also use Paris-traceroute records from RIPE Atlas VPs [78] to
study routing paths.
The target of RIPE data collection is all 13 Root DNS Name Servers (or Root Letters), shown in Ta-
ble 2.2. Of these services, our study considers all Root Letters that use anycast at the time of measurement.
We omit A-Root from the 2015 dataset, because at that time it was only probed every 30 minutes. We
omit B- and H-Root from both datasets because, at these times, B is unicast and H uses primary/secondary
routing. Root Letters are operated by 12 organizations and use 13 dierent deployment architectures, and a
wide range of sites (5 to 144), providing a diverse set of targets.
We actively probe each anycast letter, sending queries from more than 9000 RIPE Atlas probes, embed-
ded computers we call Vantage Points (VPs). VPs are geographically distributed around the world, although
15
sites observed
letter operator reported 2015 2016
A Verisign 5 — 5
C Cogent 8 8 8
D U. Maryland 87 63 71
E NASA 71 74 66
F ISC 59 51 48
G U.S. DoD 6 6 5
I Netnod 49 51 56
J Verisign 98 65 89
K RIPE 33 32 40
L ICANN 144 110 118
M WIDE 7 6 6
Table 2.2: Targets of our study are most of 13 Root Letters, with their reported number of sites [81], and
how many sites we observe in each datasets.
North Africa and China are only sparsely instrumented. Our results may underrepresent anycast problems
in these two areas.
We use active queries rather than passive analysis of BGP because we are most concerned about frequent
flipping (subsection 2.4.5) and prior work has shown that BGP changes are relatively infrequent, often hours
or more apart [62]. We also use BGP data from RouteViews to confirm the BGP stability (subsection 2.4.6).
The targets of our queries are 11 root-letters which are operated by 10 organizations, listed in Table 2.2.
Dierent letters have dierent deployment architectures, and they have dierent number of anycast sites.
Although we study most letters, we do not see all anycast sites of each letter. We sometimes miss sites
because RIPE Atlas VPs are sparse in some parts the world (particularly Africa), and because some anycast
sites are local-only and so will be seen only by a VP in the same AS. Fortunately, answers to our research
questions do not require complete coverage.
We do not directly study anycast-based CDNs. Like root letters [81], CDNs vary widely in size, from
ten or tens of sites [13], even approaching 1000 sites [15], although hybrid architectures may use a subset
of all sites [39]. In subsection 2.4.4 and subsection 2.4.6 we show that instability often results from load
balancer in the middle of the network, so our results likely apply to CDNs.
16
2.3.2 Queries from RIPE Atlas
Each VP queries each Root letter every 4 minutes (except for A root in the 2015 dataset). The query is
a DNS CHAOS class, for a TXT record with name hostname.bind; this query is standardized to report
a string determined by the server administrator that identifies server and site [99]. Queries are directed at
anycast IP addresses that are served by a specific Root Letter. (RIPE Atlas queries each specific IP address
of the root-servers, allowing us to study each letter as an independent service.)
The above query results in a record listing the time, the VP’s identity, and the response to the CHAOS
query (or an error code if there is no valid response). The responses are unique to each server in use. There
is nothing to prevent third parties from responding on an anycast service address, and we see evidence of
that in our data. We call responses by third parties other than the operator spoofed.
We map the CHAOS responses we see to the list of sites each self-reports [81], following practices
in prior studies [36]. While CHAOS responses are not standardized, most letters follow regular patterns,
and the same pattern from many dierent VPs gives us some confidence that it is valid. For example, if
lax1a.c.root-servers.org and lax1b.c.root-servers.org regularly appear, we assume city.c.root-
servers.org is C-Root’s pattern.
CHAOS responses usually identify specific servers, not sites. Some letters have multiple servers at a
given site. Continuing the above example, lax1a.c.root-servers.org and lax1b.c.root-servers.org
suggest C-root has two servers 1a 1b at the lax site. Not all letters identify servers inside large sites, but all
provide unique per-site responses.
We study flipping between sites and ignore changes between servers in each site, since operators can
control server selection if they desire (perhaps with a stateful or consistent load balancer), but not changes
between sites.
17
We detect spoofed strings as those that do not follow the pattern shown by that letter, those seen only
from a few VPs in specific networks, and because spoofers typically reply with very low latency (a few ms
instead of tens of ms). Typically, about 0.7% of VPs see spoofed replies, and those VPs always see the same
replies, suggesting their ISPs intercept DNS. While we work to remove spoofed chaos replies from our data,
our methods do not prevent a malicious party from generating correct-looking replies.
2.3.3 Other Sources: High Precision Queries, TCP and BGP
In addition to the standard RIPE Atlas probes of Root letters, we also request our own measurements at a
more frequent time interval via UDP, later perform experiment via TCP, and gathered BGP information to
understand routing.
For high precision queries we use the RIPE Atlas infrastructure, but select 100 VPs of interest, based
on those that see anycast instability. For these VPs, we request that they query D-Root every 60, 70, 80,
and 90 s for 30 minutes. Although RIPE Atlas limits queries to once per minute, by scheduling concurrent
measurement tasks on the same VPs we can get results that provide precision approaching 20 s or even less
from the unevenly-distributed queries.
For the two datasets e and f in Table 2.1 collected from TCP queries, we use the same RIPE Atlas
infrastructure. We select 100 dierent VPs of interest for the first dataset, based on those that see the
top most UDP anycast instability. For these VPs, we request that they query all DNS root letters every
20 minutes for 17 hours. For the second dataset, we select 15 VPs based on the previous datasets, and repeat
the experiment.
To rule out BGP as the cause of flipping we use data from RouteViews [85] from 2016-08-01 to 2016-08-
07. Although the peers that provide routing data are in dierent locations than our RIPE VPs, the multiple
RouteViews peers provide a guiding picture of Internet routing for BGP.
18
2.3.4 Detecting Routing Flips
We define a routing flip as when a prior response for a VP’s query indicates one site, and the next response
indicates a dierent site. For missing replies, we assume that the VP is still associated with the same site as
in the prior successful reply.
Most VPs miss ten or fewer replies per day and so loss does not change our results. About 200 VPs
(around 2%) miss all or nearly all replies; we exclude these from our datasets.
2.3.5 Identifying Load Balancers with Paris Traceroute
Although our end-to-end measurements suggest flipping somewhere on the path, these tests do not identify
specific load balancers. We therefore use traceroutes [78] to identify potential locations of load balancers that
may cause flipping. We confirm the presence of multipath routing by detecting changes to paths observed
at a list of continuing timestamps in dataset g in Table 2.1. If such appearance keeps a certain proportion
in any rolling time window, such as a half or two thirds in any one-hour window or two-minute window,
we consider that a load balancer shapes the trac this way. The load balancer is likley hop before this
intermittent hop, or in a private network before this hop.
We check whether a load balancer alters the final destination of packets. Load balancers may send
packets over dierent paths, but if they the packets end up at the same destination, problems may be limited.
On the other hand, if a load balancer sends alternative packets to dierent sites, the connection-oriented
communication will not be successful (TCP will fail with a reset). We look up the geolocation of the
penultimate routers of the final destinations to see whether the destination changes, by checking its IP
address when they are public ones. The root servers usually drops ICMP packets, and the last hops in the
traceroute are often the penultimate routers of root servers.
19
(a) 140 VPs for 1 week (b) The top left: 50 VPs for
20 hours
Figure 2.1: Sites accessed by 140 VPs: each row represents a VP, and each column represents a 40-minute
period, and the colors show what site that VP reaches for C-Root. (yellow: MAD (Madrid), orange:
ORD (Chicago), gray: CDG (Paris), red: BTS (Bratislava), pink: FRA (Frankfurt), purple: IAD (Hern-
don), blue: JFK (New York), green: LAX (Los Angeles), white: no response). Dataset: 2015.
2.4 Evaluation
We next apply our analysis to evaluate anycast stability. We first identify examples of routing stability, then
quantify how often it happens, how long it persists, and then discuss possible causes for the instability.
2.4.1 What Does Anycast Instability Look Like?
We first look to see if there is any anycast instability. While successful anycast-based CDNs suggest that
most users will be stable, but perhaps a few are less fortunate. We look at the data from RIPE Atlas to the
Roots Letters as described in section 2.3, looking for VPs that change sites between consecutive queries.
Before looking at stability statistics, we first show a sample of direct observations to characterize typical
anycast stability. We selected 140 VPs from the dataset for C-Root in 2015 dataset and plotted which sites
they access for each 40-minute period of the week-long dataset. To better present the data, we select C-root
because we can assign each of its 6 sites a unique color (or shade of gray). We choose 140 VPs that mainly
20
associate with the MAD and ORD sites as representative of all sites for C. To show our full week of data on
the page, we report only the last site selected by each VP in each 40 minute period. (This summarization
actually reduces the apparent amount of changes.)
Figure 2.1 is a timeseries showing which sites 140 VPs reach over 1 week in the 2015 dataset. Each row
is a VP, and each column is a 40-minute period, and color indicates the currently active catchment (or white
if no reply). 2.1b zooms in on the top left 50 VPs for 20 hours. We see similar results for other letters, and
for other datasets.
Overall stability: Figure 2.1 shows very strongly that anycast usually works well—most VPs are very
stable. Many of these VPs access one site, with most of top 39 VPs for MAD (the yellow band), while
the most of the bottom 101 VPs for ORD (orange). We expect general stability, consistency with wide,
successful use of anycast.
While most VPs are stable, we next look at three groups of routing flips as shown by color changes in
the figure.
Groups of Routing Flips: These routing changes happen and aect many VPs at the same time; if these
are occasional they are benign. On the left of 2.1a, there is a tall vertical “stripe” aecting many VPs for
ORD (orange), and another wider strip aecting many of the VPs for MAD (yellow). In each of these cases
we believe there was a change in routing in the middle of the network that aected many (but not all) users,
changing them from ORD or MAD to blue JFK. In both cases, the routes changed back fairly quickly (after
36 minutes for MAD-CDG-MAD, and 6 hours for ORD-LAX/IAD/JFK-ORD). Group flips that happen
occasionally will require TCP connection restarts, but two events in two weeks will have minimal impact
on users. These kind of normal routing changes reflect anycast automatically re-routing as ISPs reconfigure
due to trac shifts or link maintenance.
Individual, Long-term Changes: Other times we see individual VPs change their active site, perhaps
reducing latency. For example, the bottom-right of 2.1a, about 10 VPs change from ORD to IAD or LAX
21
(orange to purple or green), and stay at that site for the remainder of the period, about three days. Again, we
believe these changes in routing represent long-term shifts in the network, studied elsewhere [87]. Because
these changes are infrequent, long-term shifts cause minimal harm to users, and sometimes they may help if
they result in a lower latency path. They may also represent routing changes by operators to re-balance load
on sites.
Frequent Routing Flips Finally, we see a few cases where VPs see persistent routing flips, suggesting
that, for them, anycast will not work well. In 2.1a we see four cases: three VPs flip between MAD and CDG
and back (gray and yellow, all in the yellow band), and one VP alternates between FRA-BTS-MAD (pink,
red and yellow, shown at the boundary of the yellow and orange bands). This behavior continues throughout
the week. While the VPs sometimes reach the same site in consecutive measurements, this kind of frequent
flipping greatly increases the chances of breaking TCP connections.
2.4.2 Is Anycast Instability Long Lasting, and for How Many?
We have seen some unstable users ( 2.1a), but how many are unstable? To answer that question, we must
first consider how long instability lasts.
To evaluate the stability of each VP, we compute the mean duration that VP is at each site, then report
the cumulative distribution for each root letter for 2015 and 2016 dataset in Figure 2.2.
The result confirms the prior observation that overall, anycast is very stable for most VPs. The y-axis of
the CDFs (Figure 2.2) does not start at zero, and we see that 90% of VPs see two or fewer changes for all
Root Letters we study but one (Table 2.3). In fact, A-root barely saw any route changes for any VPs in the
week starting from 2016-08-01. ( 2.2b).
Stability means most VPs are in one catchment for a long time. Table 2.3 shows overall statistics per
letter, for each dataset. Most VPs are very stable.
22
0.95
0.96
0.97
0.98
0.99
1
0 0.05/min 0.1/min 0.15/min 0.2/min
20 min 10 min 6.7 min 5 min
CDF
The Routing flip frequency
Time between flips
C
D
E
F
G
I
J
K
L
M
(a) Week of 2015-12-05.
0.95
0.96
0.97
0.98
0.99
1
0 0.05/min 0.1/min 0.15/min 0.2/min
20 min 10 min 6.7 min 5 min
CDF
The Routing flip frequency
Time between flips
A
C
D
E
F
G
I
J
K
L
M
(b) Week of 2016-08-01.
Figure 2.2: Cumulative distribution of mean flip time for each VP, broken down by anycast service. (Note
the y-axis does not start at zero.)
However, it also confirms a few VPs experience frequent routing flips. We define a VP as anycast
unstable when the mean time between flips is 10 minutes or less. We select this threshold because it is
slightly longer than two measurement intervals (each 4 minutes), tolerating some measuring jitter. Based on
23
Root flips (% VPs)
Letter mean (sd) =0 1 2 3
A 2.0 (21.2) 23% 25% 98% 98%
C 16.7 (133.2) 80% 80% 90% 91%
D 32.4 (188.5) 50% 52% 89% 91%
E 30.9 (190.0) 66% 69% 90% 90%
F 7.1 (81.8) 81% 82% 91% 92%
G 11.3 (93.5) 12% 12% 51% 52%
I 17.2 (134.3) 72% 76% 89% 90%
J 15.6 (128.3) 69% 72% 90% 92%
K 14.5 (124.8) 76% 78% 86% 86%
L 17.1 (137.8) 71% 75% 90% 92%
M 8.9 (98.1) 90% 91% 95% 95%
Table 2.3: Number of flips per VP, for each Root Letter, for the week of 2016-08-01.
the threshold of 10 minutes, we see that about 1% of VPs are anycast unstable for almost all Root Letters for
both 2015 and 2016 datasets. One exception is A-root, who shows high stability in the 2016 dataset. This
analysis suggests that, at least in these datasets, some VPs will have a dicult time using anycast and may
experience TCP connection breaks.
To confirm these results are typical, Figure 2.3 examines the fraction of anycast unstable VPs each day.
Most VPs consistently have about 1% of VPs as unstable, although there is some variation in a few letters
(for example, D has 3.2% in part of 2.3a).
The precision of results in Figure 2.2 is limited by the 4 minute frequency of basic RIPE observations.
We later return to this question with more frequent request rate and analysis to suggest that actually flipping
rates are much higher than every 4 minutes (subsection 2.4.5), and likely every packet (subsection 2.4.6).
2.4.3 Is Anycast Instability Persistent for a User?
We have shown that about 1% VPs are anycast unstable, and that this count is relatively consistent over time
(Figure 2.3). But does instability haunt specific users, or does it shift from user to user over time? That is:
is the set of unstable users itself stable or changing?
24
0
0.5%
1%
1.5%
2%
2.5%
3.0%
3.5%
1 2 3 4 5 6 7
Days
C
D
E
F
G
I
J
K
L
M
(a) Week of 2015-12-05.
0
0.5%
1%
1.5%
2%
2.5%
3.0%
3.5%
1 2 3 4 5 6 7
Days
A
C
D
E
F
G
I
J
K
L
M
(b) Week of 2016-08-01.
Figure 2.3: The percentage of anycast unstable VPs for each day in a week.
To evaluate if anycast instability is persistent, we split each week into its first half and second half. We
identify anycast unstable VPs in each half using our 10 minute threshold, then we compare the two sets to
see how much overlap they have.
Table 2.4 shows the number of unstable VPs in each half of the week, for both datasets. While the
absolute number of unstable VPs varies by letter, the most VPs that are unstable keep being unstable over
the whole week—the percent that overlap in the two halves of the week is at least 63% and typically around
90%. Anycast instability is a stable property between a VP and its anycast service. Although the two weeks
we checked are more than half a year apart, we check the overlap over two dierent weeks and found there
are still around 13% overlap.
25
week of 2015-12-05 week of 2016-08-01 both weeks
Root unstable VPs Overlap unstable VPs Overlap Overlap
Letter 1st 2nd both (percent) 1st 2nd both (percent) (percent)
A — — — — 2 2 2 100% —
C 102 108 98 93% 97 113 67 64% 10%
D 301 106 99 63% 190 186 142 75% 14%
E 119 180 110 76% 173 183 129 72% 13%
F 107 111 107 98% 34 35 26 75% 7%
G 82 82 76 92% 44 49 30 64% 16%
I 84 85 76 89% 84 107 68 72% 9%
J 68 157 48 50% 94 74 64 77% 12%
K 99 100 94 94% 86 93 75 89% 18%
L 87 67 62 81% 93 102 80 82% 20%
M 53 52 46 87% 55 57 32 57% 24%
Table 2.4: Overlap of anycast instability for specific VPs in half-weeks.
It is also possible we see large amounts of overlap because many VPs are on the same networks—we
rule this case out with additional validation. To check for bias from clustered VPs, we manually examined
unstable VPs and their ISPs. We found these VPs belong to dierent ISPs.
This analysis shows unlucky VPs (those that are anycast unstable) are likely to continue to be unlucky.
This result suggests that we must take care in interpreting the commercial success of anycast CDNs. Al-
though they work well for most users, and analysis of their own data shows few broken TCP connections,
it may be that their sample is not abundant enough, especially the VPs’ coverage, because we just showed
the instability is sticky with specific VPs over time. People will use anycast CDNs that work, but unlucky
people that are anycast unstable for a particular CDN may simply turn away from that CDN (or its clients)
because it doesn’t “work” for them.
2.4.4 Is Anycast Instability Near the Client?
We next look at where in the network anycast instability appears to originate. Is it near the VP (the client),
or near the anycast service’s sites (the servers)? This question is of critical importance, because we have
shown that some VPs are consistently anycast unstable. If the problem is located near the VP, it is likely
26
that they will be unstable with many anycast services, and if an important service (like a CDN or Root DNS
services) is provided only by anycast, then it might impossible for that VP to get service.
To explore this question, we use the same approach we used to study the persistence of anycast instability
(subsection 2.4.3), but rather than comparing two halves of the same week, we compare dierent anycast
services (dierent Root Letters). We consider three cases: (1) If instability occurs near an anycast site, then
many VPs reaching that site should see instability. (2) If anycast instability is near a specific VP, we expect
that this VP will be unstable with many services. (3) On the other hand, if a VP is unstable with only one
service, then flipping likely occurs on the path between that VP to its current site.
We define near as occurring within the first (or last) three hops of the path, likely within the same ISP.
The rest of the path is the middle of the network, since few organizations have more than a few hops before
reaching another ISP.
We also assume that operators do not configure routers to treat dierent root letters dierently. For
example, operators will not configure the router as per-packet for trac to one root-server, but per-flow for
other root-servers.
We rule out case (1), since the service operator would notice and correct a site-specific problem. In
addition, our study of number of unstable VPs per service that showed that there are at most a few anycast
unstable VPs for each service (Figure 2.2).
For the 2015 dataset we identify 416 VPs that are anycast unstable for some root letter. Anycast instabil-
ity is a property between the VP and a specific service, and Figure 2.4 shows for how many anycast services
each of these VPs find to be unstable.
Our first observation is that almost half of VPs are only unstable with one service. Of the 416 VPs, 200
(48%) are unstable with only one of the 11 IP anycast services we study. We conclude that the most common
location of anycast instability is the middle of the network, somewhere on a unique network path, not near
the VP or an anycast site.
27
��
����
����
����
����
��
�� �� �� �� �� �� �� ��
��� ��� �������� � �������������
��������� ����� ��������
����
����
Figure 2.4: The CDF of unstable VPs for how many root DNS services. The vast majority of VPs only
experience instability towards one to three services.
About the same number are anycast unstable with two or three services—202 of the 416 VPs, again
about 48%. We conjecture that in these cases the problem is closer to the VP. Fortunately, it does not aect
all services.
Only 2% of VPs are anycast unstable with more than 3 services, and none are unstable with more than
7. Since very few VPs have problems with all anycast services, we rule out case (2), that there are problems
that are not tied up with the VPs.
The distribution of 2016 dataset is similar to 2015. The new weeks saw 494 unstable VPs, more than the
2015 datasets. The fact that highest number of letters the VP experiencing instability simultaneously goes
from 8 to 7 is normal considering we add another letter B-root in our dataset.
One source of instability are paths that are load balanced over multiple links, where link selection is
a function of information in packets that change in each packet. For example, the UDP source port is
randomized in each of our queries; if it is included in a hash function for load balancing, packets could
take dierent links. This problem has previously been observed in ping-based latency measurements [67].
Additional work is provided in following sections subsection 2.4.6 trying to understand these root causes.
28
We do not consider correlations between number of sites and degree of flipping. One might look at
Figure 2.2 for correlations, but with only 11 architectures, each unique, it seems dicult to make statistically
strong comparisons.
We conclude anycast instability does not correlate with a specific VP, or with a specific anycast site, but
instability is a factor of the path between (VP, service) combinations, depending on their relative locations.
The good news is that this conclusion means clients that see problems with one anycast service will likely
be successful using some alterative services. Since Root DNS resolution is implemented by 13 underlying
IP anycast services, this result implies that anycast instability is unlikely to impede access to Root DNS
resolution. For CDNs [39], this result suggests the same content should be provided by multiple independent
anycast deployments.
2.4.5 Higher Precision Probing Shows More Frequent Flipping
The long-term RIPE Atlas datasets (examined in subsection 2.4.1) provide broad coverage for years, but each
VP observes its catchment every 4 minutes, and we would like greater precision on flip frequency. We use
this data to identify anycast unstable VP/service pairs, but these measurements are hugely undersampled—
we expect some sites are flipping every packet, but 4 minute measurements of a VP flipping between two
sites every packet will see a median flip time of 8 minutes. Improving the precision of this estimation
is important because TCP connections are active for short times, often few tens of seconds, so proof of 4
minute flipping does not demonstrate TCP problems. In this section we take additional, direct measurements
from RIPE Atlas to evaluate if these pairs are actually flipping more frequently than standard RIPE Atlas
measurements are able to observe. (We cannot run custom measurement code on the VPs because RIPE
does not support that, we have no way of contacting VP owners, and we require data from the few, specific
VPs that show frequent flipping.)
29
0
10
20
30
40
50
60
201.217.128.115
83.101.37.167
186.55.88.208
167.61.211.124
201.217.150.154
167.60.15.173
150.186.100.109
179.26.101.150
190.64.201.3
85.149.13.174
185.46.31.110
83.101.43.238
37.153.202.25
200.7.84.24
37.0.94.178
37.72.96.68
185.67.200.60
84.246.12.69
58.123.238.130
80.75.1.100
47.220.19.27
77.42.54.187
185.18.150.22
85.149.16.117
85.16.18.44
167.57.92.189
91.98.141.70
145.131.194.113
95.128.95.14
5.61.82.133
37.0.93.225
185.155.77.60
219.251.58.226
217.20.191.46
91.212.140.194
145.131.150.207
83.253.160.36
84.232.223.179
80.216.98.141
109.234.14.135
109.111.101.88
82.158.13.110
167.57.81.139
79.127.49.114
85.94.178.218
188.117.70.7
151.177.29.22
72.74.36.253
185.51.156.254
5.61.24.108
80.66.17.79
92.24.71.170
81.21.34.238
80.217.63.234
91.211.46.106
94.66.56.228
94.66.38.153
81.169.150.89
79.67.244.86
193.108.249.215
194.158.70.174
141.136.134.253
84.121.77.140
61.0.188.24
79.130.55.216
2.96.99.177
185.62.68.66
176.98.68.54
197.80.104.36
81.61.50.10
202.165.192.92
213.202.93.171
5.15.227.24
212.72.227.132
157.14.200.66
85.94.186.102
95.72.11.24
91.187.84.121
31.207.112.206
204.19.18.226
185.19.22.142
185.62.165.20
194.158.78.25
213.80.193.45
178.249.138.54
90.150.80.249
109.173.93.219
151.177.40.79
2.238.68.100
213.136.0.35
92.206.4.159
2.153.192.13
185.85.78.75
188.210.79.139
145.72.118.42
88.81.97.218
46.242.13.35
167.58.6.227
flips observed
per 20 sec
per 240 sec
Figure 2.5: Counting site flips from 100 VPs to D-Root. Measurements with about 20 s intervals (blue open
squares on top) are compared to every 4 minutes (green filled dots on bottom). Two VPs with no flips in
4 minute data are marked with an asterisk (*).
To test this question we select 100 VPs to probe towards D-root in 30 minutes with unevenly distributed
95 probes, roughly one query per 20 seconds.
Figure 2.5 compares how many flips we see when the same VPs probe at 4 minute intervals (green filled
dots on the bottom) compared to probes sent with about 20 s intervals (top open squares), for these 100 VPs
with frequent flips. (We report counts of flips rather than mean flip duration because it is dicult to assess
mean duration with this hour-long measurement.)
This data shows that more observations result in more flips—the open squares are always above filled
dots. Of course more observations make more flips possible, but this data shows that 4 minutes measure-
ments are undersampled and the path is flipping much more often. In fact, two VPs marked with asterisks
show no flips during 4 minute observations, even though they flip frequently about at least every 30 seconds.
If we assume every packet flips, then with fewer samples, these VPs just get “lucky” and appear stable with
undersampling.
Since our measurements are not synchronized, sometimes we take measurements very close in time. As
two specific examples, we saw one VP (84.246.12.69) flip from London to Frankfurt and back with three
measurements in 7 s, and another (201.217.128.115) flip from Miami to Virginia and back in 10 s. In the
next section we provide statistic evidences that suggest per-packet flipping are happening for specific VPs.
30
2.4.6 Does Per-Packet Flipping Occur?
We have shown that some VPs see very frequent flipping to some anycast services—as short as tens of
seconds (subsection 2.4.5). It seems unlikely that BGP is changing so frequently, since route flap damping
is usually configured to suppress multiple changes within a few minutes.
To rule out BGP as the source of instability we analyse the dataset of RouteViews [85] from 2016-08-01
to 2016-08-07, finding that BGP is quite stable. Of the total 192 BGP RouteViews peers we studied, 0%
to 23% RouteViews peers will ever see a BGP change, for each root letter, on each day. The mean time of
BGP change seen by those RouteViews peers is always fewer than twice a day, so such infrequent routing
changes cannot explain anycast flips that occur multiple times per minute. This new analysis supports
previous studies that show BGP changes are relatively infrequent for Root DNS [62]. Instead, we suggest
that these very frequent flips result from per-packet decisions made by load balancers in the path.
We cannot directly evaluate very frequent flips, because they occur only from specific VPs to certain
anycast services. While we find them with RIPE Atlas, it limits probing intervals to 60 s, and even with
multiple concurrent experiments, sub-second probing is impossible on RIPE. Neither can we reproduce
these flips from another site, since they are specific to the path from that VP to the service.
However, we can indirectly show it is likely that these flips are per-packet by looking at how they respond
in sliding time window over time. If the path is flipping every packet, then the probability of reaching a
specific site should be almost consistent over time. We measure consistency by sliding a window over all
observations and looking at the fraction of queries that go to dierent anycast sites. If flipping is per-packet,
the fraction should be similar for any window duration and time period.
To evaluate this hypothesis, we return to the 2016 dataset, and we focus on the 100 VPs that have
frequent site-slips towards C-Root (measured as those with time-to-flip around 10 minutes, roughly twice
the measurement frequency). For each VP, we compute how many times their requests reached a specific
31
0
0.25
0.5
0.75
1
2016-08-01 2016-08-02 2016-08-03 2016-08-04 2016-08-05 2016-08-06 2016-08-07 2016-08-08
fraction of JFK site hit
4 mins 8 mins 16 mins
Figure 2.6: Fraction of time one VP (146.186.115.74) spends at the JFK site of C-Root. Each point is the
mean of a 20-observation sliding window, done at four timescales, 4 minutes (wide blue), 8 minutes (red),
and 16 minutes (blue).
0
0.25
0.5
0.75
1
213.21.200.106
86.213.202.177
90.57.116.84
90.6.184.108
109.221.216.198
92.134.64.173
80.233.249.21
46.109.236.42
159.100.255.195
79.130.231.207
79.130.224.19
90.52.119.230
109.215.209.220
79.108.78.26
85.254.75.168
146.186.115.74
86.197.16.144
83.205.109.172
90.57.121.245
78.84.194.164
81.61.24.70
83.197.57.222
185.38.164.12
90.55.125.212
117.247.211.224
92.192.59.97
202.162.33.8
90.48.223.99
212.205.91.83
109.213.100.145
81.51.254.3
31.56.159.215
92.144.4.177
86.209.10.150
90.38.67.104
74.118.183.198
62.85.13.96
49.32.56.203
86.197.134.2
2.87.149.11
109.208.248.150
90.29.83.168
109.217.63.151
80.232.250.180
83.193.218.108
90.45.58.141
90.42.141.70
151.240.140.253
109.220.184.63
87.244.214.154
86.201.103.100
92.148.101.47
81.51.103.101
109.214.135.159
92.130.194.70
49.32.56.71
86.209.108.117
188.136.136.1
49.32.56.106
188.75.73.46
212.205.91.82
90.38.195.81
86.201.130.15
85.254.78.166
109.221.130.74
109.220.237.150
92.148.214.159
90.33.3.43
86.203.40.160
86.197.149.65
92.192.87.17
91.60.173.101
81.184.2.87
91.60.172.221
92.192.33.187
92.192.54.186
91.60.174.13
92.192.109.63
49.32.56.176
91.60.174.107
92.192.101.225
91.60.164.198
62.8.68.221
91.60.175.226
41.215.133.176
91.205.155.161
84.0.129.188
145.236.214.56
217.79.79.54
212.72.192.19
195.228.228.175
188.36.71.177
31.46.183.108
81.182.199.249
84.122.218.115
91.45.167.135
VP in Fig 2.5
fraction of site hit
Figure 2.7: Mean and standard deviation of site hit ratio across all sliding time windows in a week.
anycast site in a window of 20 observations. We slide the window forward with one observation at a time,
so windows overlap.
Figure 2.6 shows a representative example for one VP (146.186.115.74), which flips between the JFK
and ORD sites of C-Root. We report the fraction of time the VP is at JFK, measured with a 20-observation
moving window. We compute this moving window at three timescales, first using all the data (4 minute
samples, the wide line), and also downsampled two times (8 and 16 minutes). First, we see that the long-
term average is around 0.5, consistent with each packet going one way or the other. There are peaks and
valleys, as we expect with any long-term average, sometimes we get a run of one site or the other, but
standard deviations is0:1184 (shown as the dashed lines), and most of the time the average is within this
range. However, lack of any repeating pattern suggests that there are not long-term flips, but per-packet.
In addition, when we compare the three timescales, all show similar properties. This result is consistent
with all being drawn from random samples of per-packet flipping. These trends supports the suggestion
that we would see similar results if we increase sampling frequency, as we showed experimentally to 20 s in
subsection 2.4.5.
32
We see the same behavior for this example VP in most of the other 100 VPs we observed. Figure 2.7
shows the mean and standard deviation of all selected VPs, sorted by mean. Most of these VPs show a ratio
around 0.5 and a standard deviation around 0.1, consistent with our example, and consistent with random
selection per-packet. Some sites on the right of the graph show an uneven split; we expect these are due to
uneven load balancing, or multiple load-balanced paths.
Taken together, our experiments and this analysis present a strong case for per-packet flipping. Ex-
periments at 4 minutes and 20 s (subsection 2.4.5) directly support this claim, and our analysis at multiple
timescales and across many VPs indirectly supports it.
2.4.7 Are TCP Connections Harmed by Anycast Flipping?
When a route flips, an active TCP connection will shift from one server to anther, almost certainly resulting
in a connection reset. In the previous sections we studied route flipping with long-term UDP data; we next
compare these UDP observations with TCP connections and TCP flipping.
2.4.7.1 The relationship between UDP flipping and TCP-flipping
Because TCP connections interact badly with route changes, most load balancers try to send TCP flows to
the same destination, balancing flows instead of packets. Measurements of UDP flipping are therefore likely
to overestimate the degree of TCP flipping. We expect that TCP instability will occur only on a subset of
the (VP, letter) combinations than show UDP instability.
We next measure TCP connection success directly. We send TCP-based DNS queries from RIPE Atlas
VPs to Root DNS letters, selecting the combinations that UDP-based measurements suggest are most un-
stable. We are looking for TCP connection failures, measured by how many TCP connections time out. We
believe these timeouts indicate per-packet load balancing, while (VP, letter) combinations that show UDP
flipping but not TCP flipping indicate load balancers that use per-flow scheduling.
33
0.75
0.8
0.85
0.9
0.95
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
CDF
fraction of timeout responses
A
A
A
B
B
B
B
C
C
D
D
E
E
E
E
F
F
F
F
G
G
G
I
I
I
I
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
K
K
K
K
K
L
L
L
M
M
Figure 2.8: Cumulative distribution of fraction of timeout responses of TCP query for each VP, broken down
by anycast service. For each letter, the selected VPs are dierent, depending on the UDP flipping case. (Note
the y-axis does NOT start at zero.)
For our experiment we begin with the 100 VPs for each letter that show the most frequent UDP flipping.
(These VPs often are dierent for dierent letters, although there is some overlap.) We then make 50 TCP
DNS queries from each VP to its letter, each 20 minutes apart (thus the experiment lasts 17 hours).
Figure 2.8 shows a CDF of what fraction of all TCP queries time out for all VPs and queries, broken out
by letter. This experiment shows that TCP is very stable most (VP , letter) combinations, except for J-Root.
In Figure 2.8, for all letters including J-root, more than 75% (VP, letter) combinations show no timeouts.
This experiment shows that most load balancers are switching flows as a whole, not packets.
However, J-Root shows many more timeouts than the others, with 15 of VPs showing many timeouts
(20% or more of all tries), and 7 VPs timing out on half or more of their queries. We later refer to these 15
VPs that timeout for more than 20% of all queries as frequent TCP flippers. We use 20% as a safe threshold,
because in Figure 2.8, for all letters other than J, the normal timeout range is less than 20%. We emphasize
that TCP timeouts are still very, very rare for J-Root: 99% of VPs see neither UDP nor TCP flipping to
34
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
CDF
fraction of timeout responses
A
A
A
A
B
B
B
B
C
C
C
C
D
D
D
D
E
E
E
E
E
E
F
F
F
F
G
G
G
G
G
I
I
I
I
J
J
J
J
J
J
J
J
J
K
K
L
L
L
L
L
M
M
M
M
M
Figure 2.9: Cumulative distribution of fraction of timeout responses of TCP query for each VP, broken down
by anycast service. For each letter, the selected VPs are from a same set (Note the y-axis DOES start at zero.)
J-Root, so only about 0.15% see frequent TCP timeouts. Next sections examine this case to understand it
better.
2.4.7.2 Why Does J-Root See More TCP Timeouts?
We find J-Root sees more TCP timeouts than other letters, and that this problem occurs only for a few
VPs. We next examine if these VPs see problems with other roots, and look at the root cause of their TCP
instability.
First, to see if these 15 frequent TCP flippers have problems with other letters, we repeat the experiment
in subsection 2.4.7 by using the same 15 VPs to query all 12 roots that use anycast at the time of this
experiment.
For these, almost all see frequent timeouts only in TCP queries to J-root, not to other root letters. In
the Figure 2.9, out of total 15 VPs, 12 VPs have more than 20% of their queries to J-Root time out, but do
not see such frequent timeout towards other roots. We refer to these 12 VPs as J-Root-TCP-flippers. This
35
observation implies these timeouts are not caused by per-packet load balancing near the VPs, since load
balancing near the VP would aect many letters.
One or two VPs time out in TCP queries to other letters besides J-root. In Figure 2.9, for every root
except K-root, one or more VPs time out frequently as well. Our previous UDP-based sample selection in
Figure 2.8 might leave out a few VPs that possibly saw frequent TCP query timeout as well (a result of
per-packet flipping), but those VPs are included in Figure 2.9.
We believe TCP-flipping is closely associated with the routing path between a VP and its possible
anycast destinations, caused by a router in the path. The amount of frequent TCP flippers varies for dierent
service. Also, VPs that timeout for one root DNS service usually do not timeout for other letters.
2.4.7.3 Can we locate the per-packet balancer?
We next examine the 12 J-Root-TCP-flippers in Figure 2.9 to identify what makes them unusual. From the
RIPE atlas data recording the VPs’ information, we find 11 of them are in Iran in several dierent ISPs, and
1 is in Finland.
We consider two possible reasons why these 11 VPs timeout frequently: DNS-based filtering, or a
load balancer. While some countries do DNS-based filtering to censor Internet access, we rule out filtering
because no other root letters see timeouts. Second, we do not see frequent timeouts on any other Iranian
VPs in Figure 2.9 and Figure 2.8, suggesting this is not caused by a country-wide policy. Third, the 11 Iran
J-Root-TCP-flippers sometimes see successful replies from J-root, suggesting an intermittent problem and
not a systematic censorship.
To look for a load balancer in common across these 11 VPs. We use 34 traceroutes from each of these
VPs to each DNS root letters. These traceroutes are already taken by RIPE Atlas and from the same time
period as our the other Atlas data we use.
36
All VPs have traceroutes that show their trac passes through private network address space [69] (ex-
cept two VPs that have a router on their path blocks ICMP echo requests). Although VPs have public IP
addresses, trac of each VP passes through 1 to 9 hops of private address space before reentering pub-
lic IP address space in a neighboring country. Paths often diverge in the private address space, but never
with the same IP address, so we cannot identify a obvious specific router doing per-packet load balancing.
Sometimes several paths share routers with common private /24 prefixes, but with private address space, it
is dicult to judge devices for certain.
We believe per-packet load balancing happens in these private networks, although we cannot say more.
Moreover, the traceroute directly shows, those Iran VPs’ penultimate routers to the J-root destination shift
among dierent sites because of the load balancing.
Traceroute records also explains why timeouts from those per-packet balancers occur only for J-Root and
not for other letters. If we look at traceroutes from a specific VP to all letters, we see the penultimate routers
to the J-Root destination shift between Malaysia and South Africa, two dierent J-Root sites. However, for
A-Root, this same VP traverses the same private networks in Iran exiting at two dierent places in the public
Internet, but both terminate at the same anycast site in Virginia, US. The same is true for B-Root, with most
trac terminating Los Angeles (and not the recently Miami site) C- and D-Root both have multiple sites,
but all Iranian trac terminates at one IP address in Los Angeles for C-Root and in Tokyo for D-Root. Other
letters show similar patterns with dierent paths, but generally one consistent destination. Other VPs show
the same pattern as the above VP, that they go through dierent traceroute path. Then for J-root, they end in
dierent penultimate routers, but for other roots, they end in a single penultimate routers or multiple routers
in one city.
These observations show that a per-packet load balancer will often cause TCP connections to timeout
and be unavailable for flows that cross it, but that such configurations are very, very rare. This experiment
result agrees with our findings in subsection 2.4.4 that load balancers are in the middle of the network. We
37
believe there are one or more hops are configured as per-packet in the middle of private network space. In
our dataset, a few Iranian VPs only timeout in TCP queries to J-roots but not to other roots, and not all
Iranian VPs timeout in TCP queries to J-root.
2.5 Related Work
Prior studies have considered many aspects of anycast: latency [16, 87], geography [12], usage and trac
characteristics [19, 36, 43, 45], CDN load balancing [39], and performance under DDoS attack [62]. How-
ever, only a few studies have considered the stability of anycast [12, 51], and their conclusions are largely
qualitative. Unlike this prior work, our goal is to quantify the stability of anycast.
Direct measurements: Prior stability studies either directly or indirectly measured catchments. Direct
measurement studies of anycast stability use data from end-users or monitors that contact the anycast site.
Microsoft has used Bing clients to study anycast and evaluate latency and load balancing. They observed
21% of end-users change sites at least once per a week [16]. However, the FastRoute system is concerned
about small file downloads in their availability studies [39]. They also showed that anycast availability
dipped from 99.9% to 99.6% once during their week-long observation, but do not discuss why.
LinkedIn [12] evaluated anycast with a synthetic monitoring service to evaluate latency and instability,
and did not find “substantial instability problems”. Our results suggest that most VPs are stable, so long-
duration observation is unlikely to see new results, unless one studies from more vantage points located at
other dierent places in the Internet.
Finally, recent studies of DNS Root anycast showed frequent routing flips during DDOS [62], but that
paper did not study stability during normal periods.
Our work is also direct measurement like these prior studies, but unlike prior work we use many geo-
graphically dispersed VPs (more than 9000 from RIPE Atlas) multiple services (the 11 anycast Root DNS
services, some with 100 sites), under normal behavior.
38
Indirect evaluation: Inference can estimate changes in anycast catchments by looking for changes in
latency or hop counts (IP time-to-live). Cicalese and Giordano examined anycast CDN trac by actively
sending queries to each prefixes announced by 8 CDN providers [19]. The found anycast stable, with nearly
constant RTT and time-to-first-byte, and consistent TTLs over a month. They later studied the duration of
TCP connections for DNS and show that most last tens of seconds, suggesting that DNS will not be aected
by infrequent anycast catchment changes [43]. Unlike their work, we directly observe site flips with CHAOS
queries, rather than infer it. More important, we use 9000 VPs geographically dispersed across the world,
while their study is based on VPs only in Europe.
2.6 Summary
In this chapter we used data from more than 9000 vantage points (VPs) to study 11 anycast services to
examine the stability of site selection. Consistent with wide use of anycast in CDNs, we found that anycast
almost always works—in our data, 98% of VPs see few or no changes . However, we found a few VPs—
about 1%—that see frequent route changes and so are anycast unstable. We showed that anycast instability
in these VPs is usually “sticky”, persisting over a week of study. The fortunate fact, that most unstable
VPs are only aected by one or two services, shows instability causes may lie somewhere in the middle of
the routing path. By launching more frequent requests, we captured very frequent (back and forth within
10s) routing change in our experiments using the unstable VPs we discovered from previous analysis, the
statistical analysis shows they are possibly aected by per-packet flipping, which is potentially caused by
load balancer in the path. Also, we perform experiment by the same sources and targets but with TCP
connection. We find TCP anycast instability is even rarer but exists and harms. Our results confirm that
anycast generally works well, but when it comes to a specific service, there might be a few users experiencing
routing that is never stable.
39
In this chapter, we have confirmed anycast stability. Next, we are going to look at the second part of our
thesis statement about confirming the anycast security.
40
Chapter 3
Anycast Security in DNS Spoofing
In this chapter, we describe methods to identify DNS spoofing, distinguish the mechanism being used, and
identify organizations that spoof from six years of historical data. Our analysis of this data provide the
longitudinal study of DNS spoofing over recent six years and to explore how it happens, and who does it
and who is aected.
Our contribution is that we design a methodology to detect DNS spoofing, provide results by analysing
the dataset, and finally validate our methodology. First, we describe methods to identify DNS spoofing,
infer the mechanism being used, and identify organizations that spoof from historical data. Our methods
detect overt spoofing and some covertly-delayed answers, although a very diligent adversarial spoofer can
hide. We use these methods to study more than six years of data about root DNS servers from thousands of
vantage points. We show that spoofing today is rare, occurring only in about 1.7% of observations. However,
the rate of DNS spoofing has more than doubled in less than seven years, and it occurs globally. Finally, we
use data from B-Root DNS to validate our methods for spoof detection, showing a true positive rate over
0.96. B-Root confirms that spoofing occurs with both DNS injection and proxies, but proxies account for
nearly all spoofing we see.
This study about DNS spoofing supports our thesis statement (subsection 1.2.2) by confirming that we
can increase our confidence in two aspects of anycast security: integrity and privacy. 13 DNS Root letters
41
are deployed in anycast, and third-parties can use an anycast server to spoof. In this work, we show that
most of the time DNS answers are returned from the authoritative servers, suggesting that they have not
been manipulated. Such suggestion increases confidence in the DNS answer’s integrity (against interception
and injections) and DNS queries’ privacy (against spoof-based eavesdropping).
We detect spoofing by checking server IDs. We prove this method has high true-positive rate by validat-
ing this methodology by using B-root server log. By using B-root server logs, we show that our detection
methodology has a high-positive rate over 0.96. We show that today proxy (dropping original query packet)
is more popular than injections as a mechanism to spoof, and anycast remains as an unpopular way to spoof.
As of November 2020, we are planning to submit this work for peer review. The current version of this
work is released at arXiv [97].
3.1 Introduction
The Domain Name System (DNS) plays an important part in every web request and e-mail message. DNS
responses need to be correct as defined by the operator of the zone. Incorrect DNS responses from third
parties have been used for ISPs to inject advertising [59]; by governments to control Internet trac and
enforce government policies about speech [42] or intellectual property [22]; to launch person-in-the-middle
attacks by malware [23]; and by apparent nation-state-level actors to hijack content or for espionage [49].
DNS spoofing is when a third-party responds to a DNS query, allowing them to see and modify the re-
ply. DNS spoofing can be accomplished by proxying, intercepting and modifying trac (proxying); DNS
injection, where responses are returned more quickly than the ocial servers [30]; or by modifying config-
urations in end hosts (section 3.2). Regardless of the mechanism, spoofing creates privacy and security risks
for end-users.
DNSSEC can protect against some aspects of spoofing by insuring the integrity of DNS responses [31].
It provides a cryptographic signature that can verify each level of the DNS tree from the root. Unfortunately,
42
DNSSEC deployment is far from complete, with names of many organizations (including Google, Facebook,
Amazon, and Wikipedia) still unprotected [68], in part because of challenges integrating DNSSEC with
DNS-based CDN redirection. Even for domains protected by DNSSEC, client software used by many end-
users fail to check DNSSEC.
While there has been some study of how DNS spoofing works [30], and particularly about the use
of spoofing for censorship [42], to our knowledge, there has been little public analysis of general spoof-
ing of DNS over time. (Wessels currently has an unpublished study of spoofing [98].) Increasing use of
DNSSEC [31], and challenges in deployment [18] reflect interest in DNS integrity.
This chapter describes a long-term study of DNS spoofing in the real-world, filling this gap. We analyse
six years and four months of the 13 DNS root “letters” as observed from RIPE Atlas’s 10k observers around
the globe, and augment it with one week of server-side data from B-Root to verify our results. Our first
contribution is to define methods to detect spoofing (subsection 3.3.2) and characterize spoofing mechanisms
from historical data (subsection 3.3.3). We define overt spoofers and covert delayers. We detect overt
spoofers by atypical server IDs; they do not hide their behaviors. We expected to find covert spoofers,
but instead found covert delayers—third-parties that consistently delay DNS trac but do pass it to the
authoritative server.
Our second contribution is to evaluate spoofing trends over more than six years of data, showing that
spoofing remains rare (about 1.7% observations in recent days), but has been increasing (subsection 3.4.2)
and is geographically widespread (subsection 3.4.3). We also identify organizations that spoof (subsec-
tion 3.4.4).
Finally, we are the first to validate client-side spoofing analysis with server-side data. We use one week
of data from B-Root to show that our recall (the true-positive rate) is over 0.96. With the end-to-end check
with B-root data, we are able to learn the fact whether or not a query reaches the server. Server-side analysis
43
confirms that proxying is the most common spoofing mechanism. DNS injection [30] and third-party anycast
are rare.
Our methodology builds on prior that used hostname.bind queries and the penultimate router [36, 48],
but we provide the first longitudinal study of 6 years of all 13 root letters, compared prior work that used a
single scan [36] or a day of DNS and traceroute and a week of pings [48]. In addition, we are the first to use
server-side data to provide end-to-end validation, and to classify spoofer identities and eavluate if spoofing
is faster.
All data from this chapter is publicly available as RIPE Atlas data [71–73] and from USC [9]. We will
provide our tools as open source and our analysis available at no cost to researchers. Since we use only
existing, public data about public servers, our work poses no user privacy concerns.
3.2 Threat Model
DNS spoofing occurs when a user makes a DNS query through a recursive resolver and that query is an-
swered by a third party (the spoofer) that is not the authoritative server. We call the potentially altered
responses spoofed. We detect overt spoofers who are obvious about their identities. We look for covert
spoofers, but find only covert delayers where DNS takes noticeably longer than other trac. We look at
reasons and mechanisms for spoofing below.
3.2.1 Goals of the Spoofer
A third party might spoof DNS for benign or malicious reasons.
Web redirection for captive portals: The most common use of DNS spoofing is to redirect users to a
captive portal so they can authenticate to a public network. Many institutional wifi basestations intercept all
DNS queries to channel users to a web-based login page (the portal). After a user authenticates, future DNS
trac typically passes through.
44
We do not focus on this class of spoofing in this chapter because it is transient (spoofing goes away
after authentication). Our observers (see subsection 3.4.1) have static locations (e.g. home) that will not
see captive portals. However, our detection methods would, in principle, detect captive portals if run from
dierent vantage points (e.g. hotels).
Redirecting applications: DNS spoofing can be used to redirect network trac to alternate servers. If
used to redirect web trac or OS updates, such spoofing can be malicious as part of injecting malware or
exploits. Alternatively, it can reduce external network trac.
Faster responses: Some ISPs intercept DNS trac to force DNS trac through their own recursive
resolver. This redirection may have the goal of speeding responses, or of reducing external trac (a special
case of redirecting applications, or implementing local content filtering (described next).
Network Filtering and Censorship: DNS spoofing is a popular method to implement network filtering,
allowing the ISP to block destinations to enforce local laws (or organizational policies, when done inside
of an enterprise). DNS spoofing has been used control pornography [1, 2], for political censorship [24],
and to implement other policies. Spoofing for network filtering can be considered a beneficial technique or
malicious censorship, depending on one’s point of view about the policy. Spoofing for trac filtering can
be detected by DNSSEC validation, if used.
Eavesdropping: Since DNS is sent without being encrypted, spoofing can be used to eavesdrop on DNS
trac to observe communications metadata [38].
3.2.2 Spoofing Mechanisms
Table 3.1 summarizes three common mechanisms used to spoof DNS: DNS proxies (in-path), on-path in-
jection, unauthorized anycast, following prior definitions [30, 48]. We review each mechanism and how we
identify them in subsection 3.3.3.
45
mechanism how spoofer spoofee
DNS proxies
(in-path)
a device intercepts
trac and returns re-
quests
ISPs, universities,
corporations
users of the organiza-
tion
On-path
injection
a device observes
trac and injects
responses
hackers, ISPs, gov-
ernments
anyone whose trac
passes the device
Unauthorized
anycast site
(o-path)
a server announcing
BGP prefix of the
anycast service
ISPs, governments anyone who accepts
the BGP announce-
ment
Table 3.1: Mechanisms for DNS spoofing.
3.3 Methodology
We next describe our active approach to observe probable DNS spoofing. This is challenging because, in
the worst case, spoofers can arbitrarily intercept and reply to trac, so we use multiple methods (subsec-
tion 3.3.2). Moreover, we classify spoofing mechanisms (subsection 3.3.3) from what we observe from
historical data. Finally, we identify who are the spoofing organizations from the server IDs they returned
(subsection 3.3.4, and Table 3.6). We caution that our methods are best eort, and not fool-proof against a
sophisticated adversary.
3.3.1 Targets and Queries
Our goal is to identify spoofing in a DNS system with IP anycast. In this chapter we study the Root DNS
system because it is well documented. Our approach can apply to other, non-root DNS anycast systems,
provided we have access to distributed VPs that can query the system, and the system replies to server-id
queries (e.g. DNS CHAOS-class), ping, and traceroute. In principle, our approach could work on any-
cast systems other than DNS, provided they support a query that identifies the server, as well as ping and
traceroute.
We probe from controlled vantage points (VPs) that can initiate three kinds of queries: DNS, ping, and
traceroute. We use RIPE Atlas probes for our vantage points since they provide a public source of long-term
46
data, but the approach can work on other platforms that make regular queries. In practice, recursive resolvers
communicate directly with nameservers on behalf of web clients, so these VPs represent recursive resolvers.
For each VP, we first examine basic DNS responses with Server IDs to detect overt spoofers with false-
looking server IDs; Second, we test the timing of replies to search for covert spoofers and detect covert
delayers, adversaries who process and forward legitimate replies. Third, we combine information from all
three types of query responses to distinguish the spoofing mechanisms used by the spoofer.
For each hour we observe, we analyze all three datasets (DNS, ping, traceroute). In that hour, DNS
and ping have 15 observations, and traceroute has two observations. (See how we sample data over time in
subsection 3.4.1.).
Our targets are authoritative DNS servers using IP anycast. DNS has three methods to identify server ID:
CHAOS-class hostname.bind [91], id.server [21], and NSID [8]. Each returns a server-specific string,
which we call the Server ID. We usehostname.bind because it is supported on all root servers from 2014
to today.
We identify latency via ICMP echo request (ping) to the service address. We also identify penultimate
hops of the destination from traceroute.
3.3.2 Finding Spoofed DNS responses
We examine server ID and ping and DNS latency to identify overt spoofers and covert delayers.
3.3.2.1 Detecting Overt Spoofers By Server ID
We detect covert spoofers because they use Server IDs that dier from what we expect.
DNS root operators use server IDs that follow an operator-specific pattern. Often they indicate the
location, a server number, and the root letter. For example, A-root operators have a naming convention
47
where the Server ID starts with nnn1- and then followed with three letters representing a site/city and end
with a number, with examples like nnn1-lax2 and nnn1-lon3. Other root letters follow similar patterns.
By contrast, overt spoofers use other types of names, often with their own identities. Examples include:
sawo, hosting, or chic-cns13.nlb.mdw1.comcast.net, 2kom.ru.
We build a list of regular expressions that match replies from each root operators, based on what we
observe and known sites as listed at root-servers.org. We find server IDs defined by each DNS root
operators provide a reliable way to tell spoofing, since our study on years of data shows operators tend
to make the server IDs in similar formats across multiple sites. Also, much fewer vantage points receive
atypical server IDs than valid server IDs. In section 3.5, we prove that recognizing spoofing with atypical
server IDs are a reliable way to tell spoofing.
3.3.2.2 Detecting Covert Delayers with Latency Dierence
Although regular expressions can identify spoofers that use obviously dierent Server IDs, covert spoofers
could reused known Server IDs to hide their behavior.
We look for covert spoofers by comparing DNS and ping latency (as described below), assuming that
a covert spoofer will intercept DNS but not ICMP. While we find delay dierences, in all cases, we see
that sites with delay dierence actually pass the query to the authoritative server. We therefore call what we
identify a covert delayer.
Our test for covert delayers is to compare DNS and ICMP latency for sites that have a good-appearing
server ID. We compare DNS and ICMP latency by considering all measurements of each type for one
hour. (We use multiple measurements to tolerate noise, such as from queueing delay, in any given obser-
vation.) For each group of RTTs, we take their median value (median
dns
, median
ping
), and median abso-
lute deviations(mad
dns
, mad
ping
) in an hourly window. We exclude measurements that observe cachement
changes (based on Server IDs that indicate another location) to filter out catchment changes, although we
48
know they are rare [95]. We then define the dierence as =jmedian
dns
median
ping
j, and require three
checks:
8
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
:
> 0:2 min(median
dns
; median
ping
); and
> 3 max(mad
dns
; mad
ping
); and
> 10 ms
(3.1)
The comparison of medians looks for large (20%), stable dierences in latency. The change also must
exceed median absolute deviation to avoid overreacting to noisy measurements. Finally, the check for 10 ms
avoids dierences that are around measurement precision. These specific thresholds are based our evaluation
of the data, and sense that 10 ms is well beyond normal jitter. Potentially, a sophisticated adversary could
intentionally incrase response latency in an eort to bypass these three checks, however they cannot reduce
latency.
While this test is designed to detect covert spoofing, in practice (details in section 3.5) we see that most
cases with large pass the query on to the authoritative server. The majority of such queries has a larger
DNS latency rather than its Ping latency, implying the DNS queries being processed dierently by a third-
party. In this chapter, we do not consider such interference as DNS spoofing, but consider them as covert
delayers. We therefore count them as valid, non-spoofers in all of section 4.5 and Table 3.4.
3.3.3 Identifying Spoofing Mechanisms
Once we detect a spoof, next identify the spoofing mechanism as anycast or non-anycast (injection or proxy,
from subsection 3.2.2).
49
Spoofers can use anycast to intercept DNS by announcing the same prefix as the ocial DNS servers.
Anycast will aect not only DNS queries, but all trac sent to the prefix being hijacked. Other spoofing
mechanisms typically capture only the DNS trac.
When we look at traceroutes to the site, a penultimate hop that diers from known legitimate sites indi-
cates anycast-based spoofing. We use the list at root-servers.org to identify known sites. This method is
from prior work [36]. We consider a VP is under influence of anycast spoof when it meets two conditions.
First, the penultimate hop of its traceroute should not match that of any VP with an authentic reply, sug-
gesting the site that the query goes to is not any authentic site. Second, its DNS RTT is the same as its Ping
RTT, suggesting in fact anycast is capturing all trac, not just DNS.
DNS injection is a second way to spoof DNS [24]. For DNS injections, the spoofer listens to DNS
queries (without diverting trac), then replies quickly, providing an answer to the client before the ocial
answer sent from an authoritative server. The querying DNS resolver accepts the first, spoofed reply and
ignores the additional, real reply.
DNS injection has two distinguishing features: responses are fast and doubled [88]. Without a plat-
form that preserve multiple responses for one query, it is hard from historical data to recognize injection
mechanism.
Proxies are the final spoofing mechanism we consider. A DNS proxy intercepts DNS trac, then diverts
it to a spoofing servers. Unlike DNS injection, the original query never reaches the ocial server because
proxy simply drops the queries after returning the answer.
Using historical data collected from VP’s side, we can only dierentiate mechanism between anycast
vs. non-anycast. With further validation from Root DNS server-side data in section 3.5, we can dierentiate
mechanisms between injection vs. proxy.
50
3.3.4 Spoofing Parties from Server IDs
Spoofing is carried out by multiple organizations; we would like to know who they are and identify unique
spoofing parties. We use patterns of Server ID to identify overt spoofers, and after knowing who they are,
we classify them to seven categories based on their functions.
We identify unique spoofing parties through several steps. First, for Server IDs that have a recognizable
DNS name, we group them by common prefix or sux (for example, rdit.ch for njamerson.rdit.ch and
ninishowen.rdit.ch). We handle Server IDs and IP addresses the same way, after looking up their reverse
DNS name. We manually identify and group recognizable company names. Remaining Server IDs are
usually generic (DNS13, DNS-expire, etc.), for which we group by the AS of the observing VPs.
We classify identifiable spoofing parties by examining their websites. Each class is based on the goal or
function of the organization or the person. Table 3.6 shows seven class of dierent spoofing parties (ISPs,
DNS tools, VPNs, etc.), with specific examples provided. Our work maintains a full table of spoofers seen
in years and their responding webpages. An example table is at subsection 3.4.4, with clickable example
URLs showing the identity of spoofing organizations.
3.4 Results
We next study six years and four months of Root DNS to look for spoofing. First, we study the quantity
of DNS spoofing. We show it is uncommon but is getting more popular over time. Second we study
the locations and identities of the spoofers. Finally we discuss whether spoofing always provides a faster
response than authorized servers.
3.4.1 The Root DNS system and Datasets
We observe the Root DNS system using RIPE Atlas [70].
51
type frequency
DNS every 240s
Ping every 240s
Traceroute every
1800s
Table 3.2: Query detail
Background about Root DNS system: Root DNS is provided by 13 independently operated services,
named A-root to M-root [82]. All of the root letters use IP anycast
1
, where locations, typically in dierent
cities, share a single IP address. The number of locations for each letter varies, from a few (2 for H, 3 for B,
less than 10 for C, G, and 28 for A) to hundreds (D, F, J, and L all operate over 100) as in August 2019 [82].
We use the list of anycast locations at root-servers.org as ground truth.
RIPE Atlas: Our observations use public data collected by RIPE Atlas probes from 2014-02 to 2020-05
(six years and four months). RIPE Atlas has standard measurements of DNS server ID (hostname.bind),
ICMP, and traceroute (UDP) to each Root Letter for most of this period. Exceptions are that G-Root never
responds to ICMP, and E-Root data is not available from 2014-02 to 2015-01.
We show the frequency of each type of query in Table 3.2. In each part, we sample at a random one-hour
window each of the three type of datasets. Over the multi-year period, we extract 4 observations each month.
Each measurement is a randomly chosen hour in a dierent week of the month, with the first in the 1st to
7th of the month, the second in the 8th to the 14th, then 15th to the 21st, and finally 22nd to the end of the
month (“weeks” are approximate, with the fourth week sometimes longer than 7 days). We choose the same
hour for all letters, but the hour varies in each week to avoid bias due to time-of-day.
The exact number of VPs we use varies, since VPs sometimes disconnect from RIPE Atlas infrastructure
(VPs are individually owned by volunteers), and RIPE adds VPs over the measurement period. The number
of VPs, ASes, and countries measured over time is show in Table 3.3. Our result is limited by the coverage
of RIPE atlas VPs.
1
H-Root’s sites are primary/secondary, so only one is visible on the general Internet at a time.
52
Year 2014 2015 2016 2017 2018 2019 2020
Vantage Points 7473 9223 9431 10311 10336 10492 10988
AS 2616 3322 3370 3633 3605 3590 3397
Country 168 184 186 183 181 180 175
Table 3.3: Data Coverage
2020-05-03
active VPs 10882 100.00%
timeout 260 2.39%
answered 10622 97.61%
valid 10430 95.85%
covertly-delayed 19 0.17%
spoofed 192 1.76%
Table 3.4: DNS spoof observations.
3.4.2 Spoofing Is Not Common, But It Is Growing
In this section, we talk about how much spoofing occurs today, and what is the trend of spoofing over the
six years and four months.
3.4.2.1 Spoofing is uncommon
Spoofing today is uncommon: about 1.76%, 192 of the of 10882 responding VPs are spoofed. Table 3.4
shows on 2020-05-03: that about 95.85% of all VPs received valid answer (in which totally 19 (0.17%) VPs
experience delayed DNS answers), 1.76% VPs are overtly spoofed. More VPs (2.39%) timeout than see
spoofing.
Most spoofers spoof on all root letters. Fig. 3.1 shows the CDF of how many letters are spoofed for each
VP, and of VPs that see spoofing, there are always more than 70% of VPs see spoofing of all root letters
throughout the six years we observed. In year 2020, there are 83% of VPs experience spoofing across all
root letters.
53
0
0.2
0.4
0.6
0.8
1
1 3 5 7 9 11 13
CDF
number of roots detected spoofed for one VP
2015
2016
2017
2018
2019
2020
Figure 3.1: CDF of root counts seen overtly-spoofed (2014 not provided because of lacking E-root data)
0
0.005
0.01
0.015
0.02
2014-02
2015-01
2016-01
2017-01
2018-01
2019-01
2020-01
2020-05
Fraction of spoofed VPs
A
B
C
D
E
F
G
H
I
J
K
L
M
all
Figure 3.2: Fraction of spoofed VPs over all available ones at each date.
2014-02
2020-05
Figure 3.3: 3000
VPs
3.4.2.2 Growth
We see an increasing amount of spoofing over the six years and four months we study. Figure 3.2 shows
the fraction of VPs that see any root servers spoofed (the thick black line), and for each root letter spoofed
(colorful dots). Although we see some variations from day to day, the overall fraction of spoofed VPs rises
from 0.007 (2014-02-04) to 0.017 (2020-05-03), more than doubling over six years.
Because the set of active VPs changes and grows over time, we confirm this result with a fixed group
of 3000 VPs that occur most frequently over the six years, shown in Figure 3.3. This subset also increased
to more than twice over the six years, but from a slightly lower baseline (0.005) to 0.014, confirming our
findings. In subsection 3.4.3, we later show that location aects the absolute fraction of spoofing.
54
<0.3
<0.2
<0.1
insufficient
Figure 3.4: Fraction of spoofing per country (varied green shades), spoofed with under-sampled VPs (pink),
not spoofed (white).
4 4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4 4
4 4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4 5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5 5 5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6 6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
7
7
7
7
7
7
7
7 7 7
7
7
7
7
7
7
7 7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
8 8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8 8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
9
9
9
9
9
9 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9 9
9
9
9
9
Figure 3.5: Location of spoofed VPs, slightly jittered. More recent years are darker colors, and each loation
is the last digit of the year of the observation. Overlapping digits indicate spoofs over multiple years.
3.4.3 Where and When Are These Spoofers?
We next consider where spoofing happens. If spoofing is legally required in some countries, we expect
spoofing to be concentrated there.
55
VPs
Country spf. active %
Indonesia 23 87 26
Iran 48 198 24
Tanzania 2 10 20
Albania 8 40 20
Philippines 4 26 15
Ecuador 2 15 13
Bosnia & Herz 2 18 11
China 3 27 11
Egypt 1 10 10
Lebanon 2 20 10
Table 3.5: Countries with largest fraction of VPs experiencing spoofing in 2019.
Figure 3.4 shows the fraction of VPs that see spoofing, by country (countries with less than 10 active
VPs but are spoofed are listed as “insucient” and are excluded from our ranking). From 2019-01 to 2019-
08, we see spoofing is most common in the Middle East and Eastern Europe, Africa, and Southeast Asia.
but we see examples of spoofing worldwide. The top ten countries by fraction of spoofing is in Table 3.5.
Most areas show spoofing activity over multiple years. Figure 3.5 shows our six years (without year
2020) of spoofing occurrence with dierent years in its last digit as symbols and in dierent darknesses.
(Points in oceans are actually on islands.). Labels that overlap show VPs that are spoofed multiple times
over dierent years.
3.4.4 Who Are the Spoofing Parties?
Goals of spoofers (subsection 3.2.1) include faster response, reduced trac, or censorship. With more than
1000 root instances, a strong need for spoofing for performance seems unlikely, although an ISP might spoof
DNS. We next study the identification of spoofing parties and classify them to perhaps infer their motivation
by using the methodology in subsection 3.3.4.
56
Types Example URLs Number of
clustered
spoofers
ISPs skbroadband.
com
2kom.ru
32 (16.16%)
network
providers
softlayer.com
level3.com
24 (12.12%)
education-
purpose
eenet.ee 1 (0.5%)
DNS tools dnscrypt.eu 1 (0.5%)
VPNs nordvpn.com 1 (0.5%)
hardware eero.com 1 (0.5%)
personal yochiwo.org 1 (0.5%)
unidentifiable DNS13
DNS-Expire
137 (69.19%)
Table 3.6: Classification of spoofing parties.
Table 3.6 shows the spoofing parties we found. More than two-thirds, spanning 137 ASes, show generic
Server IDs and are unidentifiable. Of identifiable spoofers, most are end-user (eyeball) ISPs (32, about half)
or network providers (24 providing cloud, datacenter, or DNS service).
Sometimes spoofing parties do not aect all VPs in the same AS. T-Mobile spoofs a VP in Hungary, but
not elsewhere. In Comcast, 2 of the 322 VPs see spoofing, and in FrontierNet, 1 of the 28 VPs sees spoofing.
Identifying spoofing parties suggests possible reasons for spoofing: the ISPs may be improving perfor-
mance, or they may be required to filter DNS. The 5 classes each with 1 example are likely spoofing for
professional interest, because they work with DNS or provide VPNs.
3.4.5 How Do Spoofing Parties Spoof?
We next examine spoofing mechanisms, following subsection 3.3.3.
Figure 3.6 shows how many VPs see non-anycast (injection or proxy, lightest area, on the bottom) or
anycast spoofing (darkest, on top), from 2014-02 to 2020-05.
57
0
50
100
150
200
250
2014-02
2015-01
2016-01
2017-01
2018-01
2019-01
2020-01
VPs affected by Mechanisms
anycast
non-anycast
0
50
100
150
200
250
2014-02
2015-01
2016-01
2017-01
2018-01
2019-01
2020-01
Figure 3.6: Number of VPs with dierent spoofing mechanisms over time.
We see that non-anycast (injection or proxy) is by far the most popular spoofing mechanism, accounting
for 87% to 100% of the VPs that see spoofing. We believe that non-anycast methods are popular because
they do not involve routing, able to target at specific group of users; they can be deployed as a “bump-in-
the-wire”. Anycast is the least popular one, since the anycast catchment relies on BGP, spoofers may not
precisely control who to spoof.
We see 2 VPs that see alterations between overt spoofing and authentic replies, often with timeouts in
between. We speculate these VPs may have a mechanism that sometimes fails, e.g. a slow DNS injection,
or site change between the authoritative or the third-party anycast site.
3.4.6 Does Spoofing Speed Responses?
Finally, we examine if spoofing provides faster responses than authoritative servers, since most of our iden-
tifiable spoofing parties are ISPs.
For each overt spoofer, we compare the median values of DNS response time with ping-time to the
authoritative root on 2019-08-24. In Figure 3.7, we see that spoofing is almost always faster: there can
be about 15% of all VPs that see equal or worse latency performance in spoofed answers. This result is
58
0
0.2
0.4
0.6
0.8
1
-100 0 100 200 300 400
CDF
RTT
ping
minus RTT
dns
(ms)
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
Figure 3.7: CDF of RTT
ping
minus RTT
dns
from spoofed VPs on 2019-08-24
consistent with spoofing occurring near the VP. In general, the amount of performance improvement is the
inverse of the size of root letter’s anycast footprint. Letters with more anycast sites see less improvement,
while for letters with only a few anycast sites (e.g. H-root, B-root), spoofing tends to be much faster. This
result is as one would expect for anycast latency [87], and is consistent with the statement that overt spoofers
are improving user performance. Except A-, B- , H-, and M-root, we also see that half of the VPs only see
less than 20 ms latency improvement from spoofers. This shows even though spoofers improve performance
but half of the VPs may still be good without them.
3.5 Validation
In this section, we validate the spoofing detection by the fact of whether the query has an answer from the
authoritative B-root server or not. First, we show our detection method can promise a true positive rate over
0.96. Second, we show that other than the spoofing we detected, there are about 13 (0.14%) VPs that may
experience covert delayer over their DNS queries, and most of the cases, the DNS reply is slower than the
59
Ping reply. Third, we show that in recent days proxy is far more popular than injection is used, packets of
about 98% of spoofed queries are dropped on 2019-01-10.
3.5.1 Validation Methodology
Our spoof detection looks at trac from VPs as DNS clients section 4.4. We validate it by looking at the
destination side, from the authoritative server. We expect queries from VPs that are intercepted and spoofed
to not reach the server, while regular queries will appear in server trac (unless there is packet loss or
timeout). For DNS injection, we expect the query to reach the server and two replies to return to the VP
(first from the injector, then from the authoritative).
To validate our spoof detection we use server-side data from one week (2019-01-10 to -16, the only week
available) of B-Root [9]. That dataset uses host-only anonymization where the low-8 bits of the IPv4 address
are scrambled, so we look for matches that have the same query type (DNS) and field name (hostname.bind)
from the same IPv4 /24 prefix as the public address of each VP. We also require that the timestamps are
within 4 minutes as the RIPE Atlas querying interval. (There are always multiple queries per second, so we
cannot match by timestamp only.). We use RIPE queries that are made directly to B-Root’s IP address, not
querying through the VP’s recursive resolvers.
We apply our detection method during the same week to match the B-Root dataset period. Following
subsection 3.4.1, we select four random full hour, each starting with a random oset (smaller than 3600
seconds) over the day from each day of the week. We evaluate each full hour. We compare queries sent from
the RIPE Atlas VP that get a response or timeout to those seen at B-Root. For each VP in the hourly window
where some queries timeout and some succeed, we examine only successes. With all queries timeout we
classify that VP as timed-out. We classify a timeout or spoof as correct if it does not show up at B-Root,
and any other query as correct if it does appear in B-Root trac.
60
2019-01-10T03:52:49Z
sent received true positive rate
active VPs 8981 8449 -
timeout 241 47 0.81
spoofed 142 3 0.98
non-anycast 140 3 0.98
anycast 2 0 1
not spoofed 8598 8399 -
Table 3.7: How many queries reach B-root based on spoof detection, for a sample hour.
True-Positive Rate
range quantile
detection [min, max] 0:25
th
0:50
th
0:75
th
timeout [0.79, 0.84] 0.8071 0.8198 0.8249
spoof [0.96, 0.99] 0.9719 0.9787 0.9859
not spoof [0.90, 0.99] 0.9138 0.9297 0.9534
Table 3.8: The range of true positive rate of spoof detection from 2019-01-10 to 2019-01-16
A false positive is a query that is detected as spoofed where we can see that it actually reaches B-Root.
If a query does not reach B-Root and receives an answer, this query is definitely spoofed (a true positive).
Because of DNS injection, though, a spoofer may reply quickly to a query, but allow it to proceeds to
B-Root where it then generates a second reply. The scenario of DNS injection means that we cannot get
a definitively count of false positives spoof detections, even with server-side data. These potential false
positives therefore place an upper bound on the actual false postive rate of spoofers; that upper bound is
0.02 (1 0:98) in Table 3.7.
3.5.2 Validation of Overt Spoof Detection
We first verify detections of overt spoofers (subsubsection 3.3.2.1). We expect queries that see overt spoofing
to not reach B-Root. Since overt spoofing is obvious with atypical server IDs, we expect a high true positive
rate.
Table 3.7 shows a representative hour (other sample hours are similar), and Table 3.8 shows the range
of true positive rates for all 28 sample hours over the week.
61
2019-01-10T03:52:49Z
detected received RTT
dns
RTT
ping
covert-delayers 13 13 -
RTT
dns
> RTT
ping
12 12 40.52ms
RTT
dns
RTT
ping
1 1 -10.25ms
Table 3.9: Covert delayer validation: how many reached B-Root, with the mean dierence for each DNS or
ICMP faster.
The week of samples in Table 3.8 shows that spoofing detection is accurate, with a true positive rate
consistently around 0.97. Examining a sample hour starting at 2019-01-10 3:52:49 GMT in Table 3.7, 142
of the 8981 VPs see spoofing. For almost all VPs that see spoofing (139 of the 142), their queries do not
arrive at B-Root, making a true positive rate over 0.98. For mechanism, 140 of the 142 VPs experience
either proxy or injection, and only 3 out of 140 VPs reached B-root, suggesting potentially DNS injection
(the proxy drops the packets, so the query cannot reach B-root). The rest 2 VPs suggest third-party anycast,
and we confirm their queries are not seen at B-Root.
When examining VPs that timeout, the true positive fraction is around 0.82, with a wider range from
0.79 to 0.84 (see Table 3.8). Some queries that timeout at the VP still reach B-root. It is possible that a
query reached B-Root and is answered, but the VP still timed out, perhaps because the reply was dropped
by a third party. The timeout default of RIPE Atlas probe is 5 s. In our example hour (Table 3.7), we see
that out of 241 VPs that timeout, 47 of them has queries reach B-root, making a true positive rate of 0.81.
There are 199 VPs (about 2%) VP in Table 3.7 that neither were spoofed nor timeout, but for which we
did not find a match on the B-Root side. It is possible the metadata of the IP address of those VPs is outdated
or those VPs is multi-homed, so their queries arrive at B-Root from an IP address we do not know about.
3.5.3 Validation of Covert Delayers
We now examine covert delayers (subsubsection 3.3.2.2). Table 3.9 shows analysis from one sample hour
(other hours were similar).
62
First, we see that in all cases with a large delay, the queries does get through to B-Root. We originally
expected dierences in time indicated a covert spoofer, but these networks are passing the query to the
authoritative server and not interfering with it.
However, we see there is a very large delay for the DNS replies. Most of the time (12 of 13 cases)
DNS is longer than ICMP, and the median dierence is 40 ms. This consistent, large delay suggests that
this dierence is not just queueing delay or other noise in the network, and it is possible that a third-party is
processing the trac.
Although we do not have server-side data for other letters, we do see that about one-third of VPs that
experience covert-delaying for B-Root also see covert-delaying with at least one other letter.
Finally, in one case we see a 10 ms delay of ICMP relative to DNS. it is possible that this delay is due to
a router processing ICMP on the slow path.
3.5.4 Non-Anycast Mechanism: Proxy or Injection?
Server-side data also allows us to distinguish DNS proxies from DNS injection. DNS injection will respond
quickly to the query while letting it pass through to the authoritative server (on-path processing) , while a
DNS proxy will intercept the query without passing it along (in-path processing).
In Table 3.7, shows that we see 139 out of 142 (98%) of VPs that detected as spoofed never reach B-
root, suggesting a DNS proxy instead of injection. The remaining 3 VPs (only 2%) are likely using DNS
injection. (Unfortunately we cannot confirm injection with a double reply at the receiver because we cannot
modify the RIPE Atlas software.)
3.6 Related Work
Our work is inspired by prior work in improving DNS security, anycast location-mapping, and DNS spoofing
detection.
63
Several groups have worked to improve or measure DNS security. DNSSEC provides DNS integrity [31].
Recent work of Chung et al. [18] shows under-use and mismanagement of the DNSSEC in about 30% do-
mains. This work indicates that securing DNS involves actions of multiple parties. Several groups explored
DNS privacy and security, suggesting use of TLS to improve privacy [100], and methods to counter injection
attacks [30]. Others have identified approaches to hijack or exploit DNS security [64, 90, 92, 93], or studied
censorship and multiple methods to spoof DNS [33, 42]. Our work considers a narrower problem, and ex-
plores it over more than six years of data: we study who, where and how DNS spoofing occurs. Our work
complements this prior work by motivating deployment of defences. Liu et al. looks at DNS spoofing when
users use public DNS servers [53]. This work points out that interception happens about 10 times more than
injection in TLD DNS queries. This finding agrees with our conclusion that proxies (interception) account
for nearly all spoofing we see. Our work goes beyond their work to characterize who third-parties are, and
to study longitudinal data.
Several groups have studied the use of anycast and how to optimize performance. Fan et al. [36] used
traceroute and open DNS resolvers to enumerate anycast sites, as well as Server ID information. They
mention spoofing, but do not study it in detail. Our work also uses Server ID and traceroute to study
locations, but we focus on identifying spoofers, and how and where they are. Other prior work uses Server
ID to identify DNS location to study DNS or DDoS [63, 87, 95].
Work of Jones et al. [48] aims to find DNS proxies and unauthorized root servers. They study B-root
because it was unicast at the time, making it easy to identify spoofing. Our work goes beyond this work to
study spoofing over all 13 letters over more than six years, and to identify spoofing mechanisms.
Closest to our work, the Iris system is designed to detect DNS manipulation globally [66]. They take
on a much broader problem, studying all methods of manipulation across all of the DNS hierarchy, using
open DNS resolvers. Our work considers the narrower problem of spoofing (although we consider three
64
mechanisms for spoofing), and we study the problem with active probing from RIPE Atlas. Although our
approach generalizes, we analyse only the DNS Root.
Finally, we recently became aware that Wessels is studying currently spoofing at the DNS Root with
Ripe Atlas [98]. His work is not yet generally available and is in progress, but to our knowledge, he does
not look at spoofing mechanisms and has not considered six years of data.
3.7 Summary
This chapter developed new methods to detect overt DNS spoofing and some covert delayers, and to identify
and classify parties carrying out overt spoofing. In our evaluation of about six years of spoofing at the
DNS Root, we showed that spoofing is quite rare, aecting only about 1.7% of VPs. However, spoofing
is increasing, growing by more than 2 over more than six years. We also show that spoofing is global,
although more common in some countries. By validating using logs of authoritative server B-root, we prove
that our detection method has true positive rate of at least 0.96. Finally, we show that proxies are a more
common method of spoofing today than DNS injection.
We draw two recommendations from our work. First, based on the growth of spoofing, we recommend
that operators regularly look for DNS spoofing. Second, interested end-users may wish to watch for spoofing
using our approach.
In this chapter, we have confirmed the anycast security. Next, we are going to look at the third part of
our thesis statement about improving the anycast latency.
65
Chapter 4
Anycast Latency in A CDN
In this chapter, we propose Bidirectional Anycast/Unicast Probing (BAUP), a new approach that detects
anycast routing problems by comparing anycast and unicast latencies. Our design and application of this
methodology leads to a tangible latency optimization of a commercial CDN.
The design and the application of BAUP significantly reduce the latency of a CDN. BAUP measures
both anycast and unicast latency to help us identify problems experienced by clients, triggering traceroutes
to localize the cause and suggest opportunities for improvement. Evaluating BAUP on a large, commercial
CDN, we show that problems happens to 1.59% of observers, and we find multiple opportunities to improve
service. Prompted by our work, the CDN changed peering policy and was able to significantly reduce
latency, cutting median latency in half (40 ms to 16 ms) for regions with more than 100k users.
This study of anycast latency in a commercial CDN supports our thesis statement by confirming that
latency in anycast infrastructure is improvable. First, our methodology BAUP is suitable for anycast opera-
tors to diagnose and improve the latency of their current infrastructure. BAUP find improvable latency and
the current problem by simply comparing the anycast unicast latency and bidirectional traceroute. Second,
the application of BAUP on Edgecast CDN (by Verizon Digital Media Services) leads to tangible latency
reduce from 40 ms to 16 ms for regional users.
66
Part of this chapter was published in The Network Trac Measurement and Analysis Conference (TMA)
2020 [94].
4.1 Introduction
Content-Delivery Networks (CDNs) and Domain Name System (DNS) operators use globally distributed
PoPs (Points of Presence) to bring content closer to users. Ideally, users are directed to the PoP that can
provide the lowest possible latency for the desired content. Many CDNs [16, 20, 32, 39] and DNS services
(such as the DNS root [83]) use IP anycast to direct users to PoPs. With anycast, services are announced
on one IP address (or block of addresses), and Internet routing associates users to PoPs by BGP. Border
Gateway Protocol (BGP) is influenced by routing policies set by the CDN and ISPs [14]. Previous evalua-
tions of CDNs suggest that anycast does not always find the lowest latency [16, 50], and studies of anycast
infrastructure suggest that BGP does not always select lowest latency [40, 52, 54, 87].
While multiple large CDNs directly serving 1k to 2k ASes [3,15,17], and there are more than 1000 root
DNS instances [83], with more than 67k ASes on the Internet [5], the majority are served indirectly through
other ASes. In addition, some CDNs and DNS providers operate fewer, larger PoPs, or cannot deploy in
some ASes or countries due to financial or legal constraints, so optimizing performance across multiple
ASes is essential.
It is challenging to detect problems in IP anycast deployments, much less identify root causes and deploy
corrections. Problem identification is dicult because one must distinguish between large latencies that are
due to problems (say, a path that misses a shorter route) from latency that is inherent (for example, users
connecting via satellite or over long distances). Root causes are challenging to find because even though
measurement can provide latencies and paths, we do not know why problems occur. Finally, once problems
have been identified (for example, a provider known to have frequent congestion), the CDN must determine
67
solutions to those problems. Solutions are not always possible and can be dicult to determine, particularly
when the problem is multiple hops away from the CDN.
Our first contribution is to design Bidirectional Anycast/Unicast Probing, BAUP, a method to evaluate
anycast performance for CDNs and DNS from both inside and outside. BAUP allows operators to detect
potential performance problems caused by congestion or unnecessary routing detour and learn an optional
better route. BAUP first detects potential problems from dierences in anycast and unicast latency (sec-
tion 4.4). When a potential problem is detected, it then classifies the problem with traceroutes, by checking
both the forward and reverse paths for both anycast and unicast, between vantage points and the CDN, while
considering potential path asymmetry [26]. We show that this information allows us to identify slow hops
and circuitous paths, two classes of problems that occur in anycast CDNs (subsection 4.5.1).
Our second contribution is to evaluate how often performance problems occur for a large, commercial
CDN (section 4.5). We see that about 1.59% of observers show potential latency problems. While this
number seems small, the CDN implemented changes in response to our work and saw improvements across
91 ASes in 19 countries, aecting more than 100k users.
Our final contribution is show that BAUP can result in noticeable improvements to service. BAUP is
a tool to help CDN and DNS operators detect and correct congestion and unnecessary routing detours in
complicated and changing global routing, ultimately improving tail-latency for their users [28]. We find
three such cases where solutions are possible in our candidate CDN to improve performance. While the
constraints of operational networks mean that routing changes cannot always be made, our work prompted
one set of peering policy changes in CDN deployment (section 4.6).
After this change, latency was significantly reduced in regions with a large number of users, falling by
half, from 40 ms to 16 ms. This large improvement was in tail latency—before the the improvement, our
observers in the 100k users that improved showed median latency at 86%ile of all users.
68
The measurements in this chapter use RIPE Atlas. While we cannot make CDN-side data public, all
external measurements towards the CDN is publicly available [80]. To preserve privacy, IP addresses in this
chapter use prefix-preserving anonymization [4, 35], and we replace Autonomous System (AS) numbers
with letters.
4.2 Problem Statement
Two common problems in anycast CDNs are paths that use congested links, and high-latency paths that
take more hops (or higher-latency hops) than necessary. We call using such a path as unnecessary routing
detour. In both cases, end-users experience reduced performance. We can detect both of these problems
with BAUP.
Congested links occur when a path has persistent congestion, so trac suers queueing delay. Such
congestion can often occur at internal links, private peerings, or IXPs [29] that have insucient capacity.
High-latency paths occur when the selected path has larger latency than other possible paths for reasons
other than congestion: typically because it follows more hops or larger latency hops. Anycast paths are
selected at the mercy of the BGP, and while BGP selects to minimize hop counts, it does not always provide
the lowest latency, and routing policy can override minimal-hop-count paths.
Other problems, like high-loss links, are outside our scope.
4.2.1 Observations to Find Problems
We define congested links and high-latency paths as problems of interest, but they cannot be directly ob-
served. We next define two more specific behaviors we can actually measure with network traceroutes: a
slow hop and a circuitous path.
69
VP
hop
1
hop
2
hop
4
hop
3
anycast
a CDN site
unicast
a
1
a
0
1
a
2
a
3
a
0
2
a
0
3
u
1
u
2
u
3
u
0
3
Figure 4.1: Four one-way delays in BAUP traceroute, including VP to anycast CDN, anycast CDN to VP,
VP to unicast CDN, unicast CDN to VP. All four can be dierent from each other.
A slow hop is a hop in the traceroute which show unusually high latency—the specific threshold for
abnormal is a function of the path, described in subsubsection 4.4.3.1. Link congestion, long physical dis-
tance, or high-latency in the return path can lead to a slow hop observed in a traceroute. Our measurements
search for slow-hops that can be fixed, and try to identify and dismiss distance-based latency that cannot be
improved. We classify slow hops by where they occur: intra-AS and inter-AS slow hops happen inside an
AS and between ASes, respectively. Near-CDN slow hops are a special case of inter-AS slow hops where
the CDN operator can change peering policies directly.
A circuitous path is a high-latency path that occurs and we know a lower-latency path exists. In our
observation, a circuitous path contains dierent hops from the alternative path, and it has longer end-to-end
latency measured.
Figure 4.1 shows hops and circuitous paths by looking for asymmetric AU latency and by checking
both the forward and reverse paths (considering the Internet asymmetry) with both anycast and unicast.
We describe our detection methods in section 4.4 and show examples of slow hops and circuitous paths in
subsection 4.5.1. Our goal is to find problems that a CDN can address (subsection 4.4.4), that is, slow hops
or circuitous paths where other routes exist. We call these cases improvable latency.
70
4.3 RTT Inequality between Anycast/Unicast
To provide context for how our new probing method detects problems, we next explore why would anycast
and unicast addresses ever produce unequal RTTs from the same clients? This question can be covered by a
larger question: How can the round-trip time be dierent towards one location, if the sender measures twice
towards two dierent IP addresses of this location? The fact is the path taken to connect to two dierent
addresses from dierent BGP prefixes can be dierent, no matter the physical location of the destination.
Therefore the round-trip time taken can also be dierent. Next, we examine in detail why the route can be
dierent, as it is determined by two major factors—BGP, and network asymmetry.
In particular, we note that in a single round-trip, there are two constituent one-way trips. So if measuring
from a Vantage Point (VP) to a CDN, one can target two dierent IP addresses at the CDN, the anycast and
unicast addresses. Together, there are two potentially dierent round-trips, and four one-way trips. Using
Figure 4.1 as an example, there are four one-way trips in the graph, VP to anycast CDN (via hop
1
and hop
2
),
anycast CDN to VP (by a
3
0 with no hops marked in graph), VP to unicast CDN (via hop
4
and hop
3
), unicast
CDN to VP (by u
0
3
with no hops marked in graph).
BGP determines the route towards the unicast and anycast addresses, and the route can be dierent for
each addresses. In Figure 4.1, when the VP connects to CDN site via its anycast address, the forwarding
path will route to hop
1
first. BGP selects this hop
1
based on factors such as AS path length and local
preference, which in turn may be determine by the originating announcements and subsequent propagation.
The same goes when the VP try reaching the same CDN site but via its unicast address with the first hop as
hop
4
. Factors such as AS path length and local preference can vary based on the destination address. This
dierence in address may result in hop
1
and hop
4
being dierent. For the same reason, hop
2
and hop
3
may
vary as well. In fact, the count of hops may also dier in the two forward paths to the anycast and unicast
address of one CDN site. So now we know, the forw portion of the two round-trips may be dierent.
71
With asymmetric network routing, the reverse path may dier from the forward path [27]. Although
the two forward paths being dierent suciently proves the two round-trips to unicast and anycast can be
dierent. Since the two forward paths may be dierent, and the reverse dierent from the forward, all four
single one-way trips (VP to anycast CDN, anycast CDN to VP, VP to unicast CDN, unicast CDN to VP)
may be dierent from each other.
4.4 Bidirectional Anycast/Unicast Probing
Bidirectional Anycast/Unicast Probing (BAUP) is a new method to observe slow hops and circuitous routes,
suggesting congested links and high-latency paths that perhaps can be avoided. We use Vantage Points
(VPs) that carry out active latency measurements to anycast and unicast addresses in the CDN, providing
two latency estimates. We detect potential routing problems when a VP sees consistently higher latency on
one of those two paths. Once a potential problem has been detected, we take bidirectional traceroutes (three
in total), including VP to the unicast CDN, unicast CDN to VP, VP to the anycast CDN. This information
helps us identify problems and suggest potential changes to routing that can allow the CDN to improve
performance.
4.4.1 BAUP Measurements
BAUP requires VPs that can carry out active measurements (pings and traceroutes) under our control. We
assume some control at the CDN: we assume each CDN PoP has a unique unicast address in addition to its
shared anycast address, and that the CDN can send traceroutes out of the unicast address. We use VPs from
RIPE Atlas (where they are called “probes”) to set up the BAUP, and we work with a commercial CDN
network. Our study maximizes the path set between users and CDN by using all available VPs that does not
have duplicate source IP addresses.
72
We first identify each VP’s catchment in the CDN’s anycast network. Methods for such identification
may vary by CDN—DNS services may use NSID queries [7,37], or one may use a tool like Verfploeter [27].
We determine VP catchment by taking traceroutes from our VPs to the CDN anycast address and searching
a database of CDN BGP sessions for the final hop in the traceroute that lies outside of the CDN network. If
we find a match, we label the VP as within the catchment of the unique PoP where the BGP session exists.
We initially take latency measurements (pings) from the VP to both the anycast and unicast addresses
for the VP’s cachement. We use dierences in that latency to detect potential improvements, as described
next in subsection 4.4.2.
For VPs that show potential improvement, we traceroute from the VP to the unicast and anycast ad-
dresses in the CDN, and from the CDN’s unicast address to the VP. We would like to traceroute from the
CDN’s anycast address, but because the anycast addresses are in production, we cannot easily take non-
operational measurements there. We study the path with information of both IP addresses and ASNs sug-
gested by Route Views [84]. IP addresses and ASNs shown in the chapter are encrypted in prefix-preserving
method [4, 35].
BAUP thus provides us hop-wise information about three one-way paths, as shown in Figure 4.1: VP to
the CDN’s anycast address, VP to its cachement’s unicast address in the CDN, and from that unicast CDN
address to the VP.
4.4.2 Detecting Improvable Latency
BAUP detects improvable latency by finding asymmetry between the A- and U-latency. We consider U-
latency smaller than A-latency the indicator of an improvable latency, since CDN users reach CDN in
A-probing.
We can define several types of latency from our measurements (Figure 4.1). A-probing, the VP-to-
anycast latency, defines RTT
A
from a + a
0
in Figure 4.1, where a is the end-to-end unidirectional latency
73
from VP to the CDN, and a
0
is the unidirectional latency on the reverse path from the CDN to the VP.
U-probing gives the VP-unicast latency, with RTT
U
from u + u
0
, with u and u
0
the VP-to-unicast CDN and
unicast CDN-to-VP latencies, respectively. We detect improvable routing when RTT
U
< RTT
A
.
As individual pings are often noisy, we repeat A- and U-probing to look for consistently unequal RTTs.
We define large enough as ( > 10ms) or ( > 0:15 max(RTT
A
; RTT
U
) and > 5ms). We chose these
two factors, absolute gain of 10 ms and proportional gain of 0:15 RTT, to focus on improvements that are
meaningful to users and therefore worth attention. We define consistent results when 80% of observations
meet this criteria. The specific thresholds are not critical: 10 ms, 0.15, and 80% are based on operational
experience at balancing true and false positives, and others may choose a dierent threshold. In our ex-
periments we observe RTT every two hours for 48 hours, giving 24 samples, but we think 12 samples and
24 hours is sucient—the requirement is to observe long enough to identify network topology and not just
transient congestion.
4.4.3 Locating the Problems
After we detect VPs with the potential for latency improvement (subsection 4.4.2), we next need to localize
the problem, identifying a specific slow hop or circuitous route. Our three traceroutes (Figure 4.1) provide
information to identify these events. We first look for slow hops, and if we find none, it suggests a circuitous
path (longer networking distance without specific slow hops). We next review how we find these events,
using examples we expand upon later in subsection 4.5.1.
4.4.3.1 Detecting Slow Hops
We find slow hops by examining traceroutes. Traceroutes report the IP address of each hop, and the RTT
from the source to that hop. Slow hops occur when there is a sudden increase in latency (for example, the
bold hop-to-hop marked in Table 4.2, Table 4.3, Table 4.4). For each traceroute record, we compute the
74
incremental RTT change (usually a small rise) hop by hop. If for a specific hop, its incremental change from
its previous hop is larger than the median plus twice the median absolute deviation of all incremental RTT
change in a traceroute record, we consider this hop as a slow hop.
The observation of a slow hop can point back to three possible root-causes, a distant next hop, a con-
gested link, or a high-latency reverse path. Of these, a distant next hop is not a problem, but perhaps
unavoidable to bridge the distance between source and destination. (On the other hand, a shorter U-path will
prove that the slow-hop can be avoided and is therefore not due to physical distance.) However, a congested
link or high-latency path are problems that can perhaps be addressed by taking dierent paths. However,
in some cases, subsequent hops of a slow hop may show lower RTTs than this slow hop, suggesting a long
delay in the reverse path of this slow hop (and this reverse path is not shared by the subsequent hops). We
consider such cases false slow hops, and discuss how we identify and avoid false slow hops below.
4.4.3.2 How RTT Surge Reveals A Slow Hop
To show the forming of a slow hop, in Figure 4.1, we have two RTTs between (VP, hop1) and between (VP,
hop2), and we name them R
1
(a
1
+a
0
1
) and R
2
(a
1
+a
2
+a
0
2
). If we assume R
1
is a reasonable value and R
2
is
surprisingly larger than R
1
, this could mean either a
2
is large or (a
0
2
a
0
1
) is large. A large a
2
means either
hop
1
is congested or path of a
2
is long, but we rule out hop
1
being congested because R
1
is an assumed
reasonable value. A large (a
0
2
a
0
1
) means that either hop
2
is congested, or a
2
takes an inferior route, and
a
1
does not take an inferior route as assumed. Symbols such as a
n
and u
n
in Figure 4.1 indicate the path
segments where long latency can occur. We do not need the latency of the exact path segment to detect
improvable latency.
75
4.4.3.3 Avoiding False Slow-Hops
Some hops appear “slow”, but do not aect the end-to-end RTT. The reverse paths can be dierent for
dierent traceroute hops, and the reverse path from an intermediate router may not overlap the reverse path
from the CDN. A false slow-hop will occur if it has a high latency reverse path that does not overlap with
the destination’s reverse path. We exclude those false slow-hops because their increased latency does not
pass to later hops. Fortunately, true slow-hops (due to congestion, distance, or other consistent latency) can
be determined because their latency appers in subsequent hops.
In Figure 4.1, we consider 3 RTTs between (VP, hop1) and between (VP, hop2), and (VP, CDN-anycast)
and we name them R
1
(a
1
+a
0
1
), R
2
(a
1
+a
2
+a
0
2
), R
3
(a
1
+a
2
+a
3
+a
0
3
). If we assume R
2
is surprisingly larger
than R
1
, making hop
2
look like a slow hop. But R
3
is much smaller than R
2
. We learn that a
0
3
is smaller than
a
0
2
, which means CDN takes a much faster return route than the hops before. A slow hop like this which does
not aect the final VP-CDN RTT, we call it a false slow hop. Table 4.5 provides a specific example of false
slow hop, we note the hop marked with a strikethrough: although RTT of this hop increases by about 10ms
from its previous hop, this increase does not continue to its next hop (with 16.54ms dropping to 10.23ms,
17.0ms to 13.18ms). This suggests the sudden RTT increase of this hop is potentially due to this hop taking
a slower reverse path not used by the final destination.
4.4.3.4 Circuitous Path Detection
A VP suers from a circuitous path when no slow hops are detected but there is still improvable latency. If
we look at the middle section of Table 4.5, we can see the VP-to-unicast has a 9ms RTT, much shorter than
this hop’s 17ms. In this case, packets on the anycast CDN to VP (not included in BAUP) path encounter
additional delay. Although we cannot see the details of the path from anycast-CDN-to-VP, the fact that
VP-to-anycast and VP-to-unicast match (left section and middle section being the same), means the unicast-
CDN-to-VP does not encounter the same latency as the anycast-CDN-to-VP.
76
IP aliasing can result in inaccurate ASes in traceroutes [44]. Although there has been progress reducing
aliasing [58], it seems impossible to eliminate. Fortunately, our work keeps IP addresses to study the path,
and requires only rough matches of ASes and /24 prefixes to classify problems (subsection 4.5.1), and does
not require correct AS identification for mitigations (subsection 4.4.4).
4.4.4 From Problems to Solutions
A CDN can resolve slow hops and circuitous routes by changing its routing policies, or asking its peers to
chage. The presence of an existing, lower-latency U-path suggests a better path does exist.
For circuitous paths, when U-probing suggest a better route, the CDN can change its anycast trac to
follow the path shown in U-probing to reduce anycast latency.
For paths with slow-hops, the CDN needs to influence routing to avoid the slow hop, perhaps by not
announcing at a given PoP or to a given provider at that PoP, by poisoning the route, or by prepending at
that PoP. Again, the existence of the lower-latency U-path motivates change by proving a better path exists,
but U-path may not be the only solution path for operators to follow. As long as the slow hops are avoided,
the operators may find other good paths rather than U-path. The best mitigation varies depending on the
location of the problem: if the slow hop is in a network that directly peers with the CDN, it has immediate
control over use of that network. It is more dicult to make policy changes that route around slow hops that
are multiple hops from the CDN.
In wide-area routing, BAUP must be prepared to handle load balancing in the WAN and at anycast
sites and potential routing changes. Prior work has shown that catchment changes are rare [96], so wide-
area, load-balanced links and routing changes are unlikely to interfere with BAUP analysis. Load balancing
inside an anycast site is common, but unlikely to oer alternate paths that woud appear in BAUP’s wide-area
analysis. BAUP can detect and ignore cases where the A- and U-paths end at dierent sites.
77
We show specific case studies next (subsection 4.5.1) and later show an example where a CDN was
able to provide a significant reduction in latency to certain regions (section 4.6). In fact, we show that
improvements to detect the problems we found actually benefited a broader set of clients, not all directly
detected by BAUP. The advantage of BAUP is to find VPs that have opportunities for lower latency paths.
It serves to automate identification of such locations, allowing CDN operators to focus on networks that are
likely to show improvement.
Limitations Our methodology has two limitations. First, BAUP cannot discover all available path between
a single VP and the CDN. Instead it knows only the current A-path and the alternate U-path. Future work
may study one-way latency (BAUP studies round-trip) to isolate each direction to find more improvable
cases and use other methods to find more alternate paths. Second, sometimes it may be hard for the operator
to use the U-path for anycast. The majority improvements we found were slow hops. For slow-hops, the
operator does not need to adopt the U-path (and sometimes cannot, perhaps if load balancers hash A- and
U-paths dierently). In these cases, the operator must influence the A-path to avoid the slow hop.
4.5 Results
We next evaluate a CDN from all available RIPE Atlas probes (our VPs); our goal is to identify opportunities
to reduce latency. Measurements begin at 2019-07-29T00:00 UTC and run for 48 hours, with each VP
running an AU Probe every two hours (so 24 observations per VP). We confirm the anycast catchment and
valid RTT
A
and RTT
U
values for comparison (see subsection 4.4.2) for 8350 of the 9566 probes.
Given the concentration of RIPE Atlas VPs in Europe [10], it is more likely that we will find problems
there. The goal of our experiments is to show BAUP finds real-world problems and to provide a lower-bound
into how many problems exist. We claim only a lower bound on the number of anycast problems we find,
not tight global estimate, so any European emphasis in our data does not change our conclusions.
78
CDN
problem PoP RTT
A
RTT
U
intra-AS FRA 32.91 27.32
inter-AS FRA 20.33 11.38
near-CDN VIE 25.86 2.20
circuitous path FRA 22.47 9.44
Table 4.1: Basic information about routing problems. PoP are given as near-by airport codes.
Vantage Points
src AS IP RTT src AS IP RTT dst AS IP RTT
# — 207.213.128.248 1.43 # — 207.213.128.248 1.32 "
# AS-B 115.66.46.99 1.38 # AS-B 115.66.46.99 3.3 " — — —
# AS-B 115.66.46.204 1.81 # AS-B 115.66.46.204 1.75 " AS-B 115.66.46.206 27.79
# AS-B 115.73.130.248 3.44 # AS-B 115.73.130.248 1.83 " AS-B 115.73.130.250 71.78
# AS-A 35.12.227.158 2.15 # "
# — — — # — — — " — — —
# AS-A 35.12.227.158 34.49 # "
# CDN 146.98.248.120 37.3 # CDN 146.98.249.115 28.7 " CDN 146.98.249.114 0.71
anycast CDN 101.208.74.51 32.91 unicast CDN 146.98.187.229 27.32 unicast CDN 146.98.251.152 1.13
CDN
Table 4.2: An intra-AS slow hop from a VP to PoP FRA (discussion: subsubsection 4.5.1.1). Each grouped
column represents a vertical traceroute record, horizontally matched with same-AS hops if any. Blanks
mean no hops match with same ASes, showing routing dierence.
4.5.1 Case Studies: Using BAUP to Identify Problems
Before our general results, we show examples problems (from section 4.2) BAUP revealed. Table 4.1
provides example latencies.
4.5.1.1 Intra-AS Slow Hop
Our first example is an intra-AS slow hop at Table 4.1, visible as a 5 ms dierence in RTT
A
vs. RTT
U
(background colors show dierent ASes). Table 4.2 shows each hop of paths in and out of the CDN. All
three paths have unreported hops (the dashes), and the slow hop is in AS-A (a large Transit provider) on the
inbound path.
This problem may be inside AS-A, or its reverse path (see subsection 4.4.3). Since the CDN peers
directly with AS-A, the CDN may be able to influence the path.
79
Vantage Points
src AS IP RTT src AS IP RTT dst AS IP RTT
# — 207.213.136.212 1.38 # — 207.213.136.212 1.23 "
# AS-D 127.129.232.23 5.9 # AS-D 127.129.232.23 5.87 " AS-D 115.227.215.27 10.17
# AS-D 127.129.233.106 5.89 # AS-D 127.129.233.106 5.76 " AS-D 127.129.233.107 6.26
# AS-E 35.130.33.74 27.68 # "
# AS-E 35.130.248.142 28.2 # "
# AS-E 35.130.89.0 28.34 # "
# AS-E 211.205.94.59 33.13 # "
# # AS-F 126.82.128.149 13.39 " AS-F 126.82.128.201 6.28
# CDN 146.98.249.115 30.21 # CDN 146.98.249.115 12.76 " CDN 146.98.249.114 1.44
anycast CDN 101.208.74.51 20.33 unicast CDN 146.98.251.67 11.38 unicast CDN 146.98.251.152 0.74
CDN
Table 4.3: An inter-AS slow hop from a VP to PoP FRA (discussion: subsubsection 4.5.1.2).
Vantage Points
src AS IP RTT src AS IP RTT dst AS IP RTT
# AS-K 206.13.250.110 0.43 # AS-K 206.13.250.110 0.39 " AS-K 206.13.250.111 1.99
# # AS-K 107.219.4.113 0.51 " — — —
# AS-L 72.58.248.181 0.78 # "
# — 207.122.137.141 0.5 # "
# — — — # "
# AS-H 7.213.227.109 27.1 # "
# # — 112.154.229.152 2.64 " — 112.154.229.69 1.77
# CDN 146.98.108.78 27.12 # CDN 146.98.108.78 3.28 " CDN 146.98.108.79 3.64
anycast CDN 101.208.74.51 25.86 unicast CDN 146.98.110.124 2.2 unicast CDN 146.98.110.208 0.7
CDN
Table 4.4: A near-CDN slow hop from a VP to PoP VIE (discussion: subsubsection 4.5.1.3).
4.5.1.2 Inter-AS Slow Hop
Slow hops may also happen between ASes, not only inside one AS. Our second example is an inter-AS slow
hop with a 9 ms dierence in RTT
A
and RTT
U
(Table 4.1). Table 4.3 shows a slow hop when a packet leaves
AS-D and enters AS-E (a large transit provider).
Although this problem may be at AS-E or its reverse path, U-probing shows a much faster path through
AS-F (a route through an Internet exchange provider). While the CDN does not currently announce anycast
to this provider, they may consider adding them to take advantage of this direct route.
4.5.1.3 Problem near the CDN hop
A case we are especially interested in are slow hops near the CDN, since they can often be addressed easily.
Our third example is a slow hop identified with a 23 ms dierence in RTT
A
and RTT
U
(Table 4.1). In
80
Vantage Points
src AS IP RTT src AS IP RTT dst AS IP RTT
# AS-M 119.204.17.68 0.46 # AS-M 119.204.17.68 0.45 " AS-M 119.204.17.70 9.35
# # " AS-M 119.204.23.5 9.0
# # " AS-G 205.125.195.218 11.56
# # " AS-G 35.82.197.190 10.0
# # " AS-G 35.82.246.105 11.47
# AS-N 62.245.207.153 7.51 # AS-N 62.245.207.153 7.48 " AS-G 62.214.105.74 9.71
# AS-N 82.135.16.136 7.47 # AS-N 82.135.16.136 7.72 " AS-G 62.214.38.165 11.59
# AS-N 210.69.126.74 16.54 # AS-N 210.69.126.74 17.0 " AS-G 62.214.37.137 0.66
# # " AS-F 126.82.128.45 0.66
# CDN 146.98.249.115 10.23 # CDN 146.98.249.115 13.18 " CDN 146.98.249.114 1.01
anycast CDN 101.208.74.51 22.47 unicast CDN 146.98.251.67 9.44 unicast CDN 146.98.251.152 0.81
CDN
Table 4.5: A circuitous path from a VP to PoP FRA (discussion: subsubsection 4.5.1.4). (The striking-out
means RTT of this hop is a false slow hop (see subsection 4.4.3))
Table 4.4, we cannot tell the specific location of the slow hop, it may be the fifth hop of AS-H or the fourth
hop, not shown in the traceroute information.
Luckily, since the slow hop is near CDN, U-probing suggests an alternative path through a dierent
provider that will reduce the 25ms latency to 2ms.
4.5.1.4 A Circuitous Path
Next, we look at an example of circuitous path. We identify this problem because of a 13 ms dierence
in RTT
A
and RTT
U
(Table 4.1). We can infer the existence of a circuitous path in the anycast-CDN-to-VP,
although our data does not provide the details of the path. BAUP provides three of the four possible one-way
delays, but we lack anycast CDN to VP. Table 4.5 shows the VP to anycast CDN (a) and VP to unicast CDN
(u) are the same path, but round-trip-time of former is much larger. We also learn the path unicast CDN to
VP (u0). We therefore infer that the higher latency occurs in the one-way delay from anycast CDN to VP
(a0). To express what we learn mathematically, if (a + a0)> (u + u0) and a = u, then it must be that a0> u0.
U-probing tells us that unicast-CDN-to-VP is faster than the anycast-CDN-to-VP. It proposes an explicit
path suggestion via AS-F and AS-G, suggesting that the CDN operator use a path via AS-F to improve
performance.
81
2019-07-29
VPs Percnt.
VPs in use 8350 100.00%
equal RTTs 7967 95.37%
consistent unequal RTTs 383 4.59%
A-probing faster 250 2.99%
U-probing faster 133 1.59%
slow hops 130 1.56%
inter-AS 51 0.63%
intra-AS 16 0.19%
either intra- or inter-AS 63 0.77%
circuitous path 3 0.04%
Table 4.6: BAUP results on the CDN
scenario VP count
Seen via AS-H 171
inactive after fix 33
active after fix 138
improved 130
still via AS-H 1
not via AS-H 129
got worse 8
still via AS-H 7
not via AS-H 1
Table 4.7: VPs aected by AS-H before and after fixing
4.5.2 How Often Does BAUP Find Latency Dierences?
Generally BGP works well to select anycast PoPs, with AU detecting unequal latency relatively infrequently.
Table 4.6 show the results of our evaluation: most of the time (more than 95%), the A- and U-paths show
similar latencies. About 4.6% show dierences and therefore potential routing problems. When RTTs are
unequal, A-probing is faster than U-probing about twice as often, suggesting that anycast routing is already
generally well optimized.
82
4.5.3 Root Causes and Mitigations
We next examine the 133 cases where there is a dierence and U-Probing is faster, since those are cases
where anycast routing can be improved. Nearly all of these cases are due to slow hops (subsection 4.4.3).
For cases where we find U-probing is faster in Table 4.6, we are able to locate the slow hops for most
(130 out of 133) of them. We enumerate all the ASes the appear as slow hops (one AS for an inter-AS slow
hop, and two for intra-AS). We find 77 ASes appear in slow hops. Three of these appearing very frequently,
each aecting about 20 VPs while others aecting one or two VPs. We focus remediation eorts on these
three ASes, since changes there will improve service for many users. We consider each of these three cases
next.
A Single Inbound Provider In the first case, an inter-AS slow-hop happens inside a large regional
provider, labled AS-H. RTT increases by about 20 ms aecting about 19 VPs over time. For each of the
19 VPs, we see the increase happens at one of the three route interfaces, suggesting congestion or other
challenges at three places. U-probing suggests a better route available through an alternative provider, since
the traceroute from the CDN to the VP avoids the delay.
We considered several remediation options, but choices are somewhat limited because the slow hop is
on the inbound path and the CDN must convince its clients to take a new path to the CDN. In the case
that the slow hop occurs on a hop adjacent to the CDN, two primary routing options are available to the
CDN operator. First, it could withdraw announcements to AS-H completely, but that risks leaving some
clients of that AS with poor connectivity to the CDN. Alternatively, the CDN can use a community string
to request its peer refrain from propagating the CDN’s anycast router to AS-H’s peers. Such a change may
then encourage distant clients to consider alternative inbound paths. Prompted by our work, the CDN made
changes to address this problem, as we describe in section 4.6.
83
Internal Routing Policy Our second case results from internal policies at the CDN which cause clients
to use an indirect route over a direct peering link. subsubsection 4.5.1.1 and Table 4.2 show traceroutes for
this case. Here, we observe an intra-AS slow hop happening within AS-A, a large Transit provider. The
slow hop has inflation of about 20-30 ms, aecting 15 VPs over time. Each of the 15 VPs, sees the increase
happening at one of two router interfaces, suggesting problems in two places. Unfortunately, existing CDN
policies withholds anycast announcements from these peers, causing them to use a transit route. Addressing
this case requires changes to the CDN routing policy, a topic currently under evaluation.
External Routing Policy Our final example is the result of policy determined by an external network.
Here, the slow hop is within AS-J, a large regional ISP. In this case, we see a hop-to-hop RTT increases
between 15 ms and 25 ms, at dierent router interfaces, to dierent 17 VPs. U-probing suggestions also
vary, with lower-RTT return paths passing through a handful of other networks.
This case appears to be the result of external peering policy, outside the control of the CDN opera-
tor, which prefers certain inbound routes. Changing these policies requires inter-operator negotiation, so
while BAUP cannot suggest CDN-only mitigations, it does help identify the problem. Identification helps
operators detect and quantify the impacts of policies, helping prioritize potential resolution.
Circuitous Paths Circuitous paths can have solutions similar to slow hops. With only three cases
(Table 4.6), we have not yet examined specific mitigations.
4.6 Improving Performance with BAUP
Following the inbound provider example in subsection 4.5.3 and directly motivated by the result of BAUP,
we worked with the CDN operators to adjust routing to AS-H. Changes to routing must be done carefully,
because even though we expect it to improve performance for the 19 VPs found in BAUP, we must be
careful it does not degrade performance to other CDN users. The change made by the CDN operators was to
add a community string requesting that AS-H refrain from announcing the CDN’s anycast route to its peers.
84
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
CDF
RTT before and after (ms)
before
after
Figure 4.2: CDF of RTT before and after applying fix
0
0.2
0.4
0.6
0.8
1
-60 -40 -20 0 20 40 60 80
CDF
RTT before fx minus after fx(ms)
Figure 4.3: CDF of RTT before minus RTT after applying fix
Although our goal was to improve the 19 VPs found by AU probing, in fact we found 171 VPs that
pass through AS-H improve (from traceroutes to the CDN as of 2019-07-29). BAUP finds only 19 of them
because of its strict requirement for consistent, unequal RTTs, but examination shows that AS-H contains a
slow hop for all 171 VPs.
After the CDN made the routing changes, we re-examined the 138 VPs (of the 171 behind AS-H) that
were still active. We found latency significantly decreased for nearly all VPs in this group, in some cases
85
2020-07-04
VPs Percnt.
VPs in use 10759 100.00%
equal RTTs 10343 96.22%
consistent unequal RTTs 406 3.77%
A-probing faster 193 1.80%
U-probing faster 213 1.98%
slow hops 212 1.97%
inter-AS 32 0.30%
intra-AS 120 1.12%
either intra- or inter-AS 60 0.56%
circuitous path 1 0.01%
Table 4.8: BAUP results on B-root
reducing in half. Figure 4.2 shows latency before and after the change, with most VPs reducing latency by
about 20 ms, with the median latency falling from 39.77 ms to 16.27 ms. Figure 4.3 shows the change in
performance for each VP: we see that a few (8 of 138) show slightly larger latency, but 80% show their RTT
drop by 10 ms or more, and 55% by 20 ms or more. When we examine the 8 VPs that show higher latency,
7 of them still reach CDN via AS-H and so were not actually aected by our routing changes; we believe
the last is measurement noise.
4.7 BAUP Evaluation Of DNS B-Root
A major CDN [6] has used BAUP to reduce median latency for users in some regions. In this section, we
evaluate BAUP on another anycast infrastructure, B-Root. In doing this evaluation, we show that BAUP can
diagnose the cause of high latency for both a CDN and a root DNS service.
4.7.1 How Often Does BAUP Find Latency Dierences?
We repeat the BAUP experiment like we did for the CDN (section 4.4), but this time we use B-root as our
anycast infrastructure. We use the VPs from the same platform, RIPE Atlas, as we did for the commercial
CDN. We first identify each VP’s catchment in the B-root anycast network. We determine the catchment
86
by quering the hostname.bind field of the B-root, matching server IDs to each site. We take latency mea-
surements (pings) from the VP to both the anycast and the unicast address of B-root matching the VP’s
catchment. We keep observing the RTTs for 48 hours at 2 hours intervals. For VPs that show potential im-
provement, we collaborate with the B-root team to traceroute from the VP to the unicast and anycast address
of B-root, and from B-root’s unicast addresses to the VP. Next, we introduce the evaluation result of B-root.
BAUP detects improvable latency for B-root as it does for the commercial CDN. BAUP detects 1.59%
VPs that have improvable latency for the commercial CDN (Table 4.6). Similarly, BAUP detects 1.98% VPs
that may improve latency towards B-root (in Table 4.8), In both the commercial CDN and B-root, there are
about 96% of VPs that have equal RTTs in both A-probing and U-probing.
There are two major dierences betweens the result for B-root and the result for the commercial CDN.
First, the CDN has a higher ratio of A-probing faster over U-probing than B-root has. For the commercial
CDN (Table 4.6), faster A-probing (2.99%) happens twice as often as faster U-probing (1.59%). For B-root
(Table 4.8), faster A-probing (1.98%) happens almost same as often as faster U-probing for B-root (1.80%).
The ratio of faster A-probing to faster U-probing is about 2:1 for the CDN, and about 1:1 for the B-root.
Such a dierence in ratios between the CDN and B-root suggests that B-root does not optimize the routing
in anycast as carefully as the CDN. In the CDN result, we see more VPs have shorter latency in anycast
rather than unicast, suggesting the CDN may have optimized the anycast routing.
Slow hops for B-root happen more often between two dierent ASes (inter-AS). Dierent for the CDNs,
most slow hops of B-root happen inside the same AS (intra-AS). This dierence suggests that in the current
infrastructure of B-root, its sites mainly peers with major large transit providers, where slow hops may exist
since those transit providers may not prioritize DNS trac.
We have shown that BAUP can detect improvable latency for B-root as well as for the CDN. We suggest
that BAUP can detect improvable latency for general anycast infrastructures. BAUP also suggests that
routing policies of anycast services may aect the location of slow hops.
87
Vantage Points
src AS IP RTT src AS IP RTT dst AS IP RTT
# AS-P 127.122.81.50 1.452 # AS-P 127.122.81.50 0.894 " AS-P 127.122.81.50 106.672
# AS-Q 115.144.2.52 2.538 # AS-Q 115.144.2.52 0.26 " AS-Q 115.144.2.54 107.765
# AS-R 117.208.87.165 17.489 # AS-R 117.208.87.165 10.238 " AS-R 117.208.86.20 105.529
# # AS-R 117.208.87.199 3.866 "
# AS-Q 119.79.198.191 17.033 # "
# AS-S 123.59.34.34 43.089 # AS-T 53.174.216.171 53.565 " AS-U 181.34.220.67 101.509
# AS-S 123.59.36.187 100.833 # AS-T 123.34.148.195 81.005 " AS-U 181.34.220.67 98.58
# AS-V 205.248.70.188 100.382 # AS-E 35.130.204.242 81.072 " — 21.254.194.217 49.609
# AS-V 205.248.67.81 99.232 # AS-E 35.130.89.238 87.951 " AS-X 187.196.4.37 46.7025
# AS-V 205.248.67.70 221.619 # "
# — — — # — — 96.779 " — — —
# — — — # — — 96.737 "
# AS-V 192.32.154.5 217.371 # AS-E 35.130.225.180 98.272 " AS-Y 221.91.198.43 43.622
# AS-V 205.248.89.248 214.963 # "
# AS-V 205.248.89.130 218.786 # "
# # " AS-W 211.229.182.249 7.915
# # " AS-W 135.227.127.92 8.253
# # AS-W 135.227.125.137 98.478 " AS-W 135.227.127.32 8.232
# # AS-W 135.227.123.61 109.878 " AS-W 135.227.123.60 8.293
# # AS-W 135.227.123.176 106.66 " AS-W 135.227.123.177 0.481
anycast AS-Z 188.221.159.93 215.456 unicast AS-W 211.229.191.60 106.2 unicast AS-W 211.229.191.61 11.862
B-Root
Table 4.9: Two intra-AS slow hops in one path from a VP to B-root site AMS
Vantage Points
src AS IP RTT src AS IP RTT dst AS IP RTT
# — 207.213.129.144 0.912 # — 207.213.129.144 0.819 "
# — — — # — — — " — — —
# AS-a 160.149.130.45 5.718 # AS-a 160.149.136.150 5.684 " AS-a 160.149.138.126 3.194
# AS-a 109.254.39.162 2.879 # AS-a 109.254.39.162 2.871 " AS-a 109.254.39.161 6.632
# AS-a 109.254.37.86 31.313 # AS-a 109.254.39.6 3.18 "
# AS-a 109.254.35.204 31.886 # "
# AS-a 109.254.35.206 33.342 # "
# # " — — 1.149
# # — 199.81.120.222 3.182 " — 199.81.121.231 0.946
# # AS-c 97.29.193.118 4.4762 " AS-c 220.65.85.144 0.634
# # AS-d 133.194.254.123 4.694 "
anycast unicast AS-d 133.194.254.39 4.443 unicast AS-d 133.194.254.36 0.675
B-Root
Table 4.10: An intra-AS slow hop from a VP to B-root site LAX.
4.7.2 Some Case Studies: Root Causes and Potential Mitigations
With the commercial CDN, we locate specific slow hops in vast majority of the improvable cases, and only
a few cases show circuitous path (subsection 4.5.1). We also finds slow hops as a common reason that
latency may be improved from BAUP’s evaluation on B-root. In this section, we shows BAUP is able to
function as a better tool than we expected before—BAUP not only detects improvable latency for a new
anycast infrastructure, DNS B-root, but also detects new types of cases that we did not observe in studying
the CDN. Next, we present two specific cases for slow hops, each providing either new network conditions
88
or new mitigations that we did not observe in the CDN study. The first specific case shows extreme poor
network connections that we did not encounter in the study of the CDN. The second case shows that we can
have new mitigations by adding peers instead of avoiding peers like we used to do for improving the latency
for the CDNs.
The first case shows a latency in A-probing about 100 ms larger than U-probing. This case could have
large latency improvement after mitigation. This case also indicates that extremely poor routing exist in
anycast services. In such extremely poor routing, there could exist more than one slow hops in multiple
locations, hurting the latency performance. The first column in Table 4.9 shows the RTT
A
is over 215 ms
while the second and third columns show the RTT
U
is over 106 ms. If we can change the routing on this
case, the latency will reduce by about 100 ms. We see in the path from the VP to the anycast B-root (first
column), there exist two intra-AS slow hops in AS-S and AS-V, but none of them appear in the unicast-
probing round-trip (second and third columns). We also see that the penultimate hop near the anycast B-root
is through AS-Z. AS-Z does not appear in the unicast-probing round-trip either. One possible mitigation is
to withdraw B-root’s anycast announcement to AS-Z. Such mitigation may encourage the VP’s routing shift
to a new path. This new path avoids both AS-S and AS-V. If the new path looks the same as the path in
the U-probing, the latency gain can be as high as 100 ms. This case shows that BAUP can detect new cases
about extremely poor latency containing multiple slow hops, something that we did not observe when we
study the CDN. BAUP also suggests mitigations with potential latency improvement as high as 100 ms.
The second case shows a mitigation that we can improve latency by peering with an extra ISP instead of
stopping peering with a certain ISP. This case has a latency in A-probing about 30 ms larger than U-probing.
The first column in Table 4.10 shows the RTT
A
is over 33 ms while the second and third columns show the
RTT
U
is over 4 ms. If we can change the routing on the case, the latency will reduce by about 30 ms. In the
path towards the anycast B-root, we see a slow hop exist inside AS-a. However, in the round-trips between
the VP and the unicast B-root (second and third columns), we also see the trac passes through AS-a but
89
without any slow hops inside the AS-a. This could happen if AS-a prioritize trac to AS-c. Because we see
that in the anycast probing, the AS-a exchange trac directly with B-root. But in the unicast probing, the
AS-a exchange trac with AS-c instead of directly with B-root. We also see if the penultimate hop to B-root
is AS-d, the trac could go through AS-c, and then to AS-a. So the possible mitigation in this case is that
B-root announces anycast addresses to AS-d. This could lead to the path from the VP to the anycast B-root
still go through AS-a but without experiencing any slow hops. The latency gain would be about 30 ms. This
case shows that BAUP can find novel ways to mitigate slow hops. Instead of refraining from peering with
certain peers, BAUP sometimes shows options to directly peer with an ISP to speed up latency.
4.8 Related Work
We build upon three categories of prior studies of anycast and CDN performance: first type as evaluation of
overall anycast performance, second type as optimizing anycast latency, and third type as predicting end-to-
end latency. Prior evaluations of anycast motivate our work by suggesting potential routing ineciency and
possibilities to improve latency, and studies of latency motivate our study of hop-by-hop paths.
Prior studies evaluated several production CDNs, each with dierent architectures. Google’s WhyHigh
found most users are served by a geographically near node, but regional clients can widely dierent laten-
cies even when served by a same CDN node [50]. Microsoft found roughly 20% of clients to a suboptimal
front-end in their CDN [16]. Other work has studied the latency and geographic anycast catchment based
on the root DNS infrastructure [40, 52, 54, 87] . Fontugne et al. detect network disruptions and report them
in near-real-time with traceroute [41]. Our work extends theirs by using information from both directions
and for both anycast and unicast paths, allowing us to not only find network disruptions, but also routing
detours without disruptions. We also share WhyHigh’s motivation to find slow hops from congestion and
circuitous routing. While Li et al. found routing was often inecient [52], we found that latency problems
were relatively rare—perhaps because we focus on available network paths while they consider geographic
90
distance and because they examine denser anycast networks, which have more opportunities for suboptimal
routing. In addition, the CDN we studied employs regional announcements, were anycast announcements
are restricted to a single continent, limiting how far latency can be o. Schmidt et al. [87] showed that
additional anycast instances show diminishing returns in reducing latency, and suggest per-continent de-
ployments (as seen in the CDN we study). Bian et al. [11] showed that 19.2% of global anycast prefixes
have been potentially impacted by remote peering.
Our work emphasizes a lightweight evaluation method. WhyHigh diagnoses problem by clustering
nearby clients together, and picks the shortest latency to compare, which requires the location data of clients
to be precise [50] . FastRoute optimized anycast usage, using multiple addresses for dierent areas using
a hybrid anycast with DNS-based selection []. Like WhyHigh, FastRoute also diagnoses latency problems
based on user locations. Our work focuses on diagnosis, rather than prediction like iPlane [56, 57]. Our
methodology uses the dierence between routing segments in A and U Probing by a simple RTT-inequality
indicator. Moreover, we don’t require the precise location of each client (vantage point), or each router on
path. Every time we detect, we compare from the same VP and same destination, so there is no risk of error
due to an incorrect IP-geolocation mapping.
4.9 Summary
BAUP allows general anycast infrastructures such as anycast-based CDNs and DNS to detect opportunities
to improve latency to their clients due to congested links and routing detours. By comparing the route
taken towards a CDN and a root-DNS infrastructure, in both anycast and unicast addresses, BAUP detects
opportunities to improve latency, and with bidirectional traceroutes we observe slow hops and circuitous
paths. We show that these observations allow BAUP to identify opportunities for improvement. Working
with a CDN operator, we show that changes identified by BAUP halved latency for some VPs, aecting 91
ASes in 19 countries with more than 100k users. Since Internet routing is always changing, we suggest that
91
BAUP should be used to test anycast deployments regularly. It can help debug performance problems and
detect regressions.
BAUP’s success of reducing latency for a commercial CDN supports our thesis statement by improving
the latency of anycast. Next, we discuss the future work and conclusion our thesis.
92
Chapter 5
Conclusions
The thesis has shown how we can confirm the stability and security of anycast, and how we can improve
the latency of anycast. In this chapter, we discuss possible future directions and challenges that remain in
pursuit and then present our conclusions.
5.1 Future Directions
Longitudinal study over anycast instability. We have shown that instability happens rarely in all 12 DNS-
root infrastructures. We have shown that some instability are sticky over time, occurring in some (VP, root)
pairs over weeks in UDP connections. However, a longitudinal study over anycast instability over longer
time span would be interesting. First, we could see if the instability can be persistent over longer periods,
e.g. months or years. This study will contribute to the discovery of Internet load balancers. Second, we
could also see how much the instability changes when facing certain historical DDoS events. This study
will contribute to the study of DDoS in general and to the prediction of DDoS events.
What are possible solutions if an anycast infrastructure has some users frequently timeout in TCP
connection due to site changes? In our previous study about stability (in Chapter 2), we see that although
extremely rare, it is possible for one specific anycast infrastructure to have users that frequently timeout in
TCP connections. In our study, TCP timeout caused by instability happens to only 1 services out of 12 we
93
checked and only happens to 0.15% VPs. However, routing in the Internet keeps changing. It is possible that
some users may experience such timeout one day even if they never experience it before. One challenge to
solve the problem is to keep monitoring the trac of the users that experiences anycast instability. Anycast
instability and timeout could be recognized by server logs or other regular probing from the users’ side.
Once the operators learns the instability and timeout happening to a group of users, there are a few solutions
to it. First, operators can specially direct those users’ trac to a server back-up. Second, operators can
choose add a site near this group of users to shape a new catchment for these users.
Compare BAUP’s eectiveness over dierent anycast infrastructures. We have shown in in Chap-
ter 4, that BAUP is able to detects improvable latency for both a commercial CDN and DNS B-root. Through
the study, we see that some improvement shown in B-root can be as high as 100 ms, while the improvement
in the commercial CDNs are usually tens of ms. One possibility is that B-root has poorer tail latency than
the commercial CDN. Another possibility is that BAUP are more eective on anycast infrastructures that
has fewer sites. The commercial CDN we studied has more than 100 sites globally, yet significantly reduced
latency after applying BAUP. It would be interesting to learn if BAUP is as eective on other large-scale
anycast CDNs.
Given a specific user set, and a latency requirement, what is the most economic way (smaller
amount of PoPs) to deploy a anycast infrastructure for a certain service? Our study about latency (in
Chapter 4) proposes BAUP which helps cut the median latency of regional users (86%ile of all users before
improvement) by half. BAUP significantly improved tail latency for a candidate CDN. Assuming that a
CDN only needs to reach a certain service level with a tail latency limit, is that possible this CDN can
decrease the number of PoPs after reducing the tail latency? Previous study of Schmidt et al. [87] shows
that in anycast infrastructures a few sites can provide performance nearly as good as many, and that network
locations have a far stronger eect on latency than having many sites. Studies of saving cost by using fewer
94
PoPs could benefit the anycast services. But operators also need to consider throughputs and other factors,
such as privacy issue related with locations for PoP choices.
Evaluating DNS spoofing using a dierent set of VPs. Our study about DNS spoofing (in Chapter 3) is
the first general work to quantification the DNS spoofing over all Internet users. However, such qualification
may be aected by the VPs conducting the measurements. Another measurement via another set of VPs will
reduce the possible bias from a single set of VPs.
Since DNSSEC is not deployed to all zones and not used by every Internet users, is there a way
to detect spoofing in real-time? In our study about anycast security (in Chapter 3) we found that although
DNS spoofing is not common, it is growing year by year. In future work, one could regularly observe from
a single VP to tell when there is high likelihood that a DNS query is dropped, injected, or directed to a
third-party site. Besides the server IDs (hostname.bind field) we used in our work, there are possibly other
indicators about spoofing if we frequently observe the replies of queries. For example, if a query gets a reply
in an unusually-low latency as compared to old replies, this could mean potential injections happen. If a
DNS query receives a reply indicating it is from a new site physically farther from the VP but with a smaller
latency, this could mean that a third-party anycast site is interfering the trac.
5.2 Conclusions
In this thesis, we have proven our thesis statement that we confirm the stability and security of anycast
and improve the latency of anycast by understanding anycast routing uncertainty. In our work studying
the anycast stability, we proved that site changes rarely aects TCP connections and instability happens to
pairs of certain VPs and certain anycast infrastructure. In our work studying the anycast security on DNS
spoofing, we found that DNS spoofing is uncommon but increasing over the years, and third-party anycast
site is rarely used. In our work improving the anycast latency, we reduced the median latency of regional
users of a commercial CDN from 40 ms to 16 ms.
95
In Chapter 2, we used data from more than 9000 vantage points (VPs) to study 11 anycast services to
examine the stability of site selection. Consistent with wide use of anycast in CDNs, we found that anycast
almost always works—in our data, 98% of VPs see few or no changes. This finding confirmed the anycast
stability as stated in our thesis statement. However, we found a few VPs—about 1%—that see frequent route
changes and so are anycast unstable. We showed that anycast instability in these VPs is usually “sticky”,
persisting over a week of study. Fortunately, that most unstable VPs are only aected by one or two services,
shows instability causes may lie somewhere in the middle of the routing path. By launching more frequent
requests, we captured very frequent (back and forth within 10s) routing change in our experiments using
the unstable VPs we discovered from previous analysis, the statistical analysis shows they are possibly
aected by per-packet flipping, which is potentially caused by load balancer in the path. Also, we conduct
experiments using the same sources and targets but with TCP connection. We find TCP anycast instability
is even rarer although exists and harms. This finding confirmed the anycast stability in TCP connections.
Our results confirm that anycast generally works well with good stability, but when it comes to a specific
service, there might be a few users experiencing routing that is never stable.
In Chapter 3, we developed new methods to detect overt DNS spoofing and some covert delayers, and
to identify and classify parties carrying out overt spoofing. In our evaluation of about six years of spoofing
at the DNS Root, we showed that spoofing is quite rare, aecting only about 1.7% of VPs. This finding
confirmed the anycast security on the subject of DNS spoofing as stated in the thesis statement. However,
spoofing is increasing, growing by more than 2 over more than six years. We also showed that spoofing is
global, although more common in some countries. By validating using logs of authoritative server B-root,
we proved that our detection method has true positive rate of at least 0.96. Finally, we showed that proxies
are a more common method of spoofing today than DNS injection and third-party anycast. We drew two
recommendations from our work. First, based on the growth of spoofing, we recommend that operators
96
regularly look for DNS spoofing. Second, interested end-users may wish to watch for spoofing using our
approach.
In Chapter 4, BAUP allows general anycast infrastructures such as anycast-based CDNs and DNS to
detect opportunities to improve latency to their clients due to congested links and routing detours. By com-
paring the route taken towards a CDN and a root-DNS infrastructure, in both anycast and unicast addresses,
BAUP detects opportunities to improve latency, and with bidirectional traceroutes we observe slow hops and
circuitous paths. We show that these observations allow BAUP to identify opportunities for improvement.
The development and use of BAUP confirmed that we are able to improve anycast latency as stated in the
thesis statement. Working with a CDN operator, we show that changes identified by BAUP halved latency
for some VPs, aecting 91 ASes in 19 countries with more than 100k users. Since Internet routing is always
changing, we suggest that BAUP should be used to test anycast deployments regularly. It can help debug
performance problems and detect regressions.
This thesis shows that anycast can well serve both CDNs and DNS. First, for CDNs, we see that without
extra cost of trac management to direct users to sites, anycast is fit and popular for new and even small
CDNs. Our work has proven that anycast can provide stable services. We also provide a tool—BAUP for
reducing latency in anycast infrastructures. The development of BAUP shows that there are methods to
tune anycast infrastructure to provide a better service. Second, we see that the current anycast-based DNS
root letters are uncommonly being spoofed. However, we provide longitudinal analysis over DNS spoofing,
which shows spoofing has been increasing over the years and occurs globally. DNS-related protocols are
not perfect. Attackers, organizations, ISPs will keep use the weak spots of DNS for their own benefits.
Researchers should keep studying the mitigation of DNS spoofing and how to improve the Internet security
in general.
97
Bibliography
[1] ABC. Australia bans 220 video games in 4 months as government adopts new classifica-
tion model. http://www.abc.net.au/news/2015-06-30/australia-bans-220-video-games-in-
four-months/6582100. 2015-06-30.
[2] ABC. Internet companies forced to block The Pirate Bay, bittorrent websites in Australia, Fed-
eral Court rules. http://www.abc.net.au/news/2016-12-15/federal-court-orders-pirate-
bay-blocked-in-australia/8116912. 2016-12-15.
[3] Akamai-Content Distribution Network. https://www.akamai.com/us/en/resources/content-
distribution-network.jsp, January 2020.
[4] ANT/ISI. https://ant.isi.edu/software/cryptopANT/index.html, July 2018.
[5] APNIC. BGP-stats routing table report—japan view. https://mailman.apnic.net/mailing-
lists/bgp-stats/archive/2020/02/msg00085.html, Feb. 13 2020.
[6] APNIC. Reducing latency at CDNs with bidirectional anycast/unicast probing. https:
//blog.apnic.net/2020/08/06/reducing-latency-at-cdns-with-bidirectional-anycast-
unicast-probing/, 2020.
[7] R. Austein. DNS name server identifier (NSID) option. RFC 5001, Internet Request For Comments,
August 2007.
[8] Rob Austein. DNS Name Server Identifier (NSID) Option. RFC 5001, August 2007.
98
[9] B-root. B-root server logs. Contact B-root operators.
[10] Vaibhav Bajpai, Stee Jacob Eravuchira, and J¨ urgen Sch¨ onw¨ alder. Lessons learned from using the
RIPE Atlas platform for measurement research. ACM SIGCOMM Computer Communication Review,
45(3):35–42, 2015.
[11] R. Bian, S. Hao, H. Wang, A. Dhamdhere, A. Dainotti, and C. Cotton. Towards Passive Analysis
of Anycast in Global Routing: Unintended Impact of Remote Peering. ACM SIGCOMM Computer
Communication Review (CCR), 49(3), Jul 2019.
[12] Palsson Bret, Kumar Prashanth, Jaerali Samir, and Ali Kahn Zaid. TCP over IP anycast—pipe dream
or reality? https://engineering.linkedin.com/network-performance/tcp-over-ip-anycast-
pipe-dream-or-reality, September 2010.
[13] CacheFly Network Map. https://web1.cachefly.net/assets/network-map.html, April 2017.
[14] Matthew Caesar and Jennifer Rexford. BGP routing policies in ISP networks. IEEE Network Maga-
zine, 19(6):5–11, November 2005.
[15] Matt Calder, Xun Fan, Zi Hu, Ethan Katz-Bassett, John Heidemann, and Ramesh Govindan. Mapping
the expansion of Google’s serving infrastructure. In Proceedings of the ACM Internet Measurement
Conference, pages 313–326, Barcelona, Spain, October 2013. ACM.
[16] Matt Calder, Ashley Flavel, Ethan Katz-Bassett, Ratul Mahajan, and Jitendra Padhye. Analyzing
the Performance of an Anycast CDN. In Proceedings of the 2015 ACM Conference on Internet
Measurement Conference, pages 531–537. ACM, 2015.
[17] Yi-Ching Chiu, Brandon Schlinker, Abhishek Balaji Radhakrishnan, Ethan Katz-Bassett, and Ramesh
Govindan. Are we one hop away from a better Internet? In Proceedings of the 2015 Internet
Measurement Conference, pages 523–529, 2015.
99
[18] Taejoong Chung, Roland van Rijswijk-Deij, Balakrishnan Chandrasekaran, David Chones, Dave
Levin, Bruce M Maggs, Alan Mislove, and Christo Wilson. A Longitudinal, End-to-End View of the
DNSSEC Ecosystem. In 26th USENIX Security Symposium USENIX Security 17), pages 1307–1322,
2017.
[19] Danilo Cicalese, Danilo Giordano, Alessandro Finamore, Marco Mellia, Maurizio Munaf` o, Dario
Rossi, and Diana Joumblatt. A First Look at Anycast CDN Trac. arXiv preprint arXiv:1505.00946,
2015.
[20] CloudFare. https://www.cloudflare.com, October 2019.
[21] David R. Conrad and Suzanne Woolf. Requirements for a Mechanism Identifying a Name Server
Instance. RFC 4892, June 2007.
[22] Steve Crocker, David Dagon, Dan Kaminsky, Danny McPherson, and Paul Vixie. Security and other
technical concerns raised by the DNS filtering requirements in the PROTECT IP bill. Technical
report, RedBarn, May 2011.
[23] Team Cymru. SOHO pharming. Technical report, Team Cymru, February 2014.
[24] Alberto Dainotti, Claudio Squarcella, Emile Aben, Kimberly C. Clay, Marco Chiesa, Michele
Russo, and Antonio Pescap´ e. Analysis of Country-wide Internet Outages Caused by Censorship.
In Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, IMC
’11, pages 1–18, New York, NY , USA, 2011. ACM.
[25] Ransi Nilaksha De Silva, Wei Cheng, Wei Tsang Ooi, and Shengdong Zhao. Towards understanding
user tolerance to network latency and data rate in remote viewing of progressive meshes. In Proceed-
ings of the 20th international workshop on Network and operating systems support for digital audio
and video, pages 123–128, 2010.
100
[26] Wouter de Vries, Jos´ e Jair Santanna, Anna Sperotto, and Aiko Pras. How asymmetric is the Internet?
In IFIP International Conference on Autonomous Infrastructure, Management and Security, pages
113–125. Springer, 2015.
[27] Wouter B. de Vries, Ricardo de O. Schmidt, Wes Hardaker, John Heidemann, Pieter-Tjerk de Boer,
and Aiko Pras. Verfploeter: Broad and load-aware anycast mapping. In Proceedings of the ACM
Internet Measurement Conference, London, UK, 2017.
[28] Jerey Dean and Luiz Andr´ e Barroso. The tail at scale. Communications of the ACM, 56(2):74–80,
February 2013.
[29] Amogh Dhamdhere, David D Clark, Alexander Gamero-Garrido, Matthew Luckie, Ricky KP Mok,
Gautam Akiwate, Kabir Gogia, Vaibhav Bajpai, Alex C Snoeren, and Kc Clay. Inferring persistent
interdomain congestion. In Proceedings of the 2018 Conference of the ACM Special Interest Group
on Data Communication, pages 1–15. ACM, 2018.
[30] Haixin Duan, Nicholas Weaver, Zongxu Zhao, Meng Hu, Jinjin Liang, Jian Jiang, Kang Li, and Vern
Paxson. Hold-on: Protecting against on-path DNS poisoning. In Proceedings of the Workshop on
Securing and Trusting Internet Names (SATIN), Teddington, UK, March 2012.
[31] D. Eastlake. Domain Name System security extensions. RFC 2535, Internet Request For Comments,
March 1999. obsolted by RFC-4033.
[32] Edgecast. https://www.verizondigitalmedia.com, October 2019.
[33] Roya Ensafi, David Fifield, Philipp Winter, Nick Feamster, Nicholas Weaver, and Vern Paxson. Ex-
amining How the Great Firewall Discovers Hidden Circumvention Servers. In Proceedings of the
2015 Internet Measurement Conference, IMC ’15, pages 445–458, New York, NY , USA, 2015. ACM.
[34] Facebook. www.fbcdn.com, August 2020.
101
[35] Jinliang Fan, Jun Xu, Mostafa H. Ammar, and Sue B. Moon. Prefix-preserving IP address anonymiza-
tion: measurement-based security evaluation and a new cryptography-based scheme. Computer Net-
works, 46(2):253 – 272, 2004.
[36] Xun Fan, John Heidemann, and Ramesh Govindan. Evaluating anycast in the domain name system.
In INFOCOM, 2013 Proceedings IEEE, pages 1681–1689. IEEE, 2013.
[37] Xun Fan, John Heidemann, and Ramesh Govindan. Evaluating anycast in the Domain Name System.
In 2013 Proceedings IEEE INFOCOM, pages 1681–1689, Turin, Italy, April 2013. IEEE.
[38] S. Farrell and H. Tschofenig. Pervasive monitoring is an attack. RFC 7758, Internet Request For
Comments, May 2014. (also Internet BCP-188).
[39] Ashley Flavel, Pradeepkumar Mani, David A. Maltz, Nick Holt, Jie Liu, Yingying Chen, and Oleg
Surmachev. FastRoute: A scalable load-aware anycast routing architecture for modern CDNs. In
Proceedings of the USENIX Symposium on Network Systems Design and Implementation, Oakland,
CA, USA, May 2015. USENIX.
[40] Marina Fomenkov, Kimberly C Clay, Bradley Huaker, and David Moore. Macroscopic Internet
Topology and Performance Measurements from the DNS Root Name Servers. In LISA, pages 231–
240, 2001.
[41] Romain Fontugne, Cristel Pelsser, Emile Aben, and Randy Bush. Pinpointing delay and forwarding
anomalies using large-scale traceroute measurements. In Proceedings of the 2017 Internet Mea-
surement Conference, IMC 17, page 1528, New York, NY , USA, 2017. Association for Computing
Machinery.
102
[42] Phillipa Gill, Masashi Crete-Nishihata, Jakub Dalek, Sharon Goldberg, Adam Senft, and Greg Wise-
man. Characterizing Web Censorship Worldwide: Another Look at the OpenNet Initiative Data. ACM
Transactions on the Web, 9(1), January 2015.
[43] Danilo Giordano, Danilo Cicalese, Alessandro Finamore, Marco Mellia, Maurizio Munaf` o, Di-
ana Zeaiter Joumblatt, and Dario Rossi. A first characterization of anycast trac from passive traces.
In Proceedings of the IFIP Trac Monitoring and Analysis Workshop (TMA), 2016.
[44] Ramesh Govindan and Hongsuda Tangmunarunkit. Heuristics for Internet map discovery. In Pro-
ceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint
Conference of the IEEE Computer and Communications Societies (Cat. No. 00CH37064), pages
1371–1380, Tel Aviv, Israel, March 2000. IEEE.
[45] James Hiebert, Peter Boothe, Randy Bush, and Lucy Lynch. Determining the cause and frequency
of routing instability with anycast. In Proceedings of the Asian Internet Engineering Conference
(AINTEC), pages 172–185, Pathumthani, Thailand, November 2006. Springer-Verlag.
[46] Z. Hu, L. Zhu, J. Heidemann, A. Mankin, D. Wessels, and P. Homan. Specification for DNS over
Transport Layer Security (TLS). RFC 7858, Internet Request For Comments, May 2016.
[47] Yuchen Jin, Sundararajan Renganathan, Ganesh Ananthanarayanan, Junchen Jiang, Venkata N Pad-
manabhan, Manuel Schroder, Matt Calder, and Arvind Krishnamurthy. Zooming in on wide-area
latencies to a global cloud provider. In Proceedings of the ACM Special Interest Group on Data
Communication, pages 104–116. Association for Computing Machinery, 2019.
[48] Ben Jones, Nick Feamster, Vern Paxson, Nicholas Weaver, and Mark Allman. Detecting DNS root
manipulation. In International Conference on Passive and Active Network Measurement, pages 276–
288. Springer, 2016.
103
[49] Brian Krebs. A deep dive on the recent widespread DNS hijacking attacks. Krebs-on-Security
blog athttps://krebsonsecurity.com/2019/02/a-deep-dive-on-the-recent-widespread-dns-
hijacking-attacks/, February 2019.
[50] Rupa Krishnan, Harsha V . Madhyastha, Sushant Jain, Sridhar Srinivasan, Arvind Krishnamurthy,
Thomas Anderson, and Jie Gao. Moving Beyond End-to-End Path Information to Optimize CDN
Performance. In Internet Measurement Conference (IMC), pages 190–201, Chicago, IL, 2009.
[51] Matt Levine, Barrett Lyon, and Todd Underwood. TCP anycast—don’t believe the FUD. Presentation
at NANOG 37, June 2006.
[52] Zhihao Li, Dave Levin, Neil Spring, and Bobby Bhattacharjee. Internet anycast: performance, prob-
lems, and potential. In SIGCOMM, pages 59–73, 2018.
[53] Baojun Liu, Chaoyi Lu, Haixin Duan, Ying Liu, Zhou Li, Shuang Hao, and Min Yang. Who is
answering my queries: Understanding and characterizing interception of the DNS resolution path. In
27th USENIX Security Symposium (USENIX Security 18), pages 1113–1128, 2018.
[54] Ziqian Liu, Bradley Huaker, Marina Fomenkov, Nevil Brownlee, et al. Two days in the life of the
DNS anycast root servers. In International Conference on Passive and Active Network Measurement,
pages 125–134. Springer, 2007.
[55] Chaoyi Lu, Baojun Liu, Zhou Li, Shuang Hao, Haixin Duan, Mingming Zhang, Chunying Leng,
Ying Liu, Zaifeng Zhang, and Jianping Wu. An End-to-End, Large-Scale Measurement of DNS-over-
Encryption: How Far Have We Come? In Proceedings of the Internet Measurement Conference, IMC
19, page 2235, New York, NY , USA, 2019. Association for Computing Machinery.
104
[56] Harsha V Madhyastha, Thomas Anderson, Arvind Krishnamurthy, Neil Spring, and Arun Venkatara-
mani. A structural approach to latency prediction. In Proceedings of the 6th ACM SIGCOMM con-
ference on Internet measurement, pages 99–104. ACM, 2006.
[57] Harsha V Madhyastha, Tomas Isdal, Michael Piatek, Colin Dixon, Thomas Anderson, Arvind Kr-
ishnamurthy, and Arun Venkataramani. iPlane: An information plane for distributed services. In
Proceedings of the 7th symposium on Operating systems design and implementation, pages 367–380.
USENIX Association, 2006.
[58] Alexander Marder, Matthew Luckie, Amogh Dhamdhere, Bradley Huaker, kc clay, and
Jonathan M. Smith. Pushing the boundaries with bdrmapIT: Mapping router ownership at Internet
scale. In Proceedings of the ACM Internet Measurement Conference, Boston, Massachusetts, USA,
October 2018. ACM.
[59] Cade Metz. Comcast trials (domain helper service) DNS hijacker. The Register, July 2009.
[60] Paul V Mockapetris. Domain names-implementation and specification. RFC1035, Internet Engineer-
ing Task Force, 1987.
[61] Giovane Moura, Ricardo de O Schmidt, John Heidemann, Wouter B de Vries, Moritz Muller, Lan
Wei, and Cristian Hesselman. Anycast vs. DDoS: Evaluating the November 2015 root DNS event. In
Proceedings of the 2016 Internet Measurement Conference, pages 255–270. ACM, 2016.
[62] Giovane C. M. Moura, Ricardo de O. Schmidt, John Heidemann, Wouter B. de Vries, Moritz M¨ uller,
Lan Wei, and Christian Hesselman. Anycast vs. DDoS: Evaluating the November 2015 root DNS
event. In Proceedings of the ACM Internet Measurement Conference, November 2016.
105
[63] Giovane C. M. Moura, Ricardo de O. Schmidt, John Heidemann, Wouter B. de Vries, Moritz M¨ uller,
Lan Wei, and Christian Hesselman. Anycast vs. DDoS: Evaluating the November 2015 root DNS
event. In Proceedings of the ACM Internet Measurement Conference, November 2016.
[64] Gabi Nakibly, Jaime Schcolnik, and Yossi Rubin. Website-Targeted False Content Injection by Net-
work Operators. In 25th USENIX Security Symposium (USENIX Security 16), pages 227–244, Austin,
TX, 2016. USENIX Association.
[65] Erik Nygren, Ramesh K Sitaraman, and Jennifer Sun. The Akamai network: a platform for high-
performance Internet applications. ACM SIGOPS Operating Systems Review, 44(3):2–19, 2010.
[66] Paul Pearce, Ben Jones, Frank Li, Roya Ensafi, Nick Feamster, Nick Weaver, and Vern Paxson.
Global measurement of DNS manipulation. In 26th USENIX Security Symposium (USENIX Security
17), pages 307–323, Vancouver, BC, 2017. USENIX Association.
[67] Cristel Pelsser, Luca Cittadini, Stefano Vissicchio, and Randy Bush. From Paris to Tokyo: On the
suitability of ping to measure latency. In Proceedings of the ACM Internet Measurement Conference,
Barcelona, Spain, October 2013. ACM.
[68] Joel Purra and Tom Cuddy. DNSSEC Name-and-Shame. website https://dnssec-name-and-
shame.com, 2014.
[69] Y . Rekhter, B. Moskowitz, D. Karrenberg, G. J. de Groot, and E. Lear. Address Allocation for Private
Internets. RFC 1918, Internet Request For Comments, February 1996.
[70] RIPE Atlas. https://atlas.ripe.net/, May 2020.
[71] RIPE NCC. RIPE Atlas root server dns data. https://atlas.ripe.net/measurements/ID. ID is
the per-root-letter experiment ID: A: 10309, B: 10310, C: 10311, D: 10312, E: 10313, F:10304, G:
10314, H: 10315, I: 10305, J: 10316, K: 10301, L: 10308, M: 10306.
106
[72] RIPE NCC. RIPE Atlas root server ping data. https://atlas.ripe.net/measurements/ID. ID is
the per-root-letter experiment ID: A: 1009, B: 1010, C: 1011, D: 1012, E: 1013, F: 1004, G: 1014, H:
1015, I: 1005, J: 1016, K: 1001, L: 1008, M: 1006.
[73] RIPE NCC. RIPE Atlas root server traceroute data. https://atlas.ripe.net/measurements/ID.
ID is the per-root-letter experiment ID: A: 5109, B: 5010, C: 5011, D: 5012, E: 5013, F: 5004, G:
5014, H: 5015, I: 5005, J: 5016, K: 5001, L: 5008, M: 5006.
[74] RIPE NCC. DNSMON. https://atlas.ripe.net/dnsmon/, 2015.
[75] RIPE NCC. RIPE Atlas root server data. https://atlas.ripe.net/measurements/ID, 2015. ID is
the per-root-letter experiment ID: A: 10309, B: 10310, C: 10311, D: 10312, E: 10313, F:10304, G:
10314, H: 10315, I: 10305, J: 10316, K: 10301, L: 10308, M: 10306.
[76] RIPE NCC. RIPE Atlas self tcp measurement 1. https://atlas.ripe.net/measurements/ID, 2017.
ID is the per-root-letter TCP experiment ID: A: 9177091, B: 9177098, C: 9177102, D: 9177106, E:
9177109, F:9177113, G: 9177116, I: 9177122, J: 9177126, K: 9177130, L: 9177135, M: 9177140.
[77] RIPE NCC. RIPE Atlas self tcp measurement 2. https://atlas.ripe.net/measurements/ID, 2017.
ID is the per-root-letter TCP experiment ID: A: 9203157, B: 9203162, C: 9203163, D: 9203164, E:
9203165, F:9203166, G: 9203167, I: 9203168, J: 9203158, K: 9203159, L: 9203160, M: 9203161.
[78] RIPE NCC. RIPE Atlas self traceroute measurement. https://atlas.ripe.net/measurements/ID,
2017. ID is the per-root-letter traceroute record ID: A: 5009, B: 5010, C: 5011, D: 5012, E: 5013,
F:5004, G: 5014, I: 5005, J: 5016, K: 5001, L: 5008, M: 5006.
[79] RIPE NCC. RIPE Atlas self udp measurement. https://atlas.ripe.net/measurements/ID,
2017. ID is the per-frequency UDP experiment ID: 60s : 7788714, 70s : 7788717, 80s: 7788718,
90s:7788729.
107
[80] RIPE NCC. RIPE Atlas measurement about vdms. https://atlas.ripe.net/measurements/ID,
2019. ID: towards anycast: 22385847, towards unicast: contact the paper authors for RIPE experi-
ment IDs.
[81] Root Operators. http://www.root-servers.org, April 2016.
[82] Root Operators. http://www.root-servers.org, April 2019.
[83] Root Operators. http://www.root-servers.org, February 2020.
[84] Route Views. http://archive.routeviews.org/, July 2019.
[85] RouteViews. routeviews2, routeviews3, routeviews4. http://bgpmon.io/archive/help, August
2016.
[86] Sandvine. Sandvine global Internet phenomena report, September 2019.
[87] Ricardo de O. Schmidt, John Heidemann, and Jan Harm Kuipers. Anycast latency: How many sites
are enough? In Proceedings of the Passive and Active Measurement Workshop, page to appear,
Sydney, Australia, May 2017. Springer.
[88] Kyle Schomp, Tom Callahan, Michael Rabinovich, and Mark Allman. Assessing DNS vulnerability
to record injection. In International Conference on Passive and Active Network Measurement, pages
214–223. Springer, 2014.
[89] Puneet Sharma, Zhichen Xu, Sujata Banerjee, and Sung-Ju Lee. Estimating network proximity and
latency. ACM SIGCOMM Computer Communication Review, 36(3):39–50, 2006.
[90] Sooel Son and Vitaly Shmatikov. The hitchhiker’s guide to DNS cache poisoning. In International
Conference on Security and Privacy in Communication Systems, pages 466–483. Springer, 2010.
108
[91] Douglas B. Terry, Mark Painter, David W. Riggle, and Songnian Zhou. The Berkeley Internet Name
Domain Server. Technical Report UCB/CSD-84-182, EECS Department, University of California,
Berkeley, May 1984.
[92] Thomas Vissers, Timothy Barron, Tom Van Goethem, Wouter Joosen, and Nick Nikiforakis. The
wolf of name street: Hijacking domains through their nameservers. In Proceedings of the 2017 ACM
SIGSAC Conference on Computer and Communications Security, pages 957–970. ACM, 2017.
[93] Nicholas Weaver, Robin Sommer, and Vern Paxson. Detecting Forged TCP Reset Packets. In NDSS,
2009.
[94] Lan Wei, Marcel Flores, Harkeerat Bedi, and John Heidemann. Bidirectional Anycast/Unicast Prob-
ing (BAUP): Optimizing CDN Anycast. In 2020 Network Trac Measurement and Analysis Confer-
ence (TMA), pages 1–9. IETF, 2020.
[95] Lan Wei and John Heidemann. Does anycast hang up on you? In IEEE International Network Trac
Measurement and Analysis Conference (Dublin, Ireland, 2017.
[96] Lan Wei and John Heidemann. Does anycast hang up on you (udp and tcp)? IEEE Transactions on
Network and Service Management, 15(2):707–717, 2018.
[97] Lan Wei and John Heidemann. Whac-A-Mole: Six Years of DNS Spoofing. arXiv https://arxiv.
org/pdf/2011.12978.pdf, 2020.
[98] Duane Wessels. About DNS Spoofing. Private communication, April 2019.
[99] S. Woolf and D. Conrad. Requirements for a mechanism identifying a name server instance. RFC
4892, Internet Request For Comments, June 2007.
109
[100] Liang Zhu, Zi Hu, John Heidemann, Duane Wessels, Allison Mankin, and Nikita Somaiya.
Connection-oriented DNS to improve privacy and security. In Proceedings of the 36th IEEE Sympo-
sium on Security and Privacy, pages 171–186, San Jose, Californa, USA, May 2015. IEEE.
110
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Measuring the impact of CDN design decisions
PDF
Detecting and mitigating root causes for slow Web transfers
PDF
Mitigating attacks that disrupt online services without changing existing protocols
PDF
Balancing security and performance of network request-response protocols
PDF
Improving user experience on today’s internet via innovation in internet routing
PDF
Optimal distributed algorithms for scheduling and load balancing in wireless networks
Asset Metadata
Creator
Wei, Lan
(author)
Core Title
Anycast stability, security and latency in the Domain Name System (DNS) and Content Deliver Networks (CDNs)
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
12/02/2020
Defense Date
09/28/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
anycast,anycast site,anycast stability,BGP,BGP hijacking,catchment,CDN,client-to-site,connection stability,Content Deliver Networks,distributed networks,DNS,DNS root,DNS spoofing,Domain Name System,end-to-end latency,injection attack,interception attack,Internet measurement,latency,network,network performance,OAI-PMH Harvest,RIPE Atlas,TCP,traceroute,traffic engineer,UDP,unicast
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Heidemann, John (
committee chair
), Govindan, Ramesh (
committee member
), Psounis, Konstantinos (
committee member
), Schmidt, Ricardo (
committee member
)
Creator Email
lan.wei.moon@gmail.com,weilan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-402681
Unique identifier
UC11668561
Identifier
etd-WeiLan-9154.pdf (filename),usctheses-c89-402681 (legacy record id)
Legacy Identifier
etd-WeiLan-9154.pdf
Dmrecord
402681
Document Type
Dissertation
Rights
Wei, Lan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
anycast
anycast site
anycast stability
BGP
BGP hijacking
catchment
CDN
client-to-site
connection stability
Content Deliver Networks
distributed networks
DNS
DNS root
DNS spoofing
Domain Name System
end-to-end latency
injection attack
interception attack
Internet measurement
latency
network performance
RIPE Atlas
TCP
traceroute
traffic engineer
UDP
unicast