Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Mitigating attacks that disrupt online services without changing existing protocols
(USC Thesis Other)
Mitigating attacks that disrupt online services without changing existing protocols
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MITIGATING ATTACKS THAT DISRUPT ONLINE SERVICES
WITHOUT CHANGING EXISTING PROTOCOLS
by
ASM Rizvi
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2024
Copyright 2024 ASM Rizvi
Dedication
To my daughter Ruzaina Afroz Rizvi
To my wife Dr. Sabrina Afroz, M.D.
To my mother Saleha Khatoon
To my father Mohammad Abdur Razzaque
To my elder brother Professor Raisul Islam, and
To my younger brother Dr. ASM Iftekhar
ii
Acknowledgements
When I look back to my PhD journey, I find many collaborators, faculty members, friends, and family
members who supported me throughout this journey. I want to express my gratitude to every one of
them.
A huge thanks to my beloved wife, Dr. Sabrina Afroz, M.D., for everything you did, especially during
the difficult time of my ACL surgery. Thanks for being the biggest support system of my Ph.D. life, while
completing the difficult USMLE exams, and then while doing your residency with a newborn. You are the
most resilient person that I have ever seen in my life. Thanks to my little daughter, Ruzaina, for being
the lucky charm of our life. I could not give you enough time that you deserved, but I will make that
up for you. Thanks to my mother, Saleha Khatoon, and father, Mohammad Abdur Razzaque, for their
continuous support and encouragement, especially for taking care of Ruzaina during the last phase of my
Ph.D. I am grateful to my elder brother, Professor Raisul Islam, for every support, suggestion, and tutoring
you provide, even from my high school. You are the role model of my academic life. Thanks to my younger
brother, ASM Iftekhar, Ph.D., for being a great brother and a reliable person who will always be there for me
in every situation. I want to express my gratitude to my sister-in-law, Shamsunnaher Tanwi, for the help
in the last part of my Ph.D., by taking care of Ruzaina when we both were busy with our work. Thanks to
my sister-in-law, Anika Sharin, and Nishat Subah Peau, for making exciting plans during the holidays. I
want to thank my mother-in-law, Nahida Sultana, my father-in-law, Abdus Satter, and my brother-in-law,
Ashraful Islam, for their love, support, and affection. I am thankful to my sister Sabrina Karim, Ph.D., and
iii
my brother-in-law Mohammad Rifat Haider, Ph.D., for their encouragement and suggestions during my
Ph.D. life.
I want to thank my advisor, Prof. John Heidemann, for his continuous support. From day one to the
final day of this thesis submission, I received advice, suggestions, and feedback from him. This thesis
would not be possible without his guidance and support. It was a rewarding journey where I could learn
many, many new things. Among many other things, I learned how to do methodological research and
write technical articles from him, which I can use for the rest of my life.
I would like to thank my committee members, Professor Bhaskar Krishnamachari, Professor Harsha
V. Madhayastha, and Professor Jelena Mirkovic, for being a part of my thesis committee and for their
feedback. I would also like to thank Professor Ramesh Govindan and Professor Barath Raghavan for being
a part of my thesis proposal committee and for their suggestions regarding my thesis direction.
During this PhD, I was lucky to have some great collaborators who made my journey easier. I am
grateful to Professor Jelena Mirkovic for her input in the anti-DDoS project. I also want to thank Wes
Hardaker and Robert Story for their support in the anti-DDoS filtering project. During my anycast project,
I had a great time working with Leandro Bertholdo from the University of Twente and Joao M. Ceron from
the SIDN lab. I want to thank Professor Ethan Katz-Bassett from Columbia University, Professor Italo
Cunha from UFMG, Brazil, and the whole PEERING testbed team for their feedback and help while using
PEERING testbed for my anycast project. I am grateful to Wouter De Vries from Cloudflare for assisting
me in running Verfploeter.
During my anycast polarization and mobile latency characterization project, I got enormous help from
multiple engineers of Akamai Technologies. For my anycast polarization project, I got unwavering support from Tingshan Huang, my internship manager at Akamai. I am grateful to Rasit Esrefoglu for his
collaboration and input in the anycast polarization project. I want to thank David Plonka, Philipp Richter,
iv
Nic Jansma, and Arthur Berger from Akamai Technologies for their input and support in the mobile latency characterization project. I am grateful to Lincoln Lavoie from the University of New Hampshire and
Lincoln Thurlow and Geoff Lawler from ISI for their support in running experiments in different testbeds.
I want to thank Yuri Pradkin from ISI for all the support, suggestions, and technical help throughout this
Ph.D..
Thanks to all my fellow Ph.D. students at USC/ISI for their friendship, companionship, and feedback.
I want to thank Basileal Imana, Guillermo Baltra, Asma Enayet, Kicho Yu, Shefali Kulkarni, Xiao Song,
Abdul Qadeer, Calvin Ardi, Lan Wei, Hang Guo, Rajat Tendon, Fawad Ahmad, Sulagna Mukherjee, Becky
Pham, Sulafa Zidani, Nathan Bartley, and Liang Zhu for their companionship throughout this long journey.
I want to thank the USC Bangladeshi community for their friendship, help, and support throughout
my Ph.D. Special thanks to Asma Enayet, Rafsan Hossain, Baishakhi Biswas, Professor Ratul Das, Tasnim
Fabiha, Sadid Khan, Tamim Ahmed, Orin Ahmed Lisa, Tasnim Pasha, Sarfaraz Alam, Ph.D., Mashnoon
Sakib, Samiha Karim, and Fazle Mohammed Tawsif from Belair apartments, for being there like a family.
I am hugely grateful to all my Soccer friends for coming to the Soccer field every weekend afternoon. I
am thankful to all my friends in the Cricket field as well, for keeping my Ph.D. life exciting. These people
absorbed my Ph.D. stress and energized me for the week’s work. Last but not least, I want to thank my
old friends—Shamir, Ifty, Anik, Rifat, Turjo, Abrar, Tawseef, Papel, Munna, Tawsif, Mohaimin, and Himel.
The studies in this thesis are supported by different organizations. I want to thank all these funding
organizations for supporting this thesis.
Our anti-DDoS study with anycast (Chapter 2) is supported, in part, by the DHS HSARPA Cyber Security Division via contract number HSHQDC-17-R-B0004-TTA.02-0006-I and by the Netherlands Organisation for scientific research (4019020199), and European Union’s Horizon 2020 research and innovation
program (830927). I am also grateful to the Peering and Tangled testbed admins who allowed us to run
measurements. Special thanks to Dutch National Scrubbing Center for sharing DDoS data with us.
v
Our anti-DDoS study with filtering (Chapter 3) is partially supported by the National Science Foundation (grant NSF OAC-1739034) and DHS HSARPA Cyber Security Division (grant HSHQDC-17-R-B0004-
TTA.02-0006-I), in collaboration with NWO.
Our moving target defense study (Chapter 4) was supported, in part, by the DHS HSARPA Cyber
Security Division via contract number HSHQDC-17-R-B0004-TTA.02-0006-I (PAADDoS), and by DARPA
under Contract No. HR001120C0157 (SABRES). Thanks to Rayner Pais who prototyped an early version
of Chhoyhopper and version in IPv4 hopping over ports.
Our malicious detour detection work (Chapter 5 and Chapter 6) were partially supported by DARPA
under Contract No. HR001120C0157 and by the NFS projects CNS-2319409, CRI-8115780,and CNS-1925737.
I am grateful to the anonymized CDN for sharing mobile latency data with us. Special thanks to the
University of New Hampshire 5G testbed admins for allowing and helping me to run measurements in
their testbed.
Anycast polarization study (Chapter 7) was partially supported by DARPA under Contract No. HR001120C0157
and by the NFS projects CNS-2319409, CRI-8115780,and CNS-1925737. This work was begun while I was
on an internship at Akamai. I thank Akamai leadership, Akamai legal, and USC legal team for supporting
this collaboration.
vi
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Demonstrating the Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Studies applied to the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: Mitigating DDoS Using Anycast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Anycast and BGP Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Attackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Mechanisms to Defend Against DDoS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.1 Overview and Decision Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.2 Measurement: Mapping Anycast . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.3 Verfploeter Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.4 Estimation of the Attack Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.5 Traffic Engineering as a Defense Strategy . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.5.1 Traffic Engineering to Manage an Attack . . . . . . . . . . . . . . . . . . 20
2.5.5.2 Automatic Defense Selection . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.5.3 Operator Assistance System . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Evaluation of Offered Load Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.1 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.2 Case Studies: 2016-06-25 Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vii
2.6.3 Testbed Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Evaluation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7.1 Anycast Testbeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7.2 Measuring Routing Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Traffic Engineering Coverage and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8.1 Control With Path Prepending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.8.1.1 Prepending coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.8.1.2 Does prepending work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.8.1.3 What granularity does prepending provide? . . . . . . . . . . . . . . . . 33
2.8.2 Control with BGP Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8.2.1 Community string coverage . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8.2.2 At what granularity do community strings work? . . . . . . . . . . . . . 37
2.8.3 Control with Path Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.8.3.1 Poisoning coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.8.3.2 What granularity does poisoning provide? . . . . . . . . . . . . . . . . . 39
2.8.4 Playbook Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.8.5 Load Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.9 Deployment Stability and Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.9.1 Effects of Choice of Anycast Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.9.1.1 Peering: A Small Site in Europe . . . . . . . . . . . . . . . . . . . . . . . 47
2.9.1.2 Peering: Sites in Nearby Location . . . . . . . . . . . . . . . . . . . . . 48
2.9.2 Effects of Number of Anycast Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.9.2.1 More Sites in Tangled . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.9.3 Playbook Stability Over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.10 Defenses at Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.11 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Chapter 3: Mitigating DDoS Using Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2 Background: DNS and DDoS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.1 DNS Root Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.2 The DNS Root and DDoS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.1 Flash-Crowd DDoS Defenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.2 Spoofed Traffic Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3.3 DDoS on DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4 DDiDD Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4.1 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4.2 DDiDD Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4.2.1 Attack detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4.2.2 Filter priming and selection . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.3 DDiDD Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.3.1 Frequent query name filter (FQ) . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.3.2 Unknown recursive filter (UR) . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.3.3 Hop count filter (HC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4.3.4 Wild recursive filter (WR) . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.4.3.5 Response code filter (RC) . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
viii
3.4.3.6 Aggressive recursive filter (AR) . . . . . . . . . . . . . . . . . . . . . . . 82
3.4.4 Parameter Validation for the Unknown Resolver Filter . . . . . . . . . . . . . . . . 82
3.4.5 Filter Selection and Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.5.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.5.3 DDiDD Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.5.4 Impacts On Resource Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.5.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Chapter 4: A Moving Target Defense against Brute-Force Attacks Using IPv6 . . . . . . . . . 96
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3.1 Attackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.3.2 Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4 Prior Related Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.5 Chhoyhopper Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5.1 Design Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5.2 Design overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.5.3 Address hopping pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5.4 Server-side hopping and connection persistence . . . . . . . . . . . . . . . . . . . . 105
4.5.5 Client discovery of the hopping address in SSH . . . . . . . . . . . . . . . . . . . . 106
4.5.6 Challenges with HTTPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.5.7 Server-side certificate handling with hopping HTTPS . . . . . . . . . . . . . . . . . 107
4.5.8 Client discovery of the hopping address in HTTPS . . . . . . . . . . . . . . . . . . 109
4.6 Example Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.6.1 SSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.6.2 HTTPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.7 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.7.1 Risk of Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.7.2 Risk of Collisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.7.3 Run-time Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Chapter 5: Third-Party Assessment of Mobile Performance in 4G and 5G Networks . . . . 118
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3 Architectural Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4 Data Sources And Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.1 CDN HTTP Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.1.1 CDN Logs from Server Side . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.4.1.2 CDN Logs from Client Side . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4.2 UE-based Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.5 Methodology: Identifying Mobile Devices and Stability Analysis . . . . . . . . . . . . . . . 125
5.5.1 Identifying Mobile UE from IPv6 Addresses . . . . . . . . . . . . . . . . . . . . . . 125
5.5.1.1 Carrier Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
ix
5.5.1.2 Geolocation Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.5.1.3 Access Network Technology . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.5.1.4 Apparent HTTP(S) Proxying . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.5.2 Distinguishing 4G and 5G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.5.3 Measuring Latency Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.6 End-to-End Results: Latency, Throughput, and Stability . . . . . . . . . . . . . . . . . . . . 130
5.6.1 How Low is the Latency? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.6.2 How Good is Throughput? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.6.3 Can We Distinguish 4G and 5G? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.6.4 How Stable is Latency? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.6.4.1 Evaluating Latency Stability . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.6.4.2 Stability over Three Weeks . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Appendix 5.A Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Chapter 6: Finding Malicious Routing Detours . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.2 Problem Statement and Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.4 Data Sources And Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.5 Detecting Routing Detours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.5.1 Initializing Landmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.5.2 Measurements to Landmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.5.3 Varying Window Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.5.4 Learning Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.5.5 Using the Baseline to Select Good Landmarks and Window Size . . . . . . . . . . . 154
6.5.6 Detection Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.6 Confirming Detour Detection is Possible . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.6.1 Confirming Stability within Carrier Network . . . . . . . . . . . . . . . . . . . . . 156
6.6.1.1 Stability to the edge router . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.6.1.2 Can We Observe Internal Hops? . . . . . . . . . . . . . . . . . . . . . . . 158
6.6.2 Confirming End-to-End Latency Stability . . . . . . . . . . . . . . . . . . . . . . . 160
6.7 Evaluating Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.7.1 Evaluating False Positives and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 163
6.7.1.1 k value to avoid false positives during normal traffic . . . . . . . . . . . 164
6.7.1.2 k value to detect smaller detours . . . . . . . . . . . . . . . . . . . . . . . 164
6.7.2 Evaluating Window Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.7.2.1 Window size to get stable baseline . . . . . . . . . . . . . . . . . . . . . . 166
6.7.2.2 Variable window size makes bypassing harder for attackers . . . . . . . 168
6.8 Evaluating Detour Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.8.1 Testbed Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.8.2 Testbed Scenarios and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.8.3 Efficacy when Destinations are at Different Distances . . . . . . . . . . . . . . . . . 172
6.8.4 Detection Success as Detour Distance Varies . . . . . . . . . . . . . . . . . . . . . . 173
6.8.5 Detection Success based on Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Appendix 6.A List of Landmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
x
Chapter 7: Anycast Polarization in The Wild Internet . . . . . . . . . . . . . . . . . . . . . . . 177
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.3 Defining Anycast Polarization and its Root Causes . . . . . . . . . . . . . . . . . . . . . . . 180
7.3.1 Defining Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.3.2 The Multi-PoP Backbone Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.3.3 The Leaking Regional Routes Problem . . . . . . . . . . . . . . . . . . . . . . . . . 184
7.4 Detecting and Classing Polarization in the Wild . . . . . . . . . . . . . . . . . . . . . . . . 185
7.4.1 Discovering Anycast Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.4.2 Finding Potential Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.4.3 Finding Root Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.4.4 Finding Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.5 Measurement Results and Impacts of Polarization . . . . . . . . . . . . . . . . . . . . . . . 188
7.5.1 Detecting Polarization in Anycast Services . . . . . . . . . . . . . . . . . . . . . . . 189
7.5.2 Detecting Root Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.5.3 Impacts of the Number of Sites on Polarization . . . . . . . . . . . . . . . . . . . . 191
7.5.4 Impacts of multi-pop backbone problems . . . . . . . . . . . . . . . . . . . . . . . . 193
7.5.4.1 Incomplete Tier-1 connections in Anon-CDN-2 . . . . . . . . . . . . . . . 193
7.5.4.2 Multiple incomplete Tier-1 in Anon-Cloud-1 . . . . . . . . . . . . . . . 194
7.5.4.3 Incomplete Tier-1: peers working as transits in Anon-CDN-6 . . . . . . . 194
7.5.4.4 Exceptional incomplete Tier-1: incomplete inter-AS connections . . . . . 195
7.5.5 Impacts of Leaking Regional Problems . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.5.5.1 Leaking by regional transits in Anon-CDN-1 . . . . . . . . . . . . . . . . 196
7.5.5.2 Leaking by regional transits in Anon-DNS-6 . . . . . . . . . . . . . . . . 197
7.5.5.3 Possible route leakage by regional AS in Anon-DNS-1 . . . . . . . . . . . 198
7.5.5.4 Route leakage by a regional AS in Anon-CDN-5 . . . . . . . . . . . . . . 198
7.5.5.5 Regional route leakage: a special case when organizations merge . . . . 199
7.5.6 Combination of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.6 Improvement by Routing Nudges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.6.1 Anycast Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.6.2 Routing problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.6.3 Solving Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.6.3.1 Changes in the catchments . . . . . . . . . . . . . . . . . . . . . . . . . . 204
7.6.3.2 Impacts over Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 205
7.6.3.3 Community strings are important . . . . . . . . . . . . . . . . . . . . . . 207
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Chapter 8: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.1.1 Future work related to our existing studies . . . . . . . . . . . . . . . . . . . . . . . 211
8.1.2 Potential future directions beyond our studies . . . . . . . . . . . . . . . . . . . . . 213
8.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
xi
List of Tables
1.1 Demonstrating thesis statement and corresponding studies . . . . . . . . . . . . . . . . . . 5
2.1 Estimating sizes of offered load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Testbed and respective sites used in our experiments. Transit providers (*) and IXP (†). . . 29
2.3 Experiment summarization and findings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Traffic engineering options on each testbed sites. . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Policies and traffic distribution (in 10% bins) . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6 Peering playbook (AMS, BOS, and CNF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.7 Load distribution with Peering catchment and B-Root load. . . . . . . . . . . . . . . . . . 43
2.8 Percent blocks in each catchment over time. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1 Filter parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.2 DDiDD performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1 IPv6 address pattern from server for different US carriers . . . . . . . . . . . . . . . . . . . 123
5.2 CDN dataset in numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3 Observing 4G and 5G network with respect to device type and network coverage . . . . . 128
5.4 Latency (ms) of the top clients in different countries . . . . . . . . . . . . . . . . . . . . . . 131
6.1 Latency stability to a fixed landmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.2 % of windows detecting false detours based on k . . . . . . . . . . . . . . . . . . . . . . . . 163
6.3 Impact of k in various detour scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
xii
6.4 Detour detection in different scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.5 Landmark list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.1 Detected polarization and inferred root causes. . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.2 Top anycast services with potential polarization problems . . . . . . . . . . . . . . . . . . 190
7.3 Polarization in real-world anycast services . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7.4 Continent-wise improvement in latency by routing changes . . . . . . . . . . . . . . . . . 205
xiii
List of Figures
2.1 An example of a three-site Anycast deployment with possible catchments. . . . . . . . . . 13
2.2 Overview of the decision process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Overview of the Verfploeter approach (from [25]). . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 TE techniques to shift traffic from Site-1 to Site-2. . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Operator assistance system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Estimating real-world attack events: estimating Nov. 2015 event with 5.59% access fraction. 25
2.7 Estimating real-world attack events: estimating June 2016 event with 0.91% access fraction. 26
2.8 Topology with two upstream providers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9 Peering: Impact of path prepending in catchment distribution with AMS, BOS and CNF . 31
2.10 Tangled: Effect of path prepending on catchments. . . . . . . . . . . . . . . . . . . . . . . 33
2.11 Peering: Community strings (at AMS) on catchments for AMS, BOS, CNF on 2020-02-25. 34
2.12 Tangled: using different communities to shift traffic on site LHR on 2020-04-05. . . . . . . 35
2.13 Peering: Impact of path poisoning (from AMS on 2021-04-09). . . . . . . . . . . . . . . . . 40
2.14 Tangled: Impact of path poisoning (from MIA on 2021-04-11). . . . . . . . . . . . . . . . . 40
2.15 Peering: Impact of choosing BOS, SEA and SLC sites on 2020-02-28 . . . . . . . . . . . . . 45
2.16 Peering: Impact of path prepending in catchment distribution with ATH, BOS and CNF . 49
2.17 Peering: Impact of path prepending in catchment distribution with BOS, ATL and MSN . 50
2.18 Peering: Impacts of changing the number of anycast sites from 2020-04-07 to 2020-04-10. 51
xiv
2.19 Tangled: Impacts of changing the number of anycast sites. . . . . . . . . . . . . . . . . . . 51
2.20 One month of catchment stability in B-Root . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.21 Different attacks with various responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1 Complementary CDF of the number of requests per hour sent to B-Root . . . . . . . . . . 68
3.2 CDF of source query rates, showing a wide range of rates. Data: 2015-11-29 . . . . . . . . 78
3.3 Pseudocode for filter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.4 Rcode trend during normal and attack traffic in root A . . . . . . . . . . . . . . . . . . . . 81
3.5 Impact of the duration to build the resolver list . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.6 CDF of new IP address with time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.7 Impacts of the accept list creation time for 2017-03-06 event . . . . . . . . . . . . . . . . . 85
3.8 Swiss cheese model of defense . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.9 DDiDD evaluation for a synthetic polymorphic attack. . . . . . . . . . . . . . . . . . . . . . 92
3.10 Experimental setup and the interaction with our automated system . . . . . . . . . . . . . 92
3.11 Resource consumption comparison for 2017-03-06 event . . . . . . . . . . . . . . . . . . . 93
4.1 Client and server interaction in Chhoyhopper. . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2 Getting the rendezvous address. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.3 Server for HTTPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4 Client-server interaction for SSH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5 Client-server interaction for HTTPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.1 5G architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2 Difference between TCP handshake RTT and minimum RTT from data-ACK . . . . . . . . 131
5.3 CDF of RTT (ms) in different countries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.4 CDF of throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.5 Latency observed from 4G and 5G devices . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
xv
5.6 Standard deviation among the minimum values . . . . . . . . . . . . . . . . . . . . . . . . 139
5.7 Standard deviation among the minimum values . . . . . . . . . . . . . . . . . . . . . . . . 140
5.8 Latency from one UE over three weeks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.1 5G architecture and threat model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.2 UNH testbed topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3 UE to edge router latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.4 Number of IPv4 traceroute hops within carrier networks . . . . . . . . . . . . . . . . . . . 158
6.5 RTT (ms) measured in every 5 s to two different web pages . . . . . . . . . . . . . . . . . . 161
6.6 Evaluating how different window sizes result in different baseline . . . . . . . . . . . . . . 165
6.7 Variable window size increases the difficulty for the attackers . . . . . . . . . . . . . . . . 166
6.8 Percentage (%) of time windows that detect detour in different scenarios . . . . . . . . . . 175
7.1 Two scenarios of multi-pop backbone problems . . . . . . . . . . . . . . . . . . . . . . . . 182
7.2 Regional leakage problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.3 Steps to find polarization problems and root causes . . . . . . . . . . . . . . . . . . . . . . 185
7.4 Percent of anycast services that see polarization. . . . . . . . . . . . . . . . . . . . . . . . . 191
7.5 Anon-CDN-3: incomplete inter-AS connection . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.6 Changes in Anycast catchment for a anycast site due to a routing change . . . . . . . . . . 203
7.7 CDF of all the VPs with respect to latency difference (ms) . . . . . . . . . . . . . . . . . . . 206
xvi
Abstract
Service disruption is undesirable in today’s Internet connectivity due to its impacts on enterprise profits,
reputation, and user satisfaction. We describe service disruption as any targeted interruptions caused by
malicious parties in the regular user-to-service interactions and functionalities that affect service performance and user experience. In this thesis, we propose new methods that tackle service disruptive
attacks using measurement without changing existing Internet protocols. Although our methods
do not guarantee defense against all the attack types, our example defense systems prove that our methods
generally work to handle diverse attacks. To validate our thesis, we demonstrate defense systems against
three disruptive attack types. First, we mitigate Distributed Denial-of-Service (DDoS) attacks that target
an online service. Second, we handle brute-force password attacks that target the users of a service. Third,
we detect malicious routing detours to secure the path from the users to the server. We provide the first
public description of DDoS defenses based on anycast and filtering for the network operators. Then, we
show the first moving target defense utilizing IPv6 to defeat password attacks. We also demonstrate how
regular observation of latency helps cellular users, carriers, and national agencies to find malicious routing
detours. As a supplemental outcome, we show the effectiveness of measurements in finding performance
issues and ways to improve using existing protocols. These examples show that our idea applies to different
network parts, even if we may not mitigate all the attack types.
xvii
Chapter 1
Introduction
Internet users want a continuous, uninterrupted, and fast experience when they browse and use Internet
services. Service disruption may hamper regular user-to-service interactions that ultimately result in poor
service performance like increased latency, complete outages or eavesdropping. Any Internet service disruption has impacts over enterprise reputation, user satisfaction, and company profits. Amazon reported
that a service disruption causing 100 ms of extra latency results in a 1% decrease in their sales; Google
showed 0.5 s delay in search results might cost them 20% of traffic; an electronic trading platform could
lose $4 million in revenue per millisecond [68].
Malicious parties attack victims to disrupt their regular services on the Internet. These malicious
entities can be individuals, groups, or nation states with technical or monetary power to cause financial
loss, promote competitors, or show their capabilities to damage the victims’ regular operations. Victims
are Internet services like DNS, websites, search engines, online gaming platforms, streaming services, or
Content Delivery Networks (CDNs), and the users of these services [141, 38]. As disruptive events, we
tackle attacks that target Internet services to cause poor service performance with increased latency (for
example DDoS), target end users to hamper their service access to make a complete outage (for example
brute-force password attacks), and target routing paths for eavesdropping (for example malicious routing
detours).
1
Many security solutions require changes in the existing Internet protocols or widespread adoption of
new operational practices by several parties. For example, RPKI for route validation requires a widespread
adoption in the routers to filter out unwanted BGP announcements [30]. Similarly, the transition from
HTTP to HTTPS required changes in the protocols and in web browsers, and the clients had to upgrade
their web browsers. DNSSEC adoption for DNS authentication is still ongoing, even after 15 years [46].
Such security solutions are necessary, but they demand protocol changes and additional collaboration
from other parties. These protocol changes may require time to deploy and collaborations from various
parties create inter-dependencies. As a result, it is important for the service administrators to think about
defensive systems that do not require protocol changes and collaboration from many parties.
In this thesis, our goal is to propose easily and quickly deployable solutions. When we make changes
in the protocol, time-consuming widespread deployment is generally necessary. We want to build systems
with the existing protocols so that we do not need significant collaborations from other parties.
1.1 Thesis Statement
In this thesis, we propose new methods utilizing measurement to mitigate attacks that disrupt
online services without changing any existing Internet protocols.
In this thesis, we propose two different kinds of new methods. First, we implement new network defenses,
but we constrain our approaches to use existing Internet protocols to make them easier to deploy. Second,
we provide the first public documentation and evaluation of methods that have previously been not publicly
known, even though they may be used internally by network operators in proprietary defenses.
By attacks that disrupt, we mean any targeted attacks that hamper regular user-to-service interaction
and performance. We cannot cover all possible attacks, but we study three that cover different parts of an
online service—services, users, and the path between the users and services. These attacks have a broad
range of impacts. We observe poor performance with increased latency to partial or complete outage due
2
to a DDoS attack. A successful brute-force attack can compromise a user’s privacy. Eavesdropping using
routing detours can cause privacy concerns.
Our goal is to mitigate these attack events without changing any existing Internet protocols. We should
be able to deploy a security feature easily and quickly when we do not need protocol changes or collaboration from other parties. We utilize real measurements to design our systems. Service administrators and
users can deploy our solutions with minimum operational changes and without making additional contracts with third parties. DNS or CDN operators, system administrators for SSH and HTTPS applications
in an organization, mobile operators, and government security agencies should be able to deploy and use
our systems independently. While we show six example studies/systems utilizing existing protocols, we
believe researchers or developers can build their own systems for their networks using measurements.
1.2 Demonstrating the Thesis Statement
In this section, at first we introduce our five studies against three attack types, along with a supplemental
study. To prove our thesis statement, we demonstrate how we map thesis keywords to the corresponding
studies. In the next section we discuss our research contributions (Section 1.3).
1.2.1 Studies
In our first study (Chapter 2), we propose a new technique to estimate true attack rate and develop a BGP
playbook: a guide that allows operators to anticipate how traffic engineering (TE) actions rebalance load
across a multi-site anycast system to mitigate the impacts of a DDoS attack. Together, these two elements
provide a system that can automate response to DDoS attacks by adjusting anycast routing according to
the playbook, or recommend actions to a human operator.
In our second study (Chapter 3), we show that a library of defensive filters is necessary and automatic
selection can identify good filters to defend against DDoS attacks. Our automated system (also named as
3
DDiDD) detects an attack, evaluates the attack pattern, selects the best filter, deploys it, and continuously
evaluates its performance in an automated and quick manner. We show the success of our automated
system to mitigate DDoS in DNS to protect server resources. To the best of our knowledge, our work is
the first to describe an automated defense in detail with multiple filters, and to show the importance of
matching appropriate filters to attacks.
In our third study (Chapter 4), we show that IPv6 address hopping can be used to protect existing
services. We design a moving target defense utilizing the IPv6 address space and deploy this defense for
SSH and HTTPS applications. For HTTPS, we show how to support web security with TLS by adding
support for DNS-based TLS certificates to our core hopping protocol. To the best of our knowledge, this
is the first design of a moving target defense for SSH and HTTPS utilizing IPv6.
In our fourth study (Chapter 5), we characterize end-to-end mobile latency and throughput along with
their stability using a globally distributed Content Delivery Network (CDN). We evaluate the limits of
latency, throughput, and stability that clients can achieve.
After observing the stable end-to-end latency, in our fifth study (Chapter 6), our goal is to find out
malicious routing detours in 5G network using historic latency. We show that 5G latency is stable, and
detours will deviate the stable latency. We compare the historic latency to the current latency, identify
possible detours when we observe a latency change, and filter out legitimate reasons for the latency change
to get the detour events.
As a supplemental outcome of this thesis, we measure anycast polarization in the wild Internet (Chapter 7). We show anycast polarization is common in the known anycast networks. While our other anycast
study mitigates DDoS using traffic engineering, we show latency can also be improved by routing changes.
1.2.2 Studies applied to the thesis
We show how our six studies related to the key ideas in the thesis in Table 1.1.
4
Feature/Studies
Anti-DDoS
anycast
(Chapter 2)
Anti-DDoS
filtering
(Chapter 3)
Anti brute-force
(Chapter 4)
Detour detection
(Chapter 5 and
Chapter 6)
Anycast polarization
(Chapter 7)
New methods? Estimation and
BGP playbook
Description of
automated filtering
First moving
defense with IPv6
First for
cellular networks First in the wild
Attack? Yes Yes Yes Yes No
Disrupting services? Poor performance
or outages
Poor performance
or outages
Unavailable user
access
Eavesdropping Increased latency
No change in protocol? Measurement Measurement IPv6 address space Measurement Measurement
Table 1.1: Demonstrating thesis statement and corresponding studies
We provide five new methods utilizing measurement to mitigate three disruptive attacks. We provide
one supplemental study that improves the performance of an anycast network using measurement. Our
anycast polarization study utilizes measurement to improve user latency. Our first study of anycast for
anti-DDoS brings two new ideas—estimating the offered load during an attack and having a BGP playbook
for defense (Chapter 2). Our second study provides a new detailed description of an automated system
using filtering against DDoS (Chapter 3). Although network operators use similar systems, we provide
the first detailed description. In our third study, we describe a new moving target defense utilizing IPv6
address space (Chapter 4). We provide the first SSH and HTTPS implementations for our moving defense.
In our fourth study, we characterize end-to-end latency from the UE to the CDN servers (Chapter 5). In our
fifth study, we bring two new ideas—evaluating latency as a stable parameter to detect malicious detours
in 5G, and a system to detect malicious detours using latency observation (Chapter 6). Our sixth study
shows two key reasons behind polarization problems in the wild Internet (Chapter 7). We provide more
details about our contribution in Section 1.3.
In this thesis, we tackle attacks that disrupt online services. We mitigate three different attack types
that disrupt regular user-to-service communication. DDoS attacks cause poor performance with increased
latency or complete service outages. Brute-force password attacks cause unavailable user access for the
clients. Malicious routing detours may deviate from user-perceived latency, and may result in eavesdropping. Polarization is caused by improper routing configuration which results in increased latency.
5
We design our new methods without changing any existing protocols. All these methods only require
measurement, observation, planning, and operational changes. Many security systems require changes in
the protocols and support from other external parties. Our thesis complements these ideas of modifying
protocols or exporting security solutions to other third parties. Utilizing the existing protocols helps the
operators to deploy security systems quickly without requiring collaborations from third parties.
1.3 Research Contributions
Each of our five studies contribute to proving different aspects of our thesis statement, as described above.
In addition, each study also has its own contribution
Our anti-DDoS work with anycast routing (Chapter 2) has two additional contributions. First, while
TE techniques are well known, it is not widely understood how available and effective TE mechanisms
are. Our study explores the availability and effectiveness of TE mechanisms, and show a BGP playbook is
important to guide defenders to select a TE mechanism (Section 2.8). We show this contribution in our full
paper (Section 2.8 and [190]). Second, we demonstrate successful defenses in practice (Section 2.10). We
replay real-world attacks in a testbed and show TE can defend. Of course no single defense can protect
against all attacks, these examples show our approach provides a successful defense to many volumetric
and polymorphic DDoS attacks.
Our anti-DDoS work with filtering (Chapter 3 and [197, 198]) has one additional contribution. We
show the performance of each individual filter against different attack types (Section 3.5.3)—performance
of identifying malicious or legitimate traffic, and collateral damage (legitimate traffic that is misclassified
and discarded).
We have one additional contribution from our moving target defense (Chapter 4 and [191]). We propose a new approach to accommodate long-lived connections in the face of frequent address changes
6
(Section 4.5.4). We use ip6tables rules to retain the existing connections to a fixed internal address but
changing NAT rules allow new connections only with the current IPv6 addresses.
In our detour detection work (Chapter 5 and Chapter 6), we make one additional contribution. We
show a method to find out cellular users from the CDN logs using IPv6 address space (Section 5.5).
Finally, we make two additional contributions in our anycast polarization work (Chapter 7 and [196]).
While a prior study shows the existence of polarization problem, our study shows polarization is a common
problem (Section 7.5). Second, we show how a CDN makes simple routing changes to address polarization
problem (Section 7.6).
1.4 Thesis Organization
This thesis is organized in the following way. At first, we show two defense approaches against DDoS
attack. Our first defense approach is based on anycast traffic redistribution (Chapter 2), and our second
defense approach is based on filtering (Chapter 3). Then we show a moving target defense against bruteforce password attacks using IPv6 address space (Chapter 4). After that we describe a detour detection
system against malicious routing detours using historic latency (Chapter 6). All these defense systems
prove our thesis statement (Section 1.1). After showing all four defense systems against three attack types,
we present our supplemental work to find anycast polarization in the known anycast services (Chapter 7).
Next, we show all our studies against disruptive attack events to prove our thesis statement. At first,
we start with our anti-DDoS system using anycast traffic distribution.
7
Chapter 2
Mitigating DDoS Using Anycast
To prove our thesis statement that we can build security systems without changing existing protocols, we
show four example defense systems in this thesis. We begin by showing our first example defense system
against DDoS attacks utilizing measurements.
In this chapter, we tackle DDoS attacks that disrupt the operation of a service. We want to serve
as many requests as possible before starting any filtering of the malicious traffic because filtering may
have false positives and we do not want to filter legitimate traffic. Here, we describe a redistribution-based
approach against DDoS so that we can utilize the capacity in all the anycast sites. Then in the next chapter,
we describe a filtering-based solution when redistribution does not perform as expected (Chapter 3).
In this work, we use anycast to mitigate DDoS by spreading load among anycast sites. We pre-compute
a BGP playbook, estimate the attack size, and select a Traffic Engineering (TE) response based on the
estimated traffic load and capacity in other anycast sites.
This work was published in the USENIX Security Symposium in 2022 [196]. We have released our
datasets and software tools as an artifact and our results are reproducible [192]. As a supplemental outcome
of this work, we release a tool that can generate anycast mapping after collecting datasets from Verfploeter
tool [194].
8
2.1 Introduction
Anycast routing is used by services like DNS or CDN where multiple sites announce the same prefix
from geographically distributed locations. Defined in 1993 [173] anycast was widely deployed by DNS
roots in the early-2000s [212, 95, 18], and today it is used by many DNS providers and Content Delivery
Networks [247, 74, 78, 47, 48].
In IP anycast, BGP routes each network to a particular anycast site, dividing the world into catchments.
BGP usually associates networks with nearby anycast sites, providing generally good latency [206]. Anycast also helps during Distributed-Denial-of-Services (DDoS) attacks, with each site adds to the aggregate
capacity at lower cost than a single very large site. Each site is independent, so should DDoS overwhelm
one site, sites that are not overloaded are unaffected.
DDoS attacks are getting larger and more common. Different root servers and anycast services frequently report DDoS events [167, 168, 136, 49]. Different automated tools make it easier to generate attacks [249], and some offer DDoS-as-a-Service, allowing attacks from unsophisticated users for as little as
US$10 [217]. DDoS intensity is still growing, with the 2020 CLDAP attack exceeding 2.3 Tb/s in size [15],
and the 2021 VoIP.ms attack lasting for over 5 days [214, 178]. The reservoir of attack sources grow with
millions of Internet-of-Things devices whose vulnerabilities fuel botnets [122].
Operators depend on anycast during DDoS attacks to provide capacity to handle the attack and to
isolate attackers in catchments. Service operators would like to adapt to an ongoing attack, perhaps shifting
load from overloaded sites to other sites with excess capacity. Prior studies of DDoS events have shown
that operators take these actions but suggested that the best action to take depends on attack size and
location compared to anycast site capacity [144]. While prior work suggested countermeasures, and we
know that operators alter routing during attacks, to date there has been only limited evaluation of how
routing choices change traffic [183, 81, 19, 124]. Only very recent work examined path poisoning to avoid
congested paths [219]; there is no specific public guidance on how to use routing during an attack.
9
The goal of this chapter is to guide defenders in traffic engineering (TE) to balance traffic across anycast
during DDoS.
Our first contribution is a system with novel mechanism to estimate true attack rate and plan responses.
First, we propose a new mechanism to estimate the true offered load, even when loss happens upstream of
the defender. Estimating the relative load on each site (Section 2.5.4) is the first step of defense, so that
the defender can match load to the capacities of different sites, or decide that some sites should absorb
as much of the attack as possible. Second, we develop a BGP playbook: a guide that allows operators to
anticipate how TE actions rebalance load across a multi-site anycast system. Together, these two elements
provide a system that can automate response to DDoS attacks by adjusting anycast routing according to
the playbook, or recommend actions to a human operator.
The second contribution is to understand how well routing options for multi-hop TE work: AS prepending, community strings and path poisoning. While well known, it is not widely understood how available
and effective these mechanisms are. In Section 2.8 we show that while AS prepending is available almost
anywhere, community strings and path poisoning support varies widely. We also show that their effectiveness varies greatly, in part because today’s “flatter Internet” [42] means AS prepending often shifts either
nearly all or nearly no traffic. Community strings provide finer granularity control, but we show their
support is uneven. Path poisoning may provide control multiple hops away, but like community strings
it is often filtered, particularly for Tier-1 ASes. When these factors combine with the interplay between
multiple sites and an anycast system, a BGP playbook is important to guide defenders. Since the effects
of TE are often specific to the peers and locations of a particular anycast deployment, we explore how
sensitive our results are to different locations and numbers of anycast sites (Section 2.9).
Our final contribution is to demonstrate successful defenses in practice. We replay real-world attacks
in a testbed and show TE can defend (Section 2.10). Of course no single defense can protect against all
10
attacks, these examples show our approach provides a successful defense to many volumetric and polymorphic DDoS attacks. They show that our algorithm and process contributions (attack size estimation
and playbook construction) have practical application.
Our work uses publicly available datasets. Datasets for the input and results from our experiments are
available at no charge. Because our data concerns services but not individuals, we see no privacy concerns.
2.2 Related Work
Anycast routing has been studied for a long time from the perspective of routing, performance, and DDoSprevention.
BGP to steer traffic: Prior work showed BGP is effective to steer traffic to balance load on links [184,
31, 81]. However, Ballani et al. showed that anycast requires planning and care for effective load balancing [19]. Others proposed to manipulate BGP based on packet loss, latency and jitter [183, 151]. We build
on Ballani’s recommendation to plan anycast, proposing a BGP playbook, and studying how well it can
work.
Chang et al. [40] suggested using BGP Communities for traffic engineering [39, 226, 33]. Recent work
has examined BGP communities for blackhole routing in IXPs and ISPs [58, 85]. Smith and Glenn examined
path poisoning to address link congestion [219]. While each of these are important options in routing for
defense, we show a system that guides the operator to select between them. A system with multiple choices
is necessary because no single method works against all attacks. For example, we show path poisoning
does not work when we poison a Tier-1 AS.
Anycast performance: Most anycast research focused on efficient delivery and stability [202, 133,
34, 246, 131]. Later studies explicitly investigate the proximity of the clients [19, 34, 131].
11
Some studies try to improve anycast through topology changes [206, 140]. Anycast services for DDoS
is already used in commercial solutions e.g., Amazon [208], Akamai[228] and AT&T [223]. However, none
of them address how to use routing manipulations as a DDoS defense mechanism.
Anycast catchment control as a DDoS mitigation tool: To our knowledge, the idea of handling
DDoS attacks by absorbing or shifting load across anycast sites was first published in 2016 [144]. Kuipers
et al. [124] refined that work, defining the traffic shifting approaches that we review in Section 2.5.5 and
explore through experiment. We develop the idea of a BGP playbook to guide responses, and describe
a new approach to estimate attack size, and finally show that responses can be effective with real-world
events.
Commercial and automated solutions: Most published commercial anti-DDoS solutions use routing to steer traffic towards a mitigation infrastructure [61]. Sometimes there is a requirement for all the
sites to be connected through a private backbone to support traffic analysis [208]. Another defense uses
BGP to divert all traffic to a scrubbing center, then tunnels good traffic to the destination [218]. Other
methods use DNS manipulation [35], or anycast proxies [100] which cannot be used in DNS anycast deployments itself. Rather than outsourcing the problem, we explore how one can defend it. Other automated
defenses include responsive resource management [75], client-server reassignment [110], and filtering approaches [195]. Our method uses TE approaches to efficiently use available resources in anycast.
2.3 Anycast and BGP Background
IP anycast is a routing method used to route incoming requests to different locations (sites). Each site uses
the same IP address, but at different geographic locations. Anycast then uses Internet routing with BGP
to determine how to associate users to different sites—that is known as site’s anycast catchment. BGP has
a standard path selection algorithm that considers routing policy and approximate distance [31].
12
Figure 2.1: An example of a three-site Anycast deployment with possible catchments.
Figure 2.1 shows a conceptual version of one of our three-site anycast deployments. Clients are splitted
into three sites from three continents. Although we illustrate catchments by continent here, in practice
they follow BGP routing rules and not geography, with users intermixed.
Although BGP is not perfect, in practice it often does a reasonably good job at associating users to
nearby sites and thereby minimizing service latency [131, 120]. Moreover, anycast increases the service
resiliency since it spreads traffic over multiple sites. If one site goes offline, perhaps due to maintenance,
that site withdraws its BGP route and routing automatically redistributes users previously going to that
site to other sites. Thus, anycast avoids service interruptions in addition to capacity expansion.
Operators can influence the routing decisions process using different traffic engineering techniques
(TE) to manipulate BGP. We describe TE techniques in Section 2.5.5.1 and how they can be used to rebalance
the load during a DDoS attack.
2.4 Threat Model
We next describe the threat model for this study: distributed attackers attempt to exhaust resources at the
target running a service like DNS or CDN.
13
2.4.1 Attackers
DDoS is a threat because attackers attempt to make a service at the target unavailable, often to extort
money, disadvantage a competitor, or simply show their power. Attackers can directly make malicious
traffic to a target or the attackers can hide themselves by spoofing their addresses leaving them intractable.
Active adversaries can utilize distributed compromised devices or use DDoS-as-a-Service to conduct these
attacks.
2.4.2 Target
In this study, our target network includes an anycast network with multiple sites or a network with a single
service location. In an anycast network, a client receives its service from a single site, determined by BGP,
known as the catchment of that client. Other single-site networks serve users from a single geographic
location, so all the clients move to that location.
The main threat of a DDoS attack is to find any specific resource whose exhaustion will block other services. As an example, in 2015 DDoS attacks on the root DNS some operators could reply to all queries, but
some failed to receive queries or could not respond to them typically because of bandwidth limitations [138,
144]. In an anycast network, one or more sites may be overwhelmed. A DDoS attack may overwhelm different resources of the server. We consider ingress/egress network bandwidth and CPU/memory usage
as server resources. When a site is exhausted, legitimate clients observe the impacts by getting a slow or
unavailable service.
2.5 Mechanisms to Defend Against DDoS
In this section we describe our BGP mitigation process; how we pre-compute a BGP playbook, estimate
the attack size and select a TE response.
14
DETECTION
Detect DDoS
attack
2
ESTIMATION
Attack size
estimation
3
Deploy selected
BGP-TE and
measure
impacts 5
MAPPING
Compute BGP
playbook
1
Before attack
BGP Playbook
Pick rule from
playbook to
shift or
absorb 4
DEFENSE
STRATEGY DEPLOY
Figure 2.2: Overview of the decision process.
2.5.1 Overview and Decision Support
In Figure 2.2 we show how defense against DDoS works. Defense against a DDoS begins with detection
2 , then defenders plan a defense 4 , carry it out 5 , and repeat this process until the attack is mitigated
or it ends (bottom cycle in Figure 2.2). Detecting the attack is straightforward, since large attacks affect
system performance. The challenge is selecting the best response and quickly iterating.
We bring two new components to attack response (colored light green in Figure 2.2): mapping before
the attack, and estimating attack size when the attack begins. Mapping 1 (discussed in Section 2.5.2)
provides the defender with a playbook of planned responses and the information about how they will
change the traffic mix across their anycast system. Size estimation 3 (discussed in Section 2.5.4) allows
the defender to determine how much traffic should be moved and select a promising response from the
playbook. Together, these tools help to understand not only how to reduce traffic at a given site, but also
the sites where that traffic will go.
These components come together in our automated response system (Section 2.5.5) that iterates between measurement and attack size estimation, defense selection, then deployment. Defense uses the
playbook built during mapping; we provide an example playbook in Section 2.8.4. We show how these
defenses operate in testbed experiments in Section 2.10.
15
Our system is designed for services that operate with a fixed amount of infrastructure on specific
anycast IP addresses and do not employ a third-party scrubbing service. Operators of CDNs with multiple
anycast services, DNS redirection, or scrubbing services may use our approach, but also have those other
tools. However, many operators cannot or prefer not to use scrubbing and DNS redirection: all operators
of single-IP services (all DNS root servers), many ccTLDs who value national autonomy, and scrubbing
services themselves. Our approach defends against volumetric attacks where we have spare capacities in
other sites. Since DDoS causes unavailability of services, suboptimal site selection during an attack is not
a concern.
2.5.2 Measurement: Mapping Anycast
We map the catchments of anycast service before an attack so that the defender can make an informed
choice quickly during an attack, building a BGP playbook (Section 2.8.4).
To map anycast catchments we used Verfploeter [56]. As an active prober (ICMP echo request), Verfploeter observes the responses of all ping-responsive IPv4 /24s and maps which anycast site receives the
responses. We provide a detailed description of anycast, BGP, and Verfploeter in Section 2.3 and Section 2.5.3. Since mapping happens before the attack, mapping speed is not an issue.
Alternatively, we can map traffic by observing which customers are seen at each site over time, or
measuring from distributed vantage points such as RIPE Atlas [225, 17]. (Operators may already collect
this information for optimization.)
Mapping should consider not only the current catchments but also potential shifts we might make
during the attack. This full mapping is easy to do with Verfploeter, which can be continuously running
in an adjacent BGP prefix to map the possible shifts. This mapping process is important to anticipate
how traffic may shift. We will show later that BGP control is limited by the granularity of routing policy
(Section 2.8) and by the deployment of the anycast sites (Section 2.9).
16
A challenge in pre-computed maps with routing alternatives is that routing is influenced by all ASes.
Thus, the maps may shift over time due to changes in the routing policies of other ASes. Fortunately,
prior work shows that anycast catchments are relatively slow to change [246]. We also show that our BGP
playbook is stable over time (Section 2.8.4).
2.5.3 Verfploeter Mapping
We use Verfploeter [56] to find out the client to anycast site mapping. Using Verfploeter we build our BGP
playbook with various BGP changes (Section 2.8.4).
The main intuition behind Verfploeter is to send pings using an anycast prefix as source address. The
replies to these pings will be routed to the nearest anycast site by the inter-domain routing system. Figure 2.3 shows how this works. One of the sites (green) of the anycast service runs a packet generator that
sends pings (ICMP Echo Requests) to a hitlist of IP addresses. The replies (ICMP Echo Replies) from these
IPs are then routed to the “closest” (in terms of routing distance) anycast site. The catchment of the site is
thus determined by the IP prefixes from which ping replies arrive at that particular site.
While in principle Verfploeter can work with any type of IP hitlist, for our measurement we use a publicly available hitlist [241] based on the Fan et, al. [73] methodology. This hitlist includes the IP addresses
that are most likely to respond to pings for each /24 prefix in the IPv4 address space.
2.5.4 Estimation of the Attack Size
After the detection of an attack, the first step in DDoS defense is to estimate the attack size, so we can then
select a defense strategy of how much traffic to shift. Our goal is to measure offered load, the traffic that is
sent to (offered to) each site. During DDoS offered load balloons with a mix of attack and legitimate traffic,
and loss upstream of the service means we cannot directly observe true offered load. We later evaluate our
approach with real-world DDoS events (Section 2.6).
17
Request
Packet generator
Packet Collector
Pinger Anycast
Sites
Verfploeter
Infrastructure
Reply Reply
Collector
Figure 2.3: Overview of the Verfploeter approach (from [25]).
Idea: Our insight is that we can estimate true offered load based on changes in some known traffic
that actually does arrive at the service, even when there is upstream loss.
To know how much offered load actually arrives at the service, we need to estimate some fraction of
legitimate traffic. We can then observe how much this traffic drops during the attack, inferring upstream
loss. Unfortunately, there is no general way to determine all legitimate traffic, since legitimate senders
change their traffic rates, and attackers often make their traffic legitimate-appearing. Our goal is to reliable
estimate some specific legitimate traffic; we describe several sources next.
Traffic sources: There are several possible sources of known legitimate traffic—we consider known
measurement traffic and regular traffic sources that are heavy hitters [23].
For DNS, our demonstration application, RIPE Atlas provides a regular source of known-good traffic, sent from many places. RIPE makes continuous traffic from around 10k publicly available vantage
points [188]. Each RIPE vantage point queries every 240 s, and there is enough traffic (about 2500 queries/minute)
to provide a good estimate of offered load. (Although RIPE Atlas is specific to DNS, other commercial services often have similar types of known monitoring traffic.)
To find the known-good traffic at each site, we use the catchments of RIPE vantage points with predeployed RIPE DNS CHAOS queries (one exists for each root DNS IP, such as measurement ID 11309 for
18
A-root). We can also use Verfploeter or captured traces in the anycast sites. An advantage of using RIPE
traffic is that it does not place any new load on the service.
Heavy hitters can provide an additional source of known-good traffic. Many services have a few consistently large-volume users with regular traffic patterns, and while they vary over time, many are often
stable. For DNS, we find that most heavy hitters have a strong diurnal variation in rate; we model them
with TBATS (Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal) [55]
to factor out such known variation. While an adversary could spoof heavy hitters, that requires a large
and ongoing investment to succeed.
Estimation: Our goal is to estimate offered load, Toffered. We can measure the observed traffic rate,
Tobserved, at the access link. We define α as the access fraction—the fraction of traffic that is not dropped.
Therefore Tobserved = α · Toffered.
To estimate the access fraction (α), we observe that known good traffic has the same loss on incoming
links as does other good traffic and attack traffic. We estimate the known traffic rate (from RIPE Atlas
measurement traffic, or from heavy hitters, or both), as Tknown. Then α·Tknown,offered = Tknown,observed,
and our estimate of offered load is Tˆ
offered = Tobserved · Tknown,offered/Tknown,observed.
2.5.5 Traffic Engineering as a Defense Strategy
With knowledge of the offered load, the defender can select an overall defense strategy that will drive
traffic engineering decisions. The defender first must determine if the attack exceeds overall capacity or
not.
For attacks that exceed overall capacity, the defender’s goal is to preserve successful service at some
sites, while allowing other sites to operate in degraded mode as absorbers [144]. The defender may also
choose to shift traffic away from some degraded sites to ease their pain. Unloading the overloaded sites is
recognized as breakwaters [124].
19
For moderate-size attacks, the defender should try to serve all traffic, rebalancing to shift traffic from
overloaded sites to less busy sites. In heterogeneous anycast networks, where some sites have more capacity than others, the defense approach can be different. In these cases, larger, “super”-sites can attract
traffic from smaller sites. For moderate-size attacks, it may even be best for smaller sites to shut down if
the super-sites can handle the traffic.
Regardless of attack size, traffic engineering allows the defender to shift attack traffic to absorber or
breakwater sites. We next describe traffic engineering options, and then how one can automate response.
For operators unwilling to fully automate response, our system can still provide recommendations for
possible actions and their consequences.
2.5.5.1 Traffic Engineering to Manage an Attack
Given an overall defense strategy (absorb or rebalance), the defender will use traffic engineering to shift
traffic, either automatically (Section 2.5.5.2) or as advice under operator supervision. For anycast deployments connected by the public Internet, BGP [31] will be the tool of choice to control routing and influence
anycast catchments. Organizations that operate their own wide-area networks may also be able to use SDN
to manage traffic on their internal WAN [205, 101]. Fortunately, BGP has well established mechanisms to
manage routing policy. We use three BGP mechanisms in this work: AS-Path prepending, BGP communities and Path Poisoning.
AS-Path Prepending is a way to de-prefer a routing path, send traffic to other catchments. BGP’s
AS-Path is the list of ASes back to the route originator. The AS-Path both prevents routing loops and also
serves as a rough estimate for distance, with BGP preferring routes with shorter AS-Paths. By artificially
inserting extra ASes into the AS-Path, the route originator can de-prefer one site in favor of others. Path
prepending is known to be a coarse routing technique for traffic engineering. We measure how fine the
control AS-Path prepending provides to anycast in Section 2.8.1.
20
We define Negative Prepending as the use of AS-Path prepending to draw traffic towards a site,
preferencing one site over others. Prepending can only increase path lengths, but an anycast operator
in control of all anycast sites can prepend at all sites except one, in effect giving that site a shorter ASPath (relative to the other sites) than it had before. “Negative prepending by one at site S” is, therefore,
shorthand for prepending by one at all sites other than S.
Long AS-Paths due to prepending can make prefixes more vulnerable to route hijacking [135]. However, this issue has a small impact on anycast prefix, as always there is a site announcing without any
prepend, keeping the path length limited. We suggest that formal defenses to hijacking such as RPKI are
needed even without prepending, and when they are in place, prepending can be an even more valuable
tool for TE.
BGP Communities (or community strings) label specific BGP routes with 32 or 64 bits of information. How this information is interpreted is up to the ASes. While not officially standardized, a number
of conventions exist where part of the information identifies an AS and the other part a policy such as
blackholing, prepend, or set local-preference. Community strings are widely supported to allow ISPs to
delegate some control over routing policy to their customers [9, 226].
Path Poisoning is another way to control the incoming traffic. This technique consists of adding the
AS of another carrier to the AS PATH. Paths that repeat ASes in different parts of the AS PATH indicate
routing loops and must be discarded by BGP.
When using path poisoning we announce a path with both the poisoned AS and own AS (otherwise
neighbors may filter our announcement as not from us). We must therefore also prepend twice at all other
anycast sites, otherwise poisoning also results in a longer AS path.
Figure 2.4 shows how traffic engineering can be applied to anycast systems in order to modify the
catchment. In this example, site-1 is overwhelmed by an attack. Aiming to shift bins of traffic to site-2
with spare capacity, we can make BGP announcements. site-1 poisons AS3, prepends (only showing to
21
poisoning
prepending
no_export
Site-1 AS 5
AS 4
AS_1: as-prepend : AS_4
AS_1: no-export : peers - AS-5
AS_1: poison : AS_3
Anycast
Network
Site-3 Site-2
AS 3
AS 1
AS 2
attack
attack
attack
TE options to deprefer Site-1
Figure 2.4: TE techniques to shift traffic from Site-1 to Site-2.
AS4), and prevents announcement to AS5 using not-export BGP community. These changes decrease load
in site-1, shifting the traffic to site-2.
2.5.5.2 Automatic Defense Selection
To automate defense we use a centralized controller. The controller collects observations for all sites (from
external measurements, or assuming the site is saturated if it cannot reach the site), then takes action, if
required ( 4 of Figure 2.2): (1) The controller identifies sites that are overcapacity by comparing estimated
load to expected capacity and observed resources at each site. (2) The controller identifies all playbook
options that will reduce load at any impacted sites without overloading currently acceptable sites. (3) It
selects from any viable options, favoring a uniform distribution and smallest change (or selecting arbitrarily
if necessary). If all changes leave some sites overwhelmed, it can choose the “least bad” scenario, or request
operator intervention.
After deploying a new routing policy, the decision machine continues to evaluate the traffic level at
each site ( 5 of Figure 2.2). If any site is still overwhelmed after 5 minutes, we try again, repeating size
estimation, decision, and action. In the subsequent iterations, the controller only considers the routing
options that were considered in the previous iteration (from step (2) of this decision process). We allow
22
Figure 2.5: Operator assistance system.
time between attempts so announcements can propagate [126]. To avoid oscillation or interference with
route flap dampening, after three attempts we escalate the problem to the human operator. We choose these
values for timer duration and number of retries from recommendations of operators to avoid oscillation,
other options are possible. Explore other options is possible as future work.
Return to service: After a period with no overloaded sites, we can automatically revert any interventions, on the assumption that default routing provides users best service. Leaving interventions in place
for some time can help with polymorphic attacks (Section 2.10).
2.5.5.3 Operator Assistance System
We discussed our approach with operators of root DNS and cloud services to get feedback on the approach. While they were enthusiastic about automated defenses to deal with common DDoS events, and
to handle events during non-business hours, some operators prefer human-supervised (non-automated)
response, and all expected human supervision of response during initial deployment to build trust before
full automation.
To support human-supervised response, we design an operator assistance system as an alternative
(or precursor) to automation. This system provides a web-based interface that activates route changes,
23
Scenario/ known-good traffic offered load during attack estimated/
Date Dur. normal observed α normal observed reported estimated αˆ reported
2015-11-30 3h 33.08 1.85 0.0559 0.03 M 0.37 M 5.1 M 6.6 M 0.07 1.3
2016-06-25 3h 36.58 0.33 0.0091 0.03 M 0.10 M 10 M 11 M 0.01 1.1
Testbed 5min 425.2 207.0 0.4900 8.5 k 16.3 k 29.2 k 33.2 k 0.56 1.1
Table 2.1: Estimating sizes of offered load (second from right) based on known-good traffic (second from
left) with real-world attacks at B-Root and testbed experiment. Traffic rates are in queries/second (reporting only the peaks).
coupled with playbook lookup that recommends good options based on current sensor status. To react
and reconfigure the anycast network, the operators can use this web interface similar to an equalizer,
choosing the percentage of load to be increased or dropped at an anycast site. The possible ranges of
slider positions are based on the playbook alternatives or presets of routing policies. This process hides
the playbook complexity from the operator, making the process less error-prone and more intuitive, but
still giving the operator a full control of the BGP routing.
In Figure 2.5 we can visualize a snapshot of this interface. Each slider represents an anycast site and
each site has predetermined settings indicated by “notches". The positions of the "notches" are the results of
all the measurements obtained to create our playbook. The bar graph shows the results of the measurement
process, indicating how many networks will be attracted to each anycast site. The operators can visualize
the forecasted traffic to each position and then apply the configuration on the production network.
2.6 Evaluation of Offered Load Estimation
We next evaluate estimating offered load with real-world events; a testbed evaluation is in Section 2.6.3.
2.6.1 Case Studies
We test our approach with two large DNS DDoS events from 2015-11-30 and 2016-06-25. The November 2015 event was a DNS flood, and the June 2016 was a SYN and ICMP flood attack. B-Root exhibited
24
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
-250 -200 -150 -100 -50 0 50 100 150
Normal duration attack duration
Query rate(million query/s)
Duration (seconds) relative to attack start
Estimated rate
Observed rate
Reported rate
Figure 2.6: Estimating real-world attack events: estimating Nov. 2015 event with 5.59% access fraction.
significant upstream loss in both these events, so we estimate true offered load to B-Root and compare to
observations at other roots for ground truth.
To apply our system we measure the access fraction (α) using the known-good traffic. Table 2.1 shows
the expected typical known-good traffic (“normal”), the observed rate under attack (“observed”) and the
computed α. Here we use RIPE Atlas as known good [187]. We see similar results when using the top 100
heavy hitters.
Figure 2.6 compares the observed load (the bottom blue line) with the estimated offered load (the
middle, varying, orange line) from our system, as compared to the attack rate reported from other roots
(the dashed purple line). The offered load columns of Table 2.1 give numeric values.
Even though the attack was large, we see that the estimated attack size of the 2015 event of 4–6.5 Mq/s
is close to the reported 5.1 Mq/s [144, 167]. We also see similar results from the 2016 event [168], where
we estimate 8–11 Mq/s of total traffic, compared to the 10 Mq/s reported rate (details with figure in Section 2.6.2). We also add the result from Testbed experiment which also shows a good accuracy (details in
Section 2.6.3).
These two events show that even with high rates of upstream loss we are able to get reasonable estimates of total offered load. Our results provide good accuracy when the known-good traffic has 2500
25
0.0
2.0
4.0
6.0
8.0
10.0
12.0
-150 -125 -100 -75 -50 -25 0 25 50 75
Normal duration attack duration
Query rate(million query/s)
Duration (seconds) relative to attack start
Estimated rate
Observed rate
Reported rate
Figure 2.7: Estimating real-world attack events: estimating June 2016 event with 0.91% access fraction.
Ripe Atlas can be replaced by
heavy-hitters or other monitoring tools
R1
R2 RIPE
Other
Other
RIPE
RIPE
R3
Figure 2.8: Topology with two upstream providers.
queries/minute with RIPE, and additional known-good traffic can improve accuracy. Use of additional
known-good traffic (such as heavy hitters) improves accuracy in these cases by providing a larger signal.
However, in practice, even a rough estimation allows a far better response than using directly observed
load.
2.6.2 Case Studies: 2016-06-25 Event
We already show real-world case studies in Section 2.6.1. Here we show that our approach works for
another event from 2016-06-25 (Figure 2.7). We observe our estimation (varying orange line) is close to the
reported line (dashed purple line). We can also see that our observation is only a tiny fraction of the true
offered load (bottom blue line).
Both these results show the effectiveness of our approach with both testbeds and real-world events.
26
We conclude that attack size estimation is close enough to help plan response to DDoS events.
2.6.3 Testbed Experiment
We validate our model with experiments in a testbed (DETER [24]) where we can control all factors, where
actual offered load is estimated and topology is fixed.
We consider a simple topology (Figure 2.8) where two access links from R1 and R2 towards R3 has
a capacity of 100Mb/s each. We assume the link from the service router (R3) to the servers has 1 Gb/s
capacity, so the internal network is never a bottleneck.
Here we use a slightly unequal legitimate traffic–40 Mb/s from R1 to R3 and 60 Mb/s from R2 to R3. As
part of legitimate traffic we generate known-good traffic that we use for estimation (Section 2.5.4): 2 Mb/s
on R1-R3 link and 3 Mb/s on R2-R3.
The attack consists of 250 Mb/s of traffic following the distribution of 100 Mb/s on R1-R3 and 150 Mb/s
on R2-R3. Offered load is therefore 140 Mb/s (1.4× link capacity) and 210 Mb/s (2.1× link capacity) on the
two links, for a total offered load of 350 Mb/s (29.2 k queries/second).
We show our estimation works well with the testbed experiment in Table 2.1. Our estimation needs to
know the access fraction or α (how much traffic arrives at the system during an attack). We observe the
rate of known-good traffic at the server to find the α. Table 2.1 shows the expected typical known good
traffic (“normal”), the observed rate under attack (“observed”) and the computed α.
Using the observed offered load (“observed”) of 16.3 kq/s at the server, and α of 0.49, we estimate
33.2 kq/s offered load (“estimated”), which is close to the actual reported 29.2 kq/s (“reported”). Normal
offered load (“normal”) is 8.5 kq/s, which is lower than what we can observe during an attack. Our observation is significantly lower than the reported or estimated rate, which tells us the importance of estimation.
27
We carried out other tested experiments, varying topology, traffic ratios, and distribution of attackers.
In general, we find our approach works well unless attack traffic is highly unbalanced (one or few sources,
not a distributed DoS).
2.7 Evaluation Approach
We next describe how we will evaluate the effectiveness of TE (Section 2.8) and that results generalize
to different deployments (Section 2.9). Traffic engineering in response to DDoS depends on the anycast
deployment—where sites are and with whom they peer. We evaluate on two different testbeds. Our approach (estimation, TE, and playbook construction) can be applied anywhere with different anycast setups.
We expect network operators will execute our approaches on a test prefix (in parallel with their operational
network) prior to an event so that no service interruption happens.
2.7.1 Anycast Testbeds
We evaluate our ideas on testbeds to see the constraints of real-world peering and deployments. We use
two independent testbeds: Peering [204] and Tangled [25]. Table 2.2 summarizes information about each
testbed with their own set of geographically distributed sites along with their locations (Peering supports
more sites but we used 8 sites). These sites show different connectivity, and have one or more transits and
IXP peers. Most Peering sites have academic transits while Tangled has more commercial providers.
Our testbed is about the same size as many operational networks, since nearly half of real-world networks
have five or fewer sites [47].
2.7.2 Measuring Routing Changes
To measure the effect of a BGP change, we first change the routing announcement at a site, give some time
to propagate, confirm that the announcement is accepted, and finally start the anycast measurement.
28
Testbed Used Sites #
Peering
Amsterdam*†(AMS), Boston* (BOS),
Belo Horizonte*†(CNF), Seattle* (SEA)
Athens* (ATH), Atlanta* (ATL),
Salt Lake City* (SLC), Wisconsin* (MSN)
8
Tangled
Miami (MIA)*, London (LHR)*,
Sydney (SYD)*, Paris (CDG)*,
Los Angeles (LAX)*, Enschede (ENS)*,
Washington (IAD)*, Porto Alegre (POA)*†
8
Table 2.2: Testbed and respective sites used in our experiments. Transit providers (*) and IXP (†).
Route convergence: After a change, we allow some time for BGP route propagation. We know that
routing and forwarding tables can be inconsistent (resulting in loops or black holes) while prefix is updating [126, 232, 216]. Although routing updates are usually stable within 5 minutes [216], we wait 15
minutes for routing to settle when building our playbook since it is a non-attack period. When the attack
is not mitigated after deploying a routing policy, our system moves to a different approach after 5 minutes.
Propagation of BGP policies: Policy filtering could limit the acceptance of announced routes, although in practice these limits do not affect our traffic engineering. Best practices for networks at the
edge to filter out AS-Paths longer than 10 hops, and ASes in the middle often accept up to 50 hops, both
more prepends than we need. Based on routing observations from multiple global locations using RIPE RIS,
we confirm that configurations in our experiments are never blocked due to route filtering in multi-hops
away from our anycast sites.
2.8 Traffic Engineering Coverage and Control
From an estimate of attack load, operators use BGP to shift traffic. We next evaluate three TE mechanisms: AS-Path prepending, community strings and path poisoning. For each we consider when it works
and what degree of control it provides. Table 2.3 summarizes our key results from tests on two testbeds
(Section 2.7.1); in Section 2.9 we evaluate generalizability.
29
Experiment Key Takeaways
Path prepending Works everywhere to effectively de-prefer a site (Section 2.8.1.2), but
shifts traffic in large amounts (Section 2.8.1.3), and has few traffic levels (Figure 2.10).
Neg. Prepending Works everywhere to prefer a site (Section 2.8.1.2).
BGP communities Although widely implemented, well-known communities are not universal (Section 2.8.2.1).
When supported, they provide finer-granularity control than prepending (Section 2.8.2.2).
BGP path poisoning Many Tier-1 ASes drop the announcements when it sees Tier-1 ASes in the paths. (Section 2.8.3.1)
Control over traffic is limited by the filters from other ASes. (Section 2.8.3.2).
Table 2.3: Experiment summarization and findings.
2.8.1 Control With Path Prepending
First we consider AS-Path prepending as a defense strategy.
2.8.1.1 Prepending coverage
Support for AS-Path prepending is quite complete—it requires no explicit support from the upstream
provider, so we found prepending worked at all sites in both of our testbeds. In Peering, we are allowed
to use a maximum of three prepends, and in Tangled we use up to five prepends. Previous study [40]
shows a maximum of 5 prepends is sufficient because 90% of active ASes are located less than six AS hops
away. We use RIPE RIS [189] to check the routing visibility when prepends are in place, and we do not
observe changes in the routing propagation for both testbeds. Otherwise, this might reveal the existence
of AS path length filters [108, 103].
2.8.1.2 Does prepending work?
Since AS-Path prepending is widely supported, we next evaluate this attractive TE method.
We explore this question for a representative scenario using Peering using three sites from three
continents—Europe (Amsterdam-AMS), North America (Boston-BOS) and South America (Brazil-CNF). In
Section 2.9 we generalize to other configurations. We estimate load by counting /24 blocks in catchments,
then compare the baseline with TE options. (We also explored traffic weighted by traffic loads instead of
blocks, getting the same qualitative results and shapes with different constants, Section 2.8.5.)
30
0
20
40
60
80
100
-3xAMS
-2xAMS
-1xAMS
Baseline
1xAMS
2xAMS
3xAMS
Percentage(%) of catchment
AMS BOS CNF
(a) AMS site.
0
20
40
60
80
100
-3xBOS
-2xBOS
-1xBOS
Baseline
1xBOS
2xBOS
3xBOS
Percentage(%) of catchment
AMS BOS CNF
(b) BOS site.
0
20
40
60
80
100
-3xCNF
-2xCNF
-1xCNF
Baseline
1xCNF
2xCNF
3xCNF
Percentage(%) of catchment
AMS BOS CNF
(c) CNF site.
Figure 2.9: Peering: Impact of path prepending in catchment distribution with AMS, BOS and CNF sites
on 2020-02-24.
31
Figure 2.9 shows the traffic from each site under different conditions. The middle bar in each graph
is the baseline, the default condition with no prepending. We then add prepending at each site, with one,
two or three prepends in each bar going to the right of center. We also consider negative prepending
( Section 2.5.5.1) in one to three steps, with bars going left of center.
We first consider the baseline (the middle bar) of all three graphs in Figure 2.9. Amsterdam (AMS, the
bottom, maroon part of each bar) gets about 68% of the traffic. AMS receives more traffic than BOS and
CNF because that site has two transit providers and several peers, and Amsterdam is very well connected
with the rest of the world.
We next consider prepending at each site (the bars to the right of center). In each case, prepending
succeeds at pushing traffic away from the site, as expected. For AMS, each prepend shifts more traffic away,
with the first prepend cutting traffic from 68% to 37%, then to 29%, then to about 16%. BOS and CNF
start with less traffic and prepending has a stronger effect, with one prepend sending most traffic away (at
BOS, from 15% to 7%) and additional prepends showing little further change. These non-linear changes
are because changing BGP routing with prepending is based on path length, and the Internet’s AS-graph
is relatively flat [13, 42].
The bar graphs also show that when prepending pushed traffic away from a site, it all goes to some
other site. Where it goes depends on routing and is not necessarily proportional to the split in other
configurations. For example, after one prepend to AMS, more traffic goes to CNF (the top sky blue bar)
than to BOS (the middle yellowish bar). These unexpected shifts are why we suggest pre-computing a
“playbook” of routing options before an attack (Section 2.5.2) to guide decisions during an attack and
anticipate the consequences of a change.
We also see that negative prepending succeeds at drawing traffic towards the site—in each case the
bars to the left of center see more traffic in the site that is not prepending while the others prepend. AMS
32
−5 −4 −3 −2 −1 0 1 2 3 4 5
# prepends
0
20
40
60
80
100
Percentage (%) of catchment
CDG
LHR
MIA
POA
SYD
Figure 2.10: Tangled: Effect of path prepending on catchments.
sees relatively little change (68% to 89%) since it already has most traffic, while BOS and CNF each gain up
to 68% of traffic.
All three sites show some networks that are “stuck” on that site, regardless of prepending. One reason
for this stickiness is when some networks are only routable through one site because they are downstream
of that exchange. We confirm this by taking traceroute to two randomly chosen blocks that are stuck
at BOS. Traceroutes and geolocation (with Maxmind) confirm they are in Boston, at MIT and a Comcast
network (based on the penultimate traceroute hop). We have used the local-preference BGP attribute to
move such stuck blocks, but a systematic exploration of that option is future work.
In summary, the experiment shows that AS prepend does work and can shift traffic among sites, however, this traffic shift is not uniform.
2.8.1.3 What granularity does prepending provide?
Having established that prepending can shift traffic, we next ask: how much control does it provide? This
question has two facets: how much traffic can we push away from a site or attract to it, and how many
different levels are there between minimum and maximum.
33
0
20
40
60
80
100
6-Peers
12-Peers
Route-server
All-IXP-Peers
Transit-1
Transit-2
Baseline
Percentage(%) of catchment
AMS BOS CNF
Figure 2.11: Peering: Community strings (at AMS) on catchments for AMS, BOS, CNF on 2020-02-25.
Limits: Figure 2.9 suggested that in Peering, with those three sites, there is a limit to the traffic that
can shift. AMS, BOS, and CNF always get about 16%, 7% and 3% of blocks, regardless of prepending.
Figure 2.10 confirms this result with a 5-site deployment (two from Europe, one from North America,
one from South America and one from Australia) in our other testbed (Tangled). X axis is presented
with the number of prepends applied to each site. The number zero (0) represents the baseline, the
positive numbers (1-5) are the number of prepending applied and the negative numbers represent negative
prepends. As depicted, each site can capture at most 55–65% of blocks, and can shed at most 95% of blocks,
even with up to 5 prepends. We can also see that we do not get a granular control as only three points are
between the minimum and maximum.
We conclude that while prepending can be a useful tool to shift traffic, it provides relatively limited
control.
2.8.2 Control with BGP Communities
We next show that BGP community strings have the opposite trade-off: what options they support vary
from site to site, but when available, they provide more granular control over traffic. We use whatever
community strings that can be supported at each site. Specific values for the same concept often vary.
34
0
20
40
60
80
100
Percentage(%) of catchment
No-Export-Level3
1xLevel3
No-export-Telia
1xTelia
No-export-NTT
1xNTT
No-IXP
Baseline
LHR MIA POA
Figure 2.12: Tangled: using different communities to shift traffic on site LHR on 2020-04-05.
2.8.2.1 Community string coverage
ASes must opt-in to exchange community strings with peers, as opposed to prepending’s near-universal
support (since AS paths are used for loop detection, prepending works unless it is explicitly filtered out).
Explicit support is required because communities are only a tagging mechanism; the actions they trigger
are at the discretion of peering AS. Prior work has studied the diverse options supported by community
strings [85].
To evaluate coverage, we review support for BGP communities in the testbeds we use. The testbeds
provide information about two dozen locations with diverse peers. Each one of these peers has been
evaluated about its support to this feature.
In Table 2.4 we describe path prepending and poisoning support and what types of community strings
are supported at each site. We group communities by class: advertisement options (no-peer, no-export to
customers, and no export to anyone), selective prepending, and peers and transits that support selective
advertisement. We also show the number of non-transit peers and transits.
Peering allows selective announcement to the transits and peers at each site, although the number
of peers and transits varies. Many sites with one transit provide no alternatives. We considered selective
announcement options at AMS, with 854 peers (106 bilateral peers including 2 route servers with 748 peers),
35
Site: Peering Tangled
Routing policy AMS BOS CNF SEA ATH ATL SLC MSN MIA LHR IAD CDG LAX ENS SYD POA
AS-path prepend ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
no-peer ✓ – ✓ – – – – – ✓ ✓ – ✓ – – ✓ ✓
no-export △ – – – △ – – – ✓ ✓ – ✓ – – ✓ ✓
no-client – – – – – – – – ✓ – – – – – – –
Selective prepend ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ – ✓ – – ✓ ✓
Selective announcement ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ - ✓ - - ✓ ✓
Path poisoning ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ – – – – – – ✓
# non-transit peers 854 0 129 0 0 0 0 0 0 0 0 0 0 0 0 250
# transits 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
# options 856 1 130 1 1 1 1 1 1 1 1 1 1 1 1 252
Table 2.4: Traffic engineering options on each testbed sites. ✓: supported, -: not supported, △: not tested.
and 2 transit providers [204]. CNF has one transit provider and 129 peers (with only 6 bilateral peers, other
peers are connected through 2 route servers). For our Verfploeter measurement, we consider the peers
and route servers with bilateral BGP sessions. A single peer covers a small fraction of the address space
in our Verfploeter measurement. For some peers, we observed no coverage at all which requires further
investigation with the peers to confirm our observation. Hence, all the selective announcement options
do not make difference in the catchment distribution (see the catchment in AMS with 12 peers compared
to the transit-1 in Figure 2.11). The options column of Table 2.4 summarizes these results, showing how
many routing options we have using community strings.
We evaluate Tangled to provide a second deployment with different peers. Tangled built its anycast
network over cloud providers, crowd-sourced transit providers and IXPs. All transit providers and IXPs
sites support communities as described in Table 2.4. With Tangled, the POA site has 250 peers and most
of them support communities strings.
We conclude that the number of options at each anycast site may vary depending on the number
of connections with peers and transits. This uncertainty shows the need for a playbook that shows the
possible options.
36
2.8.2.2 At what granularity do community strings work?
We next examine how well community strings work and what granularity of control they provide. We use
community strings to make BGP selective announcements, where we propagate our route only to specific
transit providers or IXP peers.
For our experiment, we use Peering, varying announcements at AMS and observing traffic when
anycast is provided from AMS, BOS and CNF (the same topology as Section 2.8.1.2). As described in
Section 2.8.2.1, selective announcement community strings are provided only at AMS and CNF, and they
affect our Verfploeter measurement only at AMS with several peers together, two transits one by one, and
route servers.
To select the target ASes for selective announcement, we sort all the working peers of AMS site, based
on the size of their customer cone using CAIDA’s AS rank list [32]. We then choose the 6 largest IXP
peers and the 12 largest, as the left two bars in Figure 2.11. We then examine the route server, announced
separately (the next bar), and then all IXP peers including route servers. Finally, we see the coverage with
each of the two transit providers, announced separately.
First, we see that selective announcement provides more control than prepending, as AMS shifts from
baseline 68% of blocks to other configurations from 53 to 6% of blocks.
Second, we see that there is some overlap in some combinations. For example, each transit reaches
more than half of all blocks reachable from AMS, so we know some blocks are reachable from both transit
providers. Thus, while there is some control over how many blocks to route to AMS, some peers are very
“strong” and will pick up many blocks if they are allowed to announce our prefix.
Third, we see the important role of route servers. While direct coordination with 12 IXP peers brings
only 7% blocks at AMS, a route server lets AMS reach more ASes and 14% of the blocks alone.
Finally, we see that transit providers play an important role. AMS site has two transit providers—BIT
BV (AS12859) and Netwerkvereniging Coloclue (AS8283). Announcing to AS8283 attracts more traffic to
37
AMS than announcing to AS12859. Different AS relationship of these two transits with their upstream
provides us a different traffic distribution.
As shown in our experiments, when compared to AS path prepending, BGP communities provide way
more better control over traffic distribution.
To investigate if the results found on Peering can be generalized, we made a set of experiments on
Tangled. Like Peering, we select 3 sites from three continents—London(LHR), Miami (MIA) and Porto
Alegre (POA), and use communities for selective prepending and selective announcement from LHR. In
Figure 2.12, we show the catchment distribution after using the community strings from LHR. In the
baseline, when no communities are used, LHR handles 69% of traffic. From right to left, we see a gradual
decrease in the catchment distribution from 69% to 33%. Stop announcing to IXP peers reduces traffic from
69% to 64%. But using prepending and no export communities in AS2914 (NTT America), AS1299 (Telia
Company) and AS3356 (Level 3), we can get 30-60% of the catchments in LHR.
Both testbeds show that community strings are not widely available in all sites, and that even wellknown communities are not fully adopted. However, community strings can provide finer-grained control.
Selective announcement mostly provides more “flexibility” depending on how many IXP peers and transits
are connected. We also find that some sites do not provide the support that we expect which means
community strings require an extra step like contacting the transit provider for an explicit agreement.
2.8.3 Control with Path Poisoning
We next turn to path poisoning, and show that like community strings, coverage and granularity are limited
by routing filters deployed in upstream peers.
38
2.8.3.1 Poisoning coverage
Support for path poisoning is dependent on the ASes we are poisoning and on route filters deployed by
our upstream ASes.
We find that many ISPs, especially Tier-1 ASes, filter out AS paths that poison any Tier-1 AS. Tier-1
ASes deploy these filters to block BGP announcements from customers that contain other Tier-1 ASes in
the path to prevent route leaks [220, 139]. This filtering often makes path poisoning ineffective to control
traffic.
To verify that poisoning Tier-1 ASes is often ineffective from filtering, we poison Tier-1 ASes announcing only from AMS in Peering, a unicast set-up blocking the impacts of other sites, and make traceroutes
from 1000 RIPE vantage points to our prefix. Our measurement shows the evidence of filters when we poison Tier-1 ASes—AS7018 (AT&T), AS6453 (Tata Communications America), and AS1299 (Telia Company).
We observe many vantage points fail to reach our prefix as they are dependent on Tier-1 ASes for their
routes. Some others change their paths avoiding Tier-1 ASes. We also validate route disappearance via
most Tier-1 ASes using RouteViews telescopes [239].
Although poisoning Tier-1 ASes is often ineffective, poisoning is effective with most non-Tier-1 ASes.
Unfortunately, these ASes carry little traffic when they are not immediate upstreams. Poisoning these small
ASes only has little impact on traffic. We again traceroute after poisoning a non-Tier-1 AS (AS57866), and
observe that Tier-1 ASes propagate the poisoned path. This proves poisoned paths with Tier-1 and nonTier-1 ASes are treated differently by other ASes.
2.8.3.2 What granularity does poisoning provide?
Path poisoning coverage is limited because one cannot usually poison a Tier-1 AS. This same filtering
limits the granularity that poisoning allows: poisoning Tier-1 ASes is not allowed, poisoning non Tier-1
39
0
20
40
60
80
100
Both-transits
3356
2914
1299
6939
174
3257
Transit-2
Transit-1
57866
Baseline
Tier-1 ASes
Transits Non
Tier-1
Percentage(%) of catchment
Poisoned AS
AMS BOS CNF
Figure 2.13: Peering: Impact of path poisoning
(from AMS on 2021-04-09).
0
20
40
60
80
100
1299
174
3257
3356
5511
6453
2914
6762
7018
6939
1273
6461
24115
63221
262589
1916
16735
1251
12953
baseline
Tier-1 ASes
(with Transits 2914, 6762) IXPs Non-Tier-1
Percentage(%) of catchment
Poisoned AS
LHR MIA POA
Figure 2.14: Tangled: Impact of path poisoning
(from MIA on 2021-04-11).
ASes has little impact when they are multiple hops away because they represent little traffic. Poisoning
immediate neighbors may shift traffic, but is more complex than just not announcing to them.
With poisoning coverage limited by filters (Section 2.8.3.1), we next examine what granularity control
it provides. We expect to see limited range since we cannot poison Tier-1 ASes, and small ASes carry little
traffic.
We test path poisoning in both Peering and Tangled using three sites from each testbed. As expected,
we observe the same traffic distribution when we poison any Tier-1 AS—30-35% load at AMS (Peering in
Figure 2.13) and 1-3% load at MIA (Tangled in Figure 2.14).
When we poison a non-Tier-1 AS that is more than one hop away, we observe a small change in the
traffic distribution. In Peering, we can see that poisoning AS57866 reduces a small fraction of traffic from
AMS (Figure 2.13). We observe a similar outcome in Tangled (Figure 2.14).
Our results prove that poisoning Tier-1 ASes is limited by the filters, and poisoning non-Tier-1 ASes
that are multi-hops away can change only a small fraction of traffic. Poisoning an immediate upstream
is equivalent to not announcing to them, so we do not consider that case here. We conclude that path
poisoning is not generally an effective tool for traffic engineering.
40
Traffic to Site (%)
Routing Policy AMS BOS CNF
(a) 6peers, 12peers ∼5 ∼35 ∼55
(b) Route-server 15 35 55
(c) All-IXP-Peers/Poison transits 15 35 45
(d) 3xPrepend AMS 15 35 45
(e) 2xPrepend AMS 25 35 45
(f) 1xPrepend AMS 35 25 35
(g) -3xPrepend BOS 25 65 5
(h) -2xPrepend BOS 35 65 5
(i) -1xPrepend BOS 45 45 15
(j) -3xPrepend CNF 25 15 65
(k) -2xPrepend CNF 35 5 55
(l) -1xPrepend CNF 45 5 45
(m)Transit-1 45 25 35
(n) Transit-2 55 15 25
(o) Poison Tier-1/Transit-2 35 25 35
(p) Poison Transit-1 55 25 25
(q) Baseline 65 15 15
(r) 1,2xPrepend BOS 65 5 25
(s) 3xPrepend BOS 75 5 25
(t) 1,2,3xPrepend CNF 75 15 5
(u) -1,-2,-3xPrepend AMS 85 5 5
Table 2.5: Policies and traffic distribution (in 10% bins); groups sorted by rough fraction of traffic to AMS,
and colors showing the traffic compared to the baseline distribution.
2.8.4 Playbook Construction
Based on our understanding of prepending, communities and poisoning, we can now build a playbook
of possible traffic configurations for this anycast network. In practice, we build the playbook automatically using scripts that connect to BGP, then iterate through different BGP configurations, then run Verfploeter [56] to measure new catchments. Playbooks are necessarily specific to each anycast deployment,
but we show in Section 2.9 that the process generalizes. Using a playbook, an operator does not need a
single “best” approach, rather a combination of approaches in the playbook ensures a greater control over
traffic distribution.
A playbook is a list of variations of routing policy and the resulting traffic distributions. Table 2.5
shows the playbook for our testbed, with the baseline of 65% blocks to a site shown in white. We group
different levels of prepending (positive or negative) at each site, and show selected community string and
poisoning configurations.
41
Traffic to Site (%) AMS BOS CNF
0-10 a k, l, r, s, u g, h, t, u
10-20 b, c, d j, n, q, t i, q
20-30 e, g, j f, m, o, p n, r, p, s
30-40 f, h, k, o a, b, c, d, e f, m, o
40-50 i, l, m i c, d, e, l
50-60 n, p – a, b, k
60-70 q, r g, h j
70-80 s, t – –
80-90 u – –
90-100 – – –
Traffic options 9 6 7
Table 2.6: Peering playbook (AMS, BOS, and CNF)
To summarize the many configurations from Table 2.5, Table 2.6 identifies which combinations result
in specific traffic ratios at each site. Each letter in this table refers back to a specific configuration from
Table 2.5. During an attack, if the anycast system begins at the baseline configuration (q), if AMS is overloaded, the operator could select a TE configuration higher in the table (perhaps ‘e’, ‘g’, or ‘j’). The operator
can then see the implications of that TE choice on other site (for example, ‘e’ increases load on both other
sites, with ‘g’ increases load on BOS but decreases it at CNF).
An operator may also use a playbook with traffic load for two reasons. First, loads in most interesting
services have diurnal pattern. Second, loads from each /24 prefix may vary because of the number of clients
behind each prefix (more on Section 2.8.5). Building the playbook with load is computationally simple; an
operator can just use the same catchment mapping along with the per prefix load.
Even with attack size estimation, attacks are accompanied by uncertainty, and attacker locations may
be uneven. However, the playbook provides a much better response than “just relying on informal prior
experience” in two ways: the defender can anticipate the consequences of the TE action (that traffic will go
somewhere!), and the defender can choose between different possible outcomes if the first is incomplete.
Playbook flexibility and completeness: Table 2.6 helps quantify the “flexibility” that traffic engineering allows us in this anycast deployment. Using these 10% traffic bins, we see that AMS has 9 options,
CNF 7, and BOS only 6. Because AMS and CNF mostly swap traffic after TE changes, and because BOS is
42
AMS(%) BOS(%) CNF(%) Policy / Day 00 GMT 06 GMT 12 GMT 18 GMT 00 GMT 06 GMT 12 GMT 18 GMT 00 GMT 06 GMT 12 GMT 18 GMT
Day-1 load 77 84 84 84 10 8 8 7 13 8 8 9
Baseline Day-2 load 77 84 84 80 10 8 7 9 13 8 9 11
Catchment 68 15 17
Day-1 load 43 49 49 58 18 20 18 13 39 32 33 29
1xPrepend AMS Day-2 load 43 46 46 50 18 18 18 18 39 36 36 32
Catchment 37 25 38
Day-1 load 78 85 83 83 4 3 4 3 18 12 13 14
1xPrepend BOS Day-2 load 78 85 83 79 4 4 4 4 18 12 13 16
Catchment 70 7 23
Day-1 load 83 88 87 87 11 10 9 8 6 2 3 5
1xPrepend CNF Day-2 load 83 89 87 85 11 9 9 10 6 2 4 5
Catchment 77 19 4
Day-1 load 88 93 92 91 5 4 5 3 6 3 4 5
-1xPrepend AMS Day-2 load 88 93 92 90 5 4 4 5 7 2 4 6
Catchment 87 8 5
Day-1 load 58 65 63 69 33 30 31 24 9 5 6 7
-1xPrepend BOS Day-2 load 54 60 62 60 37 35 32 32 9 5 6 8
Catchment 42 49 9
Day-1 load 45 51 51 58 6 4 4 4 49 45 45 38
-1xPrepend CNF Day-2 load 45 48 48 52 5 4 4 5 50 47 48 43
Catchment 42 9 49
Day-1 load 41 57 55 55 22 22 23 23 31 21 22 22
Transit-1 Day-2 load 48 59 57 48 21 18 21 26 41 23 22 26
Catchment 38 24 38
Day-1 load 64 72 73 75 13 10 10 9 23 18 17 16
Transit-2 Day-2 load 64 72 73 70 12 11 9 11 23 18 18 19
Catchment 53 19 28
Table 2.7: Load distribution with Peering catchment and B-Root load. Catchment: 2020-02-24, Load:
2020-02-25 and 2020-02-26 (only showing selected policies). Catchment distribution remains similar over
the course of the day showing by a single value.
less well connected, no configuration with three sites allows BOS to take traffic within 50-60% range, and
no 3-site configuration can drive BOS or CNF over 70%.
This analysis shows more central sites like AMS, and it may suggest the need for topology changes
(perhaps adding another site in Europe or Asia to share AMS’ load).
2.8.5 Load Distribution
Our playbook with catchment (Section 2.8.4) distribution gives an adequate prediction of traffic distribution
which we successfully apply in Section 2.10. Since services care about load, we want to see how the load
is distributed in different routing changes. An operator can simply make the load playbook based on the
already computed catchment mapping without making additional BGP announcements.
In Table 2.7, we can see different routing changes and their impacts over load distribution in different
times of the day. Load changes over the day—fewer load at 00 GMT in AMS site since most Europe sleeps
43
at that time. BOS and CNF receive more load at 00 GMT as that is a busy hour for these two regions. We
can also observe that some prefixes contribute more load due to the difference in number of clients behind
each prefix. For this reason, BOS prefixes (mostly North American prefixes) contribute less load compared
to the prefixes at other two sites. We can also see that load remains stable at the same time of different
days (varies within 5% most of the time).
We can also see that the relative catchment distribution follows the load distribution, however, it is
not exactly the same. Decisions will be even better when an operator considers different load playbooks
at different times of the day. Building multiple load playbooks is simple since we can just use the same
catchment mapping (catchment mapping remains stable (Section 2.9.3)).
2.9 Deployment Stability and Constraints
In Section 2.8 we showed BGP-based TE provides considerable flexibility. Building playbooks supports
defenders by allowing them to explore how transit providers, prepending, community strings, and poisoning affect their specific deployment. We next look at how stable the results are depending on choice of
sites and the number of sites. While the details of the playbook vary for each deployment, and we do not
claim our testbeds represent all possible deployments, we show our approach is flexible and can respond
to attacks in different deployments—our approach generalizes.
2.9.1 Effects of Choice of Anycast Sites
First we see how sites affect our playbook. New sites change catchments because they depend on location
and peering,
In Section 2.8.1, we studied catchments with three specific Peering sites on three continents: AMS, at
a large, commercial IXP in Europe; CNF with an academic backbone transit in Brazil; and BOS, an academic
site in the U.S. We now switch to three educational sites all in the United States: SEA, at University of
44
0
20
40
60
80
100
-3xBOS
-2xBOS
-1xBOS
Baseline
1xBOS
2xBOS
3xBOS
Percentage(%) of catchment
BOS SEA SLC
(a) BOS site.
0
20
40
60
80
100
-3xSEA
-2xSEA
-1xSEA
Baseline
1xSEA
2xSEA
3xSEA
Percentage(%) of catchment
BOS SEA SLC
(b) SEA site.
0
20
40
60
80
100
-3xSLC
-2xSLC
-1xSLC
Baseline
1xSLC
2xSLC
3xSLC
Percentage(%) of catchment
BOS SEA SLC
(c) SLC site.
Figure 2.15: Peering: Impact of choosing BOS, SEA and SLC sites on 2020-02-28
45
Washington on the west coast; SLC, at the University of Utah in the Rockies; and BOS, at Northeastern
University in Boston on the east coast.
More important than just geographic location, site connectivity is the most important factor in choosing sites. Multiple transit providers increase the chance of having more BGP options to affect traffic control
and granularity. While a poorly connected site inside a university network tends to provide less traffic control options.
Prepending baseline: Figure 2.15 shows catchment sizes for the three North American sites with
positive and negative prepending. Now the baseline distribution is unbalanced, but less so than before,
with SEA capturing 50% of blocks. We discussed SEA’s heavy traffic with the Peering operators. They
suspect that SEA is near to the Seattle IXP, making its paths one hop from many commercial providers.
Which site has the greatest visibility depends on its peering and will vary from deployment to deployment.
Prepending coverage and granularity: As with our prior experiments, we can adjust prepending to
see how traffic shifts. With these three sites, traffic shifts very quickly for BOS and SEA after one positive or
negative prepend. SLC has more flexibility, perhaps because it has the smallest catchment at the baseline,
and gains more coverage with each step of negative prepending, to 42%, 63%, and 91% of blocks. Often
(but not always), we see that academic sites exhibit less granularity because either they have few peers,
or their peers are academic networks with similar connectivity. As a result, minor changes in AS-Path
length place one site further from the others. In addition, this less granular control shows the importance
of building a playbook that is specific to a given deployment, or when the anycast topology changes.
Community coverage: While communities are common at IXPs and transit providers, academic networks (NRENs) have a more simple set of communities. None of those academic sites provide community
strings.
46
This observation confirms our prior coverage observation: community string support is not uniformly available. We also looked at other combinations of sites in Peering and found similar results
(Section 2.9.1.1 and Section 2.9.1.2).
Path poisoning: We repeated our path poisoning experiments with three sites in Boston, Salt Lake
City and Seattle. We confirm that Tier-1 ASes typically cannot be poisoned (Section 2.8.3.1). We also see
filters designed to prevent route leaks [220] also interfere with poisoning.
2.9.1.1 Peering: A Small Site in Europe
AMS in Peering is well-connected with two transits, and several IXP peers. Next, instead of AMS, we
take ATH in Europe which is connected through a research network in Greece. Our goal is to see whether
the findings from Section 2.8 are still valid in this anycast setup. Like the previous setup, we also take BOS
and CNF.
Figure 2.16 shows the catchment distribution. In the baseline case, as ATH is not connected like AMS,
it gets only 20% traffic. CNF serves almost 50% traffic in this anycast setup. So, the traffic distribution is
still skewed in this setup. Even if both AMS and ATH are in Europe, we see a different catchment control
which indicates the importance of site’s connectivity.
Prepending works similarly in this setup. BOS and CNF can cut most of their traffic after first prepend.
However, ATH can only shift 7% traffic after first prepend, and for more prepends it does not show any
effect. In all these sites, we can also see that some blocks are always “stuck” to a particular site. Using
negative prepending, we can push most of the traffic to BOS and CNF. However, we can only push 40%
traffic to ATH site.
47
2.9.1.2 Peering: Sites in Nearby Location
Next, we take three sites that are in nearby location and have similar connectivity. We select sites from
Boston (BOS), Atlanta (ATL) and Wisconsin (MSN) in Peering. All these sites are located within the
eastern half of the U.S., and they are connected through education network—BOS with Northeastern University, ATL with Georgia Institute of Technology, and MSN with University of Wisconsin - Madison.
When the sites are located in nearby geo-location, and connected by similar network, path prepending
can result in an “all or none” outcome. When we prepend from ATL or BOS, most traffic goes away from
these sites. Prepending one time from ATL leaves no traffic in ATL, and prepending one time from BOS
cuts traffic from 42% to 14%. Even with negative prepending, BOS can get over 90% traffic, and ATL can
get nearly 90% traffic. So, BOS and ATL can cut or gain almost all the traffic with positive and negative
prepending.
MSN receives a small fraction of traffic in the baseline, and some blocks are always “stuck” at MSN.
With two negative prepends, MSN receives only 27% catchment, however, with the third negative prepend,
MSN receives almost 80% catchment. This slow and sudden increase in the catchment suggests us why we
need a BGP “playbook” for anycast setup.
Our experiments confirm that while catchments are deployment-specific, our qualitative results hold—
prepending works but is coarse, and community strings and poisoning are not supported everywhere.
2.9.2 Effects of Number of Anycast Sites
Next, we vary the number of sites and see how that changes control traffic. We select 3, 5 and 7 sites from
each testbed, and build a playbook to evaluate defense options. Figure 2.18 shows selected configurations,
grouped by number of sites.
Baseline: With more sites, overall capacity increases and baseline load at each site falls. For example,
in Figure 2.18, the baselines (with an asterisk*) at the largest site (AMS) shifts from 70% of blocks with
48
0
20
40
60
80
100
-3xATH
-2xATH
-1xATH
Baseline
1xATH
2xATH
3xATH
Percentage(%) of catchment
ATH BOS CNF
(a) ATH site.
0
20
40
60
80
100
-3xBOS
-2xBOS
-1xBOS
Baseline
1xBOS
2xBOS
3xBOS
Percentage(%) of catchment
ATH BOS CNF
(b) BOS site.
0
20
40
60
80
100
-3xCNF
-2xCNF
-1xCNF
Baseline
1xCNF
2xCNF
3xCNF
Percentage(%) of catchment
ATH BOS CNF
(c) CNF site.
Figure 2.16: Peering: Impact of path prepending in catchment distribution with ATH, BOS and CNF sites
on 2020-05-30.
49
0
20
40
60
80
100
-3xBOS
-2xBOS
-1xBOS
Baseline
1xBOS
2xBOS
3xBOS
Percentage(%) of catchment
BOS ATL MSN
(a) BOS site.
0
20
40
60
80
100
-3xATL
-2xATL
-1xATL
Baseline
1xATL
2xATL
3xATL
Percentage(%) of catchment
BOS ATL MSN
(b) ATL site.
0
20
40
60
80
100
-3xMSN
-2xMSN
-1xMSN
Baseline
1xMSN
2xMSN
3xMSN
Percentage(%) of catchment
BOS ATL MSN
(c) MSN site.
Figure 2.17: Peering: Impact of path prepending in catchment distribution with BOS, ATL and MSN sites
on 2020-05-29.
50
0
20
40
60
80
100
-3xAMS
*Baseline
1xAMS
3xAMS
6-peers
-3xAMS
*Baseline
1xAMS
3xAMS
6-peers
-3xAMS
*Baseline
1xAMS
3xAMS
6-peers
3-sites 5-sites 7-sites
Percentage(%) of catchment
AMS BOS CNF ATH SEA SLC ATL
Figure 2.18: Peering: Impacts of changing the number of anycast sites from 2020-04-07 to 2020-04-10.
0
20
40
60
80
100
-3xLHR
*Baseline
1xLHR
3xLHR
-3xLHR
*Baseline
1xLHR
3xLHR
-3xLHR
*Baseline
1xLHR
3xLHR
3-sites 5-sites 7-sites
Percentage(%) of catchment
LHR MIA SYD CDG LAX ENS POA
Figure 2.19: Tangled: Impacts of changing the number of anycast sites.
51
three sites to 61% and 56% with 5 and 7 sites. Smaller sites shift less (BOS goes from 14% to 6% and 6%, and
CNF from 15% to 8% and 6%). Greater capacity and distribution requires a larger and distributed attacker
to exhaust the overall service. We see similar results on our alternate testbed Tangled (Section 2.9.2.1).
Traffic flexibility:
With more sites, the largest site usually shows the largest changes and has the fewest catchment sizes.
Comparing baseline to one prepending in Figure 2.18, AMS shifts from 70% to 37% with three sites, from
61% to 29% with five, and from 56% to 23% with seven, always dropping by half.
Even with more sites, some blocks are often “stuck” at a particular site. With three negative prepends,
AMS gets most of the traffic, but it tops out at 90% with three sites, and only 87% and 84% with five and
seven. We conclude that each site has its own set of “stuck blocks” that are captive to it and will not move
with traffic engineering.
With more sites, the fine control of BGP communities becomes more important because path-prepending
becomes less sensitive. For example, selective announcements with communities are need for AMS with
5 or 7 sites; prepending three times shifts all traffic.
New sites: Adding more sites also shows how our playbook can help guide deployment of new sites.
Predicting traffic shifts for a new site is difficult, but experimenting with a test prefix can build a playbook
pre-deployment.
2.9.2.1 More Sites in Tangled
We want to confirm that increasing the number of sites in Tangled shows the similar results that we get
in Section 2.9.2.
Like Peering, we take 3, 5 and 7 sites in Tangled testbed (Figure 2.19). As the overall capacity increases
with more number of sites, in Tangled also we can see the baseline traffic is reduced in each site as the
traffic spreads out. LHR gets 55%, 31% and 25% traffic when there are 3, 5 and 7 sites respectively.
52
Months AMS(%) BOS(%) CNF(%)
2020-02 68.1 14.6 17.3
2020-04 70.4 14.2 15.4
2020-06 65.3 14.1 20.6
Table 2.8: Percent blocks in each catchment over time.
With more sites, as there are more capacity in other sites, one site can cut almost all of its traffic. For
example, LHR can cut all of its traffic when there 7 sites which is not possible when there are 3 or 5 sites.
We can see a “shadowing” instance when we have a 7-site testbed. After adding ENS and POA, we can
see that all traffic from LAX site disappears (Figure 2.19). We believe Cogent traffic now shifts from LAX
to POA, and academic traffic shifts from LAX to ENS. We explain this issue in Section 2.9.2 when IAD
shadows LAX.
Like Peering, adding more sites can create new options where the shifted traffic goes. For example,
with 3 sites, LHR traffic goes to MIA when we make prepending. But when we have 7 sites, a significant
amount of LHR traffic goes to POA—POA traffic increases from 26% to 40% when we prepend from LHR.
Hence, it is necessary to keep a “playbook” to see the traffic distribution after a BGP change.
2.9.3 Playbook Stability Over Time
A playbook has a limited use if routing changes immediately. We know routing changes when links fail,
or when ISPs begin new peering or purchase new transit. For how long is a playbook applicable?
To answer this question, Table 2.8 shows the fraction of /24 blocks going to each catchment over time
for the baseline configuration. We see that the fraction of blocks is generally quite stable, with only about
5% of blocks shifting in or out of a site. In addition, prior work has shown very strong anycast stability
over hours to days [246]. We checked the stability of B-Root catchment. We use one month of B-Root
catchment mapping with test and production prefixes.
From Figure 2.20, we can see that the catchment remains stable over time. In two weeks, only 0.35%
prefixes, and in one month, only 0.65% prefixes changed their catchment when we compare the catchment
53
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0 5 10 15 20 25 30
% of changes (among 2 million pre
fixes)
No. of days after 2021-07-01
Test prefix
Production prefix
Figure 2.20: One month of catchment stability in B-Root
with day 1 considering ∼2 million prefixes. This shows only a tiny fraction of prefixes changes the catchment even after a month irrespective of the changes made by the ASes. Hence, building the playbook once
every week/month should be sufficient.
We also make the catchment mapping at different times of the day. We found catchment distribution
remains similar at different times of the day.
While catchments are relatively stable, we expect operators will refresh playbooks periodically (perhaps weekly or monthly).
2.10 Defenses at Work
In this section we describe four real-world attacks processing the traffic in our system. We show that we
can successfully respond to a different types of attacks in different ways.
Methodology: We use real-world attacks from B-Root server operator, the Dutch National Scrubbing
Center, and from an anonymized enterprise network. These events include polymorphic, adversarial, and
a volumetric attack.
We evaluate these events by simulating traffic rates against a three-site anycast network. The first two
events use Peering with our AMS, BOS, CNF configuration from Section 2.8. We vary this topology, using
54
BOS, SEA, SLC from Section 2.9.1 in the last event. We replay the traffic in simulation, assigning traffic
to each anycast site based on catchments measured in our experiments. We do not simulate the gradual
route propagation, but instead have routing take effect 300 s after a change (a conservative bound, most
routing changes happen in half that time). We then evaluate traffic levels at each site and compare that to
a target capacity.
For each attack we run our system in defense, estimating the attack size and selecting a pre-computed
playbook response. Since our playbook allows different responses: when we have choices we select different methods of defense: prepending, negative prepending, or community strings (Figure 2.21).
A 2017 polymorphic attack: Our first event is a DNS flood from 2017-03-06 in B-Root [180] (Figure 2.21a). This event was a volumetric polymorphic attack where the attack queries have common formats
like RANDOM.qycl520.com\032 (from 0 s) and RANDOM.cailing168.com\032\032 (changed at 4750 s, so polymorphic in nature). We assume 60k packets/s (30 Mb/s) capacity at each anycast site. The event was small
enough that B-Root was able to fully capture it across all active anycast sites at the time. The event lasted
about 5 hours, but we show only the first 2.25 hours. Services and attacks capacity today will both be
much larger; we use a small attack, scaling the attack and capacity up would show similar results.
In Figure 2.21a we can identity AMS site receives 100k packets/s traffic that is more than the capacity
(shown as the maroon striped area). Our system notices the attack from bitrate alerts. It then estimates
the AMS overload by computing the offered load using observed load and access fraction. The system
maps networks to number of packets to each site using the pre-computed playbook (Table 2.6). Using this
mapping our system/operator can then select a response. From Figure 2.21a, we can see the impact of the
selected routing approach—announcing only to Transit-1 using community string. After 300 s, we can see
no striped area which indicates the attack is mitigated.
55
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
-4000 -2000 0 2000 4000 6000 8000
Normal traffic
Attack
started
Only to Transit-1 Polymorphic event
changed query
name Route propagation
done (simulated)
Query rate (k packets/s)
Duration (seconds) relative to attack start
AMS
AMS overloaded
BOS
CNF
(a) A polymorphic attack at B-Root defended with
community strings.
0.00
5.00
10.00
15.00
20.00
25.00
30.00
-500 0 500 1000 1500 2000 2500
Normal
traffic
Attack
started
Prepend AMS by 3 Polymorphic event
with another botnet
Route
Propagation
done
(simulated)
Query rate (k packets/s)
Duration (seconds) relative to attack start
AMS
AMS overloaded
BOS
CNF
(b) An adversarial event at an enterprise mitigated
using positive prepending.
0.00
500.00
1000.00
1500.00
2000.00
2500.00
-100 0 100 200 300 400 500 600
Attack
started
-ve prepend BOS by 1 Route propagation
done (simulated)
Query rate (k packets/s)
Duration (seconds) relative to attack start
BOS
SEA
SEA overloaded
SLC
(c) An event captured at the Dutch National Scrubbing Center defended using negative prepending.
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
-2000 -1000 0 1000 2000 3000
Normal traffic
Attack
started
Only Transit-1Changed
query
name
Changed
query Route name
Propagation
done
(simulated)
Query rate (k packets/s)
Duration (seconds) relative to attack start
AMS
AMS overloaded
BOS
CNF
(d) A 2017 event at B-Root mitigated using community strings.
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
-600 -400 -200 0 200 400 600
Normal traffic
Attack
started
Prepend
AMS by 1
Route Propagation
done (simulated)
Query rate (k packets/s)
Duration (seconds) relative to attack start
AMS BOS CNF
(e) A 2020 event at B-Root defended using positive
prepending.
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
900.00
0 200 400 600 800
Attack
started
Only IXP peers Route propagation done(simulated)
Query rate (k packets/s)
Duration (seconds) relative to attack start
AMS
AMS overloaded
BOS
CNF
(f) A 2021 event captured at the Dutch National Scrubbing Center mitigated using community
strings.
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
900.00
-800 -600 -400 -200 0 200 400 600 800 1000 1200 1400
Normal traffic
Attack
started
Prepend
AMS by 1
Now CNF
overload
ed
-ve
prepend
BOS by 1
No
overloaded
sites
Route
propagation
done(simulated)
Query rate (k packets/s)
Duration (seconds) relative to attack start
(g) A 2021 event at B-Root defended using negative
prepending.
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
-100 0 100 200 300 400 500 600 700 800
Attack
started
Only to 6 Peers Route propagation
done (simulated)
Query rate (k packets/s)
Duration (seconds) relative to attack start
AMS
AMS overloaded
BOS
CNF
(h) A 2021 event at the Dutch National Scrubbing
Center mitigated using community strings.
Figure 2.21: Different attacks with various responses.
56
The attacker changes the query names at 4750 s, making this attack polymorphic. Filtering on query
names would need to react, but our routing changes can still mitigate the attack regardless of this type of
change.
A 2021 variable-length polymorphic attack: We next examine an HTTP-attack launched on an
enterprise network on 2021-09-05 in Figure 2.21b. This polymorphic attack changes after each of three
pauses. The initial attack consists of millions of HTTP GETs (15k packets/s) launched from an IoT botnet;
it terminates when the enterprise’s operator deploys IP-based filtering. About 1000 s later, a different
botnet launched a multi-vector attack combining HTTP GETs using random paths (to avoid caching) and
spoofed TCP ACKs. We then see a lull, brief burst, another lull, and a burst to the end.
The initial attack at time 0 overloads one site (AMS), prompting our routing response. After the estimation, we begin a route shift away from AMS, but the attack ends quickly (after 90 s), while routes are
still changing.
Since the normal traffic sources originate from Europe, most traffic went to AMS even after three
prepends. At 1020 s the attack botnet changes, with more attack traffic from Asia and South America
(based on IP geolocation from MaxMind) Our route changes in response to the initial attack are still in
place, and the renewed attack is successfully spread over all three sites, allowing AMS to tolerate the new
attack.
Shifting attacks like this are common with more sophisticated adversaries. Any approach (including
ours) that defends with routing changes is limited by route propagation times, so the applicability of such
defenses is limited for short-lived attacks like what occurred at 0 s. However, spreading traffic protects
against many types of attack, as we see the renewed attacks after 1000 s. Varying attacks like this show
the importance of reviewing defense effectiveness as the attack continues.
An example attack on a different anycast topology: We consider an LDAP amplification attack,
at the Dutch National Scrubbing Center on 2021-08-25.
57
In this case we simulate a super-site at BOS, capable of absorbing 1500k packets/s, while the other sites
(SEA and SLC) support about half (700k packets/s). In Figure 2.21c, the purple cross-hatched area shows
how much the traffic will overwhelm SEA, a smaller site, but can be handled at the super-site. We respond
with negative prepending, with the traffic shift to BOS visible at 300 s. This response mitigates the attack
(no striped area).
A polymorphic event at B-Root: We observed a polymorphic event at B-Root on 2017-02-21 where
the attackers used three different query names—RANDOM.phone.tianxintv.cn\032, RANDOMclgc88.com\032,
and RANDOM.jiang.com\032. The total offered load at AMS site exceeds the capacity of 60k packets/s
(striped area).
Our system announces only to Transit-1 using community strings to mitigate this attack (Figure 2.21d).
We can see that there is no striped area after the deployment of the new routing policy. Also, when
the attackers change pattern, our system does not need to make any routing changes. This proves the
applicability of TE approaches in polymorphic events.
A 2020 volumetric attack at B-Root: We observed an ephemeral volumetric event at B-Root on
2020-02-14 where the attackers used a single query name—peacecorps.gov. This event lasted very briefly
for 3 minutes. In practice, no routing approach can work against such short-lived attacks due to the
propagation delay of BGP. We stretched the event with similar traffic rate so that we can see the impact if
the attack continues for more time.
In this event also, AMS is overloaded with 60k packets/s when the assumed capacity is 40k packets/s
(Figure 2.21e). We prepend AMS by 1 so that the traffic shifts away from AMS. After 300 s, we can see no
overloaded striped area in AMS.
These volumetric attacks are common at root servers. Routing based approaches can defend against
such attacks.
58
A DNS amplification attack: We evaluate another DNS amplification attack collected at the Dutch
National Scrubbing Center on 2021-08-22. In this event too, AMS site receives a huge traffic exceeding
its capacity (large striped area). Announcing only to IXP peers, our system can mitigate this attack event
(Figure 2.21f). This is another example where a community string helps us to mitigate the attack.
A 2021 B-Root event where our system iterates: We evaluate another event at B-Root occurred on
2021-05-28. In this event, the queries were IP fragmented (large packet size), and the common query name
was pizzaseo.com (we stretched the event since it was short-lived). When the attack started, our system
finds AMS site overloaded (Figure 2.21g). Our system finds prepending from AMS is the best approach to
reduce traffic from AMS. However, after prepending AMS by 1, CNF site gets the most redirected traffic,
and becomes overloaded. Redirected attack sources prefer CNF over BOS. When our system finds CNF
site overloaded, it deploys an approach that will reduce traffic from CNF since it is now overloaded. Our
system deploys negative prepending to push more traffic towards BOS site. After 900 s, we can see there
is no overloaded site. This event shows how our system can gradually find out the best routing approach.
Defending with community strings: We next consider an attack observed at the Dutch National
Scrubbing Center on 2021-08-27. This attack was a volumetric DNS amplification.
In this attack, AMS is overloaded. Consulting the playbook, we select a response using community
strings to shift traffic, retaining six IXP peers at AMS, while dropping all other peers and transits. The
impact of this change is visible at 300 s in Figure 2.21h, as the attack is successfully spread across all sites.
This example shows how different community strings provide control over traffic distribution.
2.11 Limitations and Future Work
Our playbook of routing options (Section 2.5) is effective against many attacks (Section 2.10). However,
like any defense, it is not impervious. We next describe known limitations and areas of future work.
59
First, Internet routing is distributed, requiring time to converge. The effects of routing defenses cannot
be seen until convergence. We do not make changes faster than 5 minutes.
Routing convergence time implies that routing changes will have limited applicability to short-lived
attacks (less than 5 minutes). Although routing changes will not hurt the service, their benefits may not
occur until routing shifts.
In addition, routing convergence means that polymorphic attacks that shift traffic sources quickly will
be more effective. Routing changes are robust to polymorphic attacks that change method but take effect
by traffic volume, they will spread load regardless of what it is, as we show in events in Section 2.10.
However, when defending an attack where traffic shifts locations faster than routing converges, one must
provision for the worst case volume to any site under the heaviest traffic it sees. Rapid shifts make defense
harder, but not impossible.
Finally, we assume the anycast catchments of the underlying service change slowly (over days). We
showed in Section 2.9.3 that this assumption generally holds.
Although we change routing during an attack to balance load across catchments, we do not explicitly attempt to locate attack origins. As future work, we could use such information to improve defense
selection.
Attack response depends on human factors in service operators and attackers. Explicitly studying such
human factors is potential future research. Our current work focused on the technical feasibility of our
defenses.
2.12 Conclusion
This chapter provides the first public evaluation of multiple anycast methods for DDoS defense (Section 2.10). Our system estimates attack size, selects a strategy from a pre-computed playbook, and automatically performs traffic engineering (TE) to rebalance load or to advise the operator. Our contributions
60
are attack size estimation and playbook construction. We experimentally evaluate TE mechanisms (Section 2.8), showing that prepending is widely available but offers limited control (Section 2.8.1), while BGP
communities (Section 2.8.2) and path poisoning are the opposite (Section 2.8.3).
In this thesis, our goal was to show new methods without changing existing Internet protocols (Section 1.1). To prove our thesis statement, here we showed our first anti-DDoS system where we utilized
measurements. In the following study, we will describe an automated system to filter malicious traffic
when the redistribution cannot keep traffic within capacity.
61
Chapter 3
Mitigating DDoS Using Filtering
We already describe a redistribution-based approach using a BGP playbook (Chapter 2). This redistribution
method helps an operator to utilize the capacity of an anycast network without having any collateral
damage (filtering legitimate traffic). However, non-anycast networks serve from a single location where
we do not have the option for traffic engineering. Also, there can be attacks where a BGP playbook fails to
provide a BGP option to keep the traffic within the limit in all anycast sites. As a result, we need multiple
approaches to serve multiple needs handle different types of DDoS attacks because one approach is not
sufficient to handle all the attack types. his approach also proves the coverage of our thesis statement, we
want to show multiple systems against a single attack without changing existing protocols. In this chapter,
we propose a layered DDoS defense for DNS root nameservers. Our defense uses a library of defensive
filters, which can be optimized for different attack types, with different levels of selectivity. We further
propose a method that automatically and continuously evaluates and selects the best combination of filters
throughout the attack. We show that this layered defense approach provides exceptional protection against
all attack types using traces of ten real attacks from a DNS root nameserver. Our automated system can
select the best defense within seconds and quickly reduces traffic to the server within a manageable range,
while keeping collateral damage lower than 2%. We show our system can successfully mitigate resource
62
exhaustion using replay of a real-world attack. We can handle millions of filtering rules without noticeable
operational overhead.
Among the two DDoS defenses, our first priority is redistribution (Chapter 2) since it ensures no rejection of legitimate clients. Redistribution-based approach works well in an anycast network where we
have extra capacity in other locations, and when we have BGP options to redistribute the traffic. Then
redistribution does not work or when we have only a single site, we use an automated system in the overloaded location to serve more legitimate users. Since the main consequence of a DDoS attack is to exhaust
resources of the victim [141, 121], with this study, we want to free resources by filtering malicious traffic.
These two defense systems show that multiple security systems are possible against a single attack using
measurements.
The study in this section focuses on the filters that are more relevant to the Domain Name System
(DNS) and DNS roots. However, other systems can adopt this idea to use different filters. DNS is particularly challenging because most DNS requests use UDP, making spoofing attacks difficult to counter.
Moreover, the DNS root service is a high profile, critical service, and so it has been subject to repeated DDoS
attacks [243, 167, 167, 144]. Yet defenses are vital, since a DNS outage can prevent users from reaching an
otherwise active service [207].
The DDiDD design was significantly enhanced and modified by Jelena Mirkovic [197], especially the designs in Section 3.4, wild recursive filter in Section 3.4.3.4, and aggressive recursive filter in Section 3.4.3.6.
Robert Story also contributed the hop count filters in Section 3.4.3.3. This thesis incorporates those modifications.
This work was published in the International Conference on COMmunication Systems & NETworkS
(COMSNETS), 2023 [197], and won the best paper award. An extended version was also published in the
Ad Hoc Networks journal [198]. As an outcome of this work, we released DDoS datasets [11] and the
DDiDD tool [240].
63
3.1 Introduction
Distributed-Denial-of-Service (DDoS) attacks remain a serious problem [109, 157, 235, 69], in spite of
decades of research and commercial efforts to curb them. Ongoing Covid-19 pandemic and increased
reliance of our society on network services, have further increased opportunities for DDoS attacks. According to the security company F5 Labs, between January 2020 and March 2021, DDoS attacks have
increased by 55% [54]. While some large-volume DDoS attacks make front page news (for example, the
1.35 Tb/s [159] attack on Github in Feb. 2018, or 2021 17.2 M requests per second attack, detected by CloudFlare [259]), many more attacks occur daily and disrupt operations of thousands of targets [233, 5].
This chapter focuses on protecting the Domain Name System (DNS) root servers against DDoS attacks.
The root-DNS service is a high-profile, critical service, and it has been subject to repeated DDoS attacks
in the past [243, 167, 168, 144, 207]. In addition, because the DNS root “bootstraps” DNS, it is served on
specific IP addresses that cannot be easily modified, thus precluding use of many traditional DDoS defenses
that redirect traffic to clouds to distribute load [34].
There are many types of DDoS attacks. Some attacks are conceptually easy to mitigate with firewalls,
assuming upstream capacity is sufficient, such as volumetric attacks using junk traffic. Others, such as
exploit-based attacks, remain pernicious, but automated patching and safer coding practices offer promise.
Most challenging are attacks using legitimate-seeming application traffic, since a flash-crowd attack from
millions of compromised hosts (also known as layer-7 or application-layer attacks) can resemble a legitimate flash crowd, when many legitimate clients access popular content. At DNS root servers, flash crowd
attacks would generate excessive DNS queries. Because legitimate clients also generate DNS queries, it is
challenging to filter out attack traffic. We focus on mitigation of flash-crowd attacks on DNS root servers.
In flash-crowd attacks, attack traffic often appears identical in content to legitimate traffic. Approaches
to handle flash-crowd attacks thus focus on withstanding the attack using cloud-based services [59, 172,
143, 190]. Other approaches aim to separate legitimate from attack clients, e.g., via CAPTCHAs [164], or by
64
using models of typical client behavior [185, 230]. These defenses work poorly for DNS root servers. First,
the DNS root operates at small number of fixed IP addresses that cannot be easily changed. This restriction
precludes use of traditional defenses that redirect traffic to clouds [34]. Second, DNS traffic to roots is
generated by recursive resolvers. Since there is neither direct interaction with a human nor a web-based
user interface, CAPTCHAs cannot be interposed. Third, aggressive client identification requires modeling
a typical legitimate client. Building a typical client model at roots is challenging, because client request
rates vary by five orders of magnitude, from a few queries per day to thousands of queries per second. A
model that spans all types of clients can be too permissive, while a model that captures a majority of clients
may drop legitimate traffic from large senders. Since most DNS traffic is currently UDP-based, spoofing
also is a challenge and spoofers can masquerade as legitimate clients.
In this chapter, we propose a multi-layer approach to DNS root server defense against DDoS attacks,
called DDiDD – DDoS Defense in Depth for DNS. Our first contribution is to propose an automated approach
to select the best combination of filters for a given attack. Selecting from a library of possible filters is
important, since different filters are effective against different attacks, and each filter has a different false
positive rate, and different operational cost, which precludes its continuous use. DDiDD selects the best
combination of filters quickly (within 3 s) and continuously re-evaluates filtering effectiveness. When
attack traffic changes (e.g., in case of polymorphic attacks), DDiDD quickly detects decrease in the filtering
effectiveness and re-selects a new, better combination, thereby adjusting to dynamic attacks.
Our second contribution is to propose a novel wild client filter for DNS. We provide the first open
description and evaluation of a filter that models per-client behavior for DNS clients. Client modeling is
widely used to protect web servers [231] where a single model for a “typical” web client suffices. DNS
shows a huge range of rates (over 5 orders of magnitude) across clients, so any model that captures this
entire range will be too permissive. Instead, we model each client separately during pre-attack periods, and
65
identify as attackers the clients that become more aggressive during attacks. In deployment we combine
this filter with anti-spoofing filters to establish trust in client identities.
Our final contribution is to perform evaluation of each candidate filter, including our wild resolver filter
and six other filters proposed in prior work [209, 244, 112, 148]. While prior work quantified performance
of some individual filters for general DDoS attacks [244, 112, 148], and other work qualitatively described
commercial deployments (such as Akamai’s [209]), we are the first to evaluate each filter quantitatively
against real DDoS attacks on a DNS root. We are also the first to propose and evaluate a dynamic multifilter system for protection of DNS roots against DDoS. Our evaluation uses real-world attacks and normal
traffic taken over 6 years from B-Root, as well as an adversarial, polymorphic attack we have synthesized.
Our evaluation confirms that no single filter outperforms the others, but together they provide a stable
defense against different attack types converging in 3 s or less, with low collateral damage (at most 2%). Our
analysis provides evidence for the DNS operators about the importance of having an automated system,
and it provides insights about individual filter performance against different types of attacks.
We focus our work on the DNS root server system to meet its unique challenges, but our results also
apply to other self-hosted, authoritative DNS servers.
We release the DDoS datasets and our DDIDD tool that we use in this chapter [11, 240].
3.2 Background: DNS and DDoS
Domain Name System (DNS) is part of the critical Internet infrastructure. It maps between resource names
and IP addresses, using a hierarchical database distributed across authoritative DNS nameservers (authoritatives for short). The DNS root is on top of the hierarchy, followed by top-level domain (TLD) servers
and subdomain servers. Each authoritative nameserver is responsible for maintaining mapping of some
portion of the DNS namespace, and for replying to queries about that portion to any DNS client.
66
Users usually do not directly query the DNS, but instead use recursive resolvers (“recursives” for short)
that resolve names on their behalf. There are many recursives, and each serves some local users by proxying for them the translation of DNS names to IP addresses, and caching any new data learned from the
authoritative servers. Users’ computers are clients of recursives, and recrusives are clients of the authortatives.
Each DNS name consists of multiple components, separated by periods, such as www.example.com.
The rightmost segment denotes a top-level domain or TLD, such as .com or .us. When a recursive resolver
looks up a name, it parses each component, querying authoritative nameservers, if it does not have a
resolution for that given suffix in its cache. Components and their resolutions are cached for durations
specified by their owner, and can be overridden by the recursive’s configuration.
3.2.1 DNS Root Traffic
Because recursives cache responses from DNS root, and there are only a few thousand TLDs that can be
cached for 24 hours or more, one expects that recursives query the authoratives for the root infrequently.
The actual traffic from resolvers, however, defies this expectation by a large margin. Figure 3.1 illustrates
the complementary cumulative distribution function (ccdf) of the number of queries to B-Root per hour
at a random hour in each of the years 2015, 2016, 2017, 2018 and 2019. There is a wide range of rates across
5 degrees of magnitude. While majority of resolvers exhibit the behavior we expect – 95% send fewer than
62 queries per hour and 99% send fewer than 1,500 queries per hour – a small number of resolvers sends
excessive numbers of queries, up to 100,000 queries per hour!
Why are roots receiving so much more traffic than one would expect? There are several possible
explanations. First, roots receive many queries (36–39% in our dataset) that do not have a valid TLD [199,
200, 36]. Chrome browsers make such random DNS queries to detect DNS hijacking [64]. Roots will
return a no-such-domain (NXDomain) reply to these queries, but such replies are not cached by resolvers.
67
10-6
10-5
10-4
10-3
10-2
10-1
1
101 102 103 104 105 106
ccdf
queries per hour
2015
2016
2017
2018
2019
Figure 3.1: Complementary cumulative distribution function of the number of requests per hour sent to
B-Root on five random dates between 2015 and 2019.
Second, some recursives may not cache properly and so reissue queries that could be cached. Finally, some
resolvers query the roots directly, perhaps to monitor it. More research is needed to establish root causes
of this excessive traffic.
Root servers operate as a service to the Internet and are committed to serving the root DNS zone as
defined by IANA to all queriers (for example, see [166]). Due to this policy, root server operators prioritize
responding to all queries, with the exception of obvious attacks and operational threats.
3.2.2 The DNS Root and DDoS
Historically there have been several large attacks on DNS root servers. In 2002 [26], a large volumetric attack hit all 13 DNS root servers for an hour, with nine of 13 root servers largely inaccessible. In 2007 [105],
a volumetric attack hit six DNS root servers, and lasted 3 h and 5 h. Two servers were noticeably affected.
In November/December 2015 [144], most of the root name servers were hit by two volumetric attacks containing millions of spoofed queries per second. While some root servers were lightly hit, others saw severe
traffic loss of 95% or more. Analysis showed the attacks inflicted collateral damage to services collocated
with root servers [144]. Although caching of root contents at recursives reduces the end-user impact of
68
these attacks [146, 120], DNS outages at CDNs have impacted prominent user-facing services [233]. Effective DDoS defense for the DNS root is thus necessary. We use data from eight attacks in the years following
2015.
While DDoS attacks have been studied for decades, DDoS on DNS root servers pose some unique challenges. Unlike web traffic, most DNS queries use UDP and UDP support is required, so DNS is vulnerable
to IP spoofing, making filtering by source IP address ineffective. Second, root severs see a huge diversity of
query rates—legitimate traffic spans five orders of magnitude, complicating traffic modeling and filtering
big senders. Third, root DNS uses a small number of fixed IP addresses, so shifting traffic to other anycast
rings is not feasible. Finally, the DNS root has a very high commitment to serving all queries, so collateral
damage is a large concern; good queries should be answered.
3.3 Related Work
DDoS attacks have been a problem for more than two decades, and many research and commercial defenses
have been proposed. This section reviews only those solutions that are closely related to our approaches
and to protecting DNS servers against DDoS.
3.3.1 Flash-Crowd DDoS Defenses
CAPTCHAs [21, 117] are a popular defense against flash-crowd attacks. They can be used together with
other indicators of human user presence, to differentiate between humans and bots. However, DNS queries
come from recursives, not directly from human users, so there is no opportunity for a CAPTCHAs intervention. FRADE [231] is a flash-crowd DDoS defense, which builds models of how human users interact
with a Web server, including query rates and query content, and uses them to detect bot-generated traffic.
69
FRADE models a typical client’s behavior. While this works for Web servers, which are browsed by humans, request rates and contents of DNS recursives vary widely. FRADE thus cannot protect DNS servers
against DDoS.
Creating an allow-list of known-good clients is suggested in several studies and RFCs [41, 257, 175, 76,
129] for general protection from unwanted traffic. However, the approaches to create a list of known-good
recursives for DNS roots have not been described nor evaluated. We evaluate this idea in this chapter
under the name “unknown recursive filter,” in conjunction with hop-count filtering [112], and show that
it works well to filter out spoofed attack traffic, but cannot handle attacks that do not use IP spoofing.
Many companies provide DDoS solutions, which may combine signature-based filtering, rate limiting,
and traffic distribution using cloud resources and anycast. Such solutions are offered by Akamai [209,
84], Verizon [28], and Cloudflare [256, 72], for example. Since these solutions are proprietary, we cannot
compare against them directly. In addition, they often collect traffic with DNS-based redirection or route
announcement (friendly hijacking). Neither of these redirections are possible for root DNS service, which
must operate at a fixed IP address, and cannot easily be re-routed.
3.3.2 Spoofed Traffic Filtering
Several filters to remove spoofed traffic have been proposed: hop-count filtering [244, 112, 148], route
traceback [227], route-based filtering [63], path identifier [255], unknown client filtering [257, 175], and
client legitimacy based on network, transport and application layer information [234]. Of these approaches,
only hop-count filtering and unknown client filtering can be deployed on or close to the target, and thus
show promise for protection of DNS root servers. In hop-count filtering, the filter learns which IP TTL
values are used in packets from a given source IP address, and uses this to filter out spoofed packets. The
original approach [244] advocates for storing one expected hop-count per source. Mukaddam et al. show
that recording a list of possible hop-counts improves the precision of TTL filters [148]. These studies are
70
performed on 10–20 years old traceroute measurements, and they assume reliable inference of TTL filters
from established TCP connections. Both Internet topology and application dynamics have since evolved,
and DNS traffic is predominantly UDP. Our work fills this gap, by evaluating hop-count filtering against
DDoS with real attack and legitimate traffic, spanning six years and ten attack events.
3.3.3 DDoS on DNS
BIND pioneered Response Rate Limiting (RRL) to avoid excessive replies [1] and conserve outgoing network capacity during a volumetric query DDoS. RRL addresses a few misbehaving clients and outgoing
amplification attacks, but it does not address well-distributed, volumetric attacks from large botnets.
Akamai uses sophisticated scoring and priority queuing to protect their authoritative DNS servers
from floods [84, 209]. Akamai scores queries with the source’s expected rate, if the resolver participated in
prior attacks, the source’s NXDomain fraction, query similarity from that source, and an evaluation of TTL
consistency. While two of these scoring approaches are similar to our unknown resolver and wild resolver
filters, there are three major differences. First, Akamai provides no quantitative data about how various
scoring approaches perform against real attack events. We contribute a careful quantitative evaluation of
how well different filters work against playback of real attacks.
Second, we propose a specific mechanism to select filter combinations, and reevaluate them when necessary. Akamai’s approach uses all filters at once to calculate each query score, and Schomp et al. [209]
do not describe how the filters interact. Finally, key parts of Akamai’s scoring system run inline with processing, requiring high-speed packet handling. Our approach operates in parallel with packet processing,
evaluating resolvers to identify potential attackers (or known-good resolvers), simplifying deployment,
particularly for lower-end hardware.
Prior work has studied real DDoS events, inferring operator responses using anycast, and suggesting
possible anycast options in DNS roots [144]. Recent work has taken this idea further, suggesting that a
71
network playbook can pre-evaluate routing options to shift traffic across anycast sites [190]. Our work
complements this line of research, by studying how filters can reduce load at each anycast site.
Finally, several groups have suggested fully distributing the root to all recursives [93, 8, 125]. Such
wide replication would greatly reduce the threat of DDoS on the root, but not on other DNS authoritative
servers. As a result, on-site defense is still necessary to mitigate DDoS attacks on DNS.
3.4 DDiDD Design
Our goal is to design an automated system, which continuously evaluates suitability of multiple filters to
handle an ongoing DDoS attack on a DNS root server. Our system needs to quickly select the best filter
or the combination of filters, reasoning about the projected impact on the attack, the collateral damage
from the filter on legitimate recursives’ traffic and the operational cost. The system should also be able to
adjust its selection as attack changes. Finally, individual filters need to be configured to achieve optimal
performance – high effectiveness against attacks they are designed to handle and low collateral damage.
DNS root may also experience a legitimate flash crowd, e.g., when many clients access some popular
online content. Due to caching, queries for existing TLDs should not create flash crowd effect, but queries
for non-existing TLDs may, since their replies are not cached. DDiDD will only activate when excessive
queries overwhelm server resources. Unless the server can quickly draft more resources (e.g., through
anycast) some queries have to be dropped. Without DDiDD, random legitimate queries would be dropped.
DDiDD (Section 3.5) mostly drops queries from sources causing the legitimate flash crowd.
3.4.1 Threat Model
We assume that an attacker’s goal is to exhaust some key resource at a target by sending legitimate-like
requests to the server. Current authoritative servers (including root) do not store state between requests,
so the attacker can target CPU resources, incoming bandwidth or outgoing bandwidth. In all cases, the
72
attacker generates more requests than the server can process per second. The attacker may spoof these
requests, or they may compromise new or rent existing bots and send non-spoofed requests.
A spoofing attacker may spoof at random, or they may choose specific IP addresses to spoof. In some
cases, the attacker may choose to spoof addresses of existing, legitimate recursives.
A non-spoofing attacker compromises or rents bots to use in the attack. Drafting new bots carries
non-negligible cost for the attacker.
The features of attack requests depend on the resource that the attack targets. If the targeted resource is
CPU, the attacker may generate many requests per second. If the target is incoming bandwidth, the attacker
may generate large requests to quickly consume the bandwidth. In both of these cases, the content of the
requests is not important, just their rate and size. Finally, if the target is outgoing bandwidth, the attacker
may generate requests that maximize the size of replies, using the ANY query type.
Some attacks are polymorphic – they change their features during the attack event. Any attack features
may change: how spoofing is done, which sources generate attacks, and the content of attack requests.
A naive attacker does not have knowledge about DDiDD and is focused only on overwhelming the
target server. A sophisticated attacker may obtain information about types and parameters of the filters
that our defense uses, and they may try to adjust their attack to bypass the defense, or to trick the defense
into filtering a legitimate recursive’s traffic.
DDiDD works well both against naive and against sophisticated attackers, and against spoofing and
non-spoofing attackers, due to its layered defense approach, and multiple filters, as we show in our evaluation.
3.4.2 DDiDD Operation
To avoid any operational impact on a DNS root server, DDiDD consumes packet captures, operating offline
to get required parameters, independently of the actual DNS server software. DDiDD’s analysis detects
73
an attack, selects a filter or a combination of filters, then deploys filters via iptables and ipset rules on
the server. We consider six filters, described in Section 3.4.3, and implement four that perform well with
DNS root traffic: frequent query filter, unknown recursive, wild recursive and hop-count filter. iptables
work well when number of rules is small (up to 2% delay increase for 5 rules) and matching is needed on
query content. We use iptables to implement the frequent query filter, for 1–5 frequent query names.
ipset uses an indexed data structure and provides efficient matching of thousands or even millions of
rules, without added delay. We use it when blocking attack sources, identified by unknown recursive,
wild recursive and hop-count filters. iptables/ipset or their equivalents are available on all modern
operating systems, thus DDiDD is highly deployable by any interested DNS root server. If a root is anycast over multiple points-of-presence (PoPs), DDiDD should be deployed at each PoP independently. No
synchronization or information exchange is required across instances deployed at different PoPs.
DDiDD automatically selects filters to meet two goals. First, we prefer filters that will remove most
attack traffic with low or zero collateral damage to legitimate queries. Second, we aim to select filters
quickly, because most DDoS attacks are short [113]. We then revise our selection if attack changes, or if
we learn that another filter combination works better. This decision process is fully automated. Further,
DDiDD is flexible and modular, allowing addition of new filters in the future.
3.4.2.1 Attack detection
DDoS attacks cause problems because they exhaust some resource at the target. As an example, in 2015
DDoS attacks on the root DNS some operators could reply to all queries, but some failed to receive queries
or could not respond to them typically because of bandwidth limitations [138, 144]. DDiDD detects possible
attacks by monitoring the status of critical resources and recognizing when a resource is overloaded. We
use collectd to periodically collect status information from several resources (CPU, memory, inbound and
outbound network capacity). We identify possible attacks when any resource exceeds a fraction of its
74
maximum capacity, which we denote as critical load. Our system detects attacks considering different
resources:
Ingress network bandwidth: Volumetric attacks like UDP flooding can saturate the ingress network
bandwidth [107]. The memcached attack did not even send DNS queries, but exhausted channel capacity [159]. If a root server has a capacity of I Gb/s, then attack traffic of rate IA (where IA > I) will result
in user loss of approximately (IA − I)/IA legitimate queries.
Physical memory: Several types of attacks target server memory, forcing kernels to buffer IP fragments [118, 268], or TCP connections from a TCP-SYN flood attack [65]. Today’s operating systems are
generally hardened to these attacks and drop partially-complete information when resources are limited.
CPU usage: CPU usage increases in proportion to query rates, and while non-DNS traffic may be
filtered at a firewall, DNS queries require some application-level processing. Query processing can incur
an asymmetric cost; they are cheap for zombies to generate but much more expensive for servers to detect
and discard or handle.
Egress network bandwidth: It is typical in DNS deployments that the responses are larger than
the corresponding query size [268, 237]. This amplification attack can result in exhaustion of the egress
network bandwidth before the ingress network is exhausted. Moreover, if the source IP address is spoofed,
the reply can more easily exhaust the target victim. According to Arbor Network, DNS based amplification
is the most common form of DDoS attack [157]. The wide adoption of DNSSEC also enables a way to get
a larger DNS response, and DNSSEC might be used to make potential attacks [186]. If the server has E
Gb/s egress capacity, then any outgoing traffic which is greater than E Gbps, can overwhelm the egress
network bandwidth.
We detect attack termination by monitoring the amount of traffic blocked by the deployed filters. We
declare the attack over when the traffic blocked by DDiDD decreases significantly, and the load on the
server stays low as well, for an extended period of time. More details are given in [195].
75
3.4.2.2 Filter priming and selection
All filters (e.g., frequent query filter, unknown recursive, wild recursive filter, hop-count filter) require
information that must be learned continuously, in absence of attacks. DDiDD continuously learns these
parameters from packet collection and uses them when the corresponding filter is deployed. Some filters
(e.g., frequent query name) also require a short learning phase during an attack. DDiDD triggers a short
learning phase for these filters when the attack is detected, and repeats it regularly to update filter parameters. After the detection, DDiDD uses the incoming traffic to select the filter parameters (for example,
finding the frequent query name to filter). For some filters like unknown resolver filter, DDiDD uses known
legitimate traffic (we provide more details when we describe the filters).
During attack, each filter and some filter combinations are continuously evaluated for potential deployment. We emulate the effect of each filter or their combination on a sample of captured packets. We
estimate the success of each filter based on acceptable query load at the server, calculated as the server’s
average query load times a small multiplicative factor fACC. Because root servers operate well below
their capacity, this approach guarantees that query rates below the acceptable load will also not exhaust
the server’s CPU or bandwidth resources, and will not trigger attack detection.
We also estimate collateral damage when the filter is parameterized using peace-time (non-attack)
traffic. The collateral damage depends on the legitimate traffic’s blend and we have verified that it does
not change sharply over time. Thus, we can calculate it once and use this estimate for a long time (e.g,
months). Based on the estimated effectiveness of the given filter or their combination, and their projected
collateral damage, new filters may be selected for deployment and existing filters may be retired.
3.4.3 DDiDD Filters
In DDiDD we have implemented the following filters: (FQ) frequent query name filter, (UR) unknown
recursive filter, (HC) hop-count filter and (WR) wild recursive filter. In addition to these, we have also
76
parameter meaning rec. values
LF Q num. queries for learning 10 K
fF Q freq. change threshold 0.3
LUR, LHC, LW R learn. period 2 h (20 m for WR)
UUR, UHC, UW R use period 2 h
wi
, ..., wN . observ. windows 2
0
, 2
1
, ..., 2
8
tW R deviance threshold 0.5
Table 3.1: Filter parameters
considered (RC) response-code filter and (AR) aggressive recursive filter. Since these two filters do not
perform well on root server traffic, we do not include them in DDiDD, but we evaluate them on our dataset
and summarize results in this section. We show our recommended filter parameters in Table 3.1. For each
filter, we measure the performance and operational cost.
3.4.3.1 Frequent query name filter (FQ)
In our datasets many attacks have queries that follow a given pattern, e.g., have a common suffix. Thus,
in practice it is useful to develop filters that remove frequent queries during attack periods.
Approach: We use a simple algorithm to identify frequent query names. We continuously observe
LF Q queries of incoming traffic and learn frequency of top-level domains, subdomains and full queries.
Under attack, we repeat the calculation and look for segments (TLDs, subdomains or full queries) whose
frequency has increased more than a threshold fF Q. These segments are candidates for frequent query
names. Segment frequency prior to the attack serves to estimate collateral damage. We evaluated a range
of values for LF Q and fF Q. Shorter LF Q than 10,000 reduced mitigation delay, but increased chances of
mis-identification of frequent queries. Similarly, lower fF Q than 0.3 lead to some collateral damage. These
values should be calibrated for each server.
Operational cost: We can filter frequent query names directly using iptables, or we can identify
sources that send frequent queries and block them using ipset. We denote these two implementation
approaches as FQt and FQs. The FQt
(iptables) implementation imposes added processing delay, which
77
0
0.2
0.4
0.6
0.8
1
10-5 10-4 10-3 10-2 10-1 100 101 102 103
10-3 10-2 10-1 100 101 102 103 104 105
CDF of number of sources
Query rate (queries/second)
Interarrival time (s)
Figure 3.2: CDF of source query rates, showing a wide range of rates. Data: 2015-11-29
greatly increases once we go past five filtering rules, but it minimizes collateral damage. The FQs (ipset)
implementation adds no measurable delay, but it may create collateral damage if spoofing is present, and
thus must be deployed together with anti-spoofing filters (UR and HC).
3.4.3.2 Unknown recursive filter (UR)
An allow-list with IP addresses of recursives present prior to the attack can be an effective measure against
random-spoofing attacks or those that rent bots. This filter passes traffic from recursives on allow-list to
the server, and drops all other traffic.
Approach: An allow-list is built by processing incoming traffic to the DNS root server over period LUR
prior to an attack event. The list is then ready to be used for some time UUR, and after that it can be
replaced by new list.
DDiDD builds allow-lists proactively at all times, observing traffic over period LUR. We experimented
with LUR ranging from 10 minutes (capture around 90% of traffic sources) to 6 hours (capture 99% of traffic
sources). We also tested values of UUR of up to 1 day, and the allow-lists were very stable. We selected
2 hours for LUR, and 1 day for UUR.
78
Operational cost: An allow-list can be implemented efficiently using ipset, which adds no processing
delay.
The unknown resolver has a number of parameters that we study in Section 3.4.4.
3.4.3.3 Hop count filter (HC)
A hop-count filter builds the TTL-table, containing source IP addresses, along with one or more TTL values
seen in the incoming traffic from each given source. This kind of filter can be effective for attacks that
spoof IP addresses of existing recursives. The filter drops traffic from sources that exist in the TTL-table,
but whose TTL value does not match the values in the table. All other traffic is forwarded.
Approach: We build the TTL-table by processing incoming traffic to the DNS root server over period
LHC. The list is then ready to be used for some time UHC, and after that it can be replaced by new list.
One could use hop counts [244, 148] or TTL values for filtering. TTL values are better choice, since
they have larger value space, which improves filter effectiveness. DDiDD builds its TTL-list by using each
packet in the incoming traffic to the server during the learning period. Such traffic could be spoofed.
Prior approaches [244, 148, 16] rely on established TCP connections or they probe sources to reliably learn
TTL-table values. These approaches do not work for DNS root servers, which serve mostly UDP traffic
and whose policy forbids generation of unsolicited traffic. Hop-count filter parameter values have similar
properties to known-recursive parameter values.
Operational cost: We implement this filter efficiently by adding a new ipset module to match on an
IP address and TTL value (or range).
79
1: select_candidates()
2: deployed=deploy_single()
3: if not deployed:
4: deploy_combo()
1: for F in filters:
2: if F can reduce load to AL:
3: candidates.append(F)
function select_candidates()
1: current_fp = 1, best = null
2: for C in candidates:
3: if C.fp < current_fp:
4: best = C
5: current_fp = C.fp
6: if best is not null:
7: deployed.clear()
8: deployed.append(best)
9: return true
return false
function deploy_single()
1: tofilter = CL - AL; deployed.clear()
2: for T in ur, hcf, fq, wild:
3: for C in candidates:
4: if C.type not T:
5: continue
6: if C is e!ective:
7: deployed.append(C)
8: tofilter -= C.filtered
9: if tofilter <= 0
return
function deploy_combo()
function select_filters()
filters: array of all possible filters
candidates: array of filters that can be deployed
deployed: array of currently deployed filters
AL: acceptable load
CL: current load
10: 10:
Figure 3.3: Pseudocode for filter selection
3.4.3.4 Wild recursive filter (WR)
While query rate of different DNS recursives towards a DNS root server varies widely, individual recursives’ behaviors are mostly consistent over short time periods (e.g., several hours). We leverage this observation to build models of each individual recursive’s behavior. The model for a given recursive, along with
the recursive’s IP address is stored in the rate-table. During an attack, we identify those recursives that
send more aggressively than their rate-table predicts as wild recursives. Wild recursive filter drops traffic
from wild recursives, and it forwards all other traffic.
Approach: A wild-recursive filter learns the rate of a DNS recursive’s interaction with the DNS root
server over multiple time windows, w1, w2, w3, ..., wN , during learning period LW R. For each window,
the filter learns the mean and standard deviation of the number of queries observed and stores them in the
rate-table. The rate-table can be used for some time UW R, and after that it can be replaced by a new table.
When the attack is detected, the filter measures the current query rates over the same windows. It
then calculates the difference between the current rate rcwi
in the window wi and the rate expected by
the model: meanwi + 3 × stdwi
. We then calculate a smoothed, normalized deviance score dt at time t
as: dt = (dt−1 × 0.5) + 0.5 ×
P
i
rcwi−meanwi−3∗stdwi
stdwi
. Those recursives whose deviance score exceeds
threshold tW R will be identified as wild recursives.
We experimented with values for LW R between 10 minutes and 6 hours. While performance was
relatively stable, lower values led to lower collateral damage, since they captured recent traffic trends.
80
0
0.2
0.4
0.6
0.8
1
2015-11-01
2015-11-02
2015-11-03
2015-11-04
2015-11-05
2015-11-06
2015-11-07
2015-11-30
2015-12-01
Response ratio
Day
Response 0
Response 3
(a) Rcode ratio changed during November 30, 2015
and December 1, 2015 events
0
0.2
0.4
0.6
0.8
1
2017-02-27
2017-02-28
2017-03-01
2017-03-02
2017-03-03
2017-03-04
2017-03-05
2017-03-06
Response ratio
Day
Response 0
Response 3
(b) Rcode ratio changed during attack event of
March 6, 2017
Figure 3.4: Rcode trend during normal and attack traffic in root A
We experimented with uniformly distributed and exponentially distributed (powers of two) window sizes.
Exponentially distributed windows led to lower mitigation delay, because they capture both aggressive
and stealthy attackers. We also experimented with 1–9 windows. Higher number of windows had slightly
higher collateral damage, but they significantly improved filter effectiveness, because they enabled us
to identify sporadic attackers. Learned models become quickly outdated so we set UW R = LW R. We
experimented with values for the threshold tW R from 0.1 to 16. Values higher than 0.5 minimized collateral
damage.
Operational cost: This filter is implemented by processing the traffic incoming to the DNS server offline.
When the attack starts, the filter identifies wild recursives and inserts corresponding ipset rules to block
their traffic.
3.4.3.5 Response code filter (RC)
For some DNS servers, queries with missing names are rare. For example, at Akamai only a small fraction
of legitimate queries result in NXDomain [209] replies, while attackers often query for random query
names.
We therefore considered a filter based on response codes that discards NXDomain responses (Response
code 3 as shown on 2017-03-06 in Figure 3.4b). Unfortunately, more than 60% of root DNS traffic involves
81
non-existing TLDs as shown in the seven days of 2015 and 2017 (Figure 3.4). We also find that sometimes
the attack query names have an “NoError” response in the replies (Response code 0 in Figure 3.4a), and
we cannot filter out “NoError” replies. Thus for root DNS traffic, a response code filter will have large
collateral damage, and we do not currently include it in DDiDD.
3.4.3.6 Aggressive recursive filter (AR)
This filter blocks the aggressive clients during an attack, starting with the client that sends the highest
query rate and moving down. Filter adds addresses to the block-list until the query load reduces to acceptable levels. We evaluated this filter on our dataset. It performs well when attacks use non-spoofed traffic,
but its performance is consistently worse than that of wild recursive filter. We thus do not include it in
DDiDD.
3.4.4 Parameter Validation for the Unknown Resolver Filter
Next, we validate our choice of the parameters for unknown resolver filter. We will show how we choose
observation period, LUR, and observation frequency, UUR.
How long to observe?: We want to observe long enough to get most legitimate sources, so we first
consider how often each source makes queries. Figure 3.2 shows that sources place queries at many different rates with a long tail, with about 80% having inter-arrivals of 1000 s or more, while a few place
thousands of queries per second. This wide range of rates is also visible in earlier studies [36, 248].
We are certain to capture all frequent queriers in our accept list, as a relatively short observation period
(2 hours) provides a list that covers the majority of queries. However, a short observation will miss the
infrequent queriers. We next quantify how much queries and unique sources we can cover based on an
observation duration.
82
0.0
20.0
40.0
60.0
80.0
100.0
10 20 30 40 50 60
Percentage(%)
Duration to build the accept list
Figure 3.5: Impact of the duration to build the resolver list
0
0.2
0.4
0.6
0.8
1
0 150 300 450 600 750 900 1050 1200 1350 1500
CDF of new IP address
Time (minutes) from start
CDF of new IP address
Figure 3.6: CDF of new IP address with time
83
We expect shorter observations for hitlists to observe only the most frequent carriers, hence, most
legitimate queries. To evaluate how long we must observe to get the most queries, we examine normal
traffic of the day 2015-11-29 and build a list from data lasting from 10 minutes to 60 minutes starting from
00:00:00 UTC. We evaluate how many future queries these lists can identify from 02:00:00 UTC to 23:59:59
UTC traffic. Figure 3.5 shows only 20 minutes of traffic can cover over 90% of future queries. Hence, only
a small duration of traffic can cover the sources which will make most of the future queries of the day.
Although small duration covers frequent queriers which make most queries, we still miss many nonfrequent queriers. From Figure 3.6, we can see that within a ∼1400 minutes time-frame, we get 50% of
unique sources within ∼250 minutes. This implies that even if we create the accept list with 250 minutes
of traffic, we still miss 50% of the future unique legitimate sources (not the number of queries). However,
these infrequent sources send very few queries.
We choose to take 2 hours of traffic to build the known resolver list. According to Figure 3.6, we expect
to miss around 60% of the unique sources of the day (though attacks do not persist the whole day). We
choose this value because this value is sufficient to cover most of the legitimate future queries (Figure 3.5).
Also, non-frequent queriers mostly have well-configured cache, and most of the time they get their query
response from their own cache.
How often to build?: We next consider how often we need to build a known resolver list, confirming
that once a day is sufficient (Section 3.4.3.2).
How often the list is built can influence its success. We must build the list frequently enough that it
reflects current queriers, yet list generation has some cost so we cannot build it continuously. To evaluate
how list age changes accuracy, we create lists from 2 hours of data starting 0 to 960 minutes before the
attack. We compute the confusion matrix to see how much malicious traffic we can block along with the
collateral damage for 2017-03-06 event. From Figure 3.7, we can see little change in the confusion matrix
even if we build the known resolver list 960 minutes before the attack event. Sensitivity and specificity
84
0.0
2.0
4.0
6.0
8.0
10.0
0 120 240 360 480 600 720 840 960
90.0
92.0
94.0
96.0
98.0
100.0
Sensitivity
1(%)
Specicity
2(%)
Start time before attack (mins)
Sensitivity1 Specicity2
Figure 3.7: Impacts of the accept list creation time based on confusion matrix for 2017-03-06 event considering all queries
values differ by 1% to 2% based on the creation time (sensitivity means how much malicious traffic we can
detect and specificity means how much legitimate traffic we can detect).
Since time of list construction has relatively little effect on effectiveness, we build the resolver accept
list once a day.
3.4.5 Filter Selection and Synchronization
In this section we discuss how filters are selected for deployment and why their learning periods have to
be synchronized.
Filter selection: Our goal was to design effective filter selection process, which minimizes collateral
damage to legitimate traffic. Our pseudocode for filter selection is given in Figure 3.3. At each time interval
(e.g., one second), if the current query load (CL) on the server (queries per second) is higher than the
acceptable load (AL), we first select candidate filters. We continuously emulate operation of all filters,
thus we produce for each filter an estimate of the amount of queries they would drop. Our candidate
filters are those whose drop estimates are positive. If among the candidate filters there are any that could
reduce the load to AL, we will select the filter with the lowest estimated collateral damage (described in
Section 3.4.2) and deploy only this filter (function deploy_single).
85
FQ UR HCF WR
random queries spoof known IPs poison model
poison model
poison model
p1 p2 p3 p4
p5
Figure 3.8: Swiss cheese model of defense
If no such filters exist, we will consider combinations of multiple filters (function deploy_combo). Not
all combinations are valid, which greatly reduces complexity of this step. HC filter must be deployed after
an UR filter, since HC is pass-through for addresses that do not exist in TTL-table. UR filter removes queries
that spoof unknown recursives, thus guaranteeing that addresses of queries that pass will be present in
TTL-table. FQt could be deployed together with any other filter. FQs and WR filters must be deployed
after UR and HC, because they make per-source blocking decisions, and require reliable source identities.
Since both FQt and FQs filter frequent query names, only one of them should be deployed. FQt has zero
collateral damage and is considered first. If it cannot be supported operationally (there are more than
five query names, and thus there will be added processing delay), FQs will be considered. In addition to
considering filters in a specific order for deployment, we only consider filters that are effective – filter at
least 5% of excess traffic (function effective). Deployment is finalized as soon as the filter combination can
reduce the load below AL.
Filter synchronization: DDiDD may engage one or multiple filters to mitigate an attack. When some
filter combinations are engaged, it is important that their learning periods match, so that each filter has
entries for the same recursives in their table. Because we need a shorter learning period for wild recursive
filter, than for the unknown recursive and hop-count filter, we learn parameters over 2 hours, and then
keep updating WR entries each 20 minutes to keep them as recent as possible.
Sophisticated adversary: Each of the filters we consider could be bypassed by a sophisticated adversary. We now discuss how their combination makes this challenging (Figure 3.8).
86
FQ filter could be bypassed by the attacker sending random queries. UR filter could be bypassed by
the attacker spoofing existing (known) recursives. UR, HC and WR filters could each be bypassed by
poisoning the models during learning. One way to counter poisoning attacks could be to learn over longer
time periods, from random traffic samples. While this works for UR and HC, whose data is fairly stable,
it would greatly diminish effectiveness of WR filter, and it would complicate filter synchronization. Our
approach is to handle poisoning attacks only at WR filter, and to rely on the Swiss cheese defense model
(Figure 3.8) to capture attackers that bypass one filter layer, but can be stopped at the other. Thus random
queries may bypass FQ, but will be stopped at UR if they are from new sources, or at HCF if they are
spoofed. At WR, queries sent by recursives at high rate (spoofed or not) can be detected and dropped. This
leaves poisoning attacks at WR filter (thin red arrow at the top right of Figure 3.8), where each bot poisons
the rate model for itself by sending sporadic traffic during learning, with high fluctuations. This can lead
the filter to model a large expected rate for the bot in each window, due to large standard deviation. To
address this attack, we learn only when load on the server is low (avg + stdev). This forces the attacker
to engage their bots very sporadically, which becomes an outlier and is excluded from the model.
3.5 Evaluation
We use datasets containing real DNS root traffic and attacks (Section 3.5.1) to calculate success metrics
(Section 3.5.2) that characterize DDiDD performance (Section 3.5.3).
3.5.1 Datasets
We use datasets collected at B-Root, one of 13 root identifiers. These datasets are publicly available [11].
in both pcap and text format. The operators of B-Root identify attacks based on unusual traffic rates and
system load, as seen from operational monitoring. Our evaluation uses ten diverse attack events spanning
six years (see Table 3.2). During events in 2017 and later B-Root employed anycast network with multiple
87
PoP date start dur ULQ DNS FQ UR HC WR DDiDD F DDiDD P
(UTC) (sec) mon con cd con cd con cd con cd con cd con cd
LAX 2015-11-30 06:50 8,918* 98 100 100 0 99.1 1.8 0.3 1.4 0 5.5 99.1 0.4 99.3 1.7
LAX 2015-12-01 05:10 3,781* 100 100 98.7 0 99.1 0 0.6 0 0 0 99.3 0 99.4 0
LAX 2016-06-25 22:18 2,436* 52 99 0 0 100 0.1 0 0 0 0 100 0.1 100 0.1
LAX 2017-02-21 06:40 6,992* 2 1 98.4 0 0.1 1.8 0.1 1.5 98.4 0 99 0 98.8 0
LAX 2017-03-06 04:43 19,835* 6 5 98.8 0 0 1.1 0 0.4 91.6 1.5 100 0 92.3 1.5
LAX 2017-04-25 09:54 10,414* 3 4 98.3 0 0 1.1 0 0.7 94.9 2 99.1 0 95.1 2
ARI 2019-09-07 06:45 80 0 5 0 0 93.3 0.6 0 0.8 0 0.1 93.7 0.6 93.1 0.6
LAX 2019-09-07 06:45 80 23 5 0 0 100 0.9 0 0.2 0 0.2 100 0.9 100 0.9
MIA 2019-09-07 06:45 80 8 5 0 0 100 0.6 0 0 0 0.4 100 0.6 100 0.6
SIN 2020-02-13 08:05 206 14 2 100 0 0 0.3 4.8 0 38.5 0.5 100 0 97.5 0.8
ARI 2020-10-24 02:55 445 67 7 0 0 100 1.3 0 0 0 0.8 100 1.3 100 1.3
ARI 2021-05-28 02:35 70 25 3 0 0 100 1.1 0 0 0 0.1 100 1.1 100 1.1
IAD 2021-05-28 02:35 70 63 3 0 0 100 0.4 0 0 2.7 0 100 0.5 100 0.5
LAX 2021-05-28 02:35 70 3 3 0 0 100 0.4 0 0 0 0 100 0.4 100 0.4
MIA 2021-05-28 02:35 70 2 3 0 0 100 1.5 0 0 0 0 100 1.7 100 1.7
SIN 2021-05-28 02:35 61 41 3 0 0 100 0 0 0 0 0 100 0 100 0
Table 3.2: DDiDD performance: comparing load control (con) and collateral damage (cd) for each possible
filter and DDiDD as a whole. We highlight results within 1% of the best performance in bold. For long
attacks (*) we simulate only the first 600 seconds.
points-of-presence (PoPs). Some attacks affected only one PoP (e.g., 2020-02-13), while others targeted all
PoPs (e.g., 2020-05-28).
We confirm that our selected events are DDoS attacks based on DNSmon observations shown in the
“DNSmon” column Table 3.2. DNSmon reports the fraction of responses received by many (about 100)
physically distributed probers, which query each DNS root every 10 minutes. In Table 3.2, the first three
attack events had a large impact, showing 99–100% of unanswered queries, as publicly reported [144, 167,
168]. The other seven events had smaller impacts (1–7% unanswered queries), because they were shorter
(5 minutes and less) and sent at a lower rate, and because B-Root’s capacity had increased.
DNSmon reports reflect aggregate performance across all PoPs, so the percentage of unanswered
queries at each PoP might be higher than measured by DNSmon. We include traces from all the PoPs
in our analysis, and simulate running of DDiDD at each PoP. We use ground truth for attack start and stop
times to start and stop DDiDD’s simulation, and use fACC = 2.5. During attacks, query rate at the server
increases more than 10-fold, so using fACC = 2.5 is reasonable.
88
While attackers could generate any random traffic to port 53, attacks in our dataset had unique content
or traffic signatures, which enabled us to establish ground truth during evaluation. Attacks on 2015-11-30,
2015-12-01, 2017-02-21, 2017-03-06, 2017-04-25, and 2020-02-13 had used either several specific queries or
a random prefix with a common, specific, suffix. Attack on 2016-06-25 was a TCP SYN flood. Attacks
on 2019-09-07 and 2020-10-24 and 2021-05-28 sent malformed UDP traffic to port 53, which consumed
resources at the server, but did not parse into legitimate query format.
Ethical considerations: Our analysis is performed on packet traces incoming to and outgoing from
B-Root. Both source and destination IP addresses are anonymized using Crypto-PAn [253, 266]. Packet
payloads are not anonymized, which allows us to establish ground truth in evaluation. After ground truth
is established, analysis is automated and we report only aggregate results. These steps preserve resolver
privacy.
3.5.2 Metrics
Our goal is to reduce load on the DNS root server, by filtering malicious traffic, to allow serving more
legitimate users when under duress. We therefore consider two success metrics: (1) controlled load, the
percent of time when server load is at or below acceptable load due to defense’s actions, ideally 100%; (2)
collateral damage, the percent of legitimate queries filtered, with an ideal of 0%.
3.5.3 DDiDD Performance
Table 3.2 shows DDiDD’s performance per each PoP affected by a given attack. We show several defense
configurations: first, each filter by itself (FQ, UR, HC, or WR), then the full DDiDD with all four filters
and a partial DDiDD with only UR, HC, and WR filters. Removing the FQ filter from the partial DDiDD
simulates a smart adversary, which randomizes queries for each attack.
89
These experiments confirm that no single defense does well in all attack cases. The FQ filter does very
well in attacks that use similar queries, but has no effect otherwise. The UR filter performs well in many
attacks. HC does not work well by itself, but enhances other filters. Finally, WR does well in a few attacks, where some recursives, which are present prior to the attack, modify their behavior to become more
aggressive. This evaluation demonstrates that we need multiple filters to handle all attack events.
We further show that the full DDiDD automatically chooses the best filter or combination of filters for
each attack, always achieving 93% or higher controlled load and at most 1.7% collateral damage. DDiDD
selects the optimal filter combination in 1–3 seconds.
Partial DDiDD’s performance (the right-most column) shows how well it would handle an adversary
that randomizes queries. DDiDD controls load for most of the time (92.3%–100%), with low collateral damage
(2% or lower), with all filters selected in 3 s or less.
We compare collateral damage of DDiDD with percentage of legitimate queries at the affected PoP
that fail to receive a response during the original attack, without DDiDD. We calculate this percentage
from our datasets and show it in the fifth column (ULQ) of Table 3.2. This is an internal measure of DoS
impact and it can differ from the external measurements by DNSmon, because of several reasons. First,
DNSmon averages measurements over 10 minutes and across all PoPs for a given root, while our internalDoS measure is per PoP and it is averaged over the duration of the attack. For these reasons DNSmon will
often underestimate attack impact, as is the case for many of our attacks. Second, if B-Root’s incoming
bandwidth were overloaded, DNSmon could measure higher loss rate than our internal-DoS measure. This
is the case, for example, for 2019-09-07 attack.
Full DDiDD’s and partial DDiDD’s collateral damage is always lower than DNSmon (external) and ULQ
(internal) measures. Thus DDiDD improves legitimate traffic’s handling during DoS attacks. DDiDD is also
effective, reducing resource consumption by controlling server load, 93–100% of time, after a short initial
delay of 1–3 seconds.
90
Legitimate flash crowds: While three attacks in 2017 overloaded B-Root, they involved a large
number of recursives involved (around 50 K per event), large difference in rates per recursive, and did not
spoof. Legitimate flash crowds would show similar patterns. In 2017 events, DDiDD dropped only traffic
that was causing the overload event, and only as much as to free server resources from overload.
Polymorphic attacks: In evaluation events DDiDD changes defenses because the attacks change.
During 2015-11-30 attack there were periods where existing clients were spoofed with incremental TTL
values, traversing the entire TTL value space. Partial DDiDD correctly switched from UR to UR+HC combo
to handle these cases. During 2020-02-13 attack, single UR, HC and WR filters could not sufficiently reduce
the load. Partial DDiDD deployed all three filters, which managed to reduce the load.
We demonstrate how DDiDD can nimbly adjust filter selection by using an artificial polymorphic attack in Figure 3.9. We create a synthetic attack by mixing legitimate traffic from February 2017 with five
synthetic attacks, which correspond to p1–p5 labels in Figure 3.8: (p1) a random-spoofed attack with a
fixed query name, (p2) an attack with random query names, (p3) same as (p2) but also spoofs only known
recursives using random TTL values, (p4) same as (p3) but spoofs with correct TTL values, (p5) same as
(p1) but 90% of queries are random and 10% use a fixed query name. We find that DDiDD quickly converges
to the best single filter for each attack strategy: FQt
, UR, HC, WR and FQs, respectively. Figure 3.9 shows
passed and filtered legitimate and attack traffic for our synthetic attack—overall controlled load was 99.1%,
collateral damage was 0.7%, and selection delay was 1–4 s.
3.5.4 Impacts On Resource Consumption
We next look in detail about how DDiDD handles a DDoS attack. To evaluate the performance, we conduct controlled experiments in the DETER testbed (Section 3.5.4.1), evaluating resource consumption at a
simulated target.
91
0
0.5
1
1.5
2
2.5
3
0 100 200 300 400 500 600 700 800 900 1000
phase 1 phase 2 phase 3 phase 4 phase 5
load (1=average)
time (s)
att l att pass leg pass leg l
Figure 3.9: DDiDD evaluation for a synthetic polymorphic attack.
Figure 3.10: Experimental setup and the interaction with our automated system
3.5.4.1 Experimental setup
We replay traffic with LDplayer [267], using 22 clients to reproduce the full, original bitrate. Figure 3.10
shows the experimental setup to replay an attack event to the server. Figure 3.10 also illustrates how the
different components of our system interact each other to mitigate the attack.
The attack target is an emulated DNS root server. We implement it with BIND, using the LocalRoot
method of providing root service [94]
Reproducing viable events: To show the effects of attacks that drive a production system to resource
exhaustion requires many servers to attack and to emulate the service. It is also difficult to perfectly
reproduce attacks since the stored traces are often unable to capture the entire attack because of limitations
92
Case Ingress network B/W (Gb/s) CPU usage (%) Egress network B/W (Gb/s)
Without
system
0.00
0.03
0.06
0.09
0.12
0.15
0 30 60 90 120
CPU usage (%)
Duration passed after 04:30:00 (mins)
0.00
20.00
40.00
60.00
80.00
100.00
0 30 60 90 120
CPU usage (%)
Duration passed after 04:30:00 (mins)
0.00
0.20
0.40
0.60
0.80
1.00
0 30 60 90 120
Egress network B/W (Gb/s)
Duration passed after 04:30:00 (mins)
Nonadaptive
system
0.00
0.03
0.06
0.09
0.12
0.15
0 30 60 90 120
Ingress network B/W (Gb/s)
Duration passed after 04:30:00 (mins)
0.00
20.00
40.00
60.00
80.00
100.00
0 30 60 90 120
CPU usage (%)
Duration passed after 04:30:00 (mins)
0.00
0.20
0.40
0.60
0.80
1.00
0 30 60 90 120
Egress network B/W (Gb/s)
Duration passed after 04:30:00 (mins)
With
adaptive
system
0.00
0.03
0.06
0.09
0.12
0.15
0 30 60 90 120
Ingress network B/W (Gb/s)
Duration passed after 04:30:00 (mins)
0.00
20.00
40.00
60.00
80.00
100.00
0 30 60 90 120
CPU usage (%)
Duration passed after 04:30:00 (mins)
0.00
0.20
0.40
0.60
0.80
1.00
0 30 60 90 120
Egress network B/W (Gb/s)
Duration passed after 04:30:00 (mins)
Figure 3.11: Resource consumption comparison for 2017-03-06 event: top row shows resources when we
do not deploy the automated system, middle row shows when our system is not adaptive to the changes
during an attack, bottom row shows resources when we deploy the automated adaptive system. We ignore
memory graph as memory remains consistent over experiment.
93
of the capture system. We therefore scale down the server capacity to match the stored traces. We measure
the regular traffic resource consumption, and trigger a problem when the resource is double than the
regular consumption.
After filtering, can we reduce the resource consumption?: We consider the resource consumption before and after deploying our system.
From Figure 3.11, we can see the comparison of resource consumption for the 2017-03-06 event. CPU
usage reduces from ∼62% (top row, second graph from left) to ∼48% (bottom row, second graph from left)
using our system. In case of egress network bandwidth, we can reduce the bandwidth from ∼0.8 Gb/s (top
row, third graph from left) to ∼0.3 Gb/s (bottom row, third graph from left). We cannot reduce ingress
network bandwidth as we have to give access before making any filtering. We do not find any effect over
memory during the attack and after deploying the system (we ignore that in Table 3.11).
Responding to polymorphic attacks: Our system periodically evaluates the traffic to address polymorphic attacks that change attack methods during the event. We next look at the 2017-03-06 event to see
how our system copes with changing attacks.
The top-leftmost graph of Table 3.11 shows the polymorphic nature of 2017-03-06 event—attack starts
(first red line), pause (green line), and then starts again with a new query name (last red line). From
the bottom row-leftmost graph of Table 3.11, we can see that our system deploys the best filter quickly
(first blue line from the left), keeps the best filter until a temporary stop in attack at ∼89 minutes, reacts
accordingly to stop filtering (middle blue line), and deploys the best filter again when a different attack
starts again (last blue line from the left). This shows our system is adaptive to the polymorphic attack
events.
94
3.6 Conclusion
This chapter provides the first in-depth design (Section 3.4.2) and evaluation of an automated, layered
approach to mitigate DDoS on DNS root (Section 3.5). Evaluated on ten real-world DDoS attacks on Broot, DDiDD quickly selects the best filter or filter combination from a library of filters, and deploys it
automatically. DDiDD reduces server load to acceptable levels within seconds, with collateral damage
under 2% (Section 3.5.4). DDiDD is adaptive to the polymorphic attack events, which change attack pattern
during an ongoing attack event, and nimbly makes new filter selection in up to 4 seconds. It further has low
operational cost, working offline to process incoming traffic at the server, and producing filtering rules,
which can be implemented at no added processing delays using ipset.
We already show two defense systems against DDoS attacks. To prove our thesis statement (Section 1.1), we show that our systems only require measurements. Next, we describe a moving target defense
against brute-force password attacks utilizing IPv6 address space. Our moving target defense proves that
we can build systems against different attack types without changing existing protocols.
95
Chapter 4
A Moving Target Defense against Brute-Force Attacks Using IPv6
Our two defense systems against DDoS attacks utilize measurements that do not require any protocol
changes. Next, we show that our idea of utilizing existing protocols to build defense systems works for
other attack types.
To show our idea works for other attack types, we propose a moving target defense against brute-force
password attacks. Password attacks cause service disruption that block user access into a system. We utilize
the huge IPv6 address space to build the first moving target defense for SSH and HTTPS applications. This
solution shows how we can leverage existing address space for security design. The goal of this study is
to secure the users (Chapter 4) while the prior two studies secured the service from DDoS (Chapter 2 and
Chapter 3).
In this chapter, we propose a discovery-resistant moving target defense named “Chhoyhopper” that
utilizes the vast IPv6 address space to conceal publicly available services. Services on the public Internet are
frequently scanned, then subject to brute-force password attempts and Denial-of-Service (DoS) attacks. We
would like to run such services stealthily, where they are available to friends but hidden from adversaries.
The client meets the server at an IPv6 address that changes in a pattern based on a shared, pre-distributed
secret and the time of day. By hopping over a /64 prefix, services cannot be found by active scanners, and
96
passively observed information is useless after two minutes. We demonstrate our system with the two
important applications—SSH and HTTPS, and make our system publicly available.
This work was published in the NDSS Workshop on Measurements, Attacks, and Defenses for the Web
(MADWeb), 2022 [191]. As an outcome of this work, we release Chhoyhopper as an open source tool [193].
4.1 Introduction
Attackers frequently scan for services on the public Internet, then make brute-force password attempts
and Denial-of-Service (DoS) attacks. IPv4 scanning has been possible for more than a decade [96] and
recent tools allow scanning all of IPv4 in minutes [4, 91]. Mass scanning of IPv4 is done regularly by many
parties [252]. Scanning is increasing in IPv6 as well, with evidence of IPv6 address space scanning [79] and
development of public lists of responsive IPv6 addresses [150]. Once scanning detects an active service,
attackers can carry out brute-force password attacks to get access [169, 29]. Services with static address
can also be targeted by DoS attacks [69].
We would like to provide discovery-resistant stealthy services on the public Internet, available to
friends but hidden from adversaries.
IPv6 adoption has been increasing over the years [53]. As of December 2021, 37% of Google accesses
use IPv6 [90], and APNIC shows 29.4% of all global users are capable of using IPv6 [14]. From May 2018
to February 2020, Akamai reports 4x increase in the IPv6 traffic volume [161].
IPv6 provides a huge address space in which we can hide services. Even with clever scanning, when
each LAN has 2
64 addresses (or more), active discovery of services on intentionally obscure addresses is
intractable (see Section 4.7.1). With IPv6 prefixes of /48s as the recommended minimum size of publicly
routable prefix, [245], and /56s recommended for homes [156], even with a million devices in a home,
quintillions of addresses remain unused on every network.
97
Our insight is that only a discovery-resistant moving target can elude scanners. We describe Chhoyhopper∗
, using the vast IPv6 address space to conceal publicly available services. The server hops to different
IPv6 addresses in a pattern based on a shared, pre-distributed secret and the time-of-day. A client with
the shared secret can match this pattern to find the server. As with SSH [221], we target services for small
groups where out-of-band sharing of secrets (our hop key, or ssh’s per-user keys) is viable; our approach
can scale to support millions of such small groups. By hopping over a /64 prefix, any service cannot be
found by active scanners, and passively observed information is useless after two minutes. We expect our
system to be used by small organizations who want to protect their specific services used by their group
from active scanners and brute-force attacks. Since the server hops over addresses, our system provides
protection against DDoS attacks targeted to a fixed address.
We make three new contributions: first, we show that IPv6 address hopping can be used to protect
existing services (Section 4.5). Prior work suggested daily address changes for IoT devices with new services [115]. We instead propose changing addresses every minute, and show how to apply this approach
to existing popular services like SSH and HTTPS. We provide a common hopping design that can be used
by multiple services. To the best of our knowledge, this is the first design of a moving target defense for
SSH and HTTPS utilizing IPv6. Second, we show how to support web security with TLS by adding support
for DNS-based TLS certificates to our core hopping protocol (Section 4.5.6). Finally, we propose a new approach to accommodate long-lived connections in the face of frequent address changes (Section 4.5.4). We
use ip6tables rules to retain the existing connections to a fixed internal address but changing NAT rules
allow new connections only with the current IPv6 addresses. Our deployment is user friendly, and works
similarly like the current client applications (Section 4.6).
∗Chhoy is the number “six” in Bengali, since we hop in IPv6.
98
Availability: Our implementation is freely available at https://ant.isi.edu/software/chhoyhopper/.
We provide server module using a Python script for both SSH and HTTPS. Our client implementation has
a Python script for SSH, and a browser extension for HTTPS.
4.2 Background
We next briefly review how IPv6 makes our solution possible. Full details about IPv6 are in its specification [97].
The defining characteristic of IPv6 is its much larger address space relative to IPv4, with 128 bits per
address instead of only 32. IPv6’s larger address space was chosen to address the expected exhaustion of
IPv4 addresses, realized in May 2014 [106]. Of the 128 bits, 64 are dedicated to LAN-specific information
to supporter automatic address assignment [98]. We exploit these plentiful LAN addresses in our hopping
mechanism.
Global IPv6 addresses contain a routing prefix (normally 48 bits or shorter), a subnet identifier (16 bits
more), and an interface identifier (64 bits). The interface identifier can be static or can be generated by
stateless auto configuration [155], or assigned using DHCPv6 [147]. In our work, the server uses a fixed /64
prefix (combining both routing prefix and subnet identifier), generates the interface identifier dynamically,
and changing it every minute. A client needs to find out the interface identifier to get the service.
4.3 Threat Model
Like the previous attack models, we have two entities in this threat model. The attackers use a brute-force
method to get user passwords. Using our defense, the service can protect its users from being compromised.
99
4.3.1 Attackers
Attackers make brute-force attack to retrieve users’ passwords. These attackers may use a dictionary to try
the common passwords, then they may have high computation power to generate all possible passwords
and then use them at a safe interval so that the attackers remain stealthy. Sometimes the attackers use
open source tools to make these attack events [213].
4.3.2 Service
The service generally runs at a fixed IP address. An attacker targets this IP, and try different passwords
to steal a user’s credentials. Most services restrict the number of password attempts at a given time. Also,
many secure services mandate strong passwords combining characters and numbers. However, bruteforce attempts can be stealthy where the attackers make password attempts at a fixed interval. These
stealthy attempts can be unnoticed from the service side. Also, people may use strong passwords, but
these passwords can be common as well.
Despite all these efforts, brute-force password attacks are still common. We design a moving target
defense where the attackers cannot make brute-force attacks. In this study, our assumption is that services
can run at an IPv6 address, and the clients have access of IPv6.
4.4 Prior Related Studies
Our work is motivated by our desire to improve security using the unique properties of IPv6. As such, it
augments existing IPv6 security and privacy, and is related to other moving target defenses.
There are several studies related to the dramatic growth in the IPv6 adoption, and suggest that IPv6 is
no longer an “uninteresting rarity” [53, 50, 160, 90]. This widespread adoption of IPv6 implies that our use
of IPv6 is viable and timely.
100
Though IPsec in IPv6 provides data integrity and confidentiality, it can expose the link-layer address,
creating a new privacy risk [238]. To fix this, clients can choose random and ephemeral addresses using the IPv6 addressing privacy extension [154]. As an alternative way, providers utilize prefix rotation
that changes the entire allocated prefix to improve address privacy [147, 201]. Our goal is the opposite;
providing service in changing addresses, and clients need to find out the changing address.
We build on privacy-preserving IPv6 address assignment [88, 89], but while that work proposes updating addresses daily with a fixed pattern, we accelerate hopping each minute to service as an active defense
against scanning. Our work is similar to port knocking [123, 57], but it hides in IPv6 rather than requiring
“wake-up” packets. Closest to our work is IPv4-based port-hopping [128]; we take advantage of much
larger IPv6 space (2
64) compared to the quite limited IPv4 port space (2
16). Work by Judmayer et al. uses a
similar technique for IoT devices where they assume IoT devices use publicly routable IPv6 addresses [115].
Our solution does not interrupt running services, and is applicable for many other applications.
4.5 Chhoyhopper Design
Our goal is to enable discovery-resistant public services. To accomplish this goal, clients will rendezvous
with servers on a public, but temporary IPv6 address. By allocating the temporary address from a large
space (2
64 addresses), scanning is impractical, as we show in Section 4.7.1. By changing the address frequently, reuse of a passively observed temporary address is only possible for a very brief window of time.
The hopping pattern is cryptographically secure, so prior active addresses reveal nothing about future
addresses.
4.5.1 Design Requirements
The Chhoyhopper design has a primary requirement of discovery resistance, and several secondary requirements:
101
Discovery resistance: Our primary goal is that services should be discovery resistant. An adversary
should not be able to contact the service and carry out a brute-force or DoS attack, even if they know the
network where the service is running.
Application support: Our moving target defense should support many existing, real-world applications. We describe a common hopping strategy as simple core defense, and then apply this design to
applications such as SSH and HTTPS. Our design should be adaptive so that new applications can be added
easily.
Transparency: Our system should be compatible with the current application clients without protocol
changes. For example, an HTTPS client should be able to connect to a Chhoyhopper server using a web
browser. Similarly, a client should have the SSH capability using a command line interface. Extensions to
support hopping should be possible without changes to core programs.
Support for collateral services: Often IP addresses are exposed in collateral services, and we must
ensure that a hopping IP address does not break other, related services. For example, HTTPS should
support TLS authentication, and clients should be able to identify services by domain names, but both TLS
and DNS must continue to function even if the underlying IP addresses hop. We integrate hopping in DNS
lookup (Section 4.5.2) and describe how TLS can support hopping addresses (Section 4.5.4).
Uninterrupted connection while hopping: Our system should be able to hop over addresses seamlessly without breaking an already established connection. Our system should not require a restart to
hop over to a different address so that no service interruption occurs. We meet this requirement by using
ip6tables NAT rules.
Finally, a non-requirement is direct support for millions of clients. We depend a shared secret, but
security of a secret over a large group is challenging. One could support large groups by splitting them
into many small groups, each with a separate, revocable hopping secret.
102
Client
(i)
Curr. addr.: 2001:db8::5054:ff:fe80:634
Prior addr.: 2001:db8::5054:ff:fe80:123
Internal addr.: 2001:db8::5054:ff:fe80:1 Server (runs at
internal addr.)
NAT
Generate address with:
1. shared key
2. timestamp
3. salt
Chhoyhopper
(runs at
external addr.)
(ii) Existing
connections
(iii) Any other
addresses incl.
internal addr.
Server
Figure 4.1: Client and server interaction in Chhoyhopper.
4.5.2 Design overview
A hopping IPv6 address must be understood at both the client and the server: the server will move service
to a new address frequently, and a client must be able to find that server’s current IPv6 address to start
a new session. In addition, existing, long-lived sessions must continue even when the server moves to a
new address.
Figure 4.1 shows the components of our system. The client and server must follow the same hopping
pattern to rendezvous. We assume they share a pre-distributed secret key. We expect secret distribution
to use common methods, such as out-of-band distribution of ssh keys today. Several methods are possible:
including face-to-face sharing, secure interactive communication such as secure instant messaging, or
other secure channels such as encrypted e-mail or an authenticated website. While we welcome new
approaches to secret distribution, they are out-of-scope of this paper. Our requirement for this secret means
Chhoyhopper cannot be used for anonymous clients to discover a server, since scanners could exploit any
discovery process. It also means Chhoyhopper does not apply to very large groups where secret sharing
becomes untenable. The design of the server ensures only the access of the legitimate clients who have
the correct secret key and salt value. In this way, our service becomes discovery-resistant which is our
103
Domain name
Client
IPv6 address Fixed first 64 bits
Generated last 64 bit
hop.example.com 2001:db8::f 2001:db8::5054:ff:fe80:634
Generated
using key,
salt, and
timestamp
Server DNS
Figure 4.2: Getting the rendezvous address.
first design requirement. Clients who lose the secret key will not have access to the service anymore and
need to get the key again.
Next, we describe the selection and lifetime of the temporary address, hopping on the server, and
hopping by the client.
4.5.3 Address hopping pattern
The server and the client compute the same temporary address by computing a cryptographic hash of the
shared secret, a salt value, and the current time in minutes. We use the SHA-256 algorithm for hashing
and the time in seconds since the Unix epoch. The salt value prevents rainbow attacks [163] and can vary
by service or deployment.
We compute the IPv6 address in two parts. We take the DNS name of the service address and look up a
full IPv6 address, but replace the low 64 bits of the address with the top 64 bits of the hash result. Figure 4.2
shows how a client and a server converge to a single rendezvous address. Our system gets an example IPv6
address (2001:db8::f) from the domain name using DNS. Then it keeps the first 64 bits (marked by green
letters), and computes the last 64 bits (marked by red letters) using SHA-256 algorithm (in our example
the computed address is 2001:db8::5054:ff:fe80:634). The clients and the server can only merge to a single
address when they use the same secret key, salt value, and timestamp.
104
Use of DNS allows the service to move in the Internet and provides a user-friendly name. DNSSEC
should be used to ensure that the DNS lookup of the top IPv6 address bits is not subject to a person-inthe-middle attack. If clients prefer, our system can also take a direct IPv6 service address. We discuss the
potential of collisions in Section 4.7.2.
The server tracks its current address, changing it every minute. To avoid problems with clock skew, the
server listens to two addresses, one for the current minute and the other for the nearest adjacent minute.
(Larger clock skew can be handled by increasing the duration addresses are kept active, if desired.) We use
NAT rules (in ip6tables) to track live connections as addresses change.
4.5.4 Server-side hopping and connection persistence
Hopping over addresses seamlessly without interrupting any active connections is one of our service requirements. It is cumbersome for server software to change its service address every minute, and we would
rather not modify server software and cannot break active connections. We therefore operate the server
on a fixed address that is firewalled from the public Internet. Thus the traffic towards the current address
needs to be translated to the internal address to respond to the clients. A daemon then uses network address translation to map the currently active addresses through the firewall to the internal fixed address.
IP6table rules also ensure that once a connection is established it continues to operate, even after the server
moves to other addresses for new connections.
Chhoyhopper server restricts the access only to the new clients with right IPv6 address, while continuing to serve existing clients who previously started access. To summarize server processing in Figure 4.1:
(i) new flows to the current and prior address are detected by NAT rules, and establish new connection
state before being passed to the internal server address, (ii) existing flows are detected by ip6tables rules
and pass through to the internal address, (iii) any other addresses, including external traffic sent to the
105
“internal” server address, are dropped by the server’s firewall. When external traffic sent to the “internal” address, our deployed NAT rule translates the “internal” address to a different non-responsive address
so that server’s firewall drops those traffic. Thus our system can defend even if the attackers know the
internal service address.
Our NAT-manipulation daemon for server is a simple Python program that modifies Linux ip6tables.
The daemon assigns the NAT rules to a particular external interface on the server. Other OSes (like Windows or FreeBSD) would need to use their own, native NAT mechanisms; that is potential future work.
4.5.5 Client discovery of the hopping address in SSH
The client must compute and use the server’s current IPv6 address to begin a new connection. We assume
the server’s secret key and the salt are known to the client, so the client does the same hash computation
as the server. As with the server, the client looks up an IPv6 address from DNS and replaces the low 64
bits with the current temporary hash.
When a client is done with the connection, the server keeps the existing connection, and the current interface address. However, after terminating an old connection, the client needs to make the same
computation to get the current address.
To be transparent, our client implementation for SSH uses a simple Python program which invokes
the native client with appropriate arguments. A client can just use a command line interface to run the
client SSH program.
4.5.6 Challenges with HTTPS
In addition to SSH, our system supports HTTPS. We believe our core hopping technique is generalizable
to many applications. Some applications require additional support—we extend the core design of Chhoyhopper to meet the requirements of HTTPS.
106
Server
Update ip6table rules, NAT rules and
interface addresses
Update DNS entry based on the current
address
TLS certificate for wildcard domain name
Figure 4.3: Server for HTTPS.
The HTTPS deployment has two unique challenges. Our first challenge is to ensure transparency where
a user gets the service like any other HTTPS service using a web browser.
Our second challenge is user demand for TLS authentication. TLS authentication is required for HTTPS.
Since our server hops every minute, it is not feasible to get an SSL certificate for each IPv6 address. Also,
IP-based TLS does not support wildcard certificates. Thus we cannot generate a wildcard certificate for a
/64 prefix. Traditional use of a static domain name is not possible as well. A static DNS name would reveal
the hop destination.
We provide transparent access to users with a new browser extension, then it rewrites the Chhoyhopper web request to the current hopping address without users able to tell. We currently provide this
extension for Mozilla Firefox. An extension for Google Chrome is technically feasible but requires DNS
support (we currently use Firefox-specific DNS APIs).
We solve the certificate problem by getting a TLS certificate for a wildcard domain name, and then
dynamically create changing hopping name under that wildcard. Next, we describe the changes in server
and client for HTTPS.
4.5.7 Server-side certificate handling with hopping HTTPS
The core design of HTTPS is similar to that we mentioned in Section 4.5.4. HTTPS server also runs NAT
rules to translate the current allowable addresses to the internal server address. It also runs the ip6tables
rules to filter out the traffic that does not pass the NAT rules to get the internal address. Now we need
107
to extend the core idea to enable support for TLS authentication (support for collateral service in Section 4.5.1).
We enable TLS support by getting an SSL certificate for a wildcard domain name. Then the server
opens service at dynamic domain names under that wildcard. As an example, the server needs to get an
SSL certificate for “*.example.com” if the domain name is “example.com”.
The server utilizes the same hash algorithm along with the same secret key, salt value, and timestamp
to find out a domain name under the wildcard. We take 40 characters from this hash value to make the
domain name (any domain name label can be 63 characters long [60]). The server puts the generated
characters in the wildcard part of the domain name. At every minute, the server generates a new domain
name.
Chhoyhopper server needs to update the DNS entry periodically for the generated domain name. Dynamic DNS maps the hopping name to the changing IPv6 address, and updates the DNS entry at a fixed
interval. Only clients with the secret key can guess the hopping URL. Since the server has already updated a DNS entry for the hopping URL, the clients will get the right IPv6 address, and pass the filters.
The clients can also authenticate the response because of the wildcard certificate provided by the server.
Besides adding a new DNS entry, the server also deletes an old entry to limit the number of DNS entries.
Since each subdomain uses a unique name, the system will not be affected by DNS caching. While updating DNS every minute has some overhead, the cost is quite modest and is similar to frequent DNS updates
seen in CDNs.
Figure 4.3 shows the server extension for Chhoyhopper in HTTPS. The green box shows the design
that is common for all applications. The two blue boxes show that HTTPS requires extensions for TLS
authentication and DNS updates.
108
4.5.8 Client discovery of the hopping address in HTTPS
We already see that Chhoyhopper server opens the service at a dynamic domain name. A client needs to
generate that domain name to get the intended web page.
Clients use the same technique to generate the domain name. It uses the same shared secret key, salt
value, and timestamp to an SHA-256 algorithm to get the hash value. Using the computed hash value, it
generates the domain name.
We want an automated way to generate the dynamic domain name, and use it in the browser to get the
web pages running Chhoyhopper. We provide a browser extension to hop over dynamic domain names.
Our browser extension is lightweight; takes inputs from the clients about the Chhoyhopper domain name,
shared secret key, and salt value. When the users type any domain name that matches the user input for
Chhoyhopper base domain name, it generates the dynamic domain name, and rewrites the request to the
generated domain name. For example, when the clients type “example.com”, the browser extension redirects the request to “generated_hashed_chars.example.com”. The browser extension prevents the recursive
redirection by keeping track about the recent translation.
Since the server has already added a DNS entry, the browser extension redirects to the current domain
name, and DNS translation is done using the newly added entry.
A web page has multiple links to other web pages. Depending on the deployment, these links can be
relative to the current page, or they can be a complete link to the other pages. In both cases, our browser
extension will look for the base domain name and redirects if it finds a match. Thus relative links to the
terminated connections work as well by regenerating a new domain name. To prevent the URL leaking
through a referrer header [251], we recommend the servers to set its “Referrer-Policy” to “no-referrer”.
109
4.6 Example Use
In this section, we discuss the implementation details of Chhoyhopper. Currently, we provide support for
SSH and HTTPS. We demonstrate our implementation by directly taking runtime screenshots.
4.6.1 SSH
We provide the SSH service without directly modifying the standard SSH client. At the same time, we want
to keep the Chhoyhopper SSH client simple so that the users can use it through a command line interface.
Thus we provide a script that takes input parameters for the Chhoyhopper domain name, secret key, and
salt value. Then the script computes the current IPv6 address, and provides the standard SSH client with
the computed IPv6 address to make the connection.
We also provide a script for the server that takes similar inputs for internal address, secret key, and
salt value. The server script then periodically assigns interface address, deploys ip6tables and NAT rules
for access control, and deletes obsolete addresses and rules.
Figure 4.4 demonstrates the implementation of Chhoyhopper for SSH .
To meet the discovery-resistant requirement (Section 4.5.1), at a fixed interval, the server opens its
service at a temporary IPv6 address, and drops the prior minute’s active address. The server log in Figure 4.4a shows that the server opens service at an address ending with 11ba (highlighted black). At the
same time, the server also drops the prior running address ending with 8f48. A client with the same secret
key, salt value and timestamp will recreate the same IPv6 address and successfully connect to the server.
Figure 4.4b shows a successful connection where the client uses the same secret key, generates the same
IPv6 address (see the highlighted address ending with 11ba), and connects to the server using a command
line interface (transparency requirement in Section 4.5.1). If a client uses a wrong key, the client cannot
make a successful connection. Figure 4.4c shows an unsuccessful connection attempt where the client
uses a wrong secret key named “random-secret”, and makes request to the address ending with 6d5e (not
110
(a) Server log for SSH.
(b) Client connecting to server.
(c) Unsuccessful connection with a wrong key.
(d) NAT rules in server.
Figure 4.4: Client-server interaction for SSH.
111
the current address with 11ba). The server deploys a destination NAT rule to translate the current IPv6
address to the internal address, and another rule to keep the existing connections (uninterrupted service
requirement from Section 4.5.1). This translation is shown in the list of ip6tables NAT rules (highlighted in
Figure 4.4d). To test the uninterrupted service, we establish a new SSH connection to a temporary IPv6 address and wait for the duration until the temporary address stops accepting new connections. We confirm
that the old connection continues even when the original IPv6 address is not accepting new connections
any more.
4.6.2 HTTPS
To meet the application support requirement (Section 4.5.1), we show how we implement HTTPS, and how
it is different from SSH. For HTTPS, the clients need a browser extension, and the server needs additional
steps like getting a TLS certificate and updating DNS entries periodically. We also show an example use
case where a client connects to the server using a web browser.
We provide a browser extension that intercepts the Chhoyhopper domain name and redirects the requests to the current domain name. Different from SSH, clients can provide the inputs for domain name,
salt, and key value using the input page of the browser extension. The browser saves these inputs, and
uses them later to redirect the clients.
A script on the server updates its IP address (as with SSH). It also updates dynamic DNS, adding a
unique name for each new IP address. Before running the script, the server also needs to generate an SSL
certificate for the wildcard domain name.
The client-server interaction for HTTPS is shown in Figure 4.5 . Like SSH, the server utilizes similar
ip6tables NAT entries for access control. At the same time, the server adds a DNS entry for the generated
domain name. We can see the generated domain name with the wildcard part before the first dot along
with the corresponding IPv6 address in Figure 4.5a (highlighted in black).
112
A client needs to use a browser extension for getting the Chhoyhopper HTTPS service (transparency
requirement from Section 4.5.1). Figure 4.5b shows the input page for the clients. A client needs to provide
the Chhoyhopper base domain name, secret key, and salt value. The browser saves these options for future
use.
When a client types the Chhoyhopper base domain name, the browser checks the saved domain name,
and if the browser finds a match, it generates the current domain name using the shared secret key, salt
value, and timestamp. The browser then successfully redirects the request to the current domain name
(see the domain name in Figure 4.5c). Since the server already updates the DNS entry, the browser will get
the current IPv6 address after the DNS resolution. The padlock symbol in the address bar of Figure 4.5c
indicates the transport layer security to meet the collateral service requirement (Section 4.5.1).
When a client uses a wrong secret key, the redirection does not work. The request is then redirected
to a different domain name which cannot get the current IPv6 address (Figure 4.5d).
4.7 Analysis
We analyze our system to find out the risk of discovery and collision. We show that the chance of getting
discovered or having a collision is vanishingly small, even if there are millions of servers under the same
IPv6 prefix.
4.7.1 Risk of Discovery
To estimate the difficulty of brute-force scanning, consider a scanner scanning at 100 Gb/s looking for a
server hopping in one /64 with 64B TCP SYNs.
At that rate (scanning 2 × 108
addresses per second) the expected time to discover one server is about
3000 years, at which point the adversary will have at most two minutes to exploit it. Since the address
113
(a) Server log for HTTPS.
(b) Client extension.
(c) Redirection in client.
(d) Unsuccessful connection using a wrong key.
Figure 4.5: Client-server interaction for HTTPS.
114
space is huge compared to the scanning rate, we are confident that brute-force scanning is impractical.
Since the address is hopping randomly, intelligent scanning is not possible.
An adversary that observes traffic will know prior hop addresses. If the hopping pattern is predictable,
such knowledge could be used to discover future hopping addresses. Our assertion of hopping unpredictability is based on the cryptographically security of our hash function, SHA-256. As of 2022, SHA-256
is regarded as secure, but the algorithm may need to be replaced in the future.
4.7.2 Risk of Collisions
When multiple servers share the same /64 address prefix, it is possible that they could collide and hop to
the same address. A concerned operator should assign each server a unique /64 prefix (operators can get
a /48 prefix or so, and then assign a unique /64 prefix to each server). However, we suggest that odds of
collision is so low that collision avoidance is unnecessary.
Collisions of hopping addresses is equivalent to the well-known Birthday Problem, but rather than n
people in 365 days of the year, we have k servers in 2
64 addresses. Using a simplified approximation, the
probability of a hash collision in any given minute is 1−e
−k(k−1)
2N [179]. Using this formula, the probability
of an address mapped into the k of 1 million addresses is only 1 in 37 million. As we generate an address
every minute, we can expect a collision with these million servers once in every 70 years. This failure rate
is considerably less than DRAM failures due to cosmic radiation [224].
4.7.3 Run-time Costs
Runtime overhead for Chhoyhopper is usually done out-of-band with new connections, or is very small.
The server selects new IP addresses every minute, but this cost is out-of-band of new connections (so
it does not affect clients), and small (a cryptographic hash and local ip6tables manipulation).
115
Clients starting a new connection must read the secret and carry out a cryptographic hash, but this
overhead is small relative to the already required Diffie-Hellman key exchange and SSH protocol negotiation.
Live connections require IP address translation from the hopped address to the internal address. This
cost is exactly one NAT mapping. Most cloud services already have at least two levels of address translation
(for example, see VL2 [92]), so the overhead of an additional mapping is quite modest.
4.8 Conclusion
In this chapter, we provide an implementation (Section 4.5) of a discovery resistant moving target defense
named “Chhoyhopper” to provide security utilizing the huge IPv6 address space. To the best of our knowledge, this is the first deployment of a hopping defense with IPv6, applicable for both SSH and HTTPS.
Using our system, a service will hop over different IPv6 addresses, and a client needs to find the current
IPv6 address to connect. Our implementation is publicly available and we provide support for SSH and
HTTPS applications (Section 4.6).
Currently, we support SSH client with a Python program, and HTTPS using a Firefox extension. Potential future work is to a Chhoyhopper client integrated with OpenSSH, to provide HTTPS extension
support for Chrome, and to port server support to non-Linux operating systems.
To prove our thesis statement against different attack types, we already describe three defense systems against two attack types. First, we show two defense systems against DDoS attacks (Chapter 2 and
Chapter 3), and then in this section we describe a defense against brute-force password attacks utilizing
IPv6 address pattern. Like the previous two studies, here we also utilize existing protocols. We show how
we can leverage the existing IPv6 address space to design a new defense system. We build the previous
two systems utilizing measurements. Here, we show how we can build systems with existing protocols
116
without even requiring rigorous measurements. The already described attacks target the service and its
users. Next, we show how we can protect the path between the users and the service.
117
Chapter 5
Third-Party Assessment of Mobile Performance in 4G and 5G Networks
We already use existing protocols to design three defense systems protecting against DDoS and brute force
password attacks. Next, we show how we protect the path from the User Equipment (UE) to destinations
through a mobile network.
The security of mobile users is important since a significant portion of the Internet traffic is initiated
from mobile devices. In the era of 5G, many new hardware and software are deployed by different enterprises in different parts of the world. Enterprises deploy these heterogenous infrastructures quickly to cope
up with the business competition, and as a result, it is hard to standardize all security features consistently.
So, we anticipate a malicious entity can exploit these potential untrusted components to make malicious
routing detours for eavesdropping and traffic injection. In this thesis, we design a system that can detect
malicious routing detours without changing the existing protocols. Our insight to detect detours is that
5G latency is generally stable, and any detour will deviate the stable latency.
As a first step to reach this goal, in this chapter, we show the limits of latency, throughput, and stability
of the mobile users from a CDN perspective. We show clients’ end-to-end latency to a CDN can be very
low and stable. We plan to submit this work to a conference for peer-review as of June 2024.
118
5.1 Introduction
Mobile providers today offer increasingly high-speed Internet service [71, 70]. They aim to provide low
latency and high throughput to support multimedia streaming, Internet-of-Things (IoT) connectivity, and
vehicle-to-vehicle (V2V) communication. To fulfill these service requirements, they have added new technologies in radio spectrum (mmWave), edge computing, and network slicing. Today’s 5G theoretically
provides up to 20 Gb/s throughput [182, 10, 177] and end-to-end latencies as targeting 2 ms [12, 116].
However, achieving the theoretical best in practice remains elusive.
While 5G allows new capabilities, how quickly do 5G operators deploy them, and how available are
they to users? Market pressures encourage rapid deployment of “5G”, but early hardware may not include
all features, and operators may delay feature availability while they gain confidence in their stability. New
features often must be explicitly enabled, and operators may delay feature roll-out pending integration
with new billing models or specific commercial opportunities.
After several years of global 5G deployment, our goal is to assess the actual performance of 5G networks, both to gauge their current status and to explore their potential.
Content Delivery Networks (CDNs) provide a unique opportunity to provide a third-party assessment
for 5G across multiple mobile operators. CDNs are responsible for delivering popular content to users
from their distributed infrastructure. A globally distributed CDN receives traffic from almost all the mobile carriers around the globe. As a result, CDNs can observe the performance of the mobile users as a
third-party observer, without requiring any direct measurement from the mobile users. Although direct
measurements of specific CDN devices are valuable, broad measurements of many 5G users from a CDN
can avoid potential bias that can arise from direct measurements of a few users.
In this chapter, we characterize mobile latency and throughput and make two contributions. Our first
contribution is to describe an approach to identify existing mobile user equipment (UE) traffic measurements in a globally distributed CDN. As CDN logs aggregate traffic from various devices, we rely on the
119
IPv6 address pattern to differentiate mobile User Equipment (UE) from other sources (Section 5.5.1). Our
second contribution is to characterize end-to-end mobile latency and throughput along with their stability (Section 5.6). We evaluate the limits of latency (Section 5.6.1), throughput (Section 5.6.2), and stability
(Section 5.6.4) that clients can achieve. Our goal is to examine the extent to which 5G approaches the
targets of throughput and latency, as well as how closely the current latency aligns with the anticipated
expectations of 6G.
Anonymization, Data, and Ethics: We do not reveal the carrier names when we report the results
from the CDN data, since our goal is to evaluate latency and throughput relevant to 5G targets, not between
carriers. Because CDN data is anonymized and reflects proprietary details, we regret that we cannot make
our data available. Our work poses no ethical concerns.
5.2 Related Work
Related studies explore measurements from mobile UE and CDNs, and CDN performance.
Measurement from UE: Previous studies showed performance measurement from real mobile devices to evaluate 5G latency and throughput[152, 153, 258, 83]. Several studies took the mobile device to
different locations in the US, and measured latency and throughput while they moved [83, 262]. Other
studies measured latency and throughput within a limited geo-coverage from the UE. Some of these studies also measured latency, throughput, and power efficiency with Stand-Alone (SA) and Non-Stand-Alone
(NSA) 5G networks [258]. By leveraging CDN logs instead, we achieve broader coverage across a greater
number of carriers and geolocations.
Measurement from CDN: Closer to our work, one prior study compared SA-5G and NSA-5G delay,
download speed, and energy consumption using a Chinese CDN with streaming capabilities [258]. We too
study using a CDN, but we study global mobile phones and use a CDN with a global footprint. While that
work did a good job of evaluating CDN performance in China, our work instead looks at the performance of
120
Base station UE Packet
gateway Edge
router
Internet
Destination Internal
router
Internal
router
5G core network
Figure 5.1: 5G architecture
mobile providers internationally, considering operators in four different countries from three continents.
In contrast to the previous study, we utilize a globally distributed CDN. Considering the CDN’s server
deployment near Mobile Edge Computing (MEC) facilities [102], the latency analysis will inform us about
the proximity of CDNs’ server placement to the UE.
CDN performance: Prior studies evaluated CDN performance to show that the general Internet users
of the CDN see a good selection of CDN sites and client proximity to the nearby CDN front-end servers [34,
2, 3]. One study confirmed most clients normally get their service from the nearby CDN site [34]. Another
study showed how these CDNs are connected with a different number of peers [250]. In this study, we
show mobile latency and throughput from a CDN perspective. We demonstrate what UE can expect when
they get their service from a global CDN.
5.3 Architectural Considerations
In this section, we describe the components of the mobile networks and their interaction with edge computing and CDN servers.
Mobile networks: In mobile networks, users’ traffic must first pass through the radio access network
(RAN) and the carrier’s IP backhaul network, before reaching the Internet. A typical architecture of a mobile network is shown in Figure 5.1. A mobile data network has components connecting the user equipment (UE), base station, and edge router to reach the gateway, and then to the IP network. User equipment
(UE) connects to a base station through a wireless Radio Access Network (RAN). The RAN communicates
121
over a radio channel, with 4G LTE using less than 6 GHz, but 5G having more than 30 GHz channels in
mmWave [77]. Most mobile carriers in the United States promise to provide 5G coverage in metropolitan
areas, but even within metropolitan areas 4G and 5G co-exist to ensure backward compatibility.
Edge computing is a new architectural feature in 5G networks, where services such as CDNs are placed
inside or immediately adjacent to the mobile operator’s network (Figure 5.1). 5G suggests that a widespread
deployment of mobile edge computing can reduce latency and increase throughput. We may observe
performance variation depending on the edge computing’s location.
Delivery Networks: To minimize client latency, Content Delivery Networks (CDNs) sometimes place
their servers geographically close to the users. Ideally, UE that request for web or streaming services, get
their content expeditiously through match-making methods that pair them to nearby servers.
The path from the base station to the edge computing or packet gateway is known as the backhaul
network [176]. The base stations can have wired or wireless connections to the edge router and packet
gateway [261]. Backhaul network contributes to the observed latency from the user device.
The relative distance and interaction among the UE, backhaul network, IP network, and the location
of the destination CDN servers have an impact on the end-to-end latency. In this chapter, we investigate
end-to-end latency when mobile UE interacts with a CDN.
5.4 Data Sources And Measurements
We use two datasets to characterize end-to-end latency and latency inside mobile networks.
5.4.1 CDN HTTP Statistics
We observe 5G performance from a global, commercial CDN. This CDN provides both web and streaming
data and hosts DNS. Our goal is to determine latency and throughput distribution from 5G devices. We
observe both client and server-side data between the CDN to 5G devices.
122
Carrier Country Observable from
server?
Carrier
label (bits)
Geo
label (bits)
WiFi or
mobile label (bits)?
1 U.S. Yes 1 to 32 33 to 40 41 to 56
2 U.S. Yes 1 to 24 25 to 32 33 to 36
3 U.S. No for HTTP traffic 1 to 32 NATed NA
4 Germany Yes 1 to 32 33 to 40 41 to 64
5 Germany No 1 to 32 NATed NA
6 Spain Yes 1 to 32 Not found 33 to 56
7 India Yes 1 to 32 33 to 48 33 to 40
Table 5.1: IPv6 address pattern from server for different US carriers
Country Carrier # of
client /48s
# of
clients
# of
serving /24s
# of
CDN host
addresses
Duration
U.S. Carrier 1 1,830 1,412,325 769 27,018 8.4 h
U.S. Carrier 2 1,327 1,416,445 639 21,584 8.4 h
Germany Carrier 4 409 2,540,339 419 8,579 24 h
Spain Carrier 6 246 620,969 211 2,840 24 h
India Carrier 7 7,709 4,901,684 574 9,139 9 h
Table 5.2: CDN dataset in numbers
5.4.1.1 CDN Logs from Server Side
From server side, we analyze server logs of sampled HTTP(S) sessions. The CDN receives millions of HTTP
GET requests every second. The CDN collects a 1% sample on a specific day, but this sampled dataset is
large enough with about a billion samples per day.
The CDN samples sessions at the servers. For each sample we identify the client’s IP prefix, BGP origin
AS number, and the server’s IP address. From the client’s origin AS number and IP prefix we identify its
provider. We identify server physical locations from its IP address and CDN internal records. For each
TCP connection, the log reports the number of packets, information about the round trip time (RTT),
bandwidth, connection protocols, and congestion.
We analyze data from multiple countries to understand global trends. Table 5.1 shows the carriers
from different countries. We identify the use of Network Address [Port] Translation (NAT) and non-NAT
for IPv4 and IPv6 addresses from address assignment patterns, as verified with data from devices inside
123
the carriers. We choose five carriers from four different countries, each with non-NATed IPv6 addresses,
to examine (Table 5.2). In the logs for these five carriers, the clients’ full IPv6 addresses are visible since
they are non-NAT addresses. From Table 5.2, we observe over 1 M unique IPv6 UE for each of the carriers.
Each carrier uses 246 to 7,709 /48 IPv6 prefixes (shown by the # of client /48s column in Table 5.2). The
number of CDN host addresses with which these clients interact varies, of course, as shown in Table 5.2.
For instance, Carrier 1 was served from 27,018 unique IPv4 server addresses.
Different RTTs collected by the CDN: From the CDN data, we examine RTTs collected by CDN in
two different ways. First, we get RTTs passively from TCP handshakes where the server kernel reports
the RTT from SYN-ACK and ACK packets. Second, TCP reports mid-flow RTTs when an ACK arrives and
is not discarded. These RTTs generate the statistics from multiple TCP ACKs received by the servers. TCP
handshakes provide a single data point but TCP data-ACK RTTs provide multiple observations during a
session to measure the minimum, maximum, and mean RTT along with the variance within that session.
5.4.1.2 CDN Logs from Client Side
To complement server-side logs and to show the difference between 4G and 5G observed latency and
throughput, we use real-time logs measured from UE to different CDN-hosted services. CDNs collect
these performance logs from user devices to evaluate the network condition and to find out the places
for improvement. We use the detailed device and connection information, along with the latency data
from user devices collected by the CDN’s real-time user monitoring system. This dataset reports access
network information which helps us to distinguish UE using WiFi from those using mobile data networks
(Section 5.5.1.3).
124
5.4.2 UE-based Measurement
To complement CDN client logs (Section 5.4.1.2) and to analyze the stability for a longer duration, we
measure latency from real UE. While the CDN collects client logs data from real UE, they do not contain
continuous measurements.
These measurements from UE include latency for transactions with multiple targets in various timeframes. We use a Samsung Galaxy A52 device with 5G capabilities to evaluate latency stability. We use
AT&T carrier for this measurement. Unlike the CDN-collected data from UE, using our own UE we can
collect data for longer duration with our own control.
5.5 Methodology: Identifying Mobile Devices and Stability Analysis
Before we use CDN logs (Section 5.4.1.1) to characterize end-to-end latency, throughput, and stability, we
must understand what the CDN is observing. A CDN receives traffic from many clients, so our first goal is
to identify mobile UE in the data. Traffic source IP addresses may identify the originating AS as a mobile
operator, but mobile operators may support a mix of clients using mobile data, WiFi, and even wired
networks. We next describe how we use IPv6 address pattern to distinguish access network (Section 5.5.1),
and how to differentiate 4G and 5G (Section 5.5.2). Finally, we describe how we examine the stability of
latency (Section 5.5.3).
5.5.1 Identifying Mobile UE from IPv6 Addresses
We use patterns in IPv6 addresses to identify a UE’s access method (mobile or WiFi) and its geographic
location.
Mobile providers use both IPv4 and IPv6 address space for their clients. Clients who use IPv4 addresses
normally use carrier-grade NAT, often mixing clients using many different access network technologies
into the same IPv4 prefix used by the NAT. However, we find that IPv6 addresses are usually unique
125
for each specific UE. We therefore use IPv6 addresses so that we can readily distinguish and characterize
individual clients’ traffic.
Even for IPv6, the CDN sees a mix of NATed and non-NATed addresses. Some carriers use NAT even
for IPv6, in which case we see only translated addresses at CDN servers. Unfortunately, carriers using
IPv6 NAT seem to do NAT at the edge of their network, hiding internal structure of the internal UE IPv6
address that may offer a clue of the access technology. Also, the latency to these NATed addresses may
not represent the actual end-to-end latency to the client if the NAT is also doing split-connection TCP or
using a web proxy. By contrast, non-NATed addresses show end-to-end latency in CDN server logs and
imply that no web proxy is being used.
We show below how we discriminate NAT from non-NAT client IPv6 addresses. A typical NATed
address heavily aggregates traffic behind an address since many clients behind the NAT use the same
address. However, with a non-NATed address, the traffic is significantly lower than the traffic from a
NATed address. We can also observe more individual client IPs when there is no NAT. Also, split-TCP
connections result in unrealistic and consistent end-to-end latency in the CDN logs, as they originate
from the same NAT location. For our analysis, we choose carriers where the query frequency from each
source IP is notably lower compared to the query frequency observed with carriers using NATed addresses.
Next, we show how we use IPv6 address pattern to identify carrier, geolocations, and access networks.
5.5.1.1 Carrier Labels
Mobile operators assign UE to fixed subsets of their IPv6 address space. A prior study also discovered IPv6
address patterns to identify client addresses and packet gateways [262]. In this chapter, we add to this
work by identifying patterns in three non-US carriers. We also demonstrate how address patterns provide
insights into geolocations, WiFi networks, and mobile carrier identities.
126
From Table 5.1, we can see that among three popular US carriers, we find two carriers where the IPv6
addresses of the UE interfaces are directly observable from the CDN servers. The other carrier uses NATed
IPv6 address for their HTTP(s) traffic. Among the two carriers from Germany, we observe one with a
NATed IPv6 address from the server logs. For each address, the /24 or /32 IPv6 address prefix identifies the
carrier. Each carrier has different labels (subsets of contiguous bits in the client address), and these labels
can be the identifier of a carrier.
5.5.1.2 Geolocation Labels
We have identified patterns of geolocation in all four carriers that do not use IPv6 NAT, including two
non-U.S. carriers.
We confirm geolocations are consistently used by comparing our knowledge of carrier geolocation
prefixes with CDN server location, and from ground-truth locations of specific UE from client-side data
(Section 5.4.1.2). We see the two non-NATing U.S. carriers use the middle of their IPv6 addresses for
consistent geographic regions. For example, Carrier-1 uses 8 bits for the geolocation, and we consistently
see UE in California with one label and those in New York with a different label.
5.5.1.3 Access Network Technology
We also find non-NATing carriers use a label to identify access network type as mobile or WiFi. For example, we find a carrier that uses fixed 4 bits label to distinguish mobile and WiFi access network (Carrier-2
in Table 5.1). However, we did not find such characteristics for carriers with NATed addresses. With nonNATed carriers, a fixed label is used for either mobile or WiFi access network, but not for both mobile
and WiFi. We validate this finding based on the measurement from real user devices. To make sure that
we are identifying mobile access network correctly, we only identify an IP as a mobile device IP when we
have ground truth about it, measured from real user devices. To get the ground truth, we utilize CDN’s
127
Device/
Coverage 4G area 5G area
4G device 4G 4G
5G device 4G 5G
Table 5.3: Observing 4G and 5G network with respect to device type and network coverage
measurement from real user devices with device information mentioning mobile or WiFi networks as the
current access network (Section 5.4.1.2).
From Table 5.1, we can see different patterns for each carrier, Table 5.2 shows the non-NATed carriers
and these carriers’ total number of UE that we identify from the CDN data.
5.5.1.4 Apparent HTTP(S) Proxying
We found one carrier that apparently proxies HTTP(S) traffic. While non-HTTP traffic appears to come
from end-device IPv6 addresses, HTTP(S) traffic comes from different, NATed addresses, and is identifiable
by a fixed address pattern.
To confirm this implied proxying is only for HTTP(S), we started HTTP service on three different ports:
80, 443, and 8500. We found that when the service is open at port 80 or port 443, Carrier-3 of Table 5.1
uses a NAT, hides the real IPv6 address, and sends the requests from the NATed IP address. From server,
we can only see the addresses after NAT translation for HTTP(S) traffic. However, when the requests go
to a different port (like 8500), we observe the unique IPv6 client addresses.
5.5.2 Distinguishing 4G and 5G
We show how address patterns can tell us about the carrier names, geolocations, and sometimes network
type—mobile or WiFi. But we did not find any evidence in the address pattern that can tell us whether
the address is from 4G or 5G networks. Often 4G and 5G co-exist in the same physical locations, and are
supported by the same UE. Devices can move from one to another without changing the address pattern.
128
Also, some configurations of 5G networks use a software stack composed largely of legacy 4G protocols
(non-stand-alone mode), since stand-alone 5G has not been deployed yet widely [132].
The difference in architecture and co-existence of 4G and 5G raises the question, “can we distinguish
4G and 5G operation?” We suggest that performance can identify 5G use. We show possible combinations
in Table 5.3. Our first expectation is that a 4G-only device can only experience 4G, irrespective of the
network coverage in that location. Our second expectation is that a user with a 5G device may or may not
experience 5G capabilities depending on the 5G coverage within an area.
Based on this expectation, we use observed latency to distinguish between 4G and 5G devices. While
latency may vary, if we look at the minimum latency for each device, we hypothesize that the minimum for
a 4G-only device will be higher than the minimum for a 5G-enabled device. We validate that this method
works using 8 device models with known capabilities (some 4G-only and some 5G-capable) in Section 5.6.3.
5.5.3 Measuring Latency Stability
Latency stability means that observed latency is consistent, i.e., determines how much jitter occurs. Stable
latency benefits transport protocols’ performance as it often corresponds with the receipt of packets in
order (to manage data buffering) and eliminating unnecessary retransmissions (due to incorrectly inferred
packet loss), resulting in better user experiences.
Unfortunately, evaluating stability of latency from CDN logs is challenging for two reasons. First, the
client IPv6 addresses are not necessarily stable over time, to subsequent sessions, due to dynamic address
assignment practices [155]. Second, the sampled CDN logs may miss subsequent new flows from the same
IPv6 address.
To overcome these two challenges, we utilize long-lasting TCP connections and measurements from
real devices to evaluate stability. While CDN logs are sampled, we can observe multiple entries in the CDN
logs for the same connection when there is a long-lasting TCP connection. We consider a TCP connection
129
long-lasting when the connection exists for more than 30 minutes. Additionally, we observe stability in the
minimum latency. Since the routing path from the source to the destination should usually be stable, the
minimum latency is expected to be “often observed”. We use the term “stability” to show how frequently
the minimum value appears in the observed latency.
For the CDN logs, we divide the whole duration of each TCP connection into multiple time windows.
We pick the long-lasting TCP connections (that are over 30 minutes), and then divide the whole duration
into windows, each having duration W. Then we find the minimum latency in each time window and calculate the stability within these minimum latencies (more on latency parameters in Section 5.6) in different
time windows. We use 10 minutes for W when we measure stability from CDN logs. To complement the
CDN log-based latency assessment, we also measure stability for a 5G device, across three weeks.
5.6 End-to-End Results: Latency, Throughput, and Stability
In this section, we characterize end-to-end latency, throughput, and stability. First, we find out the endto-end latency measured from a CDN. Our key question is: does 5G meet its target of achieving ultra-low
latency, and high throughput? We find that the achieved end-to-end latency can be as low as 6 ms—not at
the target of 2 ms [116], but close. Mobile clients are also able to achieve throughput exceeding 100 Mb/s,
representing a notable advancement towards delivering high-throughput mobile services.
5.6.1 How Low is the Latency?
We first examine latency to report the best performance we see today. We will report two kinds of latency:
handshake latency (from the connection setup’s initial SYN / SYN-ACK / ACK exchange), and then dataACK latency extracted during data exchange for TCP. Each connection provides one estimate of handshake
latency and many of data-ACK latency, so we report CDFs over all connections by a carrier for handshake
130
200 100 0 100 200
Difference between TCP handshake RTT (ms) and Data-ACK min RTT (ms)
0.0
0.2
0.4
0.6
0.8
1.0
CDF of all flows of US Carrier-1
Data-ACK min RTT is higher TCP handshake RTT is higher
Difference (ms)
Figure 5.2: Difference between TCP handshake RTT and minimum RTT from data-ACK
Carrier Country Long-lived
TCP conns
Min
(TCP
handshake)
Top 5%
(TCP
handshake)
Min
(data-ACK)
Top 5%
(data-ACK)
Carrier 1 USA 4,876 9 ms 25 ms 6 ms 17 ms
Carrier 2 USA 3,561 8 ms 34 ms 6 ms 22 ms
Carrier 4 Germany 160 12 ms 27 ms 8 ms 15 ms
Carrier 6 Spain 761 9 ms 28 ms 7 ms 20 ms
Carrier 7 India 42,516 8 ms 30 ms 6 ms 20 ms
Table 5.4: Latency (ms) of the top clients in different countries
131
0 50 100 150 200 250 300
Latency (ms)
0.0
0.2
0.4
0.6
0.8
1.0
CDF of full IPv6 addresses
US Carrier-1
US Carrier-2
Germany Carrier-4
Spain Carrier-5
India Carrier-6
(a) CDF of RTT(ms) from TCP handshakes
0 50 100 150 200 250 300
Latency (ms)
0.0
0.2
0.4
0.6
0.8
1.0
CDF of full IPv6 addresses
US Carrier-1
US Carrier-2
Germany Carrier-4
Spain Carrier-5
India Carrier-6
(b) CDF of minimum RTT(ms) from ACKs
0 50 100 150 200 250 300
Latency (ms)
0.0
0.2
0.4
0.6
0.8
1.0
CDF of full IPv6 addresses
US Carrier-1
US Carrier-2
Germany Carrier-4
Spain Carrier-5
India Carrier-6
(c) CDF of mean RTT(ms) from ACKs
Figure 5.3: CDF of RTT (ms) in different countries
132
latency, and minimum and mean data-ACK latency. (We recognize that median is more robust than mean
given outliers, the logs contain only the mean within the data-ACKs.)
Comparing metrics: At first, we start with handshake latency, since it is the easiest and most commonly used latency measurement method. We find the minimum handshake latency is low. We see the
CDF of handshake latency of different carriers from different countries in Figure 5.3a. We find handshake
latency can be as low as 9 ms and the 5th percentile latency is between 25 ms to 34 ms (Table 5.4). So, clients
that are close to the CDN server with a good 5G coverage can expect to observe handshake latency less
than 30 ms. On the other hand, over 50% of the clients observe more than 40 ms of TCP handshake latency.
While TCP handshakes are commonly used, data-ACK latency measurement is more robust because
it considers multiple observations over the connection lifetime, rather than a single observation at the
connection start.
We show the difference between the RTTs measured from TCP handshakes and data-ACK packets in
Figure 5.2. We measure the difference between TCP handshake RTT and data-ACK minimum RTT for
the same flow. On the green side to the right, the TCP handshake RTT exceeds the minimum RTT for
data-ACK packets. Conversely, on the no-color side to the left, the minimum RTT from data-ACK packets
surpasses the TCP handshake RTT. Overall, in fewer than 5% of flows, we note that the TCP handshake
RTT is lower than the minimum RTT from data-ACK packets. In 50% of the flow, the minimum from
data-ACK packets and TCP handshake RTT have less than 10 ms of deviation.
With data-ACK latency, we observed 6 ms as the minimum latency, and the top 5th percentile latency
is between 15 ms to 22 ms (Figure 5.3b and Table 5.4), when the minimum TCP handshake latency is 8 ms
and the 5th percentile latency is between 25 ms to 34 ms.
Finally, we also examine the CDF of mean data-ACK latency. Because this mean reflects all observations
over the flow lifetime, it captures variation in latency, and when mean is much larger than 5th percentile,
it suggests high variance in latency (Figure 5.3c).
133
0 50 100 150 200 250 300
Throughput (Mbps)
0.0
0.2
0.4
0.6
0.8
1.0
CDF of full IPv6 addresses
US Carrier-1
US Carrier-2
Germany Carrier-4
Spain Carrier-5
India Carrier-6
Figure 5.4: CDF of throughput
Variation by country: In all the countries, the median latency is around 50 ms. This 50 ms median is
sufficient for most web applications but only the top 5% to 10% clients would have a better experience for
latency-sensitive applications.
Among all the countries, the German mobile carrier shows a narrower distribution; more clients observe similar latency. We find 50% of the German clients observe minimum latency of 25 ms or less (Figure 5.3b). The tail latency for India and U.S. carriers is long. Around 15% of the U.S. and Indian UE observe
more than 100 ms of minimum latency. The large geographic area of these countries means that propagation delay can be large for some UE. On the other hand, Germany and Spain have a lower tail latency. Less
than 5% of the UE observe more than 100 ms of latency. The Indian mobile carrier has the highest jitter
(over 70% clients show more than 50 ms of mean in Figure 5.3c). However, in all the cases, similar minimum
and 5th percentile latency ensures global 5G deployment and CDN proximity to the mobile users.
5.6.2 How Good is Throughput?
We measure the throughput from the transferred bytes and transfer duration, assuming uniform transfer speeds. Although transfer speed may vary over a connection, this method estimates actual observed
134
throughput. We only consider the TCP sessions where more than 1 MB of data is transferred, since shorter
transfers may under-utilize channel capacity due to startup overhead (TCP slow start). We conclude that
the observed 5G throughput of 100 Mb/s is still far under the advertised peak of 20 Gb/s [182, 177, 10].
We compute the maximum throughput over all flows for each UE, and show the CDF of this value over
all UE. We show the throughput distribution in Figure 5.4. The median UE from all carriers sees 40 Mb/s
effective throughput or less.(Figure 5.4). In the best case of U.S. Carrier 1, only 40% users get more than
50 Mb/s throughput. However, some UE see much better performance: the fastest 10% see 100 Mb/s or
better.
There are several possible reasons UE throughput may not exceed 100 Mb/s: insufficient data may not
allow the window to open fully, either because of small application buffers or slow data generation rates by
the application, or it may represent a bottleneck in either the radio-access or mobile operator’s backhaul
network.
India has a bigger difference in throughput distribution than other countries (the purple line in Figure 5.4). Only 25% of the clients observe more than 25 Mb/s throughput for the Indian carrier. We suspect
5G deployment is still not very mature in India or maybe the CDN delivers the content from a distant
server. With the CDN deploying over 9,000 well-distributed server machines in India, we anticipate that
the limited maturity of 5G deployment could contribute to low throughput.
To check the impacts of device types over throughput, we measure from devices we control, configured
to use 4G or 5G only with the same mobile provider from the same location in Los Angeles County. We
select an iPhone 7 as a 4G device and iPhone 13 Pro as a 5G device, and we put the server within 30 miles
from the source. We find up to 30 Mb/s throughput with iPhone 7, and up to 65 Mb/s throughput with
iPhone 13 Pro. This result shows throughput may vary depending on the device type. In this controlled
experiment, we vary the content size, and the server provides enough content so that we can reach the
maximum throughput. Our observed throughput of 65 Mb/s is within the top 35% throughput that we could
135
0 25 50 75 100 125 150 175 200
Latency (ms)
0.0
0.2
0.4
0.6
0.8
1.0
CDF of IPv6 /64 prefixes
4G: Day-1
5G: Day-1
4G: Day-2
5G: Day-2
Figure 5.5: Latency observed from 4G and 5G devices
observed in Figure 5.4. Getting a throughput within the top 35% of the observed throughput is expected
with a 5G-enabled device and within a metropolitan area like Los Angeles.
5.6.3 Can We Distinguish 4G and 5G?
Do we observe different latency patterns for 4G and 5G devices? While IPv6 addresses seem to distinguish
WiFi from cellular access networks, we do not know how to use them to identify 4G vs. 5G. Here we use
data from user devices collected by a measurement system running on user devices that reports the device
information, access network type, and latency data (Section 5.5.2) back to the CDN.
We evaluate the latency from a set of 4G and 5G devices located across the U.S. We select two different
sets of user devices—the first one is only 4G-enabled, and the second one consists 5G devices (however,
existing 5G devices may operate in 4G mode when required). As 4G devices, we choose Samsung Galaxy S8,
Samsung Galaxy S9, Samsung Galaxy S10, and Samsung Galaxy A12. As 5G devices, we choose Samsung
Galaxy S21 5G, Samsung Galaxy A32 5G, Samsung Galaxy A13 5G, and Samsung Galaxy Note20 5G. To
verify consistency, we evaluate latency for two days. We expect the same set of CDN servers since we
136
chose the same web target to compare latency from 4G and 5G devices. We exclude the cases when the
browser gets the web pages from the cache.
We show that latency can distinguish 4G and 5G networks, based on latency distributions that we
report in Figure 5.5. The 4G and 5G devices show a different latency distribution. We observe the latency
data collected from user devices to a commercial website hosted by the CDN. Within a day, we find around
150 unique IPv6 /64 prefixes that requested the commercial website. We observe multiple requests from a
single IPv6 /64 prefix. Multiple requests give us around 400 data points to the target website for a carrier
on a particular day.
We find that 20% of the requests from 5G devices observe less than 25 ms latency. On the other hand,
no 4G device observes less than 25 ms latency during the two days of our measurements. 5G devices have
a wider latency range since they may experience both 4G and 5G capabilities. The tail is similar for both
4G and 5G. Around 30% requests experience more than 50 ms of latency, which is true for both 4G and 5G.
Our result shows a distinction in the latency distribution between 4G and 5G devices. To confirm that
this difference is caused by the cellular network technology and not the mobile UE hardware, we examined
controlled experiences with two devices. We selected two Samsung Galaxy models with similar hardware
specifications—Samsung Galaxy A12 and Samsung Galaxy A32 5G. They both have similar numbers of
cores (8 cores each) and CPU clock speeds (2.3 GHz for Samsung Galaxy A12 and 2.0 GHz for Samsung
Galaxy A32 5G).
Comparing the latency distribution between these two models, we find that Samsung Galaxy A32 5G
devices show better latency compared to the Samsung Galaxy A12. We find Samsung Galaxy A32 5G
devices show 15 ms of minimum latency, while Samsung Galaxy A12 devices show 28 ms of minimum
latency. In terms of median latency, Samsung Galaxy A32 5G devices show 48 ms of median latency and
Samsung Galaxy A12 devices show 71 ms of median latency. Since the main difference is phone cellular
137
technology (4G vs. 5G) and not CPU or memory, this comparison suggests that cellular network technology
can cause latency variation.
While 4G and 5G distributions are different, these distributions overlap, and we do not identify a
specific threshold. If we see latencies below 20 ms, we identify the device is likely a 5G device in a 5Genabled area, however latencies above 50 ms are common to both 4G and 5G.
5.6.4 How Stable is Latency?
Finally, we evaluate the stability of latency measured at a CDN. As outlined in Section 5.5.3, we use longlasting TCP connections and direct measurements from UE to analyze the stability.
We expect IPv6 addresses to be ephemeral, because privacy preserving addresses change frequently,
often daily [155]. We confirm this result when we look at data where we have UE IPv6 addresses, and
we see that only 748 of 497,191 IPv6 addresses (only 0.15%) retain the same IPv6 address after 24 hours,
although prior study categorized IPv6 address assignment as stable [170]. This almost complete lack of
address persistence suggests that UEs frequently change their IP address assignment. Investigation of
latency over time is hampered by dynamic IP addresses because the address has a relatively short client
UE association. Since IP addresses are known to change, we examine the stability of latency in long-lived
TCP connections, since the same TCP connection must go to the same device endpoint.
Next, we show the stability of minimum latency at different time windows.
5.6.4.1 Evaluating Latency Stability
By definition, an IP address must remain fixed for a long-lived TCP connection. We see a few TCP connections that last 30 minutes or more.
Minimum latency remains stable at different time windows for long-lasting TCP connections. We
observe 160 and 761 long-lasting TCP connections for the German and Spanish carriers, respectively;
138
0 20 40 60 80 100
Standard deviation
0.0
0.2
0.4
0.6
0.8
1.0
CDF of full IPv6 addresses
US Carrier-1
US Carrier-2
Germany Carrier-4
Spain Carrier-5
India Carrier-6
Figure 5.6: Standard deviation among the minimum values
4,876 and 3,561 for the two U.S. carriers; and 42,516 for the Indian carrier (Table 5.4). These long-lasting
connections are over 30 minutes long. Figure 5.6 shows the standard deviation of minimum latencies in
each 10 minutes window collected from these TCP connections by the CDN. 40% of these connections show
less than 5 ms of standard deviation. So, long-lasting TCP connections show a stable minimum latency for
time windows of around 30 minutes. The global standard deviation in the minimum latency measured
from the data-ACK packets is low. In 60% of the long-lasting TCP connections, we observe less than 10 ms
of standard deviation (Figure 5.7). This is true for all the countries while the German carrier shows the
highest stability. We find that 80% of the German flows observe less than 10 ms of standard deviation.
Stability in minimum latency represents stable end-to-end distance. There can be many different reasons
for unstable latency like moving devices, congestion, or maybe poor network coverage. However, with all
these different reasons, we observe stable end-to-end minimum latency.
We observe low baseline latency correlates with stability. Figure 5.7 shows how minimum latency and
standard deviation are related to each other. We calculate the minimum latency and standard deviation
among the round trip times (RTTs) measured from the data-ACK packets within a TCP connection using
the ACK packets. We find that when the minimum latency is low, the standard deviation is also low. When
139
0 5 10 15 20 25 30 35 40
Standard deviation in latency (ms)
0.0
0.2
0.4
0.6
0.8
1.0
CDF of full IPv6 addresses
10.0-15.0 ms of min
15.0-20.0 ms of min
20.0-25.0 ms of min
25.0-30.0 ms of min
30.0-35.0 ms of min
35.0-40.0 ms of min
40.0-45.0 ms of min
45.0-50.0 ms of min
50.0-55.0 ms of min
55.0-60.0 ms of min
60.0-65.0 ms of min
65.0-70.0 ms of min
70.0-75.0 ms of min
Figure 5.7: Standard deviation among the minimum values
the minimum latency is 10-15 ms, the standard deviation is around 5 ms for 50% of the IPv6 addresses.
However, the standard deviation is around 8 ms for 50% of the addresses, when the minimum latency is
70-75 ms. The lines gradually shift to the right (more standard deviation) as the minimum latency shifts
from 10 ms to 70 ms as we can see from Figure 5.7.
5.6.4.2 Stability over Three Weeks
We next look at the stability of one device for three weeks. This data complements our prior examination
of thousands of devices for tens of minutes.
To check the stability for an even longer period, we measure the latency from a single piece of UE
(Section 5.4.2) for over three weeks. We select different popular top-level popular webpages as our targets,
and ping these targets everyday from a specific location. Since we use the domain names as our targets,
the target IPs may contain both IPv4 and IPv6 addresses. Then we measure the lowest latency within the
day to see whether this minimum latency remains stable for three weeks.
Figure 5.8 shows latencies from the same UE to top six websites, measured up to 30 times per day for
three weeks. For each site we consider the minimum RTT observed each day, then summarize those 21
140
google.com
facebook.com
apple.com
amazon.com
hulu.com
twitter.com
0
20
40
60
80
100
120
RTT
Figure 5.8: Latency from one UE over three weeks.
days with a boxplot showing median and quartiles (the P25 and P75 percentiles) and whiskers showing
the largest and smallest values within 1.5 × IQR beyond the P25 and P75 values, where IQR is P75 − P25.
Overall, we see latencies vary quite a bit for each site, and even more between sites. While Google,
Facebook, and Apple are all consistently around 20 ms, Amazon, Hulu, and Twitter are 2–4× that value,
suggesting some sites are located at or near mobile provider connection points, while others are remote
and accessed over longer paths. Figure 5.8 shows different range of latencies for different targets. Pinging
to google.com gives us the lowest and most stable latency from a specific location.
5.7 Conclusion
In this chapter, we show how we use measurements to evaluate mobile latency, throughput, and stability
(Section 5.6). We utilize a globally distributed CDN’s logs and direct measurements from UE to characterize
end-to-end latency. We demonstrate how IPv6 address patterns can help us to identify UE with mobile access network (Section 5.5). Then, upon isolating mobile traffic, we analyze mobile latency, throughput, and
stability from a globally distributed CDN. We study mobile carriers in four countries and three continents.
141
We show end-to-end mobile latency can be as low as 6 ms, and exceeding 100 Mb/s of throughput is not
rare from a CDN. We also show minimum mobile latency remains fairly stable when the baseline latency
is low. Our measurements and analysis suggest many mobile users are still far from the performance one
might expect of this 5G era. Ongoing use of our carrier-independent methods may tell if and when this
has improved.
In this section, we do not change any existing protocols and utilize only measurements to characterize
mobile latency. Although not always stable, our analysis of end-to-end mobile latency to a CDN shows
cases where latency is stable. We use this characteristics of stability to design our detour detection system
in Chapter 6.
Appendix 5.A Ethical Considerations
Our work poses no ethical concerns to the best of our knowledge. Our work contributes to the community
by offering insights into the performance of 5G networks reaching a globally distributed CDN. It poses no
risks to individuals or organizations. We preserve the anonymity of the operators’ names since our goal
was not to compare cellular networks or scrutinize the CDN provider. Some of our measurements reexamine data from CDN traffic, but we do not access user identities and report only aggregate information. Our
measurements from specific UE are carried out by ourselves, with devices we selected, and our consent.
142
Chapter 6
Finding Malicious Routing Detours
In this chapter, we describe our algorithm to detect malicious detours in 5G operator networks. Our insight is that 5G latency is generally stable, and any detour will deviate this latency. We use this insight
to design a system that analyzes latency patterns from UE to different landmark destinations, and finds
destinations with stable latency. We then detect a detour if the stable latency significantly deviates from
the known historically stable latency. Through testbed experiments, we demonstrate the scenarios where
our proposed algorithm effectively detects malicious detours. We show latency to a landmark destination
can be very stable, with less than 1 ms of standard deviation. Utilizing this stable latency, we demonstrate
that we can even detect small detours that just add 2.5 ms of additional latency due to the detour event.
In this chapter, like the other defense systems of this thesis, we utilize only measurement to design
our detection system (Section 1.1). We only need to run measurements from the UE without changing
other existing protocols. Defenses in Chapter 2 and Chapter 3, show defense systems to secure servers,
and defense in Chapter 4 secures clients against brute-force password attacks. This chapter secures the
path from the User Equipment (UE) to the landmark destinations.
We plan to submit this work to a conference for peer-review as of June 2024.
143
6.1 Introduction
The demand for high-speed Internet using cellular data network has been growing over the years. Cellular
network providers aim to provide low latency and high throughput for multimedia streaming, IoT devices,
and vehicle-to-vehicle (V2V) communication. The advancement of mmWave, edge computing, and network slicing aims to provide low latency and high throughput to meet service requirements. As these
new technologies and standardization involve different parties from different nation states, 5G security
has become a concern for many [7, 37, 52, 43, 165].
To provide lower latency, high bandwidth, and service requirements, 5G brings new hardware and
software, developed and standardized by different enterprises. This new development includes deployment of small network cells, connected through third-party networking devices, and edge computing that
brings service close to the users. Some enterprises have already made progress with the deployment of
5G infrastructure in different parts of the world. Even if a network operator deploys its infrastructure
with trusted devices, the end-to-end network cannot be trusted because of the heterogeneous deployment
by different parties. As a result, network operators need to be vigilant about the presence of untrusted
devices.
Technologies new to 5G raise security concerns and possible new risks—for example, network slicing
and SDN can be used to detour traffic for eavesdropping [44, 51, 62]. In addition, new 5G hardware sometimes shipped from vendors with uncertain security and testing practices. Although these new technologies can help users and network operators, they also represent a new attack surface and new opportunities
for attackers to exploit.
In this paper, we focus on detecting the risk of routing detours in 5G networks. Routing detours have
been studied in the general internet [66, 211] but they are a new concern in 5G networks. Since it is
hard to standardize all security features consistently, we anticipate attacks where network devices can be
manipulated to observe traffic, gain control over traffic, and detour traffic to other locations. Detoured
144
traffic can be used for eavesdropping and route back to the original destination. In 5G, many network
services and slices will have specific service requirements like latency and bandwidth. The malicious
parties can know these requirements, and all these detours and malicious activities can be done within
the service required limits. As no service degradation happens, service clients and operators may trust the
malicious network, and the detour activity may remain undetected. As a result, we need a mechanism to
detect detour events.
To detect detours in 5G networks, our insight is that 5G latency is generally stable due to the new
spectrum of 5G network and due to the new deployment of edge computing. Any detour activity tends
to change the regular service latency. Since 5G is supposed to provide lower stable latency, any latency
change due to the routing detours should be visible if we carefully analyze the historic latency. In this
work, we characterize 5G latency as low and stable, and then we propose an algorithm to detect malicious
routing detours using historic latency.
We make three contributions in this paper. Our first contribution is to characterize cellular latency
and their stability (Section 6.6). We show the evidence of a low minimum and stable latency which is the
primary requirement of our detour detection algorithm. We characterize end-to-end latency and latency
inside the cellular networks. Our analysis using the 5G testbed shows latency inside the cellular network
is stable.
Our second contribution is to propose an algorithm to detect malicious routing detour using cellular
latency (Section 6.5). Our algorithm utilizes low and stable latency to detect malicious routing detours.
Our third contribution is to simulate our algorithm in different scenarios using a testbed to show the
effectiveness of our algorithm (Section 6.7). We vary different parameters to demonstrate the detour events
when our algorithm successfully detects the detour events.
145
Base station UE Packet
gateway Edge
router
Internet
Destination Internal
router
Internal
router
5G core network
Figure 6.1: 5G architecture and threat model
6.2 Problem Statement and Threat Model
Our assumption is that there can be untrusted devices installed in the path from the UE to the destination.
The malicious behavior of the untrusted devices can be unknown to the network operators, and can be
exploited to detour the traffic through an intermediate hop (Figure 6.1). These detours may not be identified
using traceroutes if there are ping-unresponsive devices inside the carrier network. The detoured traffic
can then be exploited for malicious reasons, and can be forwarded to the actual destination. To meet
Service Level Agreements (SLAs), we expect detours may happen occasionally so that SLAs are within the
limit most of the time.
A 5G network has different components (Figure 6.1), and a detour may happen in different parts of the
network.
Path from the UE to base station: The user equipment (UE) connects to a base station through a
wireless Radio Access Network (RAN) running over 5G channel frequency—a difference between 4G LTE
and 5G. We anticipate a lower probability of having a detour since we do not expect third-party devices
within this communication channel. However, surveillance devices (such as Stingray [260]), simulate cell
phone tower to force cellular devices towards the illegitimate tower. Although normally available only
to law enforcement for authorized intercepts, this technology may also be exploited by a sophisticated
attacker.
146
Path from base station to the packet gateway: The path from the base station to the packet gateway
is known as the backhaul network [176]. The base stations can have wired or wireless connections to the
edge router and packet gateway [261, 119], and then packet gateway connects to the Internet. The mobile
edges can be located inside mobile networks to make computation closer to the users. Rapid deployment of
edge computing requires installing of new infrastructure inside mobile networks—possibly near the base
station.
This new infrastructure brings hardware and software that are not well trusted. Even for the trusted
providers, pressures to capture the global 5G market may result in incomplete device testing or imperfect
device security. As a result, these devices can be exploited by active adversaries. Additionally, different
content providers or enterprises use 5G edge computing as-a-service without even testing the underlying
infrastructure [104]. In Figure 6.1, we can see such a scenario where a detour happens in the path from
the base station to the packet gateway.
Detours in the Internet: Malicious detour in the Internet is dependent on how the carrier is connected to the Internet. Here, a malicious party needs to manipulate BGP to reroute the traffic. However,
BGP manipulations are normally observable from public telescopes and should appear in the traceroutes;
so BGP manipulation is mostly tractable. Also, we do not expect any mobile network specific deployment
here.
6.3 Related Work
Our detour detection systems relies on low and stable latency.
5G security related studies: 5G security has become a concern in recent years. The adoption of
devices from multiple vendors and nation states has intensified trust issues [52, 86]. The 5G security concerns, challenges, places of vulnerabilities, and possible solutions have been discussed in several previous
147
studies [6, 114]. These studies provide a general overview of 5G security. Our work utilizes historic latency
to make a network operator watchful and vigilant about any routing detours and misconfigurations.
Use of latency for different purposes: Latency in mobile data network has been studied from different perspectives. Studies show buffer queuing impacts over latency [254, 111], latency and throughput
variations in 3G [229], and design solutions for lower latency in 5G [174, 130]. User latency can be used
for other purposes like geo-movement of IP blocks [82]. Latency change may happen due to a malicious
routing activities, and finding these activities is an active research topic [264, 263, 181, 215, 203, 27]. We
utilize latency to detect suspicious routing detours.
Measurement from cellular devices: Previous studies showed performance measurement from real
cellular devices to evaluate 5G latency and throughput[152, 153, 258, 83]. Several studies took the cellular
device in different locations in the US, and measured latency and throughput while traveling [83]. Another
prior study shipped a mobile device across the US, and made traceroutes to infer cellular topology [262].
Other studies measured latency and throughput within a limited geo-coverage from the user’s devices.
Some of these studies also measured latency, throughput, and power efficiency with StandAlone (SA) and
Non-StandAlone (NSA) 5G network [258]. Our focus is different; we analyze stability in minimum latency
to check our detour detection algorithm.
Detour detection related studies: The closest to our work evaluated international BGP detours
where traffic crosses national boundary and returned to the same country [211, 67]. This work helped
privacy-concerned states to minimize malicious routing traversing different jurisdictions. Our work is different which focuses on detours in mobile networks, and to the best of our knowledge this is the first work
that attempts to characterize mobile latency and utilize that latency to detect malicious detours. There
are multiple other studies related to BGP prefix hijacking—to understand the possibility of intercepting
traffic to conduct man-in-the-middle attack [20], surveying operators to understand awareness against
BGP hijacking [210], studying route manipulation to attract traffic [87], RPKI adoption to validate route
148
origin [45], and showing the evidence of hijacking attacks in the wild Internet [242]. So, finding detours
and evaluating their possibility with BGP is an active research topic. Our work is specific to detours in
mobile networks.
Prior studies evaluated detoured paths for performance improvement [99] or to solve security and
privacy concerns [211, 67, 236]. A prior work attempted to find alternative detoured paths for improving
performance and resilience [99]. Another study utilized detoured paths for DDoS defense [236].
6.4 Data Sources And Measurements
We use three datasets to characterize end-to-end latency and latency inside cellular networks.
Measurements from UE: We measure latency from real UEs to find out the stable component in the
cellular latency. We use an iPhone 13 Pro with 5G capabilities to evaluate latency stability. We use both
T-mobile carrier for this measurement. We use HTTP GET requests from our UEs to characterize latency
and stability.
CAIDA/UCSD ShipTraceroute data: We want to find the internal hops within a cellular network,
and find out the latency towards these internal hops. But can we really find out the internal hops within
a cellular network using traceroutes and learn their latency?
A previous study from CAIDA, UC San Diego has made an analysis by shipping a mobile device to
different geo-locations to make traceroute measurements at a fixed interval [262]. Although ShipTraceroute measurements infer network topology, we use this dataset to check how many internal hops we can
get and how is the latency to these internal hops. This dataset contains traceroutes from three popular
carrier networks in the US—AT&T, Verizon, and T-mobile (we are not anonymizing here since the previous
study reported the carrier names). Since the user device travels different parts of the US, we can learn the
internal hops in different locations of the US.
149
Measurements from 5G testbed: Our goal is to learn latency from user devices to different cellular
network components. ShipTraceroute dataset shows different internal hops inside the cellular network.
However, we cannot be sure which network components these internal hops refer to. Thus we use a testbed
from the University of New Hampshire (UNH) to test latency inside the cellular network [158] where we
know the cellular network component. In the testbed, the 5G UE is connected to the edge router through
an Amarisoft gNodeB (base station) and an an intermediate router that forwards traffic to the edge router.
6.5 Detecting Routing Detours
Our insight is that if mobile latency is consistent, then we can use it to detect detours. We show mobile
latency can be stable (Section 6.6). Any detour that sends traffic out of its way will increase this stable
latency, and we can detect the detours based on this increase. Our insight leads to an algorithm to detect
detours that first learns the baseline and then detects detours from deviation. Our algorithm uses some
same subroutines both in learning and detection phase. Our algorithm learns the baseline using five steps—
initializing landmarks (Section 6.5.1), measuring latency to landmarks (Section 6.5.2), windowing to divide
the time-series (Section 6.5.3), learning baseline and jitter from the landmark latencies (Section 6.5.4), and
then selecting landmarks with stable latency (Section 6.5.5). Then we use the baseline latency to detect
detours by evaluating the deviation from the baseline (Section 6.5.6).
6.5.1 Initializing Landmarks
We define landmarks as the destination addresses to which we make latency measurements from our UE.
Our algorithm uses these input landmarks to measure baseline latency. We require stable regular latency
to these landmarks from UE. We anticipate that the landmarks are near to the UE or even within the
mobile networks (for example, as proposed by 5G Edge Computing, or if the mobile operator hosts mobilenetwork-specific data centers, or chooses to peer in a commodity data center where the CDN already has
150
servers). The UE should experience stable latency to these nearby landmarks, as there will likely be less
interference from the external Internet. Thus we choose landmarks that are hosted by Content Delivery
Networks (CDNs), because CDNs normally host their contents close to the mobile users.
As the landmarks, we use ten popular toplevel webpages like www.google.com, www.amazon.com, or www.
youtube.com (we provide our list in Table 6.5 in Section 6.A). We use multiple webpages because when they
are hosted by different CDNs, they provide different paths, so of which may be near the edge of the mobile
operator’s network. This diversity makes it likely that we can find one or more low-latency paths, an
important input to detour detection (Section 6.5.6). The set of landmarks we consider can vary by mobile
operator, to get landmarks that popular for that operator’s users. (In our stability analysis, we use a fixed
set of landmarks.)
Even if a landmark is hosted by a CDN near the mobile operator’s network, it may not experience stable
latency due to load balancing. This instability is because the content may be load balanced across CDN
sites located in different datacenters. As our goal is to find some landmarks with stable UE-to-landmark
latency, we measure latency to all the initialized landmarks (Section 6.5.2), and use the ones with stable
latency (Section 6.5.5) in the detection phase (Section 6.5.6).
6.5.2 Measurements to Landmarks
Both learning and detection phases (described in the next section) require that we measure latency to landmarks, described here. Our active measurements take pings (ICMP echo request messages) to determine
RTTs since pings are easy to deploy, and do not require an additional evaluation for incoming and outgoing
packet timers like TCP handshakes and HTTP(s). We also considered TCP handshake RTTs, ping RTTs,
and RTTs from HTTP(s) requests as measurement methods. We avoid passive measurements since passive
measurements collect segmented data in the learning period.
151
Next, we evaluate the baseline based on the measured latency to the landmarks. For both learning and
detection, we take some latency observations (we define the number in Section 6.5.3) in the timeseries, and
evaluate minimum latency from these observation points. We define this minimum latency from a window
as window latency. Since outliers may happen due to measurement error or unusual network conditions,
we select 5th percentile value to avoid the unrealistic minimum latency. We combine multiple such 5
th
percentile values (more about window latency in Section 6.5.4) to find our baseline.
6.5.3 Varying Window Size
This section outlines our approach to segment the entire time series into multiple windows with varying
window sizes so that attackers cannot plan their attacks based on a fixed window size. For each pair of
observer and landmark, we divide the learning period into multiple time windows. While typical algorithm
uses a fixed window size, we select a potentially different window size for each (observer, landmark)-pair
each time we train. Our windowing mechanism serves three purposes—finding a window that typically
sees stability in the baseline, selecting a window that is sensitive to small detour events, and hiding the
true window size from the attacker.
Our algorithm needs to maintain a sufficiently large number of samples to ensure statistical significance and stability in the baseline for each individual window. Our algorithm tries different window sizes
with different number of samples, ranging from small to large, typically choosing a window of 8 or more
observations. Then for each window size, we calculate the baseline and standard deviation based on the
method mentioned in Section 6.5.4. Then the algorithm allows the window sizes that show stability—
having a standard deviation below than a threshold (Section 6.5.5). Among multiple allowable window
sizes, we avoid excessively large window size, and then select one for our learning and detection.
We avoid setting window size excessively large for two reasons. First, a large window size results in a
large latency to detect attacks (Section 6.5.6). Second, a large window size makes detection insensitive to
152
short detour events. If attackers know a large window size, a clever attacker will device a detour schedule
within that large duration to circumvent our defense measures. Hence, we use a threshold for the maximum
number of data points in each window. We place an arbitrary upper bound of 50 data points per window.
Now, we have a minimum and maximum window size that provides stability and sensitivity to smaller
detour events. How do we choose one for the detection phase? Our algorithm goes one step further by
taking a random window size from the allowable window sizes. The threat is that the attacker can know
the window size if we fix one, and adjust their attack strategy accordingly. We choose to take a random
window size from the allowable windows, rather than fix it globally. Since our algorithm is public, an
attacker who knows the window size can detour for a brief-enough duration that our detection threshold
(the window latency described below in Section 6.5.4) is unchanged. We opt for variable window size to
increase the difficulty for the attacker in determining durations for both detour and non-detour routes.
Next, we show how we measure the baseline for each time window, and decide the baseline with
window size for the detection phase.
6.5.4 Learning Baseline
We next show how we calculate the baseline latency, and jitter for each of the (landmark, window) pair.
This step uses the landmarks (Section 6.5.2) and different window sizes (Section 6.5.3) we have selected.
We assess the latency by estimating the minimum latency, L (5th percentile) over each window. We
define this 5th percentile latency as window latency (as mentioned in Section 6.5.2). We combine multiple
windows to measure the mean and standard deviation of the window latency (5th percentile) values to
build the baseline, B and jitter, σ in the latency. To check how the window latency estimate varies across
time windows, we compute baseline and jitter from estimates from at least 10 windows. We find the mean
and standard deviation of window latency for all the window sizes and for all the landmarks.
153
Now, we have the baseline and jitter for different landmarks and window sizes. Next, we show how
we pick the stable landmarks and window size.
6.5.5 Using the Baseline to Select Good Landmarks and Window Size
Our algorithm evaluates latency to multiple landmarks (Section 6.5.1) with different window sizes (Section 6.5.3). Some of these landmarks and window sizes do not provide us expected stability from the UE to
the landmarks. Also, we show the tradeoffs between small and large window sizes in Section 6.5.3. Thus
we need to pick the landmarks and corresponding window sizes so that we can get a stable latency to the
landmarks and fulfill window size restrictions.
We use a threshold-based filter for standard deviation to ensure a stable ⟨landmark, window⟩ pair.
For a landmark with multiple window sizes, we select the ones that provide jitter below than a threshold,
Tσ. For Tσ, we utilize 5% of the mean of the window latency values measured from the time windows. If
we have a landmark with multiple such window sizes, then we just take a random window size for that
landmark. We repeat the same approach to select a window size for all the landmarks. If there is no such
landmark, then detour detection is not feasible in the network.
After filtering the unstable landmarks, we now have one or more landmarks for the detection phase.
We utilize all the stable landmarks and corresponding window size for the detection phase. Different
landmarks take different internal paths, so they may detect detours in different parts of the space. In the
detection phase, our algorithm shows a detour event for each individual landmark.
Bypassing our detection is challenging for the attackers: We show how we select stable landmarks and a random window for that landmark. Our defense makes it significantly hard for an attacker
to predict the window size in a location, and schedule their attacks accordingly (we show an example of
how we mitigate this challenge in Section 6.7.2.2). Two UE from the same location may select different
154
window sizes for the same landmark. So, planning for making attack would be significantly harder for the
attackers.
6.5.6 Detection Methodology
We receive one or more stable landmarks, and a corresponding window size from the learning phase
(Section 6.5.4) to detect detours. Next, we show how our detour detection algorithm works based on the
inputs from the learning phase.
Given our model of baseline latency defined by a stable mean and low standard deviation, we now detect changes when observations exceed function of those factors for at least three consecutive observation
windows.
We detect a detour by latency that is noticeably higher than the baseline we learned—that is, L >
B + kσ where L is the window latency, B is the baseline from learning, σ is the standard deviation (as
mentioned in Section 6.5.4), and k is the threshold. We measure L as the window latency (Section 6.5.3)
to the landmark. The detection phase receives the landmark address with stable latency from the learning
phase along with the window duration (Section 6.5.5).
The magnitude of k dictates the balance between the smallest detectable detour and the potential for
false positives generated by our system. When k is smaller, we are capable of detecting minor detour
incidents, yet our system may produce false positives. At higher values of k, our system might overlook
minor detour occurrences, yet it will exhibit a greater susceptibility to false positives. We discuss the
choice of k in Section 6.7. Finally, we wait for three consecutive time windows before declaring a detour
event.
We detect the detour event for each of the selected landmarks. Our detection algorithm finally alerts
the users about the possible detour event along with the corresponding landmark. Since we have multiple
landmarks, we can get multiple detections for different landmarks.
155
UE Base station
Edge router Packet
gateway
Internet
Destination
UNH 5G testbed
Internal router
5G core
network
Figure 6.2: UNH testbed topology
6.6 Confirming Detour Detection is Possible
Before we evaluate how well detour detection works, we first explore actual latencies observed in the 5G
carrier networks to confirm that we can observe their backhaul network, and that it has latency that is
small and stable enough that detours will be visible. We therefore examine latency from the UE to different
landmarks, and show a stable latency that is a requirement to detect detour events.
We start with the latency inside the carrier network (Section 6.6.1), and then we characterize end-toend latency (Section 6.6.2). Because the mobile operators are fully responsible for their backhaul network,
they have the ability and motivation to provision it to avoid congestion, so we expect stable latency inside
mobile networks.
6.6.1 Confirming Stability within Carrier Network
At first, we evaluate how stable latency is inside mobile networks. We expect a latency to usually be
stable for the reasons given above. We begin by considering the best-case scenario, where we can measure
latency to the edge router (Section 6.6.1.1). Subsequently, we assess the observability of latency to the
internal hops within the mobile network (Section 6.6.1.2).
156
0 50 100 150 200
Timespan(seconds)
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
RTT (ms)
Raw values
Window latency
0.0 0.2 0.4 0.6 0.8 1.0
CDF
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
RTT (ms)
Raw values
Window latency
CDF of raw values
CDF of window latency
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
RTT (ms)
Figure 6.3: UE to edge router latency showing raw values with window latencies (left), CDF of raw and
window latencies (middle), and the boxplot of the window latencies (right)
6.6.1.1 Stability to the edge router
We will demonstrate the stability within the mobile network by illustrating the consistency in latency to
the edge router.
We test latency stability over a testbed with real 5G hardware and radios. Our algorithm requires
stability, and proving stability to the edge router is the first step of showing 5G latency is generally stability.
So, how stable is the latency if we can measure to the edge router? To answer this question, we use a
5G testbed from University of New Hampshire. Our testbed setup includes a UE along with a base station
connected to the edge router through an intermediate router (Figure 6.2). The testbed base station runs
standard 5G protocols, thus the wireless component mirrors real-world hardware and software, albeit with
a reduced number of User Equipment (UE). To measure the stability, we make ping requests to the edge
router from the UE.
Figure 6.3 shows the latency distribution from the UE to the edge router. From the raw values by
blue dots, we can see that the lowest latency is around 10 ms and the latency is stable—showing variation
157
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Percentage of traceroutes
Number of hops within ATT network
(a) AT&T
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Percentage of traceroutes
Number of hops within Verizon network
(b) Verizon
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12
Percentage of traceroutes
Number of hops within T-mobile network
(c) T-mobile
Figure 6.4: Number of IPv4 traceroute hops within carrier networks
between 9 ms to 14 ms (Figure 6.3). From the CDF with the blue line, we can see that variation is generally
linear between 10 percentile and 90 percentile, or so.
We observe a low and stable latency to the edge router which is the requirement of our detour detection
algorithm. We also show the window latency with red dots, when we consider 10 pings in each time
window (Section 6.5.3). According to our algorithm (Section 6.5), the mean of the window latency values
is 11.8 ms with a 0.3 ms standard deviation. The window latency values are stable as shown by the red
CDF line. We show the boxplot of the window latency values from multiple windows in Figure 6.3. This
boxplot shows median and quartiles (the P25 and P75 percentiles) and whiskers showing the largest and
smallest values within 1.5 × IQR beyond the P25 and P75 values, where IQR is P75 − P25. We can see all
the window latency values remain stable in all the windows which is represented by the tiny height of the
boxplot. Various configurations and models of base stations could potentially yield even lower minimum
latency to the edge router. The low standard deviation reflects low jitter.
We get a low and stable window latency if we can measure latency from the UE to the edge router. Our
next question is, can we measure latency to the internal hops in a real operator’s network with multiple
hops?
6.6.1.2 Can We Observe Internal Hops?
We already show latency is stable to an internal hop like edge router. Now, we will check whether we can
really observe the internal hops inside mobile networks.
158
To understand the feasibility of getting latency to internal hops, we analyze the ShipTraceroute dataset
(see Section 6.4 and [262]). Measuring latency to internal carrier hops within the mobile networks is challenging for three reasons. First, some of the router interfaces do not respond to ping requests. Additionally,
carriers occasionally utilize addresses from private address space to reply to ping requests, which may lack
carrier identification. Third, layer-2 networking will remain hidden from the traceroute measurements.
The ShipTraceroute dataset combines traceroutes to the neighboring ASes of the cellular carriers. The
traceroute hops before the destination AS represent hops within the mobile networks. CAIDA made these
measurements with three carriers by shipping the UE to different geolocations. To check whether an
intermediate hop is within the mobile networks, we use IP to AS organization mapping. When we do not
find public IP address in the traceroutes, we consider the first few hops within private address space before
a public address as carrier hops.
We find the evidence of getting internal carrier hops from the traceroute measurements. Figure 6.4
shows a histogram of how many hops we observe from the traceroutes for three mobile carriers. We can
see that in most cases traceroutes with AT&T show 5 to 10 internal hops. This path length varies with the
geolocations but mostly fixed within a certain geographic area. With Verizon, we observe 5 to 8 internal
hops. For T-mobile, we consider private IPs as the internal hop address since T-mobile traceroutes do not
show public IP addresses. In over 60% traceroutes, we observe 6 internal hops inside the cellular networks.
Although we observe the internal hops, determining the network components (Figure 6.1) represented by
these hops is challenging.
Since the ShipTraceroute dataset was collected in 2019-2020, we reexamine traceroutes from a single
location. While we were able to observe the internal AT&T hops using the ShipTraceroutes dataset, we
observed unresponsive AT&T IPv4 hops when we conduct traceroute measurements again in 2023. We
find AT&T internal IPv4 hops drop the ping packets. Since internal hops are no longer detectable, next
159
Date City Network
signal strength 5%ile (ms) 25%ile (ms) 50%ile (ms) 75%ile (ms) 95%ile (ms)
2023-12-16T15:00 Los Angeles Excellent 35.9 40.3 46.3 59.7 74.5
2023-12-17-T13:00 Los Angeles Excellent 36.7 41.6 45.2 55.0 72.1
2023-12-18-T15:00 Los Angeles Excellent 33.5 39.9 44.3 59.0 74.1
2023-12-21T08:00 Los Angeles Excellent 33.4 38.6 42.6 55.8 73.4
2024-01-25-T13:00 Marina Del Rey Poor 15.0 22.3 32.4 49.2 75.1
2024-01-25-T17:00 Marina Del Rey Poor 17.5 24.4 53.4 90.0 138.6
Table 6.1: Latency stability to a fixed landmark
we consider end-to-end latency to check the stability for a longer period of time. Long term stability is
important to utilize - the same learned baseline for a longer period of time.
6.6.2 Confirming End-to-End Latency Stability
The ShipTraceroute dataset only contains short-term data from a single location since the device was moving while collecting the dataset. As a result, the dataset combines measurements from a wide geographic
area, but does not include observations from a single location for a longer duration. Consequently, this
dataset cannot indicate long-term stability. We therefore collect new data over five days from our own
device (Section 6.4) so we can evaluate long-term stability.
We make measurements from the UE to different landmarks for multiple days. To show stability with
other measurement methods, we conduct HTTP GET requests every 5 seconds to multiple landmarks to
observe latency stability. We measure the latency from the TCP SYN and SYN/ACK packets. We use
multiple popular web pages as the landmarks hosted by a CDN in a nearby data center (we get the CDN
service by observing the DNS CNAME). We take the measurements from two different locations.
We expect a low latency since the landmark webpages are hosted within the same city as the UE.
SYN-measured latencies may fluctuate due to the network condition or congestion, but we anticipate a
consistent window latency that reflects the distance between the source and the destination [162]. We
expect a stable window latency since we take 5th percentile latency as the window latency, and even if the
SYN latencies have jitter, the window latency is stable and repeats over time.
160
0 200 400 600
Timespan(seconds)
0
10
20
30
40
50
60
70
80
90
RTT (ms)
Raw values
Window latency
0.0 0.2 0.4 0.6 0.8 1.0
CDF
0
10
20
30
40
50
60
70
80
90
RTT (ms)
Raw values
Window latency
CDF of raw values
CDF of window latency
0
10
20
30
40
50
60
70
80
90
RTT (ms)
(a) Measurement from Los Angeles to webpage-1 hosted by the CDN
0 200 400 600
Timespan(seconds)
0
10
20
30
40
50
60
70
80
90
RTT (ms)
Raw values
Window latency
0.0 0.2 0.4 0.6 0.8 1.0
CDF
0
25
50
75
100
125
150
175
200
RTT (ms)
Raw values
Window latency
CDF of raw values
CDF of window latency
0
10
20
30
40
50
60
70
80
90
RTT (ms)
(b) Measurement from Marina Del Rey to Webpage-1 hosted by the CDN
Figure 6.5: RTT (ms) measured in every 5 s to two different web pages hosted by the CDN: left figure shows
raw values and window latencies, middle figure shows CDF of raw values and window latencies, and the
right figure shows a boxplot of the window latencies
161
We see the raw RTT values measured from two different cities along with their CDF in Figure 6.5. As
expected, the RTT varies more compared to our observations when we measured RTT to the edge router.
Figure 6.5a shows the latency varies from 35 ms to 75 ms when we measure from Los Angeles while the
latency to the testbed edge router varies between 11 ms to 15 ms, as we showed in Figure 6.3. We observe a
lower minimum of 15 ms in Figure 6.5b when we measure from Marina Del Rey. However, Marina Del Rey
RTTs exhibit greater variability, as evident in the scattered datapoints. We notice poor mobile network
signals when conducting measurements from Marina Del Rey. This poor latency could potentially account
for the variation among datapoints. Although the raw values include a wide range of datapoints, we can
see a stable window latency by the red points. This distinction is visible by the CDF graphs in Figure 6.5.
While the blue line with raw values has a high slope, we observe a flat red line indicating stable window
latency.
The stability in window latency (boxplots in Figure 6.5) justifies our choice of defining 5th percentile
latency as the window latency. We find 5th, 25th, 50th, 75th, and 95th percentile RTT combining all the
measurements taken on specific date and time. We show these values in Table 6.1. Measurements from
Los Angeles shows a very stable latency measured in four different days. The stability remains consistent
across all percentile values. However, we find a different outcome when we measure from Marina Del Rey.
We observe only 5th and 25th percentile values are consistent at two different times on a given day. The
remaining percentiles exhibit significant deviation even within a short time frame. As stated earlier, we
suspect the weak network signal causes this variation.
The consistency in end-to-end latency validates the practicality of utilizing it to detect malicious routing detours. Our analysis shows we need to use 5th percentile latency as the window latency to evaluate
stable latency and avoid occasional jitter.
162
Day
% of windows
w detour
(False positive)
k = 2 k = 3
Day 1 9.1 0.0
Day 2 0.0 0.0
Day 3 4.5 4.5
Day 4 31.8 4.5
Day 5 4.5 0.0
Day 6 36.4 4.5
Day 7 50.0 0.0
Table 6.2: % of windows detecting false detours
based on k
Detour
% of windows
w detour
(True positive)
k = 2 k = 3
1 ms+0.2std 68.1 0.0
2 ms+0.4std 100.0 59.0
2.5 ms+0.5std 86.4 45.5
5 ms+1.0std 100.0 95.5
Table 6.3: Impact of k in various detour scenarios
6.7 Evaluating Parameters
Our measurements of 5G networks in Section 6.6 show that detour detection is possible. Next, we evaluate
our choice of parameters. Then we evaluate the effectiveness of our detour detection algorithm across
various scenarios (Section 6.8).
Our choice of parameters results in different success rate for our detection algorithm. Here we discuss
the impacts of two parameters—multiplier (k) that we used with expected jitter (Section 6.5.6) and window
size (Wi
) on our detection success (Section 6.5.3).
6.7.1 Evaluating False Positives and Accuracy
A key parameter in our algorithm is k, the amount of deviation from the baseline we use to detect detours.
We analyze the value of k from two perspectives. First, we show k’s value needs to be sufficiently large to
avoid false positives during normal traffic. Second, we show k’s value should not be too large so that we
can detect smaller detour events.
163
6.7.1.1 k value to avoid false positives during normal traffic
We select the value of k to ensure that it does not generate false positive signals during regular traffic
conditions. We evaluate the false positive rate as k varies for normal traffic (with no detours). As we
mentioned in Section 6.5.6, our detection method detects a detour event when the window latency exceeds
the baseline, B + (kσ). Depending on the value of k, we demonstrate how we can either increase the
likelihood of detecting small detour events at the expense of a higher rate of false positives, or miss smaller
detour events while minimizing false positives.
To assess the influence of the parameter k, we conduct measurements across various days without
introducing any detour events. We look for false positives over several days of data with two different
values of k. We take data for 4 minutes, providing 220 observations. Our window sizing algorithm shows
10 observations per window is stable, so each experiment has 22 windows, each of 10 observations. We
assess the baseline and acceptable jitter. Then we calculate the percentage of total windows (approximately
22 in total) that detect a detour during regular traffic for two values of k, where any detections indicate a
false positive.
We show the percentage of windows that detects a false detour in Table 6.2. Employing a threshold of
2 times the value of σ (k = 2) for detour detection results in a notable proportion of windows exhibiting
false detour detection. With k = 3, we observe a significantly lower number of false positives. Thus we
recommend to use k = 3 to avoid false positives during normal traffic.
6.7.1.2 k value to detect smaller detours
One can potentially use a high value of k to avoid any false positives during the normal traffic. However,
having a high value of k would miss smaller detour events from our detection method. Next, we assess
the trade-offs associated with detecting minor detour events and false positives, considering the parameter
value k.
164
4 6 8 10 12
Window size (no. of datapoints)
15
20
25
30
35
40
45
Mean of window latency
with std deviation (ms)
UE in Los Angeles
UE in Marina Del Rey
Testbed to a webpage
Figure 6.6: Evaluating how different window sizes result in different baseline latency and standard deviation
We test the trade-offs by injecting variable detour durations and jitters. Varying detour duration indicates different distances where the attacker shifts traffic. We then compute the percentage of windows
indicating a detour (more scenarios and experiment details in Section 6.8.3).
From Table 6.3, we can see that more number of windows can detect the detour event when we use
k = 2. With k = 3, less number of windows can detect the detour events. Small detour events are hard to
detect when we use a larger multiplier (k = 3). So, having k = 2 helps us to detect smaller detour events
but may cause false positives. We use k = 3 as a multiplier to avoid false positives although we may miss
smaller detours. Using a higher value of k would result in even more false negatives—missing more detect
events.
6.7.2 Evaluating Window Size
Our algorithm utilizes different window size, and selects one for each landmark in the detection phase
(Section 6.5.3).
165
0 5 10 15 20 25 30
Window size (no. of datapoints in observations)
0
20
40
60
80
100
% of detected windows
(Level of difficulty to bypass)
1ms with 0.5ms std detour
2.5ms with 0.5ms std detour
Attacker assumed window size
Figure 6.7: Variable window size increases the difficulty for the attackers
We evaluate window sizes to find a stable baseline (Section 6.7.2.1). Then we show how a variable
window size makes it harder for the attackers to bypass our detection approach (Section 6.7.2.2).
6.7.2.1 Window size to get stable baseline
Next, we show how the window size impacts the baseline and corresponding standard deviation. Depending on the window sizes, the stability of the window latency varies, and results in different baselines. We
demonstrate instances where certain window sizes are ineffective to get a stable baseline, and illustrate
that increasing the window size beyond a certain point becomes unimportant.
We employ varying window sizes in three of our collected datasets—end-to-end latency from Los Angeles and Marina Del Rey (Section 6.6.2), and RTT to a webpage in the testbed (Section 6.6.1). We use
window size starting from 3 to 12. We compute the window latency from each window, then calculate the
mean and corresponding standard deviation of these window latency values across multiple windows. We
expect high mean and jitter when employing a low window size, as the comparison is limited to a few data
points.
166
As the window size increases, both the mean and standard deviation of the window latency values
at different time windows gradually decrease (Figure 6.6). This trend occurs because larger window sizes
encompass more data points, including window latency values, leading to lower mean and standard deviation.
A window size containing only a few data points proves ineffective because of the possible unstable
baseline latency within the few datapoints. As a result, we cannot utilize that window size in the detection
phase. In environments characterized by high jitter, such as Marina Del Rey, we note that the standard
deviation tends to be excessively high at the beginning. As we can observe from the right side of Figure 6.6,
the standard deviation, σ is over 7, when we use 3 packets in a window. Since we use k = 3, a standard
deviation of 7 means our algorithm allows upto k × σ = 21 ms of deviation from the mean. So, we expect
a high number of false negatives since we cannot identify a detour that does not add over 21 ms of latency.
With an increase in the window size, the jitter rapidly decreases. Beyond a certain window size when
the standard deviation is less than a threshold (Tσ as mentioned in Section 6.5.5), the jitter stabilizes and
remains relatively consistent. So, beyond a certain point, we can select any window size because any
window size results in stable baseline. As we can see from Figure 6.6, the blue and green lines stabilize
when the window size is over 8 in the X-axis. The mean of the window latency in the Y-axis does not vary
that much when the window size is over 8.
We do not use a very large window size (fixed by a threshold mentioned in Section 6.5.3) since it would
make the detection phase longer, and attackers can make intelligent attacks (Section 6.5.3) within that large
window. Our algorithm chooses a window size randomly from the allowable window sizes. In this way,
we avoid using a fixed window size to a specific landmark since it would make the detection phase longer,
and attackers can make intelligent attacks (Section 6.5.3). Next, we show how we make it harder for the
attackers to bypass our detection system.
167
6.7.2.2 Variable window size makes bypassing harder for attackers
Here, we show how variable window size makes it harder for the attackers to bypass our detection system.
A clever attacker can make intelligent attack events by combining both detoured and non-detoured
traffic. Since we take 5th percentile latency as a window latency, an attacker can keep some traffic in the
regular path to keep the window latency like the normal traffic, and bypass our detection mechanism. We
mitigate the risk of such bypassing by using a variable window size.
We consider two attack events—one with 2.5 ms detour having a 0.5 ms of standard deviation in the
detour delay and the other one is 1 ms detour with 0.5 ms standard deviation (we describe more about these
attacks in Section 6.8.3). Since detoured paths show different latency, normally more than the regular path,
we use the term detour delay to indicate the additional latency added because of the longer path through
the malicious entity.
We assume the attackers predict the window size is 20, and so they push two packets in every 20
packets in the regular path to keep the window latency similar to the baseline.
We measure how many detection windows correctly alerts (true positives) us about the detour event
during an attack event. More true positives indicate the difficulty for the attackers to bypass our detection
system.
We show the percentage of windows where our algorithm alerts about the detour event in Figure 6.7.
If attacker guesses window size correctly, they can hide most attacks (only 25% windows can detect them
at most in our two scenarios). But if they guess incorrectly they guess incorrectly, we detect them with a
higher probability in each observation window (we observe up to 50% windows detect detour correctly).
We show two scenarios in Figure 6.7.
In the first scenario, with 1 ms detour with 0.5 ms standard deviation (blue solid line), we observe only
17% windows can detect the detour when the attackers correctly predict the window size (the perpendicular
168
Scenario Detour
Mean of
the window
latency (ms)
Std. dev.
(ms)
% of
windows
w detour
Can the detection
algorithm
detect?
Scenario (i):
Edge router is the
destination
Normal 11.8 0.3 – Normal condition
1.0 ms detour+0.2 ms dev. 12.4 0.2 0 No
2.5 ms detour+0.5 ms dev. 14.5 0.5 100 Yes
Scenario (ii):
Distances of the
destinations (google.com)
Normal 23.2 0.6 – Normal condition
1.0 ms detour+0.2 ms dev. 24.4 0.7 0 No
2.5 ms detour+0.5 ms dev. 24.9 0.6 45 Yes
Scenario (iii):
Distances of the
destinations (West Coast)
Normal 88.1 0.6 – Normal condition
1.0 ms detour+0.2 ms dev. 89.5 0.2 9.1 No
2.5 ms detour+0.5 ms dev. 90.5 0.7 91 Yes
Other:
Testing jitter
(for scenario (ii))
Normal 22.6 0.6 – Normal condition
1.0 ms detour+0.4 ms dev. 23.6 0.5 90 Yes
1.0 ms detour+0.6 ms dev. 23.5 0.6 68 Yes
1.0 ms detour+0.8 ms dev. 23.3 0.6 68 Yes
Table 6.4: Detour detection in different scenarios
dashed line). However, we observe 50% windows can detect the detour event when we pick the window
size 10 (the leftmost point in the blue line).
In the second scenario, we evaluate a detour that adds longer delay of 2.5 ms with 0.5 ms of standard
deviation (shown by the red dashed line in Figure 6.7). The two lines (two detour scenarios shown by blue
solid line and red dashed line) show similar pattern. We observe any window size lower than the attacker
predicted window size (the perpendicular dashed line) provides more detection success.
6.8 Evaluating Detour Detection
Next, we explore how well our detour detection works in different network configurations. We first show
the experimental setup where we measure through a 5G testbed (Section 6.8.1) along with outside Internet
to create different scenarios (Section 6.8.2) to test (Section 6.8.3, Section 6.8.4, Section 6.8.5) our algorithm.
Then we show the efficacy of our algorithm in these scenarios.
169
6.8.1 Testbed Setup
We evaluate our algorithm in settings closely resembling real-world scenarios, where the testbed component simulates the wireless environment, while other destinations mirror real-world conditions.
We utilize the same setup as shown in Figure 6.2, where the wireless section is from UNH testbed, and
the additional destinations are situated at varying distances. We have a UE linked to the edge router via
a base station and an intermediate router (Figure 6.2). The testbed facilitates connectivity to the broader
Internet through the 5G base station, edge router, and gateway. The gateway establishes connections
with upstream providers, which, in turn, interface with other Autonomous Systems (ASes) to offer global
connectivity for accessing various destinations.
6.8.2 Testbed Scenarios and Evaluation
We test our algorithm in four scenarios to answer three questions. First, we examine how our algorithm
performs with destinations at varying distances (Section 6.8.3). Second, we evaluate our algorithm when
the detour shifts traffic to different distances (Section 6.8.4). Third, we assess our algorithm’s effectiveness
when there are different levels of jitter in the detoured path (Section 6.8.5).
We incorporate three distinct destinations to cover a wide array of real-world scenarios to check our
algorithm’s effectiveness when the destinations are in different distances. A UE typically accesses various
destinations to retrieve content. Therefore, we utilize destinations at varying distances and introduce
detours to assess whether our algorithm can detect them effectively.
In the first scenario with a (i) nearby destination, we illustrate the integration of edge computing facilitated by 5G, with the destination positioned within the mobile network, akin to embedding edge computing
before packet gateway. In scenario (ii), the destination is a toplevel webpage hosted by a CDN, where we
expect the server is located outside the mobile network, but close to the users. As a toplevel webpage,
we use www.google.com, which we expect to be hosted in close proximity to the users. Lastly, scenario (iii)
170
represents a distant destination, where we have the server in the West Coast while our UE resides in the
East Coast, to represent scenarios where destinations are far from the UE’s location.
In scenario (iv), we use the same destination like scenario (ii)—a toplevel webpage hosted by a CDN (we
use www.google.com). However, in this scenario, we vary the jitter in the detoured path to show the impacts
of jitter.
Injecting detours: To check how our algorithm performs with different detour distances, in each of
the above scenarios, we vary the detour delay. We vary the detour delay from 0.5 ms to around 10 ms with
a 20% of the detour delay as a standard deviation. Detour delay is the added delay caused by the detour
event. Variable detour delay represents variable detour distances.
To test our algorithm with different jitter levels (scenario (iv)), we inject different jitters with a fixed
detour delay. We assume a fixed detour delay of 1 ms with variable standard deviation as different jitter
levels (from 0.1 ms to 2 ms).
In each of the aforementioned scenarios, we inject detours from an intermediate router—the router
linking to the edge router, as illustrated in Figure 6.2. To inject a detour, we utilize Linux tc package [171]
to add delay, along with jitter. We introduce variable detour durations to emulate the behavior of malicious
entities at different distances with various jitter levels.
Running algorithm and evaluation approach: In all the scenarios, our algorithm learns the baseline and typical jitter during normal traffic based on our algorithm mentioned in Section 6.5. As the landmarks (Section 6.5.1), we utilize destinations in varying distances as mentioned by the three scenarios.
During an attack, we utilize this learned baseline and jitter to detect detours.
For our testbed experiment, we evaluate within 220 data points where we use 10 data points in each
window. We iterate our algorithm for around 22 windows to assess the percentage of windows signaling
a detour. We detect a detour event when three consecutive windows alert us about the possible detour
incident.
171
Next, we demonstrate our algorithm under various detour conditions across different scenarios.
6.8.3 Efficacy when Destinations are at Different Distances
In this section, we evaluate our algorithm when the destinations are in different distances. We evaluate
three scenarios—(i) nearby destination, (ii) a toplevel webpage hosted by a CDN, and (iii) a distant destination
in the West Coast as we mentioned in Section 6.8.2. For the three mentioned scenarios, we inject different
detour events, and evaluate the efficacy of our detour detection algorithm.
When we have a (i) nearby destination inside the mobile networks, our detection algorithm could successfully detect detour events when the detour adds more than 1 ms of latency. As the baseline, we observe
a low and stable latency. Table 6.4 shows the mean of the window latencies is only 11.8 ms with 0.3 ms of
standard deviation. We are able to identify any detour events that is larger than 1 ms of detour duration.
Figure 6.8a shows over 90% of the windows can identify the detour event when the detour duration is more
than 1 ms.
In the second scenario as well, (ii) when a toplevel webpage hosted by a CDN, our algorithm can successfully detect the detour events in most detours. We can still observe a stable latency, which is a requirement
of our detour detection algorithm. We observe 23.2 ms of baseline (mean of the window latency) with a
standard deviation of 0.6 ms (Table 6.4). So, when the traffic goes outside of the mobile network, we can
still get a stable baseline latency (however, with slightly higher jitter compared to the scenario (i)). Like the
previous scenario, we cannot identify 1 ms of detour event. With a 2.5 ms of detour with 0.5 ms standard
deviation, our algorithm detects detours in 45% of the windows. Compared to the nearby destination inside
the mobile networks scenario (scenario (i)), our algorithm identifies detours in fewer (45% compared to the
∼100%) windows when we have a 2.5 ms detour. So, in this scenario as well, our algorithm can successfully
detect detour events when the detour delay is sufficiently large (over 1 ms).
172
Figure 6.8b shows our algorithm identifies detour in over 90% of the windows when the detour size is
over 5 ms.
We can observe a high baseline when we have a distant destination in the West Coast ((iii) in Table 6.4).
We observe 88.1 ms of mean of the window latencies. Even if the destination is far from the UE, we still
observe a very stable baseline latency with only 0.6 ms of standard deviation. In this scenario as well, we
can detect detour when the detour is sufficiently significant in magnitude. Figure 6.8c shows the success
of our algorithm with the increase in detour size. Like the first scenario, over 90% windows detect detours
when we have over 1 ms of detour delay. In all these scenarios, our finding is that we can successfully
detect the detour events when the detour adds over 1 ms of extra latency.
6.8.4 Detection Success as Detour Distance Varies
We show our detection success by varying the distance of the detoured traffic. This variable distance
indicates different locations where the attackers shift the traffic before forwarding to the actual destination.
We add different delays to simulate this environment with different jitter levels. We test detour distances
in all the first three scenarios mentioned in Section 6.8.3.
Our detection algorithm is effective in most scenarios, except when the detour is minimal (e.g., a 1 ms
detour with 0.2 ms jitter). When the detour adds insignificant delay, our algorithm fails to register three
consecutive detour detection signals. By comparing each mean and standard deviation in Table 6.4 to the
baseline and jitter from normal conditions, we can discern why smaller detours remain undetected. In the
all scenarios with 1 ms with 0.2 ms detour, the mean is within B + (3 × σ) (Section 6.5.6).
With the increase in detour delay, we observe more number of windows can detect the detour event
(from the first three figures of Figure 6.8). In the first three scenarios, we can observe an upward trend in
the percentage of windows that detects the detour.
173
6.8.5 Detection Success based on Jitter
Next, we vary the jitter with a fixed detour distance to evaluate the detection success of our algorithm.
We use www.google.com as the landmark like scenario (ii), fix detour distance as 1 ms, and vary the standard
deviation upto 2 ms.
As depicted in Figure 6.8d, increasing the jitter size correlates with a reduction in the percentage of
detour signals (shown by the blue line). Elevated jitter introduces latency variations within each window,
leading to latency values closely aligned with the baseline. Consequently, in a highly variable environment
characterized by high jitter, our algorithm may fail to detect detours in numerous windows (represented
by the blue dots showing fewer windows that detect detours), as it relies on the window latency value
within a window to signal a detour.
As a future work to address detours with high jitter, our algorithm can incorporate additional parameters, such as the percentage of packets exhibiting latency similar to the window latency value, in addition
to utilizing only the window latency.
6.9 Conclusion
In this chapter (and also in Chapter 5), we characterize cellular latency (Section 6.6), and propose an approach to detect malicious routing detours (Section 6.5). At first, we characterize the stability of cellular
latency (Section 6.6). We show window latency remains fairly stable (Section 6.6.2). After observing stable
window latency, we propose an algorithm to detect malicious routing detours utilizing historic latency
(Section 6.5). Using simulation, we show our method works with small detour events (Section 6.8.2). We
also show jitter may make our detour detection system less accurate.
We already show DDoS defenses in Chapter 2 and Chapter 3. Then we show defense against bruteforce password attacks (Chapter 4). In this chapter, we show a detection system against malicious routing
174
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Variable detour with fixed fraction of jitter
0
20
40
60
80
100
Detection success rate (% of time windows)
(a) Scenario (i): Nearby destination
0 2 4 6 8 10 12
Variable detour with fixed fraction of jitter
0
20
40
60
80
100
Detection success rate (% of time windows)
(b) Scenario (ii): Destination is a toplevel webpage
hosted by a CDN
0 2 4 6 8 10 12
Variable detour with fixed fraction of jitter
0
20
40
60
80
100
Detection success rate (% of time windows)
(c) Scenario (iii): Distant destination in the West Coast
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
Variable jitter with 1ms of detour
0
20
40
60
80
100
Detection success rate (% of time windows)
(d) Scenario (iv): Impacts of different jitter levels in scenario (ii)
Figure 6.8: Percentage (%) of time windows that detect detour in different scenarios
175
Landmark
www.google.com
www.facebook.com
www.facebook.com
www.amazon.com
www.hulu.com
www.netflix.com
www.twitter.com
www.ebay.com
www.oracle.com
www.youtube.com
Table 6.5: Landmark list.
detours. All these defenses utilize measurement, and existing protocols to prove our thesis statement
(Section 1.1).
Appendix 6.A List of Landmarks
Our algorithm uses different landmarks to measure historic latency. Among these landmarks, our algorithm finds landmarks with stable latency. In Section 6.5.1, we show how we initiate the landmarks for our
algorithm. In this appendix, we show the complete list of the landmarks that we use for our experiments.
We use ten landmarks for our testing. They are all web pages that have globally distributed customers.
Table 6.5 shows the list of landmarks. These web pages are normally hosted by a CDN with a distributed
footprints around the globe.
176
Chapter 7
Anycast Polarization in The Wild Internet
In this chapter, we describe a supplemental study that describes methods to detect and mitigate polarization
problems in the wild Internet. Our previous study in Chapter 2, described a BGP playbook to mitigate DDoS
attack using routing changes in an anycast network. Here, we use routing changes to improve the user
latency of an anycast network.
This chapter describes the causes of polarization in real-world anycast and shows how to observe
polarization in third-party anycast services. We use these methods to look for polarization and its causes
in 7986 known anycast services. We find that polarization occurs in more than a quarter of services, and
identify incomplete connectivity to Tier-1 transit providers and route leakage by regional ISPs as common
problems. Finally, working with a commercial CDN, we show how small routing changes can often address
polarization, improving latency for 40% of clients, by up to 54%.
This chapter provides a new method that operates without changing existing protocols, but, unlike the
prior chapters, we address performance problems and not an attack. Thus this chapter provides evidence
supporting part of the thesis, but not attack mitigation. Like all other previous studies, we show measurements are important not only for designing security systems but also for improving user experience.
This work is also a follow up study from our anti-DDoS work using anycast (Chapter 2). We use traffic
177
engineering to mitigate DDoS in Chapter 2, in this chapter we use traffic engineering to improve user
latency.
This work was published in the Passive and Active Measurement (PAM) conference in 2024 [196].
This work was also presented in the Internet Engineering Task Force (IETF) Measurement and Analysis
for Protocols (maprg) group meeting [137], and in the Latin America and Caribbean Network Information
Centre (LACNIC) Internet Measurements Working Group (IMWG) meeting [127].
7.1 Introduction
Anycast is a routing approach where anycast sites in multiple locations announce the same service. First
defined in 1993 [173], today anycast is used by many DNS and CDN services to reduce latency and to
increase capacity [247, 74, 78, 47]. With sites at many Points-of-Presence (PoPs), clients can often find one
physically nearby, providing low latency [206, 120]. With a large capacity of servers that are distributed
across many locations, anycast helps handle Distributed-Denial-of-Service (DDoS) attacks [144, 190].
With an anycast service, each client is associated with one of the anycast sites. For IP anycast, neither
clients nor the service explicitly choose this association. Instead, each anycast site announces the same IP
prefix and BGP Internet routing selects which site each client reaches, that site’s anycast catchment. While
BGP has flexible policies [31], these do not always optimize for latency or consider server performance [205,
131].
Polarization is a pathology that can occur in anycast where all traffic from an ISP ignores nearby, lowlatency anycast sites and prefers a distant, high-latency site [22, 131, 120, 145]. Polarization can add 100 ms
or more to the round-trip time if traffic goes to a different continent.
Polarization may happen for different reasons—incomplete transit connections, preference for a specific neighbor, and for unexpected route propagation. A prior study showed peering from the hypergiant
picked one global exit from their corporate WAN, and it did not use the many local anycast sites for their
178
.nl TLD [145]. This study showed the existence of polarization in two anycast networks, we still need
to understand root causes of polarization and evaluate these causes over many networks to understand
how widespread this problem is. In this chapter, we close this gap by providing a longitudinal analysis of
polarization in the anycast networks.
This study makes three contributions, First, we describe two key reasons for polarization in the anycast
services and show how they can be observed remotely in real-world anycast services (Section 7.3 and
Section 7.4).
Second, we look for polarization in 7986 known anycast networks, finding it in at least 2273 (28.5%)
showing that polarization is a common problem (Section 7.5.1). We show incomplete connectivity of Tier-1
providers (Section 7.5.4) and unwanted route leakage by regional ASes (Section 7.5.5) are the key reasons
behind polarization. Our measurements show that polarization can have a large latency cost, often adding
100 ms latency for many users (Section 7.5).
Finally, we show how a commercial CDN provider uses traffic engineering techniques to address polarization (Section 7.6) for their DNS service. We show small routing changes may significantly improve the
overall anycast performance, and that community strings are an essential tool when other routing changes
do not work. We demonstrate that simple routing changes can produce a 54% improvement in the mean
latency, improving performance for 40% of all clients (Section 7.6.3).
Anonymization: We anonymize all the names of the anycast services for privacy reasons.
7.2 Related Work
Extensive prior work has studied, with focuses on on anycast topology, efficient routing, performance
improvement, and DDoS mitigation.
Anycast topology: Different organizations design their anycast services in different ways. Anycast
services have topological differences in number of anycast sites [246], number of providers [140], and
179
regional or global deployment [265]. These topological differences affect the performance of anycast services. In this work, we show how topological differences with Tier-1 and regional ASes have impacts on
anycast polarization. We show how the same topology with different routing configurations can mitigate
polarization problems.
Anycast latency: Providing low latency is one of the goals of anycast services, multiple studies focus
on anycast performance [131, 246, 34]. Prior studies described the importance of number of sites [206],
upstream transit providers [140], selection of paths [131], stability in path selection [246], the impacts
of polarization [145], and path inflation [120] over anycast latency. Polarization can increase latency for
many clients [22, 145], a problem sometimes described as path inflation [131]; the cost of poor site selection
can be large [19, 34, 131]. Although a prior study shows the cost of polarization [145] in two networks, to
our knowledge we are the first to examine polarization across thousands of anycast services.
Anycast performance improvement: Prior studies showed different possible ways to improve the
performance of an anycast service. Li et al. proposed to use BGP hints to select the best possible routing
path [131]. Removing peering relationship [145] and selective announcement [140] also helped others
with performance improvement. In our study, we use multiple traffic engineering techniques to improve
the performance of anycast services.
Traffic engineering is used in other previous studies for other purposes like load balancing [184, 31, 81],
traffic shifting [39, 226, 33], DDoS mitigation [124, 190], and for blackhole routing in IXPs and ISPs [58, 85].
In our study, we show multiple traffic engineering techniques are required and BGP community strings
are essential to improve performance in cases when other traffic engineering methods do not work.
7.3 Defining Anycast Polarization and its Root Causes
Recent work defined polarization [145], long a bane of anycast services (for example, [22]). We next give
our definition of polarization, and add to prior work with a characterization of the two primary root causes
180
of polarization. potential root causes of polarization, These steps pave the way for our measurement of
polarization (Section 7.4) and evaluation of its consequences in the wild (Section 7.5).
7.3.1 Defining Polarization
Polarization is when a source ends up in a distant site in an anycast service, even though there is a nearby
site that could provide lower latency.
We usually focus on polarization continent-by-continent, and many anycast services have sites on each
continent to avoid large inter-continental latency. We therefore ignore high latency on continents when
the anycast service has no sites there.
Many anycast sites are global and are willing to serve any client. However, some larger anycast services have sites that are local, and are deployed to service only users in their host ISP (and sometimes
its customers). We ignore this distinction because a service with local sites will is large enough to have
some global sites on each continent. We focus on continent-level latency increases of 50 ms or more, so
this simplification has minimal change to our results. (Other work considers “optimal” latency, but we
consider differences of a few milliseconds to be operationally unimportant.) In addition, it is often difficult
to correctly identify local sites from only third-party observation.
7.3.2 The Multi-PoP Backbone Problem
The multi-PoP backbone problem happens when a backbone network exists, and has points-of-presence
(PoPs) in many places around the world, but that backbone forwards all traffic to a limited number of
anycast sites that are not geographically distributed. We consider this case as polarization when some
clients at least on some parts of the backbone could get lower latency while some other clients have to go
to a distant site through the same backbone.
181
NA site
(global site)
Tier-1
Transit
Private
peers
EU site
(local site)
Clients in North
America Clients in Europe
(a) Incomplete Tier-1 connections
NA site
(global site)
Tier-1
Transit
Tier-1
Transit
EU site
(global site)
Clients in Europe Clients in North
America
(b) Routing to a distant site
Figure 7.1: Two scenarios of multi-pop backbone problems
By backbone, we mean both the Tier-1 providers (for example, as reported in customer-cone analysis [32]), and hypergiants (for example, Microsoft or Google), since both operate global backbone networks
and have many PoPs. Organizations with large backbone networks often peer in multiple Points of Presence (PoPs) around the globe. They often have many customers, either by providing transit service, or
operating large cloud data centers, or both. Connectivity of an anycast network with these backbones is
important since these backbones carry a significant amount of user traffic.
Polarization with multi-PoP backbones can occur for two reasons. First, although both the anycast
service and backbone have many PoPs, if they share or peer in only a few physical locations, traffic may
be forced to travel long distances, creating high latency.
Second, even when backbones and the anycast service peer widely, the backbone may choose to route
traffic to a single site. This scenario was described when both the Google and Microsoft backgrounds
connected to the .nl DNS service, with global Google traffic going to Amsterdam and ignoring anycast
sites in North America [145].
Figure 7.1 shows two examples of multi-PoP backbone problems. Often local anycast sites are deployed
in certain ASes for the benefit of that ISP’s customers. Such sites may serve a few customers of the host
AS, or a few of its customers, but if they are not connected to an IXP or a transit provider, they have limited
182
Regional AS
peer
Tier-1
Transit
Tier-1
Transit
EU site
(global site)
NA site
(local site)
Clients in North
America Clients in Europe
(a) Regional AS connected as private peer
NA site
(global site)
Regional AS
Transit
Private
Peers
Tier-1
Transit
EU site
(local site)
Clients in North
America Clients in Europe
(b) Regional AS connected as transit
Figure 7.2: Regional leakage problem
scope. Given no-valley routing policies [80], these sites are not widely visible. These sites will not be the
preferred sites, even by the clients on the same content. Figure 7.1a shows a multi-PoP backbone problem,
where the site in Europe is a local anycast site, meaning it has only a few private peers, is not connected to
a popular IXP, and does not have any transit connections. On the other hand, the site in North America is
a global site connected to a Tier-1 provider. Since the Tier-1 AS has a missing connectivity in Europe, we
call this problem as incomplete Tier-1 connection. Due to incomplete Tier-1 connection, some clients from
Europe will go to North America (marked by two sad faces), where the Tier-1 AS is connected, resulting
in two different latency levels inside Europe.
Multi-PoP backbone polarization due to backbone routing choices is shown in Figure 7.1b. In this
scenario, a Tier-1 provider or hypergiant connects multiple sites in Europe and North America. However,
due to routing preference, we can see a client from Europe is going to North American site (marked by a
sad face).
183
7.3.3 The Leaking Regional Routes Problem
Our second class of polarization problems is leaking regional routes. In this scenario, the anycast operator
peers with a regional AS at some location, but that peering attracts global traffic and so incurs unnecessarily
high latency.
By regional ASes, we mean non-Tier-1 ASes that purchase transit from another provider.
This scenario causes polarization because of the prefer-customer routing policy common in many ASes.
Because the regional AS purchases transit, presumably from a Tier-1 AS, that transit provider will prefer
to route to the regional network and its customer, the anycast service, over any other anycast sites. Often
this preference will influence all customers of the Tier-1, and most tier-1 ASes have many customers. In
addition, by definition, a regional AS has a limited geographic footprint, so its connectivity does not offset
the shift of the Tier-1’s customer cone. Thus the choice of the anycast service to peer with a regional
network can have global effects.
Regional ASes may be connected to the anycast service as a private peer or as a transit provider. As a
private peer, the regional ASes are not expected to propagate the anycast prefix to their Tier-1 upstream.
However, sometimes a regional AS m with private peering my violate this assumption and propagate the
anycast prefixes to its Tier-1 transit provider. We can see a route leakage event in Figure 7.2a where a
regional AS peer propagates routes to its Tier-1 transit and brings traffic from Europe (as shown by the
sad client in Europe).
The regional ASes may also service as the anycast service’s transit provider at this site. As a transit
provider, these regional ASes rely on their upstream Tier-1 ASes to propagate their customer prefixes. We
can see such polarization in Figure 7.2b, where a regional AS is connected as a transit in North America
and propagates anycast prefix to its upstream Tier-1 provider. The upstream Tier-1 transit is globally
well-connected, and attracts traffic from Europe (illustrated by two sad faces in Europe).
184
List of anycast
networks: using a
recently published
dataset
Finding potential
problems: using ping
from RIPE probes and
using an algorithm
Finding root causes:
using traceroutes from
RIPE probes and using an
algorithm
Full anycast list
with 7986
prefixes
List with potential
problems
Figure 7.3: Steps to find polarization problems and root causes
7.4 Detecting and Classing Polarization in the Wild
To meet our goal to study polarization in the wild, we must take third-party observations that can detect
polarization and its root causes.
Our measurement approach has three steps: First, we use prior work that identified anycast services [222] to get a list of /24 anycast prefixes to study. Second, we test each /24 anycast prefix for polarization by measuring latency from many locations with RIPE Atlas. Finally, for prefixes that demonstrate
polarization, we take traceroutes and use what they find to identify root causes for polarization. Figure 7.3
shows the steps to find polarization problems and their causes.
7.4.1 Discovering Anycast Services
We first need to find anycast services to search. Fortunately, prior work has developed effective methods
to discover anycast services [222]. We directly use the results of their work and begin with their list of
7986 anycast services with 3 to 60 anycast sites distributed around the world.
We evaluate each of these known anycast services in our study. However, we expect that the chance
of polarization is low for services with many anycast sites (more than 25); with many sites, there is often
one nearby. For services with few sites (less than 5), the sites are sparsely distributed and many clients
will observe high latency, even without polarization (Section 7.5.3).
185
7.4.2 Finding Potential Polarization
To find anycast polarization, we ping from the RIPE Atlas Vantage Points (VPs) to the known anycast
services and look for latency variability within the VPs of a continent. We use 100 RIPE Atlas VPs—72
worldwide VPs, and to ensure global coverage, we pick 4 VPs from each Asia, South America, Africa, and
Oceania continents to analyze the latency from the RIPE VPs to the anycast destinations.
We detect polarization as when some VPs get good latency and other nearby VPs see bad latency. We
define “nearby” as all VPs on the same continent, and set fixed thresholds Tlo at 50 ms and Thi of 100 ms.
Lower than Tlo latency guarantees that at least some VPs have access to a nearby anycast site, while the
high thresholds, Thi shows that other VPs miss this site and reach a distant site. This combination of VPs
with lower than Tlo and higher than Thi latency indicates polarization. Different thresholds will change
the exact number of polarization may result in different number of polarization problems.
Filtering to reduce false positives: Next, we show how we reduce the number of false positives for
more accurate results. We may identify false polarizations because of the regional anycast services—
covering only one or two continents like Edgio [265]. Also, the initial anycast prefix list may not be
100% accurate because of the reassignment of addresses or falsely identified anycast prefixes. To ensure
global coverage, we pick VPs from all the continents to verify the network is a meaningful anycast service
to find polarization. We evaluate the anycast services where some VPs from at least three continents get a
low latency (< T1). Low latency from at least three continents ensures global coverage of the ancyast sites.
7.4.3 Finding Root Causes
Next, we describe the measurement methods that indicate each of the root causes for polarization that we
identified in Section 7.3. We take traceroutes to anycast services with potential polarization and examine
the penultimate AS hop as seen from different VPs.
186
Finding penultimate AS hop and its type: At first, we find out the penultimate AS hop in the AS path
to the destination. We observe the whole AS path, and pick the AS that is present just before the anycast
service AS number. Multiple routers before the final destination may represent the same penultimate AS
hop.
After getting the penultimate AS hop, we determine the type of the penultimate AS hop. We consider
CAIDA top 10 ASes as the Tier-1 ASes, and the others are regional ASes. Hypergiants have heavy outbound
traffic like Google, Microsoft, Netflix, or Facebook [149]. We classify them by their AS numbers.
Finding multi-pop backbone problems: Multi-pop backbone problems happen when Tier-1 providers
or hypergiants are either partially connected or route traffic to a distant site. We identify the partial
connectivity when we find VPs with Tier-1 AS or hypergiant in the penultimate AS hop in their paths to a
distant site, and when no other VPs from the same continent go to a nearby site using the same penultimate
AS. To find routing problems to a distant site, we check for VPs from the same continent that get both good
and poor latency with the same Tier-1 AS or hypergiants in the AS path. We use our latency thresholds
(Section 7.4.2) to understand whether VPs are going to a nearby site or a distant cross-continent site. This
method cannot find poor routing instances when all the VPs are going to a distant site even though they
have nearby sites. These nearby sites cannot be identified by the traceroutes since their routing sends all
the VPs to the distant site. We must talk to the operators to learn such poor routing cases.
Finding leaking regional problem: We identify a leaking regional problem when a regional AS with
a smaller customer cone attracts a lot of traffic to a specific anycast location.
The regional AS can be connected as a transit or as a private peer that is leaking routes to its upstreams.
To identify private peer leaking routes, we search for other Tier-1 ASes that are present in the penultimate
hops and connected in multiple locations. The existence of other penultimate Tier-1 ASes indicates the
possible real transit providers. When there is another Tier-1 transit, it is unlikely to have the regional AS
187
connected as a transit provider. To find regional ASes that are connected as transits, we check for other
Tier-1 ASes in multiple locations. If we do not find other Tier-1 ASes, then the regional AS is possibly
connected as private peer but leaking routes to its upstream.
This approach can predict possible regional route leaking, but it may also happen that these regional
ASes are connected as transit providers. Knowing the exact peering relationship is only possible when we
can talk to the operators as the inferred peering relationships from the public databases are not always
accurate, and corner cases like route leakage may infer wrong peering relationships [134]. Hence, to
validate actual route leakage we must talk to the operators.
7.4.4 Finding Impacts
After getting the polarization problems and their root causes, we want to see the impacts of polarization.
We find out the penultimate AS hop that is common in the paths to the distant site. We measure the
median and 95th percentile latency from a continent. Polarization results in inter-continental traffic, and
its impact is expected to show in high 95th percentile latency. Using the difference between the median
and 95th percentile latency, we show the impact of polarization.
Next, we show that polarization problems are common and they have a significant impact over anycast
performance.
7.5 Measurement Results and Impacts of Polarization
We next show how common polarization is in known anycast services, then how polarization affects service performance.
188
7.5.1 Detecting Polarization in Anycast Services
Our first goal is to understand how common polarization is. Following our methodology in Section 7.4,
we first examine all known anycast services for polarization. We see that of the 7986 examined services,
about 28% show potential polarization (Table 7.1).
In this study, we focus on the 626 anycast services that show polarization problems in Europe and
North America. We focus on these anycast services to detect root causes because these continents have
mature Internet service and multiple IXPs, so they are locations where polarization can occur and we
hope it can be addressed. We leave the study of polarization on other continents as future work, since
polarization that occurs due to incomplete in-country peering will be addressed as the domestic Internet
market matures and interconnects. (An anycast service that peers with one ISP cannot avoid polarization
with other ISPs in the same country if there is incomplete domestic peering.)
7.5.2 Detecting Root Causes
Given potential polarization in anycast services, we then apply root cause detection (Section 7.3) to these
services.
We find multi-pop backbone problems in 376 anycast /24 prefixes out of 626 prefixes (Table 7.1). We
observe a Tier-1 AS in the penultimate AS hop for these 376 anycast prefixes. Among these multi-pop backbone problems, our methodology finds 218 instances where a Tier-1 provider is incompletely connected.
We suspect this behavior when nearly all VPs from a continent have catchment in a distant site through
the same Tier-1 AS. We also find 158 cases when VPs route to a distant site. We suspect this event when
some VPs from a continent experience good latency (<50 ms) while others observe poor latency (>100 ms),
and when these VPs utilize the same Tier-1 AS in the penultimate AS hop.
Our methodology shows 233 cases where a regional AS leaks routes. Among these, in 177 cases we find
other Tier-1 ASes in the penultimate AS hop. As there are other Tier-1 ASes in the path, we suspect these
189
Category Count % Conf.
Known anycast services 7986 100
No observed potential polarization 5713 72
Potential polarization 2273 28
In continents outside EU and NA 1647 20
In EU and NA 626 8 18
No class found (a) 161 2
Only multi-pop backbone problem (b) 232 3
Only regional leakage (c) 89 1
Both classes (d) 144 2
Multi-pop backbone problem (b+d) 376 5 9
Incomplete Tier-1 connections 218 3 9
Routing to a distant site 158 2 0
Leaking regional problem (c+d) 233 3 9
Leakage by regional 177 2 6
Leakage by regional transits 56 1 3
Table 7.1: Detected polarization and inferred root causes.
Provider Potential problems Total anycast
Anon-DNS-2 214 216
Anon-DNS-3 94 159
Anon-CDN-1 20 47
Anon-DNS-4 12 75
Anon-DNS-5 9 9
Table 7.2: Top anycast services with potential polarization problems
Tier-1 ASes are the real transits, and the regional ASes are possibly leaking routes. In 56 other instances,
we find no other Tier-1 ASes in the path. We suspect a regional AS is connected as a transit in these 56
cases.
We contacted the operators of 18 of these 626 cases. The operators confirmed all these polarization
events.
An anycast service provider may have multiple /24 prefixes, and because of their topological similarity
we find polarization problems in many of their /24 prefixes. Table 7.2 shows top providers who have
polarization problems in many of their /24 anycast prefixes. We can see some providers have polarization
in almost all of their anycast /24 prefixes.
190
5 10 15 20 25 30 35 40 45 50 55 60 65 70
Size of anycast service
0
20
40
60
80
100
Percent of anycast services seeing polarization
437
594
619
359
121
44
88
972
2212
646
328
654
862
52
Number of anycast services in each bin
Polarizaion
No polarization
(a) Europe and North America only
5 10 15 20 25 30 35 40 45 50 55 60 65 70
Size of anycast service
0
20
40
60
80
100
Percent of anycast services seeing polarization
437
594
619
359
121
44
88
972
2212
646
328
654
862
52
Number of anycast services in each bin
Polarizaion
No polarization
(b) All other continents
Figure 7.4: Percent of anycast services that see polarization.
7.5.3 Impacts of the Number of Sites on Polarization
Does polarization correlate with the total number of anycast sites? Prior work [206] suggested that 12
sites can provide good geographic latency, but those results assume good in-continent routing. Does that
assumption hold in practice?
To answer this question, we explore the relationship between the number of anycast sites and the
degree of polarization. To get the number of anycast sites, we utilize the count reported by the recent
study [222] that we used to get the anycast prefixes.
Figure 7.4 shows how much polarization occurs relative to the number of anycast sites. We group
anycast services by number of sites into bins of 5 (so the first bin is 5 sites or less, the next is 6 to 10,
etc.). The number of sites in each part of the graph varies and is shown on the top of each bin, but is
always at least 52 services, and often hundreds. For each bin, we show the percentage of services that see
polarization.
We see some polarization in services of all sizes. But for Europe and North America, polarization is
most common in services with 30 or fewer sites (the left part of Figure 7.4a). For other continents, some
191
Provider Reason
Common
AS to the
distant site
Source
cont.
Med. 95th Example
Anon-CDN-2 Incomplete Tier-1 AS1299 EU
NA
12
25
173
88
Poland to USA
Anon-Cloud-1 Incomplete Tier-1 AS3356,
AS6939
EU
NA
51
55
158
162
Canada to Germany
Anon-CDN-6 Incomplete Tier-1
(peers as transits) AS6762 EU
NA
16
39
122
141
Greece to USA
Anon-CDN-3 Exceptional
incomplete Tier-1 AS6453 EU
NA
32
25
113
39
Germany to USA
Anon-CDN-1 Leaking regional transit AS1273 EU
NA
21
88
53
150
USA to UK
Anon-DNS-1 Leaking regional peer AS4826 EU
NA
174
37
314
214
USA to Australia
Anon-CDN-5 Leaking regional peer AS7473 EU
NA
39
24
248
252
USA to Singapore
Anon-CDN-7 Leaking regional peer
(merging org) AS209 EU
NA
29
26
84
65
Finland to USA
Anon-DNS-2 Leaking regional
Incomplete Tier-1
AS4637
AS1299
EU
NA
165
81
301
254
Canada to Tokyo
Table 7.3: Polarization in real-world anycast services
polarization occurs regardless of how many sites the service has; in Figure 7.4b we even for services with
30 to 70 sites.
We conclude that services with many sites generally get good routing in continents with mature Internet markets. However, routing is more challenging, as shown by greater polarization, for services with
only a few sites, and when operating globally. In mature Internet markets, high rates of AS interconnectivity decrease risks of polarization. However, outside EU and NA, anycast is more difficult to deploy well,
likely because of poor local inter-AS connectivity means some local customers cannot use a local anycast
site.
Next, we show examples of such connectivity issues that cause polarization and their impacts on latency.
192
7.5.4 Impacts of multi-pop backbone problems
We next look at the problem of multi-PoP backbones (Section 7.3.2). We examine several examples of this
in real-world anycast services in Table 7.3 and show how it can result in extra latency of 100 ms or more
due to inter-continental traffic.
7.5.4.1 Incomplete Tier-1 connections in Anon-CDN-2
We find an anycast network of Anon-CDN-2 where a fraction of traffic from Europe was going to the US.
We believe this event is an example of polarization for incomplete Tier-1 connections. We find 87% of the
European VPs remain within the continent, and 13% of the other VPs end up in San Jose, USA using a fixed
Tier-1 AS.
We find European sites are globally well connected. We observe a Tier-1 AS (AS3356 - Level-3) connected to a European site. We believe the European site is a global site attracting VPs from many locations.
As proof, we find that African VPs have catchments in Europe. European sites also increase their connectivity by having many private peers connecting most small ASes within the continent. Even with this
well-connectivity, 13% of the European VPs have catchment in the US. As a result, we believe this is an
example of polarization.
The VPs that are going to North America have AS1299 (Arelion Sweden AB) in common within their
paths. We believe AS1299 is working as a transit although we do not know the contract type between
Anon-CDN-2 and AS1299 for that anycast site. Based on the other traceroutes, we did not find any path
that remains within Europe through AS1299. That means AS1299 has a missing connection in Europe.
Impacts: The impact of this cross-continent traffic has a significant impact over latency. As an example,
we find a VP from Poland goes to San Jose, USA. The cross-continent traffic results in a high 95th percentile
latency. While the median is only 12 ms, we get the 95th percentile latency is 173 ms (Table 7.3).
193
7.5.4.2 Multiple incomplete Tier-1 in Anon-Cloud-1
Next, we show an example from Anon-Cloud-1 where an anycast service uses two different Tier-1 ASes
in two different continents.
We find two global sites in Europe and North America connected through two Tier-1 ASes. We observe
traffic going to Dallas, USA using AS3356 (Level-3), and to Frankfurt, Germany through AS6939 (Hurricane
Electric). Both these ASes are Tier-1 ASes and have a wide range of connectivity. We are unsure about
their contract type with Anon-Cloud-1. They can be incomplete transit connections, or one of these Tier-1
ASes has a private peering relationship with Anon-Cloud-1, but works like a transit.
Impacts: When we have two big Tier-1 ASes have incomplete transit connections in two continents,
we observe a significant fraction of cross-continent traffic. We observe 40% VPs from Europe and North
America end up in a different continent. The cross-continent traffic using AS3356, and AS6939 add over
100 ms of latency. The cross-continent traffic makes 3× 95th percentile latency compared to the median
latency (Table 7.3). This example shows how bad the impact can be when we have two Tier-1 providers
connected incompletely in two continents. As an example of cross-continent traffic, we observe traffic
from Denmark goes to Dallas, USA, and traffic from Canada goes to Germany.
7.5.4.3 Incomplete Tier-1: peers working as transits in Anon-CDN-6
We find another case where a Tier-1 AS has incomplete connections with Anon-CDN-6. We contacted the
operators, and they confirmed their peering relationship with this Tier-1 AS. The operators confirmed a
private peering relationship with the Tier-1 AS. But that Tier-1 AS has a huge customer cone, and their
policy is to propagate the routes to their global customer cone.
In this event, Anon-CDN-6 has a site in Miami, USA where the Miami location is connected to AS6762
(Telecom Italia) as a private peer. The operators confirmed that AS2914 (NTT America) is the real transit provider since it is connected to three continents—Asia, Europe, and North America. AS6762 is only
194
AS3356 AS3356
AS6453 AS6453
NA site
AS1299
EU site
Client in North
America
Client in EU
Figure 7.5: Anon-CDN-3: incomplete inter-AS connection
connected in one location. The anycast operators did not expect the prefix to be propagated out of the
continent by AS6762 since it is connected as a private peer. But in reality, it was propagated to other ASes
connected in other continents.
Impacts: AS6762 propagates the anycast prefix out of the continent. As a result, we found 10% VPs
from Europe have catchment in Miami, USA, and experienced over 100 ms of extra latency.
7.5.4.4 Exceptional incomplete Tier-1: incomplete inter-AS connections
We find a case when we observe a polarization incident even with complete Tier-1 connections. We find
this case using manual observation of the traceroutes when we could not find a proper classification of the
polarization problem.
Traceroutes to Anon-CDN-3 show that AS6453 (Tata Communications) as a penultimate AS hop in
many different paths to Europe and North America. This behavior proves AS6453 is working as a transit
connected in both Europe and North America with Anon-CDN-3 (Figure 7.5). So, we define this connectivity
as a complete transit connection. Even with this complete connection, we observe some European VPs have
catchments in North America, which results in polarization and high latency.
195
Many traceroutes show Tier-1 ASes like AS3356 and AS1299 just before the penultimate hop AS6453,
as we can see from Figure 7.5. The presence of these Tier-1s two hops back from the anycast site suggests
that this anycast network mostly relies on the Tier-1 ASes to get into AS6453 and then to Anon-CDN-3.
We find the VPs from Europe with AS1299 in the paths end up in North America through AS6453. We
suspect AS1299 is connected to AS6453 only in North America, or for Anon-CDN-3, AS1299 has a preference
for the North American connection. On the contrary, we observe the opposite for AS3356. AS3356 has
connectivity with AS6453 in both continents and as a result, we are not observing any performance issues.
Impacts: We find 18% of the European VPs choose a path through AS1299 to go to North America to
connect to AS6453 and Anon-CDN-3. While the median and 95th percentile latency have a small difference
for the North American VPs; because of this polarization, European VPs observe over 3× 95th percentile
latency (Table 7.3).
7.5.5 Impacts of Leaking Regional Problems
We next turn to leaking regional (Section 7.3.3), our second class of polarization problems. We find several
cases where regional ASes (non-Tier-1 ASes) send a great deal of traffic to a distance site. These regional
ASes purchase transit from a Tier-1, and so polarization results when their transit-providing Tier-1 adopts
a prefer-client routing policy. Alternatively, these regional ASes are private peers but make unwanted
route propagation to their upstream.
7.5.5.1 Leaking by regional transits in Anon-CDN-1
We find a polarization instance because of the leaking of an anycast prefix by a regional AS in one of
Anon-CDN-1 anycast networks. We confirmed this event with the anycast operators. The operators informed us that the regional AS is connected as a transit provider.
196
In this polarization problem, we find a significant portion of traffic from North America ends up in
Europe using AS1273 (Vodafone). We find VPs from all over the world use big Tier-1 ASes like AS1299
(Arelion Sweden AB) to reach European sites of Anon-CDN-1. Contacting Anon-CDN-1 operators about this
issue, we confirmed that AS1273 is indeed connected as a transit for their anycast network. We did not find
any VPs having AS1273 in the path that stays within North America. We believe AS1273 is only connected
in Europe, not in North America, resulting in incomplete regional transit connections and polarization. We
also confirmed this finding with Anon-CDN-1, and they informed us that AS1273 is connected to multiple
countries in Europe.
Impacts: Since AS1273 is only connected in Europe as a transit, we find around 64% of the North
American VPs end up in Europe. Only 36% VPs stay within North America. As an example of crosscontinent traffic, we find a VP from the USA that goes to London, UK. As a result, we find both median
and 95th percentile latency are high for this polarization (Table 7.3).
7.5.5.2 Leaking by regional transits in Anon-DNS-6
A global DNS provider (Anon-DNS-6) peers in several locations. A South American site peers with a
regional network (AS65112, PIT-Chile) who purchases transit from a Tier-1 provider (AS174, Cogent).
Because of AS174’s prefer-customer routing policy, peering with the regional network causes all customers
of the Tier-1 provider in North America and Europe to go to South America.
Impact: Cross-continent routing adds 100 ms or more latency, so Anon-DNS-6 instead prevents route
announcements to this Tier-1 from this site. With limited routing, this anycast site is unavailable to other
regional networks in South America that can be reached via transit.
Better solutions to this problem is for Anon-DNS-6 to peer with all regional networks, or that they
influence routing inside the Tier-1 AS, neither of which is easy.
197
7.5.5.3 Possible route leakage by regional AS in Anon-DNS-1
We identify polarization in Anon-DNS-1 anycast network due to route leakage by a regional peer.
We suspect this anycast network has route leakage because many VPs have a non-Tier-1 AS (AS4826:
Vocus Connect, Australia) as the penultimate AS hop in all the paths to the distant site, and some other
VPs have a Tier-1 AS (AS3356) in multiple AS paths. Our assumption is that AS3356 is the actual transit,
and AS4826 is connected as a peer but it does not work as a peer, rather it leaks Anon-DNS-1 routes to
its peers and transits. We find AS4826 propagates its routes to many different peers including a Tier-1
AS (AS6939). Since AS6939 is well-connected to the rest of the world, it brings a large fraction of VPs to
AS4826.
We must talk to the operators to know whether AS4826 is really a private peer or a transit provider.
However, having a transit connection with a smaller AS is highly unlikely since this anycast network has
connectivity with another Tier-1 AS with a large geographic presence.
Impacts: We observe severe polarization due to this route leakage by AS4826, where AS4826 is only
connected in Australia. Since AS4826 is propagating its routes to AS6939, and since AS6939 is a wellconnected Tier-1 AS, traffic from all over the world ends up in Australia. We find a case where traffic
from the Ashburn, USA connects to AS6939 in Los Angeles, USA, then travels to AS4826 in San Jose, USA,
and then ends up in Sydney, Australia, resulting in over 200 ms of latency. From Europe, we find a case
when a VP from Switzerland travels to Ashburn, USA, Los Angeles, USA, San Jose, USA, and then Sydney,
Australia, which takes over 300 ms latency.
7.5.5.4 Route leakage by a regional AS in Anon-CDN-5
We find another polarization event due to route leakage by AS7473 (Singapore Telecom). We confirm this
event with Anon-CDN-5 that AS7473 is connected with Anon-CDN-5 as a private peer but it propagates
Anon-CDN-5 prefix to a big Tier-1 AS (AS6461 - Zayo Bandwidth). Since AS6461 is connected heavily with
198
the rest of the world, we find many VPs from Europe and North America end up in Singapore. From the
traceroutes, we find a Tier-1 AS (AS1299) is the real transit connected to different locations worldwide.
Impacts: We find 7% VPs from Europe and 6% VPs from North America end up in Singapore, even if
this anycast network has multiple presence in those continents. While VPs that stay within the continent
observe low latency, due to this long path to Singapore, we observe over 200 ms difference between the
median and 95th percentile latency (Table 7.3).
7.5.5.5 Regional route leakage: a special case when organizations merge
We find a polarization problem in one of CDN services (Anon-CDN-7) due to the merging of two organizations.
We find this case by looking over the traceroutes from the VPs to a distant site for a specific prefix
of Anon-CDN-7. We find “AS3356 AS209” in many of the paths that show bad latency to a distant site.
In that particular anycast network, AS209 (Century Link) is connected only in one location. Since AS209
propagates its routes to AS3356 (Level-3), and since AS3356 is a big Tier-1 AS with global connectivity,
we observe a significant fraction of cross-continent traffic to the distant site where AS209 is connected.
Contacting the CDN, we learn that AS209 is connected as a private peer, and so it should not propagate
the CDN prefix to a big Tier-1 provider. We suspect that this issue occurs because of the merging of AS209
and AS3356 [142]. We confirmed this incident with the operators of Anon-CDN-7.
Impacts: Anon-CDN-7 network peers with AS209 in Sterling, USA location. Since AS209 propagates its
route to AS3356, we find VPs from Europe and even from Asia have catchment in the USA, resulting in
over 200 ms of latency.
199
7.5.6 Combination of Problems
We find several Anon-DNS-2 prefixes where we believe multiple connectivity problems exist that cause
severe polarization.
Anon-DNS-2 has multiple anycast prefixes that are announced from different locations connected to
different transits and peers. We are certain that Anon-DNS-2 has anycast sites at least in North America,
Europe, and Asia since some of the VPs from these continents experienced good latency. However, the
big Tier-1 ASes like AS1299 or AS174 is only connected in North America. Based on our traceroutes, we
also find European and Asian sites have peers that are not big Tier-1 ASes. With a connectivity like this,
we expect most North American VPs will stay within North America because of the Tier-1 AS, and some
European and Asian VPs will go to North America because of the incomplete transit connectivity.
In reality, we observe cross-continent traffic in different directions. We find the peers (AS4637 Telstra
Global) in Asia leaks routes (or working as a transit) to its Tier-1 providers. Since that Tier-1 provider
is well-connected, we find North American traffic goes to Asia. We also find Asian VPs going to North
America since North American sites are connected to Tier-1 ASes. We find only a few European VPs stay
within the continent because of their local peers, but a significant portion moves to North America and
Asia. This cross-continent traffic results in increased latency, in some cases they add more than 200 ms of
latency.
7.6 Improvement by Routing Nudges
We already show polarization exists in many different anycast services and that it can have a significant
impact on performance. However, often small changes in routing can address polarization, even without adding new peering relationships with other ASes. In this section, we show two examples of how
Anon-CDN-4 improves anycast performance in two anycast systems.
200
7.6.1 Anycast Configuration
Anon-CDN-4 uses anycast networks for their DNS services. To ensure reliability, Anon-CDN-4 uses multiple anycast networks for DNS. These anycast networks are served from around 268 sites located in 93
cities distributed around the world. Each site has multiple machines to serve the client load. Each site
has different upstream connectivity, with different numbers of peers, network access points, and transit
providers. Anon-CDN-4 uses multiple Tier-1 ASes as the transits for their anycast networks. These transits
differ in each anycast network.
We show improvements in two anycast networks after making routing changes by Anon-CDN-4. The
first anycast network has a presence in 15 cities covering 9 countries. AS2914 (NTT America) provides
transit connectivity and is connected in multiple geographic locations covering Asia, Europe, and North
America. There are other peers and network access points connected the anycast sites. The second anycast
network covers 17 cities in 11 countries. AS1299 (Arelion Sweden) provides transit for this anycast network
covering Europe and North America continents.
7.6.2 Routing problems
The two anycast services mentioned in Section 7.6.1 have different routing problems. In the first anycast
service, Anon-CDN-4 encounters two different problems: multi-pop backbone problems (Section 7.3.2), and
leaking regional problems (Section 7.3.3). The second anycast service only has a leaking regional problem
(Section 7.3.3).
First anycast service has both multi-pop backbone and leaking regional problems: The first anycast service has multi-pop backbone problems with multiple Tier-1 ASes. Anon-CDN-4 connects to AS1299
(Arelion Sweden) and AS3356 (Level-3) as private peers in Dallas, USA. We also find AS6762 (Telecom Italia)
as a private peer connected in Virginia, USA, and Milan, Italy. Anon-CDN-4 operators expected these peers
201
to confine their announcements within a smaller customer cone. But we find that these peers propagate
Anon-CDN-4 routes to the rest of the world. We suspect these peers treat Anon-CDN-4 as their customers
since they are also connected as a transit for other anycast prefixes of the same anycast service.
In the second problem of the first anycast service, Anon-CDN-4 connects to AS209 (Lumen) as a private
peer in Virginia, USA location, and propagates routes to AS3356. Since AS3356 is well-connected to the
rest of the world, Anon-CDN-4 observes cross-continent traffic to Virginia, USA.
Second anycast service has a leaking regional problem: In the second problem, Anon-CDN-4 has a
regional private peer (AS7473) connected in Singapore, leaks their routes to other upstream Tier-1 ASes.
As a result, Anon-CDN-4 observes cross-continent traffic from other continents to Singapore.
7.6.3 Solving Problems
Anon-CDN-4 solves these performance issues by changing their routing configuration.
Solving two problems in the first anycast service: To solve the two problems in the first anycast service, Anon-CDN-4 stops announcing to the peers that were causing the polarization problem. Anon-CDN-4
blocks announcements to each of the Tier-1 private peers (AS1299, AS3356, and AS6762), and to AS209 to
prevent the propagation of routes to AS3356.
Solving leaking regional problem in the second anycast service: For the second anycast service,
Anon-CDN-4 tries two things. Since many local VPs were taking advantage by using AS7473, Anon-CDN-4
realized that blocking AS7473 may result in even worse performance overall. That is why, instead of
blocking the announcement to AS7473, Anon-CDN-4 takes a more cautious action. In one change, they
prepend twice from Singapore location so that fewer VPs end up in Singapore. In another change, they
use community strings to tell AS7473 to keep the Anon-CDN-4 prefix within Asia and Oceania regions.
Only the second change results in better performance overall which we will describe next. This example
202
ocean
land
>150
100-150
50-100
0-50
RTT (ms)
(a) Anycast 1: Dallas, USA (before)
ocean
land
>150
100-150
50-100
0-50
RTT (ms)
(b) Anycast 1: Dallas, USA (after)
ocean
land
>150
100-150
50-100
0-50
RTT (ms)
(c) Anycast 1: Virginia, USA (before)
ocean
land
>150
100-150
50-100
0-50
RTT (ms)
(d) Anycast 1: Virginia, USA (after)
ocean
land
>150
100-150
50-100
0-50
RTT (ms) (e) Anycast 1: Milan, Italy (before)
ocean
land
>150
100-150
50-100
0-50
RTT (ms) (f) Anycast 1: Milan, Italy (after)
ocean
land
>150
100-150
50-100
0-50
RTT (ms)
(g) Anycast 2: Singapore (before) ocean
land
>150
100-150
50-100
0-50
RTT (ms)
(h) Anycast 2: Singapore (after)
Figure 7.6: Changes in Anycast catchment for a anycast site due to a routing change
203
shows the importance of having multiple routing configurations for traffic engineering, and using the one
that results in best performance.
Measuring performance, before and after: We use Anon-CDN-4’s internal measurement system, observing latency from about 2300 global VPs. These vantage points have a global coverage with 74 African,
839 Asian, 772 European, 315 North American, 90 Oceanian, and 171 South American VPs.
7.6.3.1 Changes in the catchments
After making the routing changes, Anon-CDN-4 observes significant changes in the catchment distribution.
We show the catchment distribution based on the Anon-CDN-4’s internal measurement of the catchments.
We examine the VPs going to an anycast site before and after the routing change. In Figure 7.6, we visualize
the geolocations of the VPs to an anycast site, along with the measured latency. Each point on the map
represents a VP and the colors represent the latency level from that VP to the anycast site.
Catchment changes in the first anycast service: For the first one, Anon-CDN-4 blocks their private
peering with AS1299 and AS3356 from Dallas.
The topmost left graph (Figure 7.6a) shows the catchment before making the change. As we can see
many VPs from Europe and Asia (shown by red dots) end up in Dallas, USA. As a result, these VPs experience bad latency (over 100 ms). Traceroutes confirm cross-continent VPs using AS1299 and AS3356 to
reach Dallas, USA. After blocking the announcement to AS1299 and AS3356, we observe no cross-continent
VPs from Europe and Asia (Figure 7.6b). We can only see VPs from the US have catchment in Dallas, USA,
and experience better latency (less than 50 ms).
Anon-CDN-4 also blocks private peers AS209 and AS6762 from Virginia, USA. With these two ASes,
Virginia, USA was receiving traffic from different continents (Figure 7.6c). After blocking the announcement, Virginia, USA site receives traffic mostly from North and South America (Figure 7.6d). We also
204
Anycast
service Routing changes Improvement(%)
Africa Asia Europe North
America Oceania South
America
Service-1
Blocking AS1299 and
AS3356 from Dallas, USA,
AS209 and AS6762 from
Sterling, USA, and
AS6762 from Milan, Italy
5.4 23.4 54.6 23.0 2.2 -0.15
Service-2
Announcement to AS7473
only within Asia Pacific
and block others
(using community strings)
10.5 12.3 34.8 19.7 2.3 1.9
Service-2 Announcement to AS7473
with two prepending 3.8 -9.6 6.5 6.4 6.3 3.8
Table 7.4: Continent-wise improvement in latency by routing changes
observe a less number of cross-continent VPs when we block AS6762 from Milan, Italy (Figure 7.6e and
Figure 7.6f).
Catchment changes in the second anycast service: In the second anycast service, the Singapore site
was receiving traffic from other continents (Figure 7.6g). Anon-CDN-4 uses community strings to keep the
announcement propagation within the Asia and Oceania continents. As a result, we can see less number
of VPs to Singapore from other continents (Figure 7.6h).
7.6.3.2 Impacts over Performance
We have shown above the catchment changes after deploying the new routing configurations. Next, we
show how the performance changes.
Improvement in the first anycast service: After the changes in the first anycast service, we find most
improvement in the Europe, Asia, and North America continents (Table 7.4). We find 54.6%, 23.4%, and
23.0% improvement in mean latency in these continents, respectively. Many European and Asian VPs were
going to Dallas, USA, and Virginia, USA. After the new announcement, we do not see this cross-continent
205
200 100 0 100 200
Latency difference (ms)
0.0
0.2
0.4
0.6
0.8
1.0
CDF of all probes
Normal
Routing change
(a) Service 1: blocking in Dallas, USA, Virginia, USA,
and Milan, Italy
200 100 0 100 200
Latency difference (ms)
0.0
0.2
0.4
0.6
0.8
1.0
CDF of all probes
Normal
Routing change
(b) Service 2: changes in Singapore - announcing only
within Asia Pacific
Figure 7.7: CDF of all the VPs with respect to latency difference (ms)
traffic. Even if we block announcement from the North American sites, we observe 23.0% improvement
among the North American VPs. This is because many North American VPs from the West Coast had
catchments to the Dallas and Virginia sites. After the routing change, the traffic starts going to their
nearby locations. The performance in the other continents remains mostly stable.
Improvement in the second anycast service: After the change in the second anycast service using
community strings, we find the most improvement in the European and North American VPs. From Table 7.4, we can see that European VPs observe 34.8% improvement, and North American VPs observe 19.7%
improvement. This is because several European and North American VPs had catchments in Singapore.
After the new announcement, we do not observe this type of cross-continent traffic. We observe improvement in other continents as well.
New latency distribution in both anycast services: Since Anon-CDN-4 blocks routing announcements to different peers in different geo-locations, some VPs may observe worse performance who are
dependent on the peers that Anon-CDN-4 blocks. However, if there are nearby sites, then the VPs may
be redirected to the nearby sites after the routing changes, and still observe good latency. To know how
206
many VPs are getting worse latency after the routing change, we show a CDF graph with respect to latency
difference in Figure 7.7. We measure the latency decrease for each VP after the routing change. A positive
difference indicates an improved performance (light green region in Figure 7.7), and a negative difference
indicates a degraded performance (light red region in Figure 7.7). Since latency may vary slightly between
two measurements, the graphs show a normal difference line (blue lines) to show the regular latency variance without any routing changes. The orange lines show the latency decrease after the routing changes.
In the first anycast service, we can see 40% of the VPs get lower latency after the routing change
(Figure 7.7a). On the other side, we can see the blue and orange lines overlap, which indicates no significant
number of VPs gets worse latency. For the second anycast service, we can see around 15% of the VPs
observe lower latency (Figure 7.7b), when most other VPs observe regular differences (blue and orange
lines overlap).
7.6.3.3 Community strings are important
Anon-CDN-4 also attempts to use path prepending at their Singapore location to stop getting cross-continent
traffic to Singapore. (Since many local VPs are dependent on AS7473 to reach Singapore site, Anon-CDN-4
did not want to fully block the announcement through AS7473 fully.) Table 7.4 shows the outcome after
they use path prepending for two times. Even after twice prepending from Singapore, they could only
reduce 6.5% mean latency in Europe, and 6.4% mean latency in North America. At the same time, the mean
latency becomes 9.6% worse in Asia. Path prepending is an available tool for traffic engineering—anycast
operators can make path prepending without requiring support from their upstream providers. However,
as we can see from this result that path prepending may not always be useful.
On the other hand, when we restrict the announcements only within Asia and Oceania continents using
community string, we observe significant performance improvement for all the continents (Table 7.4). We
recommend the anycast operators to have transits with providers that support community strings.
207
Anon-DNS-6 has also explored the use of community strings to adjust their routing. However, in some
locations, their sites are hosted within research and educational networks that do not provide broad support for using community strings to allow routing adjustment. Simple routing changes require customized
network administration and coordination with other networks, which may or may not always be possible,
making it challenging to apply community strings. Such an example calls for community string conventions to be standardized, and small or non-traditional IXPs should be encouraged to follow conventions
used at large IXPs.
7.7 Conclusion
This chapter proposes a way to discover and resolve the polarization problems in anycast services. We
evaluate nearly 7,986 anycast prefixes, and show that the polarization problem is common in the wild
Internet. We present our method to classify the polarization problems. We demonstrate two different
classes of polarization problems. Our evaluation shows that the causes for both are common in known
anycast prefixes. Polarization can take the clients to a distant site, often in a different continent, when
the clients have a nearby anycast site. Because of the cross-continent routes, clients can observe over
100 ms of extra latency. We show some polarization problems can be solved using traffic engineering
techniques. Small changes in the routing policy can improve the latency for many VPs. We also show
network operators should have multiple traffic engineering techniques, including BGP community strings,
to improve the performance of their anycast service.
To prove our thesis statement, in this section, we show an approach to improve anycast performance
without changing existing protocols. We use measurements to detect performance issues, and we utilize
traffic engineering techniques to improve the performance. This work is a follow-up work from Chapter 7
208
which also uses traffic engineering techniques to mitigate DDoS impacts. Unlike other studies, this supplemental work shows measurement techniques can be used not only to improve security, but also that it
generalizes and can apply to related domains, such as to improve anycast performance.
209
Chapter 8
Conclusion
This dissertation describes methods utilizing measurement to tackle “service disruptive attacks” without
changing existing protocols (Section 1.1). We proved this thesis using five studies along with a supplemental study. First, we propose two defense systems against DDoS attacks where we utilize existing protocols
and measurement techniques (Chapter 2 and Chapter 3). Second, we propose a defense system against
brute-force password attacks utilizing existing IPv6 address space (Chapter 4). Third, we use measurements to characterize mobile latency (Chapter 5), and then design a detection system to detect malicious
routing detours (Chapter 6). In our fourth supplemental study (Chapter 7), we show Internet measurements
are useful not to improve security only, but also to detect and improve performance issues.
Next, we discuss possible future directions, and remaining challenges. Then we conclude this thesis
by summarizing the key concepts and contributions.
8.1 Future Directions
This thesis shows measurement techniques to improve the security and performance of different systems.
However, to advance the field, there are several directions and challenges that remain open for future
study.
210
8.1.1 Future work related to our existing studies
At first, we show the future work from the studies of this thesis. Then we show the new directions that
this thesis opens up.
Next steps in DDoS defense using anycast: In Section 2.5.4, we show how we can estimate the
attack size using the logs from a server. We show the effectiveness of this approach in Section 2.6. We
show this evaluation with two DDoS events from B-root server. As a future work, we need to evaluate
attacks with diverse attack intensity.
In Section 2.8.2, we show BGP communities can provide us better control over traffic distribution,
however, these communities are not always available through all the ISPs. Identifying the ISPs who provide
better BGP community options is a possible future work.
We show how different traffic engineering techniques help us to build a playbook that we can use
to mitigate DDoS attacks (Section 2.10). By consulting to the network operators, we know many network
operators use these techniques to redistribute traffic among anycast sites. A measurement study to evaluate
the network operators’ behavior can be a useful future study to understand the operations of the operators.
Finally, we rely on manual routing changes by the network operators. A completely automated system
design can be very useful for the operators.
Next steps in DDoS defense using filtering: We show our evaluation (Section 3.5) based on the
B-root DNS traffic. We assumed other root servers have similar traffic pattern, and can use our defense
system. This assumption requires validation which needs collaboration from other DNS root operators.
Our filtering parameters are based on the root DNS traffic (Section 3.4). However, other authoritative
DNS servers like TLDs and other lower level authoritative servers may show different distribution of DNS
components like Rcode, number of unique IP addresses or hop counts. A comparative analysis of the traffic
pattern arriving at different hierarchy of the DNS system is necessary to understand the effectiveness of
211
our defense system for different DNS servers. Investigating other DNS servers for re-evaluating system
parameters can be a useful future work.
Future work related to moving target defense: Our current implementation supports SSH client
(Section 4.6.1) with a Python program, and HTTPS using a Firefox extension (Section 4.6.2). As a future
work, we want to integrate Chhoyhopper client with OpenSSH. We also need to provide HTTPS extension
support for Chrome, and port server support to the non-Linux operating systems. We also need to test our
design in different operating systems since we only test the system in Linux platform.
Future work related to 5G detour detection: We use a 5G testbed (Section 6.8) to test our detour
detection algorithm (Section 6.5). As a future work, we want to develop a mobile application in different
operating systems that can collect latency data from UE in the background, and then use that latency
for detour detection. Collaborating with mobile operators, in future, we want to test our algorithm by
injecting real-world detours in 5G networks.
We analyze the stability of 5G latency measuring from a specific location. As a future work, we should
measure the stability when the device moves from one place to another at different velocities. Evaluating
different signal strengths along with the impacts of 4G and 5G mobile networks in different parts of the
world may provide us a complete picture of the efficacy of our algorithm in different situations.
An analysis to identify the internal hops within mobile networks using traceroutes can be an important
future work. Using CAIDA ShipTraceroutes dataset, we observed internal hops. However, our recent
measurements reveal that a carrier becomes unresponsive when we make traceroute measurements. As a
result, an evaluation for more number of carriers across different countries can be a useful future work.
Future work related to anycast polarization: We confirmed our findings of polarization for only
18 anycast prefixes out of 2273 unique anycast prefixes with potential polarization (Section 7.4.2). As a
future work, we should confirm with more number of anycast operators.
212
Finding out the number of impacted users with high latency due to polarization can also be an important future work. We show a good number of anycast prefixes that have potential polarization problems.
But we never explored how many active users are there behind these anycast prefixes. Finding this number
of impacted users is important to understand the real impacts of polarization.
8.1.2 Potential future directions beyond our studies
This thesis can lead us in different directions. Potential future work includes measurement studies regarding current practices by network operators, learning from these current practices, automation and
prediction model for traffic engineering, and Internet performance studies. Although we cannot imaging
all directions our work may go in, we next show some example use cases about the future directions from
this thesis.
A clear understanding of how operators use different traffic engineering techniques is still not well
documented. For our anti-DDoS system using anycast (Chapter 2) and for our anycast polarization improvement (Chapter 7), we utilize different traffic engineering techniques that operators can use to improve
security and latency issues of the users. As a future work, we want to explore how network operators use
these traffic engineering techniques in real-life use cases. We want to explore who are the operators that
use different traffic engineering techniques, what traffic engineering techniques these operators normally
use, and what are the purposes of using these traffic engineering methods. A survey analysis of the operators to understand their needs, their understanding, their choice of selecting upstream providers can also
be an important topic for further investigation.
A system that can predict the traffic distribution of an anycast network after a BGP change can also
be an important future work. We use a test prefix to evaluate the outcome of a BGP change in our systems
in Chapter 2 and Chapter 7. This process is cumbersome because it requires deploying a new BGP policy,
generating traffic for that test prefix, and observing the impacts. We must also need a test prefix for this
213
approach which can be a scarce resource to get. A new system that may predict the outcome of a BGP
change without requiring a test prefix can be an important future work.
A possible future work from this thesis is to understand Internet equality. In Chapter 7, we show
the reasons behind polarization, ways to find out these polarization problems, and scenarios where we
improve the impacts of polarization. We found other continents except Europe and North America have
more instance of polarization problems because operators normally deploy servers with connectivity to
the local ASes only. While this local connectivity helps to keep some local traffic local, a significant portion
of the local traffic may go to a distant server because of the unavailable path to the local servers. In those
cases, a transit connection to the local server is necessary, but many of the current deployments do not
have this connectivity to the local server. A possible future work is to understand the reason behind this
unavailable transit connection to these local sites. Is it because of the cost? Or is it because many Tier-1
ASes do not support connectivity in these continents?
8.2 Conclusions
In this thesis, we prove our thesis statement “new methods utilizing measurement to mitigate attacks
that disrupt online services without changing any existing Internet protocols”. We prove this
thesis statement by five studies and a supplemental study. Our studies show defense systems against
attacks that target clients, servers, and the path between clients and server.
Our thesis statement has four key features as mentioned in Table 1.1—new methods that mitigates
attacks which disrupt online services, without changing any existing protocols. Next, we show how our
studies cover these key features.
In Chapter 2, we show a defense system that can mitigate DDoS attacks to an anycast server based
on a BGP playbook. DDoS attacks can disrupt a service by overwhelming the server capacity with malicious traffic. We show two new methods—estimating attack size (Section 2.6), and using BGP playbook
214
(Section 2.8.4) to defend against DDoS (Section 2.10). Through this work, we prove our thesis statement
because we only require measurements to build the pre-computed playbook (Section 2.8.4), and we did not
make any changes in the existing protocols to design our systems. We show we can utilize this system in
mitigating real-world DDoS attacks.
In Chapter 3, we use filtering to design an automated system to mitigate DDoS attacks in root DNS
servers. We propose new methods that evaluate root DNS traffic, define filtering parameters, and propose
an automated way to select the correct filter for an attack (Section 3.4). This system does not require any
changes in the existing protocols, and only use measurement, which proves our thesis statement.
In Chapter 4, we show a new method to design a moving target defense (Section 4.5) against brute-force
password attacks targeting the clients. Password attacks cab cause disruption in user access. We use the
existing IPv6 address space, and we provide the first SSH and HTTPS applications. This work proves the
thesis statement by showing applications from existing protocols.
In Chapter 6, we show how historic latency can be used to design a system for detour detection. Detour
disrupt services by adding latency, and attackers can eavesdrop the detoured traffic. We have two new
contributions—first characterizing mobile latency in 5G era (Chapter 5), and second, designing a detour
detection algorithm (Chapter 6). At first, using measurements, we show the stability of mobile latency to
different destinations (Section 5.6.4 and Section 6.6). Then we take measurements to learn the historic
latency in mobile network (Section 6.5.2). Using this learned historic latency, we design a new algorithm
(Section 6.5.6) to detect detours in mobile networks. We prove the thesis statement since this work utilizes
only measurement to design the defense system without changing existing protocols.
As a supplemental work, we show how measurement can be used to improve the performance of an
anycast network (Chapter 7). This work is not directly related to attacks, however, polarization disrupts
online services by adding latency. Unlike other studies, this work shows how we can use measurement
(Section 7.6) to improve the performance of an anycast network using existing protocols.
215
Next, we show how this thesis proves the key features of our thesis statement mentioned in Table 1.1.
Our thesis statement has four key features as mentioned in Table 1.1—new methods that mitigates attacks
which disrupt online services, without changing any existing protocols. Next, we show how these key features
are related to our studies.
We provide multiple new methods to mitigate different attacks. We propose a new method for attack
estimation (Section 2.5) along with a new idea of using a BGP playbook in DDoS defense (Section 2.8.4). Our
anti-DDoS study with filtering provides the first public description of an automated system that combines
multiple filters to defend against DDoS (Section 3.4). We also provide the first design of a moving target
defense (Section 4.5) utilizing IPv6 address space for SSH and HTTPS applications (Section 4.6). Then we
provide the first analysis of mobile latency stability from a globally distributed CDN (Section 5.6), confirm
the stability in mobile latency (Section 6.6), and then propose a novel algorithm to detect malicious routing
detours in mobile networks (Section 6.5). Lastly, we provide the first analysis of anycast polarization in the
known anycast networks (Section 7.4).
All the studies, except the anycast polarization study, mitigate attacks that disrupt online services. To
protect the services, the first two studies (Chapter 2 and Chapter 3) mitigate DDoS attacks. DDoS attacks
disrupt online services by overwhelming resources. Our defenses mitigate attacks to keep the resource
consumption within the capacity. We also provide a moving target defense against brute force password
attacks targeting clients Chapter 4). Brute force password attacks disrupt user access, and our defense system prevents this attack. We also provide a detection mechanism (Chapter 6) for malicious detour attacks.
Malicious detours may disrupt user privacy by eavesdropping. Anycast polarization study (Chapter 7) is
not related to attacks, rather it identifies performance issues. Polarization may disrupt online services by
adding latency.
Lastly, all these studies utilize existing protocols. The DDoS defenses (Chapter 2 and Chapter 3) use
measurement and existing protocols/tools like BGP and iptables. Anti brute-force password study (Chapter 4)
216
utilizes IPv6 address space, and our defense is easily deployable using shell terminal for SSH, and web
browsers for HTTPs. Detecting malicious detours (Chapter 6) only requires measurement of historic latency
without changing existing protocols. We can utilize simple ping measurements for this system. Lastly,
anycast polarization (Chapter 7) study only requires measurement. We improve the performance issues
with simple routing changes without changing existing protocols.
All these studies prove our thesis statement where we use measurement to design security systems
against DDoS, brute-force, and malicious detour attacks. Our purpose was to design systems that can be
easily deployable without changing existing protocols. Researchers and network operators should think
about their systems to use our designs directly or they may use the concept of using measurement to build
their own defense systems for their network.
217
Bibliography
[1] Internet Systems Consortium (ISC). Using the Response Rate Limiting Feature.
https://kb.isc.org/docs/aa-00994. [Online; accessed 05-May-2019]. 2018.
[2] Vijay K Adhikari, Yang Guo, Fang Hao, Volker Hilt, Zhi-Li Zhang, Matteo Varvello, and
Moritz Steiner. “Measurement study of Netflix, Hulu, and a tale of three CDNs”. In: IEEE/ACM
Transactions On Networking 23.6 (2014), pp. 1984–1997. doi: 10.1109/TNET.2014.2354262.
[3] Vijay Kumar Adhikari, Yang Guo, Fang Hao, Matteo Varvello, Volker Hilt, Moritz Steiner, and
Zhi-Li Zhang. “Unreeling Netflix: Understanding and Improving Multi-CDN Movie Delivery”. In:
2012 Proceedings IEEE INFOCOM. IEEE. 2012, pp. 1620–1628. doi: 10.1109/INFCOM.2012.6195531.
[4] David Adrian, Zakir Durumeric, Gulshan Singh, and J. Alex Halderman. “Zippier ZMap:
Internet-Wide Scanning at 10 Gbps”. In: Proceedings of the USENIX Workshop on Offensive
Technologies. San Diego, CA, USA: USENIX, Aug. 2014. url:
https://www.usenix.org/system/files/conference/woot14/woot14-adrian.pdf.
[5] Lyle Adriano. Canadian comms company suffers DDoS attack | Insurance Business Canada.
https://www.insurancebusinessmag.com/ca/news/breaking-news/canadian-comms-company-suffersddos-attack-310819.aspx. Sept. 2021.
[6] Ijaz Ahmad, Tanesh Kumar, Madhusanka Liyanage, Jude Okwuibe, Mika Ylianttila, and
Andrei Gurtov. “5G security: Analysis of threats and solutions”. In: 2017 IEEE Conference on
Standards for Communications and Networking (CSCN). IEEE. 2017, pp. 193–199.
[7] Ijaz Ahmad, Shahriar Shahabuddin, Tanesh Kumar, Jude Okwuibe, Andrei Gurtov, and
Mika Ylianttila. “Security for 5G and Beyond”. In: IEEE Communications Surveys & Tutorials 21.4
(2019), pp. 3682–3722. doi: 10.1109/COMST.2019.2916180.
[8] Mark Allman. “On Eliminating Root Nameservers from the DNS”. In: Proceedings of the ACM
Workshop on Hot Topics in Networks. Princeton, NJ, USA: ACM, Nov. 2019. doi:
https://doi.org/10.1145/3365609.3365863.
[9] AMPATH. BGP Resources. https://ampath.net/AMPATH_BGP_Policies.php. [Online; accessed
12-Oct-2021].
218
[10] Xia An, Chao Zhang, Kewu Peng, Zhitong He, and Jian Song. “Adaptive Quantized and
Normalized MSA Based on Modified MET-DE and Its Application for 5G-NR LDPC Codes”. In:
IEEE Access (2023). doi: 10.1109/ACCESS.2023.3315610.
[11] Analysis of Network Traffic (ANT) group, ANT Datasets. https://ant.isi.edu/datasets/all.html.
Datasets with Anomaly keywords, [Online; accessed 19-Feb-2022]. 2022.
[12] Anritsu. Faster Low-Latency 5G Mobile Networks.
https://web.archive.org/web/20230130081437/https://www.anritsu.com/en-us/testmeasurement/solutions/mt1000a-05/index. [Online; accessed 28-Feb-2024]. 2024.
[13] APNIC. BGP-Stats Routing Table Report—Japan View.
https://mailman.apnic.net/mailing-lists/bgp-stats/archive/2020/05/msg00001.html. May 2020.
url: https://mailman.apnic.net/mailing-lists/bgp-stats/archive/2020/05/msg00001.html.
[14] APNIC. IPv6 Capable Rate by country (%). https://stats.labs.apnic.net/ipv6. [Online; accessed
13-December-2021]. 2021.
[15] AWS. AWS Shield - Threat Landscape Report – Q1 2020.
https://aws-shield-tlr.s3.amazonaws.com/2020-Q1_AWS_Shield_TLR.pdf. Aug. 2020.
[16] Michael Backes, Thorsten Holz, Christian Rossow, Teemu Rytilahti, Milivoj Simeonovski, and
Ben Stock. “On the Feasibility of TTL-Based Filtering for DRDoS Mitigation”. In: International
Symposium on Research in Attacks, Intrusions, and Defenses, RAID. Vol. 9854. Sept. 2016,
pp. 303–322. isbn: 978-3-319-45718-5. doi: 10.1007/978-3-319-45719-2_14.
[17] Vaibhav Bajpai, Steffie Jacob Eravuchira, and Jürgen Schönwälder. “Lessons learned from using
the RIPE Atlas platform for measurement research”. In: ACM SIGCOMM Computer
Communication Review 45.3 (2015), pp. 35–42. doi: https://doi.org/10.1145/2805789.2805796.
[18] Hitesh Ballani and Paul Francis. “Towards a Deployable IP Anycast Service.” In: Proceedings of
First Workshop on Real, Large Distributed Systems (WORLDS’04). USENIX, 2004. url:
https://www.usenix.org/legacy/event/worlds04/tech/full_papers/ballani/ballani.pdf.
[19] Hitesh Ballani, Paul Francis, and Sylvia Ratnasamy. “A measurement-based deployment proposal
for IP anycast”. In: Proceedings of the 6th ACM SIGCOMM conference on Internet measurement.
2006, pp. 231–244. doi: https://doi.org/10.1145/1177080.1177109.
[20] Hitesh Ballani, Paul Francis, and Xinyang Zhang. “A study of prefix hijacking and interception in
the Internet”. In: ACM SIGCOMM Computer Communication Review 37.4 (2007), pp. 265–276. doi:
https://doi.org/10.1145/1282427.1282411.
[21] C. Barna, M. Shtern, M. Smit, V. Tzerpos, and M. Litoiu. “Model-based Adaptive DoS Attack
Mitigation”. In: Proceedings of the 7th International Symposium on Software Engineering for
Adaptive and Self-Managing Systems. Zurich, Switzerland: IEEE Press, 2012, pp. 119–128. isbn:
978-1-4673-1787-0. doi: 10.1109/SEAMS.2012.6224398.
219
[22] Ray Bellis. Researching F-root Anycast Placement Using RIPE Atlas. ripe blog https:
//labs.ripe.net/Members/ray_bellis/researching-f-root-anycast-placement-using-ripe-atlas.
Oct. 2015. url: https://labs.ripe.net/Members/ray_bellis/researching-f-root-anycastplacement-using-ripe-atlas.
[23] Ran Ben-Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. “Heavy hitters in streams and
sliding windows”. In: IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on
Computer Communications. 2016, pp. 1–9. doi: 10.1109/INFOCOM.2016.7524364.
[24] Terry Benzel, Robert Braden, Dongho Kim, Cliford Neuman, Anthony Joseph, Keith Sklower,
Ron Ostrenga, and Stephen Schwab. “Experience with DETER: a testbed for security research”. In:
2nd International Conference on Testbeds and Research Infrastructures for the Development of
Networks and Communities, 2006. TRIDENTCOM 2006. IEEE. 2006, 10–pp. doi:
10.1109/TRIDNT.2006.1649172.
[25] Leandro M. Bertholdo, João M. Ceron, Wouter B. de Vries, Ricardo de Oliveira Schmidt,
Lisandro Zambenedetti Granville, Roland van Rijswijk-Deij, and Aiko Pras. “TANGLED: A
Cooperative Anycast Testbed”. In: 2021 IFIP/IEEE International Symposium on Integrated Network
Management (IM). 2021, pp. 766–771.
[26] Bill Slater, President of Chicago ISOC. The Internet Outage and Attacks of October 2002.
https://billslater.com/writing/2002_1107__Internet_Outage_and_Attacks_in_october_2002_by_
William_Slater.pdf. 2002.
[27] Henry Birge-Lee, Liang Wang, Jennifer Rexford, and Prateek Mittal. “Sico: Surgical interception
attacks by manipulating bgp communities”. In: Proceedings of the 2019 ACM SIGSAC Conference on
Computer and Communications Security. 2019, pp. 431–448.
[28] Jesse Blazina. Stonefish—Automating DDoS Mitigation at the Edge. https:
//medium.com/@verizondigital/stonefish-automating-ddos-mitigation-at-the-edge-6a2650aeb6af.
[Online; accessed 30-May-2019]. 2019.
[29] L Bošnjak, J Sreš, and Bosnjak Brumen. “Brute-force and dictionary attack on hashed real-world
passwords”. In: 2018 41st International Convention on Information and Communication Technology,
Electronics and Microelectronics (MIPRO). IEEE. 2018, pp. 1161–1166.
[30] R. Bush and R. Austein. The Resource Public Key Infrastructure (RPKI) to Router Protocol, Version 1.
RFC 8210. RFC Editor, Sept. 2017.
[31] Matthew Caesar and Jennifer Rexford. “BGP routing policies in ISP networks”. In: IEEE network
19.6 (2005), pp. 5–11. doi: 10.1109/MNET.2005.1541715.
[32] CAIDA. AS Rank. https://asrank.caida.org/. [Online; accessed 12-Oct-2021]. 2020.
[33] CAIDA. CAIDA UCSD BGP Community Dictionary. https://www.caida.org/data/bgp-communities/.
[Online; accessed 12-Oct-2021]. 2020.
220
[34] Matt Calder, Ashley Flavel, Ethan Katz-Bassett, Ratul Mahajan, and Jitendra Padhye. “Analyzing
the Performance of an Anycast CDN”. In: Proceedings of the ACM Internet Measurement
Conference. Tokyo, Japan: ACM, Oct. 2015, pp. 531–537. doi:
http://dx.doi.org/10.1145/2815675.2815717.
[35] Mark D Carney, Jeffrey A Jackson, Andrew L Bates, and Dante J Pacella. Method and apparatus for
mitigating distributed denial of service attacks. US Patent 9,197,666. Nov. 2015.
[36] Sebastian Castro, Duane Wessels, Marina Fomenkov, and Kimberly Claffy. “A day at the root of
the Internet”. In: ACM SIGCOMM Computer Communication Review 38.5 (2008), pp. 41–46. doi:
https://doi.org/10.1145/1452335.1452341.
[37] National Risk Management Center. Securing 5G Infrastructure from Cybersecurity Risks.
https://www.cisa.gov/blog/2021/05/10/securing-5g-infrastructure-cybersecurity-risks.
[Online; accessed 30-June-2021]. 2021.
[38] Anirban Chakrabarti and Govindarasu Manimaran. “Internet infrastructure security: A
taxonomy”. In: IEEE network 16.6 (2002), pp. 13–21. doi: 10.1109/MNET.2002.1081761.
[39] R. Chandra, P. Traina, and T. Li. BGP Communities Attribute. Tech. rep. 1997. RFC Editor, 1996.
url: https://www.rfc-editor.org/rfc/rfc1997.txt.
[40] Rocky KC Chang and Michael Lo. “Inbound traffic engineering for multihomed ASs using AS
path prepending”. In: IEEE network 19.2 (2005), pp. 18–25. doi: 10.1109/MNET.2005.1407694.
[41] Eric Y Chen and Mistutaka Itoh. “A whitelist approach to protect SIP servers from flooding
attacks.” In: Communications Quality and Reliability (CQR), 2010 IEEE International Workshop
Technical Committee on. IEEE. 2010, pp. 1–6. doi: 10.1109/CQR.2010.5619917.
[42] Yi-Ching Chiu, Brandon Schlinker, Abhishek Balaji Radhakrishnan, Ethan Katz-Bassett, and
Ramesh Govindan. “Are We One Hop Away from a Better Internet?” In: Proceedings of the ACM
Internet Measurement Conference. Tokyo, Japan: ACM, Oct. 2015, pp. 523–529. doi:
http://dx.doi.org/10.1145/2815675.2815719..
[43] Gaurav Choudhary, Jiyoon Kim, and Vishal Sharma. “Security of 5G-mobile backhaul networks:
A survey”. In: arXiv preprint arXiv:1906.11427 (2019).
[44] Vinod Kumar Choyi, Ayman Abdel-Hamid, Yogendra Shah, Samir Ferdi, and Alec Brusilovsky.
“Network slice selection, assignment and routing within 5G networks”. In: 2016 IEEE Conference
on Standards for Communications and Networking (CSCN). IEEE. 2016, pp. 1–7. doi:
10.1109/CSCN.2016.7784887.
[45] Taejoong Chung, Emile Aben, Tim Bruijnzeels, Balakrishnan Chandrasekaran, David Choffnes,
Dave Levin, Bruce M Maggs, Alan Mislove, Roland van Rijswijk-Deij, John Rula, and
Nick Sullivan. “RPKI is coming of age: A longitudinal study of RPKI deployment and invalid route
origins”. In: Proceedings of the Internet Measurement Conference. 2019, pp. 406–419. doi:
https://doi.org/10.1145/3355369.3355596.
221
[46] Taejoong Chung, Roland van Rijswijk-Deij, Balakrishnan Chandrasekaran, David Choffnes,
Dave Levin, Bruce M. Maggs, Alan Mislove, and Christo Wilson. “A Longitudinal, End-to-End
View of the DNSSEC Ecosystem”. In: 26th USENIX Security Symposium (USENIX Security 17).
Vancouver, BC: USENIX Association, Aug. 2017, pp. 1307–1322. isbn: 978-1-931971-40-9. url:
https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/chung.
[47] Danilo Cicalese, Jordan Augé, Diana Joumblatt, Timur Friedman, and Dario Rossi.
“Characterizing IPv4 anycast adoption and deployment”. In: Proceedings of the 11th ACM
Conference on Emerging Networking Experiments and Technologies. 2015, pp. 1–13. doi:
https://doi.org/10.1145/2716281.2836101.
[48] Danilo Cicalese and Dario Rossi. “A longitudinal study of IP Anycast”. In: ACM SIGCOMM
Computer Communication Review 48.1 (2018), pp. 10–18. doi:
https://doi.org/10.1145/3211852.3211855.
[49] Cloudflare. Famous DDoS Attacks | The Largest DDoS Attacks Of All Time.
https://www.cloudflare.com/learning/ddos/famous-ddos-attacks/. [Online; accessed 12-Oct-2021].
[50] Lorenzo Colitti, Steinar H Gunderson, Erik Kline, and Tiziana Refice. “Evaluating IPv6 adoption
in the Internet”. In: International Conference on Passive and Active Network Measurement.
Springer. 2010, pp. 141–150.
[51] Xavier Costa-Perez, Andres Garcia-Saavedra, Xi Li, Thomas Deiss, Antonio De La Oliva,
Andrea Di Giglio, Paola Iovanna, and Alain Moored. “5G-crosshaul: An SDN/NFV integrated
fronthaul/backhaul transport network architecture”. In: IEEE wireless communications 24.1 (2017),
pp. 38–45. doi: 10.1109/MWC.2017.1600181WC.
[52] Cybersecurity and Infrastructure Security Agenecy. Potential Threat Vectors Ti 5G Infrastructure.
https://www.cisa.gov/publication/5g-potential-threat-vectors. [Online; accessed 30-June-2021].
2021.
[53] Jakub Czyz, Mark Allman, Jing Zhang, Scott Iekel-Johnson, Eric Osterweil, and Michael Bailey.
“Measuring IPv6 adoption”. In: Proceedings of the 2014 ACM Conference on SIGCOMM. 2014,
pp. 87–98. doi: https://doi.org/10.1145/2619239.2626295.
[54] F5 Labs David Warbuton. DDoS Attack Trends for 2020.
https://www.f5.com/labs/articles/threat-intelligence/ddos-attack-trends-for-2020. [Online;
accessed 7-Oct-2021]. 2020.
[55] Alysha M De Livera, Rob J Hyndman, and Ralph D Snyder. “Forecasting time series with complex
seasonal patterns using exponential smoothing”. In: Journal of the American Statistical Association
106.496 (2011), pp. 1513–1527.
[56] Wouter B. de Vries, Ricardo de O. Schmidt, Wes Hardaker, John Heidemann, Pieter-Tjerk de Boer,
and Aiko Pras. “Verfploeter: Broad and Load-Aware Anycast Mapping”. In: Proceedings of the
ACM Internet Measurement Conference. London, UK, 2017. doi:
https://doi.org/10.1145/3131365.3131371.
222
[57] Rennie Degraaf, John Aycock, and Michael Jacobson. “Improved port knocking with strong
authentication”. In: 21st Annual Computer Security Applications Conference (ACSAC’05). IEEE.
2005, 10–pp. doi: 10.1109/CSAC.2005.32.
[58] Christoph Dietzel, Anja Feldmann, and Thomas King. “Blackholing at IXPs: On the effectiveness
of DDoS mitigation in the wild”. In: International Conference on Passive and Active Network
Measurement. Springer. 2016, pp. 319–332.
[59] John Dilley, Bruce Maggs, Jay Parikh, Harald Prokop, Ramesh Sitaraman, and Bill Weihl.
“Globally Distributed Content Delivery”. In: IEEE Internet Computing 6.5 (Sept. 2002), pp. 50–58.
doi: http://dx.doi.org/10.1109/MIC.2002.1036038.
[60] Domain names - implementation and specification. RFC 1035. Nov. 1987. doi: 10.17487/RFC1035.
[61] Ramin Ali Dousti, Frank Scalzo, and Suresh Bhogavilli. Automated DDoS attack mitigation via
BGP messaging. US Patent App. 15/273,510. Mar. 2018.
[62] Xiaoyu Duan and Xianbin Wang. “Authentication handover and privacy protection in 5G hetnets
using software-defined networking”. In: IEEE Communications Magazine 53.4 (2015), pp. 28–35.
doi: 10.1109/MCOM.2015.7081072.
[63] Zhenhai Duan, Xin Yuan, and Jaideep Chandrashekar. “Controlling IP spoofing through
interdomain packet filters”. In: IEEE transactions on Dependable and Secure computing 5.1 (2008),
pp. 22–36. doi: 10.1109/TDSC.2007.70224.
[64] Chris Duckett. Chromium DNS hijacking detection accused of being around half of all root queries.
https://www.zdnet.com/article/chromium-dns-hijacking-detection-accused-of-being-aroundhalf-of-all-root-queries/. [Online; accessed 24-Jan-2022]. 2020.
[65] Wesley M Eddy. TCP SYN flooding attacks and common mitigations. RFC 4987. RFC Editor, Aug.
2007.
[66] Anne Edmundson, Roya Ensafi, Nick Feamster, and Jennifer Rexford. “A first look into
transnational routing detours”. In: Proceedings of the 2016 ACM SIGCOMM Conference. 2016,
pp. 567–568. doi: https://doi.org/10.1145/2934872.2959081.
[67] Anne Edmundson, Roya Ensafi, Nick Feamster, and Jennifer Rexford. “Nation-state hegemony in
internet routing”. In: Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable
Societies. 2018, pp. 1–11. doi: https://doi.org/10.1145/3209811.3211887.
[68] Yoav Einav. Amazon Found Every 100ms of Latency Cost them 1% in Sales.
https://www.gigaspaces.com/blog/amazon-found-every-100ms-of-latency-cost-them-1-in-sales.
[Online; accessed 29-March-2022]. 2019.
[69] Tom Emmons. 2021: Volumetric DDoS Attacks Rising Fast.
https://www.akamai.com/blog/security/2021-volumetric-ddos-attacks-rising-fast. [Online;
accessed 10-Jan-2022]. 2021.
223
[70] Ericsson. 5G to account for 25 percent of mobile data traffic this year.
https://www.ericsson.com/en/reports-and-papers/mobility-report/dataforecasts/mobiletraffic-forecast. [Online; accessed 13-Dec-2023]. 2023.
[71] Ericsson. Mobile network data traffic still climbing. https://www.ericsson.com/en/reports-andpapers/mobility-report/dataforecasts/mobile-traffic-update. [Online; accessed 13-Dec-2023].
2023.
[72] Arthur Fabre. L4Drop: XDP DDoS Mitigations.
https://blog.cloudflare.com/l4drop-xdp-ebpf-based-ddos-mitigations/. [Online; accessed
01-Dec-2019]. 2018.
[73] Xun Fan and John Heidemann. “Selecting representative IP addresses for Internet topology
studies”. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement. ACM.
2010, pp. 411–423. doi: https://doi.org/10.1145/1879141.1879195.
[74] Xun Fan, John Heidemann, and Ramesh Govindan. “Evaluating anycast in the domain name
system”. In: 2013 Proceedings IEEE INFOCOM. IEEE. 2013, pp. 1681–1689. doi:
10.1109/INFCOM.2013.6566965.
[75] Seyed K Fayaz, Yoshiaki Tobioka, Vyas Sekar, and Michael Bailey. “Bohatei: Flexible and elastic
ddos defense”. In: 24th USENIX Security Symposium. 2015, pp. 817–832.
[76] P. Ferguson and D. Senie. Network Ingress Filtering: Defeating Denial of Service Attacks which
employ IP Source Address Spoofing. RFC 2267. also BCP-38. Internet Request For Comments, May
2000. url: ftp://ftp.rfc-editor.org/in-notes/rfc2267.txt.
[77] Tim Fisher. How Are 4G and 5G Different? https://www.lifewire.com/5g-vs-4g-4156322. [Online;
accessed 18-Feb-2024]. 2023.
[78] Ashley Flavel, Pradeepkumar Mani, David Maltz, Nick Holt, Jie Liu, Yingying Chen, and
Oleg Surmachev. “Fastroute: A scalable load-aware anycast routing architecture for modern
CDNs”. In: 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15).
2015, pp. 381–394.
[79] Kensuke Fukuda and John Heidemann. “Who knocks at the IPv6 door? detecting IPv6 scanning”.
In: Proceedings of the Internet Measurement Conference 2018. 2018, pp. 231–237. doi:
https://doi.org/10.1145/3278532.3278553.
[80] Lixin Gao. “On Inferring Autonomous System Relationships in the Internet”. In: 9.6 (Dec. 2001),
pp. 733–745. doi: http://dx.doi.org/10.1109/90.974527.
[81] Ruomei Gao, Constantinos Dovrolis, and Ellen W Zegura. “Interdomain ingress traffic
engineering through optimized AS-path prepending”. In: International Conference on Research in
Networking. Springer. 2005, pp. 647–658.
224
[82] Manaf Gharaibeh, Christos Papadopoulos, John Heidemann, and Craig Partridge. Delay-based
Identification of Internet Block Movement. Tech. rep. CS-20-101. Colorado State University
Computer Science Department, Apr. 2020. url:
https://www.isi.edu/%7ejohnh/PAPERS/Gharaibeh20b.html.
[83] Moinak Ghoshal, Imran Khan, Z Jonny Kong, Phuc Dinh, Jiayi Meng, Y Charlie Hu, and
Dimitrios Koutsonikolas. “Performance of Cellular Networks on the Wheels”. In: Proceedings of
the 2023 ACM on Internet Measurement Conference. 2023, pp. 678–695. doi:
https://doi.org/10.1145/3618257.3624814.
[84] David Gillman, Yin Lin, Bruce Maggs, and Ramesh K Sitaraman. “Protecting websites from attack
with secure delivery networks”. In: Computer 48.4 (2015), pp. 26–34. doi: 10.1109/MC.2015.116.
[85] Vasileios Giotsas, Georgios Smaragdakis, Christoph Dietzel, Philipp Richter, Anja Feldmann, and
Arthur Berger. “Inferring BGP blackholing activity in the Internet”. In: Proceedings of the Internet
Measurement Conference. ACM. 2017, pp. 1–14. doi: https://doi.org/10.1145/3131365.3131379.
[86] William Goddard. Where Is 5G Available? https://itchronicles.com/5g/where-is-5g-available/.
[Online; accessed 6-July-2021]. 2020.
[87] Sharon Goldberg, Michael Schapira, Peter Hummon, and Jennifer Rexford. “How secure are
secure interdomain routing protocols”. In: ACM SIGCOMM Computer Communication Review 40.4
(2010), pp. 87–98. doi: https://doi.org/10.1145/1851275.1851195.
[88] F. Gont. A Method for Generating Semantically Opaque Interface Identifiers with IPv6 Stateless
Address Autoconfiguration (SLAAC). RFC 7217. Internet Request For Comments, Apr. 2014. doi:
http://dx.doi.org/10.17487/RFC7217.
[89] F. Gont, S. Krishnan, T. Narten, and R. Draves. Temporary Address Extensions for Stateless Address
Autoconfiguration in IPv6. RFC 8981. Internet Request For Comments, Feb. 2021. doi:
http://dx.doi.org/10.17487/RFC8981.
[90] Google. Google IPv6 Statistics. https://www.google.com/intl/en/ipv6/statistics.html. [Online;
accessed 13-December-2021]. 2021.
[91] Robert Graham, Paul McMillan, and Dan Tentler. Mass Scanning the Internet. Presentation at
Defcon 22. Aug. 2014. url: https://defcon.org/images/defcon-22/dc-22-presentations/GrahamMcMillan-Tentler/DEFCON-22-Graham-McMillan-Tentler-Masscaning-the-Internet.pdf.
[92] Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim,
Parantap Lahiri, David A. Maltz, and Parveen Pat. “VL2: A Scalable and Flexible Data Center
Network”. In: Proceedings of the ACM SIGCOMM Conference. Barcelona, Spain: ACM, Aug. 2009,
pp. 51–62. doi: https://doi.org/10.1145/1592568.1592576.
[93] Wes Hardaker. Analyzing and Mitigating Privacy with the DNS Root Service. San Diego, CA, USA,
Feb. 2018. url: http://www.isi.edu/%7ehardaker/papers/2018-02-ndss-analyzing-root-privacy.pdf.
[94] Wes Hardaker. LocalRoot: Serve Yourself. https://localroot.isi.edu/. [Online; accessed
11-Jan-2019]. 2018.
225
[95] T. Hardie. Distributing Authoritative Name Servers via Shared Unicast Addresses. Tech. rep. 3258.
RFC Editor, 2002. url: https://www.rfc-editor.org/rfc/rfc3258.txt.
[96] John Heidemann, Yuri Pradkin, Ramesh Govindan, Christos Papadopoulos, Genevieve Bartlett,
and Joseph Bannister. “Census and Survey of the Visible Internet”. In: Proceedings of the ACM
Internet Measurement Conference. Vouliagmeni, Greece: ACM, Oct. 2008, pp. 169–182. doi:
http://dx.doi.org/10.1145/1452520.1452542.
[97] Bob Hinden and Dr. Steve E. Deering. Internet Protocol, Version 6 (IPv6) Specification. RFC 2460.
Dec. 1998. doi: 10.17487/RFC2460.
[98] R. Hinden and S. Deering. IP Version 6 Addressing Architecture. RFC 1884. Internet Request For
Comments, Dec. 1995. url: ftp://ftp.rfc-editor.org/in-notes/rfc1884.txt.
[99] Sing Wang Ho, Thom Haddow, Jonathan Ledlie, Moez Draief, and Peter R Pietzuch.
“Deconstructing internet paths: an approach for AS-level detour route discovery.” In: USENIX
International Workshop on Peer-to-Peer Systems (IPTPS). 2009, p. 8.
[100] Lee Hahn Holloway, Srikanth N Rao, Matthew Browning Prince,
Matthieu Philippe François Tourne, Ian Gerald Pye, Ray Raymond Bejjani, and
Terry Paul Rodery Jr. Mitigating a denial-of-service attack in a cloud-based proxy service. US Patent
8,856,924. Oct. 2014.
[101] Chi-Yao Hong, Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi,
Kondapa Naidu B., Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev,
Steve Padgett, Faro Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn,
Jonathan Zolla, Joon Ong, and Amin Vahdat. “B4 and After: Managing Hierarchy, Partitioning,
and Asymmetry for Availability and Scale in Google’s Software-Defined WAN”. In: Proceedings of
the ACM SIGCOMM Conference. Budapest, Hungary: ACM, Aug. 2018. doi:
https://doi.org/10.1145/3230543.3230545.
[102] Ke-Jou Hsu, James Choncholas, Ketan Bhardwaj, and Ada Gavrilovska. “DNS does not suffice for
MEC-CDN”. In: Proceedings of the 19th ACM Workshop on Hot Topics in Networks. 2020,
pp. 212–218.
[103] Geoff Huston. BGP in 2017. https://labs.apnic.net/?p=1102. Journal Article. [Online; accessed
12-Oct-2021]. Jan. 2018. url: https://labs.apnic.net/?p=1102.
[104] IBM. Edge Computing Solutions. https://www.ibm.com/edge-computing. [Online; accessed
14-Apr-2024]. 2024.
[105] ICANN. FACTSHEET: Root server attack on 6 February 2007.
https://www.icann.org/en/system/files/files/factsheet-dns-attack-08mar07-en.pdf. 2007.
[106] ICANN. Remaining IPv4 Addresses to be Redistributed to Regional Internet Registries | Address
Redistribution Signals that IPv4 is Nearing Total Exhaustion. ICANN Announcement. 20 May 2014.
url: https://www.icann.org/en/announcements/details/remaining-ipv4-addresses-to-beredistributed-to-regional-internet-registries--address-redistribution-signals-that-ipv4-
is-nearing-total-exhaustion-20-5-2014-en.
226
[107] Imperva. Different attack description.
https://www.imperva.com/docs/DS_Incapsula_The_Top_10_DDoS_Attack_Trends_ebook.pdf. [Online;
accessed 19-Sept-2017]. 2015.
[108] Team Cymru Inc. Secure Cisco IOS BGP Template.
https://www.team-cymru.com/secure-bgp-template.html. [Online; accessed 12-Oct-2021].
[109] Akamai InfoSec. A Look Back At The DDoS Trends Of 2018.
https://blogs.akamai.com/2019/01/a-look-back-at-the-ddos-trends-of-2018.html. [Online;
accessed 31-May-2019]. 2017.
[110] Quan Jia, Huangxin Wang, Dan Fleck, Fei Li, Angelos Stavrou, and Walter Powell. “Catch me if
you can: A cloud-enabled DDoS defense”. In: 2014 44th Annual IEEE/IFIP International Conference
on Dependable Systems and Networks. IEEE. 2014, pp. 264–275. doi: 10.1109/DSN.2014.35.
[111] Haiqing Jiang, Yaogong Wang, Kyunghan Lee, and Injong Rhee. “Tackling bufferbloat in 3G/4G
networks”. In: Proceedings of the 2012 Internet Measurement Conference. 2012, pp. 329–342. doi:
https://doi.org/10.1145/2398776.2398810.
[112] Cheng Jin, Haining Wang, and Kang G Shin. “Hop-count filtering: an effective defense against
spoofed DDoS traffic”. In: Proceedings of the 10th ACM conference on Computer and
communications security. ACM. 2003, pp. 30–41. doi: https://doi.org/10.1145/948109.948116.
[113] Mattijs Jonker, Alistair King, Johannes Krupp, Christian Rossow, Anna Sperotto, and
Alberto Dainotti. “Millions of targets under attack: a macroscopic characterization of the DoS
ecosystem”. In: Proceedings of the 2017 Internet Measurement Conference. 2017, pp. 100–113. doi:
https://doi.org/10.1145/3131365.3131383.
[114] Roger Piqueras Jover and Vuk Marojevic. “Security and protocol exploit analysis of the 5G
specifications”. In: IEEE Access 7 (2019), pp. 24956–24963.
[115] Aljosha Judmayer, Johanna Ullrich, Georg Merzdovnik, Artemios G Voyiatzis, and Edgar Weippl.
“Lightweight address hopping for defending the IPv6 IoT”. In: Proceedings of the 12th international
conference on availability, reliability and security (ARES). 2017, pp. 1–10. doi:
https://doi.org/10.1145/3098954.3098975.
[116] Sunmi Jun, Yoohwa Kang, Jaeho Kim, and Changki Kim. “Ultra-low-latency services in 5G
systems: A perspective from 3GPP standards”. In: Etri Journal 42.5 (2020), pp. 721–733.
[117] Srikanth Kandula, Dina Katabi, Matthias Jacob, and Arthur Berger. “Botz-4-sale: Surviving
organized DDoS attacks that mimic flash crowds”. In: Proceedings of the 2nd conference on
Symposium on Networked Systems Design & Implementation-Volume 2. USENIX Association. 2005,
pp. 287–300.
[118] Charlie Kaufman, Radia Perlman, and Bill Sommerfeld. “DoS protection for UDP-based
protocols”. In: Proceedings of the 10th ACM conference on Computer and communications security.
ACM. 2003, pp. 2–7. doi: https://doi.org/10.1145/948109.948113.
227
[119] Sami Kekki, Walter Featherstone, Yonggang Fang, Pekka Kuure, Alice Li, Anurag Ranjan,
Debashish Purkayastha, Feng Jiangping, Danny Frydman, Gianluca Verin, et al. “MEC in 5G
networks”. In: ETSI white paper 28 (2018), pp. 1–28.
[120] Thomas Koch, Ke Li, Calvin Ardi, Ethan Katz-Bassett, Matt Calder, and John Heidemann.
“Anycast in Context: A Tale of Two Systems”. In: Proceedings of the ACM SIGCOMM Conference.
Virtual: ACM, Aug. 2021. doi: https://doi.org/10.1145/3452296.3472891.
[121] Lukas Krämer, Johannes Krupp, Daisuke Makita, Tomomi Nishizoe, Takashi Koide,
Katsunari Yoshioka, and Christian Rossow. “AmpPot: Monitoring and defending against
amplification DDoS attacks”. In: International Workshop on Recent Advances in Intrusion Detection.
Springer. 2015, pp. 615–636.
[122] Brian Krebs. “KrebsOnSecurity hit with record DDoS”. In: KrebsOnSecurity, Sept 21 (2016).
[123] Martin Krzywinski. “Port Knocking: Network Authentication Across Closed Ports”. In: SysAdmin
Magazine 12.6 (June 2003), pp. 12–17. url:
http://www.portknocking.org/docs/krzywinski-portknocking-sysadmin2003.pdf.
[124] Jan Harm Kuipers. Anycast for DDoS. https://essay.utwente.nl/73795/1/Kuipers_MA_EWI.pdf.
[Online; accessed 12-Oct-2021]. 2017.
[125] W. Kumari and P. Hoffman. Running a Root Server Local to a Resolver. RFC 8806. Internet Request
For Comments, June 2020. doi: http://dx.doi.org/10.17487/RFC8806.
[126] Craig Labovitz, Abha Ahuja, Abhijit Bose, and Farnam Jahanian. “Delayed Internet routing
convergence”. In: ACM SIGCOMM Computer Communication Review 30.4 (2000), pp. 175–187.
[127] LACNIC. LACNIC 41. https://lacnic41.lacnic.net/en/programme/agenda/plenary?day=10/05/2024.
[Online; accessed 21-May-2024]. 2024.
[128] Henry CJ Lee and Vrizlynn LL Thing. “Port hopping for resilient networks”. In: IEEE 60th
Vehicular Technology Conference, 2004. VTC2004-Fall. 2004. Vol. 5. IEEE. 2004, pp. 3291–3295.
[129] J. Levine. DNS Blacklists and Whitelists. RFC 5782. Internet Request For Comments, Feb. 2010. url:
ftp://ftp.rfc-editor.org/in-notes/rfc5782.txt.
[130] Chih-Ping Li, Jing Jiang, Wanshi Chen, Tingfang Ji, and John Smee. “5G ultra-reliable and
low-latency systems design”. In: 2017 European Conference on Networks and Communications
(EuCNC). IEEE. 2017, pp. 1–5. doi: https://doi.org/10.3390/electronics8090981.
[131] Zhihao Li, Dave Levin, Neil Spring, and Bobby Bhattacharjee. “Internet Anycast: Performance,
Problems, and Potential”. In: Proceedings of the ACM SIGCOMM Conference. Budapest, Hungary:
ACM, Aug. 2018, pp. 59–73. doi: https://doi.org/10.1145/3230543.3230547.
[132] Guangyi Liu, Yuhong Huang, Zhuo Chen, Liang Liu, Qixing Wang, and Na Li. “5G deployment:
Standalone vs. non-standalone from the operator perspective”. In: IEEE Communications
Magazine 58.11 (2020), pp. 83–89. doi: 10.1109/MCOM.001.2000230.
228
[133] Ziqian Liu, Bradley Huffaker, Marina Fomenkov, Nevil Brownlee, and KC Claffy. “Two days in
the life of the DNS anycast root servers”. In: Passive and Active Network Measurement: 8th
Internatinoal Conference, PAM 2007, Louvain-la-neuve, Belgium, April 5-6, 2007. Proceedings 8.
Springer. 2007, pp. 125–134.
[134] Matthew Luckie, Bradley Huffaker, Amogh Dhamdhere, Vasileios Giotsas, and KC Claffy. “AS
relationships, customer cones, and validation”. In: Proceedings of the 2013 conference on Internet
measurement conference. 2013, pp. 243–256. doi: https://doi.org/10.1145/2504730.2504735.
[135] Doug Madory and Matt Prosser. Excessive BGP AS Path Prepending is a Self-Inflicted Vulnerability.
Presentation at RIPE 79. Oct. 2019. url:
https://ripe79.ripe.net/presentations/64-prepending_madory2.pdf.
[136] Marek Majkowski. Memcrashed - Major amplification attacks from UDP port 11211.
https://blog.cloudflare.com/memcrashed-major-amplification-attacks-from-port-11211/.
[Online; accessed 12-Oct-2021]. 2018.
[137] IETF MAPRG. MAPRG Presentations. https://wiki.ietf.org/en/group/maprg. [Online; accessed
21-May-2024]. 2024.
[138] Kieren McCarthy. Internet’s root servers take hit in DDoS attack.
https://www.theregister.co.uk/2015/12/08/internet_root_servers_ddos/. [Online; accessed
29-Janurary-2019]. 2015.
[139] Tyler McDaniel, Jared M Smith, and Max Schuchard. “Flexsealing BGP against route leaks:
peerlock active measurement and analysis”. In: arXiv e-prints arXiv:2006.06576 (2020).
[140] Stephen McQuistin, Sree Priyanka Uppu, and Marcel Flores. “Taming Anycast in the Wild
Internet”. In: Proceedings of the Internet Measurement Conference. 2019, pp. 165–178. doi:
https://doi.org/10.1145/3355369.3355573.
[141] Jelena Mirkovic and Peter Reiher. “A taxonomy of DDoS attack and DDoS defense mechanisms”.
In: ACM SIGCOMM Computer Communication Review 34.2 (2004), pp. 39–53. doi:
https://doi.org/10.1145/997150.997156.
[142] LA Monroe. CenturyLink completes acquisition of Level 3.
https://news.lumen.com/2017-11-01-CenturyLink-completes-acquisition-of-Level-3. 2017.
[143] Giovane C. M. Moura, John Heidemann, Moritz Müller, Ricardo de O. Schmidt, and Marco Davids.
“When the Dike Breaks: Dissecting DNS Defenses During DDoS”. In: Proceedings of the ACM
Internet Measurement Conference. Oct. 2018. doi: https://doi.org/10.1145/3278532.3278534.
[144] Giovane C. M. Moura, Ricardo de O. Schmidt, John Heidemann, Wouter B. de Vries,
Moritz Müller, Lan Wei, and Christian Hesselman. “Anycast vs DDoS: Evaluating the November
2015 Root DNS Event”. In: Proceedings of the ACM Internet Measurement Conference. Nov. 2016.
doi: http://dx.doi.org/10.1145/2987443.2987446.
229
[145] Giovane CM Moura, John Heidemann, Wes Hardaker, Pithayuth Charnsethikul, Jeroen Bulten,
João M Ceron, and Cristian Hesselman. “Old but gold: prospecting TCP to engineer and live
monitor DNS anycast”. In: International Conference on Passive and Active Network Measurement.
Springer. 2022, pp. 264–292.
[146] Giovane CM Moura, John Heidemann, Ricardo de O Schmidt, and Wes Hardaker. “Cache me if
you can: Effects of DNS Time-to-Live”. In: Proceedings of the Internet Measurement Conference.
2019, pp. 101–115. doi: https://doi.org/10.1145/3355369.3355568.
[147] Tomek Mrugalski, Marcin Siodelski, Bernie Volz, Andrew Yourtchenko, Michael Richardson,
Sheng Jiang, Ted Lemon, and Timothy Winters. Dynamic Host Configuration Protocol for IPv6
(DHCPv6). RFC 8415. Nov. 2018. doi: 10.17487/RFC8415.
[148] Ayman Mukaddam, Imad Elhajj, Ayman Kayssi, and Ali Chehab. “IP Spoofing Detection Using
Modified Hop Count”. In: 2014 IEEE 28th International Conference on Advanced Information
Networking and Applications. 2014, pp. 512–516. doi: 10.1109/AINA.2014.62.
[149] Cristian Munteanu, Oliver Gasser, Ingmar Poese, Georgios Smaragdakis, and Anja Feldmann.
“Enabling Multi-hop ISP-Hypergiant Collaboration”. In: Proceedings of the Applied Networking
Research Workshop. 2023, pp. 54–59.
[150] Austin Murdock, Frank Li, Paul Bramsen, Zakir Durumeric, and Vern Paxson. “Target generation
for internet-wide IPv6 scanning”. In: Proceedings of the 2017 Internet Measurement Conference.
2017, pp. 242–253. doi: https://doi.org/10.1145/3131365.3131405.
[151] Priyadarsi Nanda and AJ Simmonds. “A scalable architecture supporting QoS guarantees using
traffic engineering and policy based routing in the Internet”. In: International Journal of
Communications, Network and System Sciences (2009).
[152] Arvind Narayanan, Eman Ramadan, Jason Carpenter, Qingxu Liu, Yu Liu, Feng Qian, and
Zhi-Li Zhang. “A first look at commercial 5G performance on smartphones”. In: Proceedings of
The Web Conference 2020. 2020, pp. 894–905. doi: https://doi.org/10.1145/3366423.3380169.
[153] Arvind Narayanan, Xumiao Zhang, Ruiyang Zhu, Ahmad Hassan, Shuowei Jin, Xiao Zhu,
Xiaoxuan Zhang, Denis Rybkin, Zhengxuan Yang, Zhuoqing Morley Mao, et al. “A variegated
look at 5G in the wild: performance, power, and QoE implications”. In: Proceedings of the 2021
ACM SIGCOMM 2021 Conference. 2021, pp. 610–625. doi:
https://doi.org/10.1145/3452296.3472923.
[154] Dr. Thomas Narten, Richard P. Draves, and Suresh Krishnan. Privacy Extensions for Stateless
Address Autoconfiguration in IPv6. RFC 4941. Sept. 2007. doi: 10.17487/RFC4941.
[155] Dr. Thomas Narten, Tatsuya Jinmei, and Dr. Susan Thomson. IPv6 Stateless Address
Autoconfiguration. RFC 4862. Sept. 2007. doi: 10.17487/RFC4862.
[156] T. Narten, G. Huston, and L. Roberts. IPv6 Address Assignment to End Sites. RFC 6177. Internet
Request For Comments, Mar. 2011. url: ftp://ftp.rfc-editor.org/in-notes/rfc6177.txt.
230
[157] Arbor Network. NETSCOUT Arbor’s 13th Annual Worldwide Infrastructure Security Report.
https://pages.arbornetworks.com/rs/082-KNA087/images/13th_Worldwide_Infrastructure_Security_Report.pdf. [Online; accessed 31-May-2019].
2019.
[158] University of New Hampshire. Inter Operability Laboratory Testing.
https://www.iol.unh.edu/testing. [Online; accessed 17-Dec-2023]. 2023.
[159] Lily Hay Newman. GitHub Servived The Biggest DDoS Attack Ever Recorded.
https://www.wired.com/story/github-ddos-memcached/. [Online; accessed 19-March-2018]. 2018.
[160] Mehdi Nikkhah and Roch Guérin. “Migrating the internet to IPv6: An exploration of the when
and why”. In: IEEE/ACM Transactions on Networking 24.4 (2015), pp. 2291–2304.
[161] Erik Nygren. At 21 Tbps, Reaching New Levels Of IPv6 Traffic!
https://blogs.akamai.com/2020/02/at-21-tbps-reaching-new-levels-of-ipv6-traffic.html.
[Online; accessed 15-March-2021]. 2020.
[162] Katia Obraczka and Fabio Silva. “Network latency metrics for server proximity”. In:
Globecom’00-IEEE. Global Telecommunications Conference. Conference Record (Cat. No. 00CH37137).
Vol. 1. IEEE. 2000, pp. 421–427. doi: 10.1109/GLOCOM.2000.892040.
[163] Philippe Oechslin. “Making a Faster Cryptanalytic Time-Memory Trade-Off”. In: Proceedings of
the IACR CRYPTO. Vol. 2729. International Association for Cryptologic Research, Aug. 2003,
pp. 617–630. doi: http://dx.doi.org/10.1007/978-3-540-45146-4\_36.
[164] Georgios Oikonomou and Jelena Mirkovic. “Modeling human behavior for defense against
flash-crowd attacks”. In: 2009 IEEE International Conference on Communications. IEEE. 2009,
pp. 1–6. doi: 10.1109/ICC.2009.5199191.
[165] Ruxandra F Olimid and Gianfranco Nencioni. “5G network slicing: A security overview”. In: IEEE
Access 8 (2020), pp. 99999–100009.
[166] B-Root Operators. B-Root Statement of Operational Principles. Web page
https://b.root-servers.org/statements/operation.html. 2008.
[167] Root Server Operators. Events of 2015-11-30.
https://root-servers.org/media/news/events-of-20151130.txt. [Online; accessed 12-Oct-2021].
2015.
[168] Root Server Operators. Events of 2016-06-25.
https://root-servers.org/media/news/events-of-20160625.txt. [Online; accessed 12-Oct-2021].
2016.
[169] Jim Owens and Jeanna Matthews. “A study of passwords and methods used in brute-force SSH
attacks”. In: USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET). Citeseer.
2008.
231
[170] Ramakrishna Padmanabhan, John P Rula, Philipp Richter, Stephen D Strowes, and
Alberto Dainotti. “DynamIPs: Analyzing address assignment practices in IPv4 and IPv6”. In:
Proceedings of the 16th international conference on emerging networking experiments and
technologies. 2020, pp. 55–70. doi: https://doi.org/10.1145/3386367.3431314.
[171] Linux Manual Page. tc(8). https://man7.org/linux/man-pages/man8/tc.8.html. [Online; accessed
27-May-2024]. 2024.
[172] Jeffrey Pang, James Hendricks, Aditya Akella, Roberto De Prisco, Bruce Maggs, and
Srinivasan Seshan. “Availability, Usage, and Deployment Characteristics of the Domain Name
System”. In: Proceedings of the ACM Internet Measurement Conference. Taormina, Sicily, Italy:
ACM, Oct. 2004, pp. 123–137. doi: https://doi.org/10.1145/1028788.1028790.
[173] Craig Partridge, Trevor Mendez, and Walter Milliken. Host anycasting service. Tech. rep. 1546.
RFC Editor, 1993. url: https://www.rfc-editor.org/rfc/rfc1546.txt.
[174] Imtiaz Parvez, Ali Rahmati, Ismail Guvenc, Arif I Sarwat, and Huaiyu Dai. “A survey on low
latency towards 5G: RAN, core network and caching solutions”. In: IEEE Communications Surveys
& Tutorials 20.4 (2018), pp. 3098–3130.
[175] Tao Peng, Christopher Leckie, and Kotagiri Ramamohanarao. “Proactively detecting distributed
denial of service attacks using source IP address monitoring”. In: International Conference on
Research in Networking. Springer. 2004, pp. 771–782.
[176] Larry Peterson and Aguz Sunay. Basic Architecture. https://5g.systemsapproach.org/arch.html.
[Online; accessed 30-June-2021]. 2021.
[177] James Robert Pogge and Stephen Scott. “Enabling the Edge-A method for dynamic virtualizable
connections for 5G deployments”. In: Advances in Science, Technology and Engineering Systems
Journal 4.2 (2019), pp. 270–279.
[178] The Canadian Press. Canadian communications company VoIP.ms hit by cyber attack.
https://www.thestar.com/business/2021/09/21/canadian-communications-company-voipms-hit-bycyber-attack.html/. Sept. 2021.
[179] Preshing on Programming. Hash Collision Probabilities.
https://preshing.com/20110504/hash-collision-probabilities/. [Online; accessed
7-November-2021]. 2011.
[180] LANDER project. LANDER:B Root Anomaly-20170306.
https://ant.isi.edu/datasets/readmes/B_Root_Anomaly-20170306.README.txt. [Online; accessed
12-Oct-2021]. 2019.
[181] Tongqing Qiu, Lusheng Ji, Dan Pei, Jia Wang, Jun (Jim) Xu, and Hitesh Ballani. “Locating Prefix
Hijackers using LOCK.” In: USENIX Security Symposium. 2009, pp. 135–150.
[182] Qualcomm. Everything you need to know about 5G. https://www.qualcomm.com/5g/what-is-5g.
[Online; accessed 20-August-2023]. 2023.
232
[183] Bruno Quoitin, Cristel Pelsser, Olivier Bonaventure, and Steve Uhlig. “A performance evaluation
of BGP-based traffic engineering”. In: International journal of network management 15.3 (2005),
pp. 177–191.
[184] Bruno Quoitin, Cristel Pelsser, Louis Swinnen, Olivier Bonaventure, and Steve Uhlig.
“Interdomain traffic engineering with BGP”. In: IEEE Communications magazine 41.5 (2003),
pp. 122–128.
[185] Sivaramakrishnan Ramanathan, Jelena Mirkovic, and Minlan Yu. “Blag: Improving the accuracy
of blacklists”. In: NDSS. 2020.
[186] Roland van Rijswijk-Deij, Anna Sperotto, and Aiko Pras. “DNSSEC and its potential for DDoS
attacks: a comprehensive measurement study”. In: Proceedings of the 2014 Conference on Internet
Measurement Conference. ACM. 2014, pp. 449–460. doi: https://doi.org/10.1145/2663716.2663731.
[187] RIPE. Measurements. https://atlas.ripe.net/measurements/10310/. [Online; accessed
12-Oct-2021].
[188] RIPE. Root DNS Observations. Measurement ID 1009 (A-Root), 1010 (B-Root), etc. 2021.
[189] RIPE Network Coordination Centre. RIPE - Routing Information Service (RIS).
https://https://www.ripe.net/analyse/internet-measurements/routing-information-service-ris.
2020.
[190] A S M Rizvi, Leandro Bertholdo, João Ceron, and John Heidemann. “Anycast Agility: Network
Playbooks to Fight DDoS”. In: 31st USENIX Security Symposium (USENIX Security 22). Boston,
MA: USENIX Association, Aug. 2022, pp. 4201–4218. isbn: 978-1-939133-31-1. url:
https://www.usenix.org/conference/usenixsecurity22/presentation/rizvi.
[191] A S M Rizvi and John Heidemann. “Chhoyhopper: A Moving Target Defense with IPv6”. In:
Proceedings of the IEEE Workshop on Measurements, Attacks, and Defenses for the Web (MADWeb).
San Diego, California, USA: IEEE, Apr. 2022, to appear. doi:
https://dx.doi.org/10.14722/madweb.2022.23004.
[192] ASM Rizvi, Leandro M Bertholdo, João Ceron, and John Heidemann. “Artifacts-Anycast Agility:
Network Playbooks to Fight DDoS”. In: (2022). url: https://zenodo.org/records/6505557.
[193] ASM Rizvi and John Heidemann. Chhoyhopper: moving target defense in IPv6.
https://ant.isi.edu/software/chhoyhopper/index.html. [Online; accessed 20-May-2024]. 2022.
[194] ASM Rizvi and John Heidemann. Verfploeter/plotter: visualization of anycast catchements.
https://ant.isi.edu/software/verfploeter/plotter/index.html. [Online; accessed 20-May-2024].
2019.
[195] ASM Rizvi, John Heidemann, and Jelena Mirkovic. Dynamically Selecting Defenses to DDoS for
DNS (extended). Tech. rep. ISI-TR-736. USC/Information Sciences Institute, May 2019. url:
https://www.isi.edu/%7ejohnh/PAPERS/Rizvi19a.html.
233
[196] ASM Rizvi, Tingshan Huang, Rasit Esrefoglu, and John Heidemann. “Anycast Polarization in the
Wild”. In: International Conference on Passive and Active Network Measurement. Springer. 2024,
pp. 104–131.
[197] ASM Rizvi, Jelena Mirkovic, John Heidemann, Wesley Hardaker, and Robert Story. “Defending
root DNS servers against DDoS using layered defenses”. In: 2023 15th International Conference on
COMmunication Systems & NETworkS (COMSNETS). IEEE. 2023, pp. 513–521. doi:
10.1109/COMSNETS56262.2023.10041415.
[198] ASM Rizvi, Jelena Mirkovic, John Heidemann, Wesley Hardaker, and Robert Story. “Defending
Root DNS Servers against DDoS Using Layered Defenses (Extended)”. In: Ad Hoc Networks 151
(2023), p. 103259.
[199] A root. rcode-volume. https://a.root-servers.org/rssac-metrics/raw/2022/01/rcode-volume/.
[Online; accessed 24-Jan-2022]. 2022.
[200] B root. rcode-volume. https://b.root-servers.org/rssac/2022/01/rcode-volume/. [Online; accessed
24-Jan-2022]. 2022.
[201] Erik Rye, Robert Beverly, and Kimberly C Claffy. “Follow the scent: Defeating IPv6 prefix rotation
privacy”. In: Proceedings of the 21st ACM Internet Measurement Conference. 2021, pp. 739–752.
[202] Sandeep Sarat, Vasileios Pappas, and Andreas Terzis. “On the use of anycast in DNS”. In:
Proceedings of 15th International Conference on Computer Communications and Networks. IEEE.
2006, pp. 71–78.
[203] Johann Schlamp, Ralph Holz, Quentin Jacquemart, Georg Carle, and Ernst W Biersack. “HEAP:
reliable assessment of BGP hijacking attacks”. In: IEEE Journal on Selected Areas in
Communications 34.6 (2016), pp. 1849–1861.
[204] Brandon Schlinker, Todd Arnold, Italo Cunha, and Ethan Katz-Bassett. “PEERING: Virtualizing
BGP at the Edge for Research”. In: Proc. ACM CoNEXT. Orlando, FL, Dec. 2019. doi:
https://doi.org/10.1145/3359989.3365414.
[205] Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V. Madhyastha,
Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. “Engineering Egress
with Edge Fabric: Steering Oceans of Content to the World”. In: Proceedings of the ACM
SIGCOMM Conference. Los Angeles, CA, USA: ACM, Aug. 2017, pp. 418–431. doi:
https://doi.org/10.1145/3098822.3098853.
[206] Ricardo de O. Schmidt, John Heidemann, and Jan Harm Kuipers. “Anycast Latency: How Many
Sites Are Enough?” In: International Conference on Passive and Active Network Measurement.
Sydney, Australia, Mar. 2017, pp. 188–200. url:
https://www.isi.edu/%7ejohnh/PAPERS/Schmidt17a.html.
[207] Bruce Schneier. Lessons From the Dyn DDoS Attack.
https://www.schneier.com/blog/archives/2016/11/lessons_from_th_5.html. [Online; accessed
21-June-2018]. 2016.
234
[208] Thomas Bradley Scholl. Methods and apparatus for distributed backbone internet DDOS mitigation
via transit providers. US Patent 8,949,459. Feb. 2015.
[209] Kyle Schomp, Onkar Bhardwaj, Eymen Kurdoglu, Mashooq Muhaimen, and Ramesh K Sitaraman.
“Akamai DNS: Providing Authoritative Answers to the World’s Queries”. In: Proceedings of the
Annual conference of the ACM Special Interest Group on Data Communication on the applications,
technologies, architectures, and protocols for computer communication. 2020, pp. 465–478. doi:
https://doi.org/10.1145/3387514.3405881.
[210] Pavlos Sermpezis, Vasileios Kotronis, Alberto Dainotti, and Xenofontas Dimitropoulos. “A survey
among network operators on BGP prefix hijacking”. In: ACM SIGCOMM Computer
Communication Review 48.1 (2018), pp. 64–69. doi: https://doi.org/10.1145/3211852.3211862.
[211] Anant Shah, Romain Fontugne, and Christos Papadopoulos. “Towards characterizing
international routing detours”. In: Proceedings of the 12th Asian Internet Engineering Conference.
2016, pp. 17–24. doi: https://doi.org/10.1145/3012695.3012698.
[212] A. Shaikh, R. Tewari, and M. Agrawal. “On the effectiveness of DNS-based server selection”. In:
Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual
Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213). Vol. 3.
2001, 1801–1810 vol.3. doi: 10.1109/INFCOM.2001.916678.
[213] Pavitra Shankdhar. Popular tools for brute-force attacks [updated for 2020].
https://resources.infosecinstitute.com/topic/popular-tools-for-brute-force-attacks/.
[Online; accessed 20-Sep-2022]. 2020.
[214] AX Sharma. Phone calls disrupted by ongoing DDoS cyber attack on VOIP.ms.
https://arstechnica.com/gadgets/2021/09/canadian-voip-provider-hit-by-ddos-attack-phonecalls-disrupted/. Sept. 2021.
[215] Xingang Shi, Yang Xiang, Zhiliang Wang, Xia Yin, and Jianping Wu. “Detecting prefix hijackings
in the internet with argus”. In: Proceedings of the 2012 Internet Measurement Conference. 2012,
pp. 15–28.
[216] R. B. da Silva and E. Souza Mota. “A Survey on Approaches to Reduce BGP Interdomain Routing
Convergence Delay on the Internet”. In: IEEE Communications Surveys & Tutorials 19.4 (2017),
pp. 2949–2984. issn: 1553-877X. doi: 10.1109/COMST.2017.2722380.
[217] Daniel Smith. The Growth of DDoS-as-a-Service: Stresser Services.
https://blog.radware.com/security/2017/09/growth-of-ddos-as-a-service-stresser-services/.
[Online; accessed 12-Oct-2021]. 2017.
[218] Donald J Smith, Michael Glenn, John A Schiel, and Christopher L Garner. Network traffic data
scrubbing with services offered via anycasted addresses. US Patent 9,350,706. May 2016.
[219] Jared M. Smith and Max Schuchard. “Routing around congestion: Defeating DDoS attacks and
adverse network conditions via reactive BGP routing”. In: 2018 IEEE Symposium on Security and
Privacy (SP). IEEE. 2018, pp. 599–617. doi: 10.1109/SP.2018.00032.
235
[220] Job Snijders. “Practical everyday BGP filtering with AS_PATH filters: Peer Locking”. In:
NANOG-67, Chicago, June (2016).
[221] Opera Software. SKA - SSH Key Authority. https://github.com/operasoftware/ssh-key-authority.
[Online; accessed 09-July-2021]. 2021.
[222] Raffaele Sommese, Leandro Bertholdo, Gautam Akiwate, Mattijs Jonker,
Roland van Rijswijk-Deij, Alberto Dainotti, KC Claffy, and Anna Sperotto. “MAnycast2: Using
Anycast to Measure Anycast”. In: Proceedings of the ACM Internet Measurement Conference. IMC
’20. Virtual Event, USA: Association for Computing Machinery, 2020, pp. 456–463. isbn:
9781450381383. doi: 10.1145/3419394.3423646.
[223] Oliver Spatscheck, Zakaria Al-Qudah, Seunjoon Lee, Michael Rabinovich, and
Jacobus Van Der Merwe. Multi-autonomous system anycast content delivery network. US Patent
8,607,014. Dec. 2013.
[224] Vilas Sridharan and Dean Liberty. “A study of DRAM failures in the field”. In: Proceedings of the
ACM SuperComputing. Salt Lake City, Utah, USA: ACM, Nov. 2012, pp. 1–11. doi:
10.5555/2388996.2389100.
[225] RIPE NCC Staff. “RIPE Atlas: A global internet measurement network”. In: Internet Protocol
Journal 18.3 (2015).
[226] One Step. BGP Community Guides. https://onestep.net/communities/. [Online; accessed
12-Oct-2021].
[227] Minho Sung and Jun Xu. “IP traceback-based intelligent packet filtering: a novel technique for
defending against Internet DDoS attacks”. In: IEEE Transactions on Parallel and Distributed
Systems (TPDS) 14.9 (2003), pp. 861–872. doi: 10.1109/TPDS.2003.1233709.
[228] Eric Sven-Johan Swildens, Zaide Liu, and Richard David Day. Global traffic management system
using IP anycast routing and dynamic load-balancing. US Patent 7,904,541. Mar. 2011.
[229] Wee Lum Tan, Fung Lam, and Wing Cheong Lau. “An empirical study on the capacity and
performance of 3G networks”. In: IEEE Transactions on Mobile Computing 7.6 (2008), pp. 737–750.
doi: 10.1109/TMC.2007.70788.
[230] Rajat Tandon, Jelena Mirkovic, and Pithayuth Charnsethikul. “Quantifying Cloud Misbehavior”.
In: 2020 IEEE 9th International Conference on Cloud Networking (CloudNet). IEEE. 2020, pp. 1–8.
doi: 10.1109/CloudNet51028.2020.9335812.
[231] Rajat Tandon, Abhinav Palia, Jaydeep Ramani, Brandon Paulsen, Genevieve Bartlett, and
Jelena Mirkovic. “Defending Web Servers Against Flash Crowd Attacks”. In: International
Conference on Applied Cryptography and Network Security. Springer. 2021, pp. 338–361. isbn:
978-3-030-78375-4.
[232] Renata Teixeira, Steve Uhlig, and Christophe Diot. “BGP route propagation between neighboring
domains”. In: International Conference on Passive and Active Network Measurement. Springer. 2007,
pp. 11–21.
236
[233] The Guardian. DDoS attack that disrupted internet was largest of its kind in history, experts say.
https://www.theguardian.com/technology/2016/oct/26/ddos-attack-dyn-mirai-botnet. [Online;
accessed 24-October-2021]. 2016.
[234] Roshan Thomas, Brian Mark, Tommy Johnson, and James Croall. “NetBouncer:
client-legitimacy-based high-performance DDoS filtering”. In: DARPA Information Survivability
Conference and Exposition, 2003. Proceedings. Vol. 1. IEEE. 2003, pp. 14–25.
[235] Alethea Toh. Azure DDoS Protection—2021 Q1 and Q2 DDoS attack trends. https:
//azure.microsoft.com/en-us/blog/azure-ddos-protection-2021-q1-and-q2-ddos-attack-trends/.
[Online; accessed 23-October-2021]. 2021.
[236] Muoi Tran, Min Suk Kang, Hsu-Chun Hsiao, Wei-Hsuan Chiang, Shu-Po Tung, and Yu-Su Wang.
“On the feasibility of rerouting-based DDoS defenses”. In: 2019 IEEE Symposium on Security and
Privacy (S&P). IEEE. 2019, pp. 1169–1184. doi: 10.1109/SP.2019.00055.
[237] Krassimir Tzvetanov. DDoS Mitigation Tutorial NANOG 69.
https://www.nanog.org/sites/default/files/DDoSTutorial-NANOG69-v3.pdf. [Online; accessed
31-Jan-2018]. 2017.
[238] Johanna Ullrich, Katharina Krombholz, Heidelinde Hobel, Adrian Dabrowski, and Edgar Weippl.
“IPv6 security: attacks and countermeasures in a nutshell”. In: 8th {USENIX} Workshop on
Offensive Technologies (WOOT). 2014.
[239] University of Oregon. Route Views Project. http://www.routeviews.org/routeviews/. 2021.
[240] USC/ISI. DDoS Defense In Depth for DNS (DDIDD) Tools.
https://ant.isi.edu/software/ddidd/index.html. [Online; accessed 24-Nov-2022]. 2022.
[241] USC/ISI. USC/ISI ANT Datasets. https://ant.isi.edu/datasets/all.html. [Online; accessed
12-Oct-2021]. 2019.
[242] Pierre-Antoine Vervier, Olivier Thonnard, and Marc Dacier. “Mind Your Blocks: On the
Stealthiness of Malicious BGP Hijacks.” In: The Network and Distributed System Security (NDSS)
Symposium. 2015.
[243] Paul Vixie, Gerry Sneeringer, and Mark Schleifer. Events of 21-Oct-2002. web page
http://c.root-servers.org/october21.txt. Nov. 2002.
[244] Haining Wang, Cheng Jin, and Kang G Shin. “Defense against spoofed IP traffic using hop-count
filtering”. In: IEEE/ACM Transactions on Networking (ToN) 15.1 (2007), pp. 40–53.
[245] Jessica Wei. Why is a /48 the recommended minimum prefix size for routing? https:
//blog.apnic.net/2020/06/01/why-is-a-48-the-recommended-minimum-prefix-size-for-routing/.
[Online; accessed 15-March-2021]. 2020.
[246] Lan Wei and John Heidemann. “Does Anycast Hang Up On You?” In: 2017 Network Traffic
Measurement and Analysis Conference (TMA). Dublin, Ireland: IEEE, July 2017, pp. 1–9. doi:
https://doi.org/10.23919/TMA.2017.8002905.
237
[247] Fernanda Weiden and Peter Frost. “Anycast as a load balancing feature”. In: Proceedings of the
24th International Conference on Large Installation System Administration. USENIX Association.
2010, pp. 1–6.
[248] D. Wessels and M. Fomenkov. “Wow, That’s a lot of packets”. In: Passive and Active Network
Measurement Workshop (PAM). San Diego, CA: PAM, Apr. 2003.
[249] Curt Wilson. Attack of the Shuriken: Many Hands, Many Weapons.
https://www.arbornetworks.com/blog/asert/ddos-tools/. [Online; accessed 12-Oct-2021]. 2012.
[250] Florian Wohlfart, Nikolaos Chatzis, Caglar Dabanoglu, Georg Carle, and Walter Willinger.
“Leveraging interconnections for performance: the serving infrastructure of a large CDN”. In:
Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication.
2018, pp. 206–220. doi: https://doi.org/10.1145/3230543.3230576.
[251] World Wide Web Consortium (W3C). Referrer Policy. https://www.w3.org/TR/referrer-policy/.
[Online; accessed 13-Mar-2022]. 2017.
[252] Eric Wustrow, Manish Karir, Michael Bailey, Farnam Jahanian, and Geoff Houston. “Internet
Background Radiation Revisited”. In: Proceedings of the 10th ACM Internet Measurement
Conference. Melbourne, Australia: ACM, Nov. 2010, pp. 62–73. doi:
https://doi.org/10.1145/1879141.1879149.
[253] Jun Xu, Jinliang Fan, Mostafa H Ammar, and Sue B Moon. “Prefix-preserving IP address
anonymization: Measurement-based security evaluation and a new cryptography-based scheme”.
In: 10th IEEE International Conference on Network Protocols, 2002. Proceedings. IEEE. 2002,
pp. 280–289. doi: 10.1109/ICNP.2002.1181415.
[254] Yin Xu, Zixiao Wang, Wai Kay Leong, and Ben Leong. “An end-to-end measurement study of
modern cellular data networks”. In: International Conference on Passive and Active Network
Measurement. Springer. 2014, pp. 34–45.
[255] Abraham Yaar, Adrian Perrig, and Dawn Song. “StackPi: New Packet Marking and Filtering
Mechanisms for DDoS And IP Spoofing Defense”. In: IEEE Journal on Selected Areas in
Communications 24.10 (2006), pp. 1853–1863. doi: 10.1109/JSAC.2006.877138.
[256] Omer Yoachimik. Who DDoS’d Austin? https://blog.cloudflare.com/who-ddosd-austin/. [Online;
accessed 02-Dec-2019]. 2019.
[257] MyungKeun Yoon. “Using whitelisting to mitigate DDoS attacks on critical Internet sites”. In:
IEEE Communications Magazine 48.7 (2010). doi: 10.1109/MCOM.2010.5496886.
[258] Xinjie Yuan, Mingzhou Wu, Zhi Wang, Yifei Zhu, Ming Ma, Junjian Guo, Zhi-Li Zhang, and
Wenwu Zhu. “Understanding 5G performance for real-world services: a content provider’s
perspective”. In: Proceedings of the ACM SIGCOMM 2022 Conference. 2022, pp. 101–113. doi:
https://doi.org/10.1145/3544216.3544219.
238
[259] ZD Net. Cloudflare says it stopped the largest DDoS attack ever reported. https:
//www.zdnet.com/article/cloudflare-says-it-stopped-the-largest-ddos-attack-ever-reported/.
[Online; accessed 7-Oct-2021]. 2020.
[260] Kim Zetter. How Cops Can Secretly Track Your Phone. https:
//theintercept.com/2020/07/31/protests-surveillance-stingrays-dirtboxes-phone-tracking/.
[Online; accessed 24-Feb-2022]. 2020.
[261] Shunliang Zhang. “An overview of network slicing for 5G”. In: IEEE Wireless Communications
26.3 (2019), pp. 111–117.
[262] Zesen Zhang, Alexander Marder, Ricky Mok, Bradley Huffaker, Matthew Luckie, KC Claffy, and
Aaron Schulman. “Inferring regional access network topologies: methods and applications”. In:
Proceedings of the 21st ACM Internet Measurement Conference. 2021, pp. 720–738.
[263] Zheng Zhang, Ying Zhang, Y Charlie Hu, Z Morley Mao, and Randy Bush. “iSPY: Detecting IP
prefix hijacking on my own”. In: Proceedings of the ACM SIGCOMM 2008 conference on Data
Communication. 2008, pp. 327–338.
[264] Changxi Zheng, Lusheng Ji, Dan Pei, Jia Wang, and Paul Francis. “A light-weight distributed
scheme for detecting IP prefix hijacks in real-time”. In: ACM SIGCOMM Computer Communication
Review 37.4 (2007), pp. 277–288.
[265] Minyuan Zhou, Xiao Zhang, Shuai Hao, Xiaowei Yang, Jiaqi Zheng, Guihai Chen, and
Wanchun Dou. “Regional IP Anycast: Deployments, Performance, and Potentials”. In: Proceedings
of the ACM SIGCOMM 2023 Conference. 2023, pp. 917–931. doi:
https://doi.org/10.1145/3603269.3604846.
[266] Liang Zhu and John Heidemann. DNSanon: extract DNS traffic from pcap to text with optionally
anonymization. https://ant.isi.edu/software/dnsanon/index.html. [Online; accessed
20-Jan-2018]. 2017. doi: https://doi.org/10.1145/3278532.3278544.
[267] Liang Zhu and John Heidemann. “LDplayer: DNS Experimentation at Scale”. In: Proceedings of the
Internet Measurement Conference 2018. ACM. 2018, pp. 119–132. doi:
https://doi.org/10.1145/3278532.3278544.
[268] Liang Zhu, Zi Hu, John Heidemann, Duane Wessels, Allison Mankin, and Nikita Somaiya.
“Connection-oriented DNS to improve privacy and security”. In: 2015 IEEE Symposium on Security
and Privacy (S&P). IEEE. 2015, pp. 171–186. doi: 10.1109/SP.2015.18.
239
Abstract (if available)
Abstract
Service disruption is undesirable in today’s Internet connectivity due to its impacts on enterprise profits, reputation, and user satisfaction. We describe service disruption as any targeted interruptions caused by malicious parties in the regular user-to-service interactions and functionalities that affect service performance and user experience. In this thesis, we propose new methods that tackle service disruptive attacks using measurement without changing existing Internet protocols. Although our methods do not guarantee defense against all the attack types, our example defense systems prove that our methods generally work to handle diverse attacks. To validate our thesis, we demonstrate defense systems against three disruptive attack types. First, we mitigate Distributed Denial-of-Service (DDoS) attacks that target an online service. Second, we handle brute-force password attacks that target the users of a service. Third, we detect malicious routing detours to secure the path from the users to the server. We provide the first public description of DDoS defenses based on anycast and filtering for the network operators. Then, we show the first moving target defense utilizing IPv6 to defeat password attacks. We also demonstrate how regular observation of latency helps cellular users, carriers, and national agencies to find malicious routing detours. As a supplemental outcome, we show the effectiveness of measurements in finding performance issues and ways to improve using existing protocols. These examples show that our idea applies to different network parts, even if we may not mitigate all the attack types.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Protecting online services from sophisticated DDoS attacks
PDF
AI-enabled DDoS attack detection in IoT systems
PDF
A protocol framework for attacker traceback in wireless multi-hop networks
PDF
Efficiency in privacy-preserving computation via domain knowledge
PDF
Memorable, secure, and usable authentication secrets
PDF
Detecting and characterizing network devices using signatures of traffic about end-points
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Anycast stability, security and latency in the Domain Name System (DNS) and Content Deliver Networks (CDNs)
PDF
Enabling efficient service enumeration through smart selection of measurements
PDF
Improving binary program analysis to enhance the security of modern software systems
PDF
Learning about the Internet through efficient sampling and aggregation
PDF
Studying malware behavior safely and efficiently
PDF
Security and privacy in information processing
PDF
Security-driven design of logic locking schemes: metrics, attacks, and defenses
PDF
Strategic and transitory models of queueing systems
PDF
Design of cost-efficient multi-sensor collaboration in wireless sensor networks
PDF
A framework for runtime energy efficient mobile execution
PDF
Improving network security through collaborative sharing
PDF
Responsible AI in spatio-temporal data processing
PDF
Leveraging cross-task transfer in sequential decision problems
Asset Metadata
Creator
Rizvi, ASM
(author)
Core Title
Mitigating attacks that disrupt online services without changing existing protocols
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-08
Publication Date
06/25/2024
Defense Date
06/13/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
anycast,BGP,brute-force-password-attacks,DDoS,OAI-PMH Harvest,Performance,routing-detours,Security
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Heidemann, John (
committee chair
), Krishnamachari, Bhaskar (
committee member
), Madhyastha, Harsha V. (
committee member
), Mirkovic, Jelena (
committee member
)
Creator Email
asmrizvi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC1139970FR
Unique identifier
UC1139970FR
Identifier
etd-RizviASM-13144.pdf (filename)
Legacy Identifier
etd-RizviASM-13144
Document Type
Dissertation
Format
theses (aat)
Rights
Rizvi, ASM
Internet Media Type
application/pdf
Type
texts
Source
20240625-usctheses-batch-1173
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
anycast
BGP
brute-force-password-attacks
DDoS
routing-detours