Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improving network reliability using a formal definition of the Internet core
(USC Thesis Other)
Improving network reliability using a formal definition of the Internet core
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Improving network reliability using a formal denition of the Internet core
by
Guillermo P. Baltra
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2023
Copyright 2023 Guillermo P. Baltra
Dedication
To my wife Gabriela who has been my most profound inspiration.
To my children Antonia, Guillermo, Victoria and Clemente, who have been the source of my unyielding
motivation.
ii
Acknowledgements
This dissertation has been made possible with the guidance and encouragement from many individuals,
to whom I oer my profound gratitude. John Heidemann, for his guidance and patient support as my
advisor. His ability to see the big picture with incredible attention to detail kept me on the right track.
Ramesh Govindan and Antonio Ortega for serving on my defense committee, and Kristina Lerman and
Muhammad Naveed for serving on my thesis proposal committee, whose ideas and feedback helped me
sharpened this work. Ali Khayam, for inviting me to work together with him and his amazing team
earlier on in my studies, and for our lengthy discussions that allowed me to understand applications of
this research. Robert Beverly, for introducing me into the network measurements eld.
I am also thankful to my colleagues and collaborators at the ANT Lab, for sharing ideas and giving
feedback about my research right from the beginning: Calvin Ardi, Asma Enayet, Hang Guo, Wes Hardaker,
Basileal Imana, Aqib Nisar, Yuri Pradkin, Abdul Qadeer, ASM Rizvi, Xiao Song, Robert Story, Lan Wei, and
Liang Zhu.
I would like to extend my sincere thanks to Armada de Chile; Agencia Nacional de Investigación y
Desarrollo de Chile (ANID); USC Viterbi; the National Science Foundation, CISE Directorate, awards CNS1806785, CNS-2007106 and NSF-2028279; the Department of Homeland Security (DHS) Science and Technology Directorate, Cyber Security Division (DHS S&T/CSD) via contract number 70RSAT18CB0000014;
and Air Force Research Laboratory (AFRL) under agreement number FA8750-18-2-0280 for nancially supporting this work.
iii
Special thanks to Philipp Richter and Arthur Berger for discussions about their work, and Philipp for
re-running his comparison with CDN data. John Wroclawski, Ramakrishna Padmanabhan, Eddie Kohler,
the Internet Architecture Board, and the Human Rights Protocol Considerations Research Group of the
Internet Engineering Task Force for their input on on an early version of the partial outages paper.
Many thanks to Trinocular prober hosts Colorado State University (CSU), Keio University, Athens
University of Economics and Business, and SURFNet.
My parents, for your encouragement and help, and for always being there for me. Your example of
hard work has always kept me moving forward.
Gabriela, for your love, encouragement, and support. Antonia, Guillermo, Victoria and Clemente, for
sharing your dad’s time with this project. I will make it up to you.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Internet Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Demonstrating the Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2: Improving Coverage of Internet Outage Detection . . . . . . . . . . . . . . . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Challenges to Broad Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Problem: Sparse Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Problem: Lone Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Improving Outage Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Full Block Scanning for Sparse Blocks . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Lone-Address-Block Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Full Block Scanning Reduces Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1.1 Case Study of One Block . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1.2 False Outages: Does FBS Remove Noise? . . . . . . . . . . . . . . . . . . 21
2.4.1.3 True Outages: Does FBS Remove Legitimate Outages? . . . . . . . . . . 22
2.4.1.4 Random Sampling of Outage Events . . . . . . . . . . . . . . . . . . . . . 23
2.4.2 How Often Does FBS and LABR Change Outages? . . . . . . . . . . . . . . . . . . 24
2.5 Comparing Trinocular and FBS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 Comparing FBS Active and Passive Outages . . . . . . . . . . . . . . . . . . . . . . 26
2.5.2 FBS Eects on Temporal Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.3 Increasing Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
v
2.6 Related Work of Improving Internet Coverage . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7 Study Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Chapter 3: What is the Internet Core? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 How Do We Dene the Internet? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Why Does Dening the Internet Matter? . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 The Internet: A Conceptual Denition . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 The Internet Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3.1 Outages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3.2 Islands: Isolated Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3.3 Peninsulas: Partial Connectivity . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Detecting Partial Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Suitable Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2 Taitao: a Peninsula Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.3 Detecting Country-Level Peninsulas . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.4 Chiloe: an Island Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Validating our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.1 Can Taitao Detect Peninsulas? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.2 Can Taitao Detect Country-Level Peninsulas? . . . . . . . . . . . . . . . . . . . . . 54
3.4.3 Can Chiloe Detect Islands? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 Quantifying Islands and Peninsulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.1 How Common are Peninsulas? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.2 Additional Conrmation of the Number of Peninsulas . . . . . . . . . . . . . . . . 61
3.5.3 How Long Do Peninsulas Last? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.4 Additional Conrmation of Peninsula Duration . . . . . . . . . . . . . . . . . . . . 63
3.5.5 What Is the Size of Peninsulas? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5.6 Additional Conrmation of Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5.7 Where Do Peninsulas Occur? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5.8 How Common are Country-Level Peninsulas? . . . . . . . . . . . . . . . . . . . . . 68
3.5.9 How Common Are Islands? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5.10 How Long Do Islands Last? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5.11 What Sizes Are Islands? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6 Applying These Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.6.1 Policy Applications of the Denition . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.6.2 Can the Internet’s Core Partition? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.6.3 Reexamining Outages Given Partial Reachability . . . . . . . . . . . . . . . . . . . 75
3.6.3.1 Formally Dening Outages . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6.3.2 Observed Outage and External Data . . . . . . . . . . . . . . . . . . . . . 76
3.6.3.3 Are the Sites Independent? . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.6.4 Improving DNSmon Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.7 Related Work of Dening the Internet Core . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.8 Study Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Appendix 3.A Research Ethics on this Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
vi
Chapter 4: Ebb and Flow: Implications of ISP Address Dynamics . . . . . . . . . . . . . . . . 84
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Implications of Address Dynamics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.1 AS-wide Address Accumulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.2 Diurnal ISP Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.3 ISP Diurnal Detrending (IDD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.4 ISP Availability Sensing (IAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.4.1 Detecting AS-wide Address Stability . . . . . . . . . . . . . . . . . . . . 93
4.3.4.2 Detecting network changes . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.1 Mitigating an Observer Encountering Congestion with One-Loss Repair . . . . . . 93
4.4.2 IAS Detecting Known ISP Maintenance? . . . . . . . . . . . . . . . . . . . . . . . . 98
4.4.3 Validating IAS and IDD from RIPE Atlas . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4.3.1 Atlas as Ground Truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4.3.2 Validating IAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4.3.3 Validating IDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4.4 Does Unmonitored Space Harm IAS? . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4.5 Choice of Spatial Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.5.1 Quantifying ISP Address Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5.2 How Often Does IAS Repair False Outages? . . . . . . . . . . . . . . . . . . . . . . 105
4.5.3 How Many ASes Are Diurnal? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.5.4 How Much of a Diurnal AS is Diurnal? . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.5.5 Address Space Refactoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.7 Study Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Appendix 4.A Research Ethics on this Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Chapter 5: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
vii
List of Tables
1.1 Thesis statement demonstration by study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Coverage comparison in /24 blocks of dierent measuring approaches. . . . . . . . . . . . 11
2.2 Confusion matrix of Trinocular down events in random blocks . . . . . . . . . . . . . . . . 23
2.3 Trinocular-detected disruptions in CDN logs. Dataset A28, 2017q2. . . . . . . . . . . . . . 27
2.4 IPv4 address space coverage of Trinocular and FBS . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Traces from the Ark VPs before and during the event . . . . . . . . . . . . . . . . . . . . . 45
3.2 All datasets used in this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Trinocular and Ark agreement table. Dataset A30, 2017q4. . . . . . . . . . . . . . . . . . . 53
3.4 Taitao confusion matrix. Dataset A30, 2017q4. . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Number of country-specic blocks on the Internet. Dataset A30, 2017q4 . . . . . . . . . . . 55
3.6 Trinocular U.S.-only blocks. Dataset A30, 2017q4. . . . . . . . . . . . . . . . . . . . . . . . 55
3.7 Country specic peninsula detection confusion matrix. Dataset A30, 2017q4. . . . . . . . . 55
3.8 Chiloe confusion matrix, events between 2017-01-04 and 2020-03-31 . . . . . . . . . . . . . 57
3.9 Halt location of failed traceroutes for peninsulas longer than 5 hours. Dataset A41, 2020q3. 68
3.10 U.S. only blocks. Dataset A30, 2017q4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.11 Islands detected from 2017-04-01 to 2020-04-01 . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.12 RIR IPv4 hosts and IPv6 /32 allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.13 Similarities between sites relative to all six. Dataset: A33, 2018q3. . . . . . . . . . . . . . . 78
viii
4.1 Atlas VP address change events compared agains IAS detection thresholds . . . . . . . . . 100
4.2 Atlas VP address changes in Trinocular monitored/unmonitored address space . . . . . . . 101
4.3 Active RIPE Atlas Vantage Points during 2020q4 . . . . . . . . . . . . . . . . . . . . . . . . 102
4.4 Number of diurnal networks at dierent granularities. 2020q4. . . . . . . . . . . . . . . . . 107
ix
List of Figures
2.1 A sample sparse block over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Comparison of outages per block with their A(E(b)), 2017q4. Dataset A30. . . . . . . . . . 14
2.3 Cumulative distribution function of the A value per block, 2017q4. Dataset A30. . . . . . . 15
2.4 Sample sparse block state compared by Trinocular, FBS and LABR . . . . . . . . . . . . . . 20
2.5 Iraqi Government mandated outages Feb 2-9, 2017 . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Outage events during the 7 Iraqi outages, measured of their Aˆ3FR
s
and full round values . . 22
2.7 Comparison of per-block down time and down events between 2FR-FBS and Trinocular . . 24
2.8 CDF of down fraction and number of down events between Trinocular and FBS. . . . . . . 26
3.1 A 1-hour island from Trinocular Vantage Point E . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Number of blocks down in the whole responsive Internet. Dataset: A29, 2017q3. . . . . . . 41
3.3 AS level topology during the Polish peninsula. . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 BGP update messages sent for aected Polish blocks starting 2017-10-23t20:00Z. . . . . . . 44
3.5 A block showing a 3-hour peninsula. Dataset: A30, 2017q4. . . . . . . . . . . . . . . . . . . 44
3.6 Distribution of block-time fraction over sites reporting all-down, disagreement, and all-up 59
3.7 Block-time fraction of sites reporting all-down, disagreement and all-up . . . . . . . . . . 62
3.8 Peninsulas measured with per-site down events during 2020q3. Dataset A41. . . . . . . . . 64
3.9 Peninsulas measured with per-site down events longer than 5 hours. Dataset A30, 2017q4. 66
3.10 Islands detected across 3 years using six Vantage Points (VPs). Datasets A28-A39. . . . . . 69
x
3.12 CDF of islands detected by Chiloe for Trinocular and Atlas data. . . . . . . . . . . . . . . . 70
3.13 Ark traceroutes sent to targets under partial outages (2017-10-10 to -31). Dataset A30. . . . 76
3.14 Fraction of VPs observing islands and peninsulas for IPv4 and IPv6 during 2022q3 . . . . . 78
3.15 Atlas queries from all available VPs to 13 Root Servers for IPv4 and IPv6 on 2022-07-23. . . 79
4.1 Sample /24 blocks showing users simultaneously shifted to a dierent block . . . . . . . . 87
4.2 MSTL decomposition of AS9829 during 2020q4. Dataset: A42. . . . . . . . . . . . . . . . . 88
4.3 Diurnal blocks in AS9829 observed from Trinocular, Los Angeles, October 2020 . . . . . . 89
4.4 Block 0x7b753300 where observer w sees congestion and the others do not. . . . . . . . . . 94
4.5 Block 0x7b753300, where one-loss repair corrects VP w’s congestive loss. . . . . . . . . . . 94
4.6 Response rate from sample blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.7 Block 0xb671d400 showing addresses with short usage periods. . . . . . . . . . . . . . . . 97
4.8 Block 0xb671d400 after one-loss repair, showing minimal changes to the all-VP result. . . . 97
4.9 Down events from six observers and ∆t from Los Angeles . . . . . . . . . . . . . . . . . . 98
4.10 Cumulative distribution of IPv4 address, prex and AS changes per Atlas VP, 2020q4 . . . 103
4.11 Cumulative distribution of IPv6 address, prex and AS changes per Atlas VP, 2020q4 . . . 104
4.12 CDF of number of maintenance events at dierent block thresholds in 2020q4. . . . . . . . 105
4.13 CDF of unresponsive duration of blocks before or after an IAS-detected maintenance event 106
4.14 CDFs of diurnal-ness of all ASes (red) and routable prexes (blue) in 2020q4. . . . . . . . . 108
4.15 IPv4 changes by AS, routable prex and address for Atlas VPs with at least one change . . 109
4.16 IPv6 changes by AS, routable prex, and address for Atlas VP with at least one change . . 110
xi
Abstract
After 50 years, the Internet is still dened as “a collection of interconnected networks”. Yet seamless,
universal connectivity is challenged in several ways. Political pressure threatens fragmentation due to depeering; architectural changes such as carrier-grade NAT, the cloud makes connectivity indirect; rewalls
impede connectivity; and operational problems and commercial disputes all challenge the idea of a single
set of “interconnected networks”. We propose that a new, conceptual denition of the Internet core helps
disambiguate questions in analysis of network reliability and address space usage.
We prove this statement through three studies. First, we improve coverage of outage detection by
dealing with sparse sections of the Internet, increasing from a nominal 67% responsive /24 blocks coverage
to 96% of the responsive Internet. Second, we provide a new denition of the Internet core, and use it
to resolve partial reachability ambiguities. We show that the Internet today has peninsulas of persistent,
partial connectivity, and that some outages cause islands where the Internet at the site is up, but partitioned
from the main Internet. Finally, we use our denition to identify ISP trends, with applications to policy and
improving outage detection accuracy. We show how these studies together thoroughly prove our thesis
statement. We provide a new conceptual denition of “the Internet core” in our second study about partial
reachability. We use our denition in our rst and second studies to disambiguate questions about network
reliability and in our third study, to ISP address space usage dynamics.
xii
Chapter 1
Introduction
What is the Internet’s core? An “internetwork” was rst used to describe a use of an early version of
TCP, but without denition [22]. Postel’s “a collection of interconnected networks is an internet” give the
ARPAnet and X.25 as examples of internets [88]. The Federal Networking Council dened “Internet” in
1995 as (i) a global address space, (ii) supporting TCP/IP and its follow-ons, that (iii) provides services [46],
with later work considering DNS [63] and IPv6.
Today’s Internet is dramatically dierent than 1995: Users at home and work access the Internet indirectly through Network Address Translation (NAT) [115]. Most access is from mobile devices, often
behind Carrier-Grade NAT (CG-NAT) [99]. Many public services are operated from the cloud, visible
through rented or imported IP addresses, but backed with complex services built on virtual networks (for
example [54]). Content is replicated in Content Delivery Networks (CDNs). Access to each is mediated
by rewalls. Today’s Internet succeeds so well with seamless, globally-available services using common
protocols that technical details become background and laypeople consider the web, Facebook, or their
mobile phone as their “Internet”.
1.1 Challenges
Yet universal reachability in Internet core today is often challenged.
1
Political pressure and threats of disconnection suggest national borders may balkanize the core: the
2019 “sovereign Internet” law in Russia [81, 30, 95], and a national “Internet kill switch” has been debated
(including the U.S. [53] and U.K.), and employed [29, 27, 55, 113] These pressures prompted policy discussions about fragmentation [38, 1]. We suggest that technical methods can help inform policy discussions
and show what is at risk for the global Internet from threats such as de-peering. We will show that no
single country can unilaterally control the Internet today (Section 3.6.1). We also show that de-peering
can fragment the Internet into pieces (Section 3.6.2).
Architecturally, twenty-ve years of evolution have segmented the Internet, services gatewayed through
proprietary cloud APIs, users increasingly relegated to second-class status as clients, often behind CG-NAT,
rewalls interrupt connectivity, and a world straddling a mix of IPv4 and IPv6. Architecture sometimes
follows politics, with China’s Great Firewall managing international communication [4, 5], and Huawei
proposing “new Internet” protocols [43]. We suggest that technical methods to detect can help us reason
about changes to Internet architecture, both to understand the implications of partial address reachability
and evaluate the maturity of IPv6.
Operationally, ISP peering is mature, but today peering disputes cause long-term partial unreachability [66]. This unevenness has been recognized and detected experimentally [35], and in systems that detect
and bypass partial reachability [3, 64, 65]. We show several operational uses of our work. We show that accounting partial reachability can make existing measurement systems more sensitive, applying these results
to widely used RIPE DNSmon (Section 3.6.4). DNSmon sees persistent high query loss (5–8%) in DNS Root
Server System [105], but most of this loss is due to measurement error or persistent partial connectivity,
factors that are 5× and 9.7× larger than the operationally important signal in IPv4 and IPv6. Our analysis
also helps uncertainty in multiple, independently developed outage detection systems (Section 3.6.3). All
existing outage detection systems encounter “corner cases” [109, 90, 110, 96, 56] and conicting observations. We show those are due to partial reachability, and we show partial reachability is as common
2
as complete outages (Section 3.5.1). Our work also helps quantify the applicability of systems that, since
2001, route around partial reachability [3, 64, 65], and show that clouds can improve reliability with egress
selection (for example, [108]).
1.2 The Internet Core
In this work we identify that dening “the Internet” as an important open problem. We focus on the health
of the Internet’s core—the devices sharing a public address space and common protocols. We recognize that
most users and services today live in branches o this core, behind cloud load-balancers, mobile CG-NAT,
and NATs at work and home. These branches are substantial and bear the fruit we enjoy, but their success
arises from interoperation through the Internet core and its ability to foster independent innovators and
competing clouds, under sovereign states.
A denition for the Internet’s core is critical because while prior work dened what the Internet is, it
provides little guidance for what the Internet is not. A denition can help us reason about political and
operational challenges (Section 1.1) that threaten the Internet’s ubiquity and uniformity as a means of
global communication. While countries may assert their laws in their borders, our denition shows that
no single country can unilaterally control the Internet today (Section 3.6.1), and when de-peering would
fragment the Internet into pieces (Section 3.6.2).
Our denition states the Internet’s core is the connected component of more than 50% of active, public IP
addresses that can initiate communication with each other (Section 3.2.2). Several implications distinguish
it from prior work. First, requiring bidirectional initiation captures the uniform, peer-to-peer nature of the
Internet’s core necessary for rst-class services. Second, it denes one, unique Internet core by requiring
reachability of more than 50%—there can be only one since multiple majorities are impossible. Finally,
this denition is conceptual, avoiding dependence on any specic measurement system, and not requiring
3
history, special locations, or central authority. It denes an asymptote against which our current and future
measurements can compare, unlike prior denitions from specic systems [3, 64, 65].
1.3 Thesis Overview
In this work, we provide a denition of a single, global Internet core (Section 3.2.2), and use it in three
studies. First, to apply our denition we improve measurement coverage by dealing with sparse sections
of the Internet (Chapter 2). Then, we use our denition to determine who “keeps” the Internet core if
a nation secedes, and to resolve and quantify when sections of the network become reachable only to a
fraction (Chapter 3). Finally, we use our denition to resolve partial reachability and false outages due to
ISP dynamics like diurnal trends and user migration (Chapter 4).
1.4 Thesis Statement
Our thesis is that a new, conceptual denition of the Internet core helps disambiguate questions
in analysis of network reliability and address space usage.
By “new” we mean each part of the Thesis has new components. Our rst study broadens measurable
space adding new unmonitored addresses. Our second study adds to the denition a new requirement that
considers reachability between addresses in the network. Our third study adds an ISP-level viewpoint to
address space usage.
By “conceptual” we mean a theoretical denition of a single, global Internet core, independent of whatever specic measurement tools are used. We use our conceptual denition to create an “operational”
denition [39] is a measurable denition for which a tool is able to output a value. While a conceptual
denition may be intractable and cannot be fully realized, it serves as a goal against which to evaluate
operational denitions.
4
By “Internet core” we mean the devices sharing a public address space and common protocols (Section 3.2.2). We recognize that the many services we enjoy today are the fruits of 25 years of innovation
built on the Internet’s core vision of connecting networks [88] with common protocols and a global address
space [46].
By “disambiguate” we mean to resolve any source of ambiguity related to what “on the Internet” means.
For instance, suppose a pinger can only ping a small fraction of the Internet but not the majority. What
fraction of the Internet core does the pinger need to be considered “on the Internet”? Others in the same
network, same ISP, other ISPs, or outside the country? It clearly cannot mean reaching everywhere, as
there are always some hosts down. On the other hand, it cannot mean “reach nothing”, as two hosts on a
LAN can reach each other but are not “on the Internet”. We use this aspect of disambiguate to determine
by denition who is “on the Internet”.
A second source of ambiguity is observational disagreements. For example, two observers may disagree
on whether another host is up. However, this may mean that both the host and one of the observers are
behind the same outage, or the other observer is aected by a local outage. We use this other aspect of
disambiguate to classify measurement errors and help bound uncertainty under observer disagreement.
By “network reliability” we mean tools and techniques used to measure temporal and persistent network reachability problems, their cause, size and duration, like FBS (Section 2.3.1), Taitao (Section 3.3.2)
and Chiloe (Section 3.3.4).
By “address space usage” we mean ISP’s public addresses and how ISPs assign users in their portion
of the network. A formal understanding of address usage helps us comprehend ISP size and ISP recongurations.
1.5 Demonstrating the Thesis Statement
We show each keyword in our thesis statement in three studies (Table 1.1).
5
Chapter 2 Chapter 3 Chapter 4
Improving coverage Partial reachability ISP dynamics
new addr. space denition viewpoint
conceptual X
operational X X X
Internet core X X X
disambiguate X X
network reliability X X
address space usage X
Table 1.1: Thesis statement demonstration by study
In Chapter 2 we improve Internet core coverage for network reliability measurement systems by adding
new unmonitored address space. We propose a new operational Full Block Scanning (FBS) algorithm to
improve coverage for active scanning by providing reliable results for sparse blocks by gathering more
information before making a decision. FBS identies sparse blocks and takes additional time before making
decisions about their outages, thereby addressing previous concerns about false outages while preserving
strict limits on probe rates.
In Chapter 3 we provide a new conceptual denition of the Internet core, and then use it to disambiguate network reliability issues like partial reachability. Partial reachability (a previous subject [3, 65]) is
the network problem where one address can reach some addresses but not others. This problem arises from
routing trouble, persistent peering disputes, large-scale rewalls, and carrier-grade network address translation. We provide two new operational algorithms to identify two types of network fragmentation. First,
Taitao detects peninsulas, when a network can reach some parts of the Internet directly, but not others.
Peninsulas result from peering disputes or long-term rewalls. Second, Chiloe detects islands, networks
that have internal connectivity but are sometimes cut o from the Internet as a whole.
In Chapter 4 we analyze ISP address space usage dynamics. Another problem to evaluating reachability
across the Internet core is that network operators try to optimize address space usage by moving users
between routable prexes, or repurposing address segments for other uses. This user movement leaves
addresses empty, temporarily which cause scanners to erroneously interpret lack of response as network
6
problems. We propose the Internet Availability Sensing (IAS) algorithm. An operational algorithm designed
to detect and disambiguate maintenance events and address reallocation from a new ISP-level viewpoint.
New elements are present in each study. We add new unmonitored address space for outage detection
systems measurements from 67% to 96% of all responsive blocks (Section 2.5.3). We provide a new denition
of the Internet core (Section 3.2.2), and then use it to resolve challenges current denitions cannot fully
address. We show that no single country can claim for itself “the Internet”, neither no single country can
eject another by de-peering from it (Section 3.6). We supply a new viewpoint to outage detection. Our
AS-wide address accumulation algorithm (Section 4.3.1) produces snapshots of all active addresses within
an ISP enabling detection of ISP address dynamics like diurnal trends and maintenance events.
We provide a conceptual denition of a single, global Internet core, independent of assertions of authority (Section 3.2.2). Our denition serves as the asymptote against which operational denitions may
be tested.
We develop operational algorithms to improve coverage for active scanning (Section 2.3.1), to resolve
partial reachability issues (Section 3.3.2, Section 3.3.4), and to detect ISP-level diurnal trends (Section 4.3.2)
and maintenance events (Section 4.3.4).
We deploy broad measurements across the Internet core, and use our results to disambiguate sources of
ambiguity in network reliability and address space usage analysis. First, we nd that sparse blocks represent
22% of all measurable blocks (Section 2.2.1). Second, we show that partial connectivity issues are about
as common as Internet outages (Section 3.5.1). Third, we quantify how many maintenance events are
externally visible nding that 20% of such events result in /24 IPv4 address blocks that become unused for
days or more (Section 4.5.2).
7
1.6 Research Contributions
Each of our three studies contribute towards proving the thesis. In addition, each work has their own
contributions benecial for the research community and industry.
In Chapter 2, we improve Internet coverage in outage detection, while retaining accuracy and limits
on probing rates (Section 2.3.1). Our approach correctly handles 1.2M blocks that would otherwise be too
sparse to correctly report (Section 2.5.2), and allows addition of 1.7M sparse blocks that were previously
excluded as unmeasurable (Section 2.5.3). Together, coverage for 2017q4 increases to 5.7M blocks. Moreover, our algorithmsimproves accuracy by reducing the number of false outage events seen in sparse blocks
(Section 2.4.1). We conrm that it addresses most previously reported false outage events (Section 2.5.1).
In Chapter 3, in addition to providing a denition of the Internet core, we identify that partial reachability is a fundamental part of the Internet’s core. We dene peninsulas, persistent, partial connectivity
(Section 3.2.3.3); and islands, when one or more computers are partitioned from the main Internet (Section 3.2.3.2). We use peninsulas and islands to address current operational questions. We bring technical light
to policy choices around national networks (Section 3.6.1) and de-peering (Section 3.6.2). We improve sensitivity of RIPE Atlas’ DNSmon [2] (Section 3.6.4), resolve corner cases in outage detection (Section 3.5.1),
and quantify opportunities for route detouring. We support these claims with rigorous measurements from
two measurement systems.
Finally, in Chapter 4, we identify two classes of address dynamics: periodic (diurnal and weekly) trends
and ISP maintenance events. We validate our algorithms (Section 4.4.2) using data from ISPs with known
maintenance patterns and data from RIPE Atlas. Then use these algorithms to quantify how many ISPs
are diurnal (Section 4.5.3), how many maintenance events occur (Section 4.5.1), and how IPv6 shows more
consistent address usage than IPv4 (Section 4.5.5). We show that 20% of maintenance events result in /24
IPv4 address blocks that become unused for days or more. While only about 4% of ASes (2,830) are diurnal,
some diurnal ASes show 20% changes each day.
8
Chapter 2
Improving Coverage of Internet Outage Detection
In our rst study, we improve global coverage of Internet measurement systems. Our approach is to provide
reliable results for sparse blocks. We propose Full Block Scanning (FBS), an algorithm that gathers more
information before making a decision about target network reachability.
This chapter contributes to showing our thesis statement (Section 1.4). We improve Internet core coverage in network reliability analysis with an operational algorithm (Section 2.3.1) that enables adding new
unmonitored addresses to active measurements (Section 2.5.3).
This chapter was published in the Passive and Active Measurement Conference (PAM) 2018 [9]. All of
the datasets used in this study that we created are available at no cost [117]. Our work was IRB reviewed
and identied as non-human subjects research (USC IRB IIR00001648).
2.1 Introduction
Internet reliability is of concern to all Internet users, and improving reliability is the goal of industry
and governments. Yet government intervention, operational misconguration, natural disasters, and even
regular weather all cause network outages that aect many. The challenge of measuring outages has
prompted a number of approaches, including active measurements of weather-related behavior [109], passive observation of government interference [33], active measurement of most of the IPv4 Internet [90],
9
passive observation from distributed probes [110], analysis of CDN trac [96], and statistical modeling of
background radiation [56].
Broad coverage is an important goal of outage detection systems. Since outages are rare, it is important
to look everywhere. Active detection systems report coverage for more than 3M /24 blocks [90], and
passive systems using CDN data report coverage for more than 2M blocks [96]. More specialized systems
focus coverage on areas with bad weather (ThunderPing [109]), or provide broad, country-level or regional
coverage, but perhaps without /24-level granularity inside the regions (CAIDA darknet outage analysis [33]
and Chocolatine [56]). Although each of the systems provide broad coverage, each recognizes there are
portions of the Internet that it cannot measure because the signal it measures is not strong enough. Systems
typically detect and ignore areas where they have insucient signal (in Trinocular, blocks with fewer
than 15 addresses; in ThunderPing, events with fewer than 100 addresses in its region; the Akamai/MIT
system, blocks fewer than 40 active addresses; in Chocolatine, blocks with fewer than 20 active IPs). Setting
thresholds too high reduces coverage, yet setting them too low risks false outages from misinterpreting a
weak signal.
The rst contribution of our study is two new algorithms: Full Block Scanning (FBS), to improve
coverage in outage detection with active probing, while retaining accuracy and limits on probing rates
(Section 2.3.1), and Lone-Address-Block Recovery (LABR), to increase coverage by providing partial results
blocks with very few active addresses (Section 2.3.2). Our insight is to recognize that sparse blocks signal
outages more weakly than other blocks, and so they require more information to make a decision. We
chose to delay decisions until all block addresses (the full block) have been observed, thus gathering more
information while maintaining limits on the probing rate. (An alternative we decline is to probe more
aggressively.) We evaluate FBS as an extension to Trinocular Section 2.4.2, but the concept may apply to
other outage detection systems.
10
Approach Coverage
UCSD-NT darknet 3.2M observed [31]
Akamai passive/CDN 5.1M observed / 2.3M trackable [96]
ThunderPing active/addrs 10.8M trackable US IP addresses [85]
Disco TCP disconnections 10.5k trackable [110]
Trinocular active/blocks 5.9M responsive / 3.4M trackable [90]
Table 2.1: Coverage comparison in /24 blocks of dierent measuring approaches.
Our second contribution is to show that FBS can increase coverage in two ways (Section 2.5.3). First,
it correctly handles 1.2M blocks that would otherwise be too sparse to correctly report. Second, it allows
addition of 1.7M sparse blocks that were previously excluded as unmeasurable. Together, coverage for
2017q4 can be 5.7M blocks. Moreover, FBS improves accuracy by reducing the number of false outage
events seen in sparse blocks (Section 2.4.1). We conrm that it addresses most previously reported false
outage events (Section 2.5.1).
The cost of FBS is reduced temporal precision, since it takes more time to gather more information
(assuming we hold the probe rate xed). We show that this cost is limited (Section 2.5.2): FBS is required
for about one-fth of blocks (only sparse blocks, about 22% of all blocks). Timing for non-sparse majority
of blocks is unaected, and 74% of recovered uptime for sparse blocks is within 22 minutes. About 40% of
accepted outages in sparse blocks are reported within 33 minutes, and nearly all within 3.3 hours. (Reanalysis of old data shows the same results for non-sparse and recovered uptime, but requires twice the time
for accepted outages.) We examine false uptime by testing against a series of known outages that aected
Iraq in February 2017.
The nal contribution of this chapter is to support our thesis statement (Section 1.4).
2.2 Challenges to Broad Coverage
Our goal is to detect Internet outages with broad coverage. Table 2.1 shows coverage of several methods
that have been published. The table shows that active probing methods like Trinocular provide results
11
for about 3.4M /24 blocks [90] and CDN-based passive methods provide good but somewhat less coverage
(2.3M blocks for the Akamai/MIT system [96]). Passive methods with network telescopes provide very
broad coverage (3.2M blocks [31]), but less spatial precision (for example, for entire countries, but not
individual blocks in that country). Combinations of methods will provide better coverage: Trinocular
and the Akamai/MIT system have an 1.6M blocks overlap, and unique contributions, each providing 1.9M
unique 0.7M, from [96]. However, Akamai/MIT data is not publicly available.
Observed blocks are /24 blocks for which a passive detection system has detected at least some trac
in the near past. Similarly, responsive blocks are /24 blocks for which an active system has received at
least some responses in the near past. For both cases, what makes blocks trackable, is having enough
observations or responses to conrm outage.
Here we examine how to improve coverage of active probing systems like Trinocular. Trinocular tracks
3.4M blocks. Another 2.5M blocks are responsive but are not considered “trackable” since they have too
few reliably responding addresses.
Our goal in this study is to expand coverage by making these previously untrackable blocks trackable.
We face two problems: sparse blocks and lone addresses, each described below. In Section 2.3 we describe
two new algorithms to make these blocks trackable: Full Block Scanning (FBS), which retains spatial precision and limited probing rates, but loses some temporal precision; and Lone Address Block Recovery
(LABR), an approach that allows conrmation that lone-address blocks are up, although it cannot denitively identify when they are down.
Other active probing systems that follow the Trinocular algorithms (such as the active part of IODA [18])
might benet from solutions to these problems. We seek algorithms that can reevaluate existing years of
Trinocular data, so we follow Trinocular’s use of IPv4 /24-prex blocks and 11-minute rounds.
12
2017-10-06
2017-10-15
2017-10-24
2017-11-01
2017-11-10
2017-11-18
2017-11-27
2017-12-06
2017-12-14
2017-12-23
0
50
100
150
200
250
block 47024600 IP Addresses (last octet)
up (truth)
up (implied)
down (truth)
down (implied)
(d)
down
unkn
up
Trinocular
(c)
down
unkn
up
FBS
(b)
down
unkn
up
LABR
(a)
Figure 2.1: A sample block over time (columns). The bottom (d) shows individual address as rows, with
colored dots when the address responds to Trinocular. Bar (c) shows Trinocular status (up, unknown, and
down), bar (b) is Full Block Scanning, and the top bar (a), Lone Address Block Recovery.
2.2.1 Problem: Sparse Blocks
Sparse blocks limit coverage: active scanning requires responses, so we decline to measure blocks with
long-term sparsity, and we see a large number of false outages in blocks that are not sparse long-term, but
often are temporarily sparse.
Sparse blocks challenge accuracy because of a tension between the amount of probing and likelihood
of getting a response. To constrain trac to each block, and to track millions of blocks, Trinocular limits
each block to 15 probes per round. Limited probing can cause false outages in two ways: First, it may fail
to reach a denitive belief and mark the block as unknown. Alternatively, if the block is usually responsive,
a few non-responses may produce a down belief.
13
0
200
400
600
800
1000
1200
1400
1600
1800
0 0.2 0.4 0.6 0.8 1
outages per block
Avalue
100
101
102
103
104
/24 blocks
Figure 2.2: Blocks distributed according to the number of outages versus their A(E(b)), 2017q4. Dataset
A30.
As an example, Figure 2.1 shows four dierent levels of sparsity, (each starting 2017-10-06, 2017-10-27,
2017-11-14 and 2017-12-16) as (d) individual address responses to Trinocular probes, and (c) Trinocular
state inferences. As the block gets denser, Trinocular improves its inference correctness.
Furthermore, every address in this block has responded in the past. But for the rst three periods, only
a few are actually used, making the block temporarily sparse. For precision, we use denitions from [90]:
E(b) are the addresses in block b that have ever responded, and A(E(b)) is the long-term probability that
these addresses will respond. We also consider a short-term estimate, Aˆ(E(b)). Thus problematic blocks
have low A(E(b)) or Aˆ(E(b)). We provide further block examples in Section 2.4.1.1.
Sparse blocks cause the majority of outage events. In Figure 2.2 we compare the number of outages in
all 4M responsive blocks with their measured A(E(b)) value during 2017q4. Blocks with a higher number
of outages tend to have a lower A(E(b)) value. In particular those closer to the lower bound. Trinocular
does not track blocks with long term A(E(B)) < 0.1, however as blocks sparseness changes, this value
does change during measure time.
14
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
fraction of blocks
Avalue
≥ 10 down events
all blocks
sparse blocks
Figure 2.3: Cumulative distribution function of the A value per block, 2017q4. Dataset A30.
The correlation between sparse blocks and frequent outage events is clearer when we look at a cumulative distribution. Figure 2.3 shows the cumulative distribution of A for all 4M responsive blocks (light
green, the lower line), and for blocks with 10 or more down events (the red, upper line) as measured during
2017q4. These lines are after merging observations obtained from six Trinocular vantage points. We nd
that 80% of blocks with 10 or more down events have an A < 0.2, at around the knee of the curve, and
yet these sparse blocks represent only 22% of all blocks. The gure suggests a correlation between high
number of down events and low A(E(b)) per block due to the faster convergence of the line representing
blocks with multiple down events. (It conrms the heuristic of “more than 5 events” that was used to lter
sparse Trinocular blocks in the 2017 CDN comparison [96].)
Although we observe from multiple locations, merging results from dierent vantage points is not
sucient to deal with sparse blocks, because these multiple sites all face the same problem of sparseness
leading to inconsistent results. Addressing this problem is a goal of FBS, and it also allows us to grow
coverage.
Prior systems sought to lter out these sparse blocks, both before and after measurement. Trinocular
marks very sparse blocks as untrackable, that is, when A(E(b)) < 0.10 or |E(b)| < 15. It also marks
15
blocks as untrackable when observed A doesn’t match predicted A [90], that is, past A(E(b)) obtained
from previous 4 Internet censuses [58] is at least 0.10, but actual measured A is less than 0.10 (gonedark blocks). Latest Trinocular versions use an adaptive estimate for A [91]. Trinocular notes that its
unmeasurability test is not strict enough: indeterminate belief can occur when the A(E(b)) < 0.3 and
|E(b)| ≥ 15.
We consider blocks sparse when it is less than a threshold (Aˆ
s(E(b)) < Tsparse), where Aˆ
s(E(b) is a
short-term estimate of the current availability of the block, and Tsparse is a threshold, currently 0.2. Blocks
have frequent outages (like Figure 2.1) when they are sparse.
2.2.2 Problem: Lone Addresses
The second challenge to coverage are blocks where only one or two addresses are active—we call this problem lone address blocks. When a single address is active, then lack of a response may be a network outage,
but it may also be a reboot of a single specic computer or other causes—the implication of non-response
from a single address is ambiguous. Trinocular has avoided blocks with few addresses as untrackable
(when |E(b)| < 15). ThunderPing [109] tracks individual addresses, but recognizing the risk of decisions
on single addresses, they typically probe multiple targets per weather event [85].
An example block with a lone-address is in Figure 2.1. Of the four phases of use, the second phase,
starting 2017-10-27, and for 18 days, only the .85 address replies. Our goal is to handle this block correctly
in both of its active states, with many addresses and with a lone address.
2.3 Improving Outage Detection
To improve Internet coverage, we next describe Full Block Scanning and Lone Address Block handling. We
use both these algorithms to address the limitations of sparse blocks (Section 2.2).
16
2.3.1 Full Block Scanning for Sparse Blocks
The challenge of evaluating sparse blocks is that Trinocular makes decisions on too little information, forcing a decision after 15 probes, each Trinocular Round (TR, 11 minutes), even without reaching a denitive
belief. We address this problem with more information: we consider a Full Round (FR), combining multiple
TRs until all active addresses (all of E(b)) have been scanned. This Full Block Scanning algorithm makes
decisions only on complete information, while retaining the promise of limiting scanning rate.
Formally, a Full Round ends at time t when the minimum N TRs before t that cover all E(b) ever-active
addresses of the block: Pt
i=t−N (|TRi
|) ≥ |E(b)|.
Trinocular probes all addresses in E(b) in a pseudo-random sequence that is xed once per quarter, so
we can guarantee each address is probed when we count enough addresses across sequential TRs. (Versions
of Trinocular prior to 2020q1 reverse direction at end of sequence, reanalysis of data before this time must
sense 2|E(b)| addresses to guarantee observing each. We call this retrospective version the 2FR version of
FBS, and will use 1FR FBS for new data. They dier in temporal precision, see Section 2.5.2.)
Full Block Scanning (FBS) layers over Trinocular outage detection, re-evaluating outages it reports and
reverting some decisions. If the block is currently sparse (Aˆ
s < Tsparse) and the most recent Full Round
included a positive response, then we override the outage. That is, if there are any positive responses in
the last Full Round FRt
, we convert any outages to up if ∀TRi where i ∈ [t − N, t].
The cost of FBS is that combining multiple TRs loses temporal precision, so we use FBS only when it is
required: for blocks that are currently sparse. A block is currently sparse if the short-term running average
of the response rate for the block Aˆ3FR
s
, computed over the last three FRs, is below the sparse threshold
(Aˆ3FR
s < Tsparse). (We choose three FRs to smooth Aˆ from multiple estimates.)
The reduction in temporal precision depends on how many addresses are scanned in each TR and the
size of FR (that is, E(b)). When FBS veries an outage, we know the block was up at the last positive
response, and we know it is down after the full round of non-responses, so an outage could have begun
17
any time in between. We therefore select a start time as the time of the last conrmed down event (the rst
known lit address, now down). That time has uncertainty of the dierence between the earliest possible
start time and the conrmed start time. Theoretically, if all 256 addresses in a block are in use and 15
addresses are scanned each TR, a FR lasts 187 minutes. In practice, timing is often better; we show empirical
results in Section 2.5.2
2.3.2 Lone-Address-Block Recovery
The FBS algorithm repairs any block with at least one responsive address in the last FR, allowing us to
extend coverage to many sparse blocks. However, when a block has only a single active address, a nonreply may indicate an outage of the network or a problem with that single host.
To avoid false down events resulting from non-outage problems with a lone address, we dene LoneAddress-Block Recovery (LABR). We accept up events, but because outages are rare (much rarer than packet
loss), we convert down events to “unknown” for blocks with very few recently active addresses. We dene
“few” as one or two active addresses, and recently as the last three Full Rounds, so we use LABR when
|Eˆ3FR| < 3. We require at least three addresses to avoid making decisions on one or two addresses where
packet loss could change results.
This algorithm gives an asymmetric outcome: we can conrm blocks are up, but not that they are down.
We believe that outcome is preferable to the alternatives: completely ignoring the block, or tolerating false
outages. However, we identify LABR blocks to allow researchers wanting an estimator that can be both
up and down to omit them.
2.4 Evaluation
We next evaluate correctness and performance of FBS and LABR to verify Internet coverage improvement.
In the following studies we use these algorithms to measure network reliability at Internet scale.
18
2.4.1 Full Block Scanning Reduces Noise
2.4.1.1 Case Study of One Block
To illustrate how FBS works, we start at the smallest level, a /24 block. Figure 2.1 shows a sample block
with insucient responding addresses that cause Trinocular to erroneously infer the block as down.
This block is in CenturyLink (AS209, a U.S. ISP), and initially has only 8 addresses responding. On
2017-10-27, there is a usage change that causes a down event with no address response for ∼13 hrs. This
event is matched in other blocks for the same AS. Then, we see a lone address responding for 18 days.
On 2017-11-14, the block starts receiving new users, and once again starting 2017-12-17. On 2017-11-16, it
shows a partial outage that is observed only from our Los Angeles site, not from other Trinocular sites.
Trinocular results (Figure 2.1(b), third bar) show frequent unknown states that result in false down
events, particularly when block usage is sparse in October and early November.
By contrast, Full Block Scanning (Figure 2.1(b), the second graph), resolves this uncertainty. FBS’ more
information conrms the block is usually up, while recognizing the usage change and the partial outage.
However, in between, there are two down events inferred by a lone address which are changed to unknown
by LABR (Figure 2.1(a), the top graph).
Other Block Examples Next we provide examples of other blocks where sparsity changes to illustrate
when FBS is required.
The block in the left part of Figure 2.4 has no activity for three weeks, then sparse use for a week, then
moderate use, and back to sparse use for the last two weeks. Reverse DNS suggests this block uses DHCP,
and gradual changes in use suggest the ISP is migrating users. The block was provably reachable after the
rst three weeks. Before then it may have been reachable but unused, a false outage because the block is
inactive.
19
2017-10-06
2017-10-15
2017-10-24
2017-11-01
2017-11-10
2017-11-18
2017-11-27
2017-12-06
2017-12-14
2017-12-23
0
50
100
150
200
250
block 47372500 IP Addresses (last octet)
up (truth)
up (implied)
down (truth)
down (implied)
(d)
down
unkn
up
Trinocular
(c)
down
unkn
up
FBS
(b)
down
unkn
up
LABR
(a)
2017-01-01
2017-01-10
2017-01-20
2017-01-29
2017-02-07
2017-02-16
2017-02-25
2017-03-06
2017-03-15
2017-03-24
0
50
100
150
200
250
block 25ec6000 IP Addresses (last octet)
up (truth)
up (implied)
down (truth)
down (implied)
(d)
down
unkn
up
Trinocular
(c)
down
unkn
up
FBS
(b)
down
unkn
up
LABR
(a)
Figure 2.4: Sample blocks over time (columns). The bottom (d) shows individual address as rows, with
colored dots when the address responds to Trinocular. Bar (c) shows Trinocular status (up, unknown, and
down), bar (b) is Full Block Scanning, and the top bar (a), Lone Address Block Recovery.
The third bar from the top (c) of the left of Figure 2.4 we show that Trinocular often marks the block
unknown (in red) for the week starting 2017-10-30, and again for weeks after 2017-12-12. Every address
in this block has responded in the past. But for these two periods, only a few are actually used, making
the block temporarily sparse. Figure 2.4 (left, bar b) shows how FBS is able to accurately x Trinocular’s
pitfalls in such a DHCP scenario.
Figure 2.4 (right) shows a block example with a lone address. This block has three phases of use: before
2017-02-16, many addresses are in use; then for about 9 days, nothing replies; then, starting on 2017-02-25
only the .1 address replies. During the last phase, Trinocular (Figure 2.4 (right, bar c)) completely ignores
that there is one address responding, while FBS (Figure 2.4 (right, bar b)) sets block status depending on
responses of this lone-address. However, LABR (Figure 2.4 (right, bar a)) changes all the FBS detected
down events to unknown, as there is not information to claim a down event, in contrast to what the end
of phase one shows.
20
0
20
40
60
80
100
2016-12-29
2017-01-12
2017-01-26
2017-02-09
2017-02-23
2017-03-09
2017-03-23
2017-04-06
2017-02-01
2017-02-06
2017-02-11
0
200
400
600
800
1000
Blocks Down
Trinocular
FBS+LABR
Figure 2.5: Iraqi Government mandated outages Feb 2-9, 2017. Whole quarter (left), and exam week (right).
Dataset A27. FBS processed using 2FR.
2.4.1.2 False Outages: Does FBS Remove Noise?
From this single block example, we next consider a country’s Internet. Our goal is to see if FBS reduces
noise by examining false down events (blocks correctly recovered by FBS because they were observation
noise).
We study series of known outages that aected Iraq in February 2017. That country had seven governmentmandated Internet outages (the local mornings on February 2, and also the 4th through 9th) with the goal
of preventing cheating during academic placement exams [37]. This is a particularly challenging scenario
to FBS, as closely spaced short outages test the algorithm’s accuracy and precision. Furthermore, the fraction of sparse blocks is high in this country. We identied 1176 Iraqi blocks using Maxmind’s city-level
database [71]; 666 of these are sparse.
Figure 2.5 shows Iraqi outages in 2017q1, grouped in 660 s timebins. We show outages without Full
Block Scanning (the purple, top line) and with it (the green line). The Iraqi exam week is highlighted in
gray on the left, and we plot that week with a larger scale on the right.
21
0
1
2
3
0 0.2 0.4 0.6 0.8 1
Full Rounds
Aˆ3F R
s
100
101
102
/24 blocks
Figure 2.6: Outage events during the 7 Iraqi outages, measured of their Aˆ3FR
s
and full round values. Single
site W. Dataset: A27 (2017q1) subsetted to the 7 outage periods.
In each of the seven large peaks during exam week, most Iraqi blocks (nearly 900, or 76%) are out—our
true outages. Outside the peaks, a few blocks (the 20 to 40 purple line, without FBS) are often down, likely
false outages.
FBS suppresses most of the background outages (85% of outage area), from a median of 26 to a median
of 1; these dierences can be seen comparing the higher purple line to the lower green line. We conrm
this reduction was due to noise by examining blocks that FBS recovers in 10 randomly-selected time periods with 34 down events. Nearly all down events (33 events, 97% of purple) were in sparse blocks that
resemble Figure 2.1; the other block was diurnal. This study conrms that FBS recovers false outages due
to sparseness.
2.4.1.3 True Outages: Does FBS Remove Legitimate Outages?
We next look at how Full Block Scanning interacts with known outage events. Its goal is to remove noise
and false outages, but if FBS is too aggressive it may accidentally remove legitimate outages (a “true down
event”).
We treat the seven nationwide outages corresponding with Iraqi exams as true down events and compare this ground truth, with and without FBS.
The seven peaks in Figure 2.5 (right) show known Iraqi outages, with purple dots at “peak outage”
without FBS, and lower, green dots with FBS. FBS removes somewhat less than half of the down events,
with peaks around 440 to 560 instead of 790 to 910 blocks.
22
Table 2.2: Confusion matrix of 5200 Trinocular detected down events in 50 random blocks. Dataset A30,
2017q4.
true condition (manually observed)
FBS
UP DOWN
(Trinocular false down events) (Trinocular true down events)
UP 4133 (79%, FBS xes) 0
DOWN 621 (12%, FBS misses) 446 (9%, FBS conrms)
To understand this reduction we looked at the duration of the Iraqi events. FBS aects only the 35%
of events in the red box in the lower left corner. (Examination of just sparse blocks conrms that they are
the source of attenuation.)
It is important to note that these are worst case for FBS—many blocks are sparse, and the events are
just shorter than one full round. If the event was longer or more blocks were not sparse, there would be
no attenuation. A lower FBS threshold (Aˆ3F R) of 0.15 trims only 15% of events. However, we choose to
leave FBS threshold at 0.2 to avoid overtting our parameters to Iraq.
2.4.1.4 Random Sampling of Outage Events
Finally, we conrm our results with a random sample of events. We select 50 random blocks that show
some outage from the Trinocular 2017q4 dataset, then a best-estimate ground truth through manual examination. Table 2.2 shows the confusion matrix after applying FBS. Of the total 5200 down events detected
by Trinocular, FBS xes 4133 (79% are false outages), misses 621 down events (12% are not xed, but should
have been), and conrms 446 true down events (9% are not changed). The FBS Error Rate is 0.12 (621 false
outages of 5200 events), so it is fairly successful at removing noise. Many of the false outages are due to
moderately sparse blocks (0.2 < A(E(b)) < 0.4) where FBS does not trigger.
23
0
0.02
0.04
0.06
0.08
0.1
0 0.02 0.04 0.06 0.08 0.1
down fraction (FBS)
down fraction (Trinocular)
100
101
102
103
104
105
/24 blocks
0
100
200
300
400
0 100 200 300 400
down events (FBS)
down events (Trinocular)
100
101
102
103
104
105
/24 blocks
Figure 2.7: Comparison of per-block down time (left) and number of down events (right) between 2FR-FBS
and Trinocular during 2017q4 as seen from six sites. Dataset A30.
2.4.2 How Often Does FBS and LABR Change Outages?
We next evaluate how FBS and LABR change the overall down event duration and the number of down
events. We expect FBS to repair false down events, so it should show less downtime and fewer down
events.
We evaluate merged results from six Trinocular sites as measured during 2017q4 (dataset A30) and
compute fraction of time and number of occurrences across the whole quarter each block was observed
down. We repeat the procedure with data processed with FBS.
We compare outages for 2017q4 (dataset A30), processing and merging results from six sites with and
without FBS. We found similar results when we repeated this study on a dierent quarter (2017q2, dataset
A28).
FBS and Down Time: Figure 2.7 (left) compares the fraction of total down time (0.0130) with FBS
(0.0027). First, the vast majority of blocks (91%) have both values less than 0.02—they have little or no
down time. Many of the remaining blocks are on the diagonal, with prior and new values within 0.005.
We also see most of the changed blocks (9% of all blocks) appear below the diagonal, showing that FBS
usually decreases downtime.
24
Surprisingly, 0.5% of blocks show more downtime after FBS. We examined a sample of these blocks
and found that some sparse blocks did not transition from up-to-down in one round when 15 negative
results did not fully change belief. FBS gathers more information and retrospectively marks the block
down earlier. We believe this result better reects truth.
FBS and Down Events: We can also evaluate how FBS aects the number of down events in addition to
down time in Figure 2.7(right). FBS reduces the number of down events by 6% of blocks, often considerably
(see the large number of blocks near the x-axis). In these cases FBS is repairing false outages. Again, we
see a small number of blocks (0.1%) where FBS shows more down events than without. Examination of
these cases shows that FBS sometimes breaks longer down events into several shorter ones, interspersed
with an up event. We believe these results better reect the true state of the block.
LABR: In 2017q4, LABR aects only a few blocks (250k, 6% of trackable), where it resets 4M down
events to unknown. LABR aects only a few blocks, but it allows them to be reported up much of the time,
increasing coverage.
2.5 Comparing Trinocular and FBS
In Section 2.4.2 we discuss how often FBS changes outages when compared to Trinocular. We examine two
dierent metrics: total block down time and number of down events. Next we provide further information
about the distribution of these metrics.
In Figure 2.8 (left) we show block distribution of Trinocular and FBS down time fraction dierence.
The majority of blocks (91%) have little or no change. Blocks on the left side of the gure representing 9%
of the total, have a higher down time fraction when processed only with Trinocular than when processed
with FBS. For example, a −1 shows a block that was down for Trinocular during the whole quarter, while
FBS was able to completely recover it. This outcome occurs when a historically high |E(b)| block has
temporarily dropped to just a few stable addresses.
25
Figure 2.8: Cumulative distribution of down fraction dierence (left) and number of down events dierence
(right) between Trinocular and FBS for 2017q4. Dataset A30.
We also see a small percentage (0.5%) where FBS has a higher down fraction than Trinocular. This
increase in outages fraction happens when Trinocular erroneously marks a block as UP. With more information, FBS is able to correctly change block state and more accurately reect truth.
In Figure 2.8 (right) we look to the distribution of blocks when compared by the number of down events
observed in FBS and Trinocular. Similarly, the number of down events remains mostly unchanged for the
majority of blocks (94%). Trinocular has more down events for 6% of blocks, and FBS shows more events
for 0.1%. FBS can increase the absolute number of events in a block when breaking long events into shorter
pieces.
2.5.1 Comparing FBS Active and Passive Outages
Prior CDN-based results showed the large number of false outages that come from a few blocks [96].
To match their system, they compare the subset of 1.6M blocks from 2017q2 that are trackable in both
Trinocular and their system and that are at least 1 hour or longer in Trinocular. We next review that result
and show that FBS solves the problem they identied.
Table 2.3 shows this comparison of CDN events to Trinocular with both ltering (discarding blocks with
more than 5 events, a short-term x proposed for their paper at the time) and FBS. To recap prior results:
The CDN-based results summarized in conrm that 27% of outage events found by Trinocular without
26
Table 2.3: Trinocular-detected disruptions in CDN logs. Dataset A28, 2017q2.
Trinocular ltered
Trinocular FBS
# disruptions 380k 132k 119k
conrmed 103k 27% 98k 74% 92k 77%
reduced activity 49k 13% ∼13k 10% 16k 14%
no change 228k 60% ∼21k 16% 11k 9%
FBS also appear in the CDN-based passive analysis. The remaining outages are either false outages in
Trinocular (likely, since 60% show no change in the CDN) or false uptimes from the CDN. Given sparse
blocks produce many events, discarding blocks with 5 or more events (the “ltered Trinocular” column)
should avoid most false outages, although it may cause false uptime. As expected, most events (74%) that
remain after this lter are conrmed by the CDN.
While CDN-data is proprietary and is not available, we thank Philipp Richter for redoing this comparison with a similar subset of our data updated, but now with FBS. The FBS column of Table 2.3 shows
analysis of Trinocular with FBS compared to the same CDN results, now ltered only by the CDN requirements (1 hour events, and reported in the CDN system). FBS brings an even larger fraction of disruptions
in-line with the CDN, with 77% of events being conrmed. Moreover, FBS is much more sensitive than the
5-event lter, applying only to the 22% of blocks that are sparse blocks. FBS therefore preserves Trinocular’s 11-minute timing for the majority of blocks, reducing temporal precision only where necessary while
providing generally good accuracy for outage detection across all blocks.
This result suggests that FBS addresses the majority of false outages, and conrms that most false
outages are due to a small set of sparse blocks. (Addressing false outages due to ISP renumbering is ongoing
work [10].)
Finally, we note that FBS provides much larger coverage: 5.7M blocks compared to 2.3M trackable
blocks in the CDN-system. We discuss coverage in detail in Section 2.5.3.
27
2.5.2 FBS Eects on Temporal Precision
We rst examine how FBS aects temporal precision of outages. In sparse blocks, FBS will repair down
events that are shorter than a Full Round. But the exact duration of a FR depends how many addresses are
considered in the block (E(b)) and how active they are (Aˆ(E(b))).
We examine 308M events that FBS repairs in a quarter and nd that for about half the cases (53% of the
events), FBS repairs a single-round of outage in 11 minutes. Almost all the remaining events are recovered
in 15 or fewer rounds, as expected. Only a tiny fraction (0.5%) require longer than 18 rounds, for the few
times when Trinocular is slow to detect large changes in Aˆ because it thinks the block may be down (For
more details, see our paper [9]).
2.5.3 Increasing Coverage
Sparse blocks limit coverage. If historical information suggests they are sparse, they may be discarded
from probing as untrackable. Blocks that become sparse during measurement can create false outages and
are discarded during post-processing. We next show how FBS and LABR allow us to increase coverage by
correctly handling sparse blocks.
Correctly tracking sparse blocks: We rst look at how the accuracy improvements with our algorithms increase coverage. Three thresholds have been used to identify (and discard) sparse blocks: a low
response probability (A < 0.2, quarter average, from [90]), low up time (up time < 0.8, from [91]), and
high number of down events (5 or more down events, from [96]).
We use these three thresholds over one quarter of Trinocular data (2017q4-A30W), reporting on coverage with each lter in Table 2.4. With 5.9M responsible blocks, but only 4M of those (67%) are considered
trackable by Trinocular. Filtering removes another 0.2M to 0.9M blocks, leaving an average of 53 to 64%.
Trinocular with FBS gets larger coverage than other methods of ltering or detection. FBS repairs 1.2M
blocks, most sparse: of 0.9M sparse blocks, we nd that FBS xes 0.8M. The remaining 100k correspond to
28
Table 2.4: IPv4 address space coverage of Trinocular and FBS. (a), (b) and (c) dierent methods for ltering
sparse blocks. (d) blocks xed by FBS.
Blocks (in M)
Threshold reject accept %resp %Tri
IPv4 responsive |E(b)| ≥ 1 8.6 5.9 100
Trinocular trackable |E(b)| ≥ 15 ∧ A ≥ 0.1 1.9 4.0 67 100
a) mostly up blocks up time > 0.8 0.2 3.8 64 95
b) infrequently down blocks # down events < 5 0.3 3.7 63 93
c) non-sparse blocks A ≥ 0.2 0.9 3.1 53 78
d) FBS considered Aˆ3FR < 0.2 2.8 1.2 - 30
overlap with (c) 0.6 0.8 - -
FBS trackable |E(b)| ≥ 3 0.2 5.7 96 142
either good blocks that went dark due to usage change and therefore pushing the quarterly average of A
down, or sparse blocks with few active addresses (for example, |E(b)| < 100) where Trinocular can make
a better job inferring the correct state.
Can FBS+LABR expand baseline coverage? Finally, we examine the number of blocks discarded
as untrackable from historical data, and are not tracked for outages. For instance, Trinocular looks at the
last 16 surveys [58], and lter all blocks with |E(b)| < 15 and A < 0.1, left with its baseline of 4M blocks.
In a similar approach, we use the 2017-04-27 survey as our upper bound of the responsive Internet [62].
As Table 2.4 shows, we nd 5.9M responsive blocks, of which 5.7M had at least three active addresses during
the measured period. That is 1.7M (43%) more blocks than the baseline become trackable. When adding
1.7M with the number of FBS-repaired blocks (1.2M), our eective coverage increment adds to 2.9M blocks.
2.6 Related Work of Improving Internet Coverage
Several groups have methods to detect outages at the Internet’s edge: ThunderPing rst used active measurements to track weather-related outages on the Internet [109, 85]. Dainotti et al. use passive observations at network telescope to detect disasters and government censorship [33, 32], providing the rst view
29
into rewalled networks. Chocolatine provides the rst published algorithm using passive network telescope data [56], with a 5 minute detection delay, but it requires AS or country level granularity, much more
data than /24s. Trinocular uses active probes to study about 4M, /24-block level outages [90] every 11 minutes, the largest active coverage. Disco [110] observes connectivity from devices at home [102], providing
strong ground truth, but limited coverage. Richter et al. detect outages that last at least one hour with
CDN-trac, conrming with software at the edge [96]. They dene disruptions, showing renumbering
and frequent disagreements in a few blocks are false down events in prior work.
Finally, recent work has looked at dynamic addressing as one source of sparsity [83] showing that hosts
can be moved between blocks for short period of times. In this work, we address the problem of sparse
blocks, making unreliable outage detection results, reliable. Mirkovic et al. [75] looks into address liveness
using passive observations, and provides a denition for sparsity as the limitation of a given monitor to
see a target. We follow a similar denition of sparsity, however using an active approach.
Our work builds on prior active probing systems and the Trinocular data and algorithms, and addresses
problems identied by Richter, ultimately due to sparsity and dynamics.
2.7 Study Conclusions
In this study, we dened two algorithms: Full Block Scanning (FBS) (Section 2.3.1), to address false outages
seen in active measurements of sparse blocks, and Lone Address Block Recovery (LABR) (Section 2.3.2), to
handle blocks with one or two responsive addresses. We showed that these algorithms increase coverage,
from a nominal 67% (and as low as 53% after ltering) of responsive blocks before to 5.7M blocks, 96%
of responsive blocks (Section 2.5.3). We showed these algorithms work well using multiple datasets and
natural experiments (Section 2.4.1, Section 2.4.2, Section 2.5.1) they can improve existing and future outage
datasets.
30
This study contributes towards showing our thesis (Section 1.4) by adding new unmonitored addresses
and improving network reliability analysis of the Internet core (Table 1.1).
In the next chapter we will dene “the Internet core” (Section 3.2.2) and use the denition to determine
who “keeps” the Internet if a nation secedes (Section 3.3.5), and to resolve and quantify when sections of
the network become reachable only to a fraction (Section 3.5).
31
Chapter 3
What is the Internet Core?
This chapter covers our second study. In this study, we propose a new denition of the Internet core
dening a single, global network, while helping us recognize that partial reachability is a fundamental part
of the Internet’s core. To understand reachability we dene peninsulas, persistent, partial connectivity; and
islands, when one or more computers are partitioned from the main Internet. We develop new algorithms
to measure the number, size, and duration of peninsulas and islands. These algorithms follow from a
conceptual denition of the Internet’s core dened by connectivity, not special authority.
This study contributes towards showing our thesis statement (Section 1.4). We provide a new conceptual
denition of the Internet core (Section 3.2.2), and then we use the denition to disambiguate questions
about network reliability like who is “on the Internet” (Section 3.6.2), or whether a host is reachable or not
(Section 3.6.3). We implement our denition using operational algorithms (Section 3.3.2, Section 3.3.4).
We have released several versions of this study as technical reports [8, 11]. A version of Section 3.6.4 is
derived and expanded from work done primarily by Saluja Tarang [107] from the denitions we provided.
All of the data used (Section 3.3.1) and created [6] in this study is available at no cost. We review ethics
in detail in Section 3.A, but our bulk analysis of IP addresses does not associate them with individuals. Our
work was IRB reviewed and identied as non-human subjects research (USC IRB IIR00001648).
32
3.1 Introduction
What is the Internet’s core? An “internetwork” was rst used to describe a use of an early version of
TCP, but without denition [22]. Postel’s “a collection of interconnected networks is an internet” give the
ARPAnet and X.25 as examples of internets [88]. The Federal Networking Council dened “Internet” in
1995 as (i) a global address space, (ii) supporting TCP/IP and its follow-ons, that (iii) provides services [46],
with later work considering DNS [63] and IPv6.
Today’s Internet is dramatically dierent than 1995: Users at home and work access the Internet indirectly through NAT [115]. Most access is from mobile devices, often behind CG-NAT [99]. Many public
services are operated from the cloud, visible through rented or imported IP addresses, but backed with
complex services built on virtual networks (for example [54]). Content is replicated in CDNs. Access to
each is mediated by rewalls. Today’s Internet succeeds so well with seamless, globally-available services using common protocols that technical details become background and laypeople consider the web,
Facebook, or their mobile phone as their “Internet”.
Yet the notion of one, globally-available Internet core today faces political, architectural, and operational challenges. Political pressure and threats of disconnection are increasing: the 2019 “sovereign
Internet” law in Russia [81, 30, 95], and a national “Internet kill switch” has been debated (including the
U.S. [53] and U.K.), and employed [29, 27, 55, 113] These pressures prompted policy discussions about fragmentation [38, 1]. Architecturally, twenty-ve years of evolution have segmented the Internet, services
gatewayed through proprietary cloud APIs, users increasingly relegated to second-class status as clients,
rewalls interrupt connectivity, and a world straddling a mix of IPv4 and IPv6. Architecture sometimes
follows politics, with China’s Great Firewall managing international communication [4, 5], and Huawei
proposing “new Internet” protocols [43]. Operationally, ISP peering is mature, but today peering disputes
cause long-term partial unreachability [66]. This unevenness has been recognized and detected experimentally [35], and in systems that detect and bypass partial reachability [3, 64, 65].
33
Contributions: The rst contribution of this study is to recognize partial reachability is a fundamental
part of the Internet’s core. We dene peninsulas, when a network sees persistent, partial connectivity to
part of the Internet, and islands, when one or more computers are partitioned from the main Internet.
We develop algorithms to measure each (Section 3.3). Taitao detects peninsulas that often result from
peering disputes or long-term rewalls. Our second algorithm, Chiloe, detects islands. These algorithms
are operational, able to estimate the presence of peninsulas and islands in existing measurement data from
two dierent Internet-wide measurement systems.
A rigorous denition of peninsulas and islands requires that we identify the Internet’s core. The Internet’s core is the connected component of more than 50% of active, public IP addresses that can initiate
communication with each other (Section 3.2.2). This denition has several unique characteristics. First,
requiring bidirectional initiation captures the uniform, peer-to-peer nature of the Internet’s core necessary
for rst-class services. Second, it denes one, unique Internet core by requiring reachability of more than
50%—there can be only one since multiple majorities are impossible. Finally, unlike prior work, thisconceptual denition avoids dependence on any specic measurement system, nor does it depend on historical
precedent, special locations, or central authorities. Although our operational measurements of peninsulas
and islands may reect observation error, the conceptual Internet core denes an asymptote against which
our current and future measurements can compare, unlike prior denitions from specic systems [3, 64,
65].
Our second contribution is to use peninsulas and islands to address current operational questions. As
described earlier, we bring technical light to policy choices around national networks (Section 3.6.1) and
de-peering (Section 3.6.2). We improve sensitivity of RIPE Atlas’ DNSmon [2] (Section 3.6.4), resolve corner
cases in outage detection (Section 3.5.1), and quantify opportunities for route detouring.
Our nal contribution is to support these claims with rigorous measurements from two measurement
systems. We evaluate our new algorithms with publicly available, existing measurements of connectivity to
34
5M networks from six VPs over multiple years [90]. While a handful of locations cannot represent the entire
Internet, each observer scans most of the ping-responsive Internet from a unique geographic and network
location, providing a wide range of results over time. Our analysis shows that combinations of any three
independent VPs provide a result that is statistically indistinguishable from the asymptote Section 3.5.1.
We show our algorithms provide consistent results, oering reproducible and useful estimates of Internet
reachability and partial connectivity. We also validate interesting events with selective traceroutes.
We also evaluate about 10k globally distributed VPs (RIPE Atlas, [103]) observing connectivity to 13
anycast destinations (the Root Server System [105]). These observations from thousands of locations over
multiple years validate the occurrence of rare events like islands, and demonstrate how pervasive peninsulas are. They conrm our results of Internet-wide scans, and allow us to tune DNSmon, as described
earlier.
3.2 How Do We Dene the Internet?
While historic denitions (see Section 3.1) are helpful, today’s challenges impose two new requirements.
First, a denition should be both conceptual and operational [39]. Our conceptual denition in Section 3.2.2
articulates what we would like to observe. In Section 3.3 we operationalize it, describing how actual measurement systems can estimate this value. The conceptual denition suggests a limit that implementations
can approach (Section 3.5.1), even if it cannot be directly implemented. Prior denitions are too vague to
operationalize.
Second, a denition must give both sucient and necessary conditions to be part of the Internet’s core.
Prior work gave properties the core must have (sucient conditions, like supporting TCP). Our denition
adds necessary conditions that indicate when networks leave the Internet’s core.
35
3.2.1 Why Does Dening the Internet Matter?
These requirements arise due to stressors on today’s Internet from its increasing political, architectural, and
operational importance. We listed these stresses previously (Chapter 1); here we describe how denitions
and measurements can help.
Political tussles around the Internet rose with the Internet’s economic value in the 1990s. Today the
topics of Internet control, data storage, and Internet sovereignty, are issues of international importance at
top levels of government.
While the intersection of national interests and the Internet is necessarily political, providing technical denitions of what the Internet core is can clarify sovereignty. We show that no single country can
unilaterally “take” the Internet (Section 3.6.2), although any can walk away. We show the risks of political choices such as de-peering with a sharp technical denition for when the Internet will fragment into
pieces.
Architectural challenges to the Internet arise from the vast use of NAT, CG-NAT, and cloud—today most
computers are not on the IPv4 Internet core, but are attached via these branches. In addition, concurrent
deployment of IPv4 and IPv6 raises questions of if dierent maturity of deployments aects quality. We
hope our denition can clarify the role of the Internet core in today’s Internet and help us understand how
the architectural changes of ubiquitous NAT, cloud, and IPv6 change and do not change our assumptions.
Operationally, the Internet is quite robust. Yet independent outage-detection systems struggle with
conicting signals of connectivity [56, 90, 96, 109]. Our denition and algorithms show that outages are
not always binary, and peninsulas of partial connectivity are common.
3.2.2 The Internet: A Conceptual Denition
We dene the Internet core as the connected component of more than 50% of active, public IP addresses that
can initiate communication with each other. Computers behind NAT and in the cloud are on branches,
36
participating but not part of the core, typically with dynamically allocated or leased public IP addresses.
This conceptual denition gives two Internet cores, one for the IPv4 address space and one for IPv6.
This denition follows from the terms “interconnected networks”, “IP protocol”, and “global address
space” used in informal denitions—they all share the common assumption that two computers on the
Internet should be able to communicate directly with each other at the IP layer.
We formalize “an agreement of networks to interconnect” by considering reachability over public IP
addresses: addresses x and y are interconnected if trac from x can reach y and vice versa (that is: x and
y can reach each other). Networks are groups of addresses that can reach each other.
Why More than 50%? We take as an axiom that there should be one Internet core, or reason a single
Internet core no longer exists. Thus we require a denition to unambiguously identify “the” Internet core
given conicting claims.
We require that the Internet core includes more than 50% of active addresses so that the majority can
settle conicting claims. Only one group can control a majority of addresses, while any smaller fraction
could allow two groups to tie with equally valid claims. The result is that there is always a well-dened
Internet core even if a major nation (or group of nations) chose to secede. A majority denes a unique,
unambiguous partition that keeps the Internet.
The Internet’s core is reachable from multiple Internet backbones of Tier-1 ISPs with default-free routing. Our denition allows us to reason about dierences between what ISPs see, particularly due to longterm peering disputes.
This denition suggests that it is possible for the Internet to fragment: if the current Internet breaks
into three disconnected components when none has a majority of active addresses. Such a result would
end a single, global Internet.
Why all and active addresses? In each of IPv4 and IPv6 we consider all addresses equally. The
Internet is global, and was intentionally designed without a hierarchy [24]. Our denition should not
37
create a hierarchy or designate special addresses by age or importance, consistent with trends towards
Internet decentralization [36].
We dene active addresses as blocks that are reachable, as dened below. Our goal is to exclude the
inuence of large allocated but unused space. Large unused space is present in IPv4 legacy /8 allocations
and in large new IPv6 allocations.
Reachability with Protocols and Firewalls: This conceptual denition allows for dierent denitions of reachability. Reachability can be tested through measurements with specic protocols, such as
ICMP echo request (pings), or TCP or UDP queries. Such a test will result in an operational realization of
our conceptual denition. Particular tests will dier in how closely each approaches the conceptual ideal.
In Section 3.5.1 we examine how well one test converges.
Our conceptual denition considers reachability, but rewalls block protocols (sometimes conditionally or unidirectionally), complicate observing this potential. Thus dierent protocols or times might give
dierent answers, and one could dene broad reachability with any protocol in a rewall-friendly manner,
or narrowly. Measurement allows us to evaluate policy-driven unreachability in Section 3.5.8.
Our operational data uses ICMP echo requests (Section 3.3.1), following prior work that compared
alternatives [13, 90, 40] and showed ICMP provides better coverage than alternatives, and can avoid attenuation from rate limiting [57]).
Why reachability and not applications? Users care about applications, and a user-centric view
might emphasize availability of HTTP or Facebook rather than IP. We recognize this attention, but intentionally measure reachability at the IP layer as a more fundamental concept. IP has changed only twice
since 1969 with IPv4 and IPv6, but dominant applications ebb and ow, and important applications often
extend beyond the Internet. (E-mail has been transparently relayed to UUCP and FidoNet, and the web to
pre-IP mobile devices with WAP.) Future work may look at applications, but we see IP-level reachability
as an essential starting point.
38
Why bidirectional reachability? Most computers today are on branches o the core, behind NAT
or in the cloud. While such computers are useful as Internet clients, they provide services to the core or
to peers only through the core. Individual computers use protocols such as STUN [106] that rendezvous
through the core, or UPnP [73] or PMP [23] that recongure a NAT on the core. Huge services run in the
cloud by leasing public IP addresses from the cloud operator or importing their own (BYOIP).
Similarly, services may be operated as many computers behind a single public IP address with load
balancing or IP anycast [86], perhaps with cloud-based address translation [54]. Computers with only
application-level availability are also not fully part of the Internet core.
3.2.3 The Internet Landscape
Our denition of the Internet’s core highlights its “rough edges”. Using our conceptual denition of the
Internet as the fully connected component (Section 3.2.2), we identify three specic problems: an address
a is a peninsula when it has partial connectivity to the Internet’s core, an island when it cannot reach any
of the Internet’s core, and an outage only when it is o.
3.2.3.1 Outages
A number of groups have examined Internet outages [109, 90, 96, 56]. These systems observe the IPv4
Internet and identify networks that are no longer reachable—they have left the Internet. Often these systems dene outages operationally (network b is out because none of our VPs can reach it). Conceptually,
an outage is when all computers in a block are o, such as due to a power outage. When the computers
are on but cannot reach the Internet, we consider them islands, a special case of outage that we dened
next.
39
100
200
100
200
100
200
100
200
100
200
18:00 20:00 22:00 Jun-04 02:00 04:00
100
200
W
E
C
G
N
J
Last Octet of Target Block
Source Vantage Point
[2017]
up (truth)
up (implied)
down (truth)
down (implied)
Figure 3.1: A 1-hour island where a block 65.123.202.0/24 reaches itself from VP E (top) but not other VPs.
3.2.3.2 Islands: Isolated Networks
An island is a group of public IP addresses partitioned from the Internet’s core, but still able to communicate
among themselves. Operationally outages and islands are both unreachable from an external VP, but
computers in an island can reach each other.
Islands occur when an organization that has a single connection to the Internet loses its router or link
to its ISP. A single-oce business may become an island when its router’s upstream connection fails, but
computers in the oce can still reach each other and in-oce servers. In the smallest case, in an address
island a computer can ping only itself. Islands are a special case of outages, and we suspect that most
outages are actually temporary islands.
A Brief Island: Figure 3.1 shows an example of an island we have observed. In this graph, each strip
shows a dierent VP’s view of the last 156 addresses from the same IPv4 /24 block over 12 hours, starting
at 2017-06-03t23:06Z. In each strip, the darkest green dots show positive responses of that address to an
ICMP echo request (a “ping”) from that observer, and medium gray dots indicate a non-response to a ping.
40
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
2017-06-29
2017-07-13
2017-07-27
2017-08-10
2017-08-24
2017-09-07
2017-09-21
2017-10-05
Blocks Down
Figure 3.2: Number of blocks down in the whole responsive Internet. Dataset: A29, 2017q3.
We show inferred state as lighter green or lighter gray until the next probe. We show 3 of the 6 VPs, with
probes intervals of about 11 minutes (for methodology, see Section 3.3.1).
The island is indicated by the red bar in the middle of the graph, where VP E continues to get positive
responses from several other addresses (the continuous green bars along the top). By contrast, the other 5
VPs show many non-responses during this period. For this whole hour, VP E and this network are part of
an island, cut o from the rest of the Internet and the other VPs.
Country-size Islands: We would typically say that a company that loses Internet access experienced
an Internet outage and not call it an island. (In fact, with many companies depending on cloud-hosted
services, loss of Internet may well stop all work in the company.)
However, we have seen country-size islands. In 2017q3 we observed 8 events when it appears that
most or all of China stopped responding to external pings. Figure 3.2 shows the number of /24 blocks that
were down over time, each spike more than 200k /24s, between two to eight hours long. We found no
problem reports on network operator mailing lists, so we believe these outages were ICMP-specic and
41
likely did not aect web trac. In addition, we assume the millions of computers inside China continued
to operate. We consider these cases examples of China becoming an ICMP-island.
3.2.3.3 Peninsulas: Partial Connectivity
Link and power failures create islands, but a more pernicious problem is partial connectivity, when one can
reach some destinations, but not others. We call a group of public IP addresses with partial connectivity
to the Internet’s core a peninsula. (In a geographic peninsula, the mainland may be visible over water, but
reachable only with a detour. In a network peninsula, routing between two points may require a relay
through a third location.) Peninsulas occur when some upstream providers of a multi-homed network
accept trac but then drop it due to outages, peering disputes, or rewalls. Peninsula existence has long
been recognized, with overlay networks designed to route around them in RON [3], Hubble [64], and
LIFEGUARD [65].
Examples in IPv6: An example of a persistent peninsula is the IPv6 peering dispute between Hurricane Electric (HE) and Cogent. These ISPs decline to peer in IPv6, nor are they willing to forward their
IPv6 trac through another party. This problem was noted in 2009 [66] and is visible as of June 2020
in DNSMon [101]. We conrm unreachability between HE and Cogent users in IPv6 with traceroutes
from looking glasses [42, 28] (HE at 2001:470:20::2 and Cogent at 2001:550:1:a::d). Neither can reach their
neighbor’s server, but both reach their own. (Their IPv4 reachability is ne.)
Other IPv6 disputes are Cogent with Google [92], and Cloudare with Hurricane Electric [48]. Disputes
are often due to an inability to agree to settlement-free or paid peering.
An Example in IPv4: We next explore a real-world example of partial reachability to several Polish
ISPs that we found with our algorithms. On 2017-10-23, for a period of 3 hours starting at 22:02Z, ve
Polish Autonomous Systems (ASes) had 1716 blocks that were unreachable from ve VPs while the same
blocks remained reachable from a sixth VP.
42
Figure 3.3: AS level topology during the Polish peninsula.
Figure 3.3 shows the AS-level relationships at the time of the peninsula. Multimedia Polska (AS21021,
or MP) provides service to the other 4 ISPs. MP has two Tier-1 providers: Cogent (AS174) and Tata
(AS6453). Before the peninsula, our VPs see MP through Cogent.
At event start, we observe many BGP updates (20,275) announcing and withdrawing routes to the
aected blocks (see Figure 3.4). These updates correspond to Tata announcing MP’s prexes. Perhaps MP
changed its peering to prefer Tata over Cogent, or the MP-Cogent link failed.
Initially, trac from most VPs continued through Cogent and was lost; it did not shift to Tata. One VP
(W) could reach MP through Level3 for the entire event, proving MP was connected. After 3 hours, we see
another burst of BGP updates (23,487 this time) and conrming trac through Tata.
To show what our VPs see, Figure 3.5 shows address reachability for a full block aected by this peninsula, similar to Figure 3.1. We see the top VP, W, has some responsive addresses the whole time (some rows
remain green during the red-band peninsula period). The other ve VPs drop out on midnight Oct. 24 We
show 2 here (others are similar), and while a few light green addresses indicate we infer they are reachable,
43
20:00 21:00 22:00 23:00 Oct-24 01:00 02:00
0
5000
10000
15000
20000
# BGP Update Msgs
Figure 3.4: BGP update messages sent for aected Polish blocks starting 2017-10-23t20:00Z. Data source:
RouteViews.
0
100
200
0
100
200
0
100
200
0
100
200
0
100
200
Oct-23 06:00 12:00 18:00 Oct-24 06:00 12:00
0
100
200
W
E
C
G
N
J
Last Octet of Target Block
Source Vantage Point
[2017]
up (truth)
up (implied)
down (truth)
down (implied)
Figure 3.5: A block (80.245.176.0/24) showing a 3-hour peninsula accessible only from VP W (top bar) and
not from the other 5 VPs. Dataset: A30, 2017q4.
44
src block dst block time traces
c85eb700 50f5b000 1508630032
q, 148.245.170.161, 189.209.17.197, 189.209.17.197, 38.104.245.9, 154.24.19.41,
154.54.47.33, 154.54.28.69, 154.54.7.157, 154.54.40.105, 154.54.40.61,
154.54.43.17, 154.54.44.161, 154.54.77.245, 154.54.38.206, 154.54.60.254,
154.54.59.38, 149.6.71.162, 89.228.6.33, 89.228.2.32, 176.221.98.194
c85eb700 50f5b000 1508802877 q, 148.245.170.161, 200.38.245.45, 148.240.221.29
Table 3.1: Traces from the same Ark VPs (mty-mx) to the same destination block before and during the
event.
all addresses eventually go gray for unreachable. We know this example is a peninsula because VP W is
outside the block and can reach in (unlike Figure 3.1 where it was inside and could not see out.)
We can conrm this peninsula with additional observations from CAIDA’s Ark traceroutes. During
the event we see 94 unique Ark VPs attempted 345 traceroutes to the aected blocks. Of the 94 VPs, 21 VPs
(22%) have their last responsive traceroute hop in the same AS as the target address, and 68 probes (73%)
stopped before reaching that AS. Table 3.1 shows traceroute data from a single CAIDA Ark VP before and
during the peninsula. This data conrms the block was reachable from some locations and not others.
During the event, this trace breaks at the last hop within the source AS.
Although we do not have a root cause for this peninsula from network operators, large number of BGP
Update messages suggests a routing problem. In Section 3.5.7 we show peninsulas are mostly due to policy
choices.
3.3 Detecting Partial Connectivity
We use observations from multiple, independent VPs to detect partial outages and islands (from Section 3.2)
with our new algorithms: Taitao detects peninsulas, and Chiloe, islands. (Algorithm names are from Patagonian geography.)
45
Dataset Name Source Start Date Duration Where Used
internet_outage_adaptive_a28w-20170403 Trinocular [117] 2017-04-03 90 days
Polish peninsula subset 2017-06-03 12 hours Section 3.2.3.2, Section 3.2.3.3
internet_outage_adaptive_a28c-20170403 Trinocular 2017-04-03 90 days
Polish peninsula subset 2017-06-03 12 hours Section 3.2.3.3
internet_outage_adaptive_a28j-20170403 Trinocular 2017-04-03 90 days
Polish peninsula subset 2017-06-03 12 hours Section 3.2.3.3
internet_outage_adaptive_a28g-20170403 Trinocular 2017-04-03 90 days
Polish peninsula subset 2017-06-03 12 hours Section 3.2.3.3
internet_outage_adaptive_a28e-20170403 Trinocular 2017-04-03 90 days
Polish peninsula subset 2017-06-03 12 hours Section 3.2.3.2, Section 3.2.3.3
internet_outage_adaptive_a28n-20170403 Trinocular 2017-04-03 90 days
Polish peninsula subset 2017-06-03 12 hours Section 3.2.3.2, Section 3.2.3.3
internet_outage_adaptive_a28all-20170403 Trinocular 2017-04-03 89 days Section 3.4.3, Section 3.5.9, Section 3.5.10
internet_outage_adaptive_a29all-20170702 Trinocular 2017-07-02 94 days Section 3.2.3.2, Section 3.4.3, Section 3.5.9,
Section 3.5.10
internet_outage_adaptive_a30w-20171006 Trinocular 2017-10-06 85 days
Site E Island 2017-10-23 36 hours Section 3.2.3.3, Section 3.2.3.3
internet_outage_adaptive_a30c-20171006 Trinocular 2017-10-06 85 days
Site E Island 2017-10-23 36 hours Section 3.2.3.3
internet_outage_adaptive_a30j-20171006 Trinocular 2017-10-06 85 days
Site E Island 2017-10-23 36 hours Section 3.2.3.3
internet_outage_adaptive_a30g-20171006 Trinocular 2017-10-06 85 days
Site E Island 2017-10-23 36 hours Section 3.2.3.3
internet_outage_adaptive_a30e-20171006 Trinocular 2017-10-06 85 days
Site E Island 2017-10-23 36 hours Section 3.2.3.3, Section 3.2.3.3
internet_outage_adaptive_a30n-20171006 Trinocular 2017-10-06 85 days
Site E Island 2017-10-23 36 hours Section 3.2.3.3, Section 3.2.3.3
internet_outage_adaptive_a30all-20171006 Trinocular 2017-10-06 85 days Section 3.4.3, Section 3.5.9, Section 3.5.10,
Section 3.6.3.3
Oct. Nov. subset 2017-10-06 40 days Section 3.4.2, Section 3.5.3, Section 3.5.5
Oct. subset 2017-10-10 21 days Section 3.4.1, Section 3.6.3.2
internet_outage_adaptive_a31all-20180101 Trinocular 2018-01-01 90 days Section 3.4.3, Section 3.5.9, Section 3.5.10
internet_outage_adaptive_a32all-20180401 Trinocular 2018-04-01 90 days Section 3.4.3, Section 3.5.9, Section 3.5.10
internet_outage_adaptive_a33all-20180701 Trinocular 2018-07-01 90 days Section 3.4.3, Section 3.5.9, Section 3.5.10
internet_outage_adaptive_a34all-20181001 Trinocular 2018-10-01 90 days Section 3.4.3, Section 3.5.9, Section 3.5.10,
Section 3.5.2
internet_outage_adaptive_a35all-20190101 Trinocular 2019-01-01 90 days Section 3.4.3, Section 3.5.9, Section 3.5.10
internet_outage_adaptive_a36all-20190401 Trinocular 2019-01-01 90 days Section 3.4.3, Section 3.5.9, Section 3.5.10
internet_outage_adaptive_a37all-20190701 Trinocular 2019-01-01 90 days Section 3.4.3, Section 3.5.9, Section 3.5.10
internet_outage_adaptive_a38all-20191001 Trinocular 2019-01-01 90 days Section 3.4.3, Section 3.5.9, Section 3.5.10
internet_outage_adaptive_a39all-20200101 Trinocular 2020-01-01 90 days Section 3.4.3, Section 3.5.9, Section 3.5.10
internet_outage_adaptive_a41all-20200701 Trinocular 2020-07-01 90 days Section 3.5.7
prex-probing Ark [17]
Oct. 2017 subset 2017-10-10 21 days Section 3.4.1, Section 3.6.3.2
2020q3 subset 2020-07-01 90 days Section 3.5.7
probe-data Ark
Oct 2017 subset 2017-10-10 21 days Section 3.4.1, Section 3.6.3.2
2020q3 subset 2020-07-01 90 days Section 3.5.7
routeviews.org/bgpdata Routeviews [72] 2017-10-06 40 days Section 3.4.2, Section 3.2.3.3
Atlas Recurring Root Pings (id: 1001 to 1016) Atlas [79] 2021-07-01 90 days Section 3.5.1, Section 3.5.10
nro-extended-stats NRO [60, 61] 1984 41 years Section 3.6.2
Table 3.2: All datasets used in this study.
46
3.3.1 Suitable Data Sources
We use publicly available data from three systems: USC Trinocular [90], RIPE Atlas [103], and UCSD’s
Archipelago [19]. We list all datasets in Table 3.2.
Our algorithms use data from Trinocular [90] because it is available at no cost [118], provides data
since 2014, and covers most of the responsive IPv4 Internet [9]. Briey, Trinocular watches about 5M out
of 5.9M responsive IPv4 /24 blocks. In each probing round of 11 minutes, it sends up to 15 ICMP echorequests (pings), stopping early if it proves the block is reachable. It interprets the results using Bayesian
inference, and merges the results from six geographically distributed VPs. VPs are in Los Angeles (W),
Colorado (C), Tokyo (J), Athens (G), Washington, DC (E), and Amsterdam (N). In Section 3.6.3.3 we show
they are topologically independent. Our algorithms should work with other active probing data as future
work.
We use RIPE Atlas [103] for islands (Section 3.3.4) and to see how partial connectivity aects monitoring (Section 3.6.4). As of 2022, it has about 12k VPs, distributed globally in over 3572 dierent IPv4 ASes.
Atlas VPs carry out both researcher-directed measurements and periodic scans of DNS servers. We use
Atlas scans of DNS root servers in our work.
We validate our results using CAIDA’s Ark [19], and use AS numbers from Routeviews [72].
We generally use recent data, but in some cases we chose older data to avoid known problems in measurement systems. Many of our ndings are demonstrated over multiple years, as we show in Section 3.5.2,
Section 3.5.4 and Section 3.5.6. We use Trinocular measurements for 2017q4 because this time period had
six active VPs, allowing us to make strong statements about how multiple perspectives help. We use 2020q3
data in Section 3.5.7 because Ark observed a very large number of loops in 2017q4. Problems with dierent
VPs reduced coverage for 2019 and 2020, but we verify and nd quantitatively similar results in 2020 in
Section 3.5.2, Section 3.5.4 and Section 3.5.6).
47
3.3.2 Taitao: a Peninsula Detector
Peninsulas occur when portions of the Internet are reachable from some locations and not others. They can
be seen by two VPs disagreeing on reachability. With multiple VPs, non-unanimous observations suggest
a peninsula.
Detecting peninsulas presents three challenges. First, we do not have VPs everywhere. If all VPs are
on the same “side” of a peninsula, their reachability agrees even though other potential VPs may disagree.
Second, VP observations are not synchronized. For Trinocular, they are spread over an 11-minute interval,
so dierent VPs test reachability at slightly dierent times. When observations are made just before and
after a network change, both are true but the disagreement is from unsynchronized measurement and not
a peninsula. Third, connectivity problems near the observer (or when an observer is an island) should not
reect on the intended destination.
We identify peninsulas by detecting disagreements in block state by comparing valid VP observations
that occur at about the same time. Since probing rounds occur every 11 minutes, we compare measurements within an 11-minute window. This approach will see peninsulas that last at least 11 minutes, but
may miss briefer ones, or peninsulas where VPs are not on “both sides”.
Formally, Oi,b is the set of observers with valid observations about block b at round i. We look for
disagreements in Oi,b, dening O
up
i,b ⊂ Oi,b as the set of observers that measure block b as up at round i.
We detect a peninsula when:
0 < |O
up
i,b| < |Oi,b| (3.1)
When only one VP reaches a block, that block can be either a peninsula or an island. We require more
information to distinguish them, as we describe in Section 3.3.4.
48
3.3.3 Detecting Country-Level Peninsulas
Taitao detects peninsulas based on dierences in observations. Long-lived peninsulas are likely intentional,
from policy choices. One policy is ltering based on national boundaries, possibly to implement legal
requirements about data sovereignty or economic boycotts.
We identify country-specic peninsulas as a special case of Taitao where a given destination block is
reachable (or unreachable) from only one country, persistently for an extended period of time. (In practice,
the ability to detect country-level peninsulas is somewhat limited because the only country with multiple
VPs in our data is the United States. However, we augment non-U.S. observers with data from other nonU.S. sites such as Ark or RIPE Atlas.)
A country level peninsula occurs when all available VPs from the same country as the target block
successfully reach the target block and all available VPs from dierent countries fail. Formally, we say
there is a country peninsula when the set of observers claiming block b is up at time i is equal to Oc
i,b ⊂ Oi,b
the set of all available observers with valid observations at country c.
O
up
i,b = O
c
i,b (3.2)
3.3.4 Chiloe: an Island Detector
According to our denition in Section 3.2.3.2, islands occur when the Internet is partitioned, and the smaller
component (that with less than half the active addresses) is the island. Typical islands are much, much
smaller.
We can nd islands by looking for networks that are only reachable from less than half of the Internet.
However, to classify such networks as an island and not merely a peninsula, we need to show that it is
partitioned. Without global knowledge, it is dicult to prove disconnection. In addition, if islands are
49
partitioned from VPs, we cannot tell an island from an outage. An island is disconnected but still active
inside, but for an outage, the computers are disconnected from the Internet’s core and from each other.
For these reasons, we must look for islands that include VPs in their partition. Because we know the
VP is active and scanning we can determine how much of the Internet is in its partition, ruling out an
outage. We also can conrm the Internet is not reachable, to rule out a peninsula.
Formally, we say that B is the set of all blocks on the Internet responding in the last week. B
up
i,o ⊆ B
are blocks reachable from observer o at round i, while Bdn
i,o ⊆ B is its complement. We detect that observer
o is in an island when it thinks half or more of the observable Internet is down:
0 ≤ |B
up
i,o| ≤ |B
dn
i,o| (3.3)
This method is independent from measurement systems, but is limited to detecting islands that contain
VPs. We evaluate islands in two systems with thousands of VPs in Section 3.5.9. Finally, because observations are not instantaneous, we must avoid confusing short-lived islands with long-lived peninsulas. For
islands lasting longer than an observation period, we also require |B
up
i,o| → 0. When |B
up
i,o| = 0, then we
have an address island.
3.3.5 Applications
Political: Who Has the Internet? We explore this question in Section 3.6.1 and Section 3.6.2.
Architectural: Our work helps understand risk by showing reachability is not binary, but often partial. We explore this issue in Section 3.5; one key result is that users see peninsulas as often as outages
(Section 3.5.1). It helps clarify prior studies of Internet outages [109, 90, 110, 96, 56] (more detail is in
Section 3.6.3).
50
Operational: Cleaning Data. Problems near network observers can skew observations and must
be detected and removed, as we explore in Section 3.6.4 and [107] and detection of Covid-work-fromhome [111].
3.4 Validating our Approach
We next validate detection with Taitao and Chiloe. First, we detect peninsulas using Taitao (Section 3.4.1),
and persistent country level peninsulas (Section 3.4.2). Second, examine Chiloe’s single-observer island
detection against external observers (Section 3.4.3).
3.4.1 Can Taitao Detect Peninsulas?
We compare Taitao detections from 6 VPs to independent observations taken from more than 100 VPs in
CAIDA’s Ark [19]. This comparison is challenging, because both Taitao and Ark are imperfect operational
systems that dier in probing frequency, targets, and method. Neither denes perfect ground truth, but
agreement suggests likely truth.
Although Ark probes targets much less frequently than Trinocular, Ark makes observations from 171
global locations, so it provides a diverse perspective. Ark traceroutes also allow us to assess where peninsulas begin. We expect to see a strong correlation between Taitao peninsulas and Ark observations. (We
considered RIPE Atlas as another external dataset, but its coverage is sparse, while Ark covers all /24s.)
Identifying comparable blocks: We study 21 days of Ark observations from 2017-10-10 to -31. Ark
covers all networks with two strategies. With team probing, a 40 VP “team” traceroutes to all routed /24
about once per day. For prex probing, about 35 VPs each traceroute to .1 addresses of all routed /24s every
day. We use both types of data: the three Ark teams and all available prex probing VPs. We group results
by /24 block of the traceroute’s target address.
51
Ark diers from Taitao’s Trinocular input in three ways: the target is a random address or the .1 address
in each block; it uses traceroute, not ping; and it probes blocks daily, not every 11 minutes. Sometimes
these dierences cause Ark traceroutes to fail when a simple ping succeeds. First, Trinocular’s targets
respond more often because it uses a curated hitlist [45] while Ark does not. Second, Ark’s traceroutes can
terminate due to path loops or gaps in the path, (in addition to succeeding or reporting target unreachable).
We do not consider results with gaps, so problems on the path do not bias results for endpoints reachable
by direct pings.
To correct for dierences in target addresses, we must avoid misinterpreting a block as unreachable
when the block is online but Ark’s target address is not, we discard traces sent to never-active addresses
(those not observed in 3 years of complete IPv4 scans), and blocks for which Ark did not get a single successful response.(Even with this ltering, dynamic addressing means Ark still sometimes sees unreachables.)
To correct for Ark’s less frequent probing, we compare long-lived Trinocular down-events (5 hours
or more). Ark measurements are infrequent (once every 24 hours) compared to Trinocular’s 11-minute
reports, so short Trinocular events are often unobserved by Ark. To conrm agreements or conicting
reports from Ark, we require at least 3 Ark observations within the peninsula’s span of time.
We lter out blocks with frequent transient changes or signs of network-level ltering. We dene the
“reliable” blocks suitable for comparison as those responsive for at least 85% of the quarter from each of
the 6 Trinocular VPs. (This threshold avoids diurnal blocks or blocks with long outages; values of 90% or
less have similar results.) We also discard aky blocks whose responses are frequently inconsistent across
VPs. (We consider more than 10 combinations of VP as frequently inconsistent.) For the 21 days, we nd
4M unique Trinocular /24 blocks, and 11M Ark /24 blocks, making 2M blocks in both available for study.
Results: Table 3.3 provides details and Table 3.4 summarizes our interpretation. Here dark green
indicates true positives (TP): when (a) either both Taitao and Ark show mixed results, both indicating a
52
Ark
Sites Up Conicting All Down All Up
Trinocular
Conicting
1 20 6 15
2 13 5 11
3 13 1 5
4 26 4 19
5 83 13 201
Agree
0 6 97 6
6 491,120 90 1,485,394
Table 3.3: Trinocular and Ark agreement table. Dataset A30, 2017q4.
Ark
Peninsula Non Peninsula
Taitao
Peninsula 184 251 (strict) 40 (loose)
Non
Peninsula 12 1,976,701
Table 3.4: Taitao confusion matrix. Dataset A30, 2017q4.
peninsula, or when (b) Taitao indicates a peninsula (1 to 5 sites up but at least one down), Ark shows alldown during the event and up before and after. We treat Ark in case (b) as positive because the infrequency
of Ark probing (one probe per team every 24 hours) means we cannot guarantee VPs in the peninsula will
probe responsive targets in time. Since peninsulas are rare, so too are true positives, but we see 184 TPs.
We show true negatives as light green and neither bold nor italic. In almost all of these cases (1.4M) both
Taitao and Ark both reach the block, agreeing. Because of dynamic addressing [83], many Ark traceroutes
end in a failure at the last hop (even after we discard never-reachable). We therefore count this second
most-common result (491k cases) as a true negative. For the same reason, we include the small number (97)
of cases where Ark reports conicting results and Taitao is all-up, assuming Ark terminates at an empty
address. We include in this category, the 90 events where Ark is all-down and Trinocular is all-up. We
attribute Ark’s failure to reach its targets to infrequent probing.
We mark false negatives as red and bold. For these few cases (only 12), all Trinocular VPs are down,
but Ark reports all or some responding. We believe these cases indicate blocks that have chosen to drop
Trinocular trac.
53
Finally, yellow italics shows cases where a Taitao peninsula is a false positive, since all Ark probes
reached the target block. This case occurs when either trac from some Trinocular VPs is ltered, or all
Ark VPs are “inside” the peninsula. Light yellow (strict) shows all the 251 cases that Taitao detects. For
most of these cases (201), ve Trinocular VPs responding and one does not, suggesting network problems
are near one of the Trinocular VPs (since ve of six independent VPs have working paths). Discarding
these cases we get 40 (orange); still conservative but a looser estimate.
The strict scenario sees precision 0.42, recall 0.94, and F1 score 0.58, and in the loose scenario, precision improves to 0.82 and F1 score to 0.88. We consider these results good, but with some room for
improvement.
3.4.2 Can Taitao Detect Country-Level Peninsulas?
Next, we verify detection of country-level peninsulas (Section 3.3.3). We expect that legal requirements
sometimes result in long-term network unreachability. For example, blocking access from Europe is a
crude way to comply with the EU’s GDPR [116].
Identifying country-level peninsulas requires multiple VPs in the same country. Unfortunately the
source data we use only has multiple VPs for the United States. We therefore look for U.S.-specic peninsulas where only these VPs can reach the target and the non-U.S.-VPs cannot, or vice versa.
In Table 3.5 we show our results. We rst consider the 501 cases where Taitao reports that only U.S. VPs
can see the target, and compare to how Ark VPs respond. For Ark, we follow Section 3.4.1, except retaining
blocks with less than 85% uptime. We only consider Ark VPs that are able to reach the destination (that
halt with “success”). We note blocks that can only be reached by Ark VPs within the same country as
domestic, and blocks that can be reached from VPs located in other countries as foreign.
In Table 3.6 we show the number of blocks that uniquely responded to all U.S. VP combinations during
the quarter. We contrast these results against Ark reachability.
54
Vantage Points Country Num. Blocks ASes Foreign Ark nodes
Unique Domestic Non-resp. Blocks Resp. Blocks
WCE US 580 116 105 362 218
WC US 7 4 3 1 6
WE US 4 4 3 3 1
CE US 1 1 1 1 0
W US 73 35 6 22 51
E US 12 6 3 3 9
C US 4 2 2 2 2
N Netherlands 38 5 0 4 34
J Japan 2 2 0 1 1
G Greece 3 2 0 2 1
Table 3.5: Number of country-specic blocks on the Internet. Dataset A30, 2017q4
Ark
U.S. VPs Domestic Only ≤ 5 Foreign > 5 Foreign Total
Trinocular
WCE 211 171 47 429
WCe 0 5 1 6
WcE 0 1 0 1
wCE 0 0 0 0
Wce 3 40 11 54
wcE 0 4 5 9
wCe 0 1 1 2
Marginal distr. 214 222 65 501
Table 3.6: Trinocular U.S.-only blocks. Dataset A30, 2017q4.
Ark
Country Peninsula Non Country Peninsula
Country
Specic
Country
Peninsula
382 47
Non Cntry
Peninsula
3 98
Table 3.7: Country specic peninsula detection confusion matrix. Dataset A30, 2017q4.
55
True positives are when Taitao shows a peninsula responsive only to U.S. VPs and nearly all Ark
VPs conrm this result. We see 211 targets are U.S.-only, and another 171 are available to only a few
non-U.S. countries. The specic combinations vary: sometimes allowing access from the U.K., or Mexico
and Canada. Together these make 382 true positives, most of the 501 cases (see Table 3.7). Comparing all
positive cases, we see a very high precision of 0.99 (382 green of 385 green and red reports)—our predictions
are nearly all conrmed by Ark.
In yellow italics we show 47 cases of false positives where more than ve non-U.S. countries are allowed
access. In many cases these include many European countries. Our recall is therefore 0.89 (382 green of
429 green and yellow true country peninsulas).
In light green we show true negatives. Here we include blocks that lter one or more U.S. VPs, and are
reachable from Ark VPs in multiple countries, amounting to a total of 69 blocks. There are other categories
involving non-U.S. sites, along with other millions of true negatives, however, we only concentrate in these
few.
In red and bold we show three false negatives. These three blocks seem to have strict ltering policies,
since they were reachable only from one U.S. site (W) and not the others (C and E) in the 21 days period.
3.4.3 Can Chiloe Detect Islands?
Chiloe (Section 3.3.4) detects islands when a VP within the island can reach less than half the rest of the
world. When less than 50% of the network replies, it means that the VP is either in an island (for brief
events, or when replies drop near zero) or a peninsula (long-lived partial replies).
To validate Chiloe’s correctness, we compare when a single VP believes to be in an island, against what
the rest of the world believes about that VP.
We dene ground truth at a block level granularity—if VP x can reach its own block when x believes to
be in an island, while other external VPs can’t reach x’s block, then x’s island is conrmed. On the other
56
Trinocular
Block Island Address Island Peninsula
Chiloe
Island 2 19 2
Peninsula 0 8 566
Table 3.8: Chiloe confusion matrix, events between 2017-01-04 and 2020-03-31. Datasets A28 through A39.
hand, if an external VP can reach x’s block, then x is not in island, but in a peninsula. In Section 3.6.3.3
we show that Trinocular VPs are independent, and therefore no two VPs live within the same island. We
believe this denition is the best possible ground truth, but of course a perfect identication of island or
peninsula requires instant, global knowledge and so cannot be measured in practice.
We take 3 years worth of data from all six Trinocular VPs. Because Trinocular spreads measurements
over 11 minutes, we group results into 11-minute bins.
In Table 3.8 we show that Chiloe detects 23 islands across three years. In 2 of these events, the block is
unreachable from other VPs, conrming the island with our ground-truth methodology. Manual inspection
conrms that the remaining 19 events are islands too, but at the address level—the VP was unable to
reach anything but did not lose power, and other addresses in its block were reachable from VPs at other
locations. These observations suggest a VP-specic problem making it an island. Finally, for 2 events,
the prober’s block was reachable during the event by every site including the prober itself which suggests
partial connectivity (a peninsula), and therefore a false positive.
In the 566 non-island events (true negatives), a single VP cannot reach more than 5% but less than 50%
of the Internet core. In each of these cases, one or more other VPs were able to reach the aected VP’s
block, showing they were not an island (although perhaps a peninsula). We omit the very frequent events
when less than 5% of the network is unavailable from the VP from the table, although they too are true
negatives.
Bold red shows 8 false negatives. These are events that last about 2 Trinocular rounds or less (22 min),
often not enough time for Trinocular to change its belief on block state.
57
3.5 Quantifying Islands and Peninsulas
We next apply our approach to the Internet. For peninsulas: how often do they occur (Section 3.5.1),
how long do they last (Section 3.5.3), and how big are they (Section 3.5.5)? These evaluations characterize how eective systems using overlay routing [3, 65] are. We also look at peninsula location by ISP
(Section 3.5.7). Finally, we look at island frequency (Section 3.5.9) and the implications of country-level
internet secession (Section 3.6.2).
3.5.1 How Common are Peninsulas?
We estimate how common peninsulas occur in the Internet core in three ways. First, we directly measure
the visibility of peninsulas in the Internet by summing the duration of peninsulas as seen from six VPs.
Second, we conrm the accuracy of this estimate by evaluating its convergence as we vary the number of
VPs—more VPs show more peninsula-time, but if the result converges we predict we are approaching the
limit. Third, we compare peninsula-time to outage-time, showing that, in the limit, observers see both for
about the same duration. Outages correspond to service downtime [120], and are a recognized problem in
academia and industry. Our results show that peninsulas are as common as outages, suggesting peninsulas
are an important new problem deserving attention.
Peninsula-time: We estimate the duration an observer can see a peninsula by considering three
types of events: all up, all down, and disagreement between six VPs. Disagreement, the last case, suggests
a peninsula, while agreement (all up or down), suggests no problem or an outage. We compute peninsulatime by summing the time each target /24 has disagreeing observations from Trinocular VPs.
We have computed peninsula-time by evaluating Taitao over Trinocular data for 2017q4 [118]. Figure 3.6 shows the distribution of peninsulas measured as a fraction of block-time for an increasing number
of sites. We consider all possible combinations of the six sites.
58
All Down
2 3 4 5 6
Number of Reporting Sites
0.00000
0.00025
0.00050
0.00075
0.00100
0.00125
0.00150
0.00175
0.00200
Fraction of Block-Time
Disagreement
2 3 4 5 6
Number of Reporting Sites
0.00000
0.00025
0.00050
0.00075
0.00100
0.00125
0.00150
0.00175
0.00200
Fraction of Block-Time
All Up
2 3 4 5 6
Number of Reporting Sites
0.9980
0.9985
0.9990
0.9995
1.0000
Fraction of Block-Time
0.00000
0.00025
0.00050
0.00075
0.00100
0.00125
0.00150
0.00175
0.00200
Complement
Figure 3.6: Distribution of block-time fraction over sites reporting all down (left), disagreement (center),
and all up (right), for events longer than one hour. Dataset A30, 2017-10-06 to 2017-11-16.
59
First we examine the data with all 6 VPs—the rightmost point on each graph. We see that peninsulas
(the middle, disagreement graph) are visible about 0.00075 of the time. This data suggests peninsulas are
rare, occurring less than 0.1% of the time, but do regularly occur.
Convergence: With more VPs we get a better view of the Internet’s overall state. As more reporting
sites are added, more peninsulas are discovered. That is previously inferred outages (all unreachable)
should have been peninsulas, with partial reachability. All-down (left) decreases from an average of 0.00082
with 2 VPs to 0.00074 for 6 VPs. All-up (right) goes down a relative 47% from 0.9988 to 0.9984, while
disagreements (center) increase from 0.0029 to 0.00045. Outages (left) converge after 3 sites, as shown by
the tted curve and decreasing variance. Peninsulas and all-up converge more slowly. We conclude that a
few sites (3 or 4) converge on a good estimate of true islands and peninsulas, provided they are independently
located.
We can support this claim by comparing all non-overlapping combinations of 3 sites. If any combination is equivalent with any other, then a fourth site would not add new information. There are 10 possible
pairs of 3 sites from 6 observers, and we examine those combinations for each of 21 quarters, from 2017q2
to 2020q1. When we compare the one-sample Student t-test to evaluate if the dierence of each pair of
combinations of those 21 quarters is greater than zero. None of the combinations are rejected at con-
dence level 99.75%, suggesting that any combination of three sites is statistically equivalent and conrm
our claim that a few sites are sucient for estimation.
Relative impact: Finally, comparing outages (the left graph) with peninsulas (the middle graph), we
see both occur about the same fraction of time (around 0.00075). This comparison shows that peninsulas
are about as common as outages, suggesting they deserve more attention.
Generalizing: We conrm these results with other quarters Section 3.5.2. While we reach a slightly
dierent limit (in that case, peninsulas and outages appear about in 0.002 of data), we still see good convergence after 4 VPs.
60
3.5.2 Additional Conrmation of the Number of Peninsulas
Similarly, as in Section 3.5.1, we quantify how big the problem of peninsulas is, this time using Trinocular
2018q4 data.
In Figure 3.7 we conrm, that with more VPs more peninsulas are discovered, providing a better view
of the Internet’s overall state.
Outages (left) converge after 3 sites, as shown by the tted curve and decreasing variance. Peninsulas
and all-up converge more slowly.
At six VPs, here we nd and even higher dierence between all down and disagreements. Conrming
that peninsulas are a more pervasive problem than outages.
3.5.3 How Long Do Peninsulas Last?
Peninsulas have multiple root causes: some are short-lived routing miscongurations while others may
be long-term disagreements in routing policy. In this section we determine the distribution of peninsulas
in terms of their duration to determine the prevalence of persistent peninsulas. We will show that there
are millions of brief peninsulas, likely to routing transients, but that 90% of peninsula-time is in long-lived
events (5 h or more).
To characterize peninsula duration we use Taitao to detect peninsulas that occurred during 2017q4.
For all peninsulas, we see 23.6M peninsulas aecting 3.8M unique blocks. If instead we look at long-lived
peninsulas (at least 5 h), we see 4.5M peninsulas in 338k unique blocks. Figure 3.9a examines the duration
of these peninsulas in three ways: the cumulative distribution of the number of peninsulas of each size for
all events (left, solid, purple line), the cumulative distribution of the number of peninsulas of each size for
VP down events longer than 5 hours (middle, solid green line), and the cumulative size of peninsulas for
VP down events longer than 5 hours (right, dashed green line).
61
All Down
2 3 4 5 6
Number of Reporting Sites
0.000
0.002
0.004
0.006
0.008
0.010
Fraction of Block-Time
Disagreement
2 3 4 5 6
Number of Reporting Sites
0.000
0.002
0.004
0.006
0.008
0.010
Fraction of Block-Time
All Up
2 3 4 5 6
Number of Reporting Sites
0.990
0.991
0.992
0.993
0.994
0.995
0.996
0.997
0.998
0.999
1.000
Fraction of Block-Time
0.000
0.002
0.004
0.006
0.008
0.010
Complement
Figure 3.7: Distribution of block-time fraction over sites reporting all down (left), disagreement (center),
and all up (right), for events longer than ve hour. Dataset A34, 2018q4.
62
We see that there are many very brief peninsulas (purple line): about 33% last from 20 to 60 minutes
(about 2 to 6 measurement rounds). Such events are not just one-o loss, since they last at least two observation periods. These results suggest that while the Internet is robust, there are many small connectivity
glitches (7.8M events).
In addition, we see some events that are two rounds (20 minutes) or shorter. Such events could be BGP
transients or failures due to random packet loss.
The number of day-long or multi-day peninsulas is small, only 1.7M events (7%, the purple line). However, about 90% of all peninsula-time is in such longer-lived events (the right, dashed line), and 50% of time
is in events lasting 10 days or more, even when longer than 5 hours events are less numerous (compare the
middle, green line to the left, purple line). Events lasting a day are long-enough that can be debugged by
human network operators, and events lasting longer than a week are long-enough that they may represent
policy disputes. Together, these long-lived events suggest that there is benet to identifying non-transient
peninsulas and addressing the underlying routing problem.
3.5.4 Additional Conrmation of Peninsula Duration
In Section 3.5.3 we characterize peninsula duration for 2017q4, to determine peninsula root causes. To
conrm our results, we repeat the analysis, but with 2020q3 data.
As Figure 3.8a shows, similarly, as in our 2017q4 results, we see that there are many very brief peninsulas (from 20 to 60 minutes). These results suggest that while the Internet is robust, there are many small
connectivity glitches.
Events shorter than two rounds (22 minutes), may represent BGP transients or failures due to random
packet loss.
The number of multi-day peninsulas is small, However, these represent about 90% of all peninsulatime. Events lasting a day are long-enough that can be debugged by human network operators, and events
63
0
0.2
0.4
0.6
0.8
1
1 min 10 min 1 hr 6 hrs 1 day 10 days 90 days
Peninsulas (all)
Peninsulas (> 5hrs.)
Duration (> 5hrs.)
CDF
Peninsula Duration
(a) Cumulative events (solid) and duration (dashed)
9 11 13 15 17 19 21 23
Prefix Length
0 100k
Events
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Prefix Fraction
10
0
10
1
10
2
10
3
10
4
Events
0
25k
Events
(b) Number of Peninsulas
9 11 13 15 17 19 21 23
Prefix Length
0.0 0.2
Duration Fraction
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Prefix Fraction
10
6
10
5
10
4
10
3
10
2
Duration Fraction
0.0
0.1
Duration
Fraction
(c) Duration fraction
Figure 3.8: Peninsulas measured with per-site down events longer than 5 hours during 2020q3. Dataset
A41.
64
lasting longer than a week are long-enough that they may represent policy disputes. Together, these
long-lived events suggest that there is benet to identifying non-transient peninsulas and addressing the
underlying routing problem.
3.5.5 What Is the Size of Peninsulas?
When network issues cause connectivity problems like peninsulas, the size of those problems may vary,
from country-size, to AS-size, and also for routable prexes or fractions of prexes. We next examine
peninsula sizes.
We begin with Taitao peninsula detection at a /24 block level. We match peninsulas across blocks
within the same prex by start time and duration, both measured in one hour timebins. This match implies
that the Trinocular VPs observing the blocks as up are also the same.
We compare peninsulas to routable prexes from Routeviews [72]. We perform longest prex match
between /24 blocks and prexes.
Routable prexes consist of many blocks, some of which may not be measurable. We therefore dene
the peninsula-prex fraction for each routed prex as fraction of blocks in the peninsula that are Trinocularmeasurable blocks. To reduce noise provided by single block peninsulas, we only consider peninsulas
covering 2 or more blocks in a prex.
Figure 3.9b shows the number of peninsulas for dierent prex lengths and the fraction of the prex
aected by the peninsula as a heat-map, where we group them into bins.
We see that about 10% of peninsulas are likely due to routing problems or policies, since 40k peninsulas
aect the whole routable prex. However, a third of peninsulas (101k, at the bottom of the plot) aect only
a very small fraction of the prex. These low prex-fraction peninsulas suggest that more than half of
peninsulas happen inside an ISP and are not due to interdomain routing.
65
0
0.2
0.4
0.6
0.8
1
1 min 10 min 1 hr 6 hrs 1 day 10 days
Peninsulas (all)
Peninsulas (> 5hrs.)
Duration (> 5hrs.)
CDF
Peninsula Duration
(a) Cumulative events (solid) and duration (dashed)
9 11 13 15 17 19 21 23
Prefix Length
0 100k
Events
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Prefix Fraction
10
0
10
1
10
2
10
3
10
4
Events
0
50k
Events
(b) Number of Peninsulas
9 11 13 15 17 19 21 23
Prefix Length
0.0 0.1
Duration Fraction
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Prefix Fraction
10
6
10
5
10
4
10
3
10
2
Duration Fraction
0.0
0.1
Duration
Fraction
(c) Duration fraction
Figure 3.9: Peninsulas measured with per-site down events longer than 5 hours. Dataset A30, 2017q4.
66
Finally, we show that longer-lived peninsulas are likely due to routing or policy choices. Figure 3.9c shows
the same data source, but weighted by fraction of time each peninsula contributes to the total peninsula
time during 2017q4. Here the larger fraction of weight are peninsulas covering full routable prexes—20%
of all peninsula time during the quarter (see left margin).
3.5.6 Additional Conrmation of Size
In Section 3.5.5 we discussed the size of peninsulas measured as a fraction of the aected routable prex.
In the latter section, we use 2017q4 data. Here we use 2020q3 to conrm our results.
Figure 3.8b shows the peninsulas per prex fraction, and Figure 3.8c. Similarly, we nd that while small
prex fraction peninsulas are more in numbers, most of the peninsula time is spent in peninsulas covering
the whole prex. This result is consistent with long lived peninsulas being caused by policy choices.
3.5.7 Where Do Peninsulas Occur?
Firewalls, link failures, and routing problems cause peninsulas on the Internet. These can either occur
inside a given AS, or in upstream providers.
To detect where the Internet breaks into peninsulas, we look at traceroutes that failed to reach their
target address, either due to a loop or an ICMP unreachable message. Then, we nd where these traces
halt, and take note whether halting occurs at the target AS and target prex, or before the target AS and
target prex.
For our experiment we run Taitao to detect peninsulas at target blocks over Trinocular VPs, we use
Ark’s traceroutes [20] to nd last IP address before halt, and we get target and halting ASNs and prexes
using RouteViews.
In Table 3.9 we show how many traces halt at or before the target network. The center, gray rows
show peninsulas (disagreement between VPs) with their total sum in bold. For all peninsulas (the bold
67
Target AS Target Prex
Sites Up At Before At Before
0 21,765 32,489 1,775 52,479
1 587 1,197 113 1,671
2 2,981 4,199 316 6,864
3 12,709 11,802 2,454 22,057
4 117,377 62,881 31,211 149,047
5 101,516 53,649 27,298 127,867
1-5 235,170 133,728 61,392 307,506
6 967,888 812,430 238,182 1,542,136
Table 3.9: Halt location of failed traceroutes for peninsulas longer than 5 hours. Dataset A41, 2020q3.
Industry ASes Blocks
ISP 23 138
Education 21 167
Communications 14 44
Healthcare 8 18
Government 7 31
Datacenter 6 11
IT Services 6 8
Finance 4 6
Other (6 types) 6 (1 per type)
Table 3.10: U.S. only blocks. Dataset A30, 2017q4
row), more traceroutes halt at or inside the target AS (235k vs. 134k, the left columns), but they more often
terminate before reaching the target prex (308k vs. 61k, the right columns). This dierence suggests
policy is implemented at or inside ASes, but not at routable prexes. By contrast, outages (agreement with
0 sites up) more often terminate before reaching the target AS. Because peninsulas are more often at or in
an AS, while outages occur in many places, it suggests that peninsulas are policy choices.
3.5.8 How Common are Country-Level Peninsulas?
Country-specic ltering is a routing policy made by networks to restrict trac they receive. We next look
into what type of organizations actively block overseas trac. For example, good candidates to restrain
who can reach them for security purposes are government related organizations.
68
0
0.2
0.4
0.6
0.8
1
2017-04-01
2017-07-01
2017-10-01
2018-01-01
2018-04-01
2018-07-01
2018-10-01
2019-01-01
2019-04-01
2019-07-01
2019-10-01
2020-01-01
2020-04-01
W E
C
G C
N
W
E
J
E
W
N
W W
Fraction of Internet Down
W G J C N E
Figure 3.10: Islands detected across 3 years using six VPs. Datasets A28-A39.
We test for country-specic ltering (Section 3.3.3) over 2017q4 and nd 429 unique U.S.-only blocks
in 95 distinct ASes. We then manually verify each AS categorized by industry in Table 3.10. It is surprising
how many universities lter by country. While not common, country specic blocks do occur.
3.5.9 How Common Are Islands?
Multiple groups have shown that there are many network outages in the Internet [109, 90, 110, 96, 56].
We have described (Section 3.2) two kinds of outages: full outages where all computers at a site are down
(perhaps due to a loss of power), and islands, where the site is cut o from the Internet but computers at
the site can talk between themselves. We next use Chiloe to determine how often islands occur. We study
islands in two systems with 6 VPs for 3 years and 13k VPs for 3 months.
Trinocular: We rst consider three years of Trinocular data (described in Section 3.3.1), from 2017-
04-01 to 2020-04-01. We run Chiloe across each VP for this period.
Table 3.11 shows the number of islands per VP over this period. Over the 3 years, all six VPs see from
1 to 5 islands. In addition, Figure 3.10 shows the fraction of the Internet that is reachable from 6 VPs
for each 11 minutes, over three years, from 2017-04-01 to 2020-04-01, We see that islands do not always
cause the entire Internet to be unreachable, and there are a number of cases where from 20% to 50% of
the Internet is inaccessible. We believe these cases represent brief islands, since islands shorter than an
69
Sites Events /Year
W 5 1.67
C 2 0.67
J 1 0.33
G 1 0.33
E 3 1.00
N 2 0.67
All 14 4.67
Table 3.11: Islands detected from 2017-04-01 to 2020-04-01
0
0.2
0.4
0.6
0.8
1
0 1 10 100 1000
Atlas islands
Atlas islands
> 660s
CDF
Number of islands
(a) Number of islands
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000 100000
Atlas islands
Atlas islands > 660s
Trinocular islands
CDF
Island duration in seconds
(b) Duration of islands
0.2
0.4
0.6
0.8
1
0 1 10 100
Atlas non islands
Atlas islands
CDF
Number of hops in traceroute
(c) Size of islands
Figure 3.12: CDF of islands detected by Chiloe for data from Trinocular (3 years, Datasets A28-
A39) and Atlas (2021q3).
11 minute complete scan will only be partially observed. We nd 12 in the 20% to 50% range, all are short,
and 4 are less than 11 minutes
RIPE Atlas: For broader coverage we next consider RIPE Atlas’ 13k VPs for all of 2021q3 [79]. While
Atlas does not scan the whole Internet, they do scan most root DNS servers every 240 s. Chiloe would like
to observe the whole Internet, and while Trinocular scans 5M /24s, it does so with only 6 VPs. To use RIPE
Atlas’ 10k VPs, we approximate a full scan with probes to 12 of the DNS root server systems (G-Root was
unavailable in 2021q3). Although far fewer than 5M networks, these targets provide a very sparse sample
of usually independent destinations since each is independently operated. Thus we have complementary
datasets with sparse VPs and dense probing, and many VPs but sparse probing. In other words, to get
many VP locations we relax our conceptual denition by decreasing our target list.
Figure 3.11a shows the CDF of the number of islands detected per RIPE Atlas VP during 2021q3. During
this period, 55% of VPs observed one or no islands (solid line). To compare to Trinocular, we consider
70
events longer than 660 s with the dashed line. In the gure, 60% of VPs saw no islands, 19% see one, and
the remainder see more. The annualized island rate of just the most stable VPs (those that see 2 or less
islands) is 1.75 islands per year (a lower bound, since we exclude less stable VPs), compared to 1.28 for
Trinocular (Table 3.11). We see islands are more common in Atlas, perhaps because it includes many VPs
at home.
We conclude that islands do happen, but they are rare, and at irregular times. This nding is consistent
with importance of the Internet at the locations where we run VPs.
3.5.10 How Long Do Islands Last?
Islands can occur starting from brief connectivity losses to long standing policy changes. We next compare
island duration measured across Trinocular and Atlas.
We compare the distributions of island durations observed from RIPE Atlas (the left line) and Trinocular (right) in Figure 3.11b. Since Atlas’ frequent polling means it detects islands lasting seconds, while
Trinocular sees only islands of 660 s or longer, we split out Atlas events lasting at least 660 s (middle line).
All measurements follow a similar S-shaped curve, but for Trinocular, the curve is truncated at 660 s. With
only 6 VPs, Trinocular sees far fewer events (23 in 3 years compared to 235k in a yearly quarter with Atlas),
so the Trinocular data is quantized. In both cases, about 70% of islands are between 1000 and 6000 s. This
graph shows that Trinocular’s curve is similar in shape to Atlas-660 s, but about 2× longer. All Trinocular observers are in datacenters, while Atlas devices are at homes, so this dierence may indicate that
datacenter islands are rarer, but harder to resolve.
3.5.11 What Sizes Are Islands?
In Section 3.2.3 we described dierent sizes of islands starting from as small as an address island, as opposed
to LAN- or AS-sized islands, to country-sized islands potentially capable of partitioning the Internet. Here,
71
we evaluate the size of islands by counting the number of hops in a traceroute sent towards a target outside
the island before the traceroute fails.
We use traceroutes from RIPE Atlas VPs sent to 12 root DNS servers for 2021q3 [80]. In Figure 3.11c
in green the distribution of the number of hops when traceroute reach their target. In purple, we plot the
distribution of the number of hops of traceroutes that failed to reach the target for VPs in islands detected
in Section 3.5.9.
We nd that most islands are small, 70% show one hop or none (address islands). We consider very
large islands (10 or more hops) as false positives.
3.6 Applying These Tools
Given partial connectivity, we now apply our approach to Internet sovereignty, partitioning, and DNSmon
sensitivity.
3.6.1 Policy Applications of the Denition
We next examine how a clear denition of the Internet’s core can inform policy tussles [25]. Our hope
is that our conceptual denition can make sometimes amorphous concepts like “Internet fragmentation”
more concrete, and an operational denition can quantify impacts and identify thresholds.
Secession and Sovereignty: The U.S. [104], China [4, 5], and Russia [26] have all proposed unplugging from the Internet. Egypt did in 2011 [29], and several countries have during exams [52, 34, 59, 41].
When the Internet partitions, which part is still “the Internet’s core”? Departure of a ISP or small country
do not change the Internet’s core much, but what if a large country, or group of countries, leave together?
Our denition resolves this question, dening the Internet’s core from reachability of the majority of
the active, public IP addresses (Section 3.2.2). Requiring a majority uniquely provides an unambiguous,
externally evaluable test for the Internet’s core that allows one possible answer (the partition with more
72
than 50%). In Section 3.6.2 we discuss the corollary: the internet can end, turning into multiple partitions,
if none retain a majority. (A plurality is insucient.)
Sanction: An opposite of secession is expulsion. Economic sanctions are one method of asserting
international inuence, and events such as the 2022 war in Ukrainian prompted several large ISPs to discontinue service to Russia [94]. De-peering does not aect reachability for ISPs that purchase transit, but
Tier-1 ISPs that de-peer create peninsulas for their users. As described below in Section 3.6.2, no single
country can eject another by de-peering with it. However, a coalition of multiple countries could de-peer
and eject a country from the Internet’s core if they, together, control more than half of the address space.
Repurposing Addresses: Given full allocation of IPv4, multiple parties proposed re-purposing currently allocated or reserved IPv4 space, such 0/8 (“this” network), 127/8 (loopback), and 240/4 (reserved) [49].
New use of these long-reserved addresses is challenged by assumptions in widely-deployed, dicult to
change, existing software and hardware. Our denition demonstrates that an RFC re-assigning this space
for public trac cannot make it a truly eective part of the Internet core until implementations used by a
majority of active addresses can route to it.
IPv4 Squat Space: IP squatting is when an organization requiring private address space beyond
RFC1918 takes over allocated but currently unrouted IPv4 space [7]. Several IPv4 /8s allocated to the
U.S. DoD have been used this way [99] (they were only publicly routed in 2021 [114]). By our denition,
such space is not part of the Internet’s core without public routes, and if more than half of the Internet is
squatting on it, reclamation may be challenging.
The IPv4/v6 Transition: We have dened two Internet cores: IPv4 and IPv6. Our denition can
determine when one supersedes the other. The networks will be on par when more than half of all IPv4
hosts are dual-homed. After that point, IPv6 will supersede IPv4 when a majority of hosts on IPv6 can no
longer reach IPv4. Current limits on IPv6 measurement mean evaluation here is future work. IPv6 shows
the strength and limits of our denition: since IPv6 is already economically important, our denition seems
73
IPv4 Addresses IPv6 Addresses
RIR Active Allocated Allocated
AFRINIC 15M 2% 121M 3.3% 9,661 3%
APNIC 223M 33% 892M 24.0% 88,614 27.8%
China 112M 17% 345M 9.3% 54,849 17.2%
ARIN 150M 22% 1,673M 45.2% 56,172 17.6%
U.S. 140M 21% 1,617M 43.7% 55,026 17.3%
LACNIC 82M 12% 191M 5.2% 15,298 4.8%
RIPE NCC 206M 30% 826M 22.3% 148,881 46.7%
Germany 40M 6% 124M 3.3% 22,075 6.9%
Total 676M 100% 3,703M 100% 318,626 100%
Table 3.12: RIR IPv4 hosts and IPv6 /32 allocation [60, 61]
irrelevant. However, it may provide sharp boundary that makes the maturity of IPv6 denitive, helping
motivate late-movers.
3.6.2 Can the Internet’s Core Partition?
In Section 3.6.1 we discussed secession and expulsion qualitatively. Threats to secede or sanction have been
by countries or groups of countries. If a country were to exert control over their allocated addresses this
would result in a country level island or peninsula. We next use our reachability denition of more than
50% to quantify control of the IP address space. Our question: Does any country or group have enough
addresses to secede and claim to be “the Internet’s core” with a majority of addresses.
To evaluate the power of any country or RIR to control the Internet core, Table 3.12 reports the number
of active IPv4 addresses as determined by Internet censuses [58] for each Regional Internet Registry (RIR)
and selected countries. Although we dene the Internet by active addresses, we cannot current measure
active IPv6 addresses, so we also provide allocated addresses for both v4 and v6 [60, 61]. IPv4 is fully allocated, except for special purpose addresses: loopback (127/8), local and private space (0/8, 10/8, etc. [93]),
multicast, and reserved Class E addresses.
Table 3.12 shows that no individual RIR or country can secede and take the Internet’s core, because
none controls the majority of IPv4 addresses. ARIN has the largest share with 1673M allocated (45.2%). Of
74
countries, U.S. has the largest share of allocated IPv4 (1617M, 43.7%). Active addresses are more evenly
distributed with APNIC (223M, 33%) and the U.S. (40M, 21%) the largest RIR and country.
This claim also applies to IPv6, where no RIR or country surpasses a 50% allocation. RIPE (an RIR) is
close with 46.7%, and China and the U.S. have large country allocations. With most of IPv6 unallocated,
these fractions may change. Distribution of active IPv4 addresses is similar to allocated IPv6 addresses,
suggesting IPv4 allocations are perhaps skewed by unused legacy addresses.
Our analysis demonstrates that no country can unilaterally claim to control the IPv4 Internet core, nor
the currently allocated IPv6 core—today’s Internet is an international collaboration.
3.6.3 Reexamining Outages Given Partial Reachability
We next re-evaluate reports from existing outage detection systems, considering how to resolve conicting
information in light of our new algorithms.
3.6.3.1 Formally Dening Outages
Given our understanding of partial Internet reachability, we now show a robust, theoretical denition of
outages in light of peninsulas and islands. Our goal is a theoretical denition independent of a specic operational system. Prior systems have dened partial connectivity [3, 65], or outages as any-VP-reachable [90]
and majority-VP-reachable [9].
To theoretically dene an outage, imagine a VP at U, every IP active, public address. Each VP measures
reachability to everywhere, and I is the largest connected component. If |I| > 0.5|U|, I is the Internet,
otherwise there is no single global network. For each address a ∈ I, there are R(a) addresses reachable
from a. Address a is a peninsula if |R(a)| < |I|. An address a
0 6∈ I is a block-island if it can reach other
addresses (also partitioned from I), otherwise it is an address-island if it is up, and a true outage if it is down
(computer failure). Thus many outages in prior work are actually small islands, but some are peninsulas.
75
0 1 2 3 4 5 6
Trinocular Sites Observing Up
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of Traceroutes
1M 69k 76k 199k 558k 744k 250M
Ark Traceroutes
Success
Gap
Loop
Unreachable
Figure 3.13: Ark traceroutes sent to targets under partial outages (2017-10-10 to -31). Dataset A30.
3.6.3.2 Observed Outage and External Data
To evaluate outage classication with conicting information, we consider Trinocular reports with the
majority and any-up policies and compare to external information in traceroutes from CAIDA Ark.
Figure 3.13 compares Trinocular with 21 days of Ark topology data, from 2017-10-10 to -31 from all 3
probing teams. For each Trinocular outage we classify the Ark result as success or three types of failure:
unreachable, loop, or gap.
Trinocular’s 6-site-up case suggests a working network, and we consider this case as typical. However,
we see that about 25% of Ark traceroutes are “gap”, where several hops fail to reply. We also see about 2%
of traceroutes are unreachable (after we discard traceroutes to never reachable addresses). Ark probes a
random address in each block; many addresses are non-responsive, explaining these.
With 1 to 11 sites up, Trinocular is reporting disagreement. We see that the number of Ark success
cases (the green, lower portion of each bar) falls roughly linearly with the number of successful observers.
This consistency suggests that Trinocular and Ark are seeing similar behavior, and that there is partial
reachability—these events with only partial Trinocular positive results are peninsulas.
76
We observe that 5 sites show the same results as all 6, so single-VP failures likely represent problems
local to that VP. This suggests that all-but-one is a good algorithm to determine true outages.
With only partial reachability, with 1 to 4 VPs (of 6), we see likely peninsulas. These cases conrm
that partial connectivity is common: while there are 1M traceroutes sent to outages where no VP can see
the target (the number of events is shown on the 0 bar), there are 1.6M traceroutes sent to partial outages
(bars 1 to 5), and 850k traceroutes sent to denite peninsulas (bars 1 to 4). This result is consistent with
the convergence we see in Figure 3.6.
3.6.3.3 Are the Sites Independent?
Our evaluation assumes VPs do not share common network paths. Two VPs in the same location would
share the same local outages, but those in dierent physical locations will often use dierent network
paths, particularly with a “atter” Internet graph [67]. We next quantify this similarity to validate our
assumption.
We next measure similarity of observations between pairs of VPs. We examine only cases where one of
the pair disagrees with some other VP, since when all agree, we have no new information. If the pair agrees
with each other, but not with some other VP, the pair shows similarity. If they disagree with each other,
they are dissimilar. We quantify similarity SP for a pair of sites P as SP = (P1 + P0)/(P1 + P0 + D∗),
where Ps indicates the pair agrees on the network having state s of up (1) or down (0) and disagrees with
the others, and for D∗, the pair disagrees with each other. SP ranges from 1, where the pair always agrees,
to 0, where they always disagree.
Table 3.13 shows similarity values for each pair of the 6 Trinocular VPs. (We show only half of the
symmetric matrix.) No two sites have a similarity more than than 0.17, and many pairs are around 0.08.
This result shows that no two sites are particularly correlated, although J, G, and N (the three non-U.S.
sites) seem more correlated than the others.
77
C J G E N
W 0.084 0.079 0.084 0.064 0.093
C 0.154 0.128 0.073 0.075
J 0.154 0.061 0.118
G 0.068 0.130
E 0.073
Table 3.13: Similarities between sites relative to all six. Dataset: A33, 2018q3.
Jul 15 Aug 15 Sep 15 Oct
2022-Oct
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Fraction of VPs
V4 peninsulas
V4 islands
V6 peninsulas
V6 islands
Figure 3.14: Fraction of VPs observing islands and peninsulas for IPv4 and IPv6 during 2022q3
3.6.4 Improving DNSmon Sensitivity
DNSmon [2] monitors the Root Server System [105], built over the RIPE Atlas distributed platform [103]
For years, DNSmon has often reported IPv6 loss rates of 4-10%. Since the DNS root is well provisioned and
distributed, we expect minimal congestion or loss and nd these values surprisingly high.
RIPE Atlas operators are aware of problems with some Atlas VPs. Some support IPv6 on their LAN, but
not to the global IPv6 Internet—such VPs are IPv6 islands. They periodically tag these VPs and cull them
from DNSmon. However, we studied RIPE Atlas with our algorithms to detect islands and peninsulas.
Full details of our analysis are in our workshop paper [107]; but it builds on the concepts pioneered here
(Section 3.2 and Section 3.3). We also provide the rst long-term data that shows these results persist for
4 months (Figure 3.14).
78
V4
all
V6 V4
without
islands
V6 V4
and without
peninsulas
V6 0.000
0.025
0.050
0.075
0.100
0.125
fraction of query failures
Figure 3.15: Atlas queries from all available VPs to 13 Root Servers for IPv4 and IPv6 on 2022-07-23.
Each groups of bars in Figure 3.15 show query loss for each of the 13 root service identiers, as observed
from all available Atlas VPs (10,082 IPv4, and 5,173 IPv6) on 2022-07-23. (We are similar to DNSmon, but
it uses only about 100 well-connected “anchors”, so our analysis is wider.) The rst two groups show loss
rates for IPv4 (light blue, left most) and IPv6 (light red), showing IPv4 losses around 2%, and IPv6 from 9
to 13%.
We apply Chiloe to these VPs, detecting as islands those VPs that cannot see any of the 13 root identiers over 24 hours. (This denition is stricter than regular Chiloe because these VPs attempt only 13
targets, and we apply it over a full day to consider only long-term trends.) The middle two groups of bars
show IPv4 and IPv6 loss rates after removing VPs that are islands. Without island VPs, IPv4 loss rates drop
to 0.005 from 0.01, and IPv6 to about 0.01 from 0.06. We suggest this represents a more accurate view of
how most people perceive the root queries. Islands represent miscongured VPs; they should not be used
for measurement until they can route outside their LAN.
The third bar in each red cluster of IPv6 is an outlier: that root identier shows 13% IPv6 loss with all
VPs, and 6% loss after islands are removed. This result is explained by persistent routing disputes between
Cogent (the operator of C-Root) and Hurricane Electric [74]. Omitting islands (the middle bars) makes this
dierence is much clearer.
79
We then apply Taitao to detect peninsulas. Peninsulas suggest persistent routing problems deserving
consideration by ISPs and root operators. The darker, rightmost two groups show loss from VPs that are
neither islands nor peninsulas, representing loss if routing problems were addressed. This data conrms
routing problems explain the dierence for C-Root, which now shows IPv6 loss similar to other identiers.
This example shows how understanding of partial reachability can improve the sensitivity of existing
measurement systems. Filtering out islands makes it easy to identify persistent routing problems. Further
removing peninsulas leaves observations that are more sensitive to transient changes, perhaps from failure,
DDoS attack, or temporary routing changes—Figure 3.15 shows that the raw data (left two groups) are 5×
or 9.7× times larger than this remaining interesting “signal” (the right two groups). Improved sensitivity
also claries the need to improve IPv6 provisioning, since IPv6 loss is statistically higher than IPv4 loss
(compare the right blue and red groups), even after correcting for known problems.
While application of our algorithms to this system is imperfect, we suggest that it is useful. Atlas VPs
do not ping the entire Internet, so our evaluation of islands over the 13 root identiers is very rough. While
we suggest islands represent misconguration, peninsulas show actual, persistent connectivity problems
(fortunately not harming users because of the redundancy with 13 separate services). We have shared
these results with several root operators and RIPE Atlas; two operators are using them in regular operation,
showing their utility.
3.7 Related Work of Dening the Internet Core
A number of works have suggested denitions of the Internet [22, 88, 46, 43]. As discussed in Section 3.2,
they distinguish the Internet from other networks of their time, but do not address today’s network disputes
and secession threats.
Previous work has looked into the problem of partial outages. RON provides alternate-path routing
around failures for a mesh of sites [3]. HUBBLE monitors in real-time reachability problems in which a
80
working physical path exists. LIFEGUARD, proposes a route failure remediation system by generating BGP
messages to reroute trac through a working path [65]. While both solve the problem of partial outages,
neither quanties the amount, duration, or scope of partial outages in the Internet.
Prior work studied partial reachability, showing it is a common transient occurrence during routing
convergence [16]. They reproduced partial connectivity with controlled experiments; we study it from
Internet-wide vantage points.
Internet scanners have examined bias by location [58], more recently looking for policy-based ltering [119]. We measure policies with our country specic algorithm, and we extend those ideas to dening
the Internet.
Outage detection systems have encountered partial outages. Thunderping recognizes the “hosed” state
of partial replies as something that occurs, but leaves its study to future work [109]. Trinocular discards
partial outages by reporting the target block “up” if any VP can reach it [90]. To the best of our knowledge,
prior outage detection systems have not both explained and reported partial outages as part of the Internet,
nor studied their extent.
We use the idea of majority to dene the Internet in the face of secession. That idea is fundamental
in many algorithms for distributed consensus [69, 68, 78], with applications for example to certicate
authorities [15].
Recent groups have studied the policy issues around Internet fragmentation [38, 1], but do not dene
it. We hope our denition can ll that need.
3.8 Study Conclusions
This study provided a new denition of the Internet to reason about partial connectivity and secession. We
developed the algorithm Taitao, to nd peninsulas of partial connectivity, and Chiloe, to nd islands. We
81
showed that partial connectivity events are more common than simple outages, and used these denitions
to clarify outages and the Internet.
This chapter contributes to showing our thesis statement (Section 1.4) by providing a “new conceptual
denition” of the “the Internet core” (Section 3.2.2). We use our denition to help “disambiguate network
reliability” questions like who “keeps” the Internet if a nation secedes (Section 3.3.5), and to resolve and
quantify when sections of the network become reachable only to a fraction (Section 3.5). We implemented
our denition using operational algorithms (Section 3.3.2, Section 3.3.4).
In the next chapter we will use our denition to resolve partial connectivity due to ISP dynamics like
diurnal trends and user migration.
Appendix 3.A Research Ethics on this Study
The work in this chapter poses no ethical concerns for several reasons.
First, we collect no additional data, but instead reanalyze data from several existing sources listed in
Section 3.3.1. Our study therefore poses no additional risk in data collection.
Our analysis poses no risk to individuals because our subject is network topology and connectivity.
There is a slight risk to individuals in that we examine responsiveness of individual IP addresses. With
external information, IP addresses can sometimes be traced to individuals, particularly when combined
with external data sources like DHCP logs. We avoid this risk in three ways. First, we do not have DHCP
logs for any networks (and in fact, most are unavailable outside of specic ISPs). Second, we commit, as
research policy, to not combine IP addresses with external data sources that might de-anonymize them to
individuals. Finally, except for analysis of specic cases as part of validation, all of our analysis is done in
bulk over the whole dataset.
82
We do observe data about organizations such as ISPs, and about the geolocation of blocks of IP addresses. Because we do not map IP addresses to individuals, this analysis poses no individual privacy
risk.
Finally, we suggest that while our study poses minimal privacy risks to individuals, to also provides
substantial benet to the community and to individuals. For reasons given in the introduction it is important to improve network reliability and understand now networks fail. Our study contributes to that
goal.
Our study was reviewed by the Institutional Review Board at our university and because it poses no
risk to individual privacy, it was identied as non-human subjects research (USC IRB IIR00001648).
83
Chapter 4
Ebb and Flow: Implications of ISP Address Dynamics
This chapter covers our third study about address dynamics. Address dynamics are changes in IP address
occupation is users come and go, ISPs renumber them for privacy or for router or routing maintenance.
Address dynamics aect address reputation services, IP geolocation, network measurement, and outage
detection, with implications of Internet governance, e-commerce, and science. We provide new algorithms
to identify two classes of address dynamics: periodic (diurnal and weekly) trends and ISP maintenance
events.
This study contributes towards showing our thesis statement (Section 1.4). We use our denition of the
Internet core to disambiguate questions about ISP address space usage like diurnal events and maintenance
events using operational algorithms (Section 4.3.2, Section 4.3.4). We provide a new viewpoint to Internet
outage detection and policy evaluation of address usage (Section 4.5).
A version of Section 4.4.1 is derived and expanded from work done primarily by Xiao Song [112] from
the denitions we provided.
All of the data used and created in this study is available at no cost [117]. As discussed in Section 3.A,
our work poses no ethical concerns: although we use data on individual IP addresses, we have no way
of associating them with individuals. Our work was IRB reviewed and identied as non-human subjects
research (USC IRB IIR00001648).
84
4.1 Introduction
Millions of devices connect to the Internet everyday, but some come and go. Many ISPs dynamically assign
devices to public IP addresses. While some users have IP addresses that are stable for weeks, ISPs often
reassign users for many reasons: to promote privacy, to prevent servers on “home” networks, and to shift
users away from routers scheduled for maintenance. IP policies vary: some renumber users every day [121,
76, 87, 83, 98]. Some show large diurnal changes [91],
Understanding ISP address dynamics is important both Internet policy, and network measurement and
security. For Internet policy, ISPs need to make business-critical decisions that include purchasing carriergrade NAT equipment versus acquiring more address space, or evaluating the costs of carefully reusing
limited IPv4 space versus transitioning to IPv6. Regulators like national or Regional Internet Registries
(RIRs) must consider address dynamics when crafting policies about transferring limited IPv4 address
space and tracking IPv4 and IPv6 routing table sizes. For network measurement and security, dynamics
aect services like IP address reputation [89, 44], IP geolocation [71], and generating IPv4 [45] and IPv6
hitlists [47, 51, 77, 14, 50, 122]. Stable addresses also simplify attack targeting, trac ngerprinting, and
have implications for privacy and anonymity.
The topic of address dynamics has been explored previously. Some have shown the stagnation of the
total number of active IPv4 addresses and identied address block activity patterns [98]. Others have
tracked address changes for a subset of addresses and analyzed lease durations [83, 84], studied diurnal
patterns where blocks stay active during the day, but remain inactive during the night [91], and built a
statistical model from few ISPs to provide address churn estimation [76]. While important, all prior work
focuses on behavior inside specic address blocks, not ISP-wide.
Address dynamics also aect the accuracy of outage detection. Diurnal changes must be considered
to avoid interpreting nighttime quiet as outages [91]. CDN-based outage detection showed ISP-level user
movement can result in incorrect outage reports [97]. Although this work was more robust to ISP-level
85
events than prior work, they did not quantify how often maintenance events happen, nor suggest how
to address this problem in other systems. Recent work showed that tracking changes in address use can
detect changes in human behavior, such as shifts to working from home [112]. Improving each of these
systems requires understanding address dynamics so we can build accurate models of address activity to
drive these services.
The primary contribution of this study is to develop two new algorithms that address ISP-level address
dynamics: ISP Diurnal Detrending (IDD) (Section 4.3.3) separates daily and weekly patterns from underlying trends and residual, both of which are important in detection algorithms. ISP Availability Sensing
(IAS) (Section 4.3.4) identies maintenance events in the ISP, allowing us to recognize that apparent outages actually users being reassigned. Our second contribution is to validate IAS (Section 4.4.2) using data
from ISPs with known maintenance patterns and data from RIPE Atlas. Our nal contribution is to use
these algorithms to quantify how many ISPs are diurnal (Section 4.5.3), how many maintenance events
occur (Section 4.5.1), and how IPv6 shows more consistent address usage than IPv4 (Section 4.5.5).
4.2 Implications of Address Dynamics
This study examines two challenges in how addresses are managed: maintenance events and diurnal networks. Both cause problems to outage detection systems because they cause individual /24 blocks to
become vacant and so appear unreachable, resulting in a false outage. An outage is incorrect because users
are receiving service elsewhere (for maintenance), or are sleeping (diurnal changes).
In this study, we use the term ISP in the algorithm names (Section 4.3) instead of AS, because our goal
is to understand address dynamics at an ISP level. Further, ISP has a general meaning while AS requires
explanation. Although most large ISPs employ multiple ASes (for example, one each for Asian, American,
and European operations), in Section 4.5.5 we show that renumbering usually occurs within the same AS,
so this simplication does not change our primary results.
86
2020-10-01
2020-10-10
2020-10-20
2020-10-29
2020-11-07
2020-11-16
2020-11-25
2020-12-04
2020-12-13
2020-12-23
0
50
100
150
200
250
block 6d69aa00 IP Addresses (last octet)
up (truth)
up (implied)
down (truth)
down (implied)
ID1001049
(iii)
down
unkn
up
Trinocular
(ii)
0
50
Lit Addrs
(i)
(a) Atlas VP 1001049 (before)
2020-10-01
2020-10-10
2020-10-20
2020-10-29
2020-11-07
2020-11-16
2020-11-25
2020-12-04
2020-12-13
2020-12-23
0
50
100
150
200
250
block 80468a00 IP Addresses (last octet)
up (truth)
up (implied)
down (truth)
down (implied)
ID1001049
(iii)
down
unkn
up
Trinocular
(ii)
0
50
Lit Addrs
(i)
(b) Atlas VP 1001049 (after)
Figure 4.1: Sample /24 blocks showing users simultaneously shifted to a dierent block, as observed from
Trinocular Netherlands site and a RIPE Atlas VP.
87
Oct 15 Nov 15 Dec 15 Jan
2021
50k
0
0
500k
0
500k
100k
0
100k
50k
0
C(a) T(a) D(a) W(a) R(a)
Figure 4.2: MSTL decomposition of AS9829 during 2020q4. Dataset: A42.
ISP Maintenance Events: Figure 4.1 shows two /24 address blocks, with green dots showing ping
responses (bottom gray area marked (iii)), while the three bars show outage status (ii), and the top bar (i)
counts the number of replying addresses. We have a RIPE Atlas VP [103] that moved from the top block to
the bottom during three weeks in December—a maintenance event that left the top block mostly idle. This
event is dicult for an external outage detection system to handle: gaps on December 3 and 24 are outages
in the top block, and the much lower utilization for the period is hard to track. Yet someone watching the
whole AS would realize users shifted addresses temporarily. In Section 4.3.1 we show how we build an
AS-level view, and in Section 4.3.4 how we can avoid false outages from these type of events, while in
Section 4.5.2 we show how often these events occur.
Diurnal ASes: Other ASes have large changes in user populations over the course of a day. In these
diurnal ASes, dynamic users (possibly on mobile devices, or connecting during the workday or at night)
disconnect at night, leaving their addresses unused. Although many users in North America and Europe
have always-on home routers, many users in Asia, South America, and Africa disconnect at night.
88
2020-10-01
2020-10-04
2020-10-07
2020-10-10
2020-10-13
2020-10-16
2020-10-19
2020-10-22
2020-10-25
2020-10-28
2020-10-31
0
50
100
150
200
250
IPv4 Last Octet
up (truth)
up (implied)
down (truth)
down (implied)
(c)
down
unkn
up
Trinocular
(b)
0
100
Lit Addrs
(a)
Figure 4.3: Diurnal blocks in AS9829 observed from Trinocular, Los Angeles, October 2020. Dataset A42.
The top graph in Figure 4.2 shows the number of active IP addresses in AS9829 (Bharat Sanchar Nigam
Limited) over three months, based on measurements updated every 11 minutes (Section 4.3.1). With an
average of 600k active addresses (the second bar), this major national Indian ISP has many active users.
But the timeseries in the top shows daily changes in the number of active addresses of ±20%, more than
100k users! In addition, close examination shows activity drops for two days after every ve—evidence of
weekends. These trends that this AS has strong diurnal and weekly trends.
Such large shifts cause problems for outage detection systems, because losing 100k users every night
vacates some /24 blocks. The block in Figure 4.3 from AS9829 shows how one /24 occupancy varies over
the course of a day. Green dots show active addresses, and gray non-response. This address block is 50%
full every day at its peak, but empty every night. This trend can be seen in the count of active address (the
top bar). It causes daily outage events in Trinocular, as shown in the middle bar, showing up most of the
day but down every night.
89
Blocks that look like this are common in this AS, and they show the need for our IDD algorithm
(Section 4.3.3). Tracking outages across ASes with this much daily change motivates diurnal AS detection
(Section 4.3.2) and detrending (Section 4.3.3). We show the importance of detrending in Section 4.4.3.
4.3 Methodology
We next describe and ISP Diurnal Detrending (IDD), which separates daily and weekly variation from
an underlying steady state, and ISP Availability Sensing (IAS), which detects maintenance events (Section 4.3.4). Our goal with these algorithms is to recognize AS-level events and trends, a view that allows
block-level outage detection to avoid false outages.
Before describing these algorithms we describe how we track active addresses.
4.3.1 AS-wide Address Accumulation
Since ASes move users around in their address space, we nd the number of active addresses across the
AS helps characterize the current population. The rst step in our AS-wide algorithms is to track active
addresses.
Determining active addresses in an AS is dicult, because an AS has thousands (or even millions)
of addresses, and we cannot monitor all instantly. We therefore accumulate responses from incremental
address probing to approximate the current state [10, 112].
Our input is from Trinocular [117], since it is publicly available and it has years of address-specic
ping responses covering much of the Internet (4M to 5M /24 blocks). Trinocular sends between 1 and 16
ICMP echo-requests to each block every 11 minutes, each to a dierent address. Addresses rotate in a xed
order unique to each block, so a single Trinocular site will scan all planned addresses in 48 hours or less.
We accumulate individual observations to produce a snapshot of all addresses in the block. Combining
results from all six Trinocular sites cuts worst-case latency to eight hours (each site scans independently
90
with dierent and varying phases). We update estimates incrementally each 11-minute round, so even this
worst case usually track diurnal changes. For eciency, we aggregate results by the hour.
We add AS information for regional registries and combine reports for all addresses in each AS. The
result is Ci(a), a timeseries counting addresses for each AS a at time i.
Mitigating probe loss: Address accumulation is very sensitive to probe loss, since a non-response
of a query is interpreted as that address being inactive until it is next queried, and an observer will not
retry until all other addresses have been scanned. Since queries are sent to unique addresses, spread out
in time and space, queries that pass a congested link with loss rate p are reduced randomly by a factor of
(1 − p). When we combine observations from N observers, one congested observer will reduce address
by (1 − p)/N.
We apply to 1-loss repair to mitigate query loss. From [58] § 3.5, 1-loss repair examines queries to each
address and replaces the pattern 101 (responsive, non-responsive, responsive) with 111, while ignoring
001, 110, and other patterns. This algorithm assumes the addresses are usually active for multiple probe
rounds, so a better explanation for a single non-response is that the query (or response) was lost, rather
than the address was briey unused. This algorithm holds assuming active addresses are usually active for
several probing rounds and the loss rate p is small, so the probability of back-to-back query losses is very
small (p
2
). We show this algorithm has little eect on most blocks, but correctly repairs the eects of one
observer encountering loss on a congested link in Section 4.4.1.
4.3.2 Diurnal ISP Detection
Given C(a), address counts for an AS (Section 4.3.1), our next step is to identify ASes with a strong diurnal
component. Following prior block-level diurnal analysis [91], we take the FFT of this timeseries, giving a
set of coecients showing the strength and phase at all frequencies. We then label that AS as diurnal if the
energy in the frequency corresponding to a 24-hour period is the largest of all other (non-DC) components.
91
4.3.3 ISP Diurnal Detrending (IDD)
Since we know some ASes are strongly diurnal, we next decompose C(a) to extract long-term trends, cyclic
components, and any residual changes. Each component is useful to identify usual events.
We apply the standard MSTL algorithm (Multi-Seasonal-Trend using Loess [12]) to extract two seasonal
components, one for diurnal (daily) behavior, and the for weekly patterns, along with trend and residual.
We nd some networks have both diurnal and weekly patterns, while others are only diurnal. We decompose C(a) in four components: trend (T), diurnal (D), weekly (W), and residual (R) components.
Figure 4.2 shows trend decomposition of AS9829 during 2020q4 (three months). Top bar shows the
AS-wide timeseries C(a), ranging from 500k to 800k active addresses. The next bar down T(a) shows the
long-term trend. We can see that this AS has a static user population over this quarter.
The third and fourth bars show D(a) and W(a), how much regular change there is each day and week.
The strong diurnal pattern that we rst identied at the 24 h frequency in the FFT (Section 4.3.2) shows
up in D(a) with swings that range across 30% of responsive addresses (±100k). The weekly component
(W(a)) show a weekend drop of about 50k addresses. Diurnal and weekly trends are both visible in C(a),
but easier to quantify after decomposition.
The residual in the nal row, R(a), isolates any remaining changes, as we use when detecting address
dynamics (Section 4.5.1).
4.3.4 ISP Availability Sensing (IAS)
The ISP Availability Sensing algorithm (IAS) recognizes maintenance events by comparing a global count
of active users at AS-level against local changes in portions of the network. Our insight is that during
maintenance events the AS-wide count of active users remains stable even though some local portions of
the network lose users and others add users. This stability distinguishes users moving from outages.
92
4.3.4.1 Detecting AS-wide Address Stability
We rst show that the AS’ active addresses are roughly stable.
We dene ∆i as the relative fraction of change of active addresses across an AS at time interval i, using
the residual and trend decomposition from Section 4.3.3: ∆i = Ri(a)/Ti(a).
When ∆i = 0 there is no change in number of active users. Of course, in real networks, as individual
hosts can come and go, and probing packets may be lost (as can be seen in R(a) in Figure 4.2). Finally, we
identify changes where ∆i ≥ 0.05 as large changes, while smaller changes are more typical jitter.
IAS assume complete knowledge of each AS’ address space. However, ASes bring new address space
on-line to serve new customers, such space may not immediately appear in C(a). We evaluate how frequently we miss users due to unmonitored address space in Section 4.4.4.
4.3.4.2 Detecting network changes
IAS’ second requirement is the presence of some blocks changing. We enforce this requirement by identifying the number of blocks that change state (become reachable or stop being reachable) from outage
detection. We currently require δ = 4 blocks to change state, and look at this requirement in Section 4.4.3.
4.4 Validation
We validate IAS against external sources, and then evaluate the eects of lossy links, unmonitored space
and spatial granularity.
4.4.1 Mitigating an Observer Encountering Congestion with One-Loss Repair
The 1-Loss Repair algorithm is designed to mitigate the problem of one observer making observations
through a link that encounters diurnal congestion. Such congestion on the link will appear to be diurnal
behavior in the target network, possibly creating a falsely diurnal block.
93
0
250
0
250
0
250
0
250
0
250
22 May 08 15
0
250
W
E
C
G
N
ALL
Target Block Last Octet
Source Vantage Point
[2022]
up (truth)
up (implied)
down (truth)
down (implied)
Figure 4.4: Block 0x7b753300 where observer w sees congestion and the others do not.
0
250
0
250
0
250
0
250
0
250
22 May 08 15
0
250
W
E
C
G
N
ALL
Target Block Last Octet
Source Vantage Point
[2022]
up (truth)
up (implied)
down (truth)
down (implied)
1 loss repair
Figure 4.5: Block 0x7b753300, where one-loss repair corrects VP w’s congestive loss.
94
3c166600
7b753300
afab5300
db9ea600
71cf6e00
b65ab500
dd03af00
dd078b00
27478400
2762cc00
2f73a800
3db71500
6a7c8600
77a5cc00
7bb28c00
ab7ec100
da1c0200
def3a800
de480600
b671d400
/24 Blocks
0.0
0.2
0.4
0.6
0.8
1.0
Response Rate
no repair
repaired
wnecg w n e c g
Figure 4.6: Response rate from 20 sample blocks during 2022q4. A52.
Figure 4.4 shows ve observers probing one target block (block 0x7b753300). Top bar corresponds to
merged results from all 5 observers to the same target. Observer w (second bar) encounters congestion,
with a response rate of 0.38 compared to the others at 0.50. This congestion is propagated to the merged
results at the top. Figure 4.5 shows same block with 1-loss repair changes indicated in orange. Individually,
non-congested links (observers other than w), do not have many recoveries, The latter allows to properly
x errors when merging results (top bar).
To evaluate 1-loss repair, we selected 20 blocks where the response rate of one observer was noticeably
lower than the other. We then evaluated the response rate (A) from each VP, and from the merged data
from all VPs, both with and without 1-loss repair. Figure 4.6 shows reconstruction from all observers
combined (the left, black symbols) next to observations from each single observer (right, colored symbols).
Each block is shown with the dierent response rates given as a cluster of points, ordered left-to-right
from the all observers followed by each observer (as shown in the legend). Each column shows a triangle
with the response rate, pre-repair, and a circle after 1-loss repair.
95
We collect blocks into four groups, each a shaded region. For the rst four blocks (0x3c166600, 0x7b753300,
0xafab5300, and 0xdb9ea600), we see that site w has a much lower response rate than the other observers,
suggesting w’s observations passed through a congested link, while other sites took a dierent path. With
1-loss repair makes minimal changes to non-congested observers (each of the left four colored circles is
only slightly higher than its corresponding colored triangle), showing the 101 pattern is rare in normal
data. However, 1-loss repair raises A-value of non-repaired w observations considerably. (Compare the
purple triangle to the purple dot for each block—for 0x3c166600 the w-only A value rises from 0.48 to 0.55
with 1-loss repair, with similar increases for each other of these four blocks). Finally, comparing the combined observations with and without repair (the leftmost black triangle and circle in each group), we see
that without any loss repair, the response rate is much lower than any of the non-lossy single observers,
but with loss-repair, it is similar (for example, for 0x3c166600, non-repaired 5-sites is 0.59 while with 1-loss
repair, it is 0.62).
The second group of four blocks (0x71cf6e00, 0xb65ab500, 0xdd03af00, 0xdd078b00) provide a similar
result, but the c VP (Colorado, the teal color second from the right in each cluster) often sees loss. Again,
combined data with 1-loss repair generally matches non-lossy VPs.
The next 9 blocks (0x27478400 to 0xde480600) show cases where no VP sees much loss. Applying 1-loss
repair often increases the response rate slightly (for example, 0x7bb28c00 and 0xdef3a800), but we these
blocks are not strongly inuenced either by loss, or by use of 1-loss repair.
Finally, the last block, 0xb671d400, is unusual. All observers see similar results, and 1-loss repair shifts
response rates quite a bit, from 0.56 to 0.66. However, the combined result with 1-loss repair is close to
the combined result without repair and the un-repaired observations of any of the single VPs (compare
Figure 4.7 and Figure 4.8. Addresses in this block are inactive for a short period of time (less than two
Trinocular scans). The scan time for a single observer to see all addresses is quite long (median of 26
hours), so a single observer often sees the 101 pattern because addresses are reused before they are probed
96
0
250
0
250
0
250
0
250
0
250
22 May 08 15
0
250
W
E
C
G
N
ALL
Target Block Last Octet
Source Vantage Point
[2022]
up (truth)
up (implied)
down (truth)
down (implied)
Figure 4.7: Block 0xb671d400 showing addresses with short usage periods.
0
250
0
250
0
250
0
250
0
250
22 May 08 15
0
250
W
E
C
G
N
ALL
Target Block Last Octet
Source Vantage Point
[2022]
up (truth)
up (implied)
down (truth)
down (implied)
1 loss repair
Figure 4.8: Block 0xb671d400 after one-loss repair, showing minimal changes to the all-VP result.
97
0
50
100
150
200
250
300
350
400
450
500
A
B* C
d
E
f
g
h
I
*
j
K L
M N
O*
p
Q*
R
s
T*
u v
w
down events
outage
other
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
2017-10-19
2017-11-02
2017-11-16
2017-11-30
2017-12-14
2017-12-28
Δt
Tu
Figure 4.9: Down events (top) from six observers (top) and ∆t
(bottom) from Los Angeles. CenturyLink
AS209. Dataset: A30, 2017q4.
again. Repairing these “drops” is incorrect. Fortunately, the additional information due to more frequent
probing in the combined 5-site data means the 101 pattern becomes 1001—dierent observers conrm the
address is temporarily unused, so the non-response is correctly-observed inactivity, not packet loss. Thus,
while the lower ve, single VP rows of Figure 4.8 are overcorrected, the top, ALL row reects what we
believe to be the true state of address use. This block shows the importance of scanning addresses more
frequently than they change.
4.4.2 IAS Detecting Known ISP Maintenance?
We next evaluate IAS’ ability to detect maintenance events using ground truth from an ISP. Our rst source
of ground truth are ISPs that have public maintenance windows. Lumen Technologies (AS209) announces
that midnight to 6 a.m. local time [70] is a public maintenance window, and they report specic events [21].
We identify Lumen address blocks from 18 peers in Routeviews [72].
98
Figure 4.9 shows one quarter of data (2017q4). The bottom part shows ∆i
, changes in the number of
active addresses, based on Trinocular observer (Los Angeles). The top graph counts number of blocks that
are down over time, based on all Trinocular data, but without IAS. We nd 23 events in this merged data,
each involving 35 or more blocks. Of the 23 events, more than half (13, indicated with capital letters) are
in Lumen’s published maintenance window. Five of these (indicated with *) correspond to events in the
service log on their website. IAS identies all these events as maintenance, except for event (o). These
events are true positives.
Event (o) on 2017-11-16 (in red) is very large, and IAS classies it as an outage, not a maintenance event.
This event is unusual, in that it was much larger (aecting 20,211 blocks, more than half) and longer (8.5
hours) at this VP in Los Angeles than from other sites (where it was 348 blocks and 2 hours). We believe
this event was a local problem aecting Los Angeles. It shows that the IAS will correctly pass through
large outages (a true negative).
4.4.3 Validating IAS and IDD from RIPE Atlas
RIPE Atlas VPs live in edge networks and report their current IP address, providing ground truth for ISP
maintenance.
4.4.3.1 Atlas as Ground Truth
We take Atlas VPs built-in measurement 1010 [100] as our known addresses. We aggregate VP address
changes using 4-minute timebins, since new address reports are provided every 4 minutes. We omit address
changes where Atlas VP failed to reach a root DNS server to rule out address changes due to outages [83].
Finally, to rule out individual VP changes, we require four VPs in the same AS to move at about the same
time to declare a maintenance event. During 2020q4, we count 164 events by this criteria.
99
blocks that change state
δi = 0 δi = [1, 3] δi ≥ 4 all
Address
drop
without
IDD
∆i ≤ 5% 42 28 83 153
∆i > 5% 1 0 10 11
with IDD ∆i ≤ 5% 43 28 93 164
∆i > 5% 0 0 0 0
Table 4.1: Atlas VP address change events (i.e. 4 or more VPs) compared against IAS detection thresholds,
2020q4.
4.4.3.2 Validating IAS
We rst validate IAS with IDD, considering its two requirements: stable AS-level addresses (∆i ≤ 5%)
and four or more blocks that move (δi ≥ 4). We show our results in Table 4.1.
We rst look at the bottom two rows of Table 4.1, labeled “with IDD”. The fourth row shows 0 blocks
that change when ∆i > 5%. We found zero legitimate outages to pass through. The third row shows of
the 161 blocks where VPs move, 93 are found in IAS and occur as part of a large movement (4 or more
blocks, right, in green), while 43 move by themselves (left, gray), and 28 move with 1 to 3 others (center,
yellow). We consider the 93 to be IAS successes. All will be found and recognized as maintenance events.
The 43 in gray represent single movements that are not large maintenance events, but may be routers
at home rebooting to a new address. These are not found by IAS, but are not necessarily maintenance
events, so we set them aside.
Finally, the 28 marked yellow are likely maintenance events that IAS misses as being too small. These
are false negatives.
Not having all negative cases prevents us from computing recall and precision, but we can show a True
Positive rate of 0.77 (93/(28 + 93)). We conclude that IAS works reasonably well, although there is room
for improvement.
100
FROM / TO active non-trackable inactive
active 66,892 51% 3,487 3% 5,086 4%
non-trackable 3,392 3% 30,101 22% 1,251 1%
inactive 4,915 4% 1,303 1% 14,602 11%
Table 4.2: Atlas VP address changes in Trinocular monitored/unmonitored address space
4.4.3.3 Validating IDD
To validate the importance of IDD, we turn it o and compare the results in the rst two rows of Table 4.1
with the bottom two rows.
IDD helps lter out diurnal changes, making large shifts more common: compare the 10 cases with
∆i > 5% without IDD to zero cases with it. We also see that it helps IAS: the TPR is 0.75 without IDD
(83/(28 + 83)) compared to 0.77 with IDD.
We conclude that accounting for diurnal changes helps.
4.4.4 Does Unmonitored Space Harm IAS?
Measurement systems do not track the complete address space, as some segments are discarded due to
low response rate, as well as addresses that historically have not responded [9]. Users reassigned to unmonitored space implies that IAS may erroneously infer outages due to drops in the total active address
count, IAS false negatives. To evaluate if unmonitored space interferes with IAS, we count the number of
times known VPs move to and from our underlying measurement system’s unmonitored address space.
We expect most of VPs to move within monitored addresses, as unmonitored space has been historically
unresponsive implying low usage.
Trinocular strives to probe as much as it can (the active addresses), Trinocular excludes addresses for
two reasons, inactive addresses used to reply to pings but have not in two years, and non-trackable blocks
have less than three responsive addresses.
101
IPv4 IPv6
Total VPs 12,855 100.0% 6,319 100.0%
Do not change IP 8,501 66.1% 4,730 74.9%
Change IP 4,354 33.9% 1,589 25.1%
Do not change routable prex 973 7.6% 1,182 18.7%
Change routable prex 3,381 26.3% 407 6.4%
Do not change AS 2,411 18.8% 75 1.1%
Change AS 970 7.5% 332 5.3%
Table 4.3: Active RIPE Atlas Vantage Points during 2020q4
As with Section 4.4.3.1, we use RIPE Atlas VPs as ground truth, since they track their current IP addresses. Table 4.2 counts how many addresses Atlas VPs have in each of the three Trinocular categories
(active, inactive, non-trackable).
As expected, the majority of reassignments (51%) occur within monitored addresses (the top, left, green
cell). In addition, most addresses (84%) stay in the same category (the diagonal).
A few addresses (7% in the yellow, left column) become active as they move in to measurable space,
and about an equal number move out (the 7% in the red, top row). Finally, a surprisingly large 35% are
never tracked (the gray region). Since the IAS goal is identify steady or changing addresses, never tracked
blocks do not matter. The number that becomes and cease to be active is small (7% each) and about equal in
size, so they should not skew IAS. We therefore conclude IAS is not impeded by incomplete measurement.
4.4.5 Choice of Spatial Granularity
We next consider what spatial granularity to use when tracking address dynamics. Our goal is that IAS
can identify when users move, and to do so it must report on the region in which they move. Here we
compare address movement (a baseline) against how often a device moves within a routable prex or an
AS.
Table 4.3 shows how often 12,855 RIPE Atlas VPs change address, routable prex, or AS in 2020q4, for
both IPv4 and IPv6. We see that the majority of devices are stable, with 66.1-74.9% never changing address
and 7.6-18.7% staying in the same routable prex. Of the remaining that move, some change only once,
102
0
0.2
0.4
0.6
0.8
1
0 1 10 100 1000 10000
Routable prefix
Address
AS
Fraction of Atlas VPs
Number of changes
Figure 4.10: Cumulative distribution of the number of IPv4 address, prex and AS changes per Atlas VP,
2020q4
but many change frequently, perhaps because they are in ISPs that renumber their users regularly. We
conclude that most devices are very stable, but a few move frequently.
Surprisingly, we nd about 7.5-5.3% (970 IPv4 and 332 IPv6) change AS. As changes are very rare,
with a few (2%) changing once, perhaps because a user changed their home ISP. The remaining 3% change
frequently, perhaps because they are mobile and regularly move between home and work.
We conclude that AS granularity is almost always suitable to capture most movement and so IAS’ use
of ASes is correct.
4.5 Evaluation
We now study ISP address dynamics across the Internet with IDD and IAS. We evaluate the addressing
eciency, improvements to outage detection, and quantify diurnalness. We also compare IPv4 and IPv6
management practices.
103
0
0.2
0.4
0.6
0.8
1
0 1 10 100 1000 10000
Address
Routable prefix
AS
Fraction of Atlas VPs
Number of changes
Figure 4.11: Cumulative distribution of the number of IPv6 address, prex and AS changes per Atlas VP,
2020q4
4.5.1 Quantifying ISP Address Dynamics
Several groups have looked at dierent aspects of address dynamics [91, 76, 98, 83, 97, 84, 112]. While
prior identied ISP maintenance as a type of network disruption [98], they did not quantify how often
such events occur. We examine maintenance and diurnal events over a quarter using IAS.
We use IAS to identify maintenance events across all 63k ASes active in 2020q4. Figure 4.12 shows the
cumulative distribution of number of maintenance events for ASes with at least one event in this period.
We compare results for dierent detection thresholds, that is, the number of Trinocular blocks going up
or down during the same timebin, with an AS-level responsive address count remaining stable (less than
5% drops).
With a threshold of one changed block (the minimum), IAS detects at least one event in 2k ASes. The
number of ASes decreases to only 210 ASes with our strictest threshold.
We also see that some ASes regularly move users around, seeing 100 maintenance events in these 90
days. One example of these frequent-maintenance ISPs are ISPs renumber users every 24 hours.
104
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1 10 100
δ = 1
δ = 2
δ = 4
δ = 10
Number of ASes
Number of Maintenance Events
Figure 4.12: CDF of number of maintenance events at dierent block thresholds in 2020q4.
The area under each curve corresponds to the number of maintenance events and diurnal events that
occurred during this quarter. For our default preferred threshold of four blocks, we see 50k events in
2020q4.
4.5.2 How Often Does IAS Repair False Outages?
Analysis of CDN trac [97] showed that incorrect block-level outages are often due to users being assigned
to dierent IP addresses. By measuring from end-user devices they show users changing addresses, and
external outage detection systems cannot distinguish the now-vacant old address block from a network
problem. After users are reassigned, their old address blocks remain empty for minutes or months, and
external outage detection systems (like Trinocular or Thunderping) incorrectly interpret this absence as
an outage.
Figure 4.13 shows the distribution of durations of the time a block stays unresponsive before or after
an IAS-detected address reassignment during 2020q4. All of these events are false outages that IAS repairs.
To understand root causes we identify three regions of unresponsive duration.
105
0
0.2
0.4
0.6
0.8
1
1m 10m 1h 6h 1d 7d 30d 90d
0
10000
20000
30000
40000
50000
(i) (ii) (iii)
Fraction of Events
Number of Events
Duration (log scale)
Figure 4.13: CDF of unresponsive duration of blocks before or after an IAS-detected maintenance event,
2020q4.
Durations less than 11 minutes (the bottom 2% on the left, unshaded) are less than the scanning frequency of our data source (Trinocular). Such short outages occur in blocks with large numbers of scanned
addresses where only a few are in use (blocks with large |A(b)| and small A, with terms from [90]. These
false outages of shorter-than-probing-interval events is natural due to the nature of active probing at regular intervals; IAS suggests these are measurement “noise”.
The majority of events (the center, shaded region, 88%) are blocks that are inactive between 11 minutes
and one week. These false outages are typically due to diurnal address assignment policies, when customers are regularly reassigned, but as part of that assignment blocks sometimes appear to have no activity.
IAS detects and corrects these false outages because it knows the AS-wide activity is constant.
Finally, about 10% of blocks are empty for more than a week. Trinocular already classied these blocks
as “gone dark”, inferring that a long period of no responses cannot be a transient outage, but must be ISP
renumbering. Our AS-wide analysis with IAS conrms that this policy is correct.
106
IPv4
ASes 62,310 100.0%
Diurnal 1,730 2.8%
Non-diurnal 60,580 97.2%
Routable prexes 606,187 100.0%
Diurnal 30,029 5.0%
Non-diurnal 576,158 95.0%
/24 blocks 5,124,967 100.0%
Diurnal 111,908 2.2%
Non-diurnal 5,013,059 97.8%
Table 4.4: Number of diurnal networks at dierent granularities. 2020q4.
4.5.3 How Many ASes Are Diurnal?
Diurnal networks are important to assess allocation policies in Internet governance and to avoid false
outages in outage detection. While diurnal /24 blocks have previously been shown to be diurnal [91, 112],
we next examine diurnalness in the larger groupings of routable prexes and ASes.
Here we use address accumulation data for 2020q4 from Trinocular, following Section 4.3.1, grouped by
prexes and ASes from Routeviews [82] on 2020-10-01. We assess diurnalness as described in Section 4.3.2.
In Table 4.4 we present the number of diurnal networks detected at dierent network granularities.
While only 2.8% of ASes are diurnal (1,730), 112k blocks are diurnal. Recent work has used these blocks to
help understand human activity [112].
4.5.4 How Much of a Diurnal AS is Diurnal?
Although we can identify networks as diurnal as we described in Section 4.3.2, in many ISPs, only part of
the AS is diurnal, while part is more static.
Here we examine fraction of an AS’ address space is diurnal. Prior work has examined individual
blocks, but our decomposition allows us to examine “diurnalness” for the AS as a whole. We judge AS-level
diurnalness by the size of the daily change in addresses. From our MSTL decomposition (Section 4.3.3),
107
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10k 100k
CDF
Routable prefixes
ASes
Number of Responsive Addresses
(a) Absolute addresses (log scale)
0
0.2
0.4
0.6
0.8
1
0.001 0.01 0.1 1
ASes
Routable prefixes
CDF
Fraction of Responsive Addresses
(b) Fraction of addresses (log scale)
Figure 4.14: CDFs of diurnal-ness of all ASes (red) and routable prexes (blue) in 2020q4.
that is (P95(D(a)) − P5(D(a)))/P95(C(a)), where Pn is the n-percentile of the given timeseries over the
quarter.
Figure 4.14 shows how diurnal ASes (red) and routable prexes (blue) are by numbers (Figure 4.14a)
and fraction (Figure 4.14b) on of responsive addresses.
First, we see that most networks are not very diurnal: activity in 85% of ASes change by 100 addresses
or fewer each day, accounting for only 20% of their address space. This stability is typical of ISPs with
customers using always-on home gateways. Stable address usage is why IAS can detect maintenance
events.
When we compare routable prexes to ASes, we see that ASes are more often mostly diurnal (comparing the two lines in Figure 4.14b). Although most prexes are fairly stable (69% change by only 10% of
their active addresses), some (about 20%) have a very large daily swing (15% of addresses or more). Finally,
of course the absolute size of diurnal change in routable prexes is smaller than ASes (compare the lines
in Figure 4.14a), because each routable prex must be smaller than an AS.
This trend suggests that routable prexes are a useful size to study diurnalness, and it supports suggestion for its study in Section 4.5.3.
108
0
1
2
4
8
16
32
0 1 2 4 8 16 32 64 128 256 512
Unique ASes
AS changes
11885
299 262
1
164
9
89
14
50
29
1
20
6
13 2
1
3
1
4
1
10
100
1000
Atlas VPs
0
1
2
4
8
16
32
0 1 2 4 8 16 32 64 128 256 512
Unique Prefixes
Routable Prefix changes
9472
845 703
54
349
254
9
160
155
81
104
94
69
36
87
62
28
40
4
37
66
39
23
7
7
5
10
11
6
8
4
4
1
2
4
1
4
1
10
100
1000
Atlas VPs
0
1
2
4
8
16
32
0 1 2 4 8 16 32 64 128 256 512
Unique Addresses
IPv4 Address changes
8501
995 647
235
143
518
94
56
70
399
33
23
20
41
299
3
7
8
10
36
190
5
5
6
37
1
1
1
2
1
5
2
1
5
2
1
1
10
100
1000
Atlas VPs
Figure 4.15: IPv4 changes by AS (top), routable prex (center) and address (bottom) for Atlas VPs with at
least one change, 2020q4.
4.5.5 Address Space Refactoring
Address management is a business-critical decision for ISPs. Limited IPv4 addresses require careful management and reuse, while IPv6 transition requires updating current practices. Each of these choices incurs
costs. To provide ground truth for the kind of AS-level changes observed our IDD and IAS algorithms, we
next examine address churn in both IPv4 and IPv6 inferred from RIPE Atlas.
We take Atlas VP IPv4 and IPv6 address changes during 2020q4, and perform longest prex match
for these addresses using routeviews RIB archives to obtain routable prexes and ASes for the quarter.
We count the number of times each VP changes address, routable prex and AS, and the times unique
addresses are assigned. Figure 4.15 and Figure 4.16 show IPv4 and IPv6 aggregates as heatmaps.
For IPv4 address changes (Figure 4.15, bottom) we observe that most VPs do not change address during
the quarter, but those that do change, often are assigned a new address, as dark heatmap diagonal. Prexes
109
0
1
2
4
8
16
32
0 1 2 4 8 16 32 64 128 256 512
Unique ASes
AS changes
5987
136 94 44
1
24
2
13 9 3
1
2 2
1
10
100
1000
Atlas VPs
0
1
2
4
8
16
32
0 1 2 4 8 16 32 64 128 256 512
Unique Prefixes
Routable Prefix changes
5912
155 118
1
53
4
27
3
20
3
10
1
1
3
1
2
1
3
1
10
100
1000
Atlas VPs
0
1
2
4
8
16
32
0 1 2 4 8 16 32 64 128 256 512
Unique Addresses
IPv6 Address changes
4730
447 251
59
83
135
29
28
25
117
6
18
7
9
69
3
12
7
2
14
45
4
3
2
7
3
1
1
1
1
4
2
1
10
100
1000
Atlas VPs
Figure 4.16: IPv6 changes by AS (top), routable prex (center), and address (bottom) for Atlas VP with at
least one change, 2020q4.
(middle) are mostly reused, although many address changes involve a routable prex change, too. Finally,
addresses almost always stay in the same AS (top).
On the other hand, in IPv6 we observe that address changes generally occur within the same prex.
Our data conrms that IPv4 exhausting and fragmentation make address management more challenging,
while in newer IPv6, address assignment aligns with routing, using fewer prexes more eciently. While
prior work has commented about these behaviors [83, 84], we look at both protocols together.
4.6 Related Work
Other works have looked into the problem of maintenance events in relation to outages. Richter et al. used
internal information from clients to demonstrate that address reassignment cause false outages, dening
disruptions to include both true and false outages [97]. However, this work does not show how to dierentiate between true outages and maintenance events.
110
Other groups have studied address changes and usage. Some have examined the duration hosts keep
the same address [58, 83, 84], estimated Internet-wide address churn [76], and address utilization [98].
However, these techniques either do not scale to the entire address space, are estimations, or use unavailable CDN server logs. We run third party measurements and detect renumbering events to the whole
responsive Internet.
Previous work has detected diurnal patterns using FFT at block level [91]; we instead quantify AS-level
diurnalness.
4.7 Study Conclusions
AS-wide diurnal changes and maintenance are part of our Internet ecosystem, yet they challenge outage
detection systems. Our new IDD and IAS algorithms can often recover from such dynamics. We showed
these algorithms are eective and can correct 51k false outages per quarter.
This study contributes towards showing our thesis statement (Section 1.4) by implementing a operational denition grounded in our conceptual. We this denition to help disambiguate questions about
ISP address space usage like diurnal events (Section 4.3.2) and maintenance events (Section 4.3.4), and to
provide a new viewpoint to Internet outage detection and policy evaluation of address usage (Section 4.5).
In the nal chapter we look into future research directions based on our thesis work, and end with our
overall conclusions.
Appendix 4.A Research Ethics on this Study
This study poses no ethical concerns for several reasons.
First, we collect no additional data, but instead reanalyze data from existing sources. Our study therefore poses no additional risk in data collection.
111
Our analysis poses no risk to individuals because our subject is network topology and connectivity.
There is a slight risk to individuals in that we examine responsiveness of individual IP addresses. With
external information, IP addresses can sometimes be traced to individuals, particularly when combined
with external data sources like DHCP logs. We avoid this risk in three ways. First, we do not have DHCP
logs for any networks (and in fact, most are unavailable outside of specic ISPs). Second, we commit, as
research policy, to not combine IP addresses with external data sources that might de-anonymize them to
individuals. Finally, except for analysis of specic cases as part of validation, all of our analysis is done in
bulk over the whole dataset.
We do observe data about organizations such as ISPs, and about the geolocation of blocks of IP addresses. Because we do not map IP addresses to individuals, this analysis poses no individual privacy
risk.
Finally, we suggest that while our study poses minimal privacy risks to individuals, to also provides
substantial benet to the community and to individuals. For reasons given in the introduction it is important to improve network reliability and understand how networks fail. Our study contributes to that
goal.
Our study was reviewed by the Institutional Review Board at our university and because it poses no
risk to individual privacy, it was identied as non-human subjects research (USC IRB IIR00001648).
112
Chapter 5
Conclusions
This dissertation has shown how “a new, conceptual denition of the Internet core can help disambiguate
questions in analysis of network reliability and address space usage (Section 1.4).” We proved this statement through three studies. First, we improved coverage of outage detection by dealing with sparse blocks.
Second, we provided a new denition of the Internet core, and used it to resolve partial reachability ambiguities. Third, we used our denition to identify ISP trends, with applications to policy and improving
outage detection accuracy.
In this nal chapter, rst we discuss possible future directions and remaining open challenges, then
we nish with our nal conclusions.
5.1 Future Directions
This dissertation has contributed towards better understanding the Internet. However, many challenges
remain open and present opportunities to advance the eld. We discuss next steps in the analysis of
Internet reliability, outage detection, partial reachability and address dynamics analysis.
Next steps in analysis of Internet reliability As the Internet reliability eld progresses, many eorts
have made signicant improvements in detection of network problems. Yet, there is still room for novel
research in this arena like peninsula mitigation and island forecasting still present many opportunities to
113
explore. For peninsula mitigation, there is no silver bullet, but overlay networks can solve the problem.
Other strategies may involve using measurements to detect the problem, and then using traditional solutions like adding a backup link. For island forecasting, we have collected outage data for about a decade,
which can be consumed by dierent statistical algorithms to determine weak links and failure correlations.
Outage detection: Our work in Chapter 2 provided tools and techniques to improve precision and increase coverage of outages detection systems. In Section 3.5.7 we showed that outages often occur before
reaching the target AS. A next step in outage detection and network reliability analysis is to nd the physical location of events at dierent geographic granularities. Network operators and the research community
would benet from a ner grained location detector to allow nding weaknesses and the development of
mitigation techniques.
In Section 3.5.8 we look into what type of organizations actively block overseas trac, conrming
the existence of such ltering. However, our study is limited only to the U.S. since it is the only country
where we have enough VPs to deploy our algorithm. A longitudinal analysis of country-level peninsulas
including multiple countries will contribute to identify what countries and their root causes for restraining
overseas trac.
Some steps have been taken towards network issues mitigation with overlay networks designed to
route around them in RON [3], Hubble [64], and LIFEGUARD [65]. However, partial connectivity is a
pervasive problem as our ndings in Section 3.5.1 show, and it is a problem that needs to be addressed.
Outage forecasting has been a challenge, given the lack of enough data for predictive algorithms to
consume. Today, after several years of actively scanning the Internet and collecting outage data such work
is enabled, and can benet not only research community, but network operators and their users.
Partial reachability: In Section 3.6.1 we provide some applications of our denition like secession and
sanctions. Given the limitations on reachability to Internet users that political challenges may impose,
114
some digital rights of these users may be aected like access and non-discrimination, freedom of assembly,
association and participation, and education and literacy. We suggest that our denition can help inform
policy discussions, which can be used to develop and implement a metric that measures the risk of the
global Internet to suer fragmentation.
In Section 3.2.2 we proposed a conceptual denition of the Internet core recognizes reachability as
fundamental. However, users care about applications, and a user-centric view might emphasize reachability of HTTP or to Facebook rather than at the IP layer. Future work could challenge our denition and
propose a new conceptual denition based in applications, that is able to endure the test of time Such a
denition, will contribute to provide a user centric perspective of the Internet.
In Chapter 3 we dened two Internet cores: IPv4 and IPv6. Our denition can determine when one
supersedes the other. The networks will be on par when more than half of all IPv4 hosts are dual-homed.
After that point, IPv6 will supersede IPv4 when a majority of hosts on IPv6 can no longer reach IPv4.
Current limits on IPv6 measurement mean evaluation here is future work. IPv6 shows the strength and
limits of our denition: since IPv6 is already economically important, our denition seems irrelevant.
However, it may provide sharp boundary that makes the maturity of IPv6 denitive, helping motivate
late-movers. Future research may track evolution of dual-homing to conrm when IPv6 supersedes IPv4.
Address dynamics analysis: Understanding ISP address address space density and eectiveness is important both Internet policy, and network measurement and security. For Internet policy, ISPs need to
make business-critical decisions that include purchasing carrier-grade NAT equipment versus acquiring
more address space, or evaluating the costs of carefully reusing limited IPv4 space versus transitioning to
IPv6. Regulators like national or RIRs must consider address dynamics when crafting policies about transferring limited IPv4 address space and tracking IPv4 and IPv6 routing table sizes. Using the technique we
provide in Section 4.3.1 to determine the number of active addresses within an AS, enables measuring how
well ISPs are using their address space.
115
5.2 Conclusions
We have proven our thesis statement that a new, conceptual denition of the Internet core helps disambiguate questions in analysis of network reliability and address space usage. First, we enabled
network reliability analysis by adding new address space for active scanning in outage detection. Second,
we proposed a new conceptual denition of the Internet core that helps us to disambiguate disagreements
between observers like whether a host is reachable or not or who is “on the Internet”. Finally, we use our
denition of the Internet core to disambiguate evaluation of ISP address space usage events and provide a
new viewpoint to Internet outage detection and policy evaluation of address usage.
In Chapter 2, we dened two algorithms: Full Block Scanning (FBS), to address false outages seen in
active measurements of sparse blocks, and Lone Address Block Recovery (LABR), to handle blocks with
one or two responsive addresses. We showed that these algorithms increase coverage, from a nominal 67%
(and as low as 53% after ltering) of responsive blocks before to 5.7M blocks, 96% of responsive blocks. We
showed these algorithms work well using multiple datasets and natural experiments; they can improve
existing and future outage datasets.
In Chapter 3, we provided a new denition of the Internet, and then used it to resolve partial reachability issues. We dened two new algorithms to identify two types of network fragmentation. First, Taitao
detects peninsulas, when a network can reach some parts of the Internet directly, but not others. Second,
Chiloe detects islands, networks that have internal connectivity but are sometimes cut o from the Internet
as a whole. We applied these algorithms in rigorous measurement from two complementary measurement
systems, one observing 5M networks from a few locations, and the other a few destinations from 10k locations. Our results showed that peninsulas (partial connectivity) are about as common as Internet outages,
quantifying this long-observed problem. Root causes showed that most peninsula events (45%) are routing transients, but most peninsula-time (90%) is from a few long-lived events (7%). Our analysis helped
116
interpret DNSmon, a system monitoring the DNS root, separating measurement error and persistent problems from underlying dierences and operationally important transients. Finally, our denition conrmed
the international nature of the Internet: no single country can unilaterally claim to be “the Internet”, but
countries can choose to leave.
In Chapter 4, we provided new algorithms to identify two classes of address dynamics: periodic (diurnal
and weekly) trends and ISP maintenance events. We showed that 20% of maintenance events result in /24
IPv4 address blocks that become unused for days or more. While only about 4% of ASes (2,830) are diurnal,
some diurnal ASes show 20% changes each day. We discussed how this identication can improve Internet
outage detection and policy evaluation of address usage.
These chapters prove the thesis. We show that an unambiguous conceptual denition of the Internet
core that captures the idea of a single, global Internet independent of assertions of authority helps resolving today’s political, architectural, and operational challenges. A conceptual denition also serves as
an asymptote against which operational denitions may be tested. Researchers, network operators and
policy makers should consider our denition when evaluating the current state and future of the global
network.
117
Bibliography
[1] William J. Drake (moderator). Internet Fragmentation, Reconsidered. CITI Seminar on Global
Digital Governance at IETF 115. Oct. 2022. url:
https://www8.gsb.columbia.edu/citi/GlobalDigitalGovernance.
[2] Christopher Amin, Massimo Cándela, Daniel Karrenberg, Robert Kisteleki, and Andreas Strikos.
“Visualization and Monitoring for the Identication and Analysis of DNS Issues”. In: Proceedings
of the International Conference on Internet Monitoring and Protection. Brussels, Belgium, June 2015.
url: https://www.researchgate.net/profile/MassimoCandela/publication/279516870_Visualization_and_Monitoring_for_the_Identification_and_
Analysis_of_DNS_Issues/links/559468c808ae793d13798901/Visualization-and-Monitoring-for-theIdentification-and-Analysis-of-DNS-Issues.pdf.
[3] David G. Andersen, Hari Balakrishnan, M. Frans Kaashoek, and Robert Morris. “Resilient Overlay
Networks”. In: Proceedings of the Symposium on Operating Systems Principles. Chateau Lake
Louise, Alberta, Canada: ACM, Oct. 2001, pp. 131–145. url:
http://www-cse.ucsd.edu/sosp01/papers/andersen.pdf.
[4] Anonymous. “The collateral damage of Internet censorship by DNS injection”. In: ACM Computer
Communication Review 42.3 (July 2012), pp. 21–27. doi:
http://dx.doi.org/10.1145/2317307.2317311.
[5] Anonymous. “Towards a Comprehensive Picture of the Great Firewall’s DNS Censorship”. In:
Proceedings of the USENIX Workshop on Free and Open Communciations on the Internet. San Diego,
CA, USA: USENIX, Aug. 2014. url:
https://www.usenix.org/system/files/conference/foci14/foci14-anonymous.pdf.
[6] ANT Project. ANT IPv4 Island and Peninsula Data. https://ant.isi.edu/datasets/ipv4_partial/.
Nov. 2022. url: https://ant.isi.edu/datasets/ipv4_partial/.
[7] Cathy Aronson. To Squat Or Not To Squat? blog
https://teamarin.net/2015/11/23/to-squat-or-not-to-squat/. Nov. 2015. url:
https://teamarin.net/2015/11/23/to-squat-or-not-to-squat/.
[8] G. Baltra and J. Heidemann. “What Is The Internet? (Considering Partial Connectivity)”. In: 2021.
118
[9] Guillermo Baltra and John Heidemann. “Improving Coverage of Internet Outage Detection in
Sparse Blocks”. In: Proceedings of the Passive and Active Measurement Workshop. Eugene, Oregon,
USA: Springer, Mar. 2020. url: https://www.isi.edu/%7ejohnh/PAPERS/Baltra20a.html.
[10] Guillermo Baltra and John Heidemann. Improving the Optics of Active Outage Detection (extended).
Tech. rep. ISI-TR-733. johnh: pale: USC/Information Sciences Institute, May 2019. url:
https://www.isi.edu/%7ejohnh/PAPERS/Baltra19a.html.
[11] Guillermo Baltra and John Heidemann. What Is The Internet? Partial Connectivity of the Internet
Core. Tech. rep. arXiv:2107.11439v3. USC/Information Sciences Institute, Mar. 2023. doi:
https://doi.org/10.48550/2107.11439v3.
[12] Kasun Bandara, Rob J Hyndman, and Christoph Bergmeir. MSTL: A Seasonal-Trend Decomposition
Algorithm for Time Series with Multiple Seasonal Patterns. 2021. doi: 10.48550/ARXIV.2107.13462.
[13] Genevieve Bartlett, John Heidemann, and Christos Papadopoulos. “Understanding Passive and
Active Service Discovery”. In: Proceedings of the ACM Internet Measurement Conference. San
Diego, California, USA: ACM, Oct. 2007, pp. 57–70. doi:
http://dx.doi.org/10.1145/1298306.1298314.
[14] Robert Beverly, Ramakrishnan Durairajan, David Plonka, and Justin P. Rohrer. “In the IP of the
Beholder: Strategies for Active IPv6 Topology Discovery”. In: Proceedings of the ACM Internet
Measurement Conference. johnh: pale: ACM, Oct. 2018, pp. 308–321. doi:
https://doi.org/10.1145/3278532.3278559.
[15] Henry Birge-Lee, Yixin Sun, Anne Edmundson, Jennifer Rexford, and Prateek Mittal.
“Bamboozling certicate authorities with BGP”. In: 27th USENIX Security Symposium. Baltimore,
Maryland, USA: USENIX, 2018, pp. 833–849.
[16] Randy Bush, Olaf Maennel, Matthew Roughan, and Steve Uhlig. “Internet optometry: assessing
the broken glasses in Internet reachability”. In: Proceedings of the 9th ACM SIGCOMM conference
on Internet measurement. Chicago, Illinois, USA: ACM, Nov. 2009, pp. 242–253. url:
http://www.maennel.net/2009/imc099-bush.pdf.
[17] CAIDA. Archipelago (Ark) Measurement Infrastructure. website
https://www.caida.org/projects/ark/. 2007. url: https://www.caida.org/projects/ark/.
[18] CAIDA. IODA: Internet Outage Detection & Analysis. 2020. url: https://ioda.caida.org.
[19] CAIDA. The CAIDA UCSD IPv4 Routed /24 Topology Dataset - 2017-10-10 to -31.
https://www.caida.org/data/active/ipv4_routed_24_topology_dataset.xml. 2017.
[20] CAIDA. The CAIDA UCSD IPv4 Routed /24 Topology Dataset - 2020-09-01 to -31.
https://www.caida.org/data/active/ipv4_routed_24_topology_dataset.xml. 2020.
[21] CenturyLink. Event History. https://status.ctl.io/history. 2019.
119
[22] Vint Cerf and Robert Kahn. “A Protocol for Packet Network Interconnection”. In: IEEE
Transactions on Communications COM-22.5 (May 1974), pp. 637–648. url:
http://sysnet.ucsd.edu/classes/cse222/wi03/papers/cerf-tcp-toc74.pdf.
[23] S. Cheshire and M. Krochmal. NAT Port Mapping Protocol (NAT-PMP). RFC 6886. johnh: pales:
Internet Request For Comments, Apr. 2013. doi: http://dx.doi.org/10.17487/RFC6886.
[24] David D. Clark. “The Design Philosophy of the DARPA Internet Protocols”. In: Proceedings of the
1988 Symposium on Communications Architectures and Protocols. johnh: pale (scanned) 30-jul-02:
ACM, Aug. 1988, pp. 106–114.
[25] David D. Clark, John Wroclawski, Karen Sollins, and Robert Braden. “Tussle in Cyberspace:
Dening Tomorrow’s Internet”. In: Proceedings of the ACM SIGCOMM Conference. Pittsburgh, PA,
USA: ACM, Aug. 2002, pp. 347–356. url:
http://www.acm.org/sigcomm/sigcomm2002/papers/tussle.pdf.
[26] CNBC. Russia just brought in a law to try to disconnect its Internet from the rest of the world. https:
//www.cnbc.com/2019/11/01/russia-controversial-sovereign-internet-law-goes-into-force.html.
Nov. 2019.
[27] N. Coca. “China’s Xinjiang surveillance is the dystopian future nobody wants”. In: Engadget (Feb.
2018). url: https://www.engadget.com/2018-02-22-china-xinjiang-surveillance-tech-spread.html.
[28] Cogent. Looking Glass. https://cogentco.com/en/looking-glass. May 2021.
[29] James Cowie. Egypt Leaves the Internet. Renesys Blog
http://www.renesys.com/blog/2011/01/egypt-leaves-the-internet.shtml. Jan. 2011. url:
http://www.renesys.com/blog/2011/01/egypt-leaves-the-internet.shtml.
[30] RBC daily. Russia, tested the Runet when disconnected from the Global Network. website
https://www.rbc.ru/technology_and_media/21/07/2021/60f8134c9a79476f5de1d739. July 2021. url:
https://www.rbc.ru/technology_and_media/21/07/2021/60f8134c9a79476f5de1d739.
[31] A. Dainotti, K. Benson, A. King, B. Huaker, E. Glatz, X. Dimitropoulos, P. Richter, A. Finamore,
and A. Snoeren. “Lost in Space: Improving Inference of IPv4 Address Space Utilization”. In: IEEE
Journal on Selected Areas in Communications (JSAC) 34.6 (June 2016), pp. 1862–1876.
[32] Alberto Dainotti, Roman Amman, Emile Aben, and Kimberly C Clay. “Extracting benet from
harm: using malware pollution to analyze the impact of political and geophysical events on the
Internet”. In: ACM SIGCOMM Computer Communication Review 42.1 (2012), pp. 31–39.
[33] Alberto Dainotti, Claudio Squarcella, Emile Aben, Marco Chiesa, Kimberly C. Clay,
Michele Russo, and Antonio Pescapé. “Analysis of Country-wide Internet Outages Caused by
Censorship”. In: Proceedings of the ACM Internet Measurement Conference. Berlin, Germany: ACM,
Nov. 2011, pp. 1–18. doi: http://dx.doi.org/10.1145/2068816.2068818.
[34] Dhaka Tribune Desk. “Internet services to be suspended across the country”. In: Dhaka Tribune
(Feb. 2018). url: http://www.dhakatribune.com/regulation/2018/02/11/internet-servicessuspended-throughout-country/.
120
[35] Amogh Dhamdhere, David D. Clark, Alexander Gamero-Garrido, Matthew Luckie,
Ricky K. P. Mok, Gautam Akiwate, Kabir Gogia, Vaibhav Bajpai, Alex C. Snoeren, and kc clay.
“Inferring Persistent Interdomain Congestion”. In: Proceedings of the ACM SIGCOMM Conference.
Budapest, Hungary: ACM, Aug. 2018, pp. 1–15. doi: https://doi.org/10.1145/3230543.3230549.
[36] DINRG. Decentralized Internet Infrastructure Research Group. https://irtf.org/dinrg. May 2021.
[37] Doug Madory. Iraq Downs Internet To Combat Cheating...Again!
https://dyn.com/blog/iraq-downs-internet-to-combat-cheating-again/. Accessed: 2019-01-08.
2017.
[38] William J. Drake, Vinton G. Cerf, and Wolfgang Kleinwächter. Internet Fragmentation: An
Overview. Tech. rep. World Economic Forum, Jan. 2016. url:
https://www3.weforum.org/docs/WEF_FII_Internet_Fragmentation_An_Overview_2016.pdf.
[39] Peter K. Dunn. Scientic Research Methods. https://bookdown.org/pkaldunn/Book/. May 2021.
[40] Zakir Durumeric, Michael Bailey, and J Alex Halderman. “An Internet-wide view of Internet-wide
scanning”. In: 23rd {USENIX} Security Symposium ({USENIX} Security 14). 2014, pp. 65–78.
[41] Economist Editors. “Why some countries are turning o the internet on exam days”. In: The
Economist (July 2018). (Appeared in the Middle East and Africa print edition). url:
https://www.economist.com/middle-east-and-africa/2018/07/05/why-some-countries-are-turningoff-the-internet-on-exam-days.
[42] Hurricane Electric. Looking Glass. http://lg.he.net/. May 2021.
[43] Engadget. China, Huawei propose internet protocol with a built-in killswitch.
https://www.engadget.com/2020-03-30-china-huawei-new-ip-proposal.html. 2020.
[44] Fail2ban. https://github.com/fail2ban/fail2ban. 2023.
[45] Xun Fan and John Heidemann. “Selecting Representative IP Addresses for Internet Topology
Studies”. In: Proceedings of the ACM Internet Measurement Conference. johnh: pale: ACM, Nov.
2010, pp. 411–423. doi: http://dx.doi.org/10.1145/1879141.1879195.
[46] Federal Networking Council (FNC). Denition of “Internet”.
https://www.nitrd.gov/fnc/internet_res.pdf. 1995.
[47] Pawel Foremski, David Plonka, and Arthur Berger. “Entropy/IP: Uncovering Structure in IPv6
Addresses”. In: Proceedings of the ACM Internet Measurement Conference. johnh: pale: ACM, Nov.
2016, pp. 167–181. doi: https://doi.org/10.1145/2987443.2987445.
[48] HE forums. Cloudare Blocked on Free Tunnels now?
https://forums.he.net/index.php?topic=3805.0. Dec. 2017.
[49] V. Fuller, E. Lear, and D. Meyer. “Reclassifying 240/4 as usable unicast address space”. Work in
progress (Internet draft draft-fuller-240space-02.txt). Mar. 2008. url:
https://datatracker.ietf.org/doc/html/draft-fuller-240space-02.
121
[50] Oliver Gasser, Quirin Scheitle, Pawel Foremski, Qasim Lone, Maciej Korczynski,
Stephen D. Strowes, Luuk Hendriks, and Georg Carle. “Clusters in the Expanse: Understanding
and Unbiasing IPv6 Hitlists”. In: Proceedings of the ACM Internet Measurement Conference. johnh:
not on le: ACM, Oct. 2018, to appear. doi: https://doi.org/10.1145/3278532.3278564.
[51] Oliver Gasser, Quirin Scheitle, Sebastian Gebhard, and Georg Carle. “Scanning the IPv6 Internet:
Towards a Comprehensive Hitlist”. In: Proceedings of the IFIP International Workshop on Trac
Monitoring and Analysis. johnh: pale: IFIP, Apr. 2016. url:
http://tma.ifip.org/2016/papers/tma2016-final51.pdf.
[52] Samuel Gibbs. “Iraq shuts down the Internet to stop pupils cheating in exams”. In: The Guardian
(May 1996). url: https://www.theguardian.com/technology/2016/may/18/iraq-shuts-downinternet-to-stop-pupils-cheating-in-exams.
[53] GovTrack.us. Unplug the Internet Kill Switch Act would eliminate a 1942 law that could let the
president shut down the internet. https://govtrackinsider.com/unplug-the-internet-kill-switchact-would-eliminate-a-1942-law-that-could-let-the-president-shut-78326f0ef66c. Nov. 2020.
url: https://govtrackinsider.com/unplug-the-internet-kill-switch-act-would-eliminate-a1942-law-that-could-let-the-president-shut-78326f0ef66c.
[54] Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim,
Parantap Lahiri, David A. Maltz, and Parveen Pat. “VL2: A Scalable and Flexible Data Center
Network”. In: Proceedings of the ACM SIGCOMM Conference. Barcelona, Spain: ACM, Aug. 2009,
pp. 51–62. url: http://ccr.sigcomm.org/online/files/p51.pdf.
[55] James Griths. “Democratic Republic of Congo internet shutdown shows how Chinese
censorship tactics are spreading”. In: CNN (Jan. 2019). url:
https://edition.cnn.com/2019/01/02/africa/congo-internet-shutdown-china-intl/index.html.
[56] Andreas Guillot, Romain Fontugne, Philipp Winter, Pascal Merindol, Alistair King,
Alberto Dainotti, and Cristel Pelsser. “Chocolatine: Outage Detection for Internet Background
Radiation”. In: Proceedings of the IFIP International Workshop on Trac Monitoring and Analysis.
Paris, France: IFIP, June 2019. url:
https://clarinet.u-strasbg.fr/~pelsser/publications/Guillot-chocolatine-TMA2019.pdf.
[57] Hang Guo and John Heidemann. “Detecting ICMP Rate Limiting in the Internet”. In: Proceedings
of the Passive and Active Measurement Workshop. johnh: pale: Springer, Mar. 2018, to appear.
url: https://www.isi.edu/%7ejohnh/PAPERS/Guo18a.html.
[58] John Heidemann, Yuri Pradkin, Ramesh Govindan, Christos Papadopoulos, Genevieve Bartlett,
and Joseph Bannister. “Census and Survey of the Visible Internet”. In: Proceedings of the ACM
Internet Measurement Conference. Vouliagmeni, Greece: ACM, Oct. 2008, pp. 169–182. doi:
http://dx.doi.org/10.1145/1452520.1452542.
[59] Jon Henley. “Algeria blocks internet to prevent students cheating during exams”. In: The
Guardian (June 2018). url: https://www.theguardian.com/world/2018/jun/21/algeria-shutsinternet-prevent-cheating-school-exams.
[60] IANA. IPv4 Address Space Registry. https://www.nro.net/about/rirs/statistics/. May 2021.
122
[61] IANA. IPv6 RIR Allocation Data. https://www.iana.org/numbers/allocations/. Jan. 2021.
[62] Internet Addresses Survey dataset, PREDICT ID: USC-LANDER/internet-addresssurvey-reprobing-it75w-20170427/.
[63] Internet Architecture Board. IAB Technical Comment on the Unique DNS Root. RFC 2826. johnh:
pales: Internet Request For Comments, May 2000. url:
https://urldefense.com/v3/__https://www.rfc-editor.org/rfc/rfc2826__;!!LIr3w8kk_Xxm!
8cP86zBFMfNadTeHqgEKR7HviRVRsnDoeNPrstzD1LRv9XLnsS3Ujgv-U_f7eg$.
[64] Ethan Katz-Bassett, Harsha V Madhyastha, John P John, Arvind Krishnamurthy, David Wetherall,
and Thomas E Anderson. “Studying Black Holes in the Internet with Hubble”. In: Proceedings of
the USENIX Conference on Networked Systems Design and Implementation. San Francisco, CA:
ACM, 2008, pp. 247–262.
[65] Ethan Katz-Bassett, Colin Scott, David R. Chones, Ítalo Cunha, Vytautas Valancius,
Nick Feamster, Harsha V. Madhyastha, Tom Anderson, and Arvind Krishnamurthy. “LIFEGUARD:
Practical Repair of Persistent Route Failures”. In: Proceedings of the ACM SIGCOMM Conference.
Helsinki, Finland: ACM, Aug. 2012, pp. 395–406. doi: https://doi.org/10.1145/2377677.2377756.
[66] DataCenter Knowledge. Peering Disputes Migrate to IPv6.
https://www.datacenterknowledge.com/archives/2009/10/22/peering-disputes-migrate-to-ipv6.
2009.
[67] Craig Labovitz, Scott Iekel-Johnson, Danny McPherson, Jon Oberheide, and Farnam Jahanian.
“Internet Inter-Domain Trac”. In: Proceedings of the ACM SIGCOMM Conference. New Delhi,
India: ACM, Aug. 2010, pp. 75–86. doi: http://doi.acm.org/10.1145/1851182.1851194.
[68] Leslie Lamport. “The Part-Time Parliament”. In: ACM Transactions on Computer Systems 16.2
(May 1998), pp. 133–169. doi: http://dx.doi.org/10.1145/279227.279229.
[69] Leslie Lamport, Robert Shostak, and Marshall Pease. “The Byzantine Generals Problem”. In: ACM
Transactions on Programming Languages and Systems 4.3 (July 1982), pp. 382–401.
[70] Lumen Technologies. LUMEN MASTER SERVICE AGREEMENT. johnh: pale, May 2023. url:
https://www.lumen.com/en-us/about/legal/business-customer-terms-conditions.html.
[71] MaxMind. GeoIP Geojlocation Products. http://www.maxmind.com/en/city. 2017.
[72] D. Meyer. University of Oregon Routeviews. http://www.routeviews.org. 2018.
[73] Brent A. Miller, Toby Nixon, Charlie Tai, and Mark D. Wood. “Home Networking with Universal
Plug and Play”. In: IEEE Communications Magazine 39.12 (Dec. 2001), pp. 104–109. doi:
10.1109/35.968819.
[74] Rich Miller. Peering Disputes Migrate to IPv6. website
https://www.datacenterknowledge.com/archives/2009/10/22/peering-disputes-migrate-to-ipv6.
Oct. 2009. url:
https://www.datacenterknowledge.com/archives/2009/10/22/peering-disputes-migrate-to-ipv6.
123
[75] Jelena Mirkovic, Genevieve Bartlett, John Heidemann, Hao Shi, and Xiyue Deng. Do You See Me
Now? Sparsity in Passive Observations of Address Liveness (extended). Tech. rep. ISI-TR-2016-710.
johnh: pale: USC/Information Sciences Institute, July 2016. url:
http://www.isi.edu/%7ejohnh/PAPERS/Mirkovic16a.html.
[76] Giovane CM Moura, Carlos Ganán, Qasim Lone, Payam Poursaied, Hadi Asghari, and
Michel van Eeten. “How dynamic is the isps address space? towards internet-wide dhcp churn
estimation”. In: 2015 IFIP Networking Conference (IFIP Networking). IEEE. 2015, pp. 1–9.
[77] Austin Murdock, Frank Li, Paul Bramsen, Zakir Durumeric, and Vern Paxson. “Target Generation
for Internet-wide IPv6 Scanning”. In: Proceedings of the ACM Internet Measurement Conference.
johnh: pale: ACM, Oct. 2017, pp. 242–253. doi: https://doi.org/10.1145/3131365.3131405.
[78] Satoshi Nakamoto. Bitcoin: A Peer-to-Peer Electronic Cash System. Released publically
http://bitcoin.org/bitcoin.pdf. Mar. 2009. url: http://bitcoin.org/bitcoin.pdf.
[79] RIPE NCC. RIPE Atlas IP echo measurements in IPv4. https://atlas.ripe.net/measurements/[1001,
1004,1005,1006,1008,1009,1010,1011,1012,1013,1014,1015,1016]/. July 2021.
[80] RIPE NCC. RIPE Atlas IP traceroute measurements in IPv4. https://atlas.ripe.net/measurements/
[5001,5004,5005,5006,5008,5009,5010,5011,5012,5013,5014,5015,5016]/. 2021.
[81] BBC News. Russia internet: Law introducing new controls comes into force. website
https://www.bbc.com/news/world-europe-50259597. Mar. 2019. url:
https://www.bbc.com/news/world-europe-50259597.
[82] University of Oregon. Route Views Archive Project.
http://archive.routeviews.org/bgpdata/2020.10/RIBS/rib.20201001.0000.bz2. Oct. 2020.
[83] Ramakrishna Padmanabhan, Amogh Dhamdhere, Emile Aben, kc clay, and Neil Spring. “Reasons
Dynamic Addresses Change”. In: Proceedings of the ACM Internet Measurement Conference. johnh:
pale: ACM, Nov. 2016, pp. 183–198. doi: https://doi.org/10.1145/2987443.2987461.
[84] Ramakrishna Padmanabhan, John P Rula, Philipp Richter, Stephen D Strowes, and
Alberto Dainotti. “DynamIPs: Analyzing address assignment practices in IPv4 and IPv6”. In:
Proceedings of the 16th International Conference on emerging Networking EXperiments and
Technologies. 2020, pp. 55–70.
[85] Ramakrishna Padmanabhan, Aaron Schulman, Dave Levin, and Neil Spring. “Residential links
under the weather”. In: Proceedings of the ACM Special Interest Group on Data Communication.
ACM. 2019, pp. 145–158.
[86] C. Partridge, T. Mendez, and W. Milliken. Host Anycasting Service. RFC 1546. Internet Request For
Comments, Nov. 1993. url: https://www.rfc-editor.org/rfc/rfc1546.txt.
[87] David Plonka and Arthur Berger. “Temporal and Spatial Classication of Active IPv6 Addresses”.
In: Proceedings of the ACM Internet Measurement Conference. johnh: pale: ACM, Oct. 2015,
pp. 509–522. doi: http://dx.doi.org/10.1145/2815675.2815678.
124
[88] Jonathan B. Postel. “Internetwork Protocol Approaches”. In: IEEE Transactions on Computers 28.4
(Apr. 1980), pp. 604–611. doi: http://dx.doi.org/10.1109/TCOM.1980.1094705.
[89] The Spamhaus Project. https://www.spamhaus.org. 2023.
[90] Lin Quan, John Heidemann, and Yuri Pradkin. “Trinocular: Understanding Internet Reliability
Through Adaptive Probing”. In: Proceedings of the ACM SIGCOMM Conference. Hong Kong,
China: ACM, Aug. 2013, pp. 255–266. doi: http://doi.acm.org/10.1145/2486001.2486017.
[91] Lin Quan, John Heidemann, and Yuri Pradkin. “When the Internet Sleeps: Correlating Diurnal
Networks With External Factors”. In: Proceedings of the ACM Internet Measurement Conference.
Vancouver, BC, Canada: ACM, Nov. 2014, pp. 87–100. doi:
http://dx.doi.org/10.1145/2663716.2663721.
[92] Dan Rayburn. Google Blocking IPv6 Adoption With Cogent, Impacting Transit Customers.
https://seekingalpha.com/article/3948876-google-blocking-ipv6-adoption-cogent-impactingtransit-customers. Mar. 2016.
[93] Y. Rekhter, B. Moskowitz, D. Karrenberg, G. J. de Groot, and E. Lear. Address Allocation for Private
Internets. RFC 1918. Internet Request For Comments, Feb. 1996. url:
ftp://ftp.rfc-editor.org/in-notes/rfc1918.txt.
[94] Reuters. website https://www.reuters.com/technology/us-firm-cogent-cutting-internet-servicerussia-2022-03-04/. July 2022. url: https://www.reuters.com/technology/us-firm-cogent-cuttinginternet-service-russia-2022-03-04/.
[95] Reuters. Russia disconnected from internet in tests as it bolsters security. website
https://www.reuters.com/technology/russia-disconnected-global-internet-tests-rbc-daily2021-07-22/. July 2021. url: https://www.reuters.com/technology/russia-disconnected-globalinternet-tests-rbc-daily-2021-07-22/.
[96] Philipp Richter, Ramakrishna Padmanabhan, Neil Spring, Arthur Berger, and David Clark.
“Advancing the Art of Internet Edge Outage Detection”. In: Proceedings of the ACM Internet
Measurement Conference. Boston, Massachusetts, USA: ACM, Oct. 2018, pp. 350–363. doi:
https://doi.org/10.1145/3278532.3278563.
[97] Philipp Richter, Ramakrishna Padmanabhan, Neil Spring, Arthur Berger, and David Clark.
“Advancing the Art of Internet Edge Outage Detection”. In: Proceedings of the ACM Internet
Measurement Conference. Boston, Massachusetts, USA: ACM, Oct. 2018, pp. 350–363. doi:
https://doi.org/10.1145/3278532.3278563.
[98] Philipp Richter, Georgios Smaragdakis, David Plonka, and Arthur Berger. “Beyond counting: new
perspectives on the active IPv4 address space”. In: Proceedings of the 2016 Internet Measurement
Conference. 2016, pp. 135–149.
125
[99] Philipp Richter, Florian Wohlfart, Narseo Vallina-Rodriguez, Mark Allman, Randy Bush,
Anja Feldmann, Christian Kreibich, Nicholas Weaver, and Vern Paxson. “A Multi-perspective
Analysis of Carrier-Grade NAT Deployment”. In: Proceedings of the ACM Internet Measurement
Conference. Santa Monica, CA, USA: ACM, Nov. 2016. doi:
http://dx.doi.org/10.1145/2987443.2987474.
[100] RIPE Atlas. Built-in measurements ID 1010. https://atlas.ripe.net/measurements/1010/. 2021.
[101] RIPE NCC. DNSMON. https://atlas.ripe.net/dnsmon. 2020.
[102] RIPE NCC. RIPE Atlas. 2020. url: https://atlas.ripe.net/.
[103] RIPE NCC Sta. “RIPE Atlas: A Global Internet Measurement Network”. In: The Internet Protocol
Journal 18.3 (Sept. 2015), pp. 2–26. url:
http://ipj.dreamhosters.com/wp-content/uploads/2015/10/ipj18.3.pdf.
[104] Sen. John D. Rockefeller. Cybersecurity Act of 2010.
https://www.congress.gov/bill/111th-congress/senate-bill/773. 2009.
[105] Root Operators. http://www.root-servers.org. Apr. 2016.
[106] J. Rosenberg, J. Weinberger, C. Huitema, and R. Mahy. STUN—Simple Traversal of User Datagram
Protocol (UDP) Through Network Address Translators (NATs). RFC 3489. Internet Request For
Comments, Dec. 2003. url: ftp://ftp.rfc-editor.org/in-notes/rfc3489.txt.
[107] Tarang Saluja, John Heidemann, and Yuri Pradkin. “Dierences in Monitoring the DNS Root Over
IPv4 and IPv6”. In: Proceedings of the National Symposium for NSF REU Research in Data Science,
Systems, and Security. Portland, OR, USA: IEEE, Dec. 2022, to appear.
[108] Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V. Madhyastha,
Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. “Engineering Egress
with Edge Fabric: Steering Oceans of Content to the World”. In: Proceedings of the ACM
SIGCOMM Conference. Los Angeles, CA, USA: ACM, Aug. 2017, pp. 418–431. doi:
https://doi.org/10.1145/3098822.3098853.
[109] Aaron Schulman and Neil Spring. “Pingin’ in the Rain”. In: Proceedings of the ACM Internet
Measurement Conference. Berlin, Germany: ACM, Nov. 2011, pp. 19–25. doi:
https://doi.org/10.1145/2068816.2068819.
[110] Anant Shah, Romain Fontugne, Emile Aben, Cristel Pelsser, and Randy Bush. “Disco: Fast, Good,
and Cheap Outage Detection”. In: Proceedings of the IEEE International Conference on Trac
Monitoring and Analysis. Dublin, Ireland: Springer, June 2017, pp. 1–9. doi:
https://doi.org/10.23919/TMA.2017.8002902.
[111] Xiao Song and John Heidemann. Measuring the Internet during Covid-19 to Evaluate
Work-from-Home. Tech. rep. arXiv:2102.07433v2 [cs.NI]. USC/ISI, Feb. 2021. url:
https://www.isi.edu/%7ejohnh/PAPERS/Song21a.html.
126
[112] Xiao Song and John Heidemann. Measuring the Internet during Covid-19 to Evaluate
Work-from-Home (poster). Poster at the NSF PREPARE-VO Workshop. johnh: pale, Dec. 2020.
url: https://www.isi.edu/%7ejohnh/PAPERS/Song20a.html.
[113] Berhan Taye and Sage Cheng. Report: the state of internet shutdowns. blog
https://www.accessnow.org/the-state-of-internet-shutdowns-in-2018/. July 2019. url:
https://www.accessnow.org/the-state-of-internet-shutdowns-in-2018/.
[114] Craig Timberg and Paul Sonne. “Minutes before Trump left oce, millions of the Pentagon’s
dormant IP addresses sprang to life”. In: The Washington Post (Apr. 2021). url:
https://www.washingtonpost.com/technology/2021/04/24/pentagon-internet-address-mystery/.
[115] Paul F. Tsuchiya and Tony Eng. “Extending the IP Internet Through Address Reuse”. In: ACM
Computer Communication Review 23.1 (Jan. 1993), pp. 16–33. url:
http://www.cs.cornell.edu/People/francis/tsuchiya93extending.pdf.
[116] European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27
April 2016 on the protection of natural persons with regard to the processing of personal data and on
the free movement of such data. https://eur-lex.europa.eu/eli/reg/2016/679/oj. Jan. 2021.
[117] USC/ISI ANT project. https://ant.isi.edu/datasets/all.html. Accessed: 2019-01-08. 2017.
[118] USC/LANDER Project. Internet Outage Measurements. listed on web page
https://ant.isi.edu/datasets/outage/. Oct. 2014.
[119] Gerry Wan, Liz Izhikevich, David Adrian, Katsunari Yoshioka, Ralph Holz, Christian Rossow, and
Zakir Durumeric. “On the Origin of Scanning: The Impact of Location on Internet-Wide Scans”.
In: Proceedings of the ACM Internet Measurement Conference. Pittsburgh, PA, USA: ACM, Oct.
2020, pp. 662–679. doi: https://doi.org/10.1145/3419394.3424214.
[120] Samuel Woodhams and Simon Migliano. The Global Cost of Internet Shutdowns in 2020.
https://www.top10vpn.com/cost-of-internet-shutdowns/. Jan. 2021.
[121] Yinglian Xie, Fang Yu, Kannan Achan, Eliot Gillum, Moises Goldszmidt, and Ted Wobber. “How
Dynamic are IP Addresses?” In: Proceedings of the ACM SIGCOMM Conference. johnh: pale:
ACM, Aug. 2007, pp. 301–312. doi: http://doi.acm.org/10.1145/1282380.1282415.
[122] Johannes Zirngibl, Lion Steger, Patrick Sattler, Oliver Gasser, and Georg Carle. “Rusty clusters?:
dusting an IPv6 research foundation”. In: Proceedings of the 22nd ACM Internet Measurement
Conference. johnh: pale: ACM, Oct. 2022, pp. 395–409. doi:
https://doi.org/10.1145/3517745.3561440.
127
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Global analysis and modeling on decentralized Internet
PDF
Enabling efficient service enumeration through smart selection of measurements
PDF
Improving user experience on today’s internet via innovation in internet routing
PDF
Learning about the Internet through efficient sampling and aggregation
PDF
Network reconnaissance using blind techniques
PDF
Measuring the impact of CDN design decisions
PDF
Improving network security through collaborative sharing
PDF
Detecting and characterizing network devices using signatures of traffic about end-points
PDF
Detecting and mitigating root causes for slow Web transfers
PDF
Anycast stability, security and latency in the Domain Name System (DNS) and Content Deliver Networks (CDNs)
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Towards highly-available cloud and content-provider networks
PDF
Improve cellular performance with minimal infrastructure changes
PDF
Balancing security and performance of network request-response protocols
PDF
Understanding the characteristics of Internet traffic dynamics in wired and wireless networks
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Backpressure delay enhancement for encounter-based mobile networks while sustaining throughput optimality
PDF
Improving efficiency, privacy and robustness for crowd‐sensing applications
PDF
Design of cost-efficient multi-sensor collaboration in wireless sensor networks
Asset Metadata
Creator
Baltra, Guillermo Pedro
(author)
Core Title
Improving network reliability using a formal definition of the Internet core
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-12
Publication Date
10/02/2023
Defense Date
08/16/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
availability and reliability,fault tolerance,Internet,measurement platforms,measurement techniques,methods and tools,network outages,network routing,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Heidemann, John (
committee chair
), Govindan, Ramesh (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
baltra@usc.edu,gbaltrae@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113719168
Unique identifier
UC113719168
Identifier
etd-BaltraGuil-12409.pdf (filename)
Legacy Identifier
etd-BaltraGuil-12409
Document Type
Dissertation
Format
theses (aat)
Rights
Baltra, Guillermo Pedro
Internet Media Type
application/pdf
Type
texts
Source
20231004-usctheses-batch-1100
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
availability and reliability
fault tolerance
Internet
measurement platforms
measurement techniques
methods and tools
network outages
network routing