Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Global analysis and modeling on decentralized Internet
(USC Thesis Other)
Global analysis and modeling on decentralized Internet
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GLOBAL ANALYSIS AND MODELING ON DECENTRALIZED INTERNET
by
Xue Cai
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2014
Copyright 2014 Xue Cai
Dedication
Dedicated to my beloved parents, who supported me all along the way.
ii
Acknowledgments
Through this long Ph.D. journey, I am grateful to many people who helped, inspired,
and encouraged me.
First of all, I would like to thank my advisor, Prof. John Heidemann, for his guidance
during all my Ph.D. years. His mentorship has helped me to develop critical thinking
ability, problem-solving, presentation, and writing skills. I am especially grateful to the
effort and time he put into me.
I would like to specially thank my two advisors, Walter Willinger and Balachander
Krishnamurthy, during and after my internship at AT&T Labs Research. They have
opened my eyes during my early years in research and have been providing me with
help ever since. Most importantly, they taught me how to be an ethical researcher.
I want to thank Prof. Ramesh Govindan, Prof. Antonio Ortega, Prof. Kristina Ler-
man, Prof. Leana Golubchik, for their service on my qualifying exam and dissertation
committee.
Lastly, I would like to thank the friends I made at USC and ISI who have made my
research life colorful. I’d like to thank my former office mate, Genevieve Bartlett, who
helped me quickly adjust to the research and life in US. I want to thank Chengjie Zhang,
who started the Ph.D. journey the same year as me and keeps helping me since, and Lin
Quan, who sat next to me and likes to share jokes. I would also like to thank Xun Fan,
Zi Hu, Hao Shi, Xiyue Deng, Lihang Zhao, and many others.
iii
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables vii
List of Figures ix
Abstract xii
Chapter 1 Introduction 1
1.1 Problem Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Supporting the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 2 Understanding Edge Users in the Visible Internet 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Data Collection: Surveying the Internet . . . . . . . . . . . . . 20
2.2.2 Representation: Observations of Interest . . . . . . . . . . . . . 22
2.2.3 Block Identification . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.4 Ping-Observable Block Classification . . . . . . . . . . . . . . 31
2.2.5 Identifying Low-bitrate Blocks . . . . . . . . . . . . . . . . . . 33
2.3 Understanding Edge Address Usage and Low-bitrate Access . . . . . . 36
2.3.1 Block Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.2 Address Utilization . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.3 Intermittent and Dynamic IP Addressing . . . . . . . . . . . . 41
2.3.4 Understanding Edge Bitrates . . . . . . . . . . . . . . . . . . . 44
2.4 Validation of Understanding and Consistency . . . . . . . . . . . . . . 47
2.4.1 Validation within USC . . . . . . . . . . . . . . . . . . . . . . 47
2.4.2 Validation in the General Internet . . . . . . . . . . . . . . . . 51
iv
2.4.3 Consistency Across Repeated Surveys . . . . . . . . . . . . . . 55
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.6 Appendix: Details about Surveys at Different Dates . . . . . . . . . . . 59
2.7 Appendix: Examining the (A,V ,U*) Space . . . . . . . . . . . . . . . . 61
2.8 Appendix: Training and Hostname-inferred Usage Categorization . . . 62
2.8.1 Hostname-inferred Usage Categories . . . . . . . . . . . . . . 63
2.8.2 Relating Hostname-Inferred to Ping-Observable Categories . . . 66
Chapter 3 Mapping Autonomous Systems to Organizations 70
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2.1 Automated Clustering with WHOIS Data . . . . . . . . . . . . 75
3.2.2 Semi-automatic Clustering with 10-K Data . . . . . . . . . . . 83
3.3 Validation of AS-to-Organization Map . . . . . . . . . . . . . . . . . . 88
3.3.1 Validation Datasets . . . . . . . . . . . . . . . . . . . . . . . . 89
3.3.2 Validation Method . . . . . . . . . . . . . . . . . . . . . . . . 91
3.3.3 Validation Results . . . . . . . . . . . . . . . . . . . . . . . . 92
3.3.4 Factors that Improve Accuracy . . . . . . . . . . . . . . . . . . 95
3.3.5 Comparison with PCH . . . . . . . . . . . . . . . . . . . . . . 99
3.4 Prevalence and Influence of Multi-AS Usage . . . . . . . . . . . . . . . 100
3.4.1 Relevance of multi-AS organizations . . . . . . . . . . . . . . 100
3.4.2 Causes of multi-AS usage . . . . . . . . . . . . . . . . . . . . 102
3.5 Incompleteness of AS-level and Importance of Organization-level Topol-
ogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.5.1 Address Coverage of an AS vs. its Organization . . . . . . . . . 104
3.5.2 Internet Exchange Point Coverage of ASes and Organizations . 106
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.7 Appendix: Training in Detail . . . . . . . . . . . . . . . . . . . . . . . 113
3.7.1 Training Methodology . . . . . . . . . . . . . . . . . . . . . . 113
3.7.2 Details of Training Results . . . . . . . . . . . . . . . . . . . . 118
3.8 Appendix: Validation with Broader Coverage (PCH) . . . . . . . . . . 120
3.8.1 Validation Dataset . . . . . . . . . . . . . . . . . . . . . . . . 121
3.8.2 Evaluation of PCH Dataset with Strong Ground Truth . . . . . 122
3.8.3 Validation of Our Results with PCH . . . . . . . . . . . . . . . 124
3.8.4 Validation Results . . . . . . . . . . . . . . . . . . . . . . . . 126
3.9 Appendix: Persistence of Multi-AS Usage . . . . . . . . . . . . . . . . 127
3.9.1 Evolution of multi-AS usage . . . . . . . . . . . . . . . . . . . 127
3.9.2 Case studies of multi-AS usage . . . . . . . . . . . . . . . . . 129
3.9.3 Ruling out Churn . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.9.4 How persistent is multi-AS usage? . . . . . . . . . . . . . . . . 134
3.10 Appendix: Revisiting AS Rank . . . . . . . . . . . . . . . . . . . . . . 137
v
Chapter 4 Holistically Framing the User Impact of Infrastructure Threats 141
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.3 Modeling Cable Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.3.1 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.3.2 From Cable Cut to SONET Circuits . . . . . . . . . . . . . . . 151
4.3.3 From SONET Circuits to IP links . . . . . . . . . . . . . . . . 153
4.3.4 From IP Links to Transport-layer Flows . . . . . . . . . . . . . 155
4.3.5 From Flows to Sessions . . . . . . . . . . . . . . . . . . . . . 158
4.3.6 From Sessions to QoE . . . . . . . . . . . . . . . . . . . . . . 161
4.3.7 Data needed for the model . . . . . . . . . . . . . . . . . . . . 163
4.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.4.1 Incident Overview . . . . . . . . . . . . . . . . . . . . . . . . 165
4.4.2 Applying the Model . . . . . . . . . . . . . . . . . . . . . . . 167
4.4.3 Causes of Large Impact on Bangladesh . . . . . . . . . . . . . 169
4.4.4 Impact on QoE in Different What-If Scenarios . . . . . . . . . 170
4.4.5 Implications for Connectivity Planning . . . . . . . . . . . . . 176
4.4.6 General Tactics to Address Incomplete Data . . . . . . . . . . . 178
4.4.7 Applying to Other Incidents . . . . . . . . . . . . . . . . . . . 179
4.5 Guidelines to Understand and
Model Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Chapter 5 Related Work 186
5.1 Understanding Internet Edge Behavior . . . . . . . . . . . . . . . . . . 186
5.2 Building Internet Topology . . . . . . . . . . . . . . . . . . . . . . . . 187
5.3 Modeling Internet Infrastructure Threats . . . . . . . . . . . . . . . . . 188
Chapter 6 Future Work and Conclusions 190
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Bibliography 194
vi
List of Tables
2.1 Datasets used in this paper. . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Common link type and the transmission delay for 64KB and 1500KB
packet respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Number of blocks of each size in IT17ws (10 days). . . . . . . . . . . . 37
2.4 The distribution of /24 blocks in ping-observable categories of 10 coun-
tries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 The distribution of /24 blocks in ping-observable categories of 5 regional
registries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6 Evaluation of accuracy of block identification USC to ground truth sizes. 49
2.7 Evaluation of block classification accuracy at USC to ground truth. . . . 50
2.8 Evaluation of block identification accuracy of random Internet blocks. . 53
2.9 Evaluation of block classification accuracy of commercial blocks . . . . 54
2.10 Evaluation of low-bitrate block classification accuracy of commercial
blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.11 Number of blocks of each size in IT17ws (6 days). . . . . . . . . . . . . 59
2.12 Number of blocks of each size in IT16ws (6 days). . . . . . . . . . . . . 60
2.13 Number of blocks of each size in IT31ws (6 days). . . . . . . . . . . . . 60
2.14 Number of blocks of each size in IT31ws (14 days). . . . . . . . . . . . 60
2.15 Number of blocks of each size in IT30ws (14 days). . . . . . . . . . . . 61
2.16 Categories of hostname-derived usage. . . . . . . . . . . . . . . . . . . 65
vii
2.17 The mapping from the 15 hostname-inferred usage categories to 4 ping-
observable categories. hostname-inferred usage category without (paren-
theses) is dominate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.1 Data availability (AS count) for four attribute types across the 5 RIRs. . 76
3.2 Validation datasets ranked by quality, unbiasedness and size.y: omitted
intentionally. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.3 Validation of 10 intentionally selected organizations including a Tier-1
ISP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.4 Validation of randomly selected organizations from top 100 clusters. . . 93
3.5 Validation of randomly selected organizations from all clusters. . . . . . 94
3.6 Improvement on false-negative rate when company subsidiary informa-
tion is used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.7 Organization distribution by number of ASes in total, and ASes in rout-
ing tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.8 Organization distribution by number of ASes in total, and ASes in rout-
ing tables (from the second RouteViews site). . . . . . . . . . . . . . . 100
3.9 Summary of training results. Best score: 44.5. Numbers of weight
vectors examined are in parenthesis. . . . . . . . . . . . . . . . . . . . 118
3.10 Validation results by PCH . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.1 Sources of Sub-models. Daggers: sub-models used in this paper. . . . . 151
4.2 Data needed for the model. . . . . . . . . . . . . . . . . . . . . . . . . 163
4.3 Four real-world incidents we have studied. Asterisks (*): estimated,
daggers (y): reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
viii
List of Figures
1.1 The two-dimensional problem space of the thesis. . . . . . . . . . . . . 3
1.2 The position of our work in the problem space. . . . . . . . . . . . . . 6
2.1 The parts of the problem space the first study explores. . . . . . . . . . 14
2.2 our BlockSizeId algorithm identifies regions of different use. . . . . . . 18
2.3 Number of addresses in each block size and ping-observable categories
in IT17ws. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Trend of ping-observable category change in IT17ws /24 blocks . . . . 42
2.5 Comparison of availability for low-bitrate (top line) and non-low-bitrate
(bottom line) classifiable /24 blocks in IT17ws. . . . . . . . . . . . . . 45
2.6 Comparison of Median-Up between Low-bitrate with Non-low-bitrate
Classifiable /24 Blocks in IT17ws. . . . . . . . . . . . . . . . . . . . . 46
2.7 Number of addresses in each block size and ping-observable categories
in USCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.8 Density plots of /24 blocks in IT17ws across each of the A/V , U/V , A/U
planes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.9 Our Investigation Targets: IP addresses ever responded in IT17wrs and
have meaningful hostnames (with keywords). It is the middle part with
573,494 addresses in this figure. . . . . . . . . . . . . . . . . . . . . . 64
2.10 Numbers of hostname-inferred usage categories, with colors indicating
those that also have allocation types. . . . . . . . . . . . . . . . . . . . 65
2.11 CDF of address availability (A), volatility (V ) and median-up duration
(U
) by hostname-inferred categories in IT17ws. . . . . . . . . . . . . . 67
ix
2.12 Relationship of ping-observed categories to hostname-inferred categories
in IT17ws. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.1 The parts of the problem space the second study explores. . . . . . . . . 71
3.2 Comparison between our previous and current validation results. . . . . 96
3.3 Historical routability of Google ASes. . . . . . . . . . . . . . . . . . . 103
3.4 Missing address/city coverage from the main-AS view compared with
the organization view for routing-complex organizations. . . . . . . . . 105
3.5 Cumulative distribution of unweighted/weighted peering-active organi-
zations by number of ASes used to peer. . . . . . . . . . . . . . . . . . 107
3.6 The different IXP peering view from the whole organization’s perspec-
tive and from the main AS’s perspective. . . . . . . . . . . . . . . . . . 108
3.7 Missing IXP/city coverage from the main-AS view compared with the
organization view for peering-complex organizations. . . . . . . . . . . 110
3.8 Missing peer/link coverage from the main-AS view compared with the
organization view for peering-complex organizations. . . . . . . . . . . 111
3.9 Converging training results with parallel hill climbing. . . . . . . . . . 117
3.10 Parallel Hill Climbing with attribute set 4attr+all10K and cutting thresh-
old 0.01. company subsidiary information (10k) is always shown on the
y axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.11 Three definition of validation metrics. . . . . . . . . . . . . . . . . . . 122
3.12 Evaluation of PCH dataset, compared with the Tier-1 ISP and 9 organi-
zations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.13 The adjusted false-negative rate. . . . . . . . . . . . . . . . . . . . . . 124
3.14 The number of ASes (n
p
;wherep2 100; 80) of Google/Comcast that
announce p% fraction of its addresses, with linear regression ^ n
p
com-
puted over 24 months. The AS scale extends to how many ASes Google/Comcast
has as of 2011-09-01. . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.15 Historical routability of ASes of Verizon. . . . . . . . . . . . . . . . . 131
3.16 Historical routability of ASes of Time Warner Cable. . . . . . . . . . . 132
3.17 Historical routability of ASes of China Mobile. . . . . . . . . . . . . . 132
3.18 Historical routability of ASes of ISC. . . . . . . . . . . . . . . . . . . . 133
x
3.19 Classification results of multi-AS usage over all multi-AS organizations,
based on regression ofn
p
starting from different years. . . . . . . . . . 135
3.20 The number of neighbors of individual ASes vs. their organizations.
Only top 100 ASes (ranks annotated as numbers in circles) and all Ver-
izon ASes are plotted. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.1 The parts of the problem space the third study explores. . . . . . . . . . 142
4.2 Global Submarine Cable Map in 2013 [Mah13] . . . . . . . . . . . . . 144
4.3 The problem to solve. . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.4 The general picture of the model. . . . . . . . . . . . . . . . . . . . . . 148
4.5 SONET circuits rely on cable segments as physical medium, but have to
be provisioned to transmit data. . . . . . . . . . . . . . . . . . . . . . . 150
4.6 SONET systems with ring protection mechanism use two circuits (work-
ing and protection path) to support an IP link. . . . . . . . . . . . . . . 153
4.7 Traffic flows between two endpoints rely on the network layer to find a
path composed of IP links. The path must comply with policies config-
ured in routers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
4.8 Flows are dynamically routed based on current IP link state for robustness.156
4.9 A session between an user and a service relies on one or multiple flows
between the user client and server(s). . . . . . . . . . . . . . . . . . . . 159
4.10 Physical topology of SeaMeWe-4 [SEA13a]. . . . . . . . . . . . . . . . 166
4.11 QoE - decreased play time (minute) in 2-D parameter space,r
v
= 350
kbps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
4.12 Steady Demand Scenario (x=3) . . . . . . . . . . . . . . . . . . . . . . 173
4.13 Decreased Demand Scenario (x=3,y=2) . . . . . . . . . . . . . . . . . 173
4.14 Increased Demand Scenario (y 2) . . . . . . . . . . . . . . . . . . . 174
4.15 Capacity planning during normal and abnormal conditions. . . . . . . . 177
4.16 Estimated impact on user QoE in four incidents. . . . . . . . . . . . . . 182
xi
Abstract
Better understanding about Internet infrastructure is crucial to improve the reliability,
performance, and security of web services. The need for this understanding then drives
research in network measurements. Internet measurements explore a variety of data
related to a specific topic and then develop approaches to transform data into useful
understanding about the topic. This process is not straightforward since available data
often only contains indirect information that may appear to have limited connection to
the topic.
This body of work asserts that systematic approaches can overcome data limitations
to improve understanding about important aspects of the Internet infrastructure. We
demonstrate the validity of our thesis statement by providing three specific examples that
develop novel approaches and provide novel understanding compared to prior work. In
particular, we employ four systematic approaches—statistical, clustering, modeling, and
what-if approach—to understand three important aspects of the Internet: the efficiency
and management of IPv4 addresses, the ownership of Autonomous Systems (ASes), and
the robustness of web services when facing critical facility disruption. These approaches
have addressed a variety of challenges posed by indirect, incomplete, over-fit, noisy and
unknown data; they in turn enable us to improve understanding about the Internet.
Each of our three studies explores a different area of the problem space and opens a
much larger area of opportunity. The data limitations addressed by our approaches also
xii
occur in many other problems. We believe our approaches can inspire future work to
solve these problems and in turn provide more useful understanding about the Internet.
xiii
Chapter 1
Introduction
Web service providers and users care about the reliability, performance, and security
of these services. Reliability characterizes the dependability of a service when facing
stochastic scenarios, especially hostile ones such as DDoS attack and critical facility
disruption. Performance depicts the quality of a service and its metrics vary depending
on the service type. For example, for video streaming services, their performance is
usually quantified by startup delay, re-buffering ratio, and video bitrate. Security is
about defending from unauthorized access, maintaining the service integrity, and being
able to trace malicious attackers.
To improve service performance, reliability, and security, service providers and
researchers need better understanding about the Internet infrastructure. For instance,
to speed up the video transferring rate and thus shorten startup delay, a Content Deliv-
ery Network (CDN) would want to know about the real-time characteristics of different
IP paths in order to select the best one. Another security example is that some global cor-
porations may want to know how their internal traffic is routed in the Internet backbone
to ensure that their business secrets do not traverse certain political-sensitive area.
To improve understanding, service providers and researchers then perform extensive
measurements. For example, CDNs deploy hundreds of vantage points to measure the
real-time latency to users. Based on this measurement, they can then choose the closet
server to serve an user. ISPs analyze their internal infrastructure to detect any single
point of failures in order to be resilient to facility disruption.
1
Measurement studies often inspire each other, either on the topic, or the data, or the
approach. We next paint a problem space of Internet measurement (Section 1.1). We dis-
cuss the three parts of the space we explore: understanding IPv4 address usage, mapping
Autonomous Systems (ASes) to organizations, and modeling submarine cable impact on
web services (Section 1.3). In our three specific studies, we have systematically built
up several approaches that help us to address various types of limited data—indirect,
incomplete, noisy, over-fit, and unknown data—to achieve our goals. We discuss how
future work could benefit from these approaches. We finally summarize our contribu-
tions in Section 1.4.
1.1 Problem Space
We characterize Internet measurements by two major attributes: the goal to quantify
and the data used to achieve the goal. These two characteristics can be viewed as two
axes composing the two-dimensional problem space (see Figure 1.1). A goal concerns
providing qualitative and quantitative results for an important issue about the Internet.
For example, a goal might be to provide a mapping of the current broadband deployment
within U.S. The data contains information that can support analysis to reach the goal.
For example, if the goal is a U.S. broadband mapping, then the data which contains the
capacity of all U.S. residents’ access links can be used to achieve the goal.
A goal may be specific, general, or somewhere in between. For example, a web
service provider may wish to know the capacity distribution of user access links so it
can develop several version tiers of its website to customize user experience. Studying
the distribution of all access links on the Internet is a general goal, while examining only
an ISP’s users’ links is a more specific goal.
2
Data
Goal
direct
indirect
general specific
infeasible
undesirable
feasible and
desirable
Figure 1.1: The two-dimensional problem space of the thesis.
After defining the goal, the next step is to collect data that contains related informa-
tion which can be later transformed to answer the goal.
We can characterize data by how direct (or indirect) it is related to the goal. Direct
data helps researchers to reach the goal while indirect data poses more obstacles that
researchers need to find ways to overcome. For example, if the goal concerns the Quality
of Experience (QoE) of a video streaming service, then a survey of the service’s user-
satisfaction levels would directly help researchers to answer the goal. In contrast, if
only indirect data is available, one may need additional work to achieve the same goal.
Continue on the previous example, if only video-play statistics (such as duration and
re-buffering ratio) are available, researchers need to first identify and extract related
information and then untangle the complicated relationship between this information
and the QoE goal. The analysis here needs to be systemically done.
Ideally, direct data is always available, however, direct data is often less available
and not sufficiently general. Direct data is typically proprietary and not open to public
3
due to privacy and security concerns. Even when it is not proprietary, there might not
exist efficient ways to collect it (for example, as in the satisfaction survey above). In
addition, because direct data is only available to the owner and those they trust, most
direct data is often narrowly focused. Narrow data lacks generality and is therefore
insufficient to serve general goals.
We explore all important parts of the problem space in this body of work. In Fig-
ure 1.1 we divide the problem space into three parts: infeasible, undesirable, and fea-
sible and desirable. Our work covers the both ends of the feasible-and-desirable space,
and that area is where most prior Internet measurement studies fall. As we can see from
Figure 1.1, indirect data is often the only option to achieve a general goal. However,
indirect data poses many hard challenges and these challenges need to be systematically
handled.
To demonstrate the potential of using indirect data to achieve both general and spe-
cific goals, we next introduce our overall thesis statement (Section 1.2) which asserts
that carefully designed approaches can solve various data challenges. We explore impor-
tant parts of the problem space to support this thesis statement (Section 1.3). If our thesis
statement is true, it suggests vast amount of opportunities for future work that aims to
improve understanding about the Internet.
1.2 Thesis Statement
This thesis of this dissertation asserts that systematic approaches can overcome data
limitations to improve understanding about important aspects of the Internet infras-
tructure. We employ four systematic approaches—statistical, clustering, modeling
and what-if approach to overcome five types of data limitations—indirect, incomplete,
over-fit, noisy, and unknown data. Through these approaches, we have improved our
4
understanding about the efficiency of IPv4 address utilization, the management of IPv4
addresses and Autonomous Systems (ASes), and the robustness of Internet web services
when facing critical facility disruption. The problem space we have explored is by no
means complete and we do not claim that we have proved the thesis statement. However,
if the thesis statement is true, it suggests a wide range of possibilities for future work to
improve understanding about the Internet.
1.3 Supporting the Thesis
To give a high-level picture of how our work supports the thesis, we summarize our
three studies that are carefully chosen to support the thesis in this section (see Fig-
ure 1.2). Much prior work already falls into the problem space and provides evidence
to support the thesis. Our work provides examples that cover different important parts
of the space. The sub-space explored by our work and prior related work is not com-
plete. We therefore point out the parts of the problem space that can be addressed in
similar ways as ours at the end of this section. We believe future work can re-apply our
techniques to different important parts and the problem space can be completed in the
future.
To support our thesis, we choose a problem in our first study that is both an impor-
tant problem in itself, and that explores a part of the problem space new to prior work
(Chapter 2, “address usage” circle in Figure 1.2). Specifically, we use novel data, ICMP
probe responses, to achieve a goal that has not been studied for years: understanding the
utilization and management of all IPv4 addresses. ICMP data falls on the general and
indirect side, so answering the goal is not trivial. In particular, ICMP data has two limi-
tations: it is indirect and incomplete. It is indirect because it only contains information
5
Data
Goal
direct
indirect
general specific
infeasible
undesirable
feasible and
desirable
address
usage
AS
owner-
ship
service
robust-
ness
address
usage
AS
owner-
ship
Figure 1.2: The position of our work in the problem space.
about address reachability which on the surface does not show any connection to our
goals. It is incomplete because many addresses do not respond to our probes.
We develop two novel approaches to solve the two limitations of ICMP data respec-
tively. One of the approaches uses statistics to addresses the indirectness. Specifically, it
defines three metrics to characterize response patterns and then correlates these patterns
with our goals. The other approach uses clustering to handle the incompleteness. It
group addresses into blocks and use neighbors (addresses in the same block) to repre-
sent unresponsive addresses. These two systematic approaches enable us to overcome
the limitations of ICMP probe responses and in turn to improve the understanding about
utilization and management of IPv4 addresses. We validate our understanding by a spe-
cific but more direct dataset from USC operators (see the “address usage” square in
Figure 1.2). The validation proves the correctness of our study which in turn provides a
compelling example to support our thesis statement.
6
In the second study, we explore an area of the problem space that has been explored
before, but prior results are neither general nor carefully validated (Chapter 3, “AS
ownership” circle in Figure 1.2). In particular, we aim to obtain a mapping from all
Autonomous Systems (ASes) to their owner organizations by using WHOIS records.
This mapping is important to understand the Internet ecosystem. Prior work has
attempted to solve this problem, but their results are incomplete and inaccurate because
WHOIS data poses many challenges. First, the data is indirect. Data stored in WHOIS is
mainly ASes’ contact information (such as phone numbers and e-mails), not ownership
of ASes. Second, the data is incomplete, in other words, some ASes may lack some
types of information. Third, the data is over-fit, that is, ASes of the same organization
contain different information. Lastly, the data contains noise. Some contact information
belong to third parties and so does not indicate ownership.
To address the four limitations of WHOIS data, we systematically build up a novel
clustering approach. This approach transforms the indirect data to our goal by grouping
ASes via common contact information, and then relates AS clusters with organization
identities. It also addresses the incompleteness and over-fitting by combining multiple
types of information as clustering input, and excludes noise by employing a special clus-
tering algorithm. We validate our results via a direct mapping obtained from a specific
organization and several mappings manually inferred by us (the “AS ownership” square
in Figure 1.2). The thorough validation shows that our results are much more accurate
than prior work and overall is very accurate. The success of our second study demon-
strates the importance of systematic approaches in mapping ASes to organizations using
WHOIS data. It also suggests that we may re-study other problems that were considered
unsolvable before.
7
Finally we consider how the reliability and quality of web services are affected by
submarine cable cuts in general (Chapter 4, “service robustness” circle in Figure 1.2).
Others have explored parts of this problem, but only emphasize cuts that are less likely
to happen and harms that are abstract from real-world services and users [MOM09,
WZMS07]. We finally approach this problem because solving it completely can be
extremely useful in the real world. This goal is difficult because direct data is rare and
indirect data has been insufficient. Direct data is rare because submarine cable cuts do
not happen every day and the impact from historical incidents is often not collected
and recorded. Without direct data, researchers turn to indirect data. The main feasible
way to transform the indirect data to the goal is modeling. However, the data required
for modeling is vast and obtaining them all is almost impossible. Therefore prior work
often only studies a sub-goal for which the data is available, in turn making incomplete
progress or providing shaky results. While our work does not solve all cases, we show
that we can get solid results for several important scenarios using our approaches.
While prior work is data-driven, we instead take a new problem-driven approach
that focuses on the goal’s usefulness. We also perform modeling but we model all the
essential parts related to the goal rather than only the parts with available data. For parts
of the model without data, we conduct what-if analysis. The what-if approach studies a
range of possibilities and is able to not only speculate about specific parameters, but also
project beyond current usage to possible future scenarios. Our third study demonstrates
how to use alternative strategies to reach a goal.
Our three studies are not merely three specific examples to support the thesis state-
ment. Each of them explores a different area and opens a much larger area of opportu-
nity.
8
The first study demonstrates the potential of using data collected from active prob-
ing to study the vast Internet edge in general. We have studied a specific aspect of the
edge: the utilization and management of addresses via data collected from a specific
probing mechanism which is ICMP echo requests. Future work could apply the ideas
of our two approaches to work on other important aspects of the edge, such as out-
ages, access link capacity, and host types (smart phones, tablets, or desktops) by either
using the same ICMP probes or other active probing mechanisms such as TCP/UDP
port scanning. In fact, the statistical idea has already been used to study outages by
recent work [QHP13, SS11]. In addition, our clustering algorithm can be re-used by
other work that identifies the edge via addresses, and the basic idea within it is not lim-
ited to Internet measurement. In fact, surveys in the real world that sample responders
are practicing this clustering notion [Gal13]. They assume that people in the same geo-
graphical area think similarly on certain topics just like we assume that addresses in the
same block are used in the same way.
Our second study has achieved a goal that was attempted before via the same data,
therefore in general this study suggests the potential to re-examine many prior unsolved
problems via more systematic approaches. More specifically, our work suggests the fol-
lowing two ways for future work to consider when a straightforward approach seems
not working. First, one can consider combining multiple types of data when no sin-
gle type is sufficient. In this way, different types of data can complement each others
shortcomings and together achieve a desired quality which has also been shown in prior
work [AKW09, SBS08]. Second, one can consider using clustering approaches to sep-
arate noise from useful information. This technique has already been demonstrated by
our first study, as well as by a recent study that aims to locate a large content provider’s
9
servers [CFH
+
13]. The basic idea of their work is that server locations can be deter-
mined by user locations. Through clustering, the majority of users will dominate the
minority which have incorrect locations and in turn provide accurate locations of servers.
Last, our third study suggests a possible way to study the area of problems which
all available data is insufficient to address. Specifically, we demonstrate how to perform
what-if analysis to work around missing data and reach a similar goal. This approach
can explore a range of possible values and answer what-if questions. Future work could
apply this approach when some data is not available, or more usefully, to envision pos-
sible scenarios for future development, such as to predict service response-time distri-
butions under different network configurations [TZV
+
08].
In summary, our three studies provide three strong examples to support the thesis,
and more importantly is that they also suggest opportunities to study problems in a much
larger area. These opportunities can potentially lead to better understanding about the
Internet, and in turn a more robust, more secure, and faster Internet.
1.4 Contributions
Our work makes four contributions: strong and new evidence to support the thesis,
identification of important and useful goals, systematic approaches to achieve the goals,
and specific results about the goals.
Our first contribution is to prove our thesis: that systematic approaches can overcome
data limitations and in turn improve understanding about the Internet infrastructure. We
provide three studies that all succeed in proving the thesis by each exploring a different
area of the problem space (Section 1.3). Our three studies demonstrate that our thesis
applies in three specific areas, and they also suggest great potential for a much larger
area of the problem space that future work can work on.
10
The second contribution is that we demonstrate how to identify goals that are achiev-
able. Not all goals in the problem space are solvable because not all data is available.
Therefore, one needs to find the right balance between an ambitious goal and the avail-
able data. In our third study, we demonstrate how to limit the scope of the goal to reduce
the data needed and in turn make the problem manageable. By finding the right balance,
researchers are able to contribute partial knowledge to an ambitious goal instead of sim-
ply providing a falsely complete “solution” or leave it with no solution.
Our third contribution is the set of novel approaches we develop to help us achieve
our goals. These novel approaches address a wide variety of data limitations that also
occur in many other problems. Therefore, our approaches can inspire future work that
faces similar data limitations. These future studies can in turn accomplish more goals
for better understanding about the Internet as we discussed in Section 1.3.
Finally, our forth contribution is that we improve understanding about important
aspects of the Internet infrastructure. More specifically, in the first study about address
usage, we find that 2.5 million addresses, or 61% of the probed address space, show
consistent responses in blocks of 64 to 256 adjacent addresses (/26 to /24 blocks). We
also find that many blocks are only lightly used (about one-fifth of /24s show less than
10% utilization). Improving utilization is increasingly important as the IPv4 address
space nears full allocation. In addition, we detect and quantify the use of dynamic
address assignment. We observe that nearly 40% of /24 blocks appear to be dynamically
allocated, and dynamic addressing is much higher in countries most recent to the Internet
(more than 80% in China, while less then 30% in the U.S.).
In the second study about network ownership, we show that multi-AS organizations
matter in today’s Internet: some 36% of assigned AS numbers and 29% of actively
11
routed ASes belong to multi-AS organizations. Importantly, this third of ASes are par-
ticularly prominent, announcing nearly two-thirds of all routed addresses. We also eval-
uate some effects of this organization-level structure on the Internet topology. Prior
analysis typically focuses on an organization’s “main” or “best-known” AS. We show
that this traditional view greatly underestimates the geographic footprint and IP address
coverage when compared to an organization-wide view that encompasses all of an orga-
nization’s routed ASes. For example, the main AS omits a significant portion (40% to
91%) of the addresses in nearly one third of organizations.
Lastly in our third study about the robustness of web services when facing subma-
rine cable cuts, we define and quantify two classes of vulnerability of many developing
countries’ Internet infrastructure: low service self-sufficiency (most services are hosted
abroad) and low geographical diversity of circuits. Countries that are less self-sufficient
heavily depend on international cables (which are mostly submarine cables) for online
web services. However, a single cable cut can bring down 67-100% of some countries’
international capacity and in turn results in intolerable user QoE. In addition to the
two specific classes of vulnerability we discover, we also provide general understand-
ing about network design and guidelines for future modeling work. We observe four
principles. First, topological connectivity does not imply data reachability almost on
every layer of the Internet. Second, fault-recovery mechanisms reside on many layers,
and new ones are frequently added. Third, effects of threats on real users are strongly
influenced by user behavior and network architectures. Finally, reachability is the basis,
but far enough to capture QoE for modern users.
12
Chapter 2
Understanding Edge Users in the
Visible Internet
Our thesis is to improve understanding about the Internet by using systematic
approaches to address various types of data limitations. In this chapter, we study one
specific aspect of the Internet infrastructure—utilization and management of all IPv4
addresses. The IPv4 addressing system is a vital component of the Internet infrastruc-
ture. Nearly four billion IPv4 addresses identify all hosts on the public Internet, and
together with the routing system, enable communications between hosts. The impor-
tance of IPv4 addresses demands extensive studies, especially given the fact that IPv4
address space is already depleted and its utilization has not been studied for years. In
this work, we focus on three issues about IPv4 addresses: common management block
size (/24 or smaller?), utilization efficiency (address occupation rate), and assignment
pattern (dynamic or static). The understanding about these topics can improve resource
planning and the effectiveness of auditing systems based on IP addresses. The content
of this chapter has been published at SIGCOMM [CH10].
We develop novel approaches that utilize ICMP probe responses collected from mil-
lions of addresses to reach our goals. ICMP data has two limitations: it is indirect and
13
Data
Goal
direct
indirect
general specific
infeasible
undesirable
feasible and
desirable
address
usage
address
usage
Figure 2.1: The parts of the problem space the first study explores.
incomplete (Section 2.2.1). To address these two data limitations, we develop two sta-
tistical approaches (Section 2.2.2 and Section 2.2.5) and one clustering approach (Sec-
tion 2.2.3). These approaches enable us to improve the general understanding about
IPv4 addresses (Section 2.3).
This chapter serves as a strong evidence to support our thesis statement by suc-
cessfully demonstrating that systematic approaches can overcome indirect and incom-
plete ICMP data to improve understanding about all IPv4 addresses. Figure 2.1 visually
depicts the area of the problem space this chapter covers. Our main goal falls on the gen-
eral side for which we explore indirect and general ICMP data to achieve it (“address
usage” circle). In addition, we also use direct operational data obtained from USC oper-
ators to validate the correctness of our approaches (“address usage” square). This part
of work falls in the lower right corner of the problem space.
This chapter demonstrates the potential of using data collected from active probing
to study the vast Internet edge in general as we discussed in Section 1.3. The two data
14
limitations we encounter in this study, indirect and incomplete, are also common in other
cases [QHP13, HPG
+
08, SS11]. Our study demonstrates feasible ways to handle them
and could inspire future work facing similar data limitations.
2.1 Introduction
Previous Internet topology studies focused on AS- and router-level topologies [FFF99,
SARK02, Gao01, MMU
+
06, DKF
+
07, EBN08, SBS08]. While this work explored
the core of the network, it provides little insight into the edge of the Internet and the
use of the IPv4 address space. The transition to classless routing (CIDR, [FLYV93])
in the mid-1990s has made the edge opaque. Only recently have researchers begun to
study edge-host behavior using server logs [XYA
+
07], web search engines on textual
addresses [TRKN08], and ICMP probing [HPG
+
08].
Yet the network edge has seen great change and deserves study. How is CIDR
applied? How is dynamic addressing used? How widespread are low-bitrate edge links?
In this paper we use active probing to study these properties of the edge of the Internet.
Assumptions:: In this paper we begin to explore the potential of clustering of active
probes to infer network address usage. Our work makes three assumptions:
1. Many active addresses will respond to probes,
2. Contiguous addresses are often used similarly, and
3. Patterns of probe responses and response delay suggest address usage.
While there are cases where these assumptions do not hold, we believe the assump-
tions apply to a large fraction of the Internet and so active probing can provide insight
into address usage.
15
We examined the first assumption and previously showed that active probes detect
the majority of addresses in use, as verified with tests against a university and a random
sample of the general Internet [HPG
+
08].
While this prior work established the collection methodology and error bounds; this
paper provides the first evidence for the next two assumptions and their application to
understand network usage. The second assumption is contiguous use, which follows
from the traditional administrative practice of assigning blocks of consecutive addresses
to minimize routing table sizes. While there is no requirement that adjacent addresses
be used for the same purpose, we will show that they are often used similarly (Sec-
tion 2.3.1).
Finally, we assume that repeated active probing with ICMP provides information
about how addresses are used. We take advantage of both the pattern of positive, nega-
tive, or missing response, and the round-trip time (RTT) of the response. While a single
ICMP response provides only limited information (consent of the address to reply),
repeated probing can tell much more. For example, we use response patterns to distin-
guish intermittent from continuously used addresses, and we show that RTT can identify
low-bitrate edge links.
Figure 2.2 shows an example of what can be learned from probing one block of 256
addresses with prefix
1
p. In this figure, the 256 addresses in prefixp are mapped into two
dimensions following a Hilbert curve (each quadrant of the square shows one-quarter
of addresses, recursively). Different shades indicate different ping response patterns
from each address (white is non-responsive; green, availability; red, volatility; metrics
1
Recall that IPv4 addresses are 32-bit numbers, usually written in the forma:b:c:d, where each com-
ponent is an 8-bit portion of the whole address. Addresses are organized in blocks (sometimes called
subnetworks) that are sized to powers of two. Blocks have a common prefix, the leading p bits of the
address, written a:b:c:d=p. For example, 128.125.7.0/24 indicates a /24 block with 256 addresses in it
of the form 128.125.7.x. We sometimes talk about blocks asp.0/24, wherep represents the anonymized
prefix.
16
we define later in Section 2.2.2). Two green areas are blocks of addresses that are
almost always up: the single address p.65/32 at the top center, and the 32-addresses
blockp.128/27. The two dark areas (the lower left quarter,p.192/26, and bottom right
eight, p.160/27) are used only infrequently, with low availability and volatility. We
can often confirm these probe-based observations against other sources (Section 2.4
discusses hostnames and operator-provided ground truth). The bottom of the figure
shows how we automatically identify these regions (Section 2.2.3).
Approach and Validation:: From these assumptions we develop new algorithms to
identify blocks of addresses with consistent usage (Section 2.2). We start with Internet
survey data, where each address in around 24,000 /24 address blocks is pinged every 11
minutes for around one week [HPG
+
08]. From this dataset we derive several metrics
about address usage. We then use these statistics to automatically identify blocks of
consistent responsiveness.
Before applying these algorithms, we evaluate how often our assumptions hold. Our
first question is therefore are adjacent addresses used consistently and can we discover
them reasonably accurately? Before classless IP addressing [FLYV93] allocation strate-
gies were aligned with externally visible address allocation, but since then there has been
no way to easily evaluate how addresses are used. We explore these basic questions in
Sections 2.3.1 and 2.4.1.
Applications:: A first application of this approach is to understand how addresses
are managed, beginning with what block sizes are typical (Section 2.3.1). We find that
2,529,216 addresses, or 61% of the probed address space, show consistent responses in
blocks of 64 to 256 adjacent addresses (/26 to /24 blocks). Also, we observe that most
addresses (around 55%) are in /24 or bigger blocks.
17
0
1
/24 /25 /26 /27 /28 /29 /30 /31 /32 Variance
Prefix Length
p.0/24
0
1
/25 /26 /27 /28 /29 /30 /31 /32 Variance
Prefix Length
p.0/25 -> ( p.65/32 )
0
1
/25 /26 /27 /28 /29 /30 /31 /32 Variance
Prefix Length
p.128/25
0
1
/26 /27 /28 /29 /30 /31 /32 Variance
Prefix Length
p.128/26
0
1
/26 /27 /28 /29 /30 /31 /32 Variance
Prefix Length
( p.192/26 )
0
1
/27 /28 /29 /30 /31 /32 Variance
Prefix Length
( p.128/27 )
0
1
/27 /28 /29 /30 /31 /32 Variance
Prefix Length
( p.160/27 )
Figure 2.2: Top: a /24 block (prefix is anonymized to p) with 4 plausible regions of
different use. Bottom: our BlockSizeId algorithm ( = 2:0) identifies these regions
(Section 2.2.3), with best-fit variance in (parentheses).
Another application is understanding how effectively addresses are used (Sec-
tion 2.3.2). We find that many blocks are only lightly used (about one-fifth of /24s
show less than 10% utilization). Improving utilization is increasingly important as the
IPv4 address space nears full allocation; improving IPv4 efficiency is a cost to compare
compared to IPv6 transition.
18
Third, we detect and quantify the use of dynamic address assignment (Section 2.3.3).
Dynamic addresses are used in some spam detection algorithms [XYA
+
07], and iden-
tifying dynamic addresses is important to estimate the number of computers that con-
nect to the Internet [HPG
+
08]. We observe that nearly 40% of /24 blocks appear to be
dynamically allocated, and dynamic addressing is much higher in countries most recent
to the Internet (more than 80% in China, while less then 30% in the U.S.).
Finally, we distinguish blocks connected mainly by low-bitrate edge links from those
with broadband connections, identifying blocks used by dial-up and older mobile phones
(Section 2.3.4). Study of edge bitrate can help understand trends in technology deploy-
ment, and automatic identification of users of low-bitrate networks may allow websites
to automatically match content and layout. Edge links and policies also interact with
address utilization (Section 2.3.4); we show low-bitrates links are correlated with short
connect-times and sparse usage.
The contribution of this paper is therefore to develop new approaches to classify
Internet address usage and to apply those approaches to answer important questions in
network management. As with other studies of the live Internet, our approach must
employ incomplete information: our surveys cover randomly selected /24 blocks (not
larger) and do not inform us about addresses that refuse to respond. However, we sug-
gest that the approach is promising and our preliminary results provide new techniques,
adding to what is currently known.
2.2 Methodology
This section introduces our methodology: collecting raw data through an Internet sur-
vey (systematic ICMP probing), transforming that data into relevant observations by
19
statistical approaches, identifying blocks of consistent use with a new clustering algo-
rithm, classifying blocks into ping-observable categories, and distinguishing between
low-bitrate and broadband blocks.
2.2.1 Data Collection: Surveying the Internet
We would like as much data about Internet addresses or hosts as possible, but we must
balance that desire against today’s security-conscious Internet culture. Our data col-
lection builds on prior Internet ICMP surveys that ping each address of about 1% of
the allocated Internet address space approximately every 11 minutes for one week or
longer [HPG
+
08].
We use a previous selection methodology [HPG
+
08], selecting around 24,000 /24
blocks from those that were responsive in a prior census of all allocated addresses.
We select blocks of addresses rather than individual addresses so we can study how
addresses are allocated and used. Our choice of /24 blocks limits our ability to observe
very large allocations, but allows the identification of blocks smaller than 256 addresses
(Section 2.3.1). As with prior work, half of the selected blocks are kept consistent across
multiple surveys and half are chosen randomly, enabling longitudinal studies while pro-
viding a subset that is selected with very little potential bias. We compare two surveys
in Section 2.4.3, showing that our study of 1% of the address space represents a large
enough fraction of the space to be representative.
Approximately every 11 minutes, each address is probed. Probes are dispersed over
this period and sent in pseudorandom order to avoid correlations due to outages. Probes
taken every 11 minutes limit our ability to detect very rapid churn of dynamic addresses,
however prior studies of dynamic addresses placed typical use durations at 75 or 81
minutes [KFSC07, HPG
+
08], suggesting we have reasonable precision. Responses can
20
be classified into three broad categories: positive (echo reply), negative (for example,
destination unreachable), and non-response. In this paper we ignore all non-positive
responses. Packet loss can cause measurement inaccuracy, so we use 1-loss repair
to cope with singleton packet losses [HPG
+
08] (1-repair assumes an absent response
between two consistent responses is loss and interpolates accordingly). Network out-
ages can also distort our survey. We manually examine our survey and select a period
that has no local network outages.
All surveys but IT16ws [USC07a] cover more than one week, allowing us to detect
diurnal and weekly cycles.
Of course, using ICMP for probing has significant limitations. The most serious is
that large parts of the Internet are firewalled and choose not to respond to our probes.
Some form of this bias is inherent in any study using active probing. Prior studies
of a large university and a random sample of Internet addresses suggest ICMP probing
undercounts hosts by a factor of 30–50%, and that ICMP is superior to TCP-based prob-
ing [HPG
+
08]. We recognize this limitation as fundamental to our methodology, but we
know of no evidence or inference to suggest that the firewalled portions of the Internet
use significantly different allocation strategies than the more open parts of the Internet.
In addition, we confirm the accuracy of our results at USC (Section 2.4.1), and we show
similar accuracy for manual inspections of blocks drawn at random from the Internet in
Section 2.4.2. However, we are exploring additional ways to verify this assumption, and
investigation of the firewalled Internet is future work.
Table 2.1 shows the datasets we use in our paper. We use two ICMP surveys taken
by USC [HPG
+
08]: IT17ws
2
and IT16ws; IT17ws is the main dataset used in this paper,
while we use IT16ws, IT30ws, IT31ws for validation in Section 2.4.3. Not all /24 blocks
2
The name IT17ws indicates: Internet Topology, the 17th full collection, “w” collected at ISI-west in
Marina del Rey, and “s” indicates a survey rather than a full census.
21
Start Date /24 Blocks
Name (# days) probed respond. Use
IT17ws [USC07b] 2007-06-01 (10) 22,367 20,849 all
IT17wrs 2007-06-01 (10) 17,366 16,295 x2.8
IT17wvs 2007-06-01 (10) 100 100 x2.4.2
IT17wbs 2007-06-01 (10) 200 200 x2.4.2
IT16ws [USC07a] 2007-02-16 (6) 22,365 20,900 x2.4.3
IT30ws [USC09] 2009-12-23 (14) 22,381 20,227 x2.4.3
IT31ws [USC10] 2010-02-08 (14) 22,376 19,909 x2.4.3
LTUSCs [USC07c] 2007-08-13 (9) 768 299 x2.4.1
ISC-DS [Int07] 2007-01 hostnames x2.4
RIR [Reg07] 2007-06-13 block allocation x2.3
Table 2.1: Datasets used in this paper.
we picked respond to our pings, however, most of them did respond at least once by one
IP address. We collected LTUSCs to compare our inferences with network operators as
discussed in Section 2.4.1. Finally, we use a domain name survey from ISC [Int07] to
validate our conclusions (Section 2.4).
2.2.2 Representation: Observations of Interest
Since one survey provides more than 5 billion observations, it is essential to map that
raw data into more meaningful metrics. We call this step data representation. We define
three metrics to characterize address usage: availability, the fraction of time an address
is responsive; volatility, a normalized representation of how many consecutive periods
the address is responsive; and median-up, the median duration of all up periods. And
we characterize edge bitrate with two metrics: median-RTT and stddev-RTT, the median
and standard deviation of RTT values of all positive responses.
22
Metrics characterizing addresses usage
To define availability, volatility and median-up, let r
i
(a) be the positive (1) or non-
positive (0) measurements for addressa (for alli2 [1::N
p
], whereN
p
is the number of
probes). We analyze these values after 1-loss repair [HPG
+
08]:
r
i
(a) =
8
>
<
>
:
1; r
i
(a) = 1_ (r
i1
(a) = 1^ r
i+1
(a) = 1)
0; otherwise
If each probe is made at timet
i
, we can define the series of up durations of an address
in a survey as
u
j
(a) =t
e
j
t
b
j
;8j2 [1::N
u
] where
r
i
= 1;8i2 [b
j
::e
j
]andr
(b
j
)1
= 0;r
(e
j
)+1
= 0
(each up duration is a consecutive run of positive probes fromb
j
toe
j
, inclusive). There
are N
u
up durations in total, where N
u
< N
p
. We can now clarify that availability,
volatility, and median-up are given as:
A(a) =
1
N
p
Np
X
1
r
i
V (a) = N
u
=dN
p
=2e
U
(a) = median(u
j
;8j2 [1::N
u
])
Availability is normalized, the fraction of times a host is reachable. V olatil-
ity is normalized by dN
p
=2e, the maximum number of states (alternating value
each time). For example, if N
p
= 16, and the responses r
i
of address a are
[1; 1; 0; 0; 1; 1; 0; 1; 0; 0; 0; 1; 1; 1; 1; 1], then firstly, we will apply 1-loss repair on r
7
,
23
because r
6
and r
8
are both positive responses. After 1-loss repair, the responses r
i
are [1; 1; 0; 0; 1; 1; 1; 1; 0; 0; 0; 1; 1; 1; 1; 1]. Now there are three up periods (Nu = 3)
of lengths 22, 44, 55 minutes each. A(a) = 11=16 = 0:688,V (a) = 3=(16=2) = 0:375
andU
(a) = median(22; 44; 55) = 44 minutes. (We also sometimes use un-normalized
volatility, V
(a) = N
u
, simply the count of up periods.) We considered normalizing
median-up to measurement duration, but chose not to because such normalization dis-
torts observations about hosts that are not nearly always present.
While these metrics are not orthogonal, each has a purpose. Availability shows how
effectively addresses are used. High volatility indicates addresses that are intermittently
used and often dynamically allocated. Median uptime suggests how long an address is
used.
These estimates assume ther
i
observations are correct and represent a single host.
Because we know our data collection omits firewalled hosts (Section 2.2.1), we gener-
ally ignore addresses that never respond. More troubling are addresses used by multiple
computers at different times—such addresses actually represent multiple hosts. The
purpose of dynamically allocated addresses is exactly to share one address with multi-
ple computers, and we know dynamic assignment is common (see Section 2.3). If those
hosts are used for different purposes (servers sometimes, and clients others), usage infer-
ence will be difficult and unreliable. However, we believe that it is relatively uncommon
for a dynamic address to transition between client and server use, since servers usu-
ally require stable addresses. (There is some use of dynamic DNS to place services on
changing addresses. We believe such use is rare for most of the world but plan to explore
this issue in future work.)
24
Metrics characterizing edge bitrate
While address usage considers all ICMP responses (positive and negative), round-trip
time estimates are only present in positive responses. To estimate bitrate, we therefore
defineR
(a) be the set of RTT values extracted from positive responses for addressa,
that is, the set of allR
i
(a) wherer
i
(a) = 1;8i2 [1::N
p
]. (SojR
(a)jjr
(a)j.) From
this set we compute standard deviation of R
(a): R
1=2
(a), when we have sufficient
samples (jR
(a)j 10).
We use these metrics to identify low-bitrate edge links. Median-RTT tracks typical
response bitrate, while stddev-RTT estimates variance. In Section 2.2.5 these metrics
can identify low-bitrate blocks.
2.2.3 Block Identification
We next use our observations about addresses to evaluate block size using a clustering
algorithm that considers the address hierarchy.
We assume blocks are allocated in sizes that are powers of two, so block identifica-
tion is the process of finding a prefix where addresses in the block are used consistently.
We find that some blocks are not used consistently, and different addresses show very
different stability. In our analysis we will keep dividing these mixed-use blocks until
they are consistent, if necessary devolving to a single address per block. Another chal-
lenge is that many blocks have gaps where a few addresses are used differently, or are
not responsive, perhaps because they are unused or firewalled. Our algorithm weighs
choice of larger blocks with some inconsistencies against smaller but more homoge-
neous blocks.
25
We only consider /24 blocks and smaller because current data collection method
gathers samples of that size. Exploration of larger blocks is an area of potential future
work.
Clustering background
In clustering of address responsiveness, we want to determine blocks that appear to be
used consistently.
We therefore use partitional clustering, one of the two general approaches to clus-
tering in this well developed field [JD88]. Partitional clustering places each element
into exactly one cluster; we choose it over the alternative, hierarchical clustering, which
would place items into multiple, hierarchically nested clusters. Jain defines partitional
clustering as: “Givenn patterns in ad-dimensional metric space, determine a partition
of the patterns into K groups, or clusters, such that the patterns in a cluster are more
similar to each other than to patterns in different clusters” [JD88]. We build on the
basic approaches of clustering for our method: pattern matrix, feature normalization,
and using an elbow criterion to select the best choice.
Although we follow traditional clustering theory, Internet addresses impose a unique
restriction. Addresses are only grouped into blocks that are contiguous, sizes of powers
of two, and aligned at multiples of the size. For these reasons, we cannot directly use
traditional algorithms such asK-means, but instead use components of existing cluster-
ing approaches. The most radical difference from traditional clustering is that addresses
are only clustered with some number of immediate neighbors, not with arbitrary other
addresses. We therefore find blocks of consecutive addresses by the definition of our
algorithm, however the size of blocks it finds depends on the consistency of how the
addresses are used.
26
A Pattern Matrix defines the features over the space being clustered. In our case,
each address is defined by its three features (A(a);V (a);U
(a)), and the space is a
number of disjoint /24 blocks. (We also use (R
1=2
(a);R
(a)) later in Section 2.2.5 to
identify block connection types.) Each /24 block has a 2563 pattern matrixx
ij
, where
j enumerates the three features, andi enumerates each address in a /24 block. From our
24,000 /24 blocks we get 24,000 pattern matrices in total.
Although our definitions ofA andV are already normalized to the range [0; 1], their
distribution may be skewed, and U
is not normalized. We therefore employ feature
normalization to give each features equal weight. We define the normalized feature
vectorx
ij
, given the mean and standard deviationsm
j
ands
j
of each featurej:
x
ij
=
x
ij
j
j
(2.1)
where
j
and
j
are the mean and standard deviation. We use Euclidean distance
between two components of the feature vector to measure dissimilarity between two
elementsi andk over their features:
d(i;k) =
v
u
u
t
3
X
j=1
(x
ij
x
kj
)
2
(2.2)
Many clustering algorithms, likeK-means, require the number of clusters be chosen
in advance. We cannot do that because clusters correspond to block size, a quantity we
wish to discover. We also cannot simply minimize variance, because variance is trivially
minimized in the degenerate case where each cluster is a singleton address.
We therefore employ an elbow criterion, a common rule of thumb to determine the
number of clusters. We split each cluster into two whenever splitting adds significant
information, and we stop when we pass the “elbow” of the curve and more clusters add
27
little benefit. We measure information by the sum of variance in each cluster across
the population—homogeneous clusters will have low variance; splitting them adds no
new information. Heterogeneous clusters have high variance, and splitting them into
two more self-consistent pieces reduces the sum of variance, increasing the amount of
information.
We use partitional clustering [JD88] to determine blocks that appear to be used con-
sistently based on their responsiveness. A pattern matrix defines the features of patterns
(i.e., addresses) being clustered: (A(a), V (a), U
(a)) across the space of disjoint /24
blocks. (We also use (R
1=2
(a),R
(a)) later in Section 2.2.5 to identify block connec-
tion types.) Each /24 block has a 2563 pattern matrixx
ij
, wherej enumerates the three
features, andi enumerates each address in a /24 block. From our 24,000 /24 blocks we
get 24,000 pattern matrices in total. To give each features equal weight, we employ fea-
ture normalization. And we define the normalized pattern matrix asx
ij
= (x
ij
j
)=
j
,
where
j
and
j
are the feature’s mean and standard deviation. We then use Euclidean
distance to measure dissimilarity between two patterns. Because Internet addresses
impose a unique restriction that addresses are only grouped into blocks that are con-
tiguous, sizes of powers of two, and aligned at multiples of the size, we cannot directly
use traditional algorithms such asK-means. We therefore employ an elbow criterion,
a common rule of thumb to determine the number of clusters. We split each cluster
into two whenever splitting adds significant information, and we stop when we pass the
“elbow” of the curve and more clusters add little benefit.
28
Our algorithm to identify block sizes
Our algorithm follows the basic structure from above: we define a pattern matrix of
addresses by features, normalize the features, then recursively search for clusters until
reaching the elbow. We fill in the details next.
The algorithm is a recursive function, BlockSizeId, taking an address-feature matrix
256 (A(a);V (a);U
(a)) and a given prefix lengthP . Since the blocks in our survey
are disjoint, we iterate over each /24 block in our survey separately, beginning with
P = 24.
BlockSizeId then computes the intra-block unnormalized variance, vsum
p
, for all
possible prefix lengthsp (P p 32). It then selects the smallest prefix lengthp
elbow
where longer prefixes show minimal change.
n
p
= 2
pP
;s
p
= 2
32p
;
bj
=
P
bsp
i=(b1)sp+1
x
ij
s
p
v
b
=
3
X
j=1
bsp
X
i=(b1)sp+1
(x
ij
bj
)
2
; 1bn
p
vsum
p
=
np
X
b=1
v
b
;Pp 32
Heren
p
is the number of sub-blocks with prefix lengthp, s
p
is the size of sub-blocks
(number of addresses) with prefix lengthp. For example, ifP = 24 andp = 27, then
n
p
= 8 ands
p
= 32.
bj
is the mean value of thej
th
feature of addresses in theb
th
sub-
block.v
b
is the intra-block unnormalized variance of theb
th
sub-block. In this example,
it would be the intra-block unnormalized variance of theb
th
/27 sub-block.
We define minimal change in the elbow algorithm with an empirically selected con-
stant threshold, = 2:0. We selectp
elbow
as the smallestp such thatvsum
i+1
vsum
i
<
29
;p i 31. Ifp
elbow
= P , then no division of this block reduces variance signifi-
cantly and we terminate our recursive algorithm, declaringP the consistent block size.
If this case does not hold, we have determined there are splits of the block that appear
to be more consistent. We then split the block in half and recurse, calling BlockSizeId
with the next longer prefix P = p + 1 on each half of the data. In principle, a block
could be split repeatedly until it is composed on a single address (since singletons will
drive variance to zero). In Section 2.3.1 we show that, in practice, our threshold causes
the majority of the Internet addresses fall into larger blocks of consistent use.
A block identification example
To illustrate BlockSizeId we next show analysis of an example /24 block taken from
the Internet. The top of Figure 2.2 shows the whole block, while the bottom graphs
show how the algorithm identifies four sub-blocks. As described earlier (Section 2.1),
a human identifies two bright green areas (or light grey) indicating high availabil-
ity: p.65/32 and p.128/27, and two dark areas showing low availability and volatility,
p.160/27 andp.192/26. Hostnames for this block show it is used for wireless access, and
the green areas are servers and routers, while the dark areas are dynamically assigned
by DHCP.
The first graph in the middle of the figure shows the first pass of BlockSizeId, with
P = 24 covering all of blockp.0/24. In the graph, they-axis shows variance for division
of the block into each possible power-of-two smaller size. Herep
elbow
= 25 andp
elbow
>
P , so we recurse toP = 25.
The second row of two graphs shows these recursive invocations,p.0/25 on the left
andp.128/25 on the right. Forp.0/25 with only one responsive address, the left graph
shows a consistent variance regardless of subdivision, and p
elbow
= P = 25, so this
30
prefix is consistent and this recursion terminates. Forp.128/25 on the right, a subdivision
reduces variance and so we recurse again toP = 26.
The algorithm continues until either p
elbow
= P or P = 32. In this example, the
initial /24 block is divided intop:65=32,p:128=27,p:160=27, andp:192=26.
2.2.4 Ping-Observable Block Classification
We can now take remote measurements, convert them into observations, and use them
to identify blocks of consistent neighboring addresses. We generalize our observations
on addresses into observations about a blockb by taking the median value of each obser-
vation:
(A(b);V (b);U
(b)) = median(A(a);v(a);U
(a))8a2b
We then classify these blocks into five ping-observable categories, using
(A(b);V (b);U
(b)). We use four thresholds,
H
= 0:95, indicating high availability,
L
= 0:10, indicating low availability, = 0:0016, for low volatility (V (b) = is
equal toV
(b) = 1, i.e., only up for once), and
= 6 hours, corresponding to a rela-
tively long uptime.
Always-stable: highly available and stable.
(A(b)
H
)^ (V (b))
Sometimes-stable: changing more often than always-stable, but frequently up con-
tinuously for long periods (highU
(b)).
(U
(b)
)^ (A(b)
L
)^ (A(b)<
H
_V (b)>)
31
Intermittent: individual addresses are up for short periods (lowU
(b)):
(U
(b)<
)^ (A(b)
L
)^ (A(b)<
H
_V (b)>)
Underutilized: although addresses are occasionally used, they show lowA(b) val-
ues.
A(b)<
L
Unclassifiable: we decline to classify blocks with few active responders, currently
defined as any block where fewer than 20% of addresses respond.
We selected these categories to split the majority of the (A(b);V (b);U
(b)) space,
informed by evaluations of dozens of blocks (573K addresses in total) backed by manual
probing of hosts and hostnames (for details about categories, see Appendix 2.8, and
about the (A(b);V (b);U
(b)) space, Appendix 2.7).
While we have defined these categories based on what we can observe, the cate-
gories are correlated to real-world address usage. Always-stable is typical of servers,
routers and always-up end hosts. Manual inspection of randomly chosen reverse host-
names indicates that more than 80% servers and routers have always-stable addresses.
Sometimes-stable correlates addresses with hostnames that indicate statically-assigned
user computers, businesses (names containing “biz” or “business”), some dynamically
assigned but always-on connections (cable modems or DSL connections). Intermittent
characterizes the majority of cable and DSL hosts and some active dial-up hosts. We
find many address blocks, often identified as dial-up by hostname, are categorized as
underutilized. (More than 50% of hostnames that indicate dial-up haveA(b)<
L
.)
Appendix 2.8 relating these ping-observable categories to several hostname-inferred
usage categories which represent real-world address usage.
32
We examine sensitivity to our choices in Section 2.4.3.
Appendix 2.7 shows how these terms divide the space.
2.2.5 Identifying Low-bitrate Blocks
Block categories correlate with edge link technologies, but they are not one-to-one—
we find that dial-up and DSL appear as both intermittent and underutilized. To better
understand technology trends, we next show that variance across repeated RTT mea-
surements can identify blocks with low-bitrate edge links. We define low-bitrate as less
than 100Kb/s, such as dial-up (56Kb/s) and GPRS (57.6 Kb/s). We first present a RTT
model, and then apply it.
Background: components of RTT
Round-trip time has several components:
RTT = 2(D
cpu
+D
prop
+D
t
+D
q
)
whereD
t
=S=B andD
q
=nD
t
The first two components, per-hop processing delay in the routers (D
cpu
), and distance-
based propagation delay (D
prop
) are largely independent of the edge link. Transmission
delay (D
t
), however, is based on packet size (S, approximated as constant for this simple
model) and the bottleneck link’s bitrate,B. Queuing delay (D
q
) is a multiple ofD
t
based
on queue length. (All terms are for the full round-trip and do not require path symmetry;
we assume the prober is well connected.)
Our goal is to distinguish addresses with low-bitrate edge links from broadband
links. In the simplest possible case, we first assume the targets are one-hop from our
33
prober and there is no congestion, so D
cpu
and D
prop
are negligible and D
q
= 0.
Here the only difference is transmission delay, and we can easily distinguish common
edge technologies (Table 2.2) sinceD
t
dominates RTT. Here even a simple threshold of
R
(a) would distinguish slow edge links, since our 64B probe takes 9ms over a 56kb/s
dial-up link but much less than 1ms at broadband (1Mb/s or faster).
In practice, our prober is distant from most of the Internet and we encounter interfer-
ing traffic. At long distances,D
cpu
andD
prop
can dominate RTT, often approaching
200ms for communications between continents, completely obscuring the effects of the
edge we wish to observe viaD
t
.
Queuing delay is another source of noise, but it also provides the means to see
through distance. With queuing delay,D
q
=nD
t
=n(S=B), whereB is the bitrate on
the backlogged link. Queuing delay can happen at any location along the path, either in
the backbone or the edge link. We assume that most queuing occurs at the edge link,
since although backbones are highly multiplexed, they consist of high-bitrate, carefully
managed links, and we expect queues to be short (n is low) and to clear quickly (since
D
t
< 1s at 1Gb/s, even for a 1500B interfering packet). For slow edges, each packet
in the queue ahead of a probe adds tens or hundreds of milliseconds, since (D
t
is 1ms
for 1Mb/s ADSL, and almost 10ms for dial-up, andD
q
=nD
t
. If we assume slow links
are likely at the edges, then queuing (D
q
) and RTT are dominated by the effects of this
edge link.
Identifying low-bitrate links from RTT
We next turn to identifying blocks with low-bitrate edge-links with three steps: isolating
theD
q
component of RTT, and generalizing results to blocks, and then classifying blocks
as low-bitrate.
34
link type transmission delay
packet size 64B 1500B
56Kb/s dial-up 9ms 212ms
1Mb/s ADSL 0.5ms 12ms
1Gb/s Ethernet 0.5us 12us
Table 2.2: Common link type and the transmission delay for 64KB and 1500KB packet
respectively
Any given RTT observation is made up of the four components identified previously.
With one observation we cannot separate those contributions. However, a week-long
survey provides hundreds of observations for most addresses. If routing is generally
stable, all components of RTT are constant except for queuing delay, while D
q
varies
depending on how backlogged the edge link each time it is probed. We therefore look at
variation in RTT to inferD
q
, as measured byR
(a) , the standard deviation of the RTT.
Routing techniques such as load balancing or wide geographic distribution of adjacent
addresses [FVFB05] are sources of noise; we utilize a fairly high threshold to mitigate
their effects.
Standard deviation is well defined only with multiple measurements and for positive
probe responses; we ignoreR
(a) whenjR
(a)j< 10 as statistically invalid, and RTTs
for negative responses since they may be generated by a router on either side of the edge
link. There are many addresses that fail to reply positively to probes: in our survey, only
about 41% of addresses from blocks that have any responses at all respond, and about
one-twentieth of these respond fewer than 10 times. Our analysis of networks shows that
most are composed of large, homogeneous blocks (we show this data in Section 2.3.1),
so we extend our address-level observations to blocks by defining a block-level estimate
of RTT variance as the median of all address-level standard deviations: R
1=2
;
(b) =
median(R
(a))8a2b.
35
Low-bitrate block: We therefore identify low-bitrate blocks from broadband by
large variance:
R
1=2
;
(b)>
We select = 300ms, because it is roughly 1:5 the delay of a full-size packet at dial-up
speeds (1500B takes 212ms at 56kb/s), and based on evaluation of dozens of low-bitrate
blocks. We examine the validity of this classification approach and the threshold in
Section 2.4.
2.3 Understanding Edge Address Usage and Low-
bitrate Access
We next use the data to explore several questions in network management: what are
typical sizes of consistently used Internet address blocks? How effectively are they
being used? And how prominent is dynamic addressing?
To help answer these questions we compare our observations with the allocation
data from the regional Internet registries (RIRs) [Ame08]. This RIR data includes the
time and country to which each address block is assigned. Although not completely
authoritative, this data is the best public estimate for address delegation of which we are
aware. We collect data from each of the RIRs, selecting data dated June 13, 2007 to
closely match our survey data.
2.3.1 Block Sizes
We begin by considering block sizes. Figure 2.3 and Table 2.3 show our analysis of
IT17wvs.
36
0M
1M
2M
3M
4M
/24 /25 /26 /27 /28 /29 /30 /31 /32
Number of Addresses
Block Prefix Length
Ping-observable Categories
unclassifiable
underutilized
intermittent
sometimes-stable
always-stable
Figure 2.3: Number of addresses in each block size and ping-observable categories in
IT17ws.
size sometimes- classifiable unclassifiable blocks addresses
pfx addrs always-stable stable intermittent underutilized (100%) [100%]
/24 256 1,603(18%) 2,517(29%) 2,673(30%) 1,994(23%) 8,787* 3,411[27%] 12,198 3,122,688
/25 128 323(23%) 523(38%) 295(21%) 237(17%) 1,378* 920[40%] 2,298 294,144
/26 64 346(21%) 617(38%) 378(23%) 274(17%) 1,615* 787[33%] 2,402 153,728
/27 32 432(20%) 855(40%) 506(23%) 361(16%) 2,154y 872[29%] 3,026 96,832
/28 16 759(20%) 1,301(34%) 993(46%) 734(19%) 3,787y 1,139[23%] 4,926 78,816
/29 8 2,077(21%) 3,190(32%) 2,355(24%) 2,227(23%) 9,849y 0 9,849 78,792
/30 4 3,312(19%) 5,656(33%) 4,679(27%) 3,707(21%) 17,354y 0 17,354 69,416
/31 2 4,195(16%) 9,867(37%) 7,864(30%) 4,566(17%) 26,492y 0 26,492 52,984
/32 1 52,646(30%) 42,847(24%) 43,266(25%) 36,707(21%) 175,466y 0 175,466
entireIT17ws dataset: (1,603,086 addrs. in non-responsive blocks) + (4,122,866 in responsive blocks) 22,367 5,725,952
Table 2.3: Number of blocks of each size in IT17ws (10 days). Unclassifiable percent-
ages relative to all blocks; other percentages relative to classifiable blocks. Asterisks:
consistent blocks, daggers: non-consistent.
This data shows that addresses in the Internet are most commonly managed in blocks
with /24 prefixes. In fact, even though there are more opportunities for small blocks, we
find more /24 blocks than blocks of size /25 through /29. Since our data collection
only probes consecutive runs of 256 addresses, this prevalence suggests we may need to
probe larger consecutive areas to understand if even larger blocks are common but not
seen in our survey.
There are a very large number of the smallest blocks, with about as many /29s as
/24s, and roughly twice as many /30s as /29s, and /31s as /30s. These results may be
37
artifacts of our block discovery algorithm: it is statistically easier for an address to be
consistent with a few neighbors in a small block than with 128 neighbors in a /25. We
next re-examine the second assumption underlying our work: are contiguous addresses
often used similarly? If we define consistent usage as just the largest three block sizes
(/24 through /26) that we successfully identify, we find 2,529,216 addresses are used
consistently, or 44% of the probed address space.
While clearly defined, this percentage does not accurately present how much of
the Internet is consistently used. Some of the probed address space is unclassifiable
(with consistent usage but fewer than 20% of addresses responding), or completely non-
responsive. We cannot say anything about blocks that fail to respond at all. The status of
unclassifiable blocks is uncertain, but a conservative position is to declare them inconsis-
tent. A more representative evaluation of the Internet is therefore to compare how much
is definitely used consistently (2.5M addresses in large blocks) against that is effectively
inconsistent (the 506,178 addresses in small blocks) and the possibly inconsistent (the
1,087,472 addresses in unclassifiable blocks). This computation suggests that a lower
bound of 61% of the responsive Internet is used consistently, We believe this supports
our second assumption: the majority of contiguous addresses are used consistently.
2.3.2 Address Utilization
Given block sizes, we next evaluate how efficiently addresses are used in those blocks.
Inefficient IPv4 usage represents an opportunity for improvement, but greater efficiency
comes with greater management cost. Management cost of IPv4 should be weighed
against simpler-to-manage IPv6.
38
Quantifying underutilization and possible causes
The underutilized ping-observable category is defined as a sequence of addresses that
are used less than 10% of the time (Section 2.2.4). Large blocks of such infrequently
used, public IP addresses generally indicate inefficient address utilization. (Such low
utilization seems to make sense only in unusual circumstances, such as a DTN satellite
only infrequently in view [Fal03].)
The underutilized column of Table 2.3 shows that these blocks are quite common,
accounting for 17–23% of blocks of each size, Although not shown in the table, the
mean availability of addresses in /24 underutilized blocks is only 3.2% of our 10-day
observation (IT17ws). Manual examination of addresses shows the mean number of up
periods is less than 5 (V
(b) = 4:6), typically for around 1 hour (U
(b)).
To understand causes of underutilized blocks we examine the address hostnames of
these /24 blocks. We find 63% of addresses provide hostnames, and many of these host-
names (34%) include keywords that suggest how the address is used. For example, dial
and dsl suggest edge link technologies, and dynamic or pool suggest dynamic address
assignment. (Full details are in Appendix 2.8.2.) Among the various usage suggested
by hostnames, underutilized blocks are correlated with pool (68%), ppp (56%) and dial
(54%) hostname categories.
We hypothesize that this low utilization is tied to dial-up technology itself. Dial-up
lines are often shared with voice communication, encouraging short, intermittent use.
Yet dial-up POPs must be provisioned to handle peak loads. A secondary factor may be
trends shifting customers from dial-up to higher speed connections. Perhaps old dial-up
provisioned blocks are simply in lower demand than previously. Finally, while dial-up
utilization is low, we cannot tell how many users each dial-up address serves. Perhaps
address reuse is high enough to make these apparently under-provisioned addresses a
39
bargain relative to supporting the same number of users with always-on connections.
Further study to understand these trade-offs is future work.
Reversing the question, we can ask which blocks are well utilized? Still by exam-
ining the hostnames, we found that blocks with keywords static, cable, biz, res, server,
router have very few underutilized addresses. Static addresses are usually assigned to
fixed-location desktops or businesses, and these computers tend to maintain Internet
connection and occupy their address for a fairly long time. In addition, static addresses
are often billed at a flat rate per month, while dynamic addresses may incur a time-
metered charge.
Locations and trends of underutilization
Evaluating underutilization by country may highlight policy differences by regional reg-
istries or ISPs. After merging our data with RIR data, Table 2.4 shows utilization by
country. We see that the United Kingdom and Japan have the largest fraction of under-
utilized blocks, 40–60%, suggesting potential local policy differences. We expected a
large number of underutilized blocks in the U.S. because of wide deployment of dial-up.
While the U.S. has the largest absolute number of underutilized blocks, its fraction is
relatively low.
Table 2.5 shows that the fraction of underutilized blocks is fairly consistent across
all five RIRs, suggesting differences are likely due to country, not RIR policies.
Finally, the lower right graph in Figure 2.4 shows when underutilized blocks were
allocated. The fraction of blocks by age seems fairly evenly distributed, except for
peaks in very early allocations (1984 and unknown), where more than 60% of the blocks
40
sometimes- classifiable unclassifiable blocks
code country always-stable stable intermittent underutilized (100%) [100%]
US US 673(27%) 1,106(45%)* 231(9.3%) 472(19%) 2,482 1,383[36%] 3,865
CN China 39(4.1%) 117(12%) 615(65%)* 171(18%) 942 132[12%] 1,074
JP Japan 383(48%)* 50(6.2%) 18(2.2%) 350(44%)* 801 288[26%] 1,089
DE Germany 65(10%) 125(20%) 388(61%)* 62(9.7%) 640 56[8.0%] 696
KR Korea 21(4.6%) 131(29%) 237(52%)* 68(15%) 457 142[24%] 599
FR France 18(4.1%) 227(52%)* 167(38%) 28(6.4%) 440 58[12%] 498
GB UK 39(13%) 37(12%) 52(17%) 179(58%)* 307 180[37%] 487
BR Brazil 7(3.9%) 35(19%) 86(48%)* 52(29%) 180 58[24%] 238
all others 358(14%) 689(27%) 879(35%) 612(24%) 2,538 1,114[31%] 3,652
/24 blocks in entireIT17ws dataset: 8,787 3,411[27%] 12,198
Table 2.4: The distribution of /24 blocks in ping-observable categories of 10 countries.
Bold and asterisks indicate the categories with more than 40% of blocks. Colors indicate
categories and each country’s dominant category. Countries are sorted by total number
of blocks.
sometimes- classifiable unclassifiable blocks
registry always-stable stable intermittent underutilized (100%) [100%]
RIPENCC 408(14%) 798(27%) 1,084(37%)* 661(22%) 2,951 990[25%] 3,941
APNIC 473(18%) 422(16%) 1,091(40%)* 716(27%) 2,702 795[23%] 3,497
ARIN 706(27%) 1,185(45%)* 258(9.7%) 512(19%) 2,661 1,481[36%] 4,142
LACNIC 13(3.2%) 94(23%) 218(53%)* 86(21%) 411 120[23%] 531
AFRINIC 3(4.9%) 18(30%) 21(34%)* 19(31%) 61 19[24%] 80
/24 blocks in entireIT17ws dataset: 8,787 3,411[27%] 12,198
Table 2.5: The distribution of /24 blocks in ping-observable categories of 5 regional
registries. Bold and asterisks indicate the categories with more than 40% of blocks.
Colors indicate categories and each registry’s dominant category. Registries are sorted
by total number of blocks.
assigned are underutilized. We believe these earliest allocations were made with rela-
tively little assessment of organizational need, and large initial allocations allow contin-
ued use with minimal concern for efficiency.
2.3.3 Intermittent and Dynamic IP Addressing
Addresses are intermittently used by statically addressed hosts that are only sometimes
connected to the network, or by hosts that obtain dynamically assigned addresses from
a pool, typically with DHCP [Dro97].
Dynamic assignment of addresses allows ISPs to multiplex many users over fewer
addresses. Dynamic addressing also provides ISPs the business opportunity of offering
41
0
20
40
60
80
100
unknown 1985 1990 1995 2000 2005
Percentage(%)
Year
underutilized
intermittent
sometimes-stable
always-stable
Figure 2.4: Trend of ping-observable category change in IT17ws /24 blocks
static addresses as a higher-priced service, and potentially makes it more difficult for
users to operate servers. Dynamic addressing has been promoted to users as a secu-
rity advantage, on the theory that a compromised computer is more difficult to contact
if its IP address changes. Dynamic addressing prevents users from running services
or accepting unsolicited inbound connections (for example, for incoming SIP calls),
although applications employ work-arounds such as STUN [RWHM03].
Recent studies [XYA
+
07, TRKN08, HPG
+
08] have examined dynamic addressing
for several reasons. First, dynamic addresses complicate some network services, such
as reputation systems. They also are correlated with spam; some spam filters penal-
ize dynamic addresses because of the frequent exploitation of dynamically addressed
home computers by spammers. We next show that our approach can identify dynamic
addressees and suggest the causes and trends that have been previously invisible.
Quantifying dynamic addressing
We believe that the intermittent and underutilized ping-observable categories correspond
to the short-term dynamically assigned addresses of interest. Although we cannot quan-
tify what fraction of these categories actually use DHCP, our belief is supported by
42
hostname analysis. Hostnames shows that intermittent blocks commonly include key-
words cable (57%), dynamic (48%) and dsl (41%), all of which often use short- or
moderate-term dynamic addressing, and underutilized blocks often include keywords
for pool (68%), ppp (56%) and dial (54%).
Table 2.3 shows that 40–50% of classifiable blocks (depending on block size) appear
to be dynamic. Even with wide deployment of always-on connectivity, nearly half of
Internet addresses are used for short periods of time. For intermittent blocks, the mean
availability is just under 30%, with nine use periods over the week and a meanU
around
2.5 hours.
Locations and trends for dynamic addressing
Analysis by country can suggest how political, cultural and policy factors affect address-
ing. Table 2.4 shows that nearly two-thirds of Chinese blocks are intermittent, with Ger-
many, Korean, and Brazil all nearly half or more. Several factors may contribute to this
use.
China has a very large population and is a relative latecomer to the Internet; from
the beginning of commercial deployment in China. ISPs have planned to make best use
of the relatively few IPv4 addresses per potential user. They have therefore promoted
dynamic use to improve address utilization. An interesting direction for future work
would be to evaluate how effective their utilization is. Unfortunately we only know
address responsiveness, not the number of actual computers users per address needed to
answer this question.
Time-metered billing is another reason for intermittent use. Parts of China and Ger-
many employ metered billing, encouraging intermittent use even with broadband. Other
potential reasons for intermittent use include turning off a router to conserve energy, or
43
carrying over habits learned from dial-up use to broadband, and potentially continued
use of dial-up connections shared with voice communication.
Evaluation of usage by registry (Table 2.5) shows larger differences in use. We see
that intermittent blocks are very prominent under APNIC and LACNIC (40–53%), five
times more common than for ARIN in North America (9%). We believe these differ-
ences stem largely from policies of the countries the RIRs serve, not the RIRs them-
selves. We discussed Chinese practice above; several Latin American countries have
limited choice in ISPs, with national providers adopting pricing or policies that strongly
favor dynamic address assignment even for business use (as confirmed by LACNIC per-
sonnel [LAC09]). We speculate that the large number of sometimes-stable blocks in
ARIN is because of long DHCP lease times and always-on use by home users, enabled
by relatively plentiful numbers of IPv4 addresses per user.
Finally we consider trends in dynamic addressing. The lower left of Figure 2.4 shows
that intermittent blocks are more common in new address allocations. This observation
is consistent with a recognition of eventual full allocation of the IPv4 address space
and efforts to manage addresses in countries newer to the Internet. The rise in intermit-
tent blocks matches a corresponding fall in always-stable blocks (top left, Figure 2.4).
In addition to growing demand for dynamic addressing, this trend suggests most new
addresses are added to provide service for home users, intermittently. While the abso-
lute numbers of always-stable businesses and servers grows, its fraction of all addresses
is shrinking.
2.3.4 Understanding Edge Bitrates
To understand causes for utilization, we next look at block connectivity to the Internet.
44
0
20
40
60
80
100
0 0.2 0.4 0.6 0.8 1
CDF of /24 Blocks (%)
Availability
Underutilized
Always-stable/
Sometimes-stable/
Intermittent
Low-bitrate
Other
Figure 2.5: Comparison of availability for low-bitrate (top line) and non-low-bitrate
(bottom line) classifiable /24 blocks in IT17ws.
In Section 2.2.5 we suggested that RTT variance can indicate low-bitrate edge links
such as dial-up and pre-3G mobile telephones. Here we apply this analysis to provide a
new tool to understand how edge networks correlate with underutilization. Future work
includes using this analysis to evaluate deployment trends and to automatically adapt
websites to the user’s network.
To understand the usage of low-bitrate blocks, Figure 2.5 shows the availability for
blocks broken into low- and non-low-bitrate groups by RTT stability (as defined in Sec-
tion 2.2.5). From the underutilization threshold ofA(b)< 0:1, we see that nearly 80% of
low-bitrate blocks are underutilized, compared to only 20% of non-low-bitrate blocks.
Therefore low-bitrate connections strongly correlate with sparse use.
To explain this correlation between edges and underutilization, we use hostnames
and whois to infer operational usage—roughly, how blocks are managed (dynamic or
static) and what type of edge-link they are (dial-up, PPP, DSL, etc.). Such inferences
are less than ideal, but they provide the best available ground truth about the general
Internet. Among the 200 randomly selected low-bitrate blocks, we successfully inferred
the operational usage of 46 blocks: 41 dial-up, 2 PPP, and 3 DSL blocks. Dial-up and
45
0
20
40
60
80
100
0 1 2 3 4 5 6
CDF of /24 Blocks (%)
Median-Uptime (Hour)
Low-bitrate
Other
Figure 2.6: Comparison of Median-Up between Low-bitrate with Non-low-bitrate Clas-
sifiable /24 Blocks in IT17ws.
PPP are indicator of low-bitrate edge connection while DSL is one representative of
broadband connection. While not providing definitive causes of underutilization, this
suggests correlation between low use rates, low bitrates, and dial-up edge networks.
To support this explanation, we studied the median-uptimeU
for both low-bitrate
and broadband blocks shown in Figure 2.6. We found that up durations in the vast major-
ity of low-bitrate blocks are quite brief: 85% of low-bitrate blocks have aU
(b) < 0:5
hours, compared to only 15% of other blocks. This observation suggests that low-bitrate,
dial-up blocks are provisioned for a large number of potential users who do not use the
network concurrently. We hypothesize two reasons for this, 1) low-bitrate edge con-
nection lead to long delay, which lead to less satisfying user experience, which lead to
shorter occupation time; 2) the billing of low-bitrate edge connection are usually usage-
based instead of flat-rate, thus it is natural for the users to disconnect from network soon
after their task is done.
46
2.4 Validation of Understanding and Consistency
We have now shown data to support our three assumptions: addresses respond to
probes (the subject of prior work [HPG
+
08]), adjacent addresses have similar use (Sec-
tion 2.3.1), and probes suggest use (Section 2.3.2). These results have two limitations,
however. First, since they are based on active probing, they are only available for the
portion of the Internet that respond to probes. Evidence suggests that somewhat more
than half of the publicly addressed hosts respond [HPG
+
08]; extension of these results
to the whole Internet is an area of continuing work. Second, our conclusions are based
on data taken from one survey (IT17ws) from the general Internet. While not biased, we
cannot compare these results to the true network configuration that is distributed across
thousands of enterprises.
We next present three additional studies to further validate these assumptions and
address the second limitation. First we evaluate data taken from USC, a smaller and
potentially biased dataset, but one where we have ground truth from the network oper-
ations staff. We then extract small random subsets of the general Internet and infer the
ground truth by manual inspection using ISC-DS hostname data [Int07] and the whois
database. Finally, we compare our Internet-wide results with additional data taken one-
half to two years later to verify that our conclusions do not reflect something unusual in
a single measurement or time.
2.4.1 Validation within USC
We first compare our methodology against ground truth obtained directly from the net-
work operators at USC. This section uses dataset USCs and applies the same analysis
used on our general Internet dataset.
Figure 2.7 shows block sizes and classifications from our approach.
47
0
10000
20000
30000
40000
/24 /25 /26 /27 /28 /29 /30 /31 /32
Number of Addresses
Block Prefix Length
Ping-observable Categories
unclassifiable
underutilized
intermittent
sometimes-stable
always-stable
Figure 2.7: Number of addresses in each block size and ping-observable categories in
USCs.
Block identification and classification at USC shows a similar prevalence of /24
blocks (85% of USC addresses are in /24s, compared to 61% in the Internet). However,
USC shows many fewer intermittent and underutilized blocks compared to the Internet
(only 8% among classifiable /24 blocks, Figure 2.3); we expect such variation across
enterprises. We next use this data to evaluate how our assumptions affect our ability to
accurately find block size, consistency, and usage.
Validation of block identification and sizes
To validate our estimation of block sizes, we compare our analysis with the internal
routing table from our network administrators. This data helps quantify the accuracy
of our approach, measuring the false positive rate, blocks that we detect but that do not
actually exist, and the false negative rate, blocks that exist but we fail to detect.
Table 2.6 summarizes our comparison for all /24 blocks. (Smaller blocks are not
present in our ground-truth routing table.) We find our approach correctly identifies
57% of all blocks in ground truth. Although we find the majority of blocks, we have a
significant number of false negatives, failures to detect blocks. For this dataset, these
48
category: blocks percentage
in routing table 243 100%
false negative 105 43%
not in use 19
not responding 28
few responding 12
single-block multi-usage 46
/25 to /27 9
/28 to /32 37
blocks identified 147 100%
correctly identified 138 57% 94%
false positive 9 6.1%
multi-block single-usage 9
Table 2.6: Evaluation of accuracy of block identification USC to ground truth sizes.
false negatives show our approach is somewhat incomplete. On the other hand, if we
evaluate our algorithm by what it says, we see very few false positives, correctly identi-
fying 94% of all blocks we detect. For this dataset, almost no false positives show our
approach is quite accurate in what it asserts.
To understand accuracy, we looked at when our approach incorrectly identifies
blocks. All nine false positives are due to multiple blocks with common usage. We
examined each incorrect block and found that USC administrators had placed two logi-
cally different blocks on adjacent addresses, but these administratively different blocks
were used for similar purposes. Since our evaluation is based on external observations
of use, we believe there is no way any external observer could determine these adminis-
trative distinctions.
For false negatives, we found several sources of missed block identification. We
found that many blocks were either in the routing table but not assigned to locations or
services (19 not in use), or in the routing table and assigned, but with no ping responses
(28 not responding), or filled with only a few responders (12 few responding). In each
case, our algorithm refuses to make usage assertions on unused or sparsely used space.
49
category: blocks percentage
classified 138 100%
unclassifiable (false negative) 52 38%
incorrectly classified (false positive) 3 2.1%
always stable (dynamic) 3
correctly classified (true positive) 83 60%
intermittent (dynamic) 4
sometimes stable (dynamic) 5
intermittent (VPN) 1
underutilized (VPN/PPP) 2
always stable (lab) 2
sometimes stable (lab) 2
always stable (building) 25
sometimes stable (building) 42
Table 2.7: Evaluation of block classification accuracy at USC to ground truth.
Non- or few-responding blocks may be due to firewalls, reflecting a limitation of our
probing method. Not-in-use blocks would be impossible for any external observer to
confirm. In principal our algorithm could identify non-responsive blocks, but it is diffi-
cult for external observation to distinguish unused from firewalled space.
Finally, other false negatives occur due to blocks that have been administratively
assigned as /24s but then are used for different purposes. Nine of these show large,
consistent patterns, possibly indicating delegation at the department level that is not
visible to university-wide network administrators. If so, these represent incompleteness
in our ground-truth data. Smaller mixed-use blocks represent violations of our assertion
that adjacent addresses are used consistently.
Validation of block classification and usage
Table 2.7 shows the accuracy of our approach for the 138 blocks we classify. We declare
38% unclassifiable (false negatives); here we have discovered the correct block size but
50
decline to declare a ping-observable category because the block is only sparsely respon-
sive. We correctly classify the majority of blocks, selecting ping-observable categories
that are consistent with the use of 60% of blocks. We mis-identify three blocks (a 2%
false positive rate), all reported as dynamically allocated but observed as always stable.
These blocks perhaps represent DHCP-assigned addresses with very long lease times
for computers that are always up.
Validation of edge bitrate
We also validated our edge-bitrate assessment. USC has only two low-bitrate blocks
(dial-up blocks running PPP). Experimental evaluation of LTUSCs successfully identi-
fies both as low-bitrate, and does not mis-identify any of the 136 other blocks as low-
bitrate. While this 100% accuracy is reassuring, the proximity of prober and target sug-
gests that our validation with random Internet blocks (Section 2.4.2) is a more general
result.
2.4.2 Validation in the General Internet
Our main validation results use USC because there network operations can provide
ground truth. We would like to evaluate how well our approach works on the general
Internet as well, since commercial use may differ from USC. We evaluate our ping-
observable classification results for 100 randomly selected /24 blocks, and enlarge the
sample size for our edge-bitrate validation in Section 2.4.2.
While we cannot get ground truth from network operations for the general Inter-
net, we can get clues about block size and usage from hostnames and the whois
51
database. Hostnames are often assigned in patterns that suggest common administra-
tion and access method. For example, hostnames in 4.168.174/24 follow the conven-
tion dialup-4:168:174:
*
:dial1:losangeles1:level3:net. Such consistent
naming conventions strongly suggest a common administrator (in this case, Level 3).
Second, the presence of “dial” in the name suggests dial-up usage and low-bitrate con-
nection. Whois information provides an alternative view. For example, hostnames in
70.204.31/24 follow the convention
*
:sub-70-204-31:myvzw:com. Names suggest
common administration, but not how it is used. Whois indicates this block is assigned
to Cellco Partnership DBA Verizon Wireless, suggesting mobile phone usage.
In the 100 /24 blocks, 47 of them are not found in ISC dataset, 7 of them have too
few hostnames. Because we can not gain the ground truth from these blocks, we exclude
them from our validation. In the 46 /24 blocks left, 37 of them are identified as /24 by
hostnames, 7 of them are potential /24 blocks (hostnames do not follow exactly the same
convention but share similar keywords), 1 block is inferred to be a /25 block, and 1 block
is inferred to be composed of 1 /30 and 2 /32 blocks. Because most addresses are in /24
blocks, to simplify, we just give the statistics of the 37 identified by hostnames as /24
blocks and 7 as potential /24 blocks in Table 2.8. The /25 is not successfully identified
due to too few responding, the 1 /30 and 2 /32 are correctly identified as small server
blocks.
Validation of block identification and sizes
We randomly select 100 /24 blocks probed, and compare their clustering results with
our best estimates about the ground truth from manual analysis of hostname and whois
in Table 2.8 (37 are identified as /24 by hostnames).
52
category: blocks percentage
/24 randomly selected 100 100%
decided (/24 inferred from hostname) 37 37% 100%
correct 25 68%
wrong (false negative) 12 32%
few responding 6
single-block multi-usage 6
undecided 63 63%
no hostname 45
few hostnames 7
potential /24 inferred 7
correct 7
has sub-/24 groupings 4
Table 2.8: Evaluation of block identification accuracy of random Internet blocks.
As shown in Table 2.8, the correctly identified rate (68%) is even higher than the one
in USC validation (57%). The reason is that address space in the general Internet is used
in a bigger granularity than campus network, thus, blocks tend to be more consistent.
Although the results seem not bad, most blocks are /24, we would like looking
deeper into see how our method works on smaller blocks. We randomly selected 20 /24
blocks identified as composition of smaller blocks. 4 of them have too few hostnames,
12 are actually /24 blocks but break into smaller blocks due to non-consistent usages.
2 are partially responding, i.e., only half of the block is responding, the other half is
probably firewalled. 2 /25s are correctly identified. This fact suggests that most smaller
blocks we found are potentially big blocks broke into pieces due to non-consistent usage,
however, we did find some consistent small blocks identified by hostnames.
Validation of block classification and usage
To validate the ping-observable classification, we look at the 25 correctly identified /24
blocks in the previous 100 random /24 blocks. To validate the low-bitrate classification,
53
category: blocks percentage
classified 20 100%
unclassifiable (false negative) 2 10%
incorrectly classified (false positive) 1 5%
sometimes-stable (server) 1(1)
correctly classified (true positive) 17 85%
always-stable (server, biz) 3(2, 1)
sometimes-stable (dsl, static) 3(2, 1)
intermittent (dsl, cable) 4(3, 1)
intermittent (mobile, dynamic, dial) 4(2 ,1, 1)
underutilized (pool-pond, dsl, client) 3(1, 1, 1)
Table 2.9: Evaluation of block classification accuracy of commercial blocks
because of the low percentage of low-bitrate blocks, we enlarged our random sample to
200 /24s.
About ping-observable classification, of the 25 correctly identified /24 blocks, we
classified 20 of them; 5 were unclassifiable because of lack of hostname and whois
information. We summarize the validation results in Table 2.9. Because we have a loose
mapping between ping-observable category and hostname-inferred usage category, the
correct rate is relatively high (85%). For example, 2 DSL blocks are sometimes-stable,
3 DSL blocks are intermittent and 1 DSL block is underutilized. We consider the above
three mappings being all correct. Although one might argue that host firewalls could
lower down the availability metric which leads to the classification of underutilized
blocks. The results show that most DSL blocks are still distributed among intermittent
and sometimes-stable categories which weaken the claim of host firewall popularity.
Because the mapping from hostname-to-ping-inferred category is not one-to-one, our
estimates of “ground truth” here are imprecise and we do not claim this result is defini-
tive, but merely suggestive that our classification works well over the general Internet.
54
Validation of edge bitrate
Among the random 100 /24 blocks, only 10 of them (6 dsl & 1 cable, 1 dial & 2 mobile-
phone) can be used as ground truth to validate our edge-bitrate assessment. Known low-
bitrate blocks are rare in the Internet, thus we want to have more samples to validate
our edge-bitrate assessment. Simply adding more random blocks to the previous 100
blocks and manually inspecting them is time-consuming. So we use an automatic way,
although a little coarser, to add more samples. We randomly pick only classifiable /24
blocks with consistent naming convention in hostnames that have certain keywords (dsl,
cable, dial) indicating edge access link type. This process can be easily automated
with hostname data only without querying the whois database. Thus, in addition to the
previously identified 10 blocks, we add 26 random hostname-inferrable edges blocks,
for a total of 36 blocks as ground truth. Table 2.10 summarizes our analysis.
For the 36 blocks where we can infer edge types to evaluate accuracy, we success-
fully classify all low-bitrate blocks and all broadband blocks. Our low-bitrate detection
algorithm provides an 0% false-negative rate and a 0% false-positive rate. There were
three confusing-hostname broadband blocks classified into low-bitrate. These blocks
have dial in their hostnames. However, when we confirmed with the ISP’s network
operations, these blocks are actually fast 3G/UMTS wireless connections. Their R
values are 20ms, 41ms and 43ms respectively, suggesting 1Mb/s links or faster.
2.4.3 Consistency Across Repeated Surveys
We next wish to understand if the parameters of our data collection or analysis have
a disproportionate effect on our conclusions about Internet-wide address usage. To do
so, we compare analysis of IT17ws with that of three new datasets, IT16ws, taken five
months earlier; IT30ws and IT31ws, taken 30 months later. These surveys allow us to
55
category: blocks percentage
hostname-inferrable edges 36 100%
low-bitrate blocks (6 dial, 2 mobile) 8
R
1=2
;
(b)> (true positive) 8
R
1=2
;
(b) (false negative) 0 0%
broadband (21 dsl, 4 cable, 3 3G) 28
R
1=2
;
(b)> (false positive) 0 0%
R
1=2
;
(b) (true negative) 28
clear hostname 25
confusing hostname 3
Table 2.10: Evaluation of low-bitrate block classification accuracy of commercial
blocks.
consider both adjacent surveys at two different times, and longer-term trends. Half of
the /24 blocks in the survey are consistent across each survey, and half are randomly
chosen in each survey (full details of selection methodology are elsewhere [HPG
+
08]).
This comparison therefore observes whether network changes alter observations of the
same blocks, and whether different sets of blocks show very different behavior.
Our estimates of the block size distributions are almost identical in the four surveys.
If we define s
p
as the vector of number of blocks of prefix length p, the correlation
coefficient of the vectors for IT17ws against all other surveys are all above 0:9989. We
conclude that a random sample of 1% of the Internet is large enough that the block size
observations are hardly affected if half of the sample is changed.
Our work assumes that contiguous addresses are often used consistently. Following
Section 2.3.1, we consider blocks of size /24 through /26 as consistent, and size /27
through /32 as inconsistent. These percentages are quite consistent in adjacent surveys,
with a possible slow downward trend over time: In IT17ws, 44% of probed Internet,
going to 43% in IT16ws, and 38% later in both IT30ws and IT31ws. Results are similar
if we consider percentage of the responsive Internet, with 60% and 61% in IT16ws and
IT17ws, and 57% and 58% in the later two surveys.
56
Finally we consider the temporal consistency of our ping-observable classification
across four surveys. We show that temporarily adjacent surveys show consistent classi-
fication results, while more distant surveys show greater divergence. First we compare
each adjacent pair of surveys. For IT17ws and IT16ws, initially we found the correla-
tion of the number of blocks in each category to be generally good but not great across
all block sizes—it ranged from 0.663 to 0.938 for blocks smaller than /29, but the cor-
relation for /24 blocks was only 0:349. Examination showed that around 500 blocks
were shifting between always- and sometimes-stable. This shift occurred because of a
change in volatility and our selection of the always-stable requirement thatV (b)
and = 0:0016. For very stable hosts, a few outages can changeV
(b) significantly.
Examination showed that IT16ws and IT17ws are of different duration (6 and 10 days).
A longer duration makes it easier to distinguish between sometimes- and always-stable
blocks. When we keep the observation duration constant by considering only a 6-day
subset of IT17ws, the correlation coefficient for /24 classification rises to 0:626. We
conclude that most ping-observable classifications are good, but the separation between
sometimes- and always-stable categories is somewhat sensitive. We plan to investigate
the sensitivity in future work by down-sampling the survey data in time. We confirm this
result by comparing the later surveys, where full 14-days of IT30ws and IT31ws show
the correlations ranging from 0.77338 to 0.987. Thus we conclude that results taken
near the same time are fairly consistent.
We next compare surveys taken two years apart: IT17ws and a 6-day subset of
IT31ws (data for IT31ws is in Appendix 2.6). There are two main differences: first,
many blocks shift from sometimes-stable to always-stable, where IT31ws has 2,459
always-stable /24 blocks compared to only 2,001 before (33% vs. 23%). The percent-
age of intermittent and underutilized blocks is similar. Second, we see a larger number
57
of /32 blocks in the later survey, up to 198k from 179k, with many of the new blocks
shown as /32 always-stable blocks. This change may represent additional servers on the
Internet. As described above, we know the always/sometimes-stable border is sensitive
to observation duration, so future work is required to understand whether these shifts
are meaningful.
2.5 Conclusions
We have shown how to improve understanding about IPv4 addresses by systematically
collecting and analyzing ICMP probe responses. We have confirmed that contiguous
addresses are often used similarly and managed in blocks equal to or bigger than /24.
We have also found a significant number of blocks are only lightly used and nearly 40%
of /24 blocks appear to be dynamically assigned. We have validated our claims at USC,
against randomly selected Internet blocks, and over multiple years.
This chapter provides a strong example within the problem space (Section 1.1) to
support our thesis statement that systematic approaches can overcome data limitations to
improve understanding about the Internet infrastructure (Section 1.2). In particular, we
demonstrate that statistical and clustering approaches have clearly addressed the indirect
and incomplete ICMP data, and have further improved our understanding about three
important issues of IPv4 addresses—a vital component of the Internet infrastructure.
In addition to being a strong evidence for supporting the thesis statement, the anal-
ysis from this study indicates the usefulness of two systematic approaches outside of
understanding address usage. Statistical approaches examine statistics of data (such
as sum, mean, median, and standard deviation). They can be particularly useful to
extract more interesting information. This technique has been also used to measure out-
ages, edge hosts, and firewalls on the Internet [QHP13, HPG
+
08, SS11]. Clustering
58
blocks addresses
size sometimes- classifiable unclassifiable [100%]
pfx addrs always-stable stable intermittent underutilized (100%)
/24 256 2,001(23%) 2,328(27%) 2,399(28%) 1,832(21%) 8,560* 3,522[29%] 12,082 3,092,992
/25 128 376(%) 553(%) 292(%) 222(%) 1,443* 965[40%] 2,408
/26 64 401(%) 602(%) 371(%) 241(%) 1,615* 824[34%] 2,439
/27 32 550(%) 834(%) 485(%) 319(%) 2,188y 864[28%] 3,052
/28 16 824(%) 1,258(%) 913(%) 616(%) 3,611y 1,219[25%] 4,830
/29 8 2,506(%) 3,033(%) 2,279(%) 2,129(%) 9,947y 0 9,947
/30 4 4,052(%) 5,492(%) 4,881(%) 3,657(%) 18,082y 0 18,082
/31 2 5,933(%) 10,431(%) 8,757(%) 4,844(%) 29,965y 0 29,965
/32 1 53,433(30%) 40,266(22%) 47,497(26%) 38,254(21%) 179,450y 0 179,450
entireIT17ws dataset: (1,602,412 addrs. in non-responsive blocks) + (4,123,540 in responsive blocks) 22,367 5,725,952
Table 2.11: Number of blocks of each size in IT17ws (6 days). Unclassifiable percent-
ages relative to all blocks; other percentages relative to classifiable blocks. Asterisks:
consistent blocks, daggers: non-consistent.
approaches group similar entities into clusters so they can be studied together. They are
especially helpful to deal with incomplete data because the absence of data about par-
ticular entities is neutralized by the whole cluster. This technique has also been applied
outside of Internet measurement in real-world surveys [Gal13].
Clustering approaches are not only good at handle incomplete data, they can be also
used to address over-fit and noisy data as we demonstrate in the next chapter.
2.6 Appendix: Details about Surveys at Different Dates
The main body of this work provided details for 10-day analysis of IT17ws in Table 2.3.
This appendix provides data for the 6-day subset of IT17ws (Table 2.11) to compare to
the 6 days of IT16ws (Table 2.12) and 6 days of IT31ws (Table 2.13). These results were
discussed in Section 2.4.3 to establish that the results are generally consistent across
adjacent surveys of the same length, and to describe the evolution over surveys years
apart.
We also provide all 14 days of IT31ws in Table 2.14 for comparison to the 6-day
subset.
59
blocks addresses
size sometimes- classifiable unclassifiable [100%]
pfx addrs always-stable stable intermittent underutilized (100%)
/24 256 2,173(26%) 2,007(24%) 2,409(28%) 1,915(23%) 8,504* 3,518[29%] 12,022 3,077,632
/25 128 424(%) 489(%) 339(%) 215(%) 1,467* 965[40%] 2,432
/26 64 421(%) 610(%) 409(%) 218(%) 1,658* 770[32%] 2,428
/27 32 547(%) 808(%) 532(%) 326(%) 2,213y 805[27%] 3,018
/28 16 899(%) 1,246(%) 962(%) 634(%) 3,741y 1,169[24%] 4,910
/29 8 2,644(%) 2,909(%) 2,441(%) 2,075(%) 10,069y 0 10,069
/30 4 4,513(%) 5,514(%) 5,031(%) 3,836(%) 18,894y 0 18,894
/31 2 6,278(%) 10,327(%) 8,581(%) 5,092(%) 30,278y 0 30,278
/32 1 54,916(31%) 39,830(22%) 45,879(26%) 39,065(22%) 179,690y 0 179,690
entireIT16ws dataset: (1,609,610 addrs. in non-responsive blocks) + (4,115,830 in responsive blocks) 22,365 5,725,440
Table 2.12: Number of blocks of each size in IT16ws (6 days). Unclassifiable percent-
ages relative to all blocks; other percentages relative to classifiable blocks. Asterisks:
consistent blocks, daggers: non-consistent.
blocks addresses
size sometimes- classifiable unclassifiable [100%]
pfx addrs always-stable stable intermittent underutilized (100%)
/24 256 2,459(33%) 1,363(18%) 2,023(27%) 1,605(22%) 7,450* 4,045[35%] 11,495 2,942,720
/25 128 449(%) 257(%) 179(%) 112(%) 997* 1007[50%] 2,004
/26 64 540(%) 373(%) 234(%) 119(%) 1,266* 813[39%] 2,079
/27 32 741(%) 606(%) 370(%) 201(%) 1,918y 883[32%] 2,801
/28 16 1244(%) 919(%) 688(%) 451(%) 3,302y 1,339[29%] 4,641
/29 8 3,494(%) 2,143(%) 1,776(%) 1,431(%) 8,844y 0 8,844
/30 4 4,640(%) 4,209(%) 3,774(%) 2,771(%) 15,394y 0 15,394
/31 2 5,847(%) 9,104(%) 7,335(%) 3,783(%) 26,069y 0 26,069
/32 1 79,283(40%) 40,136(20%) 44,116(22%) 34,785(18%) 198,320y 0 198,320
entireIT31ws dataset: (1,849,294 addrs. in non-responsive blocks) + (3,878,962 in responsive blocks) 22,376 5,728,256
Table 2.13: Number of blocks of each size in IT31ws (6 days). Unclassifiable percent-
ages relative to all blocks; other percentages relative to classifiable blocks. Asterisks:
consistent blocks, daggers: non-consistent.
blocks addresses
size sometimes- classifiable unclassifiable [100%]
pfx addrs always-stable stable intermittent underutilized (100%)
/24 256 1,735(23%) 1,676(22%) 2,300(30%) 1,994(26%) 7,705* 3,606[32%] 11,311 2,895,616
/25 128 313(%) 340(%) 174(%) 145(%) 972* 942[49%] 1,914
/26 64 361(%) 487(%) 231(%) 181(%) 1,260* 761[38%] 2,021
/27 32 514(%) 686(%) 396(%) 289(%) 1,885y 869[32%] 2,754
/28 16 878(%) 1,236(%) 784(%) 587(%) 3,485y 1,286[27%] 4,771
/29 8 2,016(%) 3,188(%) 1,897(%) 1,906(%) 9,007y 0 9,007
/30 4 3,020(%) 6,231(%) 4,007(%) 3,377(%) 16,635y 0 16,635
/31 2 1,113(%) 12,806(%) 7,229(%) 4,269(%) 25,417y 0 25,417
/32 1 81,320(38%) 49,609(23%) 42,767(20%) 38,969(18%) 212,665y 0 212,665
entireIT31ws dataset: (1,891,745 addrs. in non-responsive blocks) + (3,836,511 in responsive blocks) 22,376 5,728,256
Table 2.14: Number of blocks of each size in IT31ws (14 days). Unclassifiable percent-
ages relative to all blocks; other percentages relative to classifiable blocks. Asterisks:
consistent blocks, daggers: non-consistent.
60
blocks addresses
size sometimes- classifiable unclassifiable [100%]
pfx addrs always-stable stable intermittent underutilized (100%)
/24 256 1,884(25%) 1,522(20%) 2,256(29%) 2,008(26%) 7,670* 3,451[31%] 11,121 2,846,976
/25 128 358(%) 291(%) 161(%) 166(%) 976* 846[46%] 1,822
/26 64 435(%) 358(%) 217(%) 211(%) 1,221* 594[33%] 1,815
/27 32 547(%) 568(%) 346(%) 299(%) 1,760y 729[29%] 2,489
/28 16 856(%) 988(%) 730(%) 648(%) 3,222y 1,091[25%] 4,313
/29 8 1,838(%) 2,306(%) 1,834(%) 2,127(%) 8,105y 0 8,105
/30 4 2,731(%) 4,624(%) 3,978(%) 3,478(%) 14,811y 0 14,811
/31 2 2,890(%) 8,915(%) 7,110(%) 4,377(%) 23,292y 0 23,292
/32 1 98,867(44%) 45,016(20%) 41,529(18%) 40,368(18%) 225,780y 0 225,780
entireIT30ws dataset: (1,988,080 addrs. in non-responsive blocks) + (3,741,456 in responsive blocks) 22,381 5,729,536
Table 2.15: Number of blocks of each size in IT30ws (14 days). Unclassifiable percent-
ages relative to all blocks; other percentages relative to classifiable blocks. Asterisks:
consistent blocks, daggers: non-consistent.
(A, V)
0 0.2 0.4 0.6 0.8
A(b)
0
0.02
0.04
0.06
0.08
0.1
V(b)
(U, V)
0 0.02 0.04 0.06 0.08 0.1
U(b)
0
0.02
0.04
0.06
0.08
0.1
V(b)
1
10
100
1000
(A, U)
0 0.2 0.4 0.6 0.8 1
A(b)
0
0.02
0.04
0.06
0.08
0.1
U(b)
Figure 2.8: Density plots of /24 blocks in IT17ws across each of the A/V , U/V , A/U
planes.
2.7 Appendix: Examining the (A,V ,U*) Space
Section 2.2.4 defined our ping-observable categories based on the (A;V;U
) values of
blocks. To develop an understanding of how these metrics help categorize the Internet,
Figure 2.8 shows the density plot of (A;V;U
) space separated in three planes. For each
plot, we create 100 bins for each of two parameters, then count the number of /24 blocks
identified in IT17ws that fall into that bin with any value of the third parameter.
All of the planes show blocks with many different values, providing no defini-
tive clusters. However, there are concentrations in some areas of some planes, even
61
though there are a few blocks in between those concentrations. The (A;V ) plane
shows two concentrations, with a portion of blocks tend to gather around (A;V;U
) =
(0:975; 0:005;), showing highly available and highly stable behavior. We classify
most of them into always-stable blocks. Another portion of blocks tend to gather
around (A;V;U
) = (0:050; 0:005;) which exhibit highly underutilized behavior.
We classify them into underutilized blocks. The rest blocks are distributed between
(A;V;U
) = (0:100 0:400; 0:075 0:022;), with no obvious boundary to differen-
tiate sometimes-stable and intermittent blocks on (A;V ) plane. Instead, we inspect the
(A;M) and (M;V ) planes to split these apart. Even there, we do not see a sharp bound-
ary. However, we place a line atU = 0:026 (U
= 6 hours) to classify sometimes-stable
(U
6 hours) and intermittent (U
6 hours) blocks.
2.8 Appendix: Training and Hostname-inferred Usage
Categorization
Our methodology takes data about use of public addresses and produces five ping-
observable categories. We would like to relate those categories to terms that are more
meaningful to network operators, and to find what root causes correspond to and poten-
tially cause blocks to be intermittent or underutilized.
Determining the operational characteristics of a network is quite challenging, how-
ever. In some cases we are able to discuss network policy with the operations staff to
confirm our assumptions; we will use this approach to validate our conclusions against
a large campus network in Section 2.4. However, such observations may be biased by
the policies of a single institution. We would like to also draw data from the Internet
at large, but it is infeasible to contact operations for large parts of the network. While
62
tools such as nmap [Lyo97] can extract significant information from a network through
sophisticated active probing, their use is easy confused with hostile network activity by
many network operations.
3
Hostnames are a source of data that provides some information about how public
computers are used—many hostnames contain keywords such as “www”, “dynamic”,
or “dsl”. Wide hostnames collection is also feasible: many Internet hosts suggest reverse
DNS lookup [Moc87], reverse lookup occurs commonly as part of normal operation and
so is unlikely to be seen as hostile. The Internet Systems Consortium has collected full
tables of reverse DNS regularly since 1994 [Int07] and makes it available for a nominal
fee.
We next describe how we map hostnames to 15 hostname-inferred usage categories,
and how this data corresponds to our five ping-observable categories.
While one might study the Internet using hostname data alone, without ping data, we
believe the information complements each other. About half addresses that are used lack
reverse hostnames, and about 49% of hostnames lack meaningful keywords, and reverse
names may not represent the computer’s true use (for hosts with multiple names, where
often the reverse name is automatically assigned) so we think hostnames alone are not
sufficient.
2.8.1 Hostname-inferred Usage Categories
Although hostnames are not perfect, we believe they provide a useful dataset to compare
against our ping-observable categories. We use ISC survey 17 [Int07], taken slightly
before our primary ping survey [USC07b].
3
Widescale nmap use would place us in contact with additional operations staff, but perhaps not on
ideal terms.
63
ping survey 4,445,696 (100%)
ping responders 1,675,121 (37.7%)
ping survey w/ hostnames
2,197,373 (49.4%)
ping responders w/ hostnames
1,049,842 (23.6%)
ping responders
w/ hostnames
w/ keywords
573,494 (12.9%)
Figure 2.9: Our Investigation Targets: IP addresses ever responded in IT17wrs and have
meaningful hostnames (with keywords). It is the middle part with 573,494 addresses in
this figure.
Figure 2.9 shows the overlap of these datasets. shows our investigation targets.
We begin with the 4.4M IP addresses probed the ping survey Nearly half of these
(2.2M) have hostnames in the ISC reverse DNS survey. Of the 1.6M ping addresses
that respond, we consider the 1.0M addresses that also have hostnames. We then focus
on the 573,494 of those that have identifiable keywords in their hostname (12.9% of all
addresses in the ping survey).
We follow recommendations that were proposed as standard naming conventions
for Internet hosts [SM06] and that occur in 2000 or more hosts in our dataset. Although
these were neither approved by the IETF, nor would the be mandatory even if approved,
these terms do appear in about one-quarter of reverse hostnames. From their recommen-
dations we define 15 hostname-inferred usage categories as shown in Table 2.16.
64
group category keywords count
allocation static static 28,137
dynamic dynamic, dyn 105,882
dhcp dhcp 14,290
pool pool, pond 66,009
ppp ppp 44,729
access link dial dial, modem 80,090
dsl dsl 208,682
cable cable 29,761
wireless wireless, wifi 910
ded ded, dedicated 733
consumer biz business, biz 12,999
res res, resident 25,847
client client 9,994
server server server, srv, svr, mx,
mail, smtp, www, ns,
ftp
12,568
router router, rtr, rt, gateway,
gw
2,850
Table 2.16: Categories of hostname-derived usage.
0
50000
100000
150000
200000
250000
staticdynamic dhcp pool ppp dial dsl cable wireless ded biz res client server router
Number of Hosts
Hostname-inferred Usage Categories
unknown
singleton
dynamic
static
Figure 2.10: Numbers of hostname-inferred usage categories, with colors indicating
those that also have allocation types.
Figure 2.10 shows the count of hostnames in each category. The sum exceeds 573k
because these categories partially overlap, and a single hostname may be in multiple
categories. For example, some providers label DSL addresses with both DSL and static
or dynamic. We see that access links keywords (DSL, dial, etc.) are very common,
65
occurring in 51% of hostnames, and allocation types (static, dynamic, etc.) occur in
about 22% of hostnames in ping survey w/ hostnames.
To provide some understanding the number of hostnames with multiple keywords,
we subdivide each category by those that also contain static, dynamic, or any other addi-
tional keyword. Several groups types often have an additional indication of allocation
type: while 10% of dsl are labeled dynamic (1.2% static), 50% of biz are labeled static.
These secondary attributes reveal some technology trends: the ratio of dial also with
static or dynamic types is around 1:17, while for DSL it is 1:8 suggesting increased use
of static addresses in always-on DSL lines. For cable the ratio is 1:1, but the fraction of
cables with an additional type is small enough that drawing conclusions may be risky.
2.8.2 Relating Hostname-Inferred to Ping-Observable Categories
Our goal in evaluating hostnames is to use them to understand and train our ping-
observable categories. We next compare the two to see when of our observations
(A;V;U
) are correlated with hostname-inferred usage categories.
Figures 2.11 shows cumulative distribution functions of each observation against
each hostname-inferred type. This data will prove essential to understand the root net-
work causes of address underutilization and locations of dynamic addresses; we there-
fore defer a detailed discussion of this data to Sections [ and 2.3.2][2.3.3].
Taken together, though, these graphs support the third assumption of our paper, that
patterns of probe responses can suggest address usage. This assertion is supported
because hostname-inferred categories (our approximation of usage) show fairly distinct
distributions, particularly in availability and median-up (Figure 2.11). As a specific
example, The left graph of Figure 2.11 shows that availability of more than 50% dial
66
0
20
40
60
80
100
0.1 0.95 0.2 0.4 0.6 0.8
CDF of Hosts (%)
Availability
Availability
0
20
40
60
80
100
0.016 0.156 1
639 1 10 100
CDF of Hosts (%)
Volatility
Volatility*
0
20
40
60
80
100
6 117 235 48 96 144 192
CDF of Hosts (%)
Median Uptime (Hour)
Median Uptime (Hour)
static
dynamic
dhcp
pool
ppp
dial
dsl
cable
wireless
ded
biz
res
client
server
router
Figure 2.11: CDF of address availability (A), volatility (V ) and median-up duration
(U
) by hostname-inferred categories in IT17ws.
addresses is smaller than 0:1, while theA of more than 80% server addresses is larger
than 0:95.
While the bulk of dial and server addresses are quite different, a few dial addresses
have with reasonably largeA (5.4% haveA > 0:5), and a moderate number of servers
have poor availability (about 10% have A < 0:15). We conclude that, while ping-
observable metrics are reasonable predictors of usage, they are not exact, and any esti-
mates will have fairly large error bounds. Perhaps this result is consistent with previous
observations about the great variability of the Internet [FP01].
67
0
200
400
600
800
1000
1200
static dynamic dhcp pool ppp dial dsl cable wireless ded biz res client server router
Number of /24 Blocks
Hostname-inferred Usage Categories
Ping-observable Categories
unclassifiable
underutilized
intermittent
sometimes-stable
always-stable
Figure 2.12: Relationship of ping-observed categories to hostname-inferred categories
in IT17ws.
Finally, the observations in these CDFs help define our thresholds for ping-
observable classes (Section 2.2.4). The sharp knee at A = 0:1 in the left graph Fig-
ure 2.11 suggests
L
= 0:1. Based onV in the middle graph, we select = 0:0016 to
separate most servers and stable uses from less stable. The sharp knee at aroundU
= 6
hours in the right graph suggests this value for
, This cutoff helps separate addresses
which are not always-stable and not underutilized to two categories: sometimes-stable
and intermittent.
We observe one anomaly in U
: the step in the right graph of Figure 2.11 around
hour 120 is an artifact caused by hosts that have highA (nearly 1:0), but a brief outage.
This outage gives themV
= 2, andU
of half our probing duration.
Based on these thresholds, Figure 2.12 and Table 2.17 map the 15 hostname-inferred
usage categories to the ping-observable categories (Section 2.2.4).
68
ping-
observable
category
hostname-inferred
usage category
always-stable router, server, (static),
ded, (biz), (dhcp), (res)
sometimes-
stable
res, static, biz, dhcp,
(server), ded, (client),
(cable), (dsl), (router),
(ppp)
intermittent cable, dynamic, dsl,
(wireless), (dial), (ppp),
(dhcp)
underutilized pool, wireless, ppp, dial,
client, (dynamic), (ded),
(dsl)
Table 2.17: The mapping from the 15 hostname-inferred usage categories to 4 ping-
observable categories. hostname-inferred usage category without (parentheses) is dom-
inate.
69
Chapter 3
Mapping Autonomous Systems to
Organizations
To support our thesis statement that systematic approaches can overcome data limita-
tions to improve understanding about the Internet, the previous chapter has provided
one specific example which studies IPv4 addresses via ICMP probe responses. We
next explore another part of the problem space where we aim to improve the results
of prior work. Specifically, our goal in this chapter is to obtain an accurate mapping
from Autonomous Systems (ASes) to organizations that own them. Nearly 50k ASes
compose the vital inter-domain routing system where the majority of Internet traffic
traverses. Understanding AS ownership is important to understand the AS ecosystem
which further sheds light on economic relationships among Internet Service Providers
(ISPs), content providers, and Content Delivery Networks (CDNs). Parts of the content
of this chapter were published at IMC [CHKW10]. A more complete version is released
as a technical report [CHKW12b].
To infer the AS-to-organization mapping, we systematically build a clustering
approach that explores the AS contact information stored in WHOIS registries. WHOIS
data has four major limitations: it is indirect, incomplete, over-fit and contains noise
(Section 3.2.1, Section 3.3.3). Our clustering approach handles these limitations by
combining multiple types of contact information and employing a targeted clustering
algorithm (Section 3.2.1). The clustering approach enables us to improve the general
70
Data
Goal
direct
indirect
general specific
infeasible
undesirable
feasible and
desirable
AS
owner-
ship
AS
owner-
ship
Figure 3.1: The parts of the problem space the second study explores.
understanding about the ownership of ASes: how many and what ASes are managed by
organizations that own multiple ASes? (Section 3.4.1) and why do these organizations
own multiple ASes? (Section 3.4.2). In addition, we show this understanding can further
improve the understanding about the influence of Internet organizations (Section 3.5).
This chapter serves as another effective evidence to support the thesis statement. It
demonstrates that systematically-built approaches can help to obtain an accurate AS-
to-organization mapping from which prior work is kept by the indirect, incomplete,
over-fit and noisy WHOIS data. Figure 3.1 visualizes the part of the problem space this
chapter explores. Similar as the first study, the main goal of this chapter also falls on the
general side, and we explore indirect WHOIS data to reach it (“AS ownership” circle).
The results are validated by a direct mapping obtained from a specific organization and
several mappings manually inferred by us (“AS ownership” square).
71
The study in this chapter suggests the potential to re-examine many prior unsolved
problems via more systematic approaches. In particular, our work suggests two con-
ceptual ways. First, one can think about combining multiple types of data when
any single type is insufficient. This technique has also been demonstrated by prior
work [AKW09, SBS08]. Second, clustering approaches are not only useful to address
incomplete data (shown in our first study), but are also effective to separate noise from
valuable information, as also shown in other case [CFH
+
13].
3.1 Introduction
Mapping the Internet’s connectivity structure is important to study network vulner-
abilities to the various failure modes, be they technical [Und05, Und06a], politi-
cal [McP08], or business-related [Pou09, Und06b] in nature or the result of catastrophic
events [COP
+
03, GA06] or intentional attacks [The03]. However, by the Internet’s very
design, there exist many shades of connectivity from inherently physical links to differ-
ent types of virtual connections. To be of practical relevance, abstractions of Internet
topology must account for key features of this complex connectivity.
A popular structure for studying the Internet has been the Internet’s AS graph [GR97,
FFF99] where a node represents an Autonomous System (AS), commonly defined to be
one or more networks in the Internet operated under a common routing policy [HB96]
and links capture the exchange of reachability information among these ASes, each
controlled by an organization. Around 50,000 ASes in today’s Internet appear in public
BGP routing tables [RL95] and BGP-inferred AS-graphs have been extensively studied
for more than a decade. One popular application has been the evaluation of Internet
resilience, modeling threats to the network as removals of nodes or edges from the AS-
graph [AJB00, NBW06] and measuring their impact in terms of network partitioning
72
or increase in network diameter. However, modeling threats as graph operations can
be problematic when the graphs fail to account for lower- or higher-level real-world
details. For example, physical problems affect routers inside ASes, while ISP disputes
and de-peering can involve multiple ASes and organizations.
Moving beyond the Internet’s physical infrastructure, many issues concerning
today’s network occur at the organization level, above the AS-level, and arise from
business or political disputes. Examining these issues when organizations operate mul-
tiple ASes requires understanding the AS ecosystem [CHKW10] that aims to reflect
organization-level constraints imposed on the traditional AS-graph. To obtain such a
deeper understanding of the effects of organizations and be able to evaluate the effect of
this organization-level structure on important aspects related to the Internet’s topology,
the main contribution of this paper is the development and evaluation of a new algorithm
that uses public information about ASes to cluster them into organizations.
To this end, we describe in Section 3.2 our new clustering algorithm. It not only adds
a new data source for mapping ASes to companies (i.e., company subsidiary informa-
tion contained in the U.S. SEC Form 10-K filings), but also applies a new approach to
hierarchical clustering with weighted attributes. As a result, this newly developed algo-
rithm provides much greater accuracy than our prior work’s much simpler clustering
without 10-K information [CHKW10]. Given the legal requirements that form the basis
of this newly used information source, its accuracy is superior to the largely volunteer-
based efforts such as Packet Clearing House [Pac10] or the Regional Internet Registries
(RIRs) [Reg09]. While mining 10-K data can be challenging due to its unstructured
nature, we combine automatic matching with some manual evaluation to make effec-
tive use of it. Our exploration of these new data sources and the unique aspects of our
problem domain are our contribution beyond basic clustering.
73
We evaluate our new approach to map ASes to organization with our best effort
using four datasets composed of more than 100 organizations and 4,000 ASes (Sec-
tion 3.3). These datasets were chosen to balance quality, unbiasedness, and size and
combine ground truth from operators with independent information gathered from pub-
lic sources. We show that our results are accurate, with 90% of the organizations show-
ing false-positive rates of less than 10%; with respect to completeness, more than 60%
of the organizations show 10% or fewer false negatives. Our evaluation shows how
data quality, data availability, and the type of clustering algorithms affect our results. In
particular, we find that company subsidiary information from U.S. SEC Form 10-K fil-
ings significantly reduces false negatives. Our accuracy and completeness rates exceed
prior published results, and our analysis highlights the impossibility of reaching perfect
accuracy and completeness, even with multiple sources of public data.
To demonstrate the practical relevance of the Internets organization-level structure,
we show in Section 3.4 that multi-AS organizations matter in todays Internet: some 36%
of assigned AS numbers and 29% of actively routed ASes belong to multi-AS organi-
zations. Importantly, these 36% of ASes are particularly prominent, announcing nearly
two-thirds of all routed addresses. Moreover, the phenomena of multi-AS organiza-
tions is not transient: we identify underlying causes of multi-AS usage and use historic
routing table snapshots to illustrate that these causes are persistent.
We also evaluate some effects of this organization-level structure on the Internet
topology in Section 3.5. Prior analysis typically focuses on an organization’s best-
known AS. We show that this traditional view greatly underestimates the geographic
footprint and IP address coverage when compared to an organization-wide view that
encompasses all of an organization’s routed ASes. For example, the main AS omits a
significant portion (40% to 91%) of addresses in nearly one third of organizations. We
74
also demonstrate that understanding the public peering of companies is strengthened by
an organization-level view. For instance, the main-AS view fails to account for 20% to
58% or more of the IXPs and cities for about one third of multi-AS organizations.
These examples illustrate how our work can provide a deeper understanding of the
economic relationships in today’s Internet, both inside and between organizations. This
need is growing: We expect these relationships to continue to evolve as the Internet
changes, particularly as business models become more heterogeneous (as ISPs dis-
tribute content and content-providers deploy networks). Although the specific orga-
nizations that we identify are unique to the time and data we evaluate, our validation
against multiple independent data sources establishes the first baseline accuracy for AS-
to-organization mapping.
To improve research using Internet topologies and allow others to build on our work,
our AS-to-organization map has been released in June 2012, along with our test data
1
.
3.2 Methodology
We map ASes to organizations through a combination of two methods: automatic clus-
tering done on a structured data source (Section 3.2.1) and a semi-automatic method on
a less structured data source (Section 3.2.2).
3.2.1 Automated Clustering with WHOIS Data
Our automatic method relies on publicly available information from AS registration data
in WHOIS and consists of five separate steps. This section provides more details about
1
USC/LANDER project. AS-to-organization mapping dataset, PREDICT ID USC-LANDER/as to
org mapping-20101019/rev3007, Nov. 2011. AS-to-organization inferred truth dataset, PRE-DICT ID
USC-LANDER/as to org mapping inferred truth-20110901/rev3182. Athttp://www:isi:edu/ant/
traces/.
75
RIR All OrgID Contact Phone Email
ARIN 22k (100%) 21k (95%) 20k (91%) 19k (86%) 19k (86%)
RIPE 20k (100%) 13k (65%) 19k (95%) unavail. 14k (70%)
APNIC 6k (100%) unavail. 6k (95%) 5k (83%) 5k (83%)
LACNIC 1.5k (100%) 1.5k (100%) unavail. unavail. unavail.
AfriNIC 0.6k (100%) 0.6k (87%) 0.6k (100%) 0.6k (98%) unavail.
All 50k (100%) 36k (72%) 46k (92%) 25k (50%) 38k (76%)
Table 3.1: Data availability (AS count) for four attribute types across the 5 RIRs.
(i) WHOIS data, (ii) Step 1: attribute extraction and standardization, (iii) Step 2: training
attribute weights, (iv) Step 3: similarity matrix, (v) Step 4: clustering algorithm, and (vi)
Step 5: cluster labeling and selection.
WHOIS Data
The WHOIS database stores AS-specific registration information that is provided
by each AS on a voluntary basis. WHOIS originated to assist communication
between network operators, but there are no forcing mechanisms to ensure that each
AS’s information is complete or accurate. Each of the five Regional Internet Reg-
istries [Reg09] (RIRs) provides its portion of WHOIS data, with ARIN using its own
format [ARI10] and the other RIRs relying on the Routing Policy Specification Lan-
guage (RPSL) [NCC09]. To make full use of this WHOIS data, we first merge these
different formats into a common one.
WHOIS data is composed of different types of records and each record is associated
with multiple attributes. We are interested in three general types of records in WHOIS:
Autonomous Systems (ASes), organizations (or orgs), and points-of-contact (contacts).
From specializations of these types (for example, administrative or technical contacts),
we identify a total of 66 different types that provide potentially useful information for
identifying ASes with an organization. In particular, there are some 50k ASes in all five
76
RIRs (see the second column annotated with “All”in Table 3.1), identified by ASHandle
records in ARIN and aut-num records in other RIRs. Some AS’s org records are linked
by OrgID or org attributes in their records. Org records are often used in WHOIS for
common management of multiple resources and are potentially useful for identifying an
AS’s organization. However, RIR policies do not require a one-to-one mapping between
WHOIS-derived and real-world organizations, making simple clustering on common
organization records ineffective. An AS’s contact records identify individuals in charge
of administrative, technical, abuse or operations aspects of the AS. For example, a multi-
AS organization may use the same contact information for all of its ASes. However,
even if it uses different contacts, these contacts can often be linked based on common
telephone numbers and e-mail addresses.
We pursue two strategies with respect to the above-mentioned attributes. The first
strategy is to aggregate them into four attribute types: OrgID, contact ID, phone, and
email. The second strategy is to keep the 66 identified attributes separate so as to allow
for a differentiation between their subtle semantic meanings. By treating administra-
tive, technical, abuse, and network operation center (NOC) phone and e-mail contact
information separately, we allow these categories to be weighted differently. Different
weightings can reflect roles of outsourcing, since administrative and technical roles may
be outsourced with different frequencies. Throughout this paper, we refer to these two
sets of attributes as 4attr and 66attr, respectively.
There are several challenges in using WHOIS data to map ASes to organizations.
First, incompleteness in WHOIS data is a serious problem due to incomplete coverage as
well as stale and incorrect records. Table 3.1 shows the data availability of four different
attribute types. We see that no single type of attribute (OrgID, contact ID, phone, and
email) covers all ASes. In addition, some RIRs filter particular attributes due to privacy
77
concerns (for example, RIPE, one of the five RIRs, filters all phone numbers in bulk
data). These challenges require our use of clustering across multiple attributes. Possible
future work may examine exploiting unique aspects of data for each RIRs; we instead
focus on combining data from all RIRs because use of global data is mandatory to large,
influential ISPs with ASes in all RIRs.
Second, even when WHOIS data is present, technical outsourcing can provide mis-
leading relationships. When an organization has a third-party to handle network oper-
ations, specific fields in an AS’ data do not link to the parent organization, but instead
identify a third-party outsourcing company. Such false linkages can incorrectly join
otherwise unrelated ASes. We discuss these cases and our solutions in Section 3.3.4.
Finally, mergers and acquisitions are a primary source of mismatches between cur-
rent, real-world organizations and WHOIS information. After an acquisition, if WHOIS
records are not updated to use common points-of-content, or if the acquisition maintains
distinct AS-level administration (for example, as Youtube and Google), then it may be
impossible to infer the single new organization from WHOIS information alone. In Sec-
tion 3.2.2 we turn to additional data to explicitly identify acquisitions.
Attribute Extraction and Standardization
In Section 3.2.1, we summarized how we selected and normalized the AS attributes used
for clustering. This section gives more details, highlighting what aspects of the process
differ from our prior work [CHKW10].
Extract Raw Attributes: We extract raw attributes from WHOIS, following chains of
AS- to org- to contact-record as necessary.
Canonicalize to Simple Attributes: Unlike OrgID and contact IDs, phone and email
attributes often contain details that make similar records appear dissimilar. For example,
78
telephone numbers need to be amended with country code and stripped of extensions. To
canonicalize email records, we discard the user portion and keep only the distinguishing,
right-most part of the domain address. We identify that portion with a manually-built
list of more than 6k suffixes using longest-suffix matching.
Discard Generic Attributes: A number of attribute values are generic, shared by
unrelated ASes. Examples of generic attributes are public email services like GMail
and Hotmail. Used blindly, generic attributes will link unrelated organizations into
large, incorrect clusters. In addition to public e-mail providers, we identify eight generic
OrgID attributes (for RIRs, IANA, and NICs) and 120 contact IDs (for these, plus tens
of outsourcing companies and a few ISPs that manage customer networks). In total, we
discard 179 phone numbers and 141 e-mail domains.
Training Attribute Weights
Attributes of different types may have varying relevance for detecting organizations.
We therefore assign weights to the different attributes and tune these weights based on
training data. We briefly summarize our approach below and provide more details in
Appendix 3.7.
To train attribute weights, we use parallel hill-climbing [RN03] over a training set
of about 10,000 ASes. To form the training set, we start with 715 ASes for which we
have reliable organizational identities, then add 9k additional ASes to provide “noise”.
We then adjust weights to minimize false positives as well as false negatives; that is,
assigning an AS to the wrong cluster and missing ASes in organizations we know. Our
specific objective function is the sum of the quartiles of the false-positive and false-
negative rates over known organizations. We sum both types of error to avoid optimizing
either at the expense of the other, and use quartiles to be robust to extrema.
79
The size of the training set (10,000 ASes) is carefully selected based on time and
hardware constraints. Clustering with the large dataset is very time- and memory-
intensive: clustering with 50k ASes (whole population) takes about 3 days and 24 GB
of memory. Our training dataset instead requires only 20 minutes and 1 GB of memory.
In all, training considers about 15k weight vectors, requires about 200 days of compute
time, and is accomplished in about a week through parallelization.
Training finds the best clustering result with the 4attr dataset with weights of
^ w=f0.75, 0.1, 0.1, 0.05g for attributes (OrgID, contact ID, phone, and email). These
weights emphasize the correctness of OrgID (0.75), downplays contact ID (0.1), phone
(0.1) and email (0.05). OrgIDs are intended for common administrative management
and are thus unlikely to cause false positives. Contact IDs, phones and emails, though,
can be registered by outsourcing third parties thus have the potential to introduce false
positives. Although the best weighting scheme favors OrgID for clustering, input from
other attributes helps improve cases where OrgID is insufficient (Section 3.3.3).
Weighting schemes for the 66attr set are generally worse (1.5 to 2 times larger
objective function), so we rule out using 66 attribute types in practice. We examine
causes for this worse performance and find that more specific categories often break
links that would cluster ASes, resulting in higher false-negative rates. This problem
occurs because we cannot link across categories. Consider, for example, the e-mail
attribute type (similar arguments apply to phone and contact ID). With 4attr input,
administrative and technical contacts for two ASes admin@as1:example:com and
tech@as2:example:com are part of the same attribute and hence link the ASes.
However, with 66attr input, these attributes are considered to be different, meaning
they will fail to link the ASes.
80
Similarity Matrix
Next we use the weighted attributes to link ASes based on a similarity score. We com-
pute similarity scores for all AS pairs and store them in a similarity matrix, the input to
our clustering algorithm (Section 3.2.1).
We use weights and Jaccard index to compute the similarity score between two
ASes. Let s
x;y
denote the similarity score between AS x and y, ^ w the weight vector
with elementsw
i
; i2 [1::M];
P
M
i=1
w
i
= 1; for each attribute type. ASx has attribute
vector
^
X with elementsX
i
; i2 [1::M]; for each attribute type, and ASy has similar
vector
^
Y (recall that attribute vectors are produced in the second step and stored either
in the 4attr or 66attr set, see Section 3.2.1). Then we have
s
x;y
=
M
X
i=1
w
i
J(X
i
;Y
i
) (3.1)
where the Jaccard index J(X
i
;Y
i
) is the similarity score between two specific sets of
attributes of the same type (thei
th
attribute type), defined as
J(X
i
;Y
i
) =
X
i
\Y
i
X
i
[Y
i
(3.2)
This definition assumes that attribute types are orthogonal and compares only
attributes of the same type. Thus, in the case of 4attr input, all e-mail addresses are com-
pared with each other, while with 66attr input, administrative, and technical addressees
are treated separately; similarly for OrgIDs, contact IDs, or phone numbers.
81
Clustering Algorithm
Next we use the similarity matrix to cluster ASes. Ideally, each generated cluster cor-
responds to a real-world organization and can be labeled appropriately based on clues
such as domain names or keywords.
To cluster ASes, we rely on an existing hierarchical clustering algorithm, bringing
new data sources, attributes, and weights to the problem. It starts with a set of individ-
ual ASes with their pairwise similarity matrix. Initially, each individual AS denotes a
cluster. During each round, two of the clusters with the shortest distance are merged
together until there is only one cluster remaining (we define distance in the next para-
graph). By default, the algorithm joins all elements into a binary tree of relationships,
but a user can then select a threshold of similarity that will separate the tree into a forest.
We select a similarity threshold that cuts clusters with the following automatic training
method. Using a similarity score that is proportional to the sum of weights defined in
Equation (3.1), we begin training with a fixed threshold. We then train weights and
normalize both weights and threshold so that the sum of weights is 1.
There are several ways to define the distance between two clusters during clus-
tering. Importantly, this definition can significantly affect the clustering results. The
four definitions commonly used are maximum-linkage (the maximum distance among
all pair-wise elements), single-linkage (the minimum distance), average-linkage (the
average distance), and centroid-linkage (the distance between the centroids of two clus-
ters). We reject maximum-linkage as too strict because it creates too many false nega-
tives. Single-linkage is too aggressive because it causes too many false positives. We
avoid centroid-linkage because it is computationally too intensive. We therefore choose
average-linkage and verify that it can extract relationships and cope with challenges of
the WHOIS dataset (see Section 3.3.4).
82
Cluster Labeling and Selection
Hierarchical, average-linkage clustering produces a set of AS clusters, but we would
like to label them with meaningful identifiers. One promising source for labeling are e-
mail domain names of AS attributes, since e-mail addresses are usually human-readable
and often match an organization’s website. We also extract text names in AS and OrgID
records, including ASName or as-name, OrgName, descr (description), and owner fields.
We do not use OrgIDs since they are often obscure. We also avoid contact IDs, telephone
numbers, and information that may be specific to an individual. To improve search
speed and label quality, we break these names into keywords and rank them based on
their frequency and uniqueness. This process tends to highlight keywords that reveal an
organization’s identity.
To identify an organization’s AS cluster, one can search for related keywords or
domains and manually decide which AS cluster is the closest. Automatic identification
of all organizations’ AS clusters requires a list of keywords or domains and a function
that associates those terms with clusters and can ideally disambiguate multiple potential
matches. Unfortunately, we do not have such a list of keywords or domains for every
Internet-related organization. Thus, when comparing the accuracy of our results with
the ground truth, for the sake of simplicity, we always pick the biggest cluster in our
results to compare with the ground truth cluster. We caution that this decision favors
lower false-negative rate and higher false-positive rate.
3.2.2 Semi-automatic Clustering with 10-K Data
Mergers and acquisitions make it hard to map ASes into organizations. Such changes
often result in stale information in WHOIS, thus reducing mapping accuracy. To address
this issue, we advocate in this paper the use of a new and previously untapped source of
83
information for AS-to-organization mapping: company subsidiary data contained in the
U.S. SEC Form 10-K filings.
Form 10-K Data
Unlike the voluntary nature of WHOIS, all publicly-traded U.S.-based companies are
mandated by law to file 10-K forms annually. Because of this legal requirement, the
completeness and accuracy of the data is superior to the WHOIS database. Form 10-
K data therefore represents the ground truth for subsidiary relationships among these
U.S. companies for the year prior to the filings. Its disadvantages are that it does not
apply outside the U.S. and 10-K names can be imprecise as we describe below.
Form 10-K data is freely available from the U.S. Securities and Exchange Commis-
sion’s (SEC) EDGAR database [SS]. EDGAR covers each of the thousands of publicly-
traded, U.S.-based companies. Each form is identified by a unique company identifier
(we call it the 10-K ID), the year the form is filed, and a list of all of a company’s current
subsidiaries. After extracting the subsidiary names from all these lists and normalizing
them to lower case, we produce a mapping from company identifiers to their company
and subsidiary names. In total, we extract 156,936 names mapped to 8,706 10-K IDs
from the database for fiscal year 2010.
The main weakness of 10-K data is that company names are not standardized. Thus,
comparisons between 10-K and WHOIS are imprecise. Fully normalizing names is a
difficult problem in natural language understanding, since many variations are context
dependent and some names provide very little context. For example, Network and Inc.
are two words that convey little information in general about organization identity, but
they cause noise and cannot be dropped for matching. Similarly, variations in spac-
ing, abbreviations, and level of detail can all cause errors in name matching, and some
84
manual data cleaning is necessary to determine that, for example, “Apple”, “Apple Com-
puter”, and “Apple, Inc.” all refer to one and the same company, while “Apple Records”
identifies a different company.
Automatic Name Linkage
Given that the Form 10-K data is an accurate source of company subsidiary information,
in theory, we can use this data to cluster ASes for all subsidiaries into one organization
simply by matching subsidiary names to ASes. Two challenges make this task difficult
in practice. First, names are often ambiguous and come with little context, so name
matching is error-prone, as described above. Second, subsidiaries may or may not retain
a distinct identity after a merger. For example, Google lists Youtube as a distinct sub-
sidiary in 10-K, but not Postini. In this case, we can link Youtube’s ASes with Google,
but not Postini’s.
To overcome these difficulties, we first describe an automated procedure for linking
AS names with names of subsidiaries and then detail a manual process for verifying and
pruning links for some 50 purposefully chosen organizations in Section 3.2.2.
Linking the same entity from different data sources, usually with different names, is
called record linkage, a well studied problem in data mining. A number of algorithms
are available [SB88, RWB
+
96, CHK
+
07]. We employ TF-IDF [SB88], mainly because
it pays special attention to infrequent keywords in names, is fairly straightforward to
implement, and has only simple parameters to configure. To assess if two strings should
be linked, the strings are broken into terms. TF-IDF then computes a similarity score
between them by comparing their term sets, weighting infrequent terms. TF-IDF is
commonly used to query natural text documents in corpus by certain keywords. While
85
we believe it is a reasonable choice for our application, it is not optimized for matching
company names.
Manual Verification and Pruning
In Section 3.2.2 we briefly discussed how we manually verified and used 10-K links for
some 50 purposefully selected organizations. In this section, we present more details
about this process.
We select 50 organizations (about 0.6% of the 8,706 organizations) intentionally
to favor those that are relevant in the real world and are important to the Internet’s
ecosystem. In particular, we select 38 large, computer-related organizations from the
2011 Fortune 500 list and add 12 large ISPs that are not included in that list. In terms
of Fortune 500 companies, we included all organizations (38 in total) in the follow-
ing six Internet-related industries: telecommunications (e.g., Verizon, Sprint), Internet
services (Amazon, Google), computers (Hewlett-Packard, Apple), software (Microsoft,
Oracle), IT services (IBM, Computer Sciences Corporation), and communication equip-
ment (Cisco). We then added 12 organizations that are not on the Fortune 500 list but are
important players in the Internet, including large Tier-1 and Tier-2 ISPs such as Level 3
and Cogent.
The complete list of these 50 organizations is:
1. Telecommunication (14 companies): AT&T, Cablevision, Charter Communica-
tions, Comcast, DirecTV , DISH Network, Liberty Global, NII Holdings and Tele-
phone & Data Systems, Qwest, Sprint, Time Warner Cable, Verizon, and Virgin
Media,.
2. Internet service (5 companies): Amazon, eBay, Google, Liberty Media, and
Yahoo.
86
3. Computer (5 companies): Apple Inc., Dell Inc., Hewlett-Packard, Pitney Bowes,
and Xerox.
4. Software (3 companies): Microsoft, Oracle, and Symantec.
5. IT service (5 companies): AimNet Solutions, Cognizant, Computer Sciences Cor-
poration, IBM, and SAIC Inc.
6. Communication equipment (6 companies): Avaya, Cisco, Corning Inc., Harris
Corporation, Motorola, and Qualcomm.
7. Other (12 companies): Akamai, Citigroup, Cogent, Equinix, Gannett, Internap,
Limelight, Savvis, SunGard, VeriSign, V onage, and XO Communications.
Of the 1817 links that the automated clustering produced for these 50 organizations,
we verified and kept 1226 links, dropping 591. To verify the correctness of a link, we
manually compared the AS name with the subsidiary name, using additional information
from WHOIS and public web pages where available. For example, we verified and
kept the link between AS36561 (YouTube, Inc.) and Google’s subsidiary (YouTube,
LLC), because AS36561 registered with the same address as Google, and has contacts
with e-mail domain google.com and youtube.com. In contrast, we eliminated the link
between AS4616 (Information Technology Services) and Google’s subsidiary (Google
Information Technology Services LLC), because AS4616 actually belongs to Hong Kong
Polytechnic University according to its WHOIS record.
Enhanced AS Clustering
To make good use of the Form 10-K data, we modify the clustering algorithm described
in Section 3.2.1 by incorporating the information contained in the above-identified links
between ASes and 10-K organizations.
87
To this end, we add the 10-K ID as an additional attribute type to the 4attr set.
We call attributes categorized into these 5 attribute types (OrgID, contactID, phone,
email, and 10-K ID) 4attr+10K. We assign 10-K ID attributes the same weight as OrgID
attributes; we use a large weight because we have manually verified the accuracy of
these new attributes. The new weight vector
^
w
0
=f0.75, 0.1, 0.1, 0.05, 0.75g is not
normalized and we keep the same similarity threshold when cutting the clustering tree
(see Section 3.2.1). As a result, on one hand, if two ASes are linked with the same 10-K
organization their similarity score will be higher than before and thus more likely to
be clustered together later. On the other hand, if two ASes are linked only by WHOIS
attributes, their similarity score (and thus their clustering) will not change compared to
4attr-only clustering.
We re-compute the similarity matrix with the 4attr+10K set as input and feed the
result to the clustering algorithm described in Section 3.2.1. We compare the clustering
results obtained by this 4attr+10K-based method with those produced by the 4attr-
based approach in the subsequent section.
3.3 Validation of AS-to-Organization Map
We use four datasets to validate the accuracy of our clustering methods (Section 3.3.1),
purposefully chosen to trade-off confidence among three criteria: quality, unbiasedness,
and size. We describe the validation method in Section 3.3.2 and present our results
in Section 3.3.3, focusing in particular on how specific aspects of our methodology and
properties of our datasets improve accuracy (Section 3.3.4). We also compare our results
to PCH in Section 3.3.5.
88
Datasets Quality Unbiasedness Size
source (definitivenss) selection (bias) Orgs. ASes
T
tier1
operators (definitive) intentionally selected (potential) 1 manyy small
T
9org
many public reccords (very good) intentionally selected (potential) 9 502 medium
T
randtop
good public records (good) random from top (minimal) 50 2516 large
T
randall
sparse public records (incomplete) random from all (none) 50 1001 large
Table 3.2: Validation datasets ranked by quality, unbiasedness and size. y: omitted
intentionally.
3.3.1 Validation Datasets
To gain a comprehensive view of the quality of our mapping results, we use four different
datasets with varying degrees of quality, unbiasedness and size for evaluation. Table 3.2
ranks the datasets by these criteria.
The first dataset, T
tier1
, is provided by a Tier-1 operator, and thus represents the
highest quality in terms of completeness and accuracy (the operator must know what
ASes they operate!). However, the dataset reflects only one organization, and so it
provides a limited view of diverse AS policies and uses that may occur in the Internet.
Ground truth from Tier-1 ISPs is not possible to obtain because such information
about customers and business relationships is proprietary. We therefore infer three
datasets from public records. From public online documents, routing data and WHOIS
information, we believe we find most ASes of a given organization for our targets (all
public companies). We infer the AS list of a given organization in three steps. We
first manually search online to collect the string names of its divisions, subsidiaries, and
previous acquisitions and mergers. We then find any AS whose WHOIS record stores
similar string names. Finally, we disambiguate names with similar keywords (as we
described with “Apple” in Section 3.2.2), but belong to different organizations. We cau-
tion that ground truth inferred in this way is only best effort. However, to our knowledge,
these inferred datasets are the best available.
89
These three datasets consist of different samples of organizations for different pur-
poses. T
9org
contains nine big U.S.-based public companies with a large online pres-
ence and is thus of fairly good quality. It is hand-picked to shed light on key players
in today’s Internet, including four large telecommunications companies, four content
providers, and a root-DNS provider.
In contrast,T
randtop
andT
randall
are randomly chosen, and each consists of 50 orga-
nizations. We first consider all clusters that were produced by our clustering method
that uses 4attr+10K as input. Next, we take a random sample and infer the organi-
zation identity of the sample from the AS WHOIS records. More precisely, T
randtop
is a random sample of size 50 from the 100 largest organizations we find, where the
size of an organization is given by the number of its ASes. From manual inspection,
this dataset contains large ISPs, big research networks, media conglomerates and multi-
national financial companies. By comparison,T
randall
is a randomly selected set of 50
organizations from all 36,463 clusters our method produces. Most ofT
randall
are small,
private organizations, often without even a website. Both of these datasets serve a role,
even though their organizations may be less complete than the organizations in T
9org
because less public information is available for them. On the one hand, random selection
means thatT
randtop
provides a less biased view of a broader range of key Internet play-
ers than T
9org
. On the other hand, T
randall
represents a completely unbiased sample
but contains mostly small and unimportant organizations.
While there are six organizations that belong to bothT
randtop
andT
9org
, T
randall
and T
randtop
are disjoint. Also, while the median organization size in T
randall
is 1,
T
randall
contains a total of 1001 ASes, simply because a single organization (a network
information center) has 944 ASes. When ignoring that organization, we end up with 49
organizations with a total of 57 ASes.
90
Finally, our validation data is derived from operator information, WHOIS, and public
sources such as the web, and our analysis uses the same sources and SEC 10-K data.
Our validation and analysis is therefore not affected by incomplete information in public
BGP peerings [OPW
+
10, ACF
+
12].
3.3.2 Validation Method
To validate our results, we consider the clusters produced by our clustering method that
takes 4attr+10K as input. For each organization in our validation sets, we first select the
biggest cluster (see Section 3.2.1) that overlaps with the ground truth and compare them.
We then check how many ASes are wrongly assigned to the cluster (false negatives) and
how many ASes are missing from the cluster (false positives). We define false positives
(fp) and false negatives (fn) as follows. LetM
i
be thei
th
cluster in the ground truth (e.g.,
T
9org
) andC the biggest cluster produced by our method that overlaps with the ground
truth. Then we have
fn = 1
jM
i
\Cj
jM
i
j
; fp =
jCj
jM
i
j
jM
i
\Cj
jM
i
j
(3.3)
where C 2 R
ours
is a cluster produced by our method and M
i
is the cluster in the
ground truth overlapping withC (e.g.,M
i
2T
9org
).
For simplicity, we also classify our validation results for each organization into good
or bad with the help of the false-positive and false-negative rates. In particular, we call
the results for an organization good if both types of error rates are below 10%; otherwise,
we call the results bad.
91
category orgs percentage
falsepositive
good 9 90%
perfect (=0%) 6 60%
0%-10% 3 30%
bad 1 10%
10%-20% 1 10%
falsenegative
good 6 60%
perfect (=0%) 6 6%
bad 4 40%
10%-20% 2 20%
20%-40% 2 20%
Table 3.3: Validation of 10 intentionally selected organizations including a Tier-1 ISP.
3.3.3 Validation Results
In this section, we first present the overall findings for all four validation datasets (Sec-
tion 3.3.3). We then look into the underlying causes of mistakes and the obstacles we
face when mapping ASes into organizations. (Section 3.3.3).
Overall statistics
Table 3.3 summarizes the results forT
tier1
andT
9org
, showing overall very low false-
positive rates and moderately low false-negative rates (see Section 3.3.3 for causes). As
we can see from the false-positive analysis shown in the top portion of the table, for nine
out of ten organizations (90%), we wrongly cluster less than 10% of ASes. In terms of
the false-negative analysis given in the bottom portion of the table, we find all ASes for
six organizations (60%), identify more than 80% of the ASes for two other organiza-
tions and perform poorly for the two remaining organizations. The false-positive and
-negative rates forT
tier1
alone are 7% and 27%, respectively.
92
category orgs percentage
falsepositive
good 47 94%
perfect (=0%) 31 62%
0%-10% 16 32%
bad 3 6%
10%-20% 2 4%
20%-40% 1 2%
falsenegative
good 34 68%
perfect (=0%) 23 46%
0%-10% 11 22%
bad 16 32%
10%-20% 6 12%
20%-40% 8 16%
>40% 2 4%
Table 3.4: Validation of randomly selected organizations from top 100 clusters.
Similar results hold forT
randtop
. Table 3.4 shows that 47 organizations (94%) have
fewer than 10% false positives and that we found more than 90% of the ASes for 34
organizations (68%). Not only do these numbers show that our results generalize to
large organizations, they also confirm that our weights are not overfitted as a result of
using the ten organizations inT
tier1
andT
9org
for training.
Finally, validation based on the truly unbiased dataset T
randall
shows that our
method performs even better for the majority of Internet-related organizations. In fact,
as shown in Table 3.5, for almost all organizations (48, or 96%), the results are good
with respect to false positives and for almost as many (47, or 94%) are good for false
negatives. The high accuracy with respect to false-negatives follows from the fact that
the majority of Internet-related organizations are simple and small, which makes finding
all their ASes easy; indeed, 43 out of the 50 organizations inT
randall
have only one AS.
93
category orgs percentage
falsepositive
good 48 96%
perfect (=0%) 48 96%
bad 2 4%
>40% 2 4%
falsenegative
good 47 94%
perfect (=0%) 47 94%
bad 3 6%
>40% 3 6%
Table 3.5: Validation of randomly selected organizations from all clusters.
Understanding Sources of False-Positives and False-Negatives
While the overall accuracy of our results is quite good, our approach performs poorly in
some cases. We next examine these cases to understand the limitations of our approach
and suggest possible future improvements.
The main cause of false positives is the lack of clear boundaries between organi-
zations. A typical real-world scenario involves ISPs and IT consulting companies that
often provide technical support, including the management of AS records, for their cus-
tomers. Thus, they share the same contact information with their customers which, in
turn, becomes a common reason for false-positives in our results. This scenario applies
to the three organizations with bad false-positive rates in Table 3.4 (two ISPs and one
tech-outsourcing company), and also to the two organizations in Table 3.5 with more
than 40% false-positives.
In the case of false-negatives, the main cause is missing or inaccurate company sub-
sidiary information. Although our clustering algorithm uses company subsidiary data
via the 4attr+10K input set, we encounter numerous cases where different subsidiaries
94
maintain distinct identities in WHOIS. For example, two organizations with bad false-
negative rates in Table 3.4 are Nippon Telegraph and Telephone (NTT) and Deutsche
Telekom, two large telecom organizations without U.S.-centric 10-K data. WHOIS
records that are out-of-date and do not reflect the correct subsidiary names make it
difficult to produce accurate clusters without external knowledge and result in false-
negatives.
Some ASes are not routed. To understand if WHOIS data for non-routed ASes
is stale, we examined the subset of our results that only contain the routed ASes and
compare it to the four validation datasets. We find that accuracy in the routed-only
subset is always very close the the results of the whole data—always within 10% and
only a few percent difference. We conclude that active use as shown by routability does
not significantly affect our results.
3.3.4 Factors that Improve Accuracy
Figure 3.2 shows the significant improvements that our new method (new) is able to
achieve when compared to our earlier approach (old [CHKW10]). In terms of false-
negatives (the green arrows annotated by “!”), the results for all ten organizations
improved, varying from 2% to 36%. In addition, four organizations (the Tier-1 ISP,
Verizon, Limelight and ISC) show improvements with respect to their false-positive
rates, varying from 4% to 129%. Only Akamai shows a slightly worse false-positive
rate, at 3%. Manual inspection shows that it is due to a new outsourcing arrangement.
We next evaluate what specific aspects of our new algorithm helped improve the
accuracy of our results and highlight the impact that our new data source (i.e., Form
10-K data) has on AS clustering.
95
old
new
old
new
old
new
old
new
old
new
old
new
old
new
old
new
old
new
-50
0
100
Accuracy (100%)
false
pos.
true
pos.
false
neg.
many ASes 224 ASes 10 ASes 50 ASes 29 ASes 22 ASes 67 ASes 32 ASes 12 ASes 56 ASes
79%
21%
28%
86%
14%
3%
!
!
90%
10%
0%
100%
0%
0%
!
98%
2%
0%
100%
0%
0%
!
83%
17%
14%
100%
0%
14%
!
48%
52%
0%
77%
23%
0%
! 75%
25%
0%
100%
0%
0%
!
91%
9%
0%
100%
0%
3%
!
x
64%
36%
9%
100%
0%
0%
!
!
85%
15%
129%
89%
11%
0%
!
ISC Limelight Akamai Yahoo Google TW Cable Comcast CN Mobile Verizon
old
new
64%
36%
11%
73%
27%
7%
!
!
Tier-1
Figure 3.2: Comparison between our previous and current validation results.
Avoiding incorrect assertions (false-positives)
A good clustering method should be able to clearly identify organizational boundaries
and put each AS into its own and relevant cluster. However, organization boundaries
become usually blurred by a number of real-world organizational relationships, includ-
ing tech-outsourcing, technical support, and joint ventures. These blurred organizational
boundaries result in false-positives when we try to associate ASes with their organiza-
tion.
To sharpen our discovered organizational boundaries and thus reduce false-positives,
we exploit hierarchical average-linkage clustering (see Section 3.2). Unlike clustering
where any linkage joins two clusters, average-linkage examines all links between any
two ASes in two clusters to judge the strength of the relationship. This weighting results
in tenuous links between otherwise well connected clusters getting severed, preventing
organizational boundaries from blurring together.
Among the organizations whose false-positive rates improved significantly, ISC and
Verizon stand out. ISC’s false-positive rate is slashed from 129% to 0% and Verizon’s
is reduced from 28% to 3%. We confirm that 70 of the 71 false-positives for ISC were
caused by a single linkage that existed because of tech-outsourcing. For Verizon, 59 of
96
the original 66 false-positives were caused by single linkages resulting from a combina-
tion of customer and tech-outsourcing relationships. In both cases, the use of average-
linkage results in less aggressive clustering by ignoring these weak relationships.
In a few cases, average-linkage is less accurate as it can ignore weak but legitimate
relationships between ASes. For instance, average-linkage cannot distinguish one or
two legitimate links between two ASes in an organization from the common pattern
where one link corresponds to tech-outsourcing or a joint-venture.
Handling incomplete data (false-negatives)
Besides clearly identifying organizational boundaries, a good clustering method should
also make use of all critical information that relates an AS to its organization. For
example, contact information typically relates an AS to its direct operator. However, if
this direct operator is a subsidiary of a big company, relating it to its parent company may
require additional information. Unavailable critical information can result in missing
ASes, which in turn makes our results incomplete.
To reduce false-negatives and mitigate the problems caused by subsidiaries, our new
algorithm relies on a combination of more complete and up-to-date WHOIS data and
a novel information source in the form of Form 10-K data. In particular, we note that
two organizations (Akamai (2 ASes) and CN Mobile (1 AS)) benefit from using more
attributes than in [CHKW10]. An up-to-date WHOIS database is also very important
as can be seen in the cases of Yahoo (4 ASes), Google (3 ASes), ISC (2 ASes) and
Limelight (1 AS). We next examine in more detail how Form 10-K data helps to increase
completeness.
97
category orgs percent
subsidiary information used 18 100%
big improvement (>20%) 4 22%
medium improvement (10-20%) 2 11%
small improvement (10%) 8 44%
no improvement (=0%) 4 22%
Table 3.6: Improvement on false-negative rate when company subsidiary information is
used.
Does company subsidiary information help?
To understand the role of 10-K data in improving accuracy, we evaluate all 50 organiza-
tions for which 10-K data applies. Most organizations in Figure 3.2 are included since
they are considered critical Internet players and have 10-K data available. Exceptions
are CN Mobile (non-U.S.) and ISC (non-public). We show that 78% of the organizations
improved after using 10-K.
To examine its benefits, we consider Form 10-K data for 50 U.S.-based public com-
panies and compare two sets of clustering results. One set is obtained by running our
new clustering algorithm with 4attr+10K as input (i.e., using company subsidiary infor-
mation). The other set is the result of running the same algorithm with 4attr as input.
We evaluate both sets of results using all four validation datasets and check for improve-
ments concerning the false-negatives.
Of the 50 companies considered, 18 are in our ground truth datasets. Table 3.6
shows that 14 out of 18 organizations (78%) see improvements. The improvements
are significant for four organization (Limelight, Oracle, IBM and HP) because this new
source of information captures most of their subsidiaries. It also helps four organizations
(Cogent, VeriSign, Yahoo, Comcast) where we capture all of their ASes obtaining zero
false negatives. In short, our approach can miss ASes of organizations which have a
98
complex structure or a long history of mergers and acquisitions. However, publicly
available Form 10-K data is able to alleviate this problem, often improving clustering.
3.3.5 Comparison with PCH
We next summarize a comparison of our results with PCH’s manually generated AS-to-
organization map. Full details of these experiments are in Appendix 3.8.
The PCH dataset (referred asT
pch
) is a database that relies on voluntary contribu-
tions from network operations personnel in many different organizations and is main-
tained by PCH to facilitate communication among the different players interested in a
smooth functioning of the Internet. Compared to our results, T
pch
covers many fewer
organizations and ASes (960 organizations and 1,968 ASes).
To assess the quality of PCH dataset, we compare it with T
tier1
and T
9org
. The
evaluation shows few false-positives, but many false-negatives in the PCH dataset. More
specifically, while all organizations show less than 3% false positives, all organizations
(except for Comcast and Time Warner Cable) miss more than 50% of the ASes; of those
50%, 6 organizations miss more than 90% of ASes.
We conclude thatT
pch
is in general correct but very incomplete. In contrast, while
our results have a few more false positives, they have many fewer false negatives and
cover many more organizations, the advantage of automatic clustering over manual con-
tributions.
99
category orgs ASes addresses
total 36463 100% 49262 100%
multi-AS 4856 13% 17655 36%
single-AS 31607 87% 31607 64%
total 36463 49262
routed 27802 100% 34472 100% 2.5B 100%
routing-complex 3165 11% 9835 29% 1.6B 64%
routing-simple 24637 89% 24637 71% 0.9B 36%
not routed 8661 14790
Table 3.7: Organization distribution by number of ASes in total, and ASes in routing
tables.
category orgs ASes addresses
total 36463 100% 49262 100%
multi-AS 4856 13% 17655 36%
single-AS 31607 87% 31607 64%
total 36463 49262
routed 27682 100% 34260 100% 2.5B 100%
routing-complex 3142 11% 9720 28% 1.6B 64%
routing-simple 24540 89% 24540 72% 0.9B 36%
not routed 8781 15002
Table 3.8: Organization distribution by number of ASes in total, and ASes in routing
tables (from the second RouteViews site).
3.4 Prevalence and Influence of Multi-AS Usage
In this section we study our AS-to-organization mapping results. We first examine the
prevalence and influence of multi-AS usage (Section 3.4.1) and then investigate why
organizations use multiple ASes (Section 3.4.2).
3.4.1 Relevance of multi-AS organizations
The use of multiple ASes is not confined to large organizations but can also be found
among small organizations. Table 3.7 measures the number of ASes and organizations,
100
showing that 49,262 ASes map into 36,463 organizations. It illustrates that most organi-
zations (87%) are simple, using a single AS, and only 13% are multi-AS organizations.
However, many ASes are part of these multi-AS organizations— about 36% of all ASes
are assigned to multi-AS organizations. Since more than a third of ASes have other “sib-
ling” ASes in the same organization, this finding suggests that an AS-to-organization
map may be relevant to resolving AS-relationship discovery in routing [Gao01].
However, some of the allocated ASes are “moribund” and not in active use. To focus
on active ASes only, we next consider the subset of ASes that are routed. We obtain
routing tables from RouteViews [Mey13]; we see similar results when using routing
tables from a vantage point located in Japan. in Japan shown in Table 3.8.
To focus on active ASes, we define routing-complex and routing-simple organiza-
tions. We discard all non-routed ASes; organizations that route with multiple ASes are
called routing-complex, while those with only a single AS are routing-simple. We next
focus on routing-complex organizations since they actively use multiple ASes.
Not all routed ASes are equivalent since some are large and others small. Since
we cannot directly measure traffic or costs, we approximate the influence of an AS
and an organization by the number of addresses it announces (i.e., the influence of an
organization is thus the sum of the influences of its ASes).
Table 3.7 (bottom portion) shows that routing-complex organizations are very influ-
ential on the Internet, accounting for nearly two-thirds (1.6B or 64%) of all routed
addresses. Thus, even though there are few routing-complex organizations, and organi-
zations with 2 or more routed ASes constitute only 4% of all routed organizations, they
announce more than half of all routed addresses. Organizations with more than 5 ASes
(only 1%) announce about one third of all routed addresses.
101
This analysis shows that while there are relatively few multi-AS organizations and
even fewer routing-complex organizations, they are very influential in today’s Internet.
3.4.2 Causes of multi-AS usage
Organizations use multiple ASes for many different reasons. While some are transient,
others are persistent and unlikely to go away in time. Mergers and acquisitions are
examples of transient reasons, especially in cases where infrastructure is consolidated
following a merger. Persistent reasons are more varied, but usually result from legal
or policy pressure, either from internal or external factors. An example of an internal
policy decision is an ISP that chooses to use different ASes to implement internal routing
policies; (e.g., Verizon’s use of different ASes on different continents [Ver11]). External
policy constraints include cases where legal conditions of mergers require that certain
business practices remain unchanged for an extended period. Regardless of the specifics,
we consider these constraints that last for years as “persistent”.
We next summarize our inference about both transient and persistent multi-AS usage
of six organizations. To this end, we definen
p
as the number of top ASes that announce
p% of an organization’s addresses. We focus onn
100
to describe how many ASes are
routed in total and onn
80
to identify “core” ASes (stable, policy-based ASes).
To illustrate multi-AS usage, Figure 3.3 gives a historical account of the routability
of all ASes that are part of Google as of 2011-09-01 (see Appendix 3.9 for additional
examples). ASes are stacked by AS index (sorted by number of addresses currently
announced and then by the first date routed), with horizontal bars indicating the periods
when the ASes are routed (darker bars indicating membership in n
80
). The first time
an AS is routed is called n
100
birth and the last time is called n
100
death. Similarly,
the first time ofn
80
membership is calledn
80
promotion and the last time is calledn
80
102
Figure 3.3: Historical routability of Google ASes.
demotion. From the graph we can see that two ASes have been announcing 80% of the
addresses for one year: Google’s main AS (AS15169, AS index: 1) and a WiFi-specific
AS (AS36492, AS index: 2), suggesting a stable routing policy.
In contrast, transient AS usage is often the result of acquisitions followed by AS
consolidation. Continuing with the Google example in Figure 3.3, in late 2006, Google
acquired Youtube (AS36561, AS Index: 16); as can be seen, the number of addresses
announced by this AS gradually decreased and the AS finally disappears from BGP by
April 2011. This change suggests that, over time, Google consolidated this service into
its core infrastructure.
To contrast with the Google example, we also observed a case where routing policy
decisions promote AS diversification: ISC. Although only one AS announces most of
ISC’s addresses, we notice that since 2003, ISC is using more and more ASes. Exam-
ining these new ASes, we see that each announces a single /24 address block. This
policy is consistent with the choice to associate a unique AS with each physical any-
cast location [ISC11] and ISC’s operation of the anycasted F-root DNS server. This
103
example illustrates how policy decisions can result in an increasing number of ASes per
organization and that this type of multi-AS usage is likely to last.
3.5 Incompleteness of AS-level and Importance of
Organization-level Topology
In this section we illustrate how accounting for multi-AS organizations impacts our
understanding of a number of Internet topology-related features.
3.5.1 Address Coverage of an AS vs. its Organization
We first show that it is necessary to consider entire organizations to get a complete
view of routed addresses and geographic coverage. Recall that routing-complex orga-
nizations use multiple ASes to operate their routed addresses. A common policy in
routing-complex organizations is to assign different ASes to specific geographic regions
to implement specific routing policies. Knowing the geographic footprint of an orga-
nization helps many applications, from understanding business strategies (for competi-
tors), to assessing the robustness and efficiency of the organization’s infrastructure (for
example, a CDN with wider geographic coverage may provide more reliant and faster
service to users). By contrast, ignoring organizations and evaluating each AS inde-
pendently produces a view of influence and geographic coverage that is necessarily
overly-narrow. We next quantify this incompleteness, both in terms of number of routed
addresses and geographic coverage.
To quantify how much a main-AS view underestimates address coverage, Fig-
ure 3.4(a) shows how much this view affects address coverage. For each routing-
complex organization, we compare the number of addresses announced by the AS that
104
0
1K
2K
3K
0 20 40 60 80 100
CDF of Orgs
Missed Addr Coverage (%)
~1/3 orgs
missing
many (40~91%)
addresses
(a) IP addresses
0
1K
2K
3K
0 20 40 60 80 100
CDF of Orgs
Missed City Coverage (%)
~1/5 orgs
missing
many (40~90%)
cities
(b) cities
Figure 3.4: Missing address/city coverage from the main-AS view compared with the
organization view for routing-complex organizations.
announces the most addresses (i.e., main-AS view) to the number announced by all of
its ASes (i.e., org-level view). We calculate the percentage of addresses not announced
by its main AS (missed address coverage) and show the cumulative distribution of orga-
nizations based on this percentage. The higher the missed address coverage, the more
incomplete the view. As can be seen from Figure 3.4(a), the address coverage of almost
all organizations is incomplete, missing between 1% to 91% of the addresses. More
specifically, when considering an organization’s main AS only, then nearly one-third of
all organizations (933) miss a signification portion (40% to 91%) of their addresses.
To measure geographic coverage, we count the cities where these addresses are
located. We identify the city of each address using MaxMind’s CityLite geo-location
database [Max12]. This dataset provides worldwide coverage and claims that for the
U.S., 79% of the addresses are mapped with an accuracy of less than 25 miles. With
the help of this database, we identify the address locations for 2,631 routing-complex
105
organizations and compute the missed city coverage in a similar fashion as the missed
address coverage discussed earlier.
2
Figure 3.4(b) shows that nearly half of the mapped
organizations (1,132) have an incomplete geographic coverage, missing at least one city.
Importantly, when reducing an organization to its main AS, a fifth of the organizations
(540) miss a signification portion of the cities (40% to 90%).
In summary, we observe that the traditional main-AS view of an organization
often misses a large number of relevant addresses and cities. We conclude that an
organization-level view is critical to accurately account for an organization’s geographic
extent and IP address coverage.
3.5.2 Internet Exchange Point Coverage of ASes and Organizations
Internet Exchange Points (IXPs) are an important aspect of Internet topology [ACF
+
12].
To illustrate the value that an AS-to-organization map has on studying peering in IXPs,
we first show that the organizations that use multiple ASes to peer at IXPs (peering-
complex organizations) have a large role in the Internet. We then compare the main AS
view and corresponding organization-level view of peering-complex organizations and
discuss limitations of the former.
Importance of Peering-Complex Organizations
Large ISPs often use multiple ASes to implement routing policies for different geo-
graphical regions (Section 3.4.2). As a result, they tend to peer only with certain of
their ASes in certain locations. Similarly, large content and hosting providers also use
different ASes for, say, different continents. These ASes then peer with local access
networks at close-by IXPs to reduce transit cost and user latency. This explicit use of
2
We omit 534 organizations (about 17%) for which we lack geolocation information.
106
0
0.5K
1K
0 2 4 6 8 10 12 14 16 18
CDF of Orgs
ASes at IXP
peering-complex orgs
peering-simple orgs
(a) unweighted orgs
0
0.5B
1B
1.5B
0 2 4 6 8 10 12 14 16 18
CDF of Addrs
ASes at IXP
peering-complex announce
peering-simple announce
(b) weighted orgs
Figure 3.5: Cumulative distribution of unweighted/weighted peering-active organiza-
tions by number of ASes used to peer.
multiple ASes suggests that considering only the main AS will offer only limited visibil-
ity into the true geographic reach of the corresponding organization and underestimate
its geographic coverage.
To quantify the prevalence of multi-AS peering, we apply our AS-to-Org mapping to
previously obtained AS-level IXP peering matrices [AKW09]. These inferred peering
matrices represent the current state-of-the-art but are known to be incomplete, and we
will comment below on how this incompleteness may affect our observations. The 2009
dataset lists 2,840 ASes and we map them to 2,503 organizations using our AS clus-
tering method. Nearly two-thirds of these organizations are routing-simple and do not
concern us here. In the following, we examine how the remaining 882 routing-complex
organizations affect our view of IXP peering.
Figure 3.5(a) shows the cumulative distribution of organizations as a function of
how many ASes they use to peer at IXPs. As can be seen, most (715 organizations of
107
IXP A IXP B IXP C
AS1
AS2 AS3
AS6 AS5 AS4
Missed IXPs: IXP C
Missed peers: AS6
Missed peerings: AS2-AS4, AS3-AS6
33%
33%
40%
Org1
missed
Figure 3.6: The different IXP peering view from the whole organization’s perspective
and from the main AS’s perspective.
the 882) of these routing-complex organizations peer using one AS (they are peering-
simple). Although only about 19% (shaded area) of these routing-complex organiza-
tions are peering-complex (using multiple ASes at IXPs), Figure 3.5(b) shows that these
peering-complex organizations have a large influence on the Internet. Approximating
that influence by the number of announced addresses, Figure 3.5(b) shows that these
peering-complex organizations account for more than half (0.74B or 56%, shaded area)
of all routed addresses (that were announced by both peering-complex and peering-
simple organizations as of 2011-09-01). Since we know that our IXP data is incomplete,
these estimates are likely lower bounds on the number of peering-complex organiza-
tions.
Two examples from our 2009 data illustrate peering-complex organizations. Com-
cast uses 18 ASes to peer, the largest number we saw. These 18 ASes peer at 13 IXPs
located in 12 cities, mostly in North America and Europe. China Telecom uses 3 ASes
to peer and exports the largest number of addresses. These 3 ASes peer at 18 IXPs
located in 16 cities in Europe and North America.
108
Since organizations that peer at IXPs often use sophisticated routing policies, it can
be argued that an organization-level view is more important for properly evaluating their
presence at IXPs than for the general evaluation discussed in Section 3.5.1.
Implications of Peering-Complex Organizations
Next we quantify the degree of underestimation that results from reducing peering-
complex organizations to the commonly-used main-AS view. To this end, we focus
on peering-related metrics such as geographic reach (measured by number of IXPs or
number of cities with an IXP), number of peers, and number of peering links.
For each organization, we measure how much each of these metrics is underesti-
mated by comparing the size of the organization-level view and the corresponding main-
AS view in the relevant sub-graphs. The specific definition of “main-AS” here depends
on the metric; we chose the AS with the most IXPs, or peers, or links. An organiza-
tion’s subgraph is the subset of the IXP map that only contains the IXPs, peers and links
related to that organization. The subgraph for an organization’s main-AS view consists
of the IXPs, peers, or links associated with only the main AS and is always a subset of
the organization’s subgraph.
Figure 3.6 gives an example of these measures. Org1’s subgraph is shown as the
whole figure. In the case where AS1 is Org1’s main AS, the gray-shaded portion of the
figure refers to the corresponding main-AS subgraph. Org1 consists of three ASes and
peers at three IXPs with a total of five peering links. In contrast, AS1, its main AS, peers
only at two IXPs and has a total of three peering links.
109
0
50
100
150
0 10 20 30 40 50 60 70 80
CDF of Orgs
Missed IXP Coverage (%)
58 orgs
missing
20~69%
IXPs
(a) IXPs
0
50
100
150
0 10 20 30 40 50 60 70 80
CDF of Orgs
Missed City Coverage (%)
48 orgs
missing
20~58%
cities
(b) IXP cities
Figure 3.7: Missing IXP/city coverage from the main-AS view compared with the orga-
nization view for peering-complex organizations.
We quantify the degree of underestimation by counting the fraction of IXPs, peering
ASes and peering links, respectively, that one will miss by considering only the main-
AS’s graph. For instance, in the case of the example in Figure 3.6, we can see that the
main-AS view misses 33% of the IXPs, 33% of the peers and 40% of the peerings.
Using all 167 peering-complex organizations, Figure 3.8 shows the cumulative
distributions of the degree of underestimation for the different metrics of interest.
For example, in terms of (geographic) coverage, we observe that the main-AS view
often provides a limited perspective of the (geographic) coverage of the corresponding
organization-level view—about one-third of all organizations are missing 20% of IXPs
and cities (see Figure 3.7(a) and Figure 3.7(b)). Turning to peers, Figure 3.8(a) shows
that the majority (144) of organizations miss some peers and about one-third of organi-
zations miss more than 20% of the peers. These omissions can lead to a false inference
of an organization’s peering strategy. Lastly, the underestimation is worst with respect
110
0
50
100
150
0 10 20 30 40 50 60 70 80
CDF of Orgs
Missed Peer Coverage (%)
56 orgs
missing
20~63%
peers
(a) peering ASes
0
50
100
150
0 10 20 30 40 50 60 70 80
CDF of Orgs
Missed Link Coverage (%)
85 orgs
missing
20~66%
links
(b) peering links
Figure 3.8: Missing peer/link coverage from the main-AS view compared with the orga-
nization view for peering-complex organizations.
to peering links. Figure 3.8(b) shows that almost all organizations are missing some
peering links when compared to the main-AS view, with half of the organizations miss-
ing more than 20%. The underestimation of peering links can results in underestimates
of the organizations’ IXP-specific connectivity fabrics.
3.6 Conclusions
In this chapter, we describes a new approach that automatically yields AS-to-
organization maps of the Internet by exploring WHOIS data. We show that the accu-
racy of our maps is superior to existing maps. The improved accuracy results from the
development of a better clustering approach for assigning ASes to organizations and our
reliance on company subsidiary data contained in the annual U.S. SEC Form 10-K fil-
ings. We validate our new AS-to-organization map against a “best-effort” ground truth.
111
Finally, we show that accounting for all the ASes of an organization provides a much
more accurate picture of an organization’s properties as compared to the traditional and
commonly-applied view that equates an organization with its main AS. In particular, we
illustrate that this “main-AS” view is seriously flawed with respect to properties such as
an organization’s size, geographic footprint, and IXP peerings.
This chapter serves as a second concrete example to support our thesis statement.
We show that carefully-designed clustering approaches can overcome various limita-
tions of WHOIS data in order to achieve an accurate AS-to-organization mapping. We
analyze and demonstrate the improved accuracy compared to prior work results from
this systematic clustering approach. Hence, our study suggests the potential to re-study
many prior problems with more systematic approaches.
More importantly, the work in this chapter can inspire future work in two ways.
First, when no single type of data is sufficient to reach the goal, one can consider com-
bining multiple types of data. In this way, different types of data can complement each
others shortcomings and together achieve a desired quality. Prior work has already
practiced this idea [AKW09, SBS08]; our work extends its applicability by using it
in the context of AS-to-organization mapping. Second, when noise in data distracts
researchers from identifying useful information, one can consider employing the clus-
tering approach. Clustering approaches group entities together to study them in groups
in which noise can often be neutralized. We have already seen a recent study that prac-
tices this idea [CFH
+
13].
In the previous two chapters, we have studied two goals for which we have found
available data to achieve. In the next chapter, we study a more ambitious goal which
is extremely useful in the real-world but lacks available data. We demonstrate how to
achieve a goal when some data is not available.
112
3.7 Appendix: Training in Detail
In Section 3.2.1, we briefly described how we tuned the weights for the different attribute
types. In this section, we give a more detailed description, describing our training
method in Section 3.7.1, and providing details about our training results in Section 3.7.2.
3.7.1 Training Methodology
We first list the parameters to optimize (Section 3.7.1), then set aside a training set
of about 10,000 ASes (Section 3.7.1), and finally define the objective function that we
attempt to optimize (Section 3.7.1) by relying on a parallel hill climbing algorithm (Sec-
tion 3.7.1).
Parameters
We consider several parameters that greatly affect our AS-to-organization mapping
results and list them below. Our focus here is to determine the best possible weight
vector ^ w for these parameters to improve our results.
First, how specific or general should attributes be? We group attributes into either
the 4attr or 66attr set.
Second, what data sources should we use? Besides WHOIS data, we can choose
either to use or omit the 10-K data.
Third, when should we merge clusters or leave them distinct? We must determine
a cutting threshold for hierarchical clustering; this threshold is used to decide when
a similarity score is too low to keep two clusters together. As stated in Section 3.2.1,
our similarity score is proportional to sum of the weights. Thus, we first define two
fixed cutting thresholds, a conservative one with = 0:01 and a more liberal one with
113
= 0:001, and then let the training algorithm walk through different weight vectors.
After the best weight vector is selected, we normalize it and adjust the cutting threshold
accordingly.
We consider each of the combinations of the four attribute sets (4attr, 66attr,
4attr+all10K, 66attr+all10K) and two cutting thresholds ( = 0:01 and = 0:001)
and optimize weights accordingly.
Sample Dataset
We first need to select a training sample of ASes. An ideal training set should be verifi-
able, representative and computable. Verifiable means that we have sound ground truth
to evaluate the clustering results in order to guide training. Representative means that
the sample contains an appropriate subset of all ASes, so both false positives and false
negatives can be captured. Computable means we can evaluate a training run reasonably
quickly on a commodity computer. Finding that the memory requirement for clustering
represents the bottleneck for training, we adjust the size of the training set accordingly.
However, there are two problems with obtain the ideal training set. First, we have
high-confidence ground truth only for ten organizations (the Tier-1 ISP and the nine
organizations in Section 3.3.1). Thus, only results for these ten organizations are verifi-
able with respect to both false-positives and false-negatives. Second, clustering is very
memory- and CPU-intensive if the dataset is large (memory/ N
2
, time/ N
3
forN
ASes in the training dataset). Clustering with some 50K ASes requires about 24GB
memory and takes approximately 3 days on a large-memory computer. Due to limited
access to this hardware and timing limits imposed by the large parameter space, we train
only on a subset of the data.
114
To this end, we created a training sample with 9710 ASes, selecting about one fifth of
the whole AS population. This size is small enough to make the analysis tractable—one
clustering based on one weight vector requires about 1 GB of memory and takes about
20 minutes. One round of hill climbing walks through approximately 60 weight vectors
before it converges. We run 32 rounds for each of the eight combinations of attribute
sets and cutting thresholds. Thus in total, the training takes 20 minutes per case, with
60 32 8 = 15k cases, or 213 days of compute time. We carry this work out in
parallel on 32 processors over about 7 days. To make the training sample verifiable, we
begin by seeding it with all ASes known to be in the ten organizations (736 ASes in
total). To make it representative, we then add about 9K additional ASes of “noise” as
described below. We then train on this dataset to choose the best clustering scheme and
parameters.
We select our parameters based on training with a purposefully-chosen subset of
ASes. We considered two approaches to choose “noise” ASes to fill out the training
set: select them randomly from all ASes, or select them preferentially in the sense of
being close to, but not in, the ten organizations. We say an AS is close to the ten
organizations if this AS is likely to be clustered with them under a general attribute
set and a very low cutting threshold. In the case of preferential selection, we select all
ASes that cluster with the ten organizations under the 4attr+all10K attribute set with
0.0001 cutting threshold. This preferential approach biases the noise to make training
more difficult, mainly because it is easy to accidentally cluster nearby ASes with known
organizations. In Section 3.7.2 we confirm that our biased noise produces more accurate
parameters compared to using a purely random selection.
115
Objective Function
Here we define the objective function f( ^ w) used in Section 3.7.1 to judge what value
for ^ w best reflects clustering quality. A simple function would sum false-positives and
false-negatives for the ten organizations, but we found this approach to be very sensitive
to outliers. To avoid this problem, we instead sum the quartiles of false-positives and
false-negatives. Let fp
Q1
, fp
Q2
, and fp
Q3
be the first, second and third quartiles of false-
positive rates validated by the ten organizations, and fn
Q1
, fn
Q2
, and fn
Q3
be the false-
negative rate quartiles (see Equation (3.3) in Section 3.3.2 for false-positive/negative
rate definition). The objective function is then defined as
f( ^ w) =
3
X
i=1
fp
Q
i
+ fn
Q
i
and leverages the false-positive and false-negative rates for comprehensiveness of our
objective and also discards outliers by using representative quartiles. Based on this
definition, a lowerf( ^ w) means better ^ w, and thus the goal of the hill climbing algorithm
is to minimizef( ^ w).
Algorithm: Parallel Hill Climbing
The algorithm we chose is parallel hill climbing. The basic hill climbing algorithm starts
from a random ^ w-value, iteratively tries to find a better one by changing one element of
it and judging if the new one produces a better clustering result. The algorithm iterates
until it reaches a local optimum where no improvements can be found around the final
^ w, even after an exhaustively search of nearby configurations.
Basic hill climbing is fast at finding a local optimum, but it cannot determine if
that point is a global optimum. To increase the chances of finding a global optimum,
116
40
50
60
70
80
90
100
9 1 32
Best Score
Round
4attr(0.001)
4attr(0.01)
4attr+all10K(0.001)
4attr+all10K(0.01)
66attr(0.001)
66attr(0.01)
66attr+all10K(0.001)
66attr+all10K(0.01)
Figure 3.9: Converging training results with parallel hill climbing.
we use parallel hill climbing which starts from multiple random ^ w-values. Provided
it iterates long enough, parallel hill climbing finds local optima with high probability.
With enough initial positions, parallel hill climbing will find a good global value with
high probability, provided the parameter space is relatively smooth. As illustrated in
Figure 3.10, this assumption appears to hold in our case.
We run 32 parallel hill climbing processes for each of the eight combinations of
attribute set and cutting threshold. Table 3.9 shows the number of different ^ w-values
examined in the training space for each combination in parenthesis. Each initial value
searches a path through space until reaching a local optimum. With 32 rounds of search-
ing, parallel hill climbing typically explore 1 to 2k weight vectors. Although our random
walks cover less than 1% of the training space, as Figure 3.9 shows, the best (low-
est) scores for all eight attribute/threshold combinations converge fairly quickly, usually
after 9 rounds. Thus, we argue that 32 rounds are sufficient to find a local optimum and
to determine a set of “good” parameters.
117
Cutting Threshold
0.01 0.001
Input
4attr 44.5 (1.5k) 45.0 (2.2k)
4attr+all10K 45.5 (1.1k) 47.0 (1.0k)
66attr 95.0 (2.5k) 75.5 (2.5k)
66attr+all10K 65.5 (2.2k) 63.0 (2.2k)
Table 3.9: Summary of training results. Best score: 44.5. Numbers of weight vectors
examined are in parenthesis.
3.7.2 Details of Training Results
In this section, we first present the best parameters we found and then discuss how
different parameters affect the results.
Best parameters
Table 3.9 gives a summary of our training results. The best attribute set/threshold com-
bination is 4attr+0.01 with score 44:5; the combination 4attr+0.001 has a very similar
score, indicating that the result is not very sensitive to the value of the cutting threshold.
The corresponding weight vector is given by ^ w=f3, 0.4, 0.4, 0.2g, and if we normalize
so the sum of the weights is 1, then ^ w=f0.75, 0.1, 0.1, 0.05g, with = 0:0025. Either
way, this solution quantifies the importance of the different attribute types. It empha-
sizes the importance of OrgID, and downplays contact ID, phone and email. This result
confirms our expectation that while OrgID is very informative for the purpose of clus-
tering, email is much less useful. Since OrgIDs are intended for common administrative
management, they are unlikely to cause false-positives; on the other hand, since con-
tact information can be registered by outsourcing third parties, they can easily introduce
false-positives.
118
To verify that our method of training dataset selection is appropriate, we also trained
using a purely random selection of 9K noise ASes. While the best score resulting from
training with this input data is better than that resulting from training with preferential
noise (34:5 instead of 44:5), the resulting weights ( ^ w
R
=f0:4; 0:28; 0:24; 0:08g) and
threshold ( = 0:0004) perform much worse when applied to the whole dataset, scoring
86:5 compared to 49:5 with preferentially-derived parameters. This result confirms that
our biased, more challenging training dataset improves overall accuracy by finding more
effective weights.
Parameter discussion
In this section, we discuss several factors affecting our training results and understand
why other parameters are not preferred.
Attribute generalization Attribute/threshold combinations with 66 attribute types per-
form much worse than the ones with four attribute types. This is because dividing a
general attribute type into many specific sub-types may break clustering links and thus
may lead to higher false-negative rates. With the 4attr (or 4attr+all10K) set, admin-
istrative contact email @example.com (belongs to AS1) and technical contact email
@example.com (belongs to AS2) are of the same type (both belong to email type), thus
they will be compared with each other. Since these two emails are of the same value,
AS1 and AS2 will be linked together. However, with the 66attr (or 66attr+all10K) set,
administrative email and technical email are of different types. As mentioned in Sec-
tion 3.2.1, different types of attributes are orthogonal, thus they will not be compared
and AS1 and AS2 will not be linked.
Attribute weights Although we chose a best value for the weight vector ^ w in Sec-
tion 3.7.2, we see the training space is fairly flat. Figure 3.10 visualizes the training
119
0
0.2
0.4
0.6
0.8
1
0
0.5
1
1.5
2
2.5
3
3.5
50
75
100
score
10K
OrgID
score
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
50
75
100
score
10K
contactID
score
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
1.2
50
75
100
score
10K
phone
score
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
50
75
100
score
10K
email
score
Figure 3.10: Parallel Hill Climbing with attribute set 4attr+all10K and cutting threshold
0.01. company subsidiary information (10k) is always shown on they axis.
space for attribute set 4attr+all10K. As can be seen, the score is not very sensitive to
the weight changes. Instead, it is more sensitive to the selection of different number of
attribute types as shown in Table 3.9.
3.8 Appendix: Validation with Broader Coverage
(PCH)
In this section, we use PCH’s manually generated AS-to-organization map to validate
our work with a dataset that provider broader coverage than our carefully chosen val-
idation datasets. We evaluate this related work and demonstrate its incompleteness
120
(Section 3.8.2). Although incomplete, we use it to test our clustering algorithms for
false-negatives (Section 3.8.3), and present the results in Section 3.8.4.
3.8.1 Validation Dataset
The PCH dataset (referred asT
pch
) is a database that relies on voluntary contributions
from network operations personnel in many different organizations and is maintained
by PCH to facilitate communication among the different players interested in a smooth
functioning of the Internet. Compared to our other validation datasets,T
pch
covers many
more organizations (960 in PCH, compared to a total of 110 organizations in our other
datasets). Especially,T
pch
is more diverse in terms of organization sampling (similar as
T
randall
), mainly because it covers many “small” organizations with fewer ASes (mean
cluster size is only 2). We also expect it to be more unbiased than, for example,T
randtop
.
The PCH data is a table with three columns: AS, shortorg and longorg. Longorg
and shortorg are full and abbreviated names of the organization to which the AS is
assigned; e.g., “Internet Systems Consortium, Inc.” and “ISC”. There is no strict format
for shortorg and longorg, and not every AS has both shortorg and longorg. Longorgs are
20 times more frequent than shortorgs, however they are usually verbose and contain
details that make string matching hard.
Because longorgs make clustering difficult, we identify AS clusters inT
pch
by short-
org. ASes with the same shortorg are clustered together and identified as belonging to
one and the same organization. As Table 3.2 shows, this process results in about 2K
ASes grouped into 960 clusters.
Although PCH data covers many more organizations than our other validation sets,
it covers fewer ASes for each organization, and these ASes may fall into different short-
org clusters (see Section 3.8.2). This drawback of PCH data poses challenges for our
121
1
2
3
4
5
6
7
1
2
3
4
3
4
5
6
7
1
2
false neg. = 2 40%
true pos. = 3 60%
false pos. = 0 0%
false neg. = 1 20%
true pos. = 4 80%
false pos. = 2 40%
false neg. = 7 70%
true pos. = 3 30%
false pos. = 2 20%
a) biggest b) all c) pair
AS cluster Tier-1 or 9org PCH
6
7
5
C0
C1
C2
C0
C1
C2
C0
C1
C2
Figure 3.11: Three definition of validation metrics.
biggest
all
pair
biggest
all
pair
biggest
all
pair
biggest
all
pair
biggest
all
pair
biggest
all
pair
biggest
all
pair
biggest
all
pair
biggest
all
pair
-50
0
100
Accuracy (100%)
false
pos.
true
pos.
false
neg.
ISC Limelight Akamai Yahoo Google TW Cable Comcast CN Mobile Verizon
biggest
all
pair
Tier-1
Figure 3.12: Evaluation of PCH dataset, compared with the Tier-1 ISP and 9 organiza-
tions.
validation efforts. In particular, we cannot validate false-positives because the ground
truth itself is incomplete.
3.8.2 Evaluation of PCH Dataset with Strong Ground Truth
Because the PCH dataset is the result of a largely voluntary effort, we expect that it
will be less complete than the datasets we have built ourselves. Therefore, we first
evaluate the completeness of the PCH dataset before using it to judge the accuracy of
our clustering algorithm.
122
To assess the quality of PCH dataset, we compare it with T
tier1
and T
9org
. We
use three different validation metrics to achieve three different objectives: gain some
intuition, obtain error bounds, and calculate correction factors.
We first use the same validation metric described in Section 3.3.2 (Figure 3.11a).
This definition is the most intuitive one, and it allows us to compare PCH’s data quality
with our results. For any given ground truth AS clusterC
0
inT
tier1
orT
9org
, we select
the clusterC
1
inT
pch
that has the largest overlap withC
0
, and then compare them (Fig-
ure 3.11a). However, this method only gives us an approximate assessment of PCH’s
data quality. For example, it ignores other clusters that overlap with C
0
, thus misses
both true-positive and false-positive ASes in these clusters. We therefore use a second
definition to obtain some bounds on the errors.
The second approach considers not just the biggest cluster, but selects all overlap-
ping clusters (say, C
1
and C
2
in Figure 3.11 b). We then then take the union of all
these overlapping clusters and compare the resulting cluster with the ground truth clus-
ter. Given that this approach covers all clusters, it provides a lower bound for missing
ASes (false-negatives) and an upper bound for wrong assertions (false-positives).
Later in Section 3.8.3, we add a third definition to evaluate the PCH data. The
purpose of this evaluation is not to assess the quality of PCH data, but rather to obtain
correction factors that can be applied when we validate our results using PCH data as
ground truth. The need for this third metric arises from the incompleteness of the PCH
data.
Figure 3.12 shows the evaluation results for the PCH data. Using the biggest and all
metrics, we observe few false-positives, but many false-negatives. Checking the results
for the biggest metric, only the the Tier-1 ISP has a small false-positive rate, with all
other organizations, except for Comcast and Time Warner Cable, missing more than
123
1-fn
p
fn
p
u
1-u-v-b
v
b
fn
p
*
P
pch
P
ours
P
ideal
1-fn
p
*
+ = k
+ = k’
(1-fn
p
)P
pch
+ bP
ideal
P
pch
+ (1-k)P
ideal
fn
p
*
= 1 -
(1-fn
p
)kP
ideal
+ bP
ideal
kP
ideal
+ (1-k)P
ideal
= 1 - [k(1-fn
p
)+b]
≈ 1 -
Figure 3.13: The adjusted false-negative rate.
50% of the ASes. When comparing these results with the ones in Figure 3.2, we see that
our clustering approach outperforms the clustering provided by PCH.
The second definition, the (all) metric, produces an upper bound for false-positives
and A lower bound for false-negatives. We see that except for Time Warner Cable,
all organizations have at most 3% false-positives. The Time Warner Cable situation
is caused by a single AS with incorrect information in PCH that introduces 15 false
positives. As for a lower bound on false-negatives, 6 organizations miss at least 90% of
the ASes; this confirms the incompleteness of PCH data.
We thus conclude that T
pch
is relatively correct but incomplete. For the purpose
of validating our clustering results, we find that T
pch
is suitable for assessing false-
negatives; that is, if two ASes are in the same cluster inT
pch
, then they should be in the
same cluster in our results.
3.8.3 Validation of Our Results with PCH
To validate our results withT
pch
, we introduce a new definition of false-negative rate
fn
p
to address PCH’s incompleteness. Note that since the PCH data is incomplete, we
do not validate the false-positive rate withT
pch
for our results.
124
The challenge that the PCH data poses is how to relate AS clusters to organizations.
In the cases ofT
tier1
andT
9org
, each AS cluster is associated with exactly one organi-
zation. However, due to PCH’s incompleteness, one organization can have multiple AS
clusters. Which cluster should we choose to compare with the one in our result? Fur-
thermore, how do we identify multiple AS clusters that belong to the same organization
in the first place?
This challenge makes it hard to calculate errors for each organization as we did with
T
tier1
andT
9org
. Instead, we compute a single false-negative rate for all clusters. To
compute this single rate, we introduce the pair metric that counts AS pair links rather
than ASes (see Figure 3.11 c)). In contrast with the previous two metrics, only this new
metric can capture the clustering results when aggregated into a single rate.
To formally define this single false-negative rate, consider all AS pairs,x andy, and
set
P
pch
:= ASes(x;y) such that
org
pch
(x) = org
pch
(y);
org
pch
(x)2T
pch
and
P
ours
:= ASes(x;y) such that
org
ours
(x) = org
ours
(y);
org
ours
(x)2R
ours
125
where org
pch
(x) and org
ours
(x) are the corresponding organizations of ASx in PCH
and in our results, respectively. We define the relative false-negative rate of our results
compared with the PCH data as
fn
p
= 1
jP
pch
\P
ours
j
jP
pch
j
This relative rate fn
p
only tells us how well we did compared with the PCH data
(but not compared to the ideal ground truth). To account for this effect, we introduce
two correction factorsk andb to obtain an approximate estimation of the absolute false-
negative rate. Figure 3.13 illustrates the correction process. We first obtain the relative
false-negative rate (fn
p
) by comparing our results with PCH data. We then compute (i)
how much of the “ideal ground truth” PCH covers (k), and (ii) how much of the ideal
ground truth PCH misses but our results cover (b). Finally we calculate the corrected
false-negative rate (fn
0
p
) using the equation shown in the graph. The corrected false-
negative rate takes PCH’s incompleteness into consideration by re-weighting the relative
rate (multiplication byk) and amending the missing portion (addition ofb).
However, since we do not have the “ideal ground truth”, we approximate it using
eitherP
tier1
orP
9org
. We caution that this approximation may introduce errors.
3.8.4 Validation Results
Table 3.10 shows the relative and the corrected false-negative rates of our results for
the PCH data. We missed 29% of the AS pairs in PCH. If corrected by either P
tier1
orP
9org
, the false-negative rates increase to 64% or 44%. Note that 1k
0
is actually
the relative false-negative rate of our results compared with the ideal ground truth (see
Figure 3.13). Since we useP
tier1
orP
9org
to approximate the ideal ground truth, 1k
0
126
P
ideal
P
tier1
P
9org
fn
p
29%
fn
p
64%* 44%*
k 30% 15%
b 15% 45%
1k
0
67% 41%
Table 3.10: Validation results by PCH
shows how many AS pairs we missed for our 10 organizations. We see that the corrected
false-negative rates are consistent with the validation results for eitherT
tier1
orT
9org
In summary, we conclude that our clustering results are consistently more accurate
when compared to a much broader set of ground truth.
3.9 Appendix: Persistence of Multi-AS Usage
In Section 3.4.2, we briefly examined why organizations use multiple ASes, classify-
ing causes into being either transient or persistent in nature. In this section we look
at these causes in more detail. We first develop a method to classify organizations to
understand transient and persistent AS use (Section 3.9.1). We then give two exam-
ples to illustrate our classification method (Section 3.9.2) and subsequently verify the
practicality and correctness of our method (Section 3.9.3). Lastly, we apply our classi-
fication to all organizations identified by our AS-to-org mapping effort and demonstrate
the prevalence and persistence of multi-AS usage (Section 3.9.4).
3.9.1 Evolution of multi-AS usage
While our analysis of the current Internet ecosystem shows that many ASes are part of
multi-AS organizations, we wish to understand if this finding is an artifact of today’s
127
Internet, or if multi-AS usage is growing or shrinking. To this end, we examine the
per-organization changes of ASes active in routing tables over time.
We evaluate the importance of an AS by counting the number of addresses it orig-
inates. A persistent AS will be the origin of prefixes for years, while a transient AS’s
announcements will eventually vanish. We measure how many important ASes an orga-
nization has by counting the number of ASes that announce 100% and 80% of the
addresses for each organization (denoted byn
p
, wherep2 100; 80), and examine the
trend of this number. A constant number indicates persistent usage, while a changing
number may indicate transient usage, except for possible “noise” that first needs to be
quantified. Based on observed trends in these numbers, we classify organizations into
four categories: inconsistent, constant, consolidating, diversifying. We also bound the
number of organizations that will keep using multiple ASes. The method takes the fol-
lowing steps.
First, to establish a base to compare with historical snapshots, we obtain a current
address setA for each organization. We match all IPv4 addresses to ASes based on the
current (i.e., 2011-09-01) global routing table snapshot. Then, based on our AS-to-Org
mapping results, we group ASes belonging to the same organization and their addresses
and thus obtain the address setA for each organization.
Second, for each organization with address set A, we count n
p
for each historical
snapshot. In particular, we obtain a snapshot of the global routing tables from Route
Views every month from 2001 to 2011, and calculaten
p
for all snapshots.
Third, we use a simple heuristic to quantify and classify the trend ofn
p
. Specifically,
we use simple linear regression and fit observations over a recent period with a linear
model ^ n
p
(i) =i+, for each observationi in the lastM months. Small slopes indicate
steady n
p
implying a near constant number of ASes. A moderate or large positive
128
indicates growth in the number of key ASes used; that is, a diversifying organization. A
moderate or large negative slope indicates reduction or consolidation in multi-AS use.
We estimate the strength of our estimation by summing the residuals:
=
1
M 2
M
X
i=1
(^ n
p
(i)n
p
(i))
2
;
where M 2 denotes the degrees of freedom [FS69]. We consider the trend to be
inconsistent if >
0
and large changes are bounded by a constant
0
. We set
0
= 1
(i.e., the fluctuation is limited in 1 AS) and
0
= 0:1 (i.e., constant usage means less
than 1 AS growth/reduction every 10 years). Replacing this simple heuristic with a more
rigorous trend analysis is part of our future work.
3.9.2 Case studies of multi-AS usage
To illustrate how these metrics reflect real-world policies, Figure 3.14 shows the results
of our classification heuristic for two organizations: Google and Comcast.
Both organizations show a number of moribund ASes. The height of each graph is
scaled to the number of ASes we discover in the WHOIS data, and the blackn
1
00 line
shows the number of ASes that are routed. The difference shows that each organization
has about one-third unrouted ASes (as of 2011-09-01: Google has 8 moribund ASes, or
36%; Comcast, 13, or 27%).
We next describe our classification scheme applied to Google and Comcast. We
focus on the pastM = 24 months (from 2009 to 2011). The trend in multi-AS usage
is visualized by the slope of the upper (lower) short light line ^ n
100
(^ n
80
). As the graph
shows, Google exhibits an declining slope (negativebeta) when using all routed ASes
(n
100
), but a flat slope (smalljj) when using only the core ASes (n
80
). This suggests that
129
0
5
10
15
20
2001 2006 2011
ASes
Year
Google (22 ASes)
hat
n
100
n
100
hat
n
80
n
80
β = -0.9, ε = 0.4
Consolidating
β = -0.07, ε = 0.04 Constant
0
5
10
15
20
25
30
35
40
45
2001 2006 2011
ASes
Year
Comcast (48 ASes)
hat
n
100
n
100
hat
n
80
n
80
β = 2.8, ε = 0.4
Diversifying
β = -0.4, ε = 0.5 Consolidating
Figure 3.14: The number of ASes (n
p
;where p 2 100; 80) of Google/Comcast that
announce p% fraction of its addresses, with linear regression ^ n
p
computed over 24
months. The AS scale extends to how many ASes Google/Comcast has as of 2011-
09-01.
the core ASes are fairly stable, while smaller and less important ASes are being phased
out over time. In the case of Comcast, we observe a diversifying trend with respect to
all ASes and a consolidating trend with respect to its core ASes for the past two years.
Although classified as consolidating, the changes for Comcast’s core ASes are very
small (from 6 to 4), suggesting a stable core. The difference between our classification
result and the real-world AS usage by Comcast indicates that our classification method
may be sensitive to threshold selection.
130
0
50
100
150
200
2001 2006 2011
AS Index
Year
Verizon (234 ASes)
n
100
n
80
n
100
birth
n
100
death
n
80
promotion
n
80
demotion
3369
6167
6167
3378
3378
1660
702
702
1270
1673
703
705
13666
1677
1674
11145
1890
4286
13671
7193 1333
1324
13664
6995
1665
3626
18652
1335
8960
7192
11371 11149
10519 814 1327
2125
13663 11146
1322 1670
13665
8112 3966
1325
4860
10719
3493
6976
8016
1331
1667
704
13667
1662
18061
1330
1270
1673
25114
1332 1334 6113
815
24788
7021
11147
817
13661 13670
4433
8243
10027
1326
18653
6350
3965
9922
11148
12199
1323
8113 13662
26832
2822
816
8017
13668
3369
6167
3378
1660
1270
1673
703
705
6256
1390
3620
29891
2634
3373
33052
1689
3626
8960
30536
18654
1660
2125
13665 3966
32587
3370
22394
6976
28698
22385
18061
1661
25114
3371
3376
3374
24788
40476
3377
10027
18653
11037
28086
3372
1395
3965
3375
9922
22521
12199
13562
19973
1699
26832
3369
6167
1660
22394
19262
Figure 3.15: Historical routability of ASes of Verizon.
To evaluate the stability of these trends (consistent and inconsistent), measures the
error against these regressions and is visualized by the deviation of the top dark linen
100
from the short light line ^ n
100
(or the deviation of the bottom filled curven
80
from the
short light line ^ n
80
). As Figure 3.14 shows, neither Google nor Comcast has a significant
deviation from the regression line, and thus both are considered as being consistent for
the pastM = 24 months. However, the industry is changing rapidly: if we extend our
study back to 2005 (M = 72), then none of them is consistent onn
100
.
Policies in other organizations: To broaden the above examples of policies that affect
multi-AS usage, we next briefly summarize our inferences about AS policies for six
other organizations.
We see stable, policy-based ASes in the core (n
80
ASes) of many organizations. For
example, Figure 3.3 shows historical routability of all ASes that are part of Google as of
2011-09-01. ASes are stacked by AS number, with horizontal bars indicating the time
periods when ASes are routed, and with darker bars indicating membership inn
80
. Two
ASes have been announcing 80% of Google’s addresses for one year, Google’s main AS
(AS15169, AS index: 1) and AS36492 (AS index: 2), designated for WiFi, suggesting
131
0
5
10
15
20
25
30
35
2001 2006 2011
AS Index
Year
TW Cable (35 ASes)
n
100
n
80
n
100
birth
n
100
death
n
80
promotion
n
80
demotion
11707
7756
12270
13343
10311
10994
8052
20231
12262
11060
11955
13343
10311
8052
21522
27252
19548
36032
20001
3456
30628
46913
27476
27617
14065
20231
20001
10311
8052
Figure 3.16: Historical routability of ASes of Time Warner Cable.
0
2
4
6
8
10
2001 2006 2011
AS Index
Year
CN Mobile (10 ASes)
n
100
n
80
n
100
birth
n
100
death
n
80
promotion
n
80
demotion
9231
45120
9231
24400
38019
24059
24444
24445
45120
24547
24311
9231
Figure 3.17: Historical routability of ASes of China Mobile.
a stable routing policy. We see similar patterns for three other large organizations: Fig-
ure 3.15 shows the cores of Verizon (with two geographic one wireless and one access
AS), Figure 3.16 shows Time-Warner Cable (with 6 geographic ASes), and Figure 3.17
shows China Mobile (with 7 geographic ASes, plus IPv6 and backup core).
Frequently, transient multi-AS usage is the result of acquisitions followed by AS
consolidation. Continuing with the Google example, in late 2006, Google acquired
Youtube (AS36561, AS Index: 16 in Figure 3.3); the number of addresses announced by
132
0
10
20
30
40
50
2001 2006 2011
AS Index
Year
ISC (55 ASes)
n
100
n
80
n
100
birth
n
100
death
n
80
promotion
n
80
demotion
1280
3557
33080
33071
3557
33078
33080
30128
23709
24049
23708
30122
33072
33082
27848
24047
30134
55440
27320
24048
27319
30124
24050
30125
33079
33077
33074
27321
25572
30123
30126
33075
30131
55439
27916
24051
33081 28054
27913
45220
23710
23707
30132
23712
38568
27322
27318
30130
33073
33076
27859
30129
33071
23713
23711
30127
Figure 3.18: Historical routability of ASes of ISC.
this AS have been decreasing since then. In fact, this AS was slowly demoted and then
disappeared completely from BGP in April 2011. This suggests that, over time, Google
consolidated this service into their core infrastructure. We see similar results for Verizon
(see Figure 3.15) which consolidated ASes from MCI (AS703, AS705, AS3378), an ISP
acquired in 2005.
Consolidation also happens when there is a business strategy change. For example,
we see geographic consolidation in Time Warner Cable caused by an agreement with
Comcast in late 2006 [Ehl06]. This agreement exchanged subscribers between Time
Warner Cable and Comcast to consolidate key regions; it is the likely cause for the death
of AS11707, AS13343, AS10311, AS10994, and AS8052 (see Figure 3.16). These
ASes covered areas in Florida, Tennessee and Oregon where Time Warner Cable does
not have presence now [Cab11].
Lastly, we found one case where routing policy decisions promote AS diversifica-
tion: ISC. Although only one AS announces most of ISC’s addresses (AS1280 in Fig-
ure 3.18), we see ISC is using more and more ASes since 2003. Examining these new
ASes, we see that each announces a single /24 address block. This policy is consistent
133
with the choice to associate a unique AS with each physical anycast location [ISC11]
and with ISC’s operation of the anycasted F-root DNS server. This example illustrates
how policy can imply usage of an increasing number of ASes per organization over time,
suggesting that this type of multi-AS usage is likely to stay.
We also examined the remaining organizations in our 10 organization list and found
very similar behaviors to those shown in Figures 3.3, 3.15, 3.16, 3.17 and 3.18.
3.9.3 Ruling out Churn
While we use n
p
to classify organizational use of ASes, this metric focuses only on
active ASes. Such a focus could be misleading for organizations that have both growth
and reduction in AS usage between two consecutive observations. The churn caused by
the simultaneous addition ofN ASes and removal ofN ASes is not captured by then
p
metric.
To bound the error introduced by AS churn, we measure how often there is an offset
inn
p
. Sincen
p
is measured every month, the corresponding offset is given by difference
between the number of newly observed ASes and the number of removed ASes in each
month. Our examination shows that offsets are rare and small and thus can be ignored.
Among 4,388 multi-AS organizations, 3,948 (90%) do not have any offsets at all for all
observation intervals. Among the remaining 10% organizations, 423 (9.6%) organiza-
tions only have offsets for less than 5% observation intervals, and all offsets are limited
in size by 1.
3.9.4 How persistent is multi-AS usage?
Based on the understanding we gained from these case studies, we next look at all orga-
nizations identified by our AS-to-org mapping and present some overall statistics and
134
0
20
40
60
80
100
2001 2006 2010
1980
2278
2615
2965
3280
3544
3834
4101
4252
4358
Organizations(%)
Regress From
Organizations(#)
n
100
n
80
n
100
n
80
Inconsistent
Diversifying
Constant
Consolidating
93%
Figure 3.19: Classification results of multi-AS usage over all multi-AS organizations,
based on regression ofn
p
starting from different years.
classification results. Our goal is to answer the following question: Are organizations
consolidating or diversifying their use of ASes over time? In other words: Are multi-AS
organizations here to stay or are they going away?
We use regression over different durations to see if organizations are consolidating
their use of ASes or not. To do this analysis, we begin by selecting all multi-AS orga-
nizations. Then for each such multi-AS organization, we perform linear regression on
either all of their routed ASes (n
100
) or on the top ASes that announce 80% of their
addresses (n
80
). We then classify our fit to show AS consolidation, constant use, diver-
sification, or inconsistent trends.
Figure 3.19 shows these classifications based on time periods from 2001 to 2010
and ending with 2011-09-01. Solid bars show classification results based onn
100
while
dashed bars are forn
80
. Solid lines (forn
100
) and dashed lines (forn
80
) are used to trace
the classification boundaries for clarity. At the top of the figure, we shows the absolute
total number of organizations that exist at the start of the different regression periods,
135
while the left y axis shows the relative percentage of organizations classified into each
category.
Our first observation is that multi-AS use is not going away: with a two-year regres-
sion based onn
100
, around 93% organizations are using the same or more ASes. This
trend is the same and extends over longer periods if we usen
80
.
Second, we observe that relatively few organizations are consolidating their AS use
over all durations we analyze. While the case studies of our ten organizations show
definitive signs of consolidation (3 of 10 are consolidating, and 3 more are inconsistent),
when looking across all organizations, we see at most 6% of them are consolidating
(for two-year regression based on n
100
), with even fewer over all other periods. This
difference shows that our selection of the ten organizations is not representative of the
Internet as a whole – we chose those ten organizations because of their prominence,
large size, and use of many ASes. The vast majority of the multi-AS organizations in the
Internet are much smaller, with single-digit numbers of ASes. Big companies typically
engage in more acquisitions and mergers and tend to expend more efforts consolidating
post-merger.
A third finding is that organizations are much more consistent and constant if we
only focus on the top ASes (n
80
). In fact, we hardly see any inconsistent organizations
(top dashed bars), and the percentages of constant organizations far exceed the ones
based on n
100
(compare the dashed line above label “Diversifying” and the solid line
above label “Constant”). However, the percentage of consolidating organizations is
roughly constant, irrespective of whether we consider all routed ASes or only the top or
core ASes. This finding suggests that most consolidations happen in the “core” ASes,
while most diversifications occur for the non-core ASes. Presumably, organizations
136
prefer to keep their core small and simple for the ease of management, while they rely
on non-core ASes to implement miscellaneous policies.
As the length of regression period (the number of years to look back) decreases, the
total number of organizations is increasing (from 1,980 in year 2001 to 4,358 in year
2010), with about 264 new organizations appearing each year. This result reflects the
overall growth of the Internet in terms of number and types of Internet-related compa-
nies. However, the relative rate of diversification and consolidation appears to have not
changed much during these last 10 years. Finally, as expected, with a longer regres-
sion period, the number of inconsistent organizations increases – in a dynamic industry,
policy changes are frequent.
We thus conclude that the prevalence of multi-AS usage by organizations is persis-
tent and likely to continue in the future.
3.10 Appendix: Revisiting AS Rank
The rank of an AS is commonly defined as the node degree of this AS in an inferred
AS-level graph of the Internet; that is, the number of ASes with which this AS has an
AS relationship (e.g., customer-provider, peer-peer). AS rank has been widely used as a
proxy for an organization’s influence, mainly because more informative metrics such as
traffic volume, revenues, or number of users are not publicly available and very difficult
to measure. AS rank has been used by a number of researchers to infer properties of the
Internet AS-level topology, from identifying Tier-1 Internet providers [GR97] to infer-
ring routing relationships [Gao01], and in network visualizations [CAI13]. The study of
such inferred AS-level graphs of the Internet has subsequently prompted the considera-
tion of new metrics such as an AS’ “customer cone” (i.e., counting an AS’s direct and
indirect customers) [DKF
+
07]. In the following, we consider AS rank for the purpose
137
0
1000
2000
3000
0 1000 2000 3000
Org-wide Neighbors
AS Neighbors
Top 100 ASes
1
2
3
4
5
6
7
8 9
10
11
12
13 14
15 16
17 18
19
20 21 22 23
24
25 26
27 28 29 30 31
32
33
34
35
36
37 38
39 40 41
42
43 44 45 46
47
48 49
50
51 52 53 54 55
56
57
58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80 81 82 83 84 84 86 86
88
89 90
90
92 92 94 94 96
97
98 99 100 100
AS701 AS702
Small Verizon ASes
N
232 smaller
Verizon ASes
Top Verizon ASes
Figure 3.20: The number of neighbors of individual ASes vs. their organizations. Only
top 100 ASes (ranks annotated as numbers in circles) and all Verizon ASes are plotted.
of examining the sensitivity of this metric with respect to the AS- vs. organization-level
view of the Internet.
Given that many ISPs control multiple ASes, ranking individual ASes and examining
them separately can lead to two kinds of inaccurate conclusion. First, organizations may
not appear as prominent if they “dilute” their peering relationships across several large
ASes. Second, ASes with few direct peers will appear minor, even if they are part of a
Tier-1 ISP. For example, Verizon controls 234 ASes, among which two ASes (AS701,
AS702) are frequently studied, while more than 200 other Verizon ASes are out of the
spotlight.
To explore the effect of organizations on these questions, we next look at two metrics
one could use to compute AS rank. Prior work used the number of neighbors (N(a)) of
each AS as the metric for ranking (If two ASes appear adjacently in any BGP path, we
138
call them neighbors). We compare that to the organization-wide neighbors (OWN(a)),
that is, for each ASa, what are the number of unique neighbors of any AS ina’s organi-
zation. Thus each ASa of an organization has the same OWN(a), reflecting the contribu-
tions of all ASes of that organization. This metric reflects the publicly visible influence
of the organization and it directly captures the number of the organization’s business
relationships (as measured by peering agreements).
To consider how how well AS rankings reflect organizations’ business relationships,
Figure 3.20 shows AS- and organization-level peerings by plottingN(a) on thex-axis
vs. OWN(a) on they-axis. For clarity, we only plot the top 100 ASes (big circles with
numbers) and Verizon’s ASes (small circles). The numbers in circles indicate the top
100 AS ranks as determined byN(a). Two Verizon ASes (AS701 and AS702) are in the
top 100, and thus are plotted as big circles instead of small ones. IfN(a) and OWN(a)
are equivalent metrics, all points should lie on the diagonal, and additional organization-
peerings will lift points above the diagonal.
We observe two interesting findings in Figure 3.20. First, the top 100 ASes gener-
ally preserve the same ordering with the two metrics. One exception is Verizon’s AS702,
ranked 57
th
before, which rises to 5
th
place when considered as part of the Verizon orga-
nization with its sibling AS701. Second, many large ISPs have many secondary ASes
that are very low-ranked when considered on their own. For example, while most of
Verizon’s 200 secondary ASes are ranked very low (e.g., between 12093
th
and 25638
th
)
with traditional AS rankN(a), their relationship with Verizon suggests that they may
make policy decisions as would the 5
th
placed AS701. We see similar results for the
minor ASes of most large ISPs. Thus, while an organization-based view does in general
not change the AS ranking for the top ASes, it can greatly affect previously low-ranked
139
ASes that belong to large ISPs (we see similar results when using customer cone size
for ranking ASes).
To summarize, while AS rank and related metrics are widely used, we recommend
against their use. For one, their relevance for capturing an AS’s influence is more than
questionable, especially in view of increasing evidence that currently used inferred AS-
level graphs of the Internet are incomplete [OPW
+
08, ACF
+
12], capturing less than half
(or even fewer) of all AS-level links. Such a severe degree of incompleteness makes a
reliable computation of metrics such as AS rank essentially impossible and casts doubts
on published reports that rely on AS rank as a basic metric for studying or comparing
ASes. Second, in addition to this incompleteness problem, we showed in this section
that the use of the AS rank metric is also affected by appropriately accounting for an
organization-based view of the AS-level Internet, which in turn has its own issues con-
cerning the completeness of the available data. In view of these two facts, unless the
quality of the datasets available for inferring the AS- and organization-level structure
of the Internet improves dramatically in the near future, we recommend against the use
of traditional metrics such as AS rank for studying questions concerning the AS-level
Internet.
140
Chapter 4
Holistically Framing the User Impact
of Infrastructure Threats
In this chapter, we provide our third study to support the thesis statement that systematic
approaches can overcome data limitations to improve understanding about the Internet.
Previous two chapters have each achieved a goal via some data that has limitations but
is available. We next demonstrate how to reach a goal when not all data is obtainable.
We explore a different part of the problem space where the goal is to understand how
submarine cable cuts affect Internet web services. Around 300 submarine cables carry
the majority of international Internet traffic, so a single cable cut can affect millions
of users. Outages are long-lasting, since repairs are expensive and time consuming.
We especially pay attention to the cut impact on web services because the reliability
and quality of these services are what users care about. The understanding gained in
this study can further aid in connectivity planning and service deployment to improve
resiliency against cable cuts.
We systematically construct a holistic model that bridges cable cuts with services to
achieve our goal (Section 4.3). This model relies on topology and traffic data on various
Internet layers as input. Obtaining all data is impossible since most data is proprietary
(Section 4.3.7). To address unknown data, we perform what-if analysis that studies a
range of possibilities (Section 4.4.4). The holistic model and what-if analysis enable
us to discover two general classes of vulnerability of developing countries’ Internet
141
Data
Goal
direct
indirect
general specific
infeasible
undesirable
feasible and
desirable
service
robust-
ness
Figure 4.1: The parts of the problem space the third study explores.
infrastructure (Section 4.4.3) and has inspired countermeasures to improve resiliency
(Section 4.4.5).
This chapter provides a third strong evidence to support the thesis statement by
demonstrating that systematic approaches can help to understand submarine-cable-cut
impact on web services via exploring the incomplete topology and traffic data. The part
of the problem space this chapter explores is shown in Figure 3.1. Because our conclu-
sions are drawn from four specific case studies, thus our understanding is less general
than the previous two studies.
The work in this chapter suggests a possible way to study the class of problems
which all available data is insufficient to address. To solve this kind of problems, one
can study a range of possible values of the unknown data. Although this approach does
not provide any definitive answers about the current status, it can be extremely useful to
answer what-if questions for future planning and development.
142
4.1 Introduction
The Internet is of great importance today, and as critical infrastructure, the impact of
various threats it faces needs to be carefully studied and understood. Threat models are
built to provide understanding about how specific threats change the Internet, how the
network reacts to such threats, and how the various threats affect end users.
In this paper, we focus on submarine cable cuts, a specific class of threats that impact
the physical infrastructure of the Internet. Understanding the impact of submarine cable
cuts is essential for three reasons. First, judging form the many reported real-world inci-
dents [BBC12a, BBC13, Mad12b, BBC12b, CPBW11, GA06], they occur rather fre-
quently and can have considerable impact. Second, the majority of international traffic
travels over fewer than 300 submarine cables around the globe (Figure 4.2 shows cables
in 2013). A single cut can profoundly affect millions of users and businesses. Third,
recovery from a cut can be slow, with typical repair times as long as several weeks.
Figure 4.3 illustrates the challenge of submarine cable cuts. In this example, four
landing stations are connected via a submarine cable system (SCS 1). Each landing sta-
tion connects to users via some terrestrial networks. Various online services are repli-
cated in facilities connected to landing stations, with differing deployments depending
on user distribution and cost. A cable cut has broken the cable segment between sta-
tions 1 and 2.
One may simply treat this problem as a path-finding problem, focusing on graph
properties. Since after the cut the graph is still connected, this naive model implies there
are minimal results from the cut. Since all stations are still reachable, the only harm is
increased path length between stations 1 and 2.
143
Figure 4.2: Global Submarine Cable Map in 2013 [Mah13]
4
2 1
3
SCS 1
landing station
submarine cable segment
terrestrial cable segment
Figure 4.3: The problem to solve.
However, this naive model does not reflect important aspects of Internet operation:
users and services, as well as the diverse mechanisms each Internet layer uses to estab-
lish, protect, and restore data channels. Users interact with application-layer services,
while the cut happens at the physical layer. To understand the impact of a cut on users
we must consider and model the cascade of interactions from the physical all the way to
the application layer.
Capturing this cascade is challenging, since each layer has diverse mechanisms that
support communications. A user’s access to a service depends not only on connectivity
of a physical medium, but also on virtual data channels that must be provisioned at
144
intermediate layers. The different mechanisms at each layer require separate models to
capture their unique functionalities for fault-tolerance and recovery.
Rich Internet connectivity means that while completely disconnected users may be
rare, unacceptable performance is a more common user experience. We must therefore
also evaluate user-perceived qualities as measured by Quality-of-Experience (QoE).
Network changes affect QoE in service-specific ways. For example, a user browsing
web pages cares about the page-loading time, whereas a user watching videos concerns
factors such as how long it takes to start the video, how often the video re-buffers, and
what the video quality is.
Prior threat models are often influenced primarily by data availability. For example,
the naive model we mentioned earlier follows directly from knowledge of the topology,
but omits layer interactions and end-user impacts. Data-driven approaches can misplace
risk by emphasizing threats that are unlikely and defining harms that are abstract from
real-world users. For example, wide availability of data about AS topologies encourages
threat models involving node and link removals in AS graphs. Since AS graphs represent
business relationships, this graph manipulation has at best limited relationship to real-
world events [DAL
+
05, ALWD05].
The first contribution of our paper is to frame the modeling need as spanning real-
world threats at lower layers to end-user harm (Section 4.3). To address this challenge
we bring together a number of existing models of Internet components and show how
they can fit together to identify the essential mechanisms at intermediate layers that
change threat outcomes on end-users.
Even with carefully selected models, no single organization is likely to have all the
data needed to populate models from the physical layer to users. Our second contribu-
tion is to show how to apply what-if modeling to network threats (Section 4.4.4). The
145
ability to explore a range of possibilities allows one to make qualitative claims about
possible outcomes in the face of incomplete data. As one example, we use Quality-
of-Experience models (Section 4.3.6) to study a range of possible current and future
outcomes to users that might result from a submarine cable cut.
Our third contribution is to illustrate our approach by exploring four real-world inci-
dents of submarine cable cuts in 2012 and 2013 (Section 4.4). Using our models and
what-if analysis, we provide general rules that help assess what makes some countries
more vulnerable to disruption (Section 4.5): service-self-sufficiency and diversified con-
nectivity. Our models allow countries to evaluate their vulnerabilities to these risks and
explore possible mitigating strategies.
Finally, although we focus on submarine cable cuts, many parts of our model also
apply to other disruption threats.
4.2 Related Work
Four areas relate to our work: other models that either predict the impact of subma-
rine cable cuts or post-facto evaluation after cuts, models of other threats to Internet
infrastructure, and non-threat models of parts of the Internet.
Models of submarine cable cuts Omer et al. provide a model to assess the impact
of submarine cable cuts [MOM09]. They construct a physical cable topology in which
nodes are continents and edges are aggregations of inter-connecting submarine cables
based on the public map [Mah13]. They then hypothesize threats by removing nodes
or edges and assessed impact by computing the amount of traffic could be delivered
between continents after the threat. Unlike their work, we analyze real-world cuts, and
146
we relate the impact to end-users. In addition, we pay attention to the diverse data-
transmission mechanisms on layers which are not present in their work.
Measurements of submarine cable cuts Many researchers have measured the conse-
quences of submarine cable cuts [Mad12c, Mad12a, CPBW11]. In contrast, our model
can provide implications before a cut happens. Nevertheless, these measurements are
valuable as ground truth to validate and correct our model.
Models of other threats Because of the availability of data of AS topologies, past threat
models typically build on the AS graph, modeling threats as removals of nodes or edges
from it. Albert et al. [AJB00] first analyzed errors (accidental removal of nodes) and
attacks (intentional removal of nodes), assessing impact as network-diameter increase
and fragmentation. Dolev et al. [DJMS06] builds on this model, but with the considera-
tion of the network-layer transmission mechanism. They note that connectivity between
ASes does not imply reachability—a valid AS path must be valley-free [Gao01]. Wu
et al. [WZMS07] further enriches the model and assessed impact as the reachability
changes between all AS pairs. Different from their work, we start with threats drawn
from real-world incidents and assess impact not on the network layer but on end-users.
Models of the Internet Much prior work model how different parts of the Internet
work without explicitly considering threats. These models provide useful input to our
work. In particular, Feamster et al. [FWR04] model how BGP selects paths for traffic
flows. Mok et al. [MCC11] model how flow condition affects video streaming qualities,
while Zhang et al. [ZXH
+
12] model video telephony. Researchers in [DSA
+
11, KS12,
MCC11, CHHL06] model how service qualities affect user QoE. We incorporate some
of the models above to build our multi-layer threat model.
147
SONET
circuit
IP link flow session QoE
reachability,
length,
capacity
cable
segment
link
layer
reachability,
latency,
capacity
reachability,
latency,
capacity
reachability,
latency,
throughput
reachability,
startup delay,
rebuffering ratio
abandonment
rate,
play time
cable cut user
physical
layer
network
layer
transport
layer
application
layer
service-reachability branch
user-QoE branch
β,r
v
C
if
(t)
c
i
M
fa
(t)
M
if
(t)
W
si
P
si
M
cs
r
c
cascading impact
Figure 4.4: The general picture of the model.
4.3 Modeling Cable Cuts
To understand the impact of cable cuts on the real world, we follow the problem from
real-world threats to user-relevant harms, bridging them with our holistic model.
4.3.1 Model Overview
Our approach to model cable cuts is problem-driven, which contrasts with models that
are built around the constraints of available data. We identify the threat and harms,
then identify what role each takes in the Internet and determine how they relate. This
approach is challenging, because the relationship of network components is not always
obvious, and because components are often “black boxes” where obtaining data can be
difficult. This section presents an overview about how we follow this approach to model
cable cut impact.
Step 1: Selecting threat and harm We model submarine cable cuts because of their
frequent occurrence, traffic importance, and long repair time. We choose to model harms
as degraded QoE because it is what users care about.
Cable cuts happen to cable segments at the physical layer, while users access services
at the application layer. Thus, to assess how cable cuts affect we must bridge these
148
layers modeling intermediate layers. Before we discuss this process, we first provide
background to bring out the basic idea.
Background of the Internet Logically, the Internet is structured as layers. Between
two adjacent layers, the lower layer provides a communication channel for the upper
layer and thus directly affects its communication quality.
The QoE of users is directly shaped by the quality of the communication on the
application layer.
The communication further relies on lower transport layer to transmit its traffic and
network layer to find a path for the traffic. The path then relies on lower link layer to
establish channels that support the links composing the path.
Eventually, the link layer needs a physical medium (such as cable segment) to sup-
port its virtual channels. As a result, any damage made to a physical medium will
cascade up the stack.
Step 2: Bridging threat to harm Our approach is to tie the threat to harms by succes-
sively modeling how changes of lower-layer communication channels affect upper-layer
communication quality, from threat layer to harm layer.
There are three benefits to this approach. First, with mostly independent models at
each layer, each component can be treated more or less in isolation, making each layer
simpler and easier to interpret. Second, we can explore and validate components sepa-
rately to increase our overall confidence in the approach. Third, the approach provides
a framework to identify the different components’ relationships so as to capture a more
holistic view of the problem.
Figure 4.4 shows the general picture of our model. The long solid red arrow repre-
sents the cascading impact from cable cuts to users. To model this impact, we break the
model into five sub-models (five short arrows) that each addresses a direct impact.
149
C1
C4
C3
C2
4
1 2
3
SONET layer
cable segment
SONET circuit
4
1 2
3 S3
S1
S2
SCS 1
physical layer
Figure 4.5: SONET circuits rely on cable segments as physical medium, but have to be
provisioned to transmit data.
To summarize, the problems that each sub-model is going to address in later sections
are:
1. how does a cable cut break SONET circuits (Section 4.3.2)?
2. how does the breakage of SONET circuits break or impair upper-layer IP links
(Section 4.3.3)?
3. how does the change of IP link condition affects flows that traverse through the
link (Section 4.3.4)?
4. how does the application system adjust its session qualities to adapt to the new
flow condition (Section 4.3.5)?
5. how does the change of session qualities affect user perceived QoE (Sec-
tion 4.3.6)?
Modeling across multiple layers is a challenging task. To keep model manageable,
we avoid details that can be captured in existing layers, such as WDM whose ring pro-
tection mechanism is captured adequately in our SONET model. We also do not model
transient effects brought by mechanisms such as fast re-routing in MPLS, but instead
focus on impact that lasts for at least days.
150
sub-model source
modeling cable cut breaking SONET circuits this paper
modeling SONET circuits affecting IP links this paper
modeling IP links affecting flows this paper and [FWR04]
modeling flows affecting sessions
video streamingy [MCC11], video telephony [ZXH
+
12],
gaming [CTCL11]
modeling sessions affecting QoE
video streamingy [DSA
+
11, KS12, MCC11],
V oIP [CHHL06], gaming [CTCL11]
Table 4.1: Sources of Sub-models. Daggers: sub-models used in this paper.
4.3.2 From Cable Cut to SONET Circuits
We first tie physical damage to the SONET link layer.
A SONET circuit is a virtual circuit between two SONET devices. Logically, a
SONET circuit corresponds to physical and link layers in the OSI model [ISO94]. Most
submarine cable systems use SONET, connecting devices that are physically located at
landing stations near the coast.
We model SONET circuits because they are statically provisioned and do not nec-
essarily exist between all landing station pairs. Thus absence of a logically provisioned
SONET circuit can leave a physical connection useless. For each SONET circuit, we
evaluate reachability based on if a logical circuit is provisioned. In principle, circuits
have latency and capacity, but we model those as part of the IP link described later.
Reachability A cable cut breaks one or more cable segments and thus changes the
physical topology of a SONET submarine cable system. Because SONET circuits are
virtual circuits which need to be provisioned, we can not simply examine reachability
by finding paths in the changed topology.
Figure 4.5 gives an example. Four cable segments (C
1;2;3;4
) connecting four landing
stations compose the physical topology of cable system SCS 1. However, only three
station pairs can communicate with each other through the provisioned SONET circuits
151
(S
1;2;3
). Pairs without SONET circuits in between will not be able to communicate even
if they are physically connected (for example, station 1 and 4), because SONET circuits
are assigned statically and human intervention is often required to reconfigure them.
To model how a cable cut breaks circuits between station pairs, we define this static
mapping between SONET circuits and cable segments as matrix M
cs
. Each matrix
element a
ij
= 1 (otherwise 0) if and only if SONET circuit i traverses through cable
segmentj. In the example shown in Figure 4.5,
M
cs
=
0
B
B
B
B
@
C
1
C
2
C
3
C
4
S
1
1 0 0 0
S
2
1 1 0 0
S
3
0 0 1 0
1
C
C
C
C
A
When a cable cut happens, the SONET circuits that traverse the broken segments
will be affected. The consequence is straightforward, all these circuits will break.
We translate this impact to the equation shown below.
r
s
=M
cs
^
r
c
(4.1)
r
c
= [ r
1
r
2
]
T
is the column vector denoting the reachability status of all cable
segments (true if reachable, false if broken), whiler
s
represents all SONET circuits.
The operator
^
captures the fact that a SONET circuit is reachable if and only if
all cable segments it traverses through are reachable. It is slightly different than the
matrix multiplication (instead of sum, it computes the conjunction). It is defined as the
following equation: (A
^
B)
ij
=
V
m
k=1
a
ik
b
kj
where (A
^
B)
ij
is the element ini
th
row
andj
th
column.
152
I2
I1
AS 1 AS 2
SONET layer
IP layer
SONET working path
IP link
4
1 2
3
S1
S4
SCS 1
S2
SONET protection path
S3
Figure 4.6: SONET systems with ring protection mechanism use two circuits (working
and protection path) to support an IP link.
4.3.3 From SONET Circuits to IP links
We next model how SONET circuits affect IP links. An IP link is a virtual channel
between two adjacent devices identified by IP addresses at the network layer. We use
hop-by-hop IP links to model routing, and IP paths to refer to a series of IP links over
an inter-networks.
We model IP links explicitly to allow for SONET circuit diversity. If SONET ring
protection mechanism is used (two circuits supporting one link), the threat impact might
be contained at the IP link and thus no impact on users.
Two properties of an IP link are important to us: reachability and latency, because
they are sufficient enough to capture the threat impact that will further propagate to
users. Latency is the data propagation time, ignoring queuing delay, over the IP link.
Normally it is fixed, but after a cable cut it may take a different value if a different
SONET circuit is selected.
In principle, the cable cut could also change an IP link’s capacity if multiple link-
layer channels are supporting it via link aggregation. For simplicity, we model this
situation using multiple IP links, each supported by a single link-layer channel at one
time (working and protection SONET circuits do not work simultaneously).
153
Reachability A SONET circuit is supported by a series of cable segments, whereas
an IP link is supported by one or two parallel SONET circuits. Due to this difference,
the way the impact propagates is slightly different. The primary circuit is the working
path, while the secondary one is the protection path. This protection scheme is known
as the Multiplex Section-Shared Protection Ring (MS-SPring) or just “ring protection
mechanism” [MPDPM02]. As an example, in Figure 4.6, the working path S
1
and
protection pathS
3
together support IP linkI
1
. The protection path is optional.
Since the working and protection path are in parallel (rather than in series as cable
segments), an IP link is reachable as long as at least one SONET circuit is reachable.
Thus, unlike with SONET circuits, any active SONET circuit supports the IP link. We
model this effect using the following equation:
r
i
= (W
si
r
s
)_ (P
si
r
s
) (4.2)
Where, analogous to r
s
, r
i
is the column vector denoting the reachability of all IP
links.W
si
andP
si
are matrices mapping IP links to their working and protection paths,
respectively. In the example shown in Figure 4.6,
W
si
=
0
B
@
S
1
S
2
S
3
S
4
I
1
1 0 0 0
I
2
0 1 0 0
1
C
A
,P
si
=
0
B
@
S
1
S
2
S
3
S
4
I
1
0 0 1 0
I
2
0 0 0 1
1
C
A
Because SONET circuits can be sold to different ISPs, the IP links they support can
reside in different ISPs’ networks. We show ISPs as different Autonomous Systems
(ASes) in Figure 4.6.
Latency In cases where an IP link is still reachable after the cable cut, its latency may
increase if protection path has higher latency than the working path. For long-haul IP
154
links in modern networks, propagation delay is the major component of IP link latency.
We assume capacity of both circuits is the same, as is typical in practice [MPDPM02].
We thus model the impact on latency by the following equation:
l
i
=
8
>
<
>
:
W
si
l
s
if workingpathfunctions
P
si
l
s
otherwise
(4.3)
wherel
i
is the column vector denoting the latency of all IP links, whilel
s
denotes latency
of SONET circuits. An element in l
s
is a finite number unless its corresponding circuit
is broken. If broken, the value of the element equals to1.
Note that we have ignored the queuing delay might induced by the cable cut. A cable
cut might causes some core IP links congested and thus increases their queuing delay.
However, we believe in such cases, congestion-reactive traffic will back off [FF99], and
routers will drop packets in the queue as there is no benefit to keep a queue when the
link is heavily congested.
4.3.4 From IP Links to Transport-layer Flows
IP links connect devices; we next consider flows that represent traffic over an IP path (a
series of IP links). We use the traditional network definition of a transport-layer flow:
a series of packets sent between two network endpoints identified by two IP addresses,
two port numbers, and the protocol.
We choose to model flows for three reasons: they add multi-hop, routing, and con-
gestion control, bridging IP links to applications. Our goal is to capture properties:
reachability, latency, and throughput. Reachability is affected by multi-hop communi-
cation and routing that can find paths around failed links. Latency is affected by path
155
AS 1
AS 2
IP layer
I3
I2
I6
I8
I1
I5
I4
F2
IP link flow
b a
c
d
g
I7
e
f
F1
Figure 4.7: Traffic flows between two endpoints rely on the network layer to find a path
composed of IP links. The path must comply with policies configured in routers.
AS 1
AS 2
IP layer
I3
I2
I6
I8
I1
I5
I4
IP link flow
b a
c
d
g
I7
e
f
F1
F1'
Figure 4.8: Flows are dynamically routed based on current IP link state for robustness.
changes that increase path length. Finally, completing flows can trigger congestion con-
trol and change effective throughput, an important factor in application quality.
Reachability We model flow reachability over IP links in the same manner as SONET
circuit reachability over cable segments (Section 4.3.2). The way IP link reachability
affect flow reachability However, unlike statically provisioned working and protection
SONET circuits, the network layer uses dynamic routing to select from several possible
paths. For example, in Figure 4.8, the flow betweena andg may take two different IP
paths. We thus model the impact on flow reachability with:
r
f
(t) =M
if
(t)
^
r
i
(4.4)
156
This equation is similar as Equation 4.1, but the mapping is varies over timet asM
if
(t),
unlike the the staticM
cs
in Equation 4.1.
In the example shown in Figure 4.7,
M
if
(t) =
0
B
@
I
1
I
2
I
3
I
4
I
5
I
6
I
7
I
8
F
1
0 1 0 0 1 0 0 1
F
2
1 0 0 1 0 1 0 0
1
C
A
Note that after a cable cut, M
if
(t) is likely to change. M
if
(t) is collaboratively
decided by distributed routers running Border Gateway Protocol (BGP) and Interior
Gateway Protocol (IGP) based on current IP reachability. Feamster et al. have proposed
an algorithm to compute this matrix by emulating the route selection process of each
ingress router for each destination prefix [FWR04]. In principle,M
if
(t) can also capture
load balancing and traffic engineering, but modeling these factors is future work.
Latency Flow latency is the sum of each IP links latency on its path. We capture the
latency of each flowl
f
(t) as:
l
f
(t) =M
if
(t)l
i
(4.5)
Throughput Throughput is affected both by IP link capacity and traffic and congestion
control on each link. We assume most traffic is congestion-reactive [FF99], and there-
fore over medium-timescales each flow will converge on a fair share at its bottleneck.
We thus model flow throughputc
f
(t) as:
c
f
(t) =C
if
(t)
min
c
i
(4.6)
wherec
i
denotes capacity of IP links.C
if
(t) is a matrix denoting the fraction of capacity
occupied by each flow on each IP link. It is derived from M
if
(t) by computing the
multiplicative inverse of number of flows on each IP link. Here operator
min
computes
157
the minimum value over the vector. That is: (A
min
B)
ij
= Min
m
k=1
a
ik
b
kj
where
(A
min
B)
ij
is the element ini
th
row andj
th
column.
4.3.5 From Flows to Sessions
Applications often use one or more flows to realize complex network services; we call
this exchange of the information a session. (Our sessions are somewhat more general
than the OSI session layer, and are implemented in applications and libraries.)
We model sessions as a bridge between flows and application QoE. This bridge
allows us to identify metrics that are application-specific but lower-level than users
might care about. These metrics are useful because they are common to several differ-
ent models of QoE, and because they identify measurable things in the network that we
can verify. We expect each application to require distinct session information. We draw
on prior work in modeling multiple applications, focusing on video streaming using a
model developed by Mok et al. [MCC11]. We focus on one generic session property,
reachability, and three application-specific properties of video streaming, which later fit
into the session-QoE model. Other applications that could be used within this frame-
work include video telephony, VOIP, gaming, and newly emerged cloud applications.
Reachability We consider the reachability of a session equals to the one of the trans-
mission flow. A video streaming session may initiate one or a series of transmission
flows to transfer video segments to the user [HHH
+
12]. In case where multiple flows
are used, we consider them as one flow but with changing endpoints (servers). There
are also other flows involved in a streaming session, such as DNS queries that map the
service to servers. But because these flows have much less influence over the session
quality, we thus do not model them.
158
AS 1
AS 2
IP layer
F1
F2
session flow
a
g
f
Application
layer
Figure 4.9: A session between an user and a service relies on one or multiple flows
between the user client and server(s).
Prior work shows that there is typically only one TCP flow at a time [HHH
+
12], we
therefore consider the reachability of a session equals to the one of the single flow at a
given time.
Most video streaming services employ Content Delivery Networks (CDNs) that dis-
tribute content around the Internet in caches to reduce latency to the user, and bandwidth
costs to the provider. Different CDN caches provide redundancy to the session, just as
backup SONET circuits do to the IP link. Reachability of to any cache allows the service
to proceed.
For example, in Figure 4.9, the session the user and video streaming service YouTube
depends on the underlying flowF
1
. But if anything goes wrong withF
1
, the user can
still access the service byF
2
which goes to another cache providing the same content.
We model session reachability (r
a
(t)) as the following equation (We use superscript
a
to denote “application”. In principle, we should use
s
to denote “session”, however,
s
is already used for “SONET circuit”).
r
a
(t) =M
fa
(t)r
f
(t) (4.7)
159
where M
fa
(t) is the dynamic mapping between sessions and flows (or services and
servers). Note that unlike Equation 4.2 where redundancy is expressed by disjunction,
redundancy here is expressed by dynamic. M
fa
(t) will automatically change after a
failure to restore the session reachability.
Application-specific properties of video streaming We also draw on several prop-
erties specific to video streaming: video bitrate, startup delay, and rebuffering ratio.
These properties have been developed in models specific to video evaluation [MCC11,
KS12, DSA
+
11]; we adapt them to our model and flow ratec
f
.
Video bitrater
v
is often congestion adaptive, picking certain bitrate tiers in modern
streaming players [HHH
+
12]; we model this mechanism by picking the largestr
v
just
less thanc
f
. Video startup delay (d
s
) is
d
s
=
r
v
c
f
(4.8)
a function of video buffer size, video bitrate, and flow rate (from Mok et al. [MCC11]).
Rebuffering ratio is also important. We model it as
r
b
r
v
c
f
1 (4.9)
aggregated from two separate models in [MCC11] (they define separate rebuffering time
and frequency).
We define these network-level metrics of video performance in our session model
because they serve as input to several different QoE models (described in the next sec-
tion), and because they help identify how specific network phenomena cause problems
to the user experience.
160
Other applications Video telephony is also an important application on the Internet.
Well-known specific systems include Skype video calls, Google hangout, and iChat.
Zhang et al. [ZXH
+
12] proposed models to predict three session qualities (sending rate,
video rate, and frame rate) of Skype.
Compared with video streaming, video telephony is more sensitive to real-time con-
dition such as latency. However, as the models in [ZXH
+
12] show, throughput is still
the most important factor.
4.3.6 From Sessions to QoE
We can now bring our model to the user by modeling their Quality of Experience in spe-
cific applications. We can reflect reachability as completely unacceptable QoE (1),
but more QoE is interesting when it reflects more subtle differences. As a concrete
application where we can estimate QoE, we continue to focus on video streaming.
Application-specific QoE of video streaming We survey four QoE models of video
streaming developed in three prior papers [DSA
+
11, KS12, MCC11]. These models are
induced by analyzing data sets of varying sizes. The model in [MCC11] is based on lab
experiments including 270 views from 10 viewers; while the two in [KS12] are based
on a much larger set from Akamai (23 million views from 6.7 million viewers). The
model in [DSA
+
11] is drawn from the largest and most diverse dataset, including 300
million views from 100 million viewers in a week, from various content providers. We
thus first choose the model in [DSA
+
11]. In addition, we also incorporate one model
in [KS12] as another branch to complete our model. These two models focus on two
different aspects of QoE (play time and abandonment rate) and we think they are both
useful.
161
The model in [DSA
+
11] uses decreased video play time (
QoE
P
) to indicate user
QoE (less play times indicates worse experience), and studies how it is shaped by
rebuffering ratio (r
b
). The model can be formalized by the following equation:
QoE
P
=
8
>
<
>
:
1minute=%r
b
videoondemand
3minute=%r
b
livevideo
(4.10)
which means users watch 1 or 3 minutes less every 1% more rebuffering for two types
of video. Note that this model has a range where it is applicable. It only applies to
rebuffering ratio less than 10%. Beyond that, QoE is too bad to be applicable.
Focusing on another aspect of user experience, the model in [KS12] uses negative
video abandonment rate (QoE
A
) to indicate user QoE and studies the causality between
it and the startup delay (d
s
).
QoE
A
=(d
s
2) 5:8% (4.11)
The above equation means that users start to abandon videos after 2 seconds of startup
delay, and abandonment rate raises by 5.8% every one more second delay. This model
also has its application range. The authors did not discuss it directly, but by examining
their regression graph (Figure 10 in [KS12]), we conclude that this model only applies
to startup delay less than 10 seconds.
Other applications Although we focus on video streaming, QoE models exist for other
applications such as Internet telephony. Chen et al. [CHHL06] proposed a model to pre-
dict Skype voice call QoE from session quality. Their model shows that Skype sending
rate and the jitter of sending rate are the two most important factors that ensure a good
quality of experience for users.
162
layer nota- meaning value/ prop- in equa-
tion source rietary tion
PHY r
c
reachability status of cable segments cable owners y 4.1
PHY to SONET M
cs
mapping from SONET circuits to cable segments cable owners y 4.1
SONET l
s
latency of SONET circuits cable owners y 4.3
SONET to IP
W
si
mapping from IP links to SONET working paths ISPs y 4.2, 4.3
P
si
mapping from IP links to SONET protection paths ISPs y 4.2, 4.3
IP
M
if
(t) mapping from flows to IP links at timet ISPs y 4.4, 4.5
C
if
(t) mapping of IP link bandwidth to flows at timet ISPs y 4.6
c
i
capacity of IP links ISPs y 4.6
IP to APP M
fa
(t) mapping from sessions to flows at timet app providers y 4.7
APP
video buffer sizet 0:5 5 min [ABD11] y 4.8
r
v
video bitratet 0:35 3:8 Mbps [ABD11] y 4.8, 4.9
Table 4.2: Data needed for the model.
Online gaming is another application area where QoE can be modeled. Chang et
al. [CTCL11] has proposed a model for QoE for on-line gaming, showing that both
display frame rate and frame distortion are critical to user experience. Such models
could fit in our framework.
4.3.7 Data needed for the model
So far, we have completed our model by incorporating prior models and developing ones
that are needed.
This model helps one to predict what users will be affected by the cable cut and
how the cut affects their video streaming experience. However, to conduct any useful
prediction, one needs to collect real-world data as input and parameters to the model.
Table 4.2 lists the data needed. As we can see from the table, almost all data are
proprietary and thus hard to obtain. However, we can still gather some through mea-
surement and online documents (see citations in the table).
Some data are not only proprietary, but also dynamic (notations with (t)), which
means they may be hard to obtain even for service providers. For example, the mapping
163
from flows to IP links (M
if
(t)) is dynamic. It is governed by complex routing protocols
that reside in distributed routers according to the current IP link state. Often, even
internal operators find it hard to predict routes for flows.
We see two approaches to obtain dynamic data. First, one can log the information
for a period of time long enough to predict future behavior. Most network traffic has
strong diurnal and weekly periodicity that allows trend identification. Alternatively, one
can build models that infer dynamic behavior from slower changing information, such
as prior work in routing [FWR04] and traffic matrix estimation [MTS
+
02, ZRLD03].
In some cases, data may be unavailable, either to researchers or operators. Although
missing data makes specific outcomes difficult to predict, modeling makes it relatively
easy to quickly study a range of parameters. Such a study can suggest if negative out-
comes are likely or unlikely over different possibilities.
4.4 Case Studies
After constructing the model in Section 4.3, we next apply our model to understand
real-world incidents (see Table 4.3). Specifically, we characterize specific aspects of
networks that make countries more or less vulnerable to threats. In addition, we also
explore how one can apply the model when facing incomplete data. We find that service
self-sufficiency (hosting services near users), and geographic diversity of circuits both
help insulate a country from outages.
In this section we focus on Bangladesh and its 2012 cut (the first incident in
Table 4.3). We explored this incident concurrent with developing our model. Subse-
quently, we applied our model to the three other cases listed in Table 4.3 and discussed
164
incident victim cables cables self- geo- geo- capacity
country (total) (cut) sufficiency diversity weakness drop
SeaMeWe-4’12 [BBC12a] Bangladesh 1 1 4% low eastbound to Singapore 67%*
SeaMeWe-4’13 [Mad13] Pakistan 4 2 0% medium westbound to Europe 60%y
IMEWE’12 [Mad12b] Lebanon 1 1 8% low westbound to France 100%y
TEAMS’12 [BBC12b] Kenya 3 1 4% medium 20%y
Table 4.3: Four real-world incidents we have studied. Asterisks (*): estimated, daggers
(y): reported.
in Section 4.4.7. Although each scenario requires new parameters, our model is effec-
tive at evaluation of these additional cases, suggesting it generalizes and is not overfit to
a single occurrence.
The SeaMeWe-4 cable cut happened in 2012 (Section 4.4.1) had a significant impact
on Bangladesh; to understand its cause we first apply our model for an explanation
(Section 4.4.3). Besides the explanation, we would also like to quantify the impact,
especially on user QoE, beyond what has been reported in public news (Section 4.4.4).
One step further, countries which suffer from cable cuts would also like to know how to
mitigate the impact. We therefore present a method to help countries address this issue
(Section 4.4.5).
The process to apply our model has been discussed in Section 4.3 and shown in
Figure 4.4. However, to address incomplete data, we have slightly changed the course.
We briefly describe this modified process (Section 4.4.2) and apply it to one of our
examples for illustration. Finally, we would like to share what we have learned about
addressing incomplete data in a generic scenario (Section 4.4.6).
4.4.1 Incident Overview
SeaMeWe-4 submarine cable system connects 17 landing stations from South East Asia
via Middle East to Western Europe (Figure 4.10), in a bus-like topology [SEA13b].
While submarine cables are often rings, geography forces a linear topology here. The
165
Figure 4.10: Physical topology of SeaMeWe-4 [SEA13a].
system is managed by a consortium composed of 16 telecommunication companies and
spans about 20,000 km, supporting communication at 1.28 Tb/s [Sub13].
We analyze the 6 June 2012 cable cut occurring 60 km outside Singapore that dis-
connected it from other stations [BBC12a].
A naive model of reachability might suggest that this cut would affect Internet users
in Singapore. However, public reports suggest that Singapore users were barely affected,
while Bangladeshi users experienced significant problems [BBC12a]. Press reports sug-
gest about eight million Bangladesh netizens suffered very slow connections after the
cut. We did not find news for Thailand and Malaysia which likely implies that these two
countries were also barely affected.
166
4.4.2 Applying the Model
We next briefly describe the applying process. The ideal process is shown as the solid
red line in Figure 4.4, however, in order to address data incompleteness, we instead
follow two branches of it visualized by dashed purple lines. (The dashed parts of the
two branches are where data is unavailable. To address this problem, we perform what-
if analysis on these parts.) The service-reachability branch reverses the bottom-up order
of the ideal process, starts from the session reachability, and works down the stack to
the cable segment reachability. The user-QoE branch keeps the bottom-up order, starts
from the IP link capacity, and ends at video play time, based on the prerequisite that
session reachability is satisfied. We next describe the two branches in details.
Service reachability branch integrates Equation 4.7, 4.4, 4.2 and 4.1 to obtain the
cross-layer impact on service reachability by following a top-down direction. Data
needed in this branch are obtainable through measurement or educated inference, there-
fore this path allows us to quantify service reachability.
To analyze service reachability, we first apply Equation 4.7 to map service reach-
ability to flow reachability. We obtain the data needed in this step (M
fa
(t)) from dis-
tributed DNS queries and BGP routing tables. Distributed DNS queries can discover
more servers than from a single vantage point, and thus decrease the possibility of under-
estimating service reachability. We therefore query server addresses from all major Sin-
gaporean and Bangladeshi ISPs. We limit our scope to Singaporean and Bangladeshi
users only to make the problem more manageable. More specifically, we only consider
sessions between users and top services within these two countries respectively. We
identify top services as the top 25 websites for each country provided by Alexa [Ale13],
and users by IP addresses announced by major Singaporean and Bangladeshi ISPs.
167
One layer down, we then map flow reachability to IP link reachability by applying
Equation 4.4. We obtain the data needed here (M
if
(t)) from traceroutes targeted either
to servers or users supplemented by AS path inference.
Down to the bottom of the stack, we analyze cut effects on IP link reachability by
applying Equation 4.2 and Equation 4.1. The data needed in these two steps (M
cs
,W
si
,
P
si
) are proprietary, and also hidden under the IP layer within each network’s bound-
aries. Only the ISPs have this data and can measure it, so instead we make educated
estimates based on ISPs’ public documents. We use published IP [Sin10] and physical
topologies [Sin13]; some ISPs make this data available to attract customers. We can
compare these two topologies to infer what IP links depend on what cable segments.
One can follow the above service-reachability branch to analyze cut effects on
service reachability, which provides a binary answer (either accessible or not). QoE
depends on service reachability (QoE only makes sense when services are reachable),
but contains much richer information about user satisfaction. We next describe the other
branch that examines user QoE based on the assumption that services are reachable.
User QoE branch integrates Equation 4.6, 4.9 and 4.10. This analysis requires pro-
prietary information including IP link capacity and flow traffic matrix (c
i
andC
if
(t) in
Equation 4.6). Even for ISPs where this information is known, the exact values change
over time. We therefore study a range of values in the parameters space to understand
the range of conditions where the network is robust or fragile. Thus our approach can
both answer what-if questions, where one provides or speculates about specific param-
eters, and project beyond current usage to possible future scenarios.
We study link capacity (c
i
) and user traffic (C
if
(t)). To simplify representation, we
replace capacity of individual links with aggregate international capacity, and dynamic
168
traffic with maximum flow count. This approximation is appropriate when most ser-
vices are international (as we show they are for Bangladesh) and flows are congestion-
reactive and therefore will converge on a fair share at the bottleneck international links.
Figure 4.11 shows an example of the parameter space, with capacity shown against flow
count.
4.4.3 Causes of Large Impact on Bangladesh
We next discuss the two weaknesses of Bangladesh’s infrastructure that our analysis
reveals: low service self-sufficiency and low geographic diversity of international cir-
cuits. These weaknesses result in harm to Bangladeshis.
We define self-sufficiency as a metric to quantify the degree a country depends on the
outside world for Internet services. Specifically, it equals the number of top 25 websites
that are hosted by any domestic servers. For Bangladesh, the self-sufficiency is very
low (4%), meaning only one website is within its border. More specifically, among the
24 popular websites hosted abroad, 16 are foreign or global services (such as Google
and Facebook) eight are local but hosted abroad (such as BanglaNews24 and BDJobs),
suggesting an opportunity for new hosting services inside Bangladesh to improve self-
sufficiency.
In contrast, Singapore is much more self-sufficient (52%). When considering only
the top five websites, its self-sufficiency rises to 80%.
The low service self-sufficiency suggests that Bangladesh heavily depends on the
outside world, and thus it is very important for Bangladesh to diversify its international
outlets to cope with physical threats that disrupt regional connectivity.
However, at the time of the cut (mid-2012), SeaMeWe-4 was Bangladesh’s only
high-capacity international cable We confirmed this statement by searching through the
169
complete list of submarine cables [Sub13], and concluded that Bangladesh had no ter-
restrial connectivity at that time because a later Dec. 2012 terrestrial connection (via the
ITC cable) appeared as major news [bdn12]. Satellite or dialup links can provide service
for some, but both are slow and do not support general traffic).
The low cable diversity is further intensified by the low circuit diversity. Most
Bangladesh’s international circuits are provisioned to the east connecting with Singa-
pore and so were cut during the incident. The sudden disruption of eastbound circuits
leads approximately to a 60-70% drop of Bangladesh’s total international capacity. If
Bangladesh had provisioned more backup circuits to the west connecting with global
ISPs in Middle East or Europe, user traffic could shift to the west and threat impact
would be much smaller.
In summary, the low geographic diversity of circuits, together with the low service
self-sufficiency, has made Bangladesh vulnerable to cable cuts. Self-sufficiency will
improve by either encouraging popular foreign services to deploy servers in-country, or
if popularity of domestic services grows. geographic diversity of international connec-
tivity is improved by adding circuits or cables to new destinations, as Bangladesh did in
Dec. 2012 [bdn12].
4.4.4 Impact on QoE in Different What-If Scenarios
This section studies the threat impact on user QoE for different possible scenarios. These
different possibilities allow us to explore potential future situations, an approach that
applies not only to cable cuts, but also other cases where international capacity supply
or traffic demand changes (such as planned maintenance and flash crowds).
We explore a two-dimensional parameter space to study these what-if scenarios. We
have briefly described the parameter space in Section 4.4.2 which has international
170
0 10 30 50
International Capacity (Gbps)
0
50
100
150
200
International Flow Count (K)
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
QoE - decreased play time (min)
good experience
degraded experience
too bad to apply QoE model
after before
Figure 4.11: QoE - decreased play time (minute) in 2-D parameter space,r
v
= 350 kbps
capacity and flow count as its two dimensions. Figure 4.11 shows how user QoE (rep-
resented by play time) varies in this space, as measured by:
QoE
P
= 100(1
r
v
y
x
)minute (4.12)
integrated from Equation 4.6, 4.9 and 4.10.
QoE models and Figure 4.11 simplify several aspects of Internet video. Rather than
model adaptive video, we approximater
v
by fixing it at the basic bitrate of many ser-
vices today (r
v
= 350 kb/s). In addition, the QoE curve may vary by content type (for
example, some animation can be encoded more efficiently); these differences are not
reflected in current QoE models. Finally, we allocate bandwidth equally over all users
in a country. In practices, their needs will vary. Future work could capture these effects
by refining our model, perhaps with a multi-tier QoE equation.
As shown in the figure, there are three regions corresponding to different QoE in the
parameter space. The bottom right green triangle region corresponds to good experience
with zero decreased play time (0 in color box). The narrow middle strip (shown as
171
shades of blue) just on top of it corresponds to degraded experience. Users in this region
react by watching less time ranging from several seconds to 10 minutes (-1 -10). Note
that this region is very narrow, indicating there is fairly little adaptively at these scales.
The upper left white region is where the QoE model does not apply because the capacity
is so small and flows are so many, in this region users are unlikely to use the service at
all.
To answer what-if questions, we consider performance before and after a network
change. As shown in the figure, this approach first picks a specific position in the space
(“before” dot) representing the state of a given country before the cut happened, it then
moves the initial state to another position (“after” dot) representing the state after the
cut. The approach finally assesses the cut impact by comparing the QoE values at these
two positions. In the example shown in Figure 4.11, the cut makes the QoE degrades
from good to an intolerable level.
Theoretically, the before and after states can be at any positions in the parameter
space. However, in practice, the before state typically appears in the good experience
region and the direction and distance it moves only follows a finite set of ways. We next
summarize major possible scenarios.
Steady demand (with decreased capacity) may be caused by a cable cut or simply
a regular maintenance. In this scenario, users maintain their regular online activities
despite the potential change of service qualities. Because the user traffic is unchanged,
the direction the before state moves is fixed—it only moves horizontally to the left. The
distance it moves reflects the capacity taken out of service by cable cuts or maintenance,
in turn depending on the diversity of the country’s connectivity.
The Bangladesh incident may match this scenario. We next conjecture about the
possible impact on Bangladeshi users. Since the cable cut results in a 60-70% drop
172
10 30 50
International Capacity (Gbps)
50
100
150
200
International Flow Count (K)
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
QoE - decreased play time (min)
no impact
moderate impact
significant impact
2x extra capacity
2x extra flows
Figure 4.12: Steady Demand Scenario (x=3)
10 30 50
International Capacity (Gbps)
50
100
150
200
International Flow Count (K)
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
QoE - decreased play time (min)
no impact
moderate impact
significant impact
after
before
Figure 4.13: Decreased Demand Scenario (x=3,y=2)
of Bangladesh’s international capacity, the after state shifts to about 30% of the initial
capacity (moving directly left). However, since the initial capacity (the before state)
is unknown, we cannot exactly position the after state. Nevertheless, what-if analysis
allows us to bound regions of before states that produce different outcomes. Figure 4.12
shows regions of before states where Bangladeshi users would feel (i) no impact (i.e., the
after state is still within the good experience boundary) by inequality 100(1
3rvy
x
) 0;
173
10 30 50
International Capacity (Gbps)
50
100
150
200
International Flow Count (K)
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
QoE - decreased play time (min)
no impact
moderate impact
significant impact
after
before
Figure 4.14: Increased Demand Scenario (y 2)
(ii) moderate impact (i.e., the after state falls within the degraded experience boundary)
by inequality10 100(1
3rvy
x
)< 0; and (iii) significant impact (i.e., the after state is
outside the QoE model application range) by inequality 100(1
3rvy
x
)<10. (Observe
that these three inequalities are obtained by substitutingx withx=3 in Equation 4.12 to
reflect the capacity drop.)
Figure 4.12 also reveals interesting implications about connectivity planning—the
trade-off between resilience and resources. The no-impact zone provides high resilience
at the cost of over-provisioning capacity by a factor of 2. In contrast, the significant-
impact zone, although vulnerable to cable cuts, can accommodate two-times more flows
in normal situations. The amount of extra capacity to provide and how to plan is a
crucial problem for many countries. Our models can guide countries in assessing such
trade-offs, as we expand upon in Section 4.4.5.
Decreased demand (with decreased capacity) could be caused by the same reasons as
the first scenario, but reflects user defection (giving up) that often results from degraded
174
service quality. With decreased demand, the after-state is both left and below the before
state to reflect a decrease in number of flows as users defect, as shown in Figure 4.13.
Suppose Bangladesh falls into this scenario and half of the flows were withdrawn.
We can again bound regions of before states by different outcomes they result in as
shown in Figure 4.13. Defection extends the no-impact zone (compared with Fig-
ure 4.12) for those users who remain, although the lower demand represents different
cost of the cut.
This scenario matches intuition that when capacity decreases significantly, users wait
(perhaps to come back later), or ISPs limit normal traffic so prioritized flows (such as
control and business flows) can maintain their throughput. Note however that the price
for having protecting prioritized flows is a potentially large number of significantly-
impacted normal flows. Hence, the fundamental solution is still overprovisioning with
spare capacity.
Increased demand (with unchanged capacity) is typically not caused by cable cuts,
since cable cuts usually reduce capacity. Instead, this scenario is often caused by flash
crowds, such as global events that cause short-term traffic surges. The result is that the
after state shifts up relative to before (see Figure 4.14).
This scenario is not relevant to our cable cut for Bangladesh, but it shows the gener-
ality of our analytic approach. Similarly to the previous scenarios, we can bound regions
of before states leading to different outcomes (Figure 4.14 shows the situation for dou-
bled traffic). As we can see from the figure, to accommodate unexpected traffic surge
(analogous to containing cable cut impact), over-provisioning of capacity is needed.
175
4.4.5 Implications for Connectivity Planning
To help a country to plan its international connectivity, we present a method in this
section.
A problem many countries, especially developing ones, face is that in order to let
a% of population to enjoy decent online experiences, how much international capacity
should be provisioned? More specifically, what submarine cable consortium should the
country participate in? what circuits should be provisioned? and how much capacity for
each circuit?
In addition to normal conditions, countries are also interested in whether the goal
will be respected during abnormal situations (such as sudden capacity drop caused by
cable cuts). This concern leads to more questions: how much extra capacity needs to be
provisioned? how to distribute the extra capacity among circuits?
We next present a method to aid countries answering these questions based on the
what-if study described in Section 4.4.4. The core idea is to diversify the connectivity
to limit the capacity changes brought by common threats, and therefore to achieve the
maximal resilience using the minimal amount of resources. Figure 4.15 helps to under-
stand this idea. For a given traffic demand, the minimal capacity needed to achieve
good QoE during normal condition resides on the boundary of the “good experience”
zone. However, in order to be also resilient to threats, the country needs extra capac-
ity to extend itself into the “no impact” zone. The shape of the good-experience zone
is irrelevant to country connectivity and thus the minimal capacity is fixed for a given
traffic demand. In contrast, the shape of the no-impact zone can be changed via careful
connectivity planning, and the bigger the zone, the less extra capacity is required. This
observation provides countries important implications—a good connectivity planning
can save millions of dollars on infrastructure resources.
176
10 30 50
International Capacity (Gbps)
50
100
150
200
International Flow Count (K)
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
QoE - decreased play time (min)
no impact
good experience
extra capacity
minimal
capacity
Figure 4.15: Capacity planning during normal and abnormal conditions.
The no-impact zone is bounded by how much capacity and traffic varies during
abnormal conditions. The larger the capacity drop, or the larger the traffic increases,
the smaller the no-impact zone is. Hence, to extend the no-impact zone and in turn
reduce the extra capacity needed, the country needs to shrink the range of capacity drop
and traffic increase. We focus on how to shrink capacity drop in this paper and leave
confining traffic increase for future work.
To shrink the capacity drop, the country needs to diversify its connectivity so a sin-
gle threat only affects a limited number of circuits with a limited amount of capacity. A
helpful concept here is the Shared Risk Group (SRG) which identifies resources (such
as circuits) that are likely to be brought down by a common threat (for example, all
Bangladesh’s eastbound circuits to Singapore are in the same SRG). The total capacity
of all resources within a SRG hence represents the capacity drop caused by the corre-
sponding threat. The maximal capacity of all SRGs ultimately determines the bound-
aries of the no-impact zone.
177
There are many previous works on how to identify SRGs [SYHG01, DG
+
02]. For
cable cuts specifically, the common SRGs are circuits going through the same cable
conduits, the same straits, and the same landing stations. Therefore, countries need to
pay special attention to make sure their circuits are diversified over different cables,
straits, and destinations.
4.4.6 General Tactics to Address Incomplete Data
Our experiences suggest two general tactics to handle incomplete data: focusing on a
subset of the problem to reduce the data needed and studying a range of possibilities.
We next elaborate on each of these two tactics.
The first tactic generally applies to problems that require large amounts of hard-to-
obtain data. To work-around missing data, we instead refocus on a subset of the problem
that requires data that is more easily or readily available. In our specific case, the prob-
lem is to analyze the impact on all users in all countries given a cable cut. Solving this
problem would need all related data about all users which is impossible to obtain. We
instead focus only on a subset of all users. In the case at hand, this subset consists of
users in Bangladesh because those users have reportedly been severely impacted by the
cable cut. With this new focus, we only need data for network components that are con-
cerned with Bangladesh and the users in that country. Before obtaining such data, we
first identify the relevant components. To this end, we start with the users of interest who
are represented in our model at the application layer and work our way down the stack.
At each layer, we identify the components of interest. In short, we end up following the
top-down service-reachability branch and not the ideal bottom-up process depicted in
Figure 4.4.
178
The second approach addresses missing data by studying a range of possible values
of the data. In our specific case, the required data to study user QoE is link capacity
and user traffic. However, both data are proprietary and thus not available to us. To
continue our analysis, we instead study a range of possible values of these data as we
have demonstrated in Section 4.4.4. By performing the what-if analysis, we can not only
study the current usage, but also predict possible future scenarios.
We have developed these tactics in the context of Bangladesh. We next show that
they also apply to several other incidents in Pakistan, Lebanon, and Kenya.
4.4.7 Applying to Other Incidents
In addition to Bangladesh, we also apply our model to three other cases as shown
in Table 4.3. We examine if our model can explain other incidents and if low self-
sufficiency and geo-diversity are still the major causes (Section 4.4.7). We then discuss
the different geographical regions each country is vulnerable to and provide recommen-
dations (Section 4.4.7). Finally, we hypothesize and compare the impact on user QoE in
all incidents (Section 4.4.7).
Generalizing Causes of Large Impact
We have concluded that low service self-sufficiency and low geographic diversity are the
two major causes that make a country vulnerable to submarine cable cuts. We believe
this conclusion also applies to many other incidents that impact different counties with
different connectivity.
We examine three additional incidents to support our conclusion shown in Table 4.3.
Our model provides a consistent explanation for all incidents. Except Pakistan, all sig-
nificantly impacted countries (capacity drop> 50%) show low self-sufficiency (ranging
179
from 0% to 8%) and low geo-diversity. The reason why Pakistan (with a medium level
of geo-diversity) also experienced large impact is because it had two cables out of ser-
vice during the same period. The challenge Pakistan was facing was much bigger than
Bangladesh and Lebanon, however, it still performed better than the other two countries
(60% capacity drop compared with 67% and 100% drop).
The four incidents cover a wide range of geographical area (South Asia, Middle
East, and Africa) and diverse connectivity, we therefore believe our model is generic
enough to apply to many submarine cable cut incidents.
Although all incidents have shown that low geo-diversity is a major vulnerability,
countries differ at the location where they are vulnerable to. We next discuss the weak-
ness for each country.
Geographic Diversity of Different Countries
We next discuss the geographical regions each country is vulnerable to and how they
could improve geo-diversity. Table 4.3 summarizes these geographical regions for each
country.
We have learned that Bangladesh heavily relies on its eastbound circuits to Singa-
pore for Internet access in Section 4.4.3. Thus, the region between Bangladesh and
Singapore where SeaMeWe-4 traverses is Bangladesh’s weakness. Any earthquake and
ship anchors in this region pose threats to Bangladesh. As we have also mentioned in
Section 4.4.3, Bangladesh could improve its geographic diversity by adding circuits or
cables to new destinations, such as India, Middle East, and Europe.
Unlike Bangladesh, Pakistan’s geographical weakness is westbound to Europe. We
infer that westbound circuits through SeaMeWe-4 and all circuits through IMEWE (the
other cable out of service) represent 60% of Pakistan’s international capacity, and they
180
were either broken or dis-functional during the incident. The heavy circuit provisioning
in one direction makes Pakistan vulnerable to threats happening along the westbound
routes to Europe, which almost all go through the Suez Canal, the biggest single point
of failure of Pakistan’s Internet access.
To improve geographic diversity, Pakistan could follow two ways. The first way is
to provision more eastbound circuits to Asia. The second one is to establish circuits
via different routes, such as through South Africa rather than through Egypt to reach
Europe.
Lebanon has the lowest geographic diversity among all countries. We infer that it
only provision westbound circuits to France for Internet access. Thus, a threat at any
position of the cable route between Lebanon and France can bring the whole country
down, which is what happened in the incident [Ami12].
Lebanon could improve its geographic diversity by establishing Internet circuits to
some countries other than France, and ideally in other directions. In this way, even if
the westbound circuits are broken, Lebanon can still rely on eastbound circuits to avoid
a complete Internet blackout.
Compared with the previous three countries, we believe Kenya has better geo-
diversity. Similar to Pakistan, Kenya also provisions circuits that go through Suez Canal
to reach Europe. However, Kenya connects to at least two more destinations via two
more routes for Internet access: northbound to United Arab Emirates and eastbound to
Asia.
We have shown that geographical weakness is often caused by provisioning circuits
heavily in one direction, to one destination, or via one route by examining four countries.
This weakness can lead to significant drop of a country’s international capacity. We next
look at how this capacity drop affects user QoE.
181
50 100 150 200 250
International Capacity (Gbps)
200
400
600
800
International Flow Count (K)
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
QoE - decreased play time (min)
Bangladesh
Pakistan
Lebanon
Kenya
Figure 4.16: Estimated impact on user QoE in four incidents.
Impact on User QoE
In this section, we analyze QoE (see Section 4.4.4) for all four incidents. We compare
the potential impact on user QoE in different countries, and see how good connectivity
planning could insulate users from cable cut impact. Figure 4.16 shows our estimation of
the position of each country before and after the cable cuts. We gather the approximate
international Internet capacity of each country from public web pages [Inu11, Int12,
Sta12, Com10]. These capacity numbers place each country on the capacity axis. We
then position each country on the flow axis based on their number of Internet users.
Finally, we decide how far the before state moves by estimated or reported capacity
drop listed in Table 4.3.
From Figure 4.16, we can see that Kenya has done a good job to insulate its users
from submarine cable cuts. It not only has provisioned abundant extra capacity, but it
has good geographic diversity (one cable cut has only resulted in 20% capacity drop).
Other countries are much closer to the edge of acceptable capacity. All exit the “good
experience” region after a single cable cut. These countries do not have adequate extra
182
capacity in intentional connectivity, visualized by the distance between the before state
and the good experience zone boundary. Neither do they have good diversity of inter-
national capacity as shown by the relative length of the arrow (the shorter, the higher
diversity).
From these additional cases, we conclude that a good connectivity planning could
insulate users from impact brought by submarine cable cuts.
4.5 Guidelines to Understand and
Model Threats
We next summarize the lessons we learned in framing, modeling, and analyzing network
threats, and what they say about network design.
First, almost on every layer of the Internet, topological connectivity does not imply
data reachability. The topology could be a well-known AS and router topology on the
network layer, or physical cable topology, or even application-layer client-server/peer-
to-peer network topology. Prior work [Gao01, DJMS06, WZMS07] has mainly focused
on the connectivity/reachability issue of the AS topology. In this paper, we also showed
that cable connectivity is not enough — a SONET circuit needs to be established to
transmit data (Section 4.3.2 and 4.4.3). In fact, the data transmission on every layer
needs to follow the layer’s control protocol (such as SONET, Ethernet, MPLS, OSPF,
BGP, TCP, RTCP and SIP) and it is these protocols that govern the data reachability,
given the prerequisite that the two ends are topologically connected. Thus, to model
data reachability, one must consider the behavior of these control protocols. We stress
this principle because it can be easily forgotten and therefore cause modeling errors.
183
Second, fault-recovery mechanisms reside on many layers, and new ones are fre-
quently added. Network routing is the best known recovery mechanism (Section 4.3.4).
In this paper, we also identified the ring protection on SONET layer (Section 4.3.3)
and server redundancy on the application layer (Section 4.3.5). Use of multiple servers
to enhance reliability and performance is a relatively new mechanism used by content
delivery networks (CDNs). Sometimes, when all of the lower layers fail to contain the
threat impact, CDNs can mitigate the impact by delegating proper servers to serve users
(In Section 4.4.3, we show that Singaporean users are mainly served by local servers
and thus were barely affected by the cut). Therefore, to correctly assess threat impacts,
one must consider the fault-recovery mechanisms on all related layers, paying special
attention to newly introduced ones.
Third, the effects of threats on real users are strongly influenced by user behavior and
network architectures. This observation has implications on both modeling and network
deployment. To model threats, it means that one must consider local users’ preferences
for services. One must also understand where modern CDNs deploy servers, since CDN
nodes can make a “foreign” service local. To understand the impact of threats, one
must identify common traffic sources and destinations instead of just picking arbitrary
endpoints, as we have demonstrated (Section 4.4.2).
This observation also suggests that countries that wish to improve their network
resilience can do so by improving self-sufficiency (encouraging use of local services and
local replicas of global services), as well as by diversifying network connectivity. All of
these “best practices” can reduce the impact of disruptions as shown by the examples in
Section 4.4.
Lastly, reachability is the basis, but by far not enough to capture QoE for modern
users. Modern users’ expectation has risen sharply along with the rapid development
184
of the Internet. Years ago, being able to fetch a webpage is satisfying enough for many
users, while nowadays, users abandon a webpage in seconds and regard a video that
buffers frequently intolerable (see our discussion in Section 4.3.6). Therefore, to draw
real-world attention, we need to shift the focus from reachability to QoE.
4.6 Conclusions
We have developed a holistic model that first relates low-layer physical threats with
high-layer Quality-of-Experience for end users. Since no single organization has data
that spans all these layers, we applied what-if analysis to understand possible outcomes
in the face of gaps in specific data. We have applied our model to four incidents and
identified low service self-sufficiency and low geographic diversity as two major vulner-
abilities of developing countries. What-if analysis and our model can predict possible
outcomes of future events, and the effects of mitigation strategies.
This chapter provides a third specific example to support the thesis statement. We
demonstrate how to use modeling approaches to assess potential cable-cut impact based
on indirect topology and traffic data. We also show how to perform what-if analysis
when some data is unavailable. By employing these approaches, we have bridged cable
cuts with what billions of users cable about—the reliability and quality of online web
services.
This study also indicates the usefulness of combining modeling with what-if anal-
ysis to speculate about different scenarios outside of understanding cable-cut impact.
What-if analysis studies a range of possibilities and therefore can help to understand
potential outcomes under abnormal conditions or in future scenarios. This general tech-
nique applies to a wide range of problems, such as to predict service response-time
distributions under different network configurations [TZV
+
08].
185
Chapter 5
Related Work
In this chapter, we discuss and compare other work related to our thesis, that also aims
to improve understanding about the Internet by developing systematic approaches to
address various data limitations. We present three areas of research related to our three
specific studies: understanding Internet edge (Section 5.1), building Internet topology
(Section 5.2), and modeling Internet threats (Section 5.3).
5.1 Understanding Internet Edge Behavior
Internet end users drive Internet demands and shape Internet traffic. Thus, it is important
to understand how they use and access Internet. However, only recently researchers have
begun exploring edge host behavior. End users access Internet through IP address, which
is also a major way to identify them. Our work thus investigates four important address
usage issues with each’s related work listed below.
Are contiguous addresses consistent and what are the typical block sizes? Huston’s
report has analyzed the common prefix lengths in BGP routing table [Hus09]. But it
cannot look at usage at granularity smaller than BGP prefixes. Our approach is able to
look at these smaller block sizes through active probing.
Are allocated addresses efficiently utilized? Given the IPv4 address exhaustion hap-
pened on 15 April 2011, efficient utilization of IP addresses is crucial. Prior researchers
inferred address utilization by detecting allocated but not advertised prefixes in BGP
routing table [MXZ
+
05]. But what is routed may differ from what is actively used. Our
186
work tries to track active use; and our study of individual addresses can reveal changes
that happen to blocks inside an organization (smaller than are typically routed).
How many addresses are dynamically assigned? Xie et al. [XYA
+
07] have begun to
explore this question with a goal of identifying dynamic blocks to assist spam preven-
tion. Their work is based on passive collection of Hotmail web server logs, while our
method uses a completely different approach by active probing and so can extend and
corroborate their findings. In our prior work, we provide another perspective based on
active probing with ICMP [HPG
+
08]. While this prior work focused on censuses (occa-
sional but complete probing) and establishing the methodology, here we study survey
data (frequent probing of a sample of the Internet) and add significant new analysis to
identify block sizes and low-rate edges.
Identifying edge-link bitrates? Sundaresan et al. [SdDF
+
11] studied edge access
link performance, investigating how various factors (modem type, ISP traffic shaping
policies) affect performance. Our work differs from theirs on coverage and methodol-
ogy. Their work studies 4,000 edge gateway devices only while we study millions of
addresses. Their work measure directly from home gateway devices, while we explore
the use of variance as a new approach to estimate edge-link bitrates.
5.2 Building Internet Topology
Internet topology has been heavily studied in the past ten years, because of its impor-
tance on network diagnosis and prediction. These studies typically build an Internet
map either comprising ASes as nodes and their logical relationships as edges (AS-
level) [FFF99, Gao01, OPW
+
08, AKW09], or with finer granularity, using lower level
routers as nodes and their physical links as edges (router-level) [LAWD04, SBS08].
187
However, many important activities (e.g., business disputes and political interfer-
ence) happen above AS-level, on the organization granularity, thus an organization-level
topology is needed. To integrate the organization-level map with current Internet topol-
ogy, we need a mapping from organizations to ASes. To our best knowledge, only two
prior works have made effort to relate ASes with organizations. Hyun et al. [HBC03]
examined the incongruities between AS paths derived from traceroute and BGP rout-
ing. As a side work, they infer AS ownership based on the ID and name of regis-
tered owners (organizations) in a subset of the American Registry for Internet Num-
bers (ARIN) WHOIS database with much manual workload. We use a more complete
dataset consisting registry information from all five continents, and an automatic method
taking much more information for clustering. PCH [Pac10] has a manually generated
AS/organization directory for network operators to contact each other. We evaluate this
dataset in our technical report [CHKW12a] and demonstrate its unignorable incomplete-
ness.
5.3 Modeling Internet Infrastructure Threats
Four areas relate to our work: other models that either predict the impact of subma-
rine cable cuts or post-facto evaluation after cuts, models of other threats to Internet
infrastructure, and non-threat models of parts of the Internet.
Models of submarine cable cuts Omer et al. provide a model to assess the impact
of submarine cable cuts [MOM09]. They construct a physical cable topology in which
nodes are continents and edges are aggregations of inter-connecting submarine cables
based on the public map [Mah13]. They then hypothesize threats by removing nodes
or edges and assessed impact by computing the amount of traffic could be delivered
188
between continents after the threat. Unlike their work, we analyze real-world cuts, and
we relate the impact to end-users. In addition, we pay attention to the diverse data-
transmission mechanisms on layers which are not present in their work.
Measurements of submarine cable cuts Many researchers have measured the conse-
quences of submarine cable cuts [Mad12c, Mad12a, CPBW11]. In contrast, our model
can provide implications before a cut happens. Nevertheless, these measurements are
valuable as ground truth to validate and correct our model.
Models of other threats Because of the availability of data of AS topologies, past threat
models typically build on the AS graph, modeling threats as removals of nodes or edges
from it. Albert et al. [AJB00] first analyzed errors (accidental removal of nodes) and
attacks (intentional removal of nodes), assessing impact as network-diameter increase
and fragmentation. Dolev et al. [DJMS06] builds on this model, but with the considera-
tion of the network-layer transmission mechanism. They note that connectivity between
ASes does not imply reachability—a valid AS path must be valley-free [Gao01]. Wu
et al. [WZMS07] further enriches the model and assessed impact as the reachability
changes between all AS pairs. Different from their work, we start with threats drawn
from real-world incidents and assess impact not on the network layer but on end-users.
Models of the Internet Much prior work model how different parts of the Internet
work without explicitly considering threats. These models provide useful input to our
work. In particular, Feamster et al. [FWR04] model how BGP selects paths for traffic
flows. Mok et al. [MCC11] model how flow condition affects video streaming qualities,
while Zhang et al. [ZXH
+
12] model video telephony. Researchers in [DSA
+
11, KS12,
MCC11, CHHL06] model how service qualities affect user QoE. We incorporate some
of the models above to build our multi-layer threat model.
189
Chapter 6
Future Work and Conclusions
In this chapter, we point out several directions for future work and conclude our thesis.
6.1 Future Work
There are areas of immediate future work that could strengthen the validity of our claims.
We next discuss the future work for our three studies respectively.
In Chapter 2 we study the utilization efficiency, management size, and assignment
pattern of IPv4 addresses. There are three future directions to strengthen this work. First,
we have assumed that addresses assigned to different types of hosts (servers sometimes,
and clients others) are rare and therefore usage patterns are generally consistent over
months or even years. Future work could verify this assumption by studying address
usage over time. Second, we have speculated that slow dial-up connection is a cause
for address under-utilization. Future work could provide more quantitative results to
support this claim. Finally, a third direction for future work is to prove the hypothesis
that improving address utilization is a major reason for developing countries to adopt
dynamic addressing, which is mentioned but not verified in our work.
In Chapter 3 we map ASes to organizations that own them by exploring the contact
information stored in five WHOIS registries. In our clustering approach, we combine
data from all registries without differentiation. One direction for future work is to exam-
ine unique aspects of data for each registry. Second, we have only explored the com-
pany subsidiary information for U.S. public companies. Future work could mine other
190
sources of data for non-U.S. companies and private companies to improve the complete-
ness of the AS-to-organization mapping. Third, although ownership changes of ASes
are infrequent, they do happen following acquisitions and mergers. It would be ben-
eficial for future work to automate the manual verification process and therefore fully
automate the mapping process. This fully automated process can be then used to update
the AS-to-organization mapping on a regular basis.
Lastly, in Chapter 4 we model the submarine-cable-cut impact on web services based
on Internet topology and traffic data. We draw our conclusions about causes for large
impact by applying the model to four real-world incidents. To make our conclusions
more solid, one direction for immediate future work is to apply the model to more
incidents and validated against more recorded facts. Because recorded facts are rare,
another beneficial direction is to develop tools that can automatically measure and record
service reachability and quality during future incidents. Data gathered by these tools can
serve as valuable ground truth to validate and correct the model, and in turn sheds light
on ways to improve resiliency.
Besides the above immediate future work, our thesis suggests a much larger area
of opportunity outside of the scope of our three specific studies. The thesis asserts that
systematic approaches can overcome data limitations to improve understanding about
the Internet. We next suggest two topics that would benefit from our approaches.
First, future work could consider studying the capacity distribution of all access
links on the Internet using our clustering approach developed in Chapter 2. Being able
to detect access link capacity is extremely helpful for content providers and CDNs since
they can then customize the content delivered to achieve the best user experiene. How-
ever, studying the capacity one by one for each access link is often infeasible since data
is usually only available for a subset of links. The basic idea of our clustering approach
191
might help to address this issue. Specifically, in our work we show that one can cluster
addresses into blocks and use address within the same block to represent addresses with-
out data. To apply this idea to this specific problem, one can cluster links into groups
and use links within the same group to represent ones without data. Results obtained in
this way can cover a much larger set of access links and thus be more useful to content
providers and CDNs.
Second, one can study the impact of other types of threats following the general
guidelines of our modeling approach developed in Chapter 4. Our modeling approach
stresses the importance of being problem-driven, in other words, studying harm that has
real-world victims and threats that do happen in the real-world. Future modeling work
could benefit from this general principle and in turn develops threat models that are
useful. More specifically, we believe threats such as power outages (point disruption)
and hurricanes (range disruption) are all worth studying. In terms of harm, we believe
except for degraded QoE of normal users, violation of Service Level Agreement (SLA)
also has its real-world victims: service providers, and therefore is also an interesting
way to represent harm.
6.2 Conclusions
This thesis demonstrates how we use systematic approaches to overcome various data
limitations in order to improve understanding about the Internet infrastructure by pro-
viding three specific examples. More specifically, our first work demonstrates how to
use statistical approaches and clustering approaches to address indirect and incomplete
ICMP probe responses for a better understanding about the utilization and management
of IPv4 addresses. Our second study systematically builds up a clustering approach
to handle indirect, incomplete, noisy, and over-fit WHOIS data in order to achieve an
192
accurate AS-to-organization map. Last, our third work shows how to combine model-
ing approaches and what-if analysis to assess submarine-cable-cut impact on web ser-
vices based on indirect and sometimes unknown topology and traffic data. All of our
three studies have improved understanding about some important aspects of the Internet.
Together as a whole, they support our thesis statement and suggest a much larger area
of problems that can be solved in similar ways as ours.
193
Bibliography
[ABD11] Saamer Akhshabi, Ali C Begen, and Constantine Dovrolis. An experimental
evaluation of rate-adaptation algorithms in adaptive streaming over http. In
Proceedings of the second annual ACM conference on Multimedia systems,
pages 157–168. ACM, 2011.
[ACF
+
12] Bernhard Ager, Nikolaos Chatzis, Anja Feldmann, Nadi Sarrar, Steve Uhlig,
and Walter Willinger. Anatomy of a large european ixp. In Proceedings of
the ACM SIGCOMM 2012 conference on Applications, technologies, archi-
tectures, and protocols for computer communication, SIGCOMM ’12, pages
163–174, New York, NY , USA, 2012. ACM.
[AJB00] R´ eka Albert, Hawoong Jeong, and Albert-L´ aszl´ o Barab´ asi. Error and attack
tolerance in complex networks. Nature, 406:378–382, July 27 2000.
[AKW09] Brice Augustin, Balachander Krishnamurthy, and Walter Willinger. IXPs:
mapped? In Proceedings of the ACM Internet Measurement Conference,
pages 336–349, New York, NY , USA, 2009. ACM.
[Ale13] Alexa. The top 500 sites in each country or territory. http://
www:alexa:com/topsites/countries, June 2013.
[ALWD05] David Alderson, Lun Li, Walter Willinger, and John C Doyle. Understand-
ing Internet topology: principles, models, and validation. ACM/IEEE Trans-
actions on Networking, 13(6):1205–1218, 2005.
[Ame08] American Registry for Internet Numbers. RIR statistics exchange for-
mat. Web page ftp://ftp:arin:net/pub/stats/arin/README,
September 2008.
[Ami12] Mohamad El Amin. Lebanon experiences nationwide Internet blackout. The
Daily Star Website http://www:dailystar:com:lb/Business/
Lebanon/2012/Jul-02/179079-lebanon-experiences-
nationwide-internet-blackout:ashx, July 2012.
194
[ARI10] ARIN. Introduction to ARIN’s database. https://www:arin:net/
knowledge/database:html, March 2010.
[BBC12a] BBC. Bangladesh suffers Internet disruption after cut cable. http://
www:bbc:co:uk/news/technology-18366007, June 2012.
[BBC12b] BBC. Ship’s anchor slows down east african web connection. http://
www:bbc:co:uk/news/world-africa-17179544, February 2012.
[BBC13] BBC. Egypt arrests as undersea Internet cable cut off Alexandria. http:
//www:bbc:co:uk/news/world-middle-east-21963100,
March 2013.
[bdn12] bdnews24. Bangladesh connected with terrestrial cable. http:
//biz-bd:bdnews24:com/details:php?id=237802&cid=4,
December 2012.
[Cab11] Time Warner Cable. Locations. http://
www:timewarnercable:com/corporate/about/careers/
locations:html, November 2011.
[CAI13] CAIDA. AS Rank: AS Ranking. http://as-rank:caida:org/,
August 2013.
[CFH
+
13] Matt Calder, Xun Fan, Zi Hu, Ethan Katz-Bassett, John Heidemann, and
Ramesh Govindan. Mapping the expansion of Google’s serving infrastruc-
ture. In Proceedings of the 2013 ACM conference on Internet measurement
conference. ACM, 2013.
[CH10] Xue Cai and John Heidemann. Understanding block-level address usage in
the visible Internet. In Proceedings of the ACM SIGCOMM Conference,
pages 99–110, New Delhi, India, August 2010. ACM.
[CHHL06] Kuan-Ta Chen, Chun-Ying Huang, Polly Huang, and Chin-Laung Lei.
Quantifying Skype user satisfaction. In Proceedings of the ACM SIGCOMM
Conference, pages 399–410, Pisa, Italy, 2006. ACM.
[CHK
+
07] Amit Chandel, Oktie Hassanzadeh, Nick Koudas, Mohammad Sadoghi, and
Divesh Srivastava. Benchmarking declarative approximate selection predi-
cates. In Proceedings of the 2007 ACM SIGMOD international conference
on Management of data, pages 353–364, Beijing, China, 2007. ACM.
195
[CHKW10] Xue Cai, John Heidemann, Balachander Krishnamurthy, and Walter Will-
inger. Towards an AS-to-Organization map. In Proceedings of the ACM
Internet Measurement Conference, pages 199–205, Melbourne, Australia,
November 2010. ACM.
[CHKW12a] Xue Cai, John Heidemann, Balachander Krishnamurthy, and Walter Will-
inger. Accurate AS-to-organization mapping and its implications (extended).
Technical Report ISI-TR-679, USC/Information Sciences Institute, 2012.
ftp://ftp:isi:edu/isi-pubs/tr-679:pdf.
[CHKW12b] Xue Cai, John Heidemann, Balachander Krishnamurthy, and Walter Will-
inger. An organization-level view of the Internet and its implications
(extended). Technical Report ISI-TR-2009-679, USC/Information Sciences
Institute, June 2012.
[Com10] Communications Commission of Kenya (CCK). Quarterly sector
statistics report. http://www:cck:go:ke/resc/statistics/
SECTOR STATISTICS REPORT Q2 2010-11 x2x x3x x2x:pdf,
November 2010.
[COP
+
03] James H. Cowie, Andy T. Ogielski, BJ Premore, Eric A. Smith, and
Todd Underwood. Impact of the 2003 Blackouts on Internet Com-
munications. Renesys Website https://www:renesys:com/tech/
presentations/pdf/Renesys BlackoutReport:pdf, Novem-
ber 2003.
[CPBW11] Kenjiro Cho, Cristel Pelsser, Randy Bush, and Youngjoon Won. The Japan
earthquake: the impact on traffic and routing observed by a local ISP. In
Proceedings of the Special Workshop on Internet and Disasters, pages 2:1–
2:8. ACM, 2011.
[CTCL11] Yu-Chun Chang, Po-Han Tseng, Kuan-Ta Chen, and Chin-Laung Lei.
Understanding the performance of thin-client gaming. In IEEE International
Workshop Technical Committee on Communications Quality and Reliability
(CQR), pages 1–6. IEEE, 2011.
[DAL
+
05] John C Doyle, David L Alderson, Lun Li, Steven Low, Matthew Roughan,
Stanislav Shalunov, Reiko Tanaka, and Walter Willinger. The ”robust yet
fragile” nature of the Internet. Proceedings of the National Academy of Sci-
ences (PNAS), 102(41):14497–14502, 2005.
[DG
+
02] John Doucette, Wayne D Grover, et al. Capacity design studies of span-
restorable mesh transport networks with shared-risk link group (SRLG)
effects. In SPIE Opticomm, pages 25–38, 2002.
196
[DJMS06] Danny Dolev, Sugih Jamin, Osnat Mokryn, and Yuval Shavitt. Internet
resiliency to attacks and failures under BGP policy routing. Computer Net-
works, 50(16):3183–3196, 2006.
[DKF
+
07] Xenofontas Dimitropoulos, Dmitri Krioukov, Marina Fomenkov, Bradley
Huffaker, Young Hyun, kc claffy, and George Riley. AS relationships: Infer-
ence and validation. ACM Computer Communication Review, 37(1):29–40,
January 2007.
[Dro97] R. Droms. Dynamic host configuration protocol. RFC 2131, Internet
Request For Comments, March 1997.
[DSA
+
11] Florin Dobrian, Vyas Sekar, Asad Awan, Ion Stoica, Dilip Joseph, Aditya
Ganjam, Jibin Zhan, and Hui Zhang. Understanding the impact of video
quality on user engagement. In Proceedings of the ACM SIGCOMM 2011
conference, SIGCOMM ’11, pages 362–373, New York, NY , USA, 2011.
ACM.
[EBN08] Brian Eriksson, Paul Barford, and Robert Nowak. Network discovery from
passive measurements. In Proceedings of the ACM SIGCOMM Conference,
pages 291–302, Seattle, Washigton, USA, August 2008. ACM.
[Ehl06] Jeff Ehling. Time-Warner Cable leaving Houston. http:
//abclocal:go:com/ktrk/story?section=news/local&id=
4423119, August 2006.
[Fal03] Kevin Fall. A delay-tolerant network architecture for challenged internets.
In Proceedings of the ACM SIGCOMM Conference, pages 27–34, Karlsruhe,
Germany, August 2003. ACM.
[FF99] Sally Floyd and Kevin Fall. Promoting the use of end-to-end congestion
control in the Internet. ACM/IEEE Transactions on Networking, 7(4):458–
473, August 1999.
[FFF99] Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. On power-
law relationships of the Internet topology. In Proceedings of the ACM
SIGCOMM Conference, pages 251–262, Cambridge, MA, USA, September
1999. ACM.
[FLYV93] V . Fuller, T. Li, J. Yu, and K. Varadhan. Classless inter-domain routing
(CIDR): an address assignment and aggregation strategy. RFC 1519, Internet
Request For Comments, September 1993.
[FP01] Sally Floyd and Vern Paxson. Difficulties in simulating the Internet.
ACM/IEEE Transactions on Networking, 9(4):392–403, August 2001.
197
[FS69] Ivan Fellegi and Alan Sunter. A theory for record linkage. Journal of the
American Statistical Association, 64:1183–1210, December 1969.
[FVFB05] Michael Freedman, Mythili Vutukuru, Nick Feamster, and Hari Balakrish-
nan. Geographic Locality of IP Prefixes. In ACM Internet Measurement
Conference, pages 13–13, Berkeley, CA, October 2005. ACM.
[FWR04] Nick Feamster, Jared Winick, and Jennifer Rexford. A model of BGP routing
for network engineering. In SIGMETRICS, pages 331–342, New York, NY ,
USA, 2004. ACM.
[GA06] Donald Greenlees and Wayne Arnold. Asia scrambles to restore
communications after quake. http://www:nytimes:com/
2006/12/28/business/worldbusiness/28iht-
connect:4042439:html? r=1, December 2006.
[Gal13] Gallup. Gallup world poll. http://www:gallup:com/
strategicconsulting/en-us/worldpoll:aspx, November
2013.
[Gao01] Lixin Gao. On inferring autonomous system relationships in the Internet.
ACM/IEEE Transactions on Networking, 9(6):733–745, December 2001.
[GR97] Ramesh Govindan and Anoop Reddy. An Analysis of Internet Inter-Domain
Topology and Route Stability. In Proceedings of the IEEE Infocom, pages
850–857, Kobe, Japan, 1997.
[HB96] J. Hawkinson and T. Bates. Guidelines for creation, selection, and regis-
tration of an autonomous system (AS). RFC 1930, Internet Request For
Comments, March 1996.
[HBC03] Young Hyun, Andre Broido, and K. C. Claffy. Traceroute and BGP AS
path incongruities. Technical report, UCSD CAIDA, 2003. Published
as web page http://www:caida:org/publications/papers/
2003/ASP/.
[HHH
+
12] Te-Yuan Huang, Nikhil Handigol, Brandon Heller, Nick McKeown, and
Ramesh Johari. Confused, timid, and unstable: picking a video streaming
rate is hard. In Proceedings of the ACM Internet Measurement Conference,
pages 225–238, Boston, MA, USA, 2012. ACM.
[HPG
+
08] John Heidemann, Yuri Pradkin, Ramesh Govindan, Christos Papadopoulos,
Genevieve Bartlett, and Joseph Bannister. Census and survey of the visi-
ble internet. In Proceedings of the ACM Internet Measurement Conference,
pages 169–182, V ouliagmeni, Greece, October 2008. ACM.
198
[Hus09] Geoff Huston. IPv4 reports. Web Page http://bgp:potaroo:net/
index-ale:html, April 2009.
[Int07] Internet Software Consortium. Internet domain survey. http://
www:isc:org/solutions/survey, January 2007.
[Int12] Internet Service Providers Association of Pakistan (ISPAK). Internet Facts.
http://www:ispak:pk, April 2012.
[Inu11] Hasanul Haq Inu. IPv6 deployment in bangladesh. http:
//meetings:apnic:net/ data/assets/pdf file/0003/
23664/BD-IPv6Bangladesh:pdf, 2011.
[ISC11] ISC. Origin ASN for anycasted services. http://www:isc:org/
community/blog/201109/origin-asn-anycasted-
services, September 2011.
[ISO94] ISO/IEC. ISO/IEC standard 7498-1. http://
standards:iso:org/ittf/PubliclyAvailableStandards/
s020269 ISO IEC 7498-1 1994(E):zip, 1994.
[JD88] A.K. Jain and R.C. Dubes. Algorithms for clustering data. Prentice Hall,
Englewood Cliffs, NJ, 1988.
[KFSC07] Manas Khadilkar, Nick Feamster, Matt Sanders, and Russ Clark. Usage-
based DHCP lease time optimization. In Proceedings of the 7th ACM Inter-
net Measurement Conference, pages 71–76, October 2007.
[KS12] S. Shunmuga Krishnan and Ramesh K. Sitaraman. Video stream qual-
ity impacts viewer behavior: inferring causality using quasi-experimental
designs. In Proceedings of the 2012 ACM conference on Internet measure-
ment conference, IMC ’12, pages 211–224, New York, NY , USA, 2012.
ACM.
[LAC09] LACNIC. Lacnic policy manual (v1.2 - 11/03/2009). http://
www:lacnic:net/en/politicas/manual3:html, April 2009.
[LAWD04] Lun Li, David Alderson, Walter Willinger, and John Doyle. A first-
principles approach to understanding the Internet’s router-level topology. In
Proceedings of the ACM SIGCOMM Conference, pages 3–14, Portland, Ore-
gon, USA, August 2004.
[Lyo97] Gordon Lyon. nmap. computer software at http://insecure:org/
nmap/, September 1997.
199
[Mad12a] Doug Madory. East African Internet Resilience. Renseys Blog http://
www:renesys:com/2012/02/east-african-cable-breaks/,
February 2012.
[Mad12b] Doug Madory. Lebanon Loses Lone Link. Renseys Blog
http://www:renesys:com/blog/2012/07/large-outage-
in-lebanon:shtml, July 2012.
[Mad12c] Doug Madory. SMW4 Cut Shakes Up South Asia. Renseys
Blog http://www:renesys:com/blog/2012/06/smw4-break-
on-south-asia:shtml, June 2012.
[Mad13] Doug Madory. Intrigue Surrounds SMW4 Cut. Renseys Blog
http://www:renesys:com/2013/03/intrigue-surrounds-
smw4-cut/, March 2013.
[Mah13] Greg Mahlknecht. Greg’s cable map. http://www:cablemap:info/,
2013.
[Max12] MaxMind. Geolite city. http://www:maxmind:com/app/
geolitecity, March 2012.
[MCC11] Ricky KP Mok, Edmond WW Chan, and Rocky KC Chang. Measuring the
quality of experience of HTTP video streaming. In Integrated Network Man-
agement (IM), 2011 IFIP/IEEE International Symposium on, pages 485–
492. IEEE, 2011.
[McP08] Danny McPherson. Internet routing insecurity: Pakistan nukes YouTube?
http://ddos:arbornetworks:com/2008/02/internet-
routing-insecuritypakistan-nukes-youtube/, February
2008.
[Mey13] David Meyer. University of Oregon Route Views project. http://
www:routeviews:org, 2013.
[MMU
+
06] Wolfgang M¨ uhlbauer, Olaf Maennel, Steve Uhlig, Anja Feldmann, and
Matthew Roughan. Building an AS-topology model that captures route
diversity. In Proceedings of the ACM SIGCOMM Conference, pages 195–
204, September 2006.
[Moc87] P. Mockapetris. Domain names—concepts and facilities. RFC 1034, Internet
Request For Comments, November 1987.
200
[MOM09] R. Nilchiani M. Omer and A. Mostashari. Measuring the resilience of the
global Internet infrastructure system. In Proc. of the IEEE International
Systems Conference, pages 156–162, Vancouver, Canada, 2009.
[MPDPM02] Guido Maier, Achille Pattavina, Simone De Patre, and Mario Martinelli.
Optical network survivability: protection techniques in the WDM layer. Pho-
tonic Network Communications, 4(3-4):251–269, 2002.
[MTS
+
02] A. Medina, N. Taft, K. Salamatian, S. Bhattacharyya, and C. Diot. Traffic
matrix estimation: existing techniques and new directions. In Proceedings
of the ACM SIGCOMM Conference, pages 161–174, Pittsburgh, PA, USA,
2002. ACM.
[MXZ
+
05] Xiaoqiao Meng, Zhiguo Xu, Beichuan Zhang, Geoff Huston, Songwu Lu,
and Lixia Zhang. IPv4 address allocation and the BGP routing table evolu-
tion. ACM Computer Communication Review, 35(1):71–80, January 2005.
[NBW06] Mark Newman, Albert-Laszlo Barabasi, and Duncan J. Watts. The Structure
and Dynamics of Networks: (Princeton Studies in Complexity). Princeton
University Press, Princeton, NJ, USA, 2006.
[NCC09] RIPE NCC. Ripe database query reference manual. http://
www:ripe:net/db/support/query-reference-manual:pdf,
November 2009.
[OPW
+
08] Ricardo V . Oliveira, Dan Pei, Walter Willinger, Beichuan Zhang, and Lixia
Zhang. In search of the elusive ground truth: the internet’s AS-level connec-
tivity structure. In Proceedings of the ACM SIGMETRICS, pages 217–228,
Annapolis, MD, USA, 2008. ACM.
[OPW
+
10] Ricardo Oliveira, Dan Pei, Walter Willinger, Beichuan Zhang, and Lixia
Zhang. The (In)Completeness of the Observed Internet AS-level Structure.
ACM/IEEE Transactions on Networking, 18(1):109–122, February 2010.
[Pac10] Packet Clearing House. PCH INOC-DBA. https://www:pch:net/
inoc-dba/, April 2010.
[Pou09] Kevin Poulsen. Oops! AT&T Blackhole Was 4Chan’s Fault. Wired Website
http://www:wired:com/threatlevel/2009/07/4chan/, July
2009.
[QHP13] Lin Quan, John Heidemann, and Yuri Pradkin. Trinocular: Understanding
Internet Reliability Through Adaptive Probing. In Proceedings of the ACM
SIGCOMM Conference, pages 255–266, Hong Kong, China, August 2013.
ACM.
201
[Reg07] Regional Internet Registry. Resource ranges and geographical data.
ftp://ftp:afrinic:net/pub/stats/afrinic/, ftp:
//ftp:apnic:net/pub/stats/apnic/, ftp://ftp:arin:net/
pub/stats/arin/, ftp://ftp:lacnic:net/pub/stats/
lacnic/,ftp://ftp:ripe:net/ripe/stats/, June 2007.
[Reg09] Regional Internet Registry. http://www:afrinic:net/,
http://www:apnic:net/, http://www:arin:net/, http:
//www:lacnic:net/,http://www:ripe:net/, November 2009.
[RL95] Y . Rekhter and T. Li. A border gateway protocol 4 (BGP-4). RFC 1771,
Internet Request For Comments, March 1995.
[RN03] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern
Approach. Prentice Hall, 2003.
[RWB
+
96] Stephen E Robertson, Steve Walker, MM Beaulieu, Mike Gatford, and Ali-
son Payne. Okapi at TREC-4. In Proceedings of the fourth Text REtrieval
Conference (TREC), pages 73–97. NIST Special Publication, 1996.
[RWHM03] J. Rosenberg, J. Weinberger, C. Huitema, and R. Mahy. STUN—simple
traversal of user datagram protocol (UDP) through network address transla-
tors (NATs). RFC 3489, Internet Request For Comments, December 2003.
[SARK02] Lakshminarayanan Subramanian, Sharad Agarwal, Jennifer Rexford, and
Randy H. Katz. Characterizing the Internet hierarchy from multiple vantage
points. In Proceedings of the IEEE Infocom, pages 618–627, June 2002.
[SB88] Gerard Salton and Christopher Buckley. Term-weighting approaches in auto-
matic text retrieval. In Information Processing and Management, pages 513–
523, 1988.
[SBS08] Rob Sherwood, Adam Bender, and Neil Spring. DisCarte: A Disjunctive
Internet Cartographer. In Proceedings of the ACM SIGCOMM Conference,
pages 303–314, Seattle, Washigton, USA, August 2008. ACM.
[SdDF
+
11] Srikanth Sundaresan, Walter de Donato, Nick Feamster, Renata Teixeira,
Sam Crawford, and Antonio Pescap` e. Broadband internet performance: a
view from the gateway. In Proceedings of the ACM SIGCOMM Conference,
pages 134–145, New York, NY , USA, 2011. ACM.
[SEA13a] SEA-ME-WE 4. Cable system configuration. http://
www:seamewe4:com/inpages/cable system:asp, February
2013.
202
[SEA13b] SEA-ME-WE 4. Sea-me-we 4 potential customer material.
http://www:seamewe4:com/pdfs/home/Customer event/
SMW4 Customer event with Backhaul slide:pdf, February
2013.
[Sin10] SingTel. Coverage map. http://business:singtel:com/
upload hub/mnc/STiX Factsheet 2010:pdf, 2010.
[Sin13] SingTel. Our coverage. http://info:singtel:com/large-
enterprise/products/global-connectivity/our-
coverage, February 2013.
[SM06] Matthew Sullivan and Luis Munoz. Suggested generic DNS naming schemes
for large networks and unassigned hosts. Work in progress (Internet
draft draft-msullivan-dnsop- generic-naming-schemes-
00:txt), April 2006.
[SS] U.S. Securities and Exchange (SEC). Researching public companies through
EDGAR: A guide for investors. http://www:sec:gov/investor/
pubs/edgarguide:htm.
[SS11] Aaron Schulman and Neil Spring. Pingin’ in the rain. In Proceedings of
the 2011 ACM SIGCOMM conference on Internet measurement conference,
IMC ’11, pages 19–28, New York, NY , USA, 2011. ACM.
[Sta12] The Daily Star. Internet speed increases in one year. The Daily Star
Website http://www:dailystar:com:lb/Business/Lebanon/
2012/Nov-10/194587-internet-speed-increases-in-
one-year:ashx, November 2012.
[Sub13] Submarine Telecoms Forum. Submarine Cable Almanac. http:
//www:subtelforum:com/articles/submarine-cable-
almanac/, February 2013.
[SYHG01] Panagiotis Sebos, Jennifer Yates, Gisli Hjalmtysson, and Albert Greenberg.
Auto-discovery of shared risk link groups. In Optical Fiber Communication
Conference and Exhibit, 2001. OFC 2001, volume 3, page WDD3. IEEE,
2001.
[The03] The National Academic Press. The Internet under crisis conditions:
Learning from the impact of September 11. http://books:nap:edu/
openbook:php?isbn=0309087023, 2003.
203
[TRKN08] Ionut Trestian, Supranamaya Ranjan, Aleksandar Kuzmanovic, and Anto-
nio Nucci. Unconstrained endpoint profiling (Googling the Internet). In
Proceedings of the ACM SIGCOMM Conference, pages 279–290, Seattle,
Washigton, USA, August 2008. ACM.
[TZV
+
08] Mukarram Tariq, Amgad Zeitoun, Vytautas Valancius, Nick Feamster, and
Mostafa Ammar. Answering what-if deployment and configuration ques-
tions with wise. In Proceedings of the ACM SIGCOMM Conference, pages
99–110, New York, NY , USA, 2008. ACM.
[Und05] Todd Underwood. Internet-Wide Catastrophe–Last Year.
Renseys Blog http://www:renesys:com/blog/2005/12/
internetwide nearcatastrophela:shtml, December 2005.
[Und06a] Todd Underwood. Con-Ed Steals the ’Net. Renseys
Blog http://www:renesys:com/blog/2006/01/
coned steals the net:shtml, January 2006.
[Und06b] Todd Underwood. Sprint and Cogent Peer. Renseys Blog
http://www:renesys:com/blog/2006/11/sprint-and-
cogent-peer:shtml, November 2006.
[USC07a] USC/LANDER project. Internet Addresses Survey dataset, PREDICT ID
USC-LANDER/internet address survey reprobing it16w-
20070216. http://www:isi:edu/ant/lander, February 2007.
[USC07b] USC/LANDER project. Internet Addresses Survey dataset, PREDICT ID
USC-LANDER/internet address survey reprobing it17w-
20070601. http://www:isi:edu/ant/lander, June 2007.
[USC07c] USC/LANDER project. Internet Addresses Survey dataset, PREDICT
IDUSC-LANDER/survey validation usc-20070813. http://
www:isi:edu/ant/lander, August 2007.
[USC09] USC/LANDER project. Internet Addresses Survey dataset, PREDICT ID
USC-LANDER/internet address survey reprobing it30w-
20091223, December 2009.
[USC10] USC/LANDER project. Internet Addresses Survey dataset, PREDICT ID
USC-LANDER/internet address survey reprobing it31w-
20100208, February 2010.
[Ver11] Verizon. Verizon business policy for settlement-free interconnection with
internet networks. http://www:verizonbusiness:com/terms/
peering/, September 2011.
204
[WZMS07] Jian Wu, Ying Zhang, Z. Morley Mao, and Kang G. Shin. Internet routing
resilience to failures: analysis and implications. In ACM CoNEXT, pages
25:1–25:12, New York, NY , USA, 2007. ACM.
[XYA
+
07] Yinglian Xie, Fang Yu, Kannan Achan, Eliot Gillum, Moises Goldszmidt,
and Ted Wobber. How dynamic are IP addresses? In Proceedings of the
ACM SIGCOMM Conference, pages 301–312, Kyoto, Japan, August 2007.
ACM.
[ZRLD03] Yin Zhang, Matthew Roughan, Carsten Lund, and David Donoho. An
information-theoretic approach to traffic matrix estimation. In Proceedings
of the ACM SIGCOMM Conference, pages 301–312, Karlsruhe, Germany,
2003. ACM.
[ZXH
+
12] Xinggong Zhang, Yang Xu, Hao Hu, Yong Liu, Zongming Guo, and Yao
Wang. Profiling Skype video calls: Rate control and video quality. In Pro-
ceedings of the IEEE Infocom, pages 621–629, Orlando, FL, USA, 2012.
205
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning about the Internet through efficient sampling and aggregation
PDF
Measuring the impact of CDN design decisions
PDF
Improving network reliability using a formal definition of the Internet core
PDF
Enabling efficient service enumeration through smart selection of measurements
PDF
Network reconnaissance using blind techniques
PDF
Understanding and optimizing internet video delivery
PDF
Detecting and mitigating root causes for slow Web transfers
PDF
Detecting periodic patterns in internet traffic with spectral and statistical methods
PDF
Improving network security through collaborative sharing
PDF
Language abstractions and program analysis techniques to build reliable, efficient, and robust networked systems
PDF
Cloud-enabled mobile sensing systems
PDF
Improving user experience on today’s internet via innovation in internet routing
PDF
Supporting faithful and safe live malware analysis
PDF
Anycast stability, security and latency in the Domain Name System (DNS) and Content Deliver Networks (CDNs)
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Improve cellular performance with minimal infrastructure changes
PDF
Balancing security and performance of network request-response protocols
PDF
Design of cost-efficient multi-sensor collaboration in wireless sensor networks
PDF
Aggregation and modeling using computational intelligence techniques
PDF
Analysis and countermeasures of worm propagations and interactions in wired and wireless networks
Asset Metadata
Creator
Cai, Xue
(author)
Core Title
Global analysis and modeling on decentralized Internet
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science (Computer Networks)
Publication Date
01/29/2014
Defense Date
12/03/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
analysis,Internet,measurement,Modeling,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Heidemann, John (
committee chair
), Govindan, Ramesh (
committee member
), Ortega, Antonio K. (
committee member
), Willinger, Walter (
committee member
)
Creator Email
cher.xue.cai@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-361207
Unique identifier
UC11296123
Identifier
etd-CaiXue-2236.pdf (filename),usctheses-c3-361207 (legacy record id)
Legacy Identifier
etd-CaiXue-2236.pdf
Dmrecord
361207
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Cai, Xue
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
analysis
Internet
measurement