Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning about the Internet through efficient sampling and aggregation
(USC Thesis Other)
Learning about the Internet through efficient sampling and aggregation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEARNING ABOUT THE INTERNET THROUGH EFFICIENT SAMPLING AND
AGGREGATION
by
Lin Quan
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2014
Copyright 2014 Lin Quan
Dedication
To my beloved parents.
ii
Acknowledgments
My graduate study at USC and ISI is the most memorable period in my life. I am
grateful to many people who helped me during this endeavor.
First, I want to thank my advisor, John Heidemann, for his guidance, encouragement
and patience to me during my entire PhD study. I learn a great deal in problem solving,
research and presentation skills. I believe these skills will be useful during my career
going forward.
I would also like to thank many colleagues and friends that I have the pleasure to
know. Thanks to Yuri Pradkin for co-authoring several papers and sharing research
ideas. Also, whenever I have problems with coding or our computing environment, he
seems to know it all. Thanks to many fellow students here, with whom we share the
joys and pains and other feelings: Andrew Goodney, Xue Cai, Zi Hu, Calvin Ardi, Xun
Fan, Liang Zhu, Chengjie Zhang, Hao Shi, Xiyue Deng, Lihang Zhao, Weiwei Chen,
and many others. Thanks to Joe Kemp, Alba Regalado and Matt Binkley for booking
my conference travels and helps with other administrative tasks.
iii
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables viii
List of Figures x
Abstract xiv
Chapter 1: Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Sampling and Aggregation Techniques . . . . . . . . . . . . . 4
1.2.2 Resource Constraints . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Knowledge Discovery . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Supporting the Thesis Statement . . . . . . . . . . . . . . . . . . . . . 10
1.5 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 2: Characteristics and Reasons of Long-lived Internet Flows 14
2.1 Motivation for Studying Long-lived Internet Flows . . . . . . . . . . . 14
2.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Relation to Thesis . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Data Collection and Analysis . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Collection and Anonymization . . . . . . . . . . . . . . . . . . 17
2.2.2 Multi-time-scale IP Flow Analysis . . . . . . . . . . . . . . . . 18
2.2.3 Managing Outages . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 Understanding the Methodology . . . . . . . . . . . . . . . . . 22
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Characteristics of Long Flows . . . . . . . . . . . . . . . . . . 24
2.3.2 Causes of Long-lived Flows . . . . . . . . . . . . . . . . . . . 28
iv
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter 3: Detecting Internet Outages with Precise Active Probing 33
3.1 Motivation for Detecting Internet Outages . . . . . . . . . . . . . . . . 33
3.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Relation to Thesis . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Active Probing of Address Blocks . . . . . . . . . . . . . . . . 37
3.2.2 Probes to Outages . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3 Visualizing Outages . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.4 Outages to Correlated Events . . . . . . . . . . . . . . . . . . . 43
3.2.5 Parameter Discussion . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Building an Operational System . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Bounding Probing Trac . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Sampling Addresses in Blocks . . . . . . . . . . . . . . . . . . 47
3.3.3 Reducing the Number of Target Blocks . . . . . . . . . . . . . 48
3.3.4 Our Prototype System . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Validating Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.1 Validating Data Sources and Methodology . . . . . . . . . . . . 51
3.4.2 Network Event Case Studies . . . . . . . . . . . . . . . . . . . 54
3.4.3 Validation of Randomly Selected Events . . . . . . . . . . . . . 58
3.4.4 Validation of Controlled Outages . . . . . . . . . . . . . . . . . 59
3.4.5 Stability over Locations, Dates and Blocks . . . . . . . . . . . 61
3.4.6 Comparing Accuracy with Other Approaches . . . . . . . . . . 65
3.5 Evaluating Internet Outages . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5.1 Evaluation over the Analyzable Internet . . . . . . . . . . . . . 68
3.5.2 Durations and Sizes of Internet Outages and Events . . . . . . . 69
3.5.3 Internet-wide View of Outages . . . . . . . . . . . . . . . . . . 72
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chapter 4: Visualizing Sparse Internet Events: Internet Outages and Route
Changes 76
4.1 Motivation for Visualizing Sparse Internet Events . . . . . . . . . . . . 76
4.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.2 Relation to Thesis . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Visualizing Correlated Events . . . . . . . . . . . . . . . . . . . . . . 78
4.2.1 Clustering Visualization of Network Data . . . . . . . . . . . . 79
4.2.2 Choice of the Closeness Threshold . . . . . . . . . . . . . . . . 82
4.2.3 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . 82
4.2.4 Handling Large Images . . . . . . . . . . . . . . . . . . . . . . 84
4.3 Visualizing Network Outages . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.1 Data Sources: Detecting Outages . . . . . . . . . . . . . . . . 84
v
4.3.2 Learning from Outage Visualization . . . . . . . . . . . . . . . 85
4.4 Visualizing Routing Changes . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.1 Data Sources: Detecting BGP Path Changes . . . . . . . . . . . 90
4.4.2 Learning from Visualizing Route Changes . . . . . . . . . . . . 90
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 5: Trinocular: Understanding Internet Reliability through Adaptive
Probing 94
5.1 Motivation for Trinocular . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.1.2 Relation to Thesis . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3 Principled Low-rate Probing . . . . . . . . . . . . . . . . . . . . . . . 99
5.3.1 An Outage-Centric Model of the Internet . . . . . . . . . . . . 99
5.3.2 Changing State: Learning From Probes . . . . . . . . . . . . . 100
5.3.3 Gathering Information: When to Probe . . . . . . . . . . . . . 102
5.3.4 Parameterizing the Model: Long-term Observation . . . . . . . 104
5.3.5 Outage Scope From Multiple Locations . . . . . . . . . . . . . 106
5.3.6 Operational Issues . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4 Validating Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4.1 Correctness of Outage Detection . . . . . . . . . . . . . . . . . 108
5.4.2 Precision of event timing . . . . . . . . . . . . . . . . . . . . . 109
5.4.3 Probing rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5 Eects of Design Choices . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.5.1 How Many Addresses to Probe . . . . . . . . . . . . . . . . . . 114
5.5.2 What Granularity of Blocks . . . . . . . . . . . . . . . . . . . 117
5.6 Studying the Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.6.1 Days in the Life of the Internet . . . . . . . . . . . . . . . . . . 121
5.6.2 Re-analyzing Internet Survey Data . . . . . . . . . . . . . . . . 122
5.6.3 Case Studies of Internet Outages . . . . . . . . . . . . . . . . . 123
5.6.4 Longitudinal Re-analysis of Existing Data . . . . . . . . . . . . 127
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Chapter 6: Evaluating Policy Eects on Internet Usage 131
6.1 Motivation for Evaluating Policy Eects . . . . . . . . . . . . . . . . . 131
6.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.1.2 Relation to Thesis . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2.1 Dynamic Tracking of Block Availability . . . . . . . . . . . . . 133
6.2.2 Diurnal Detection Algorithm . . . . . . . . . . . . . . . . . . . 139
6.2.3 Other Network Factors: Geolocation, Organizations, and Link
Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
vi
6.2.4 Factorial Analysis with ANOV A . . . . . . . . . . . . . . . . . 144
6.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3.1 Validating Block Availability Tracking . . . . . . . . . . . . . 145
6.3.2 Validating Diurnal Blocks . . . . . . . . . . . . . . . . . . . . 148
6.4 Directly Observed Results . . . . . . . . . . . . . . . . . . . . . . . . 159
6.4.1 Diurnal Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.4.2 A Month of Internet Outages . . . . . . . . . . . . . . . . . . . 162
6.5 Indirectly Observed Results . . . . . . . . . . . . . . . . . . . . . . . . 166
6.5.1 Location of diurnal blocks . . . . . . . . . . . . . . . . . . . . 166
6.5.2 Correlating Diurnal Blocks with Internet Entry Time . . . . . . 169
6.5.3 Eects of Economic Conditions . . . . . . . . . . . . . . . . . 171
6.5.4 Eects of Access-Link Technology . . . . . . . . . . . . . . . 177
6.5.5 Eects of Organizations on Reliability . . . . . . . . . . . . . . 180
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Chapter 7: Related Work 186
7.1 Understanding Internet Flows . . . . . . . . . . . . . . . . . . . . . . . 186
7.2 Internet Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.3 Internet Outage Detection . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.4 Adaptive Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7.5 ISP Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Chapter 8: Future Work and Conclusions 193
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Bibliography 196
vii
List of Tables
3.1 Subsetting blocks that are probed and analyzable for Survey S
30w
. . . . . 43
3.2 Internet surveys used, with dates and durations. . . . . . . . . . . . . . 53
3.3 Validation of event detection algorithm. . . . . . . . . . . . . . . . . . 58
3.4 Outage percentage statistics of four quarters, from S
29w
to S
40w
. . . . . . 63
3.5 Comparing accuracy of dierent outage estimation strategies. . . . . . . 67
4.1 Internet surveys with dates and durations. . . . . . . . . . . . . . . . . 80
5.1 Bayesian inference from current block state U
and a new probe. . . . . 100
5.2 Comparing precision and recall of dierent probing targets. Dataset: S
50j
. 116
5.3 Comparing coverage by granularity. Dataset: A
20addr
. . . . . . . . . . . 118
5.4 Outages observed at three sites over two days. Dataset: A
7
. . . . . . . . 121
6.1 Vary maximum phase , showing the eect of phase on diurnal block
detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.2 Validation of diurnal blocks in Survey S
51w
. . . . . . . . . . . . . . . . 159
6.3 Fraction of diurnal blocks, top 20 countries and United States. . . . . . 169
6.4 Fraction of diurnal blocks grouped by regions. . . . . . . . . . . . . . . 170
6.5 ANOV A analysis of correlations between diurnal and individual factors. 173
6.6 Mean outage fraction grouped by countries and regions. . . . . . . . . . 175
6.7 ANOV A analysis of correlations between outages and individual factors. 176
6.8 Mean outage fraction of 9 access keywords, observed from all three
probers. Dataset: A
12all
. . . . . . . . . . . . . . . . . . . . . . . . . . . 179
viii
6.9 Mean outage fraction of top Internet service providers (ISPs) in the
United States, plus Google, Yahoo, Facebook and LinkedIn. . . . . . . 184
ix
List of Figures
2.1 The structure of multi-level flow records. . . . . . . . . . . . . . . . . . 19
2.2 Durations of flows observed at 8 dierent timescale levels. . . . . . . . 23
2.3 Number of flows at dierent timescales. . . . . . . . . . . . . . . . . . 23
2.4 Density plot of flow duration vs. size, for all and sampled flows. . . . . 26
2.5 Density plot for flow duration vs. rate. . . . . . . . . . . . . . . . . . . 26
2.6 Density plot of flow duration vs. size. . . . . . . . . . . . . . . . . . . 28
2.7 Cumulative distribution of flow sizes (in bytes). . . . . . . . . . . . . . 28
2.8 Source and destination port usage as a function of timescale. . . . . . . 29
2.9 Density plot of flow duration vs. burstiness. . . . . . . . . . . . . . . . 29
2.10 Protocol usage as a function of timescale. . . . . . . . . . . . . . . . . 31
3.1 Probe responses and per round block coverage and outage thresholds,
for one /24 block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Performance of one prober instance as number of targets grows: 1-core
CPU (left scale) and bandwidth (right). . . . . . . . . . . . . . . . . . . 51
3.3 Evaluating detection of controlled outages. . . . . . . . . . . . . . . . . 59
3.4 Comparing results of emulated outages for 5 CSU blocks (top) and 10
random blocks (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Evaluation of 35 2-week surveys and analyzable Internet run. . . . . . . 62
3.6 Downtime percentage over time, for 4 dierent quarters of our dataset,
from S
29w
to S
40w
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.7 Precision, recall and accuracy as a function of samples per target block. 66
x
3.8 Selected slices of outages in the analyzable Internet study. . . . . . . . 70
3.9 Cumulative distributions of outage and event durations, and marginal
distributions by round and block. . . . . . . . . . . . . . . . . . . . . . 71
3.10 Network event sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1 The 500 largest outages in S
38c
. . . . . . . . . . . . . . . . . . . . . . . 80
4.2 The 900 largest outages in S
30w
. . . . . . . . . . . . . . . . . . . . . . . 83
4.3 The 500 largest outages in S
39c
. . . . . . . . . . . . . . . . . . . . . . . 87
4.4 The 500 largest outages in S
39w
. . . . . . . . . . . . . . . . . . . . . . . 89
4.5 Sample cluster showing correlated BGP changes for China prefixes. . . 92
5.1 Median number of probes needed to reach a definitive belief after block
state change. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Fraction of detected outages and duration in rounds, for controlled exper-
iments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3 Observed outage duration vs. ground truth. . . . . . . . . . . . . . . . . 110
5.4 Distribution of probes to each target block. . . . . . . . . . . . . . . . . 112
5.5 Median number of probes for aggregate and state transitions. . . . . . . 113
5.6 Distribution of availability values for dierent approaches to selecting
tagets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.7 Six days of the 600 largest outages in March 2011 T¯ ohoku earthquake. . 124
5.8 Six days of the 300 largest outages in U.S. networks in Hurricane Sandy. 125
5.9 Median number of outages per day, broken down to states. . . . . . . . 126
5.10 Evaluation of single-site outages of surveys over three years. . . . . . . 128
6.1 Checking stationarity of all blocks in Survey S
51w
. . . . . . . . . . . . . 141
6.2 Block 1.9.21/24 (0x010915/24),jE(b)j = 42, A = 0:735. . . . . . . . . . 146
6.3 Block 23.46.151/24 (0x172e97/24),jE(b)j = 249, A = 0:991. . . . . . . 147
6.4 Block 93.208.233/24 (0x5dd0e9/24),jE(b)j = 245, A = 0:191. . . . . . 148
6.5 Block 81.80.129/24 (0x515081/24),jE(b)j = 43, A = 0:964. . . . . . . . 149
6.6 Block 27.186.9/24 (0x1bba09/24),jE(b)j = 256, A = 0:598. . . . . . . . 150
xi
6.7 Block 2.134.216/24 (0x0286d8/24),jE(b)j = 256, A = 0:408. . . . . . . 151
6.8 Correlation graph showing actual availability A and estimated availabil-
ity
ˆ
A
s
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.9 Correlation graph showing actual availability A and operational avail-
ability
ˆ
A
o
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.10 (Old algorithm). Correlation graph showing actual availability A and
estimated availability
ˆ
A
s
. . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.11 (Old algorithm). Correlation graph showing actual availability A and
operational availability
ˆ
A
o
. . . . . . . . . . . . . . . . . . . . . . . . . 155
6.12 Accuracy of diurnal block detection, varying number of diurnal behavior
addresses (n
d
). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.13 Accuracy of diurnal block detection, varying maximum . . . . . . . . 156
6.14 Accuracy of diurnal block detection, varying standard deviation of uptime
duration (
d
). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.15 FFT components of block 27.186.9/24 (0x1bba09/24), in 14-day survey
S
51w
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.16 FFT components of block 27.186.9/24 (0x1bba09/24), in 35-day A
12w
. . 160
6.17 FFT components and auto-correlation of block 1.9.21/24 (0x010915/24),
in 14-day survey S
51w
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.18 Cumulative distribution of the highest frequency in 35-day A
12w
. . . . . 162
6.19 Overall diurnal rate for Internet surveys over time. . . . . . . . . . . . . 163
6.20 Overall outage fraction in A
12all
. Data is the intersection of outages three
vantage points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.21 May 2013 Syria outages observed from all three sites, showing two com-
plete shutdowns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.22 Number of total blocks in A
12w
. Gray-scale shows 0 to 10k. . . . . . . . 167
6.23 Number of diurnal blocks in A
12w
. Gray-scale shows 0 to 1k. . . . . . . 168
6.24 Fraction of diurnal blocks in A
12w
. Gray-scale shows 0% to 100%. . . . 168
6.25 Percentage of diurnal blocks as a function of allocation date. . . . . . . 170
6.26 Scatter plot of diurnalness and per-capita GDP for all countries. . . . . 172
xii
6.27 Scatter plot of outage fraction and per-capita GDP for all countries. . . . 174
6.28 Bar chart of mean outage fraction for 23 countries. . . . . . . . . . . . 176
6.29 Mean outage fraction and diurnal fraction for 9 access technologies. . . 178
6.30 Cumulative distribution of fraction of outages for each block, by access
keyword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.31 Availability A
s
and outage periods of 4 DHCP blocks. . . . . . . . . . 181
6.32 Mean outage fraction and diurnal fraction for top United States ISPs. . . 183
6.33 Cumulative distribution of block outage fraction for top US ISPs and
organizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
xiii
Abstract
The Internet is important for nearly all aspects of our society, aecting ordinary people,
businesses, and social activities. Because of its importance and wide-spread applica-
tions, we want to have good knowledge about Internet’s operation, reliability and per-
formance, through various kinds of measurements. However, despite the wide usage, we
only have limited knowledge of its overall performance and reliability. The first reason
of this limited knowledge is that there is no central governance of the Internet, making
both active and passive measurements hard. The second reason is the huge scale of the
Internet. This makes brute-force analysis hard because of practical computing resource
limits such as CPU, memory and probe rate.
This thesis states that sampling and aggregation are necessary to overcome
resource constraints in time and space to learn about better knowledge of the Inter-
net. Many other Internet measurement studies also utilize sampling and aggregation
techniques to discover properties of the Internet. We distinguish our work by exploring
novel mechanisms and new knowledge in several specific areas. First, we aggregate
short-time-scale observations and use an ecient multi-time-scale query scheme to dis-
cover the properties and reasons of long-lived Internet flows. Second, we sample and
probe /24 blocks in the IPv4 address space, and use greedy clustering algorithms to e-
ciently characterize Internet outages. Third, we show an ecient and eective aggre-
gation technique by visualization and clustering. This technique makes both manual
xiv
inspection and automated characterization easier. Last, we develop an adaptive probing
system to study global scale Internet reliability. It samples and adapts probe rate within
each /24 block for accurate beliefs. By aggregation and correlation to other domains,
we are also able to study broader policy eects on Internet use, such as political causes,
economic conditions, and access technologies.
This thesis provides several examples of Internet knowledge discovery with new
mechanisms of sampling and aggregation techniques. We believe our approaches of
new sampling and aggregation mechanisms can be used by and will inspire new ways
for future Internet measurement systems to overcome resource constraints, such as large
amount and dispersed data.
xv
Chapter 1
Introduction
1.1 Overview
The Internet has become more and more important for nearly all aspects of our soci-
ety. Ordinary people use it on a day-to-day basis for information (such as looking
up flight schedules), entertainment (such as video websites), or social needs (online
social networks); businesses utilize the Internet as a platform for sales (online shop-
ping), internal administration, or advertisements. As a result, many cloud-based ser-
vices have emerged [Ade09]. In the recent years, even politicians have started to put
serious energy on the Internet, as we see a clear trend of many political activities on
the Web. For example, the US President Barack Obama used Google hangouts in early
2012 [Pos12]. Because of its importance and wide-spread applications, we want to have
as much knowledge about the Internet as possible, through various kinds of measure-
ments.
However, despite the wide usage of Internet today, we only have limited knowledge
of its overall state. The first reason of this limited knowledge is that there is no cen-
tral governance of the Internet. As a fundamental goal, the Internet was designed to
connect many networks, each operated and managed separately [Cla88]. These indi-
vidual networks are physically distributed over the globe and are managed by dierent
organizations under the jurisdiction of dierent governments. This decentralized man-
agement makes both active and passive measurements hard. Active measurement can
suer from dierent policies or firewalling [GS09, BMRU09]; passive measurement,
1
such as BGP analysis, can suer from delayed convergence [LMJ97, LABJ00] or oscil-
lations [BOR
+
02]. Bartlett et al. find that neither active nor passive measurements can
provide a complete view [BHP07]. The second reason for our limited knowledge about
the Internet is its huge scale, as it grows to encompass much of the world’s popula-
tion [All05]. For example, the Internet generates huge amount of trac each day. As
of July 2009, the Internet’s inter-domain trac is estimated at 39.8 Tbps [LIJM
+
10].
Besides trac, the large number of networks and end hosts [Rob00] also makes anal-
ysis hard. As of Nov. 2013, there are more than 2.65 billion addresses announced in
the Internet [Wol13]. Such a huge scale of the Internet makes brute-force-type analysis
hard.
This thesis states that sampling and aggregation provide ecient ways to achieve
new knowledge about the Internet. In social sciences, sampling and aggregation serve as
useful means to study social behaviors. As an example of sampling, polls are conducted
to study people’s voting behavior and preferences. By asking well-designed questions
to only a small sample of voters, campaign managers can get useful conclusions regard-
ing all people, and aid the process of their campaigns. As an example of aggregation,
through censuses to all people, we can aggregate data from individual level to district,
state or ethnic group-levels. By aggregating such data, we can get a better picture of
the income of districts or ethnic groups, and make government actions better reflect the
needs and choices of the population.
Similarly, sampling and aggregation are used in measurements of the Internet. For
example, researchers use sampling techniques to understand packet-level [Mkc00],
flow-level [Bkc02, EKMV04, DLT03, ZBPS02] and aggregate-level [TMW97] traf-
fic characteristics. To understand routing behavior of the Internet, researchers con-
duct measurements over a sample of computers (a mesh) and report influential results
reflecting the whole Internet, such as the prevalence of asymmetric paths [Pax96].
2
To understand the Internet address space usage, researchers conduct both sampling
and aggregation [HPG
+
08, CH10a], finding significant results with simple measure-
ment tools. To understand network reliability, many researchers use sampling and
aggregation techniques to study networks in BGP table and derive reliability met-
rics [KBMJ
+
08a, MIP
+
06a, FABK03]. On top of these studies, we provide new
approaches of sampling and aggregation, and discover new knowledge about the Inter-
net.
In general, sampling and aggregation are useful because they help solve problems
eciently while still providing highly accurate conclusions. The research goal of this
thesis is to develop new mechanisms of sampling and aggregation to overcome resource
constraints, then use these approaches to learn new important knowledge about the Inter-
net. Our work on properties and causes of long-lived Internet flows and characteriza-
tion of Internet outages help foster new understanding of the Internet. They are useful
because with our findings: researchers can gain new insights of the Internet and com-
pare findings with ours; operators can inspect trac behavior and reliability of their
networks; and end users can compare service providers.
1.2 Problem Space
In this thesis, we explore how dierent measurement mechanisms help to eciently
discover new knowledge of the Internet. To understand how we explore this space,
we discuss three challenges in measurement studies: sampling and aggregation tech-
niques (Section 1.2.1), resource constraints (Section 1.2.2), and knowledge discovery
(Section 1.2.3).
3
1.2.1 Sampling and Aggregation Techniques
The first challenge we explore is sampling and aggregation techniques. In order to
eciently understand Internet properties while maintaining good accuracy, researchers
use various kinds of sampling and aggregation techniques in their measurement mecha-
nisms.
In Internet measurement studies, sampling is the process of selecting a subset of
data to estimate characteristics of a bigger phenomenon or population. Sampling usu-
ally subsets data in the dimension of time or space or both. For example, Internet address
surveys study a sample of the IPv4 address space [HPG
+
08]: they sample block status
every 11 minutes (time); and they use a mixed sample of both stable and random blocks
in the Internet IPv4 address space (space). Similarly, we use space sampling of Inter-
net address blocks in our outage detection work (Chapter 3), and both time and space
sampling in our adaptive sampling work for confidence of conclusions (Chapter 5).
To maximize accuracy, the sampling process often selects representative targets for
study. For example, Internet hit lists select representatives for all /24 blocks, serving
as a basis for Internet topology studies [FH10]. To study Internet delay, jitter and path
failures, many researchers utilize a mesh-based network of nodes to represent real com-
puter networks [ABKM01, Pax96, ABKM01, FABK03, KYGS07]. Such a mesh is
usually consisted of tens or hundreds of nodes, clearly not representative for the entire
Internet. However, the routes within the mesh are representative because they include
“non-negligible fraction” of ASes of the Internet [Pax96]. We pay attention to this
aspect of sampling in our work. We use mixed samples of stable and random blocks
to be representative in our outage study (Chapter 3). In a further study, we study all
responsive IPv4 blocks, and use two levels of sampling. The first level of sampling is a
per block model. For each /24 block, we only sample the addresses that ever responded.
4
The next level of sampling is done per round, where we sample a few (up to 15) of these
responding addresses in every 11-minute round. (Chapter 5).
Aggregation is necessary in Internet knowledge discovery because it gathers the
otherwise scattered or incomplete information to form a bigger picture for new knowl-
edge. It can be done with brute force, usually for complete understanding, such as our
work in outage detection showing the overall rate and marginal distribution of outages
in all the Internet edge (Chapter 3). Aggregation can also be done in a more elegant
manner in cases when only the essential information needs to be kept. For example, in
our study of properties of long-lived flows, we only keep flows within a certain dura-
tion range, in each aggregation level and discard many shorter flows for eciency. As
a result, instead of keeping 4967 M flows in a two week dataset, we only need to keep
and study 846 k flows, without loss of of accuracy (Chapter 2). In other words, our
aggregation mechanism reduces the amount of input data by a factor of 6k (Chapter 2).
Other researchers also use aggregation techniques for knowledge discovery. For exam-
ple, studies of Internet hitlists [FH10] aggregate in both time (data over four years) and
space (responses from all addresses) to find representative IPs (hitlist) for /24 address
blocks.
Learning from the above experiences, from both our work and other researchers,
future measurement studies should carefully choose which dimension(s) in time and
space to sample in or aggregate on; and should keep in mind to study only those essential
properties for eciency.
1.2.2 Resource Constraints
Another challenge that aects measurement studies is resource limitations. We often
encounter resource constraints in measurement studies, usually because of the need to
5
analyze large amount of data in certain problems, or politeness, or (practical) limit of
measurement mechanisms.
First we discuss CPU and memory constraints due to the need to process large
amount of data. In our network outage study, in order to probe all responsive 3.6M /24
blocks, we are primarily CPU bound (see Chapters 3 and 5). So in operation, we have to
parallelize our probers and use 4-way probing, each on a separate core, to deal with the
large work load. CPU constraints also limit our ability to interpret data, where we need
to cluster and visualize all blocks for correlation and causes. Unfortunately clustering
cannot be easily parallelize because of potential data dependencies between any two data
points (in our case blocks). More generally, all algorithms with large enough scale and
need to run on a single core can potentially be constrained by CPU capacity. Memory
can also be a constraint. In our study of long-lived Internet flows, because of large
amount of trac in our campus network and an internal flow table we need to maintain,
we could easily run out of memory (with a 60 s flow timeout parameter). So we employ
an ecient multi-time-scale scheme and with a high performance cluster to process and
analyze flow data over 2 weeks (Chapter 2).
Another resource constraint is due to probe rate for politeness on target networks
and bandwidth limit at prober site. Politeness constraints occur when studies have
impact on people’s regular use or operation of their networks. In active measurement
studies, researchers often send out probes to a sample of networks and collect data to
gain insights of characteristics of these networks. We need to be polite in such studies,
with as low a probe rate as possible that is enough to draw conclusions but does not
disturb people. In our study of Internet outages with Trinocular (Chapter 5), we actively
probe 3.5 M of /24 blocks, and thus sometimes draw complaints from operators. To
be polite, we adaptively send out probes which are guided by Bayesian inference and
6
probe “just enough”. We also compare our probe rate with the IPv4 background radia-
tion [WKB
+
10] and find that our probes are less than 1% of the background noise. A
secondary issue with probe rate is bitrate limitations out of vantage points. With mil-
lions of targets to probe, network capacity may become the bottleneck at the vantage
point. Even if the network has enough capacity, we need to consider possible complains
from the first hop ISP provider and communicate with them before doing large-scale
measurements.
The next category of resource constraints is the practical resource limit of measure-
ment tools. This is usually for completeness of study, for example lack of data or not
enough vantage points. We have such constraints in the long flow analysis work where
we only had data from our own campus network (Chapter 2). To be more representa-
tive, one would think getting more data from other campuses or commercial networks is
useful. However due to practical limits this is not always possible. Another example is
that we have only 3 vantage points in two continents in our outage study: ISI, Colorado
State University, and Keio University in Japan (Chapters 3 and 5). Getting more vantage
points is hard because our probers are quite demanding in network capacity and some-
times draw complaints. Other researchers also meet with such resource constraints. For
example, in order to get a complete view of the Internet, the Netalyzr team develops java
applets on popular browsers [KWNP10], and encourages users to install these plugins
and report network statistics (end users serve as vantage points). This approach is clever
and makes a distinct contribution. Otherwise even with the largest and most successful
Internet testbed PlanetLab [PR06], they would have limited view of their research goals:
HTTP caches, DNS manipulations, NAT behavior, etc.
In our thesis, we develop several new measurement mechanisms to overcome various
resource constraints. A more important contribution we expect is that future measure-
ment studies can learn from our approaches to managing of resource constraints and
7
more clearly analyze trade-os between their research goals and resource constraints.
For example, we expect existing and future active measurement systems (such as Thun-
derping [SS11a]) to learn from our adaptive sampling method (guided by Bayesian infer-
ence) to save trac on both the target network and at the vantage points.
1.2.3 Knowledge Discovery
The last challenge of the problem space we explore is the goal of measurement stud-
ies: new knowledge about the Internet. The Internet measurement community is
large and active in many areas, from traditional path and flow properties [ZZP
+
04,
ZD01, FABK03, EKMV04], to data center networks [CFH
+
13], and to Internet eco-
nomics [VLF
+
11], and many other areas.
By no means do our studies fully cover the space of all possible measurement stud-
ies. Rather, we mainly look at several specific areas with contributions of new and e-
cient mechanisms of measurement or analysis. We first propose an ecient multi-time-
scale analysis approach to overcome memory constraints and study long-lived Internet
flows (Chapter 2). We next use simple but eective clustering algorithms to aggre-
gate /24 blocks to larger correlations, to study properties and causes of network out-
ages (Chapter 3). We then propose an ecient and eective aggregation technique by
visualization and clustering (Chapter 4). Our last contribution is to use novel adaptive
sampling techniques and statistical correlations to reveal policy eects on how people
use their networks (Chapters 5 and 6).
Besides our work, other researchers also propose ecient measurement mechanisms
for new knowledge of the Internet. Internet edge network classification work examines
response rate, jitter, RTT information and cluster them to find dierent usage patterns
of network blocks [CH10a]. IPv4 background radiation work use clever techniques to
8
setup of an unused /8 block [WKB
+
10], and Czyz et al. do a similar study on the IPv6
network [CLM
+
13].
We expect our thesis to be helpful to future measurement studies when they decide
research topics or measurement strategies. We hope our approaches to measuring dif-
ferent Internet phenomena to be useful to other researchers, using analysis inspired by
our approaches on dierent problems. Future studies can also re-run our studies with
dierent settings (new environments or new vantage points) to verify our results or find
new knowledge.
1.3 Thesis Statement
This thesis states that sampling and aggregation are necessary to overcome resource
constraints in time and space to learn about better knowledge of the Internet.
Many other Internet measurement studies also utilize sampling and aggregation tech-
niques to discover properties of the Internet. We distinguish our work by exploring
novel mechanisms or new knowledge in several specific areas. First, we demonstrate
the eectiveness of aggregating short-time-scale observations and an ecient multi-
time-scale query scheme to discover the properties and reasons of long-lived Internet
flows. Second, we demonstrate the eectiveness of sampling small sub-IPv4-space (/24
block) observations and aggregating them with greedy clustering algorithms to charac-
terize properties of outages in the whole Internet edge. Third, we show that aggregation
by visualization and clustering provides an ecient and eective method for analyz-
ing sparse Internet events. Last, we develop an adaptive probing system to study global
scale Internet reliability. It samples and adapts probe rate within each /24 block for accu-
rate beliefs. By aggregation and correlation to other domains, we are also able to study
broader policy eects on Internet use, such as political causes, economic conditions,
9
and access technologies. This thesis provides several examples of Internet knowledge
discovery with new mechanisms of sampling and aggregation techniques. We believe
our contributions of new sampling and aggregation mechanisms can be used by future
Internet measurement systems to overcome resource constraints such as large amount
and dispersed data.
1.4 Supporting the Thesis Statement
In this section, we substantiate the thesis statement through four specific studies, each
gaining new insights and knowledge about the Internet with particular sampling and
aggregation techniques. We also argue that our claims can also apply to future large-
scale Internet measurement systems.
In our first work [QH10b] (Chapter 2), we show that aggregation in time, varying
from smaller durations to larger durations, can provide a powerful means to study the
properties of long-lived Internet IP flows. Our first study supports the thesis statement
as follows. Our goal is to understand the properties and causes of long-lived Internet
flows. However, such long-lived flows are usually carrying smaller number of bytes,
and are usually buried into large amounts of short duration flows. Thus we develop a
new mechanism of multi-time-scale flow analysis, which allows ecient evaluation of
network trac from time scales of minutes to weeks. This mechanism supports ecient
queries and is useful understand the properties and causes of long-lived flows.
In our second work [QHP12a, QHP12c, QHP13b] (Chapter 3), we show that aggre-
gation in the Internet IPv4 address space, from smaller blocks to larger blocks, can
provide a clear view of the fundamental service of the Internet, any-to-any reachability.
This study also supports the thesis statement as a specific example. Our goal is to char-
acterize outages in the whole Internet edge. For this goal, we develop an approach which
10
first samples a wide range of blocks (home users, server farms, universities and some
businesses), then analyze outages in all sampled blocks and aggregate with greedy clus-
tering algorithms. With this approach, we enable the correlation of otherwise scattered
outage information, providing an ecient way to enable the study of Internet reachabil-
ity as a whole.
In our third work [QHP13b], we show that visualization and clustering provides an
ecient and eective way of aggregation and analysis. Our visualization techniques
help analyzing sparse Internet events, enabling both manual inspection and automated
findings. We show that we can more easily correlate many events to large Internet events
manually, which overcomes the constraint of analyzing very large number of small
events and enables further studies. We can also automatically find events by checking
variations in the marginal distributions.
Our last work also supports the thesis statement as an example [QHP13c] (Chapter 5
and 6). Our research goal is to characterize the reliability in the whole Internet, all the
time, for all geographical regions and ISPs. To achieve this goal, we need to cover many
edge networks and decrease probing rate to avoid firewalling or rate-limiting. We thus
use an adaptive sampling scheme which achieves high accuracy with minimal probes. In
this scheme, to decrease the probing rate, we only probe minimal number of addresses
in each /24 block when we see positive responses. In the case of non-response, we
adaptively send extra probes up to an upper bound determined by Bayesian inference,
which ensures accuracy. With this approach, we can eciently track all geographical
regions and ISPs and report in near real time.
The above studies show the eectiveness of sampling and aggregation over four
particular measurement systems. These studies serve as encouraging examples sup-
porting our thesis statement. They suggest that future large-scale Internet measurement
systems, especially those processing large amount and dispersed data, can also benefit
11
from sampling and aggregation. We believe researchers and engineers often need to
consider accuracy-eciency trade-os in their systems, usually for two reasons: first,
the volumes of data to analyze are large; second, current Internet data and applications
are in a trend of large scale distribution (such as cloud computing and content distribu-
tion networks). For example, a potential new measurement study of end user perceived
cloud storage fetch delays would need to consider sampling representative edge users
and paths to datacenters, and it also needs to consider trade-os between number of
users (and datacenters) in the study and the accuracy of its metrics.
1.5 Research Contributions
The above four specific studies all partly support our thesis statement, and suggest future
Internet measurement systems can also benefit from sampling and aggregation, espe-
cially in processing large amount and dispersed data. We thus conclude that our first
research contribution is proving our thesis statement. Other than this contribution, we
also make broader contributions to the research community, summarized as follows.
In our long-lived Internet flows study, we hypothesize that at some point in the
future, computer-to-computer trac will eclipse human-driven trac, just as data traf-
fic has eclipsed voice. We describe new mechanisms for multi-time-scale flow analysis
that allows ecient evaluation of network trac from time scales of minutes to weeks.
We have observed the presence of long-lived flows, showing 21% of Internet trac (by
bytes) are carried by flows longer than 10 minutes, and nearly 2% are carried by 100
minutes or longer flows. Finally, we evaluate the causes of such trac, showing that
long-lived IP flows are mostly due to computer-to-computer trac running in the back-
ground for protocol needs (Chapter 2).
12
In our Internet outage study, we provide a new method that can systematically
find outages, unreachable blocks of adjacent network addresses, for all of the analyz-
able IPv4 Internet—a method that provides better accuracy and coverage than existing
approaches, particularly for small events. In addition, we describe clustering algorithms
to visualize correlated outages (Chapter 4). Second, we carefully validate our approach
comparing onset and duration of outages to root causes both for widely publicized events
such as the Jan. 2011 Egypt outage, and for randomly sampled small outages. And
finally, we provide a statistical characterization of Internet outages as a whole and for
specific blocks, extending prior evaluation using meshes to cover the entire network
edge (Chapter 3). We show the Internet has “2.5 nines” of availability. We believe our
statistics help establish a baseline of current Internet reliability and are a step to allow
quantitative comparisons between ISPs or regions on Internet reliability.
In our Internet reliability study, we first develop a system that tracks all responsive
Internet edge networks on a 247 basis. This system samples and adapts probe rate
within each /24 block, and uses Bayesian inference to guide probes for accurate reach-
ability results (Chapter 5). A further contribution is the systematic and quantitative
evaluation of policy eects on Internet usage in dierent parts of the world. By aggre-
gation and correlation to other domains, we are able to study broader policy eects, such
as political causes, economic conditions, and access technologies (Chapter 6).
13
Chapter 2
Characteristics and Reasons of
Long-lived Internet Flows
In this chapter, we describe how we use a multi-time-scale aggregation scheme to study
the properties and causes of long-lived Internet flows, proving that we can eciently
study Internet trac with aggregation in the time dimension, from smaller to larger
timescales.
Part of this chapter was published in IMC 2010 [QH10b].
2.1 Motivation for Studying Long-lived Internet Flows
Trac in the Internet is a complex mix of eects from protocols, routing, traf-
fic engineering, and user behaviors. Understanding trac is essential to modeling
and simulation, trac engineering and planning [BTI
+
02], router design [AKM04],
and better understanding of the Internet [LTWW94]. There has been a great deal
of study of trac at the protocol level [PFTK98, Pax97], and at timescales of sec-
onds to hours [ZBPS02, LTWW94, Bkc02, Bro05, cLH06], and over longer terms for
planning [TMW97, Mkc00]. Yet prior work studies either protocol eects at small
timescales (seconds to hours) or aggregate eects at large timescales (hours to weeks),
but little attempt to bridge this division and understand protocol eects on long-lived
trac.
14
This chapter explores how users and protocols aect long-lived network trac.
Unlike prior protocol studies, we explore trac that lasts for multiple hours to days.
Unlike prior long-term trac studies, we explore the causes of trac patterns at the
flow-level in multiple timescales, instead of only trends of aggregate trac. We use the
standard flow definition of the 5-tuple of source and destination IP address and port, plus
the protocol number, ended by a timeout [TMW97].
There are several reasons why an understanding of long-lived flows is increasingly
important. First, understanding long-lived flows is important for network management.
While capacity planning can be done on measures of aggregate trac, several kinds of
on-line trac control have been proposed: protocol trunking [KW99], optical trunk-
ing [HDL
+
98], lambda switching [AR01], and low-buer operation [AKM04]. Under-
standing the feasibility and impact of these approaches requires flow-level trac char-
acterization.
Second, a scientific understanding of the Internet must investigate the patterns and
causes of long-lived trac. What are the first-order statistical properties of long-
lived flows, and how do they dier from short ones? In addition, short-term studies
of network packet data have shown self-similar behavior in timescales of seconds to
hours [LTWW94, CB97], but most such analysis stops as diurnal eects dominate.
Finally, we wish to understand the causes of long-term flows. Protocol eects dom-
inate sub-second timescales, and human behavior governs diurnal and weekend eects.
Some human-centric trac is no longer bound by human patience, such as “patient”
peer-to-peer file sharing [GDS
+
03], and unattended streaming media, perhaps stream-
ing Internet audio in a store, or automated, Internet-based, Tivo-like devices such as the
Slingbox [Sli10]. Computer-to-computer trac is growing due to automated control and
sensing, on-line backup, and distributed processing in the cloud and across distributed
15
data centers. We hypothesize that at some point in the future, computer-to-computer
trac will eclipse human-driven trac, just as data trac has eclipsed voice.
2.1.1 Contributions
The first contribution of this chapter is that we describe new mechanisms for multi-time-
scale flow analysis that allows ecient evaluation of network trac from timescales of
minutes to weeks (Section 2.2). We have operated this system for more than six months,
taking data from a regional network. Second, we document the presence of long-lived
flows, showing 21% of Internet trac (by bytes) are carried by flows longer than 10
minutes, and nearly 2% are carried by 100 minutes or longer flows (Section 2.3.1).
Finally, in Section 2.3.2 we begin to evaluate the causes of such trac, exploring how
protocol mix changes as a function of timescale.
2.1.2 Relation to Thesis
This case study supports our thesis that aggregation in the time dimension, varying from
smaller durations to exponentially longer durations, helps eciently reveal properties
and reasons of long-lived Internet flows. Because long flows are usually buried in the
vast majority of short-duration flows, keeping track of all flows, such as with a flow
table, is not practical. Our new measurement mechanism overcomes this resource con-
straint by only keeping flows in a certain range of duration in each “level”, discarding
many short flows for eciency.
2.2 Data Collection and Analysis
Network packet trace collection is well understood, but sequential processing becomes
challenging as datasets stretch from minutes to months. In this section we review our
16
approach to long-term collection of network flows and multi-time-scale analysis of that
data.
2.2.1 Collection and Anonymization
We first review our packet collection, processing, and anonymization frame-
work [HBP
+
05].
Source data is from packet taps at USC’s connection to Los Nettos, their upstream
Internet Provider and a regional network for the Los Angeles area. Packet capture rates
are 100–300k packets/second or 400–1000Mbits/s, but only packet headers are captured.
We use the LANDER system [HBP
+
05] to process data at USC’s high performance
computing facility, a cluster of more than 2500 multi-core PCs [USC]. LANDER
anonymizes packet headers, removes all user data, and coordinates our data analysis.
It dynamically schedules tasks to the computing cluster and buers pending processes,
using up to 10 concurrent compute tasks and queueing thousands of trace segments if
necessary. This accommodates changing trac bitrates and CPU availability. LANDER
produces fixed-length, 512MB files in Endace ERF format.
The default LANDER policy anonymizes trac with keys that rotate at regular inter-
vals. Such a scheme is useful because it insures that any accidental information disclo-
sure in one period does not assist unanonymization in other periods. However, key
rotation impedes analysis of flows longer than the rotation period. LANDER therefore
re-anonymizes all flows with a common, long-term key. We reduce this greater risk
through stricter policy controls: we control access to the long-term data and prohibit
those with access to attempt unanonymization.
Although our work builds on packet-header traces, a potential direction for future
is to start with NetFlow records as a data source. Another interesting direction is to
compare characteristics of long flows at dierent places of the Internet.
17
2.2.2 Multi-time-scale IP Flow Analysis
Given our goal of observing long duration flows, we have four problems: what flows
are and what to record for each flow; how to manage streaming data and incremental
analysis; how to support analysis at very dierent timescales, from seconds to weeks or
more. We consider each of these next.
We use the standard 5-tuple definition of flows: source and destination port and IP
address, plus protocol. We convert LANDER’s packet headers into flow records using
a slightly modified Argus toolkit [Qos10]. Argus flow records provide: the 5-tuple flow
identifier (given above), flow start and finish time, number of packets, and number of
bytes in the flow. Flows begin with the first packet with a unique 5-tuple, and continue
until a timeout (currently set to 60 seconds).
We extend Argus to also capture information about flow burstiness, which is defined
as variance of bytes over fixed time period T. We record the number of time periods
observed, and the average and square sum of bytes over the time periods. Our base time
period for variance is T = 10 minutes, the same as our base segment length as described
below. This data allows us to compute standard deviation of bytes over T afterwords.
Because we expect to run data collection indefinitely, it is essential that we collect
data concurrent with analysis, and that we store data in a manner that supports ecient
queries. An easy algorithm would use an in-memory flow table (indexed by the 5-tuple),
and update corresponding flow record upon seeing a flow. However, this algorithm can
easily run out of memory due to a large number of concurrent flows, particularly with
long timeouts. So we divide flow records into segments for ecient analysis. LANDER
uses fixed-size segments (each 512MB of packet headers, or 1–2 minutes at our current
capture rates), and these traces arrive asynchronously, invoking our segment processing
engine as it arrives.
18
Figure 2.1: The structure of multi-level flow records: each level has primarily flows with
exponentially longer durations, plus a “tail” to permit merging.
We convert these variable-duration segments to hierarchical, fixed duration seg-
ments to support ecient analysis and queries that span dierent timescales. We call
the intial fixed duration segments level-0 flow segments, currently each at a duration of
19
T = 10 minutes. When we determine that all packet-header traces needed to cover a flow
segment are present, we process them to create the corresponding level-0 flow-segment.
Care must be taken because each flow segment typically requires several packet-header
traces, and the packet-header trace at the start or end of a flow segment typically span
two flow segments. When a trace spans multiple segments, we place the packets cor-
responding to each segment in seperate flow records in each segment. These records
will later be merged into a common flow record in hierarchical merging described next.
The left-most column of Figure 2.1 shows packet headers (dark gray) being converted
to level-0 flow-segments.
Each level-0 flow-segment contains 10 minutes of flow records, but long flows will
span multiple, possibly hundreds or thousands of segments. Since we cannot sequen-
tially process terabytes of data to make queries for dierent durations, we assemble
level-0 flow-segments into higher-level segments. We assemble segments hierarchically
in powers of two, so two adjacent level-0 segments are processed to produce one level-1
segment, and so on, with two level-i flows producing a level-i + 1 flow.
To avoid segments growing in size indefinitely and to allow ecient queries at large
timescales, we prune the flow contents at each level according to the following rule:
The pruning rule: A level-i segment starting at time t must preserve all flows of dura-
tion longer than T2
i2
(the duration rule), and all flows that are active in the time-
out period the last seconds of the trace (the tail rule).
The presence corollary: A level-i segment starting at time t guarantees to contain all
flows of durations between T2
i2
and T2
i1
that start in the time [t; t + T2
i1
]. It
may also contain some shorter flows at the end, and some longer flows (up to T2
i
)
which are not complete yet.
(When i = 0, the durations start at zero.)
20
The duration part of the pruning rule keeps each level file small, because each targets
a specific time duration and guarantees coverage for that duration. All short flows that
are not active at the end of the segment may be discarded. We can prove the presence
corollary, because we guarantee coverage for flows that start in the first half of the
segment and last for between a quarter and a half of the segment, since by definition
those flows must terminate in the segment and are too long to be discarded. We do not
guarantee all shorter flows are present, since they will be discarded to keep segment
sizes manageable. We cannot guarantee that longer flows are complete since they may
stretch into subsequent segments. We show the results of our multi-level storage below
in Section 2.2.4.
The tail part of the pruning rule allows adjacent segments to be merged without
loss of information. Only flows that are active in the last seconds of the segment are
candidates to merge with the next segment, since by our definition of flows they will
timeout if the gap is longer. By keeping all flows active in this window at the end of the
trace we therefore guarantee no information about mergeable flows will be discarded,
so we do not accidentally truncate the head of a new long-duration flow. Finally, the
rule keeps flows that are active in the last seconds, more than flows started in the
last seconds—a flow may start anywhere in the segment, and long-running flows will
typically span most or all of the segments.
Several details in segment organization support merging and processing. When
merging two adjacent level-i segments to create a level-i + 1 segment, we combine and
reorder flow records. We keep flow records sorted by flow start time, so if the level-i
files are numbered n and n+1, the merge must scan all of file n but only the head of n+1.
Variance can be combined across segments because we preserve the sum of observations
and their squares, not just the computed variance.
21
Finally, all segment processing is done on a workstation cluster in parallel. Seg-
ments are processed and committed atomically (using filesystem rename as the commit
method). Concurrent processing of the same file is discouraged by tagging in-process
files with a flag, and we recover from crashed processing jobs by timing out flags. We
periodically scan the segment tree to catch and correct any missed merges due to races.
2.2.3 Managing Outages
Very few network tasks can run uninterrupted forever without error—with power out-
ages and scheduled maintenance, continuous operation more than a few months is good.
While we tolerate several types of outages, we have experienced multiple gaps, primar-
ily due to software errors in our experimental system. Since May 2009 we have taken 8
traces to date in durations of 8, 9, 15, 23, 40, 65, and 99 days. In the future we plan to
bridge brief outages by computing both optimistic and pessimistic flow records around
a gap.
2.2.4 Understanding the Methodology
To illustrate how dierent timescale flows are stored in dierent levels of our system,
Figure 2.2 shows the cumulative distribution of flow durations for dierent levels of
segments on a linear-log scale graph. Each line shows a dierent level segment, starting
with level-1 at 20 minutes and doubling at each subsequent level.
Each level shows a range of flow durations. Because of the tail rule, all segments
have some very short flows. Because there are relatively few very long flows, the size of
high-level segments is dominated by shorter flows. Although each segments at level-i
contain flows from zero to T2
i
in duration (some of them may not be complete yet),
many short flows have been pruned away for clearer view of the longer flows.
22
0
0.2
0.4
0.6
0.8
1
1 1.5 2 2.5 3 3.5
CDF
Duration (log10 of minutes)
level 1
20 min
level 2
40 min
level 3
80 min
level 4
160 min
level 5
320 min
level 6
640 min
level 7
1280 min
level 8
2560 min
Figure 2.2: Durations of flows observed at 8 dierent timescale levels (from 2 days of
dataset D1, flows less than 10 minutes truncated).
timescale all median presence
0 4967M 2M 835k
1 652M 570k 7.1k
2 214M 276k 3.1k
3 105M 271k 1.5k
4 53M 274k 949
5 27M 268k 598
6 14M 265k 586
7 7M 300k 301
8 4M 308k 139
9 2.6M 265k 119
10 1M 243k 148
11 846k 846k 71
Figure 2.3: Number of flows at dierent timescales: all flows, median per segment, and
presence flows in one segment (14 days from dataset D8).
In addition, each segment has a large number of flows near the segment duration
limit. For example, 70% of level-1 flows are about 20 minutes long, and 57% of level-2
flows are 40 minutes long. These durations indicate flows that last the entire segment
23
and are part of flows that span multiple segments. Their correct duration can only be
identified at higher-levels.
To show the advantage of our multi-time-scale storage, Figure 2.3 shows the number
of flows across all files at each level, the median for each level, and how many are valid
by the presence rule. We see the number of valid, presence flows (bottom line) per
segment drops quickly—the true number of long flows is small. The median number of
flows plateaus around 300k per segement because segment size is limited by the tail rule
and all flows active in the last seconds. Finally, the storage requirements (top line)
drop exponentially, although they are again limited by the tail rule. We conclude that
multi-scale storage is important to study long duration flows.
2.3 Results
We next describe the results of our measurement system: how do long flows dier from
short flows in their characteristics and causes? Since May 2009 we have colected 8
traces. For this work, we focus on D1, a 2-day subset of 15-day capture starting 27 May
2009, and D8, a 14-day subset of a 65 day capture starting 19 Feb 2010.
2.3.1 Characteristics of Long Flows
We first compare flow characteristics: rate, size in bytes, and burstiness as a function
of flow duration. Our goal is to understand what long flows are like, and how they
dier from short flows. We therefore graph density plots, with darker shades indicating
more flows. To quantify distributions at each timescale, we overlay box plots for each
timescale, showing quartiles, minimum, and maximum.
Most graphs in this section are generated with time-scale sampling: we take one
level-i segment for each level (i 2 [1; 11], omitting level 0), getting a representative
24
sample from a fraction of the data (Section 2.2.4). We then select subset of that seg-
ment that we can guarantee full capture (flows with duration in [T2
i2
; T2
i1
]) and plot
only those flows, discarding the rest. This approach implies that one can compare fre-
quency of some characteristic across a given timescale (for a fixed x value). However,
at dierent timescales (varying x), the absolute number of shorter duration flows are
underrepresented relative to longer duration flows.
Figure 2.4 shows this dierence: the left graph uses both level-0 segments and one
level-1 segment (all flows), while the right uses only one of each level (sampled), so
the left has higher absolute densities indicating more flows. Although time-scale sam-
pling under-estimates the total number of flows in the sampled case, it correctly reports
the overall trend of flow sizes. More importantly, it allows study of the long-tail of
long-lived flows, while reducing computation spent on the already-well-studied shorter
flows (computation that would otherwise overwhelm analysis). In summary, sampling
allows ecient observation of the correct trends, but not absolute density scales across
durations.
Flow Rates: We first look at flow rate vs. duration in Figure 2.5. We see that short-
duration flows can be quite fast, spanning 6 orders of magnitude speed. By contrast, long
flows are typically much slower. Quartiles show median rates are around 50 bytes/s for
flows shorter than 40 minutes, with a broad distribution, while flows longer than 100
minutes or longer have medians closer to 10 bytes/s.
The slower rate of long flows may be helpful for trac engineering, allowing longer
time to react to long-lived but slow-moving flows. Although we see very dierent rates
at dierent durations, rate alone does not show which flows contribute to trac. To
evaluate if “slow and steady wins the race”, we next look at flow sizes across all time.
25
6 8 10 12 14
Flow Duration (minutes)
2
3
4
5
6
7
8
9
Flow Size (log scale of bytes)
1
20
400
6 8 10 12 14
Flow Duration (minutes)
2
3
4
5
6
7
8
9
1
20
400
Figure 2.4: Density plot comparing all (left) and sampled flows (right), duration vs. size
(from D8).
1 1.5 2 2.5 3 3.5 4
Flow Duration (log10 of minutes)
-1
0
1
2
3
4
5
6
7
Flow Rate (log10 of bytes per second)
1
20
400
hour day week
Figure 2.5: Density plot (log-scale) with quartile boxes of flow duration vs. rate (sam-
pled from D8).
Flow Sizes: Prior studies of “slow but steady” tortoise flows can account for signif-
icant trac [Bkc02, cLH06]. Having just shown long flows are slower than short flows,
we next consider if their persistence makes up the dierence.
26
Figure 2.6 shows the flow sizes (in bytes) of D8. We see a strong correlation between
flow duration and total number of bytes at a slower-than linear rate on the log-log plot.
Linear regression of median shows an exponentially increase at a rate of 0.77 with a
0.958 confidence coecient.
Although each long-duration flow sends many bytes, there are many more brief
flows, so in the aggregate short flows may still dominate trac by bytes. Figure 2.7
shows the cumulative number of bytes sent by all flows of a two day period in D1.
(Unlike density plots, this CDF considers all flow segments of all timescales sent over
the entire period.) This graph confirms that there are not enough long-duration flows to
dominate Internet trac. From the figure we can observe that although the short flows
dominate the Internet trac (in terms of bytes), 21.4% of the Internet trac are carried
by flows longer than 10 minutes, 12.6% are carried by flows longer than 20 minutes,
and nearly 2% are carried by flows longer than 100 minutes. Even though short flows
are the majority of trac bulk, optimizations to long flows can still have a significant
eect. Internet Service Providers may also be interested in this observation, since the
contribution of long-running but slow flows supports the need to meter service by bytes,
not by peak speeds.
Flow Burstiness: Burstiness measures the uniformity of trac rate over time. From
Figure 2.9, we can observe that long flows are generally less bursty than short flows
(linear regression of median shows an exponentially decrease at rate0:296, with a
0:830 confidence coecient). Our explanation, confirmed when we consider causes in
Section 2.3.2, is that long flows are mostly computer-to-computer communications, and
so are naturally less bursty. One implication of this observation is for very low-buer
routers, which assume input trac is smooth [SRS99, AKM04]. Low burstiness could
be provided with high probability in segregated long-duration flows.
27
1 1.5 2 2.5 3 3.5 4
Flow Duration (log10 of minutes)
2
3
4
5
6
7
8
9
10
Flow Size (log10 of bytes)
1
20
400
hour day week
Figure 2.6: Density plot (log-scale) of flow duration vs. size (bytes) (sampled from D8).
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000
CDF of Bytes
Flow Duration (minutes)
Figure 2.7: Cumulative distribution of flow sizes (in bytes) of all flows of two days from
D1.
2.3.2 Causes of Long-lived Flows
While we observe long-duration flows behave dierently, we would like to know their
causes. Although imperfect, port-based classification is our best tool and Figure 2.8
28
0
0.2
0.4
0.6
0.8
1
20 40 80 160 320 640 1280 2560 5120 10240 20480
Fraction
Flow Duration (minutes)
hour day week
others
sapv1
svrloc
ntp
pim
redwoo
isakmp
49180
1027
1028
netbio
1024
1025
1026
sip
6655
4466
8718
ms-wbt
msnp
macrom
net-as
fujits
irdmi
icmp
http
https
net-as
1026
1025
1024
1026
1025
1024
1026
1025
1024
redwoo 1026
1025
1024
h t t p s
h t t p
i c m p net-as
i c m p i c m p
p i m
p i m
0
0.2
0.4
0.6
0.8
1
20 40 80 160 320 640 1280 2560 5120 10240 20480
Fraction
Flow Duration (minutes)
hour day week
other
snmptr
8044
isakmp
9874
net-as
h323ga
8102
sip
sapv1
5938
ntp
sd
pim
ipsec-
trivne
47093
aol
msnp
imaps
53181
mmcc
snmp
http
dtpt
icmp
https
sd
sapv1
sd
sapv1
8102
9874
sd
sapv1
sd
icmp
http
snmp
msnp
net-as
msnp
msnp
h t t p s
h t t p
i c m p
s n m p
s n m p
i c m p
p i m
p i m
Figure 2.8: Source (left) and destination (right) port usage, plus PIM and ICMP, as a
function of timescale (sampled from D8). Well-known ports are labeled with protocols
and protocol colors dier.
1 1.5 2 2.5 3 3.5 4
Flow Duration (log10 of minutes)
-1
0
1
2
3
4
5
6
7
8
Flow Variance Burstiness (log10 of bytes)
1
20
400
hour day week
Figure 2.9: Density plot (log-scale) of flow duration vs. burstiness (log-scale variance,
bytes) (sampled from D8).
shows fraction of flows by port usage from minutes to weeks. We treat ICMP and
Protocol Independent Multicast (PIM) as special “ports”.
The result supports our hypothesis: the trac mix switches from interactive, to back-
ground, to computer-to-computer as timescale increases. Hour time-scales are domi-
nated by web (HTTP, HTTPS, ports 80 and 443) destinations. Web is also a frequent
source port as well. Although at first it may seem suprising to think about port 80 as
29
a source of trac, this observation follows because our analysis treats each side of a
bidirectional connection independently, so a port-80 source is the reply-side of a web
request. Day-long flows contain “background” trac, with computer-driven but human-
initiated protocols like chat and messaging (msnp, aol). We believe these represent
regular application-level keep-alives and presence reports. Finally, week-long flows are
almost all computer-to-computer protocols that run without human involvement, such as
time synchronization (ntp) and multicast control (sd, pim, sapv1). This trend also shows
with a very strong shift against TCP in the protocol mix at longer timescales: (as shown
in Figure 2.10) TCP is 66% through 10 hours, but falls to 16% at two weeks, where 30%
is PIM and 43% UDP. However, there do exist some very long http connections. For
example, we have seen week long http flows from a Texas based infrastructure providing
company (http://www.softlayer.com/, 208.43.202.*) to some USC IPs (128.125.179.*,
128.125.230.*, 128.125.128.*, 128.125.169.*) in our observations. Since we scramble
the last 8 bits of the IP addresses, we can only know the subnet (/24) of the observed
flows. An interesting observation is that, although these long connections are http flows,
they are also computer-to-computer communications in their nature (which verifies our
basic assumption).
Another interesting result is that ports 1024 through 1026 are very common sources
for long-lived flows. These are the first non-reserved ports and we believe indicate long-
running, started-at-boot daemons.
Although we have identified the question of causes for long-lived Internet flows,
we have only preliminary answers. Port-based classification schemes are well known
to be inaccurate as many protocols today intentionally use random ports, so use of
other techniques to identify applications is one direction (potentially those of Kim et
al. [KCF
+
08]). Also, carrying out similar experiments in other locations, and more
30
0
0.2
0.4
0.6
0.8
1
20 40 80 160 320 640 1280 2560 512010240 20480
Fraction
Flow Duration (minutes)
TCP
UDP
ICMP
PIM
OTHER
Figure 2.10: Protocol usage as a function of timescale, sampled from D8.
thorough evaluation of causes of long-running flows (protocols or applications) are both
important future directions.
2.4 Conclusions
We propose an ecient multi-time-scale IP flow analysis methodology in this chapter,
targeting at the long-lived flows. The characteristics of dierent timescales of flows have
been studied, with flow duration ranging from minutes to weeks. Our results show that
long-lived flows are generally slow running and non-bursty ingredients of the Internet
trac, which is useful for trac engineering purposes. We also study the causes of the
long-lived flows, and found that unlike short flows with much human trac, they are
mostly computer-to-computer trac for specific application purposes.
This chapter shows that aggregation in the time dimension provides an ecient way
to gain new knowledge about long-lived Internet flows. We develop a new mechanism
31
of multi-time-scale flow analysis, which allows ecient queries to evaluate network
trac from timescales of minutes to weeks. With this new mechanism, we greatly
simplify the examination of simple statistics (bytes, rates, burstiness) and cause analysis
(from protocol and port). We thus conclude that this chapter supports part of the thesis
statement with aggregation in the time dimension to eciently find new knowledge of
long-lived flows. We only look at aggregation in the time dimension in this chapter.
In the next chapter, we present another study to support our thesis. It looks at both
sampling and aggregation in the space dimension, to characterize Internet outages.
32
Chapter 3
Detecting Internet Outages with
Precise Active Probing
In this chapter, we present our approach to sampling in the space dimension to study
characteristics and reasons for Internet outages. We study the Internet IPv4 address
space with a granularity of /24 blocks, which is (usually) the smallest unit announced
in BGP tables. We sample in this space, selecting those blocks with enough responses
in history, to reduce the number of target blocks. We also develop a simple clustering
algorithm to aggregate our observations for more complete views of outages.
Part of this chapter was released as a technical report [QHP12a].
3.1 Motivation for Detecting Internet Outages
End-to-end reachability is a fundamental service of the Internet. Network outages break
protocols based on point-to-point communication and often harm the user’s experience
of Internet applications. Replication and content delivery networks strive to cover
up outages, but in spite of decades of research on network reliability, Internet out-
ages are still pervasive, ranging from minutes to hours and days. Outages are trig-
gered by system, link or router breakdowns [TLSS10, Mal11]. Causes of these fail-
ures include natural disasters [Tim11a, Mal11], human error [MWA02], and political
upheavals [Tim11c, Cow11a, Cow11b, Cow11c]. On occasion, routing changes can
cause user-visible problems as the network reconfigures [LMJ97, LABJ00, Tim11b].
33
3.1.1 Contributions
The contributions of our work are to provide a new method that can systematically
find outages, unreachable blocks of adjacent network addresses, for all of the analyz-
able IPv4 Internet—a method that provides better accuracy and coverage than existing
approaches, particularly for small events. Second, we carefully validate our approach
comparing onset and duration of outages to root causes both for widely publicized events
such as the Jan. 2011 Egypt outage, and for randomly sampled small outages. And
finally, to provide a statistical characterization of Internet outages as a whole and for
specific blocks, extending prior evaluation using meshes to cover the entire network
edge.
Our first contribution is our new approach to active probing, showing that a sin-
gle computer can track outages over the entire, analyzable IPv4 Internet (the 2.5M /24
blocks that are suitable for our analysis, see Section 3.3.4). Like prior work [HPG
+
08],
we send active ICMP probes to addresses of each /24 block every 11 minutes. Unlike it,
we develop a new approach of precise probing that carefully selects a subset of blocks
and addresses per block to reduce probing trac by a factor of 75, while retaining more
than 90% accuracy for outage detection (Section 3.3). We develop a new method to
distill this data into block-level outage reports (Section 3.2.2), defining an outage as
a sharp change in block responsiveness relative to recent behavior. We interpret the
observations of our system by correlating block-level outages to discover network-wide
events with two new clustering algorithms. The first groups outages in two dimensions,
time and space, to provide a general understanding of network behavior, associate out-
ages to countries, and provide the first visualization of outages (Section 4.2.1). The sec-
ond, more general algorithm, finds network-wide events from the start- and end-times
of block-level outages (Section 3.2.4).
34
Several prior systems study network outages—the new contribution of our approach
is significantly greater accuracy than the best current active methods, and operation
from a single computer with about the same probing trac. Unlike control-plane stud-
ies [LABJ00, MIB
+
04], we detect outages that are not seen in the routing system (Sec-
tion 3.4.3), expanding the result observed by Bush et al. [BMRU09]. Unlike previous
data-plane studies using active probing, including DIMES [SS05], iPlane [MIP
+
06a],
Hubble [KBMJ
+
08a], and SCORE [KYGS05, KYGS07], our block-level measure-
ments are considerably more accurate at detecting core-to-edge outages. Comparisons
to Hubble show that our approach reduces the number of false conclusions by 31%
compared to approaches probing a single representative per /24 block, with about the
same trac (Section 3.4.6). Unlike network tomography that focused localizing out-
ages [KYGS05, KYGS07, CTFD09, DTDD07, HFT08], we instead focus on tracking
core-to-edge reachability; our work could serve as a trigger for such localization meth-
ods. We remove outages near our vantage points to correct for correlated error (Sec-
tion 3.5.3). Of course, our approach shares the limitation of all those based on active
probing: it can only report on the visible Internet, those willing-to-respond blocks; cur-
rently we can monitor about 2.5M /24 blocks, about one-seventh more coverage than
Hubble. Recent work has combined backscatter with routing information to character-
ize large outages [DSA
+
11, DAAC12]; we show that active probing complements this
work and is critical to detect small outages and provide Internet-wide statistics. We
cover related work more generally in Section 7.
The second contribution of our work is to validate the accuracy of active probing for
outage detection (Section 3.4). Even though we probe from a single location, we draw
on data sources taken from three vantage points in California, Colorado, and Japan, to
show our results are largely insensitive to location (Section 3.4.5). We study more than
30 observations taken over more than two years, using 2 week surveys of all addresses
35
in a 1% sample of Internet blocks [HPG
+
08], and a 24-hour measurement taken across
all suitable /24 blocks in the Internet in Sep. 2011 to show that our results are stable
over time. We validate our approach with BGP archives and news sources, for selected
large events and a random sample of 50 observed events. We confirm 5 of 6 large events
(83%, Section 3.4.2), including the Jan. 2011 Egyptian outage, the Mar. 2011 Japanese
earthquake, and equally large but less newsworthy events. Our random sample of all
events confirm prior work by Bush et al. [BMRU09] showing that small outages often
do not appear in control-plane messages, since partial control-plane information shows
only 38% of small outages we observe (Section 3.4.3). We emulate outages of controlled
length to investigate false availabilities. We miss very short outages, and detect 100%
of full-block outages that last at least twice our probing interval (Section 3.4.4).
Our final contribution is to evaluate Internet stability as a whole (Section 3.5). We
show that, on average, about 0.3% of the Internet is inaccessible at any given time.
The Internet blocks have around 99.7–99.8% availability, only about 2.5 “nines”, as
compared to the “five nines” telephone industry. While prior work has studied paths
between meshes of hundreds of computers and thousands of links, and anecdotes about
the Internet as a whole abound, we provide much broader coverage with quantitative data
about all responsive 2.5M edge /24 blocks. We believe these statistics can establish a
baseline of Internet reliability, allowing future comparisons of Internet reliability across
ISPs or geography.
3.1.2 Relation to Thesis
We studied time dimension aggregation for properties of long-lived Internet flows in the
prior chapter. This chapter supports our thesis by sampling and aggregation in the space
dimension, and in a dierent problem domain: Internet outages. We explore two types
of sampling and probing. The first type adopts the same idea from previous Internet
36
survey work [HPG
+
08]: using a mixed sample of both stable and random /24 blocks to
be representatives of the Internet edge networks (inter block), but with only 20k blocks.
In the second type of sampling, we probe a sample of the top k responsive addresses in
each block (intra block), and cover a much larger fraction of the Internet: 2.5M blocks.
We find that, with top k (k = 20), we achieve 93% accuracy with only 8% of trac as
compared to probing all addresses. With top k strategy, we can eciently and accurately
detect Internet outages.
We probe continuously to find outages and use simple clustering aggregation to study
the characteristics of Internet outages. We conclude that we develop new sampling and
aggregation mechanisms which help us overcome the large scale of the Internet, by
selecting representative blocks and sample within each block.
3.2 Methodology
Our method for outage detection begins with active probing, followed by outage identi-
fication in individual blocks, visualization, and correlation of into events.
3.2.1 Active Probing of Address Blocks
We collect data with active probing, building on our approach developed to study the
Internet address space [HPG
+
08]. A brief review of this collection method and data
normalization follows. In Section 3.3 we extend raw collection into a system optimized
for outage detection.
Reviewing Address-Space Probing: Our approach begins with active probing of
some or all addresses in some or all analyzable /24 address blocks in the IPv4 address
37
space. We probe each block with ICMP pings (echo requests) at 11 minute inter-
vals for one to 14 days. Responses are classified into four broad categories: posi-
tive (echo reply), negative indicating network is unreachable (for example, destination
unreachable), other negative replies (we interpret these as a reachable network), and
non-response. We have two probing configurations: Internet address surveys probe all
addresses in about 22,000 /24 blocks (data available [HPG
+
08] and reviewed in Sec-
tion 3.4.1), while the operational outage observation system probes 20 addresses in 2.5M
/24 blocks (Section 3.3).
Our probing rate is high compared to some prior probing systems. When we probe
all addresses in a /24, incoming probe trac to each /24 arrives at a rate of one packet
every 2.5 s. In operation, we get about three inquiries about probing per month, either
directly to the ISP or through information on a web server on the probers. Many requests
are satisfied when they understand our research, but can be added to a do-not-probe
blacklist on request. Our operational system (Section 3.3) probes many more blocks,
but at a rate of one packet every 32 s, actually drawing fewer complaints.
Our outage detection applies only to blocks where 10% of addresses respond (Sec-
tion 3.2.2). Based on Internet-wide censuses, about 17% of /24 blocks meet this crite-
ria [HPG
+
08]. Our results therefore exclude sparsely populated blocks, but do reflect on
a diverse set of Internet users whose firewalls admit ICMP, including home users, server
farms, universities, and some businesses. Although we provide no information about
the non-responsive Internet, this limitation is shared by other forms of active probing,
and our coverage is actually 14% better than Hubble in Section 3.4.6.
Normalizing survey data: Probes are spread out in time and responses return with
varying delays in the raw data. In this chapter we simplify the survey data by mapping
probe records into rounds, where each round is 11 minutes long. We identify rounds by
index i, with N
r
total rounds in a dataset (thus i2 [1:: N
r
]).
38
We correct two errors that occur in mapping observations to rounds: sometimes a
round is missing an observation, and occasionally we see duplicate responses in that
round. Our collection software is not perfectly synchronized to 11 minute rounds, but
takes on average 11 minutes and 3 seconds. (We intentionally chose to correct for minor
drift rather than guarantee perfect synchronization over days of continuous operation.)
Because this interval is not exactly 11 minutes, for each individual IP address, about one
round in 220 has no observation. We detect such holes and fill them by extrapolating
from the previous observation. In addition, we sometimes get multiple observations
per round for a single target. About 3% of our observations have duplicate results,
usually a timeout (non-response) followed by a negative response (an error code). These
duplicates are rare, and somewhat non-uniformly distributed (for example, about 6%
of blocks have over 100 addresses each reporting duplicates, but most blocks have no
duplicates). When we get duplicate responses, we keep the most recent observation,
thus the negative response usually overrides the timeout.
Finally, we observe that the process of associating the IP address of an ICMP reply
with its request is not perfect. Multi-homed machines sometimes reply with an address
of an interface other than the one which was targeted, this is known as IP address aliasing
in topology discovery (as described in early work [GT00] and recent surveys [Key10]).
Since we know all the addresses we probe, we discard responses from unprobed targets
(about 1.4% of replies).
3.2.2 Probes to Outages
From a series of probe records organized into rounds, we next identify potential outages
when we see a sharp drop and increase in overall responsiveness of the block.
39
0
0.2
0.4
0.6
0.8
0 500 1000 1500 2000
coverage (C(i))
round (count, each 11 minutes)
block coverage (C(i))
outage threshold ((1-ρ)C
bar
(i))
0
0.2
0.4
0.6
0.8
0 500 1000 1500 2000
coverage (C(i))
round (count, each 11 minutes)
block coverage (C(i))
outage threshold ((1-ρ)C
bar
(i))
Figure 3.1: Top: probe responses for one /24 block. Green: positive response; black:
no response; blue: not probed (after round 1825). Bottom: block coverage and outage
thresholds per round. Dataset: Survey S
30w
.
Our system begins with observations of individual addresses. Let r
j
(i) be 1 if there is
a reply for the address j in the block at round i, and 0 if there is no reply, or the negative
response is network or host unreachable.
r
j
(i) =
8
>
>
>
>
<
>
>
>
>
:
1; responsive
0; otherwise
Figure 3.1 shows a graphical representation of r
j
(i): each green dot indicates a posi-
tive response, while black dots are non-responsive (the blue area on the right is after the
survey ends). In this block many addresses are responsive or non-responsive for long
periods, as shown by long, horizontal green or black lines, but there is some churn as
machines come and go.
The coverage of a block, at round i, is defined as:
C(i) =
1
N
s
N
s
X
j=1
r
j
(i):
40
(Where N
s
is the number of IP addresses that are probed in the block, either 256, or
20 with sampling Section 3.3.2.) C(i) is a timeseries (i2 [1:: N
r
]), for block respon-
siveness across the entire observation period.
A severe drop and later increase in C(i) indicates an outage for the block. The bottom
of Figure 3.1 shows C(i) drops to zero for rounds 1640 to 1654, an outage that shows as
a black, vertical band in the top panel.
Algorithm 1 formalizes our definition of “a severe drop”: we keep a running average
of coverage over window w (default: 2 rounds or 22 minutes) and watch for changes of
C(i) by more than a threshold (default: 0.9). In a few cases C(i) changes gradually
rather than suddenly, or a sudden change is blurred because our observations are spread
over 11 minutes. Therefore, for robustness of algorithm, we compare C(i) against both
the current running average, and the previous round’s running average. The result of
this algorithm is a list of outages and a binary-valued timeseries
(), indicating when
the block is down (
(i) = 1) or up (0). For succinctness, we don’t show other special
cases in Algorithm 1 (such as consecutive downs/ups, where we mark earliest as down
and latest as up), but we handle such cases properly in our implementation. Also, we
report outage as long as C(i) is 90% lower than previous rounds, even if C(i) > 0 in
some cases.
Because this algorithm detects changes in C(), it only works for blocks where a
moderate number of addresses respond. We typically require around = 0:1 of all
addresses (10% or 25 address per /24), in a block to respond, averaged over the entire
survey (
¯
C = (1=N
r
)
P
i
C(i) 0:1), otherwise we ignore the block as being too sparse. In
Section 4.2.2 we review values of and conclude that = 0:1 is reasonable. Table 3.1
shows how many blocks are analyzable for Survey S
30w
(the 30th survey taken, in the
41
Algorithm 1 Outage detection for a block
Input: C(i): timeseries of coverage, N
r
: number of rounds
Output: L: list of outage (start, end) time tuples
(i): binary timeseries of block down/up information.
Parameters: w: number of rounds to look back, : drop/increase percent to trigger
outage start/end
L =?,
ˆ
C = 0
(i) = 0; i2 [1::N
r
]
for all i2 [w + 1::N
r
] do
ˆ
C
0
=
ˆ
C // previous running average
ˆ
C =
1
w
P
i1
j=iw
C( j) // current running average
if C(i)< (1)
ˆ
C or C(i)< (1)
ˆ
C
0
then
// severe drop) outage start
last outage start i
else if
ˆ
C < (1)C(i) or
ˆ
C
0
< (1)C(i) then
// severe increase) outage end
L = L[f(last outage start; i)g
for all j2 [last outage start::i] do
( j) = 1
end for
end if
end for
return L,
U.S. west coast). In our operational system (Section 3.3), we pre-screen blocks, dis-
card sparse blocks (less than 25 responders), probe only the 20 addresses most likely to
respond; we therefore omit the-check in this case.
3.2.3 Visualizing Outages
After finding block-level outages, we next apply the clustering algorithm in next chapter
(Chapter 4) to group block-level outages in two dimensions: time and space.
Figure 4.1 shows the result of visualization clustering for Survey S
38c
. The x-axis
is time, each row shows the
j
downtime for a dierent /24 block j. Due to space, we
42
category blocks percentage
all IPv4 addresses 16,777,216
non-allocated 1,709,312
special (multicast, private, etc.) 2,293,760
allocated, public, unicast 12,774,144 100%
non-responsive 10,490,902 82%
responsive 2,283,242 18% 100%
probed 22,381 1%
too sparse,
¯
C < 11,752 0.5%
analyzable,
¯
C 10,629 0.5%
Table 3.1: .
Subsetting for blocks that are probed and analyzable (
¯
C 0:1), for Survey S
30w
.
Measurements are in numbers of /24 blocks. The percentages are shown on a
per-column basis (e.g., responsive blocks are 18% of allocated,public and unicast
blocks).
plot only the 500 blocks with most outages. Color is keyed to the country to whom each
block is allocated.
There are two clusters of blocks that have near-identical outage end times. The
cluster labeled (a) covers 19 /24s that are down for the first third of the survey; it cor-
responds to the Feb. 2011 Egyptian Internet shutdown. The cluster labeled (b) covers
21 /24 blocks for a slightly longer duration; it is an outage in Australia concurrent with
flooding in the eastern coast.
3.2.4 Outages to Correlated Events
Next we use block outage information to discover network events; we use these events
later in Section 3.4 to relate the outages we see to ground truth based on routing and
news. While visualization is helpful, Algorithm 3 over-constrains clustering since each
block can be adjacent to only two others.
43
We therefore develop a second clustering algorithm that relaxes this constraint,
instead of grouping blocks, we group individual block-level outages into network-wide
events. We identify events from similar start- and end-times of outages. Given two
outages o and p, each having a start round s() and end round e(), we measure their
distances d
e
:
d
e
(o; p) =js(o) s(p)j +je(o) e(p)j
Outages that occur at exactly the same time have d
e
(o; p) = 0. Clusters can be
formed by grouping all outages that occur at similar times. Since routing events often
require some time to propagate [LABJ00], and outages may occur right on a round edge,
we consider outages with small distance (less than a parameter) to be part of the same
event. This approach may fail if there are two unrelated events with similar timing, but
we believe that timing alone is often sucient to correlate larger events in today’s Inter-
net, provided we use a conservative. Currently we set = 2 rounds (22 minutes). We
have also studied much larger = 10 (110 minutes), showing similar results, although
less strict matching aggregates many more small events, see Section 3.5.2. This is for-
malized in Algorithm 4.
Discussion: For simplicity and eciency, we use greedy O(n
2
) clustering algo-
rithms. (Algorithms 3 and 4). We considered other standard clustering algorithms,
including k-means and hierarchical agglomerative clustering. The k-means algorithm is
not suited for our problem, because k needs to be pre-selected as the number of clusters,
which is not known beforehand. We don’t choose hierarchical agglomerative clustering
for eciency reasons, because it has a time complexity of O(n
3
) and is not suitable for
a large n (especially for the operational system in Section 3.3).
44
Algorithm 2 Finding correlated events
Input: O: the set of all outages in a survey
Output: E: the set of network outage events, each containing one or more outages
Parameters: : the threshold to decide if two outages belong to same event
while O,? do
find first occurring outage o2 O
e =fp : 8p2 O; s.t. d
e
(o; p)g
O = On e
E = E[feg
end while
return E
3.2.5 Parameter Discussion
We next discuss the parameters of our approach to evaluate how sensitive the results are
to their values.
We use a window w (default: 2 rounds) to determine edges of outages. A large
w is not feasible because most outages are short (Section 3.5.2). We studied dierent
w values from 1 to 5 rounds, and found that the numbers of up/down decisions (
())
diered by only 0.3%, confirming our choice of w = 2 is reasonable.
The parameter (default: 0.9) is the fraction of addresses must go dark to indicate an
outage. To evaluate the eect of, we consider an extreme strategy any as ground truth,
where we probe all addresses, but consider the block up if any single address responds.
We choose< 1:0 because requiring a “perfect” outage allows a single router to indicate
a block is up even if all hosts are down. However, the dierence in accuracy is less than
0.1% (details in [QH10a]). For values of 0.5 to 0.9, outage estimates are all accurate
(more than 99.8%), diering by less than 0.1%. We select = 0:9 as a balance between
accuracy and conservativeness, when declaring an outage.
We define outage
() as a property of an entire /24 block, implying the entire
/24 is used and routed consistently. About 76% of addresses are in consistently used
/24s [CH10a]; study of sub-/24 outages is future work.
45
We use to identify blocks as too sparse to classify because of few responding
addresses. A very small is not possible becauseb(1)N
s
c must be more than 1
(Section 3.2.2), and large enough to be robust to packet loss. A large would disqualify
many blocks (Section 3.3.3). An of 0.1, meaning that on average 25 hosts in a fully
probed block are responsive, is a good balance between probing rate (Section 3.3) and
accuracy (Section 3.4.6).
The choice of an 11-minute probing interval limits the precision of our estimates
of outage times. We selected this probe frequency to match that used in public
datasets [HPG
+
08], as it provides a reasonable tradeo between precision and trac,
and because our analysis is greatly simplified by a fixed probing interval. Our choice
limits the probing rate of the target networks to no more than one probe every 2.5 s,
when all addresses are probed.
3.3 Building an Operational System
Much of our analysis uses complete probing: survey data probes all addresses in each
/24 block every 11 minutes. This trac is modest at the targets (each /24 receives one
probe every 2.5s), and for 22k blocks the prober sends around 6k probes/s. However,
covering the entire public IPv4 space would be expensive: 4.8M probes/s at the source,
requiring more than 250 cores and 2.5Gb/s trac. We next describe our operational
probing system. We identify plausible probing rates for targets and prober, and develop
optimizations to reduce the trac at the target and the load on the prober.
3.3.1 Bounding Probing Trac
Probing rates trade trac against accuracy, so we first identify reasonable rates for the
prober and target.
46
At the target, probing all addresses in each /24 block every 11 minutes implies 0.39
probes/s per block. To put this trac in perspective, a typical /24 block receives 0.56
to 0.91 probes/s as “background radiation” (22 to 35 billion probes per week per /8
block [WKB
+
10], ignoring unusual targets like 1.2.3.4). Full block probing therefore
imposes a noticeable burden on the target: about 50% more background trac. We
therefore probe only 8% of the addresses in each block, cutting per-block incoming
trac to a only 4–7% of background.
At the prober, outgoing probes are constrained by bandwidth, CPU, and memory, as
we track probes awaiting responses. Of these, CPU is the largest constraint, since we
require 75k probes/s, about 40Mb/s outgoing trac, and each open request requires only
104 bytes of memory. We show below that we can reduce the number of target blocks
to allow a single, modest 4-core server to probe all of the analyzable IPv4 Internet.
3.3.2 Sampling Addresses in Blocks
While probing all addresses gives a perfect view of the block, much of that trac is
redundant if one assumes outages aect all or none of the block. (Prior work suggests
that about 76% of addresses are managed as /24-size blocks or larger [CH10a], so this
assumption usually holds.) Some redundancy is important to avoid interpreting individ-
ual failure as outage for the entire block, therefore next we evaluate sampling k of these
addresses (k 256).
Sampling reduces the probing rate, but also accuracy. To maximize the benefit of
probing we wish to probe the addresses in each block that are most likely to respond,
since only they can indicate when the block is out. This problem is a generalization of
hitlist detection, which selects a single representative address for each block [FH10].
Instead we want the k-most likely to respond. We use public datasets derived for hitlist
generation, consuming two years of full IPv4 census data to find the k-sample addresses
47
customized for each /24 block. We evaluate the eects of sampling in Section 3.4.6,
showing that k = 20 provides good accuracy.
3.3.3 Reducing the Number of Target Blocks
To detect outages successfully, several addresses in a block must reply to probes. Many
blocks in the IPv4 address space do not respond to pings at all [HPG
+
08]; they are
firewalled, not routed on the public Internet, or not occupied. Many more blocks have a
few addresses that respond, but not enough for our threshold ( = 0:1 in Section 3.2.2).
Therefore we discard such non-analyzable blocks.
To evaluate how many /24 blocks respond and meet our criteria of analyzable, we
looked at a census of all IPv4 addresses [HPG
+
08] taken at the same time as S
40w
. There
were 14.4M /24 blocks allocated, but only 4.0M (28%) had any responses, and only
2.5M (17%) are analyzable, meeting our threshold of 25 (b256c) or more responders.
In summary, we can cut our aggregate probe rate by a factor of about 75 by avoiding
non-analyzable blocks and downsampling in each block.
3.3.4 Our Prototype System
Our prototype probing system employs both of these optimizations, probing 20 samples
in each of about 2.5M /24 blocks (75k probes/s) to observe the entire analyzable IPv4
Internet. This target population requires about 40 Mb/s outgoing network trac and
sees about 27 Mb/s return trac. Our core prober is prior work (from [HPG
+
08]), but
preparation, analysis, and optimizations so one host can cover the Internet are new.
Probe Preparation: Before beginning a probing run, we must generate the list of
target blocks and sampled addresses. (In regular use, we would redo this list for each
new census.) The input for this process is the most recent IPv4 Response History dataset,
containing the estimates of how likely each and every IPv4 address is to respond [FH10]
48
based on approximately two years of rolling IPv4 censuses [HPG
+
08]. We extract the
k-sample addresses for each /24 block using a Hadoop-based Map/Reduce job. The
output of this step is a list of IP addresses for each viable block. The address list is
written in a pseudo-random order, so probes to each block are spread out over each
round (to avoid ICMP rate limiting at target blocks), and probe order is dierent for
each block. (We use probing order described previously [HPG
+
08], similar to that of
Leonard&Loguinov [LL10].)
Active Probing: We use a custom high-performance prober, to pace probes across
the 11-minute round duration, send many probes without waiting, and track their
progress until they reply or timeout in 3s. It associates replies with requests based on the
reply address (80.6% of the time), the contents of the reflected header (0.5%), or it logs
the apparently erroneous reply (18.9%). We run four instances of the prober in parallel
on a single computer, each processing one quarter of the targets.
Response Analysis: We analyze the responses when collection completes, or peri-
odically for on-going collection. We process the data with three Map/Reduce jobs: first
we convert raw responses from each address into discrete records by 11-minute rounds;
then we group these records by common prefix for each /24 block; finally we com-
pute outages for each block (Algorithm 1). We also cluster and plot outages for further
analysis.
From 2011-09-28 T22:36 +0000, we have taken a 24-hour probe of sampled
addresses for entire analyzable IPv4 Internet. That observation of 2.5M blocks includes
about 6.5 billion records and 56GB of compressed data. By comparison, a two-week
survey of 22,000 blocks consists of about ten billion records and 70GB of compressed
data. While we have not tried to optimize our analysis code, we can turn observations
into clustered events in about 80 minutes for a survey on our cluster.
49
Performance: In operation we run four parallel probers (4-way parallelism), each a
separate process on a CPU core, probing a separate part of address space. We find each
core can sustain 19k probes/s and conclude that a single, modest 4-core server can probe
all of the analyzable IPv4 Internet.
To show our system can probe the entire analyzable Internet, we evaluated raw
prober performance. For this experiment we use a 4-core Opteron with 8GB mem-
ory system to probe a set of IP addresses ranging in number from 1M to about 50M,
taken from our optimized set of sampled addresses and target blocks.
Assuming a good Internet connection, we are primarily CPU constrained, as the
prober manages data structures to match responses with requests to confirm the probed
addresses.
Figure 3.2 shows single-core CPU load and network trac for one instance of our
prober as we increase the number of target addresses per round. Each observation shows
the mean and a very small standard deviation from 18 measurements taken every minute,
starting 13 minutes into a probing run to avoid startup transients. Memory is fixed at
roughly 333MB/core, growing linearly from 325MB to 346MB over this range of probe
rates.
Fortunately, probing parallelizes easily; in operation we run four parallel probers:
each a separate process (on a dierent CPU core), probing a separate part of address
space. There is minimal interference between concurrent jobs, and in fact the data from
Figure 3.2 reflects 4-way parallelism. Our 4-way probing therefore meets our target of
75k probes/s to cover the sampled Internet at k = 20 per block.
Data Availability: Our input data and results are available on request at http:
//www.isi.edu/ant/traces/index.html.
50
0
10
20
30
40
50
60
70
80
0 5 10 15 20
0
5
10
15
20
0 2 4 6 8 10 12
user and system CPU usage (%/core)
network bandwidth (Mbps/core)
probe rate (probes/s x 10
3
/core)
ips probed per round (x10
6
/core)
CPU
net-out
net-in
Figure 3.2: Performance of one prober instance as number of targets grows: 1-core CPU
(left scale) and bandwidth (right).
3.4 Validating Our Approach
We next validate our approach, starting with case studies, then consider unbiased ran-
dom cases and stability over time and location. Finally, we compare our accuracy to
prior approaches.
3.4.1 Validating Data Sources and Methodology
While our current operational system probes the analyzable Internet, to validate our
approach we turn to survey data collected over the last two years. We use survey data
here to provide an upper bound on what full probing can determine, and because our
optimized system was only completed in Sep. 2011; we show in Section 3.5.1 that our
51
optimized system is consistent with complete probing. Our goal is to confirm our obser-
vations by verifying against real-world events: public archives of BGP routing informa-
tion and, for large events, public news sources. We next summarize our datasets, how
we use BGP, and how we associate an event to specific Autonomous Systems.
Datasets: We use 35 public Internet survey datasets collected from Nov. 2009 to
Dec. 2011 [USC12] (S
29w
through S
40w
. Table 3.2 lists all the datasets we study, and
what fraction of each dataset is analyzable. All datasets are available at no cost from
the authors and through the PREDICT program,http://www.predict.org. In PRE-
DICT, each dataset has PREDICT id: PREDICT/USC-LANDER/internet_address_
survey_reprobing_it29w-20091102, or equivalent for dierent survey numbers and
dates.
Each dataset represents two weeks of probing; data is taken from three locations
(Marina del Rey, California; Ft. Collins, Colorado; and Keio University, Tokyo, Japan).
Each dataset probes all addresses in about 22,370 /24 blocks where three-quarters of
blocks are chosen randomly from responsive blocks, while one quarter selected based
on block-level statistics [HPG
+
08]. Since some blocks are selected non-randomly, Sec-
tion 3.4.5 evaluates bias, finding we slightly underestimate outage rates.
We find that 45–52% of blocks in these datasets provide enough coverage to sup-
port analysis (
¯
C 0:1). Of these datasets, most validation uses S
30w
(started 2009-12-
23), with additional case studies drawn from S
38w
(2011-01-12), S
38c
(2011-01-27), S
39w
(2011-02-20) and S
39c
(2011-03-08).
We gather BGP route updates from RouteViews [oO], and BGP feeds at our probing
sites using BGPmon [YOB
+
09].
Relating events and routing updates in time: To find routing updates relevant
to a network event, we search BGP archives near the event’s start and end times for
messages concerning destination prefixes that become unreachable. We search within
52
Survey Start Date Duration (days) Blocks (Analyzable)
S
29w
2009-11-02 14 22371 (46%)
S
29c
2009-11-17 14 22371 (45%)
S
30w
2009-12-23 14 22381 (47%)
S
30c
2010-01-06 14 22381 (48%)
S
31w
2010-02-08 14 22376 (48%)
S
31c
2010-02-26 14 22376 (49%)
S
32w
2010-03-29 14 22377 (48%)
S
32c
2010-04-13 14 22377 (48%)
S
33w
2010-05-14 14 22377 (48%)
S
33c
2010-06-01 14 22377 (48%)
S
34w
2010-07-07 14 22376 (47%)
S
34c
2010-07-28 14 22376 (47%)
S
35w
2010-08-18 14 22376 (47%)
S
35c
2010-09-02 14 22375 (47%)
S
36w
2010-10-05 14 22375 (48%)
S
36c
2010-10-19 14 22375 (48%)
S
37w
2010-11-24 14 22374 (48%)
S
37c
2010-12-09 14 22373 (48%)
S
38w
2011-01-12 14 22375 (47%)
S
38c
2011-01-27 14 22373 (47%)
S
39w
2011-02-20 16 22375 (52%)
S
39c
2011-03-08 14 22375 (49%)
S
39w2
2011-03-22 14 22374 (49%)
S
40w
2011-04-06 14 22922 (47%)
S
40c
2011-04-20 14 22921 (47%)
S
41w
2011-05-20 14 40645 (57%)
S
41c
2011-06-06 14 40639 (57%)
S
42w
2011-07-26 14 40565 (52%)
S
42c
2011-08-09 14 40566 (56%)
S
43w
2011-09-13 14 40598 (53%)
S
43c
2011-09-27 14 40597 (56%)
S
AnalyzableInternet
2011-09-28 1 2.5M (100%)
S
43j
2011-10-12 14 40594 (54%)
S
44w
2011-11-02 14 40634 (57%)
S
44c
2011-11-16 14 40632 (57%)
S
44j
2011-12-05 14 40631 (56%)
Table 3.2: Internet surveys used in Chapter 3, with dates and durations. Survey numbers
are sequential with a letter indicating collection location (w: ISI-west in Marina del Rey,
CA; c: Colorado State U. in Ft. Collins, CO; j: Keio University, Tokyo, Japan). Blocks
are analyzable if
¯
C 0:1.
53
120 minutes of these times, a loose bound as our outage detection precision is only11
minutes, and routing changes can take minutes to converge. We expect to see relevant
withdraw messages before event e and announce messages after e. Finding both, we
claim that e is fully validated, while with just one we claim partial validation.
Relating events and routing updates in space: Although the above approach
detects outages that happen at the destination, we find many outages occur in the mid-
dle of the Internet. Narrowing our search to just destination prefixes therefore overly
constrains our search. When our temporal search fails to identify a routing problem, we
broaden our search to all ASes on the path, as done by Chang et al. [CGH03] and Feld-
mann et al. [FMM
+
04]. We generate an AS path for the destination prefix by search-
ing in RouteViews BGP snapshots. We then search for BGP withdraw and announce
messages around the same time as the start and end of our network event. Often the
destination search found an announce message; in that case we look here for withdraw
messages for an intermediate AS.
Searching intermediate ASes has two disadvantages. First, the search space is much
larger than just considering the destination prefixes. Second, RouteViews BGP snap-
shots are taken every two hours, so we must widen our search to two hours.
3.4.2 Network Event Case Studies
We begin by considering three cases where the root cause made global news, then out-
ages near our collection points, and finally three smaller events. These events are larger
than the median size outages we detect. We make no claims that these events are repre-
sentative of the Internet in general, only that they demonstrate how events found by our
tools relate to external observations. In the next section we validate a random sample of
events to complement these anecdotes.
54
Jan. 2011 Internet Outage: Beginning 2011-01-25 the Egyptian people began a
series of protests that resulted in the resignation of the Mubarak government by 2011-
02-11. In the middle of this period, the government shut down Egypt’s external Internet
connections.
Our S
38c
began 2011-01-27 T23:07 +0000, just missing the beginning of the Egyp-
tian network shutdown, and observed the restoration of network service around 2011-
02-02 T09:28 +0000. Our survey covered 19 responsive /24 blocks in the Egyptian
Internet, marked (a) in Figure 4.1. We can confirm our observations with widespread
news coverage in the popular press [Tim11c], and network details in more technical
discussions [Cow11a, Cow11b]. Analysis of BGP data shows withdraws before and
announces after the event, consistent with our timing. All Egyptian ASes we probed
were out, including AS8452, AS24835, and AS24863. We conclude that our approach
correctly observed the Egyptian outage.
Feb. 2011 Libyan Outage We also examined the Libyan outages 2011-02-18 to -
22 [Cow11c]. This period was covered by S
38c
, but our survey contains only one Libyan
block, and coverage for that block was too low (about 4 addresses) for us to track out-
ages. Our requirement for blocks with moderate coverage, combined with measuring
only a sample of the Internet and Libya’s small Internet footprint (only 1168 /24 blocks
as of Mar. 2011 [Web11]) shows that we sometimes miss outages.
Feb. 2011 Australian Outage: We also observe a significant Australian outage in
S
38c
. Marked (b) in Figure 4.1, by our observations this outage involved about as many
blocks as the Egyptian outage. We can partially validate our outage with BGP, but its
root cause is somewhat unclear. We are able to locate these blocks in the east coast of
Australia, including Sydney and Brisbane. Private communications [Arm11] and the
AusNOG mailing list [Aus11] suggest this outage may be related to mid-January flood-
ing in eastern Australia. However, our survey begins on 2011-01-27, so we only know
55
the outage’s end date. The recovery of the network seems consistent with news reports
about telecommunications repairs [Tim11a]. Our observations suggest that this Aus-
tralian outage was about as large and long-lasting as the Egyptian outage, yet the Egyp-
tian Internet outage made global news while the Australian outage got little discussion.
The Egyptian outage was more newsworthy both because of the political significance,
and because it represented nearly all Egyptian trac. Australia, by comparison, has
eight times more allocated IPv4 addresses than Egypt, so though the Australian outage
may be as large as the Egyptian one, it does not have the same country-wide impact. We
believe this example shows the importance of our methodology to quantify the size and
duration of network outages.
March 2011 Japanese Earthquake: In survey S
39c
, we observe a Japanese Internet
outage, as shown in Figure 5.7 marked (f). This event is confirmed as an undersea cable
outage caused by the T¯ ohoku Japanese earthquake 2011-03-11 [Mal11]. Unlike most
other outages we observe, both the start and recovery from this outage vary in time.
For most blocks, the outage begins at the exact time of the earthquake, but for some it
occurs two hours later. Recovery for most blocks occurs within ten hours, but a few
remain down for several days.
Local Outages: In addition to outages in the Internet, they also happen near our
monitors. (We watch for such outages in our data, and confirm with local network
operations.) Survey S
39w
shows two such events. In Figure 4.4, event (h) was planned
maintenance in our server room; the blue color indicates absence of data. Event (i) was a
second planned power outage that took down a router near our survey machines although
probes continued running. Both of these events span all probed blocks, although Fig-
ure 4.4 shows only 500 of the blocks. Finally, event (g) is due to temporary firewalling
of our probes by our university due to a mis-communication.
56
These examples show that our methods have some ability to distinguish local from
distant outages. They also revealed an interaction of our probing with Linux iptables.
In event (i), the number of active connections in iptables overflowed. Such overflow
produces random ICMP network unreachable error replies at the probing host. We filter
these errors from our prior data, and have now disabled ICMP connection tracking.
Smaller Events: Finally, we explore three small events in survey S
30w
as examples
of “typical” network outages. These events are shown in Figure 4.2. Although we don’t
find evidence in the NANOG mailing list, BGP messages do confirm two of them.
Verizon outage 2010-01-05 T11:03 +0000: In Figure 4.2, event (c) is a short outage
(about 22 minutes) aecting about 331 /24 blocks. Many of these destinations belong
to AS19262, a Verizon AS. Examination of RouteViews BGP archives confirms this
event. Examination of the AS-paths of aected blocks suggests that the outage occurred
because of a problem at AS701, another Verizon AS, present in the path of all but 0.6%
of destinations. It also confirms the duration, with the BGP withdraw-to-announce time
of about 20 minutes.
AT&T/Comcast 2010-01-05 T07:34 +0000: In Figure 4.2, event (e) is a 165 minute
outage aecting 12 blocks. Again, we confirmed this outage in RouteViews BGP
archives. The aected destinations were AS7132 (AT&T) and AS7922 (Comcast).
Routing archives confirm withdraws and returns of these routes, and AS-paths suggest
the root cause was in AS7018 (AT&T WorldNet), likely upstream of the destinations.
Mexico outage 2010-12-29 T18:36 +0000: The event labeled (d) in Figure 4.2 cor-
responds to a large number of destinations in AS8151, a Mexican ISP (Uninet S.A.
de C.V .). The event is fairly large and long: 105 blocks for 120 minutes. We were
unsuccessful in identifying the root cause of this outage in RouteViews data. This sur-
vey pre-dates our local BGP feed, and all RouteViews BGP archives are several ASes
from our probing site, suggesting the outage may have been visible to us but not seen at
57
valid. with. ann. count outage sizes
no — — 31 (62%) 1 to 57, median 4
partial Yes — 1 (2%) 24
partial — Yes 10 (20%) 1 to 27, median 15
yes Yes Yes 8 (16%) 1 to 697, median 21
50 (100%)
Table 3.3: Validation of algorithm with counts of missing (—) or found (Yes) withdraw
and announce messages, for randomly selected events from Survey S
40w
. Counts in
events; sizes in blocks.
the RouteViews monitors, or that some of these blocks may be using default routing as
described by Bush et al. [BMRU09].
3.4.3 Validation of Randomly Selected Events
Our outage case studies in the prior section were selected because of their importance
and so are biased towards larger events. To provide a more careful study of the validity
of our approach, we randomly pick 50 events from a total of 1295 events in Survey S
40w
and attempt to confirm each using BGP information (Section 3.4.1).
Table 3.3 summarizes our results. We are able to fully or partially confirm 38% of
the cases by finding either corresponding BGP withdrawal or announcement messages.
Randomly selected events are often small (as confirmed in Section 3.5.2), and it is easier
to verify large events. One possible reason smaller events do not appear in the control
plane is that smaller networks more often use default routing. Bush et al. describe
how default routing can result in “reachability without visibility”, as addresses may
be reachable without visibility to the BGP control plane [BMRU09]. Our results are
consistent with a corollary, “outages without visibility”, since outages in default-routed
blocks do not appear in BGP. We therefore claim that 38% represents incompleteness
of BGP and not our detection algorithm; we next use controlled outages to support this
hypothesis.
58
0
20
40
60
80
100
2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940
outages found (%)
duration of controlled outage (minutes)
1 round
2 rounds
3 rounds
0
20
40
60
80
100
outages found (%)
estimated outage time
probe interval
0
20
40
60
80
100
outages found (%)
estimated outage time
probe interval
0
20
40
60
80
100
outages found (%)
estimated outage time
probe interval
Figure 3.3: Evaluation of controlled outages on detection (bar height) and estimated
duration (color).
3.4.4 Validation of Controlled Outages
Evaluation of random events show what we detect is true, but it is silent about what we
miss. We next show our system can detect all outages of sucient duration.
To provide a controlled experiment, we extract probes sent from California to five
known /24 blocks in Colorado from our analyzable Internet experiment (Section 3.5.1).
Network operators confirm these blocks had no outages on that day. We use real prob-
ing data to capture the noise of packet loss and machine reboots; about 2.1% of our
individual probes are negative.
We emulate an outage in each target block by replacing positive responses with
negative for one known period. Our emulated outage starts at a random time between
1 hour after collection start and 2 hours before end; start time is thus independent of
outage rounds. We vary outage duration, from 1 to 40 minutes in steps of 1 minute, with
100 random times for each step.
Figure 3.3 shows the percentage of outages we detect for one block as a function of
outage duration. We see that we miss nearly all outages shorter than our probing interval;
we space probing out over 11 minutes to be gentle on the target network, creating a
low-pass filter over outage observations. As a result, a 5.5 minute outage aecting all
addresses appears identical to an 11-minute outage aecting half. We detect all outages
59
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40
outages found (%)
duration of controlled outage (minutes)
probe interval
129.82.138/24
129.82.44/24
129.82.45/24
129.82.46/24
129.82.47/24
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40
outages found (%)
duration of controlled outage (minutes)
probe interval
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40
outages found (%)
duration of controlled outage (minutes)
probe interval
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40
outages found (%)
duration of controlled outage (minutes)
probe interval
(a) Comparing results of emulated outages for 5 CSU blocks.
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40
outages found (%)
duration of controlled outage (minutes)
probe interval
62.213.98/24
24.112.12/24
98.162.40/24
202.139.242/24
75.77.239/24
86.0.246/24
186.102.171/24
76.217.104/24
76.105.246/24
75.184.157/24
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40
outages found (%)
duration of controlled outage (minutes)
probe interval
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40
outages found (%)
duration of controlled outage (minutes)
probe interval
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40
outages found (%)
duration of controlled outage (minutes)
probe interval
(b) Comparing results of emulated outages for 10 random blocks.
Figure 3.4: Comparing results of emulated outages for 5 CSU blocks (top) and 10 ran-
dom blocks (bottom).
longer than 21 minutes, and the majority of outages of 15 minutes or longer. Dierent
parameters (Section 4.2.2) could adjust sensitivity, but for full-block outages longer than
about twice the probe interval, our approach does not falsely declare as available.
Colors in Figure 3.3 show how long we estimate outages. Due to filtering, we con-
sistently underestimate the duration of each outage by half the probe interval.
To confirm that the block we report in Figure 3.3 is representative, we selected four
additional block at CSU, and with ten randomly chosen blocks from around the Internet.
Similarly, in our datasets each of these blocks is evaluated as always available across our
observation period, although each has a dierent number of responsive hosts and random
packet loss. We then repeat our experiment, artificially injecting outages and evaluating
what our algorithms observe.
60
Figure 3.4a shows the result of controlled outages in all five CSU blocks. To make
it easier to compare dierent blocks, we connect the percentage of estimates for each
block with a line rather than plotting 5 separate bars. We see that the trends in outage
detection are within a few percent across all blocks, suggesting that the results shown in
Figure 3.3 are representative.
To further validate if our results are stable, we randomly picked 10 /24 blocks that
were judged always up, and we do the same controlled outage experiment. Figure 3.4b
shows this experiment. Here almost all blocks show similar results as Figure 3.4a. One
block, 186.102.171/24, has lower outage estimates than the others. Based on exami-
nation of a 2-week survey, we believe this block uses dynamically assigned addresses,
only about 15% of which are occupied. Therefore we see few responses in our sample
(typically only 3 of 20), and variation as addresses are reassigned aect our conclusions.
Improving our results for dynamically assigned blocks is ongoing work. We conclude
that for responsive blocks are results are quite consistent, while variation in our estimates
is greater in sparse and dynamic blocks.
3.4.5 Stability over Locations, Dates and Blocks
We next consider the stability of our results, showing they are independent of prober
location and date, and only slightly aected by the survey block select method.
Probing location can aect evaluation results. Should the probing site’s first hop ISP
be unreliable, we would underestimate overall network reliability. Our probing takes
place regularly from three dierent sites, ISI west (marked “w”), CSU (marked “c”) and
Keio University (marked “j”), each with several upstream networks.
Figure 3.5 indicates ISI surveys with open symbols, CSU with filled symbols, Keio
University with asterisks, and the analyzable Internet run (Section 3.5.1) with inverse
open triangle, and it calls out survey location at the top. Visually, it suggests the results
61
99.5
99.6
99.7
99.8
99.9
100
Availability (%)
98.58 98.43 99.48 99.46
availability
0
500
1000
1500
2000
2009-10-01
2010-01-01
2010-04-01
2010-07-01
2010-10-01
2011-01-01
2011-04-01
2011-07-01
2011-10-01
2012-01-01
0
1
2
3
Number of events (x1) and outages (x10)
Outage Area (%)
Date
events
outages
outage area
29
w
c
30
w
c
31
w
c
32
w
c
33
w
c
34
w
c
35
w
c
36
w
c
37
w
c
38
w
c
39
w
c
w
2
40
w
c
41
w
c
42
w
c
43
w
c
j
44
w
c
j
Figure 3.5: Evaluation over 35 dierent 2-week surveys, plus our analyzable Internet
run. Top shows availability, bottom shows Internet events, outages and outage percent-
age over time. Local outages are shown with dotted lines and are often omitted for
scale.
are similar regardless of probing site and for many dierent random samples of targets.
Numerically, variation is low: mean outage level (
¯
) is 0.33% with standard deviation
of only 0.1% after local outages are removed. To strengthen this comparison we carried
out Student’s t-test to evaluate the hypothesis that our estimates of events, outages, and
62
Quarter Mean Min Max q1 q2 q3
1 (stable) 0.21 0.09 0.34 0.14 0.18 0.28
2 (stable but random) 0.28 0.11 0.53 0.22 0.27 0.31
3 (random, odd third octet) 0.31 0.18 0.61 0.27 0.29 0.36
4 (random, even third octet) 0.29 0.20 0.43 0.24 0.26 0.33
Table 3.4: Outage percentage statistics of four quarters, from S
29w
to S
40w
.
¯
for our sites are equal. The test was unable to reject the hypothesis at 95% confidence,
suggesting the sites are statistically similar.
In addition to location, Figure 3.5 suggests fairly stable results over time, with sev-
eral exceptions. For example, surveys S
29c
and S
39w
each had extended local outages,
for about 41 and 4 hours, respectively, shown as dashed lines aecting outage count and
¯
(they do not change the event estimate because each outage is mapped to a single
network event). After removing local outages, the corrected versions are roughly the
same as others.
Only three quarters of blocks in surveys are selected randomly, one quarter are
selected to cover a range of network conditions. To evaluate block selection eects,
we separate each survey’s data into quarters and compared the selected quarter against
each of the three randomly chosen quarters. We find that the mean outage rate of the
selected quarter is 0.2% (standard deviation 0.078%), while the other three are 0.29%
(standard deviation 0.09%). Overall outage estimates from surveys appear slightly more
stable (about 0.06% less downtime) than would analysis of a completely random sample.
We next present the analysis to quantify this statement.
Target blocks in each survey can be grouped into four quarters: stable and selected
to represent dierent characteristics; stable but randomly selected; randomly chosen
each survey, with an odd third octet; and randomly chosen each survey, with an even
third octet. To validate if our results are skewed by selection of blocks, we plot the
63
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2009-10-01
2010-01-01
2010-04-01
2010-07-01
2010-10-01
2011-01-01
2011-04-01
Down Percent (%)
Date
quarter 1
(stable)
quarter 2
(stable random)
quarter 3
(random, odd)
quarter 4
(random,
even)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2009-10-01
2010-01-01
2010-04-01
2010-07-01
2010-10-01
2011-01-01
2011-04-01
Down Percent (%)
Date
quarter 1
(stable)
quarter 2
(stable random)
quarter 3
(random, odd)
quarter 4
(random,
even)
Figure 3.6: Downtime percentage over time, for 4 dierent quarters of our dataset,
from S
29w
to S
40w
.
outage percentage of the four quarters over time (Figure 3.6). We also plot the outage
percentage quartiles of all four quarters in the right part of Figure 3.6 (for raw data,
please see Table 3.4), showing we are slightly under-estimating the Internet’s outages,
as Quarter 1 (stable fixed blocks) has less overall outage rates (0.2%, with standard
deviation 0.078%), while other three quarters’ outage rates are around 0.29% (standard
deviation 0.08%).
While this comparison shows a slight bias for the stable-selected blocks, this bias is
slight and aects only one quarter of all observed blocks, so our overall conclusions are
only slightly more stable than a random sample would be. Overall outage estimates from
surveys appear slightly more stable (about 0.06% less downtime) than would analysis
of a completely random sample.
64
3.4.6 Comparing Accuracy with Other Approaches
We probe multiple or all addresses in a block to evaluate outages. Prior work such as
Hubble has probed a single address in each block, possibly multiple times [KBMJ
+
08a].
Probing more addresses requires more trac, but is more robust to probe loss and single-
address outages. We next evaluate the eect of sampling k addresses per block, and
choice of address for k = 1.
To compare alternatives, we evaluate methods A and B in pairs, treating A as a trial
and B as truth. Analogous to Type-I and Type-II errors, we define a false availability
(fa) as the estimate predicting a reachable block when it should be out, while for a
false outage (fo), the estimate predicts out and the truth is reachable. Similarly, we can
define true availability (ta) and true outage (to). We then compute standard information
retrieval terms: precision (ta=(ta + fa)), recall (ta=(ta + fo)), and accuracy ((ta + to)=(ta +
to + fa + fo)).
General Sampling
To evaluate the accuracy of a k-sample, we consider the full probing (k = 256) observa-
tion of block availability as ground truth (B), then compare our k-sample approximation
as an estimate (A).
We evaluate k-samples by downsampling our full data; Figure 3.7 shows the pre-
cision, recall and accuracy for this experiment. Precision is always quite good (over
99.6%), showing it is rare to falsely predict the block as reachable, even when sampling
only a few addresses. However, we show below that sampling a single address is less
robust than even a few. The best tradeo of recall and accuracy is for k from 20 to 40,
where accuracy o by only 7% (or 4%), but trac is cut by 92% (or 84%). The errors
are mostly due to false outages, claiming the target is down when a more complete
measurement would show it as reachable.
65
60
65
70
75
80
85
90
95
100
0 50 100 150 200 250
99.5
99.6
99.7
99.8
99.9
100
recall and accuracy (%)
precision (%)
probing sample size (addresses per /24 block)
Precision
Recall
Accuracy
Figure 3.7: Precision, recall and accuracy as a function of samples per target block.
(Neither y-axis starts at 0.)
Single Address Per Block and Hubble
Hubble, iPlane and most other prior works in outage detection have used a single target
to represent an entire /24 block. We next compare our system to these prior systems,
quantifying dierences in accuracy and coverage. This comparison is dicult because
there are several dierences in methodology: how many targets to probe per block;
probing which address or addresses; and use of retries or not. We examine bounds
on each of these factors in Table 3.5, and compare specifically on Hubble’s approach
(single, .1 address, with-retries) with our approach (top 20 addresses, no-retries). Note
that we use a strategy all as ground truth, which probes all addresses without retries. We
have shown all is both complete and accurate in a previous technical report [QH10a].
We discussed the eect of number of targets in Section 3.4.6, providing a best trade-
o of sampling and accuracy (Figure 3.7). Prior work on IP hitlists examined the eects
66
strategy single hitlist us Hubble
samples per /24 1 1 20 1
which addresses .1 top top .1
retries no no no yes
precision 99.97% 99.97% 99.71% 99.98%
recall 56.3% 79.1% 91.3% 61.0%
accuracy 56.4% 79.1% 92.3% 61.1%
Table 3.5: Comparing accuracy of dierent strategies used to estimate outage. Dataset:
S
30w
and S
46c
.
of which addresses should be probed [FH10]. Here we add a comparison of fixed (sin-
gle) vs. top (hitlist) and we show that top is 22.7% more accurate than only probing a
fixed .1 address (79.1% vs. 56.4% in accuracy, Table 3.5). This shows that probing only
the .1 address is not accurate enough for outage detection, and careful selection of which
address to probe can improve accuracy significantly.
Using retries should help singleton packet loss, therefore single (.1, no retries) values
are underestimates of Hubble. However, considering retries doesn’t help with medium-
term host failure, such as if the single target is taken down for maintenance. To more
accurately evaluate the eect of retries, we run a specific experiment to reproduce Hub-
ble (probing .1 with retries at 2 minute intervals [KBMJ
+
08a]), side-by-side with a
recent survey S
46c
, which we can sample to generate our operational system, and use
complete data as ground truth. We find that retries to the same address multiple times is
slightly better than no-retries (Hubble vs. single, 4.7% better).
Our side-by-side experiment pulls these factors together, comparing exactly Hub-
ble’s configuration (single, .1, with retries) with ours.
We see a 31% improvement in accuracy (us vs. Hubble), consistent with our above
bounds.
67
Probing Rate and Coverage: We have shown that we improve accuracy; in addi-
tion, we provide better coverage at about the same aggregate probe rate. Hubble coor-
dinates probes to each block from 30 vantage points, sending 0.5–3 probes/minute on
average [KBMJ
+
08a] (ignoring retries). We probe more addresses, but from only one
site, thus only 1.8 probes/minute (20 addresses, 1 site, 11 minute cycles).
Finally, our requirement of 20 responsive addresses per block is much stricter than
Hubble’s one, however, our hitlist-selection is much more flexible. We evaluated cover-
age using a full census from Jan. 2012 (C
45w
, scaled for outages), finding that Hubble’s
.1 covers 2.2M /24s, while our top-20 covers 2.5M, 14% more blocks. (We see similar
results in observations from another site, and two months earlier, with C
45c
and C
44w
.)
3.5 Evaluating Internet Outages
We next apply our approach to measure Internet outages. We look at this data in two
ways, first exploring event and outage durations then examining network wide stability
by exploring marginal distributions (
¯
B
and
¯
I
) across Internet space and time.
After correcting for local outages, we believe the observations in this section reflect
Internet-wide stability, within the limits of measurement error. Since our vantage points
are well connected and we remove local outages, our estimates approximate the Internet-
core-to-edge reliability. We make this claim because we know our observations are
stable across location and time (Section 3.4.5) and across all surveys in this section.
3.5.1 Evaluation over the Analyzable Internet
On 2011-09-28 we probed the entire analyzable Internet, targeting 20 samples in 2.5M
blocks as described in Section 3.3.4. Somewhat surprisingly, this experiment drew no
complaints, perhaps because it was shorter than our 2-week surveys. Data processing
68
took 4 hours, both to visualize the results (as an image 2.5M 134 pixels, broken into
20 tiles), and to detect the 946 routing events we observe. The overall outage rate is
consistent with our survey data (Section 3.5.3): 0.3% outage area, or 99.7% availability.
Figure 3.8 shows selected portions of outages in this survey, as they are well corre-
lated and aect many blocks. (We omit most of the plot; a complete plot at 600 dots-
per-inch would be more than 375 pages long.)
Figure 3.8a shows an outage in a Brazilian AS (AS26615) from 2011-09-02 T03:34
+0000 for 25 rounds (about 4.5 hours), aecting more than 350 /24 blocks. We are able
to partially verify this outage with BGP control plane messages.
The other three parts of this figure show outages aecting more than 800 /24 blocks
in southern China (Figures 3.8b and 3.8c), including 35 /24 blocks in a mass-transit
Internet (as part of Figure 3.8d). We did not observe evidence for these outages in BGP,
but did correlate their timing and location with news reports confirmed in international
media.
3.5.2 Durations and Sizes of Internet Outages and Events
We first consider the durations and sizes of block-level outages and network-wide events
(Figure 3.9 left 2 plots).
Beginning with outages (Figure 3.9a), we see that half to three-quarters of outages
last only a single round. Our current analysis limits precision to one round (11 minutes),
but possible future work could examine individual probes to provide more precise tim-
ing. All surveys but Survey S
39w
have the same trend; Survey S
39w
diverges due to its
local outages (dotted line S
39w
), but joins the crowd when they are removed. We also see
that 80% of outages last less than two hours. While there is no sharp knee in this distri-
bution, we believe this time period is consistent with human timescales where operators
detect and resolve problems.
69
(a) Portion 1 (b) Portion 2 (c) Portion 3 (d) Portion 4
Figure 3.8: Selected slices of outages in the analyzable Internet study. Colored regions
show 4.5–7.3 hours (25–45 rounds) of the 24 hours measurement (133 rounds). Each
of the X axis is 24 hours in time. Subgraphs on the Y axis show marginal distribution
(green line) and overall block responsiveness (red dots).
Network events group individual outages by time, presumably due to a common
cause. Figure 3.9b shows event durations, computed as the mean duration of each
event’s component outages. This figure shows that many single-round outages cluster
into single-round events, since about 40% of events last one round instead of 50–75%
70
50
60
70
80
90
100
10 100 1000 10000 100000
Cumulative Distribution (%)
Outage Duration (minute)
Fraction of Total Survey (%)
30w
38c
38w
39c
39w
39w’
0.1% 0.2% 0.5% 1% 2%3% 5% 10% 20% 50% 100%
hour day week
(a) Block outage durations.
0
20
40
60
80
100
10 100 1000 10000 100000
Cumulative Distribution (%)
Event Duration (minute)
Fraction of Total Survey (%)
30w
38c
38w
39c
39w
0.05%0.1% 0.2% 0.5% 1% 2%3% 5% 10% 20% 50% 100%
hour day week
(b) Network event durations.
90
92
94
96
98
100
10 100 1000 10000
Cumulative Distribution (%)
Cumulative Duration of All Outages (minutes)
Fraction of Total Survey (%)
0.05% 0.1% 0.3%0.5% 1% 2% 3% 5% 10% 20% 50%
38w
38c
30w
39w’
39c
39w
hour day week
(c) Marginal dist. by rounds,
¯
I
.
0
20
40
60
80
100
0 50 100 150 200
Cumulative Distribution (%)
Cumulative Size of All Outages (blocks)
Fraction of Total Blocks (%)
0.1% 0.3% 0.5% 0.7% 0.9% 1.1% 1.3% 1.5% 1.7%
30w
38w
39w’
39c
38c
39w
(d) Marginal dist. by blocks,
¯
B
.
Figure 3.9: Cumulative distributions of outage and event durations (left two). Marginal
distributions of outage, by round and block (right two). CDFs of (a) focus only on por-
tions of the graph, CDFs of (c) starts at 80%. Datasets: surveys S
30w
, S
38c
, S
38w
, S
39c
, S
39w
.
The dotted lines are Survey S
39w
without removing local outages.
of outages. With less strict clustering ( = 10 rounds instead of = 2) this trend grows,
with only 20% of events lasting one round.
About 60% of events are less than hour long, but there is a fairly long tail out to
the limits of our observation (2 weeks or 20,000 minutes). This long tail is similar to
distributions of event durations of Feamster et al. [FABK03] and Hubble [KBMJ
+
08a].
71
Feamster et al.’s very frequent probes (1–2 seconds between probes) in a mesh of com-
puters, allow them to find 100% of events more than 10 s long, but the very high probing
rate is only acceptable between a mesh of friendly computers. We cannot detect such
short events, but we see the same long tail and our approach can scale to the whole Inter-
net. Hubble favors large events, claiming to find 85% of events longer than 20 minutes
and 95% of events longer than 1 hour. Our system captures all events longer than about
20 minutes (twice our probing interval), and about half of events from 10–20 minutes
(Figure 3.3); more accurate than Hubble, particularly for shorter events.
Because local outages correspond to a single event, Survey S
39w
resembles the other
surveys both with and without removal of local outages, and Survey S
39w
is indistin-
guishable from S
39w’
.
Finally, we examined event sizes. Extending Figure 3.9, Figure 3.10 shows the
distribution of network event sizes. Almost all events are very small: 62% of events
aect only a single block, and 95% are 4 blocks or smaller. Nevertheless, a few large
outage events do occur, as discussed in Section 3.4.2.
3.5.3 Internet-wide View of Outages
We next shift our attention to the Internet as a whole. How often is a typical block
down, and how much of the Internet is inaccessible at any given time? To understand
these questions, Figure 3.9 (right 2 plots) shows the marginal distributions of outages by
round and block.
First we consider distribution by rounds in Figure 3.9c. As expected, we see the vast
majority of the blocks in our survey are always up: from 92 to 95% of blocks have no
outages over each two week observation. The exception is Survey S
39w
, where two local
outages partitioned the probers from the Internet for about two hours. When we remove
local outages, this survey becomes consistent with the others. About 2% of blocks are
72
60
65
70
75
80
85
90
95
100
1 10 100 1000 10000
Cumulative Distribution (%)
Event Size (blocks)
Fraction of Total Blocks (%)
0.01% 0.03% 0.1% 0.5% 1% 5% 20% 50% 100%
Figure 3.10: Cumulative distributions of network event sizes, from Surveys S
30w
, S
38c
,
S
38w
, S
39c
, S
39w
.
out once (the step at 11 for one round) and the remaining tail follows the distribution of
Figure 3.9c.
Turning to space, Figure 3.9d shows marginal distributions of
¯
B
. Survey S
39w
is
again an outlier due to large local outages, but it resembles the others when local outages
are removed.
Considering Figure 3.9d as a whole, we see that almost always, some part of the
Internet is inaccessible. At any time, typically 20 to 40 blocks are unreachable in our
survey. This result is consistent with our observations from Figure 3.5 that show 0.33%
of the Internet is out, averaged over entire surveys, with a standard deviation of 0.1%.
Our outage estimate is much lower than Paxson (up to 3.3% outages), suggesting much
greater stability than 1995. It confirms the mesh study in RON [ABKM01] with a much
larger number of edge networks. Finally, we see a set of unusually large outages in
73
Survey S
38c
, where the 50%ile outage is around 38 blocks, but 80%ile is at 63 blocks.
We discuss the root causes for these outages in Section 3.4.2 and Figure 4.1.
Highly reliable networks are often evaluated in terms of availability, and compared
to the “five nines” goal of telephone networks. We plot availability in the top panel of
Figure 3.5, seeing that overall, the Internet is up about 99.7% of the time for about 2.5
nines of availability, suggesting some room for improvement.
The above analysis is based on surveys of 1% of the responsive Internet. We can
confirm this result with our operational system scanning the entire analyzable Internet
(Section 3.5.1), where we observed 0.3% of analyzable IPv4 space was out on average.
Our analysis of Internet-wide outages is preliminary, but it illustrates the utility of
automated methods for detecting and quantifying outages in the data plane.
3.6 Conclusions
Researchers have studied Internet outages with control- and data-plane observations for
many years. We show that active probing of a sample of addresses in responsive /24
blocks provides a powerful new method to characterize network outages. We validate
this approach by both case studies and random samples, verify our results are stable
and more accurate than prior work. With our system, a single PC can observe outages
to destinations across the entire analyzable IPv4 Internet, providing a new approach to
study Internet-wide reliability and typical outage size and duration.
This chapter also supports the thesis statement as another example with sampling
and aggregation in space. Specifically, we show that sampling and aggregation in the
Internet IPv4 address space provides an ecient method to characterize outages in the
Internet edge. Our bottom-up approach first samples a wide range of blocks (home
users, server farms, universities and some businesses), and samples inside each block to
74
reduce probing rate. It then analyzes outage in each block and aggregate with greedy
clustering algorithms for visualization and characterization. With our approach, we
enable the correlation of otherwise scattered outage information, providing an ecient
way to study edge Internet reachability as a whole. With these sampling and aggregation
techniques, we are able to study a large fraction of the Internet edge consisting of 2.5M
blocks. We find new knowledge of the Internet by showing the characteristics of typical
outages and report overall availability measures.
The previous chapter supports our thesis statement with aggregation in time dimen-
sion. This chapter supports with another example in space dimension. In the next chap-
ter, we will show another useful aggregation mechanism to support our thesis, which is
aggregation by visualization.
75
Chapter 4
Visualizing Sparse Internet Events:
Internet Outages and Route Changes
In this chapter, we present a visualization technique that is useful in analysis of sparse
Internet events. This chapter serves as a form of aggregation and supports the thesis as
an example, because aggregation by visualization makes finding sparse but correlated
events more ecient and eective. It uses simple clustering and eective visualization
for both manual inspection and automated analysis. We use visualization techniques
introduced here in other chapters of the thesis too (Chapters 3 and 5).
Part of this chapter was published in WIV 2012 [QHP12c].
4.1 Motivation for Visualizing Sparse Internet Events
Researchers and network operators must interpret large amounts of network data each
day to understand and manage their networks. Many dierent types of events hap-
pen in the Internet: networks become accessible and inaccessible [Tim11c, Mal11,
DSA
+
11, MIP
+
06a, KBMJ
+
08a, SS11a], congestion occurs and subsides, routes
change [LMJ97, LABJ00], spam campaigns and denial-of-service attacks wax and
wane [GHW
+
10, LYL08]. In many cases, these events have common root causes: a
shared router that fails [TLSS10], or a common change to software or a configura-
tion [MWA02], or distributed botnet with common external control.
76
One can infer common root causes of network events by studying their occurrence
over space and time. By space, we mean the Internet topology or address space, since
root causes are often associated with particular routers, Autonomous Systems (ASes),
or address blocks. By time, we mean events can be reduced to discrete times or ranges,
like network outages or bursts of spam messages. Events that are continuous (such as
degree of congestion) are not our primary focus, but could be studied by looking for
threshold changes.
Our goal is to detect repeated correlations in time and space. We assume that events
that are repeated in time suggest a common root cause. We wish to know how widely
correlated they are in space to understand the potential extent of that cause. While
correlation does not guarantee a cause, and the step from correlation to causation is nec-
essarily specific to each event, repeated correlation can help network operators narrow
the search space of problems. We intentionally make few assumptions about network
topology, considering each observation independently and not considering topological
hints oered by AS paths or address structure. We choose this “hands-o” approach
since it can reveal hidden correlations—cases where networks with dierent addresses
and AS paths share potentially related failure causes.
We propose a clustering algorithm that groups events that happen at similar times in
network address space, supporting a two-dimensional visualization that reveals patterns
(Section 5.3).
4.1.1 Contributions
The contribution of this chapter is to show that simple clustering is helpful at deter-
mining correlated network events. We support this claim with examples of such events
for network outages (Section 4.3) and for routing changes (Section 4.4). We find that
77
clustering reveals the size and dynamics of network outages, and our approach to clus-
tering routing changes provides evidence of correlations suggesting possible shared root
causes that are not visible in AS-path-data alone.
4.1.2 Relation to Thesis
This work serves as another useful aggregation mechanism in our thesis. The visualiza-
tion techniques in this chapter helps with the aggregation and analysis of sparse Internet
events: we enable both manual inspection and automated findings. Aggregation by visu-
alization makes both manual (identifying large events from noisy background easier)
and automated analysis (marginal distributions) more ecient. As a general research
contribution, our visualization technique provides an interface for timeseries data anal-
ysis. We use simple clustering in this chapter to feed to visualization framework, and
the technique here is used in several other chapters (Chapters 3 and 5).
4.2 Visualizing Correlated Events
We begin by describing how we identify clusters in a timeseries
b
(i) for an array of
blocks b and discrete times (or rounds) i. Each element of
b
(i) takes a binary value
indicating the presence or absence of an event. We later apply this method to two
datasets: network outages (Section 4.3), where blocks are /24 address prefixes with 11-
minute rounds; and network route changes (Section 4.4), where blocks are variable-size
address prefixes from BGP with 2-hour rounds.
78
4.2.1 Clustering Visualization of Network Data
Our simple clustering algorithm groups the network event timeseries (
()) in two
dimensions: time and space (Algorithm 3). We order blocks based on Hamming dis-
tance. For blocks m and n, with binary-valued timeseries
m
(i) and
n
(i), we define
distance:
d
h
(m; n) =
X
m
(i)
n
(i)
Perfect temporal correlation occurs if d
h
(m; n) = 0.
Algorithm 3 Clustering of blocks for visualization
Input: A: the set of blocks, with
() event timeseries
Output: B: reordered list of blocks, by distance
start with block m2 A with smallest
P
m
(i) (number of events)
A = Anfmg
B:append(m)
while A,? do
for all n; s.t. d
h
(m; n) do
A = Anfng
B:append(n)
end for
// pick the next most similar block:
find m
0
s.t. d
h
(m; m
0
) d
h
(m; n)8n2 A
A = Anfm
0
g
B:append(m
0
)
m = m
0
end while
return B
Since network events often require some time to propagate [LABJ00], we consider
blocks with small variance of event times (less than a parameter ) to be close. This
approach may fail if there are two unrelated events with similar timing, but we believe
that timing alone is often sucient to correlate to larger events in today’s Internet, pro-
vided we use a conservative.
79
Figure 4.1: The 500 largest outages of Survey S
38c
, x axis: time, y axis: address space
(blocks). Colors represent countries. Subgraphs on X and Y axis show marginal distri-
butions (green line) and overall block responsiveness (red dots). Full interactive graph
available as Supplement 1 in [QHP12b].
Survey Start Date Duration (days) Blocks Analyzable
S
30w
2009-12-23 14 22381 10629
S
38c
2011-01-27 14 22373 10553
S
39w
2011-02-20 16 22375 11585
S
39c
2011-03-08 14 22375 10955
whole Internet study 2011-09-28 1 — 2.5M
Table 4.1: Internet surveys used in this chapter, with dates and durations. Survey num-
bers are sequential with a letter indicating collection location (w: ISI-west in Marina del
Rey, CA; c: Colorado State U. in Ft. Collins, CO; j: Keio University, Fujisawa, Japan).
Table 4.1 lists the survey datasets we use in this chapter. Figure 4.1 shows the result
of visualization clustering for network outages in Survey S
38c
. This dataset uses a study
of routing outages in about 10,600 blocks as described later in Section 4.3.1. The x-axis
is time, each row shows the
b
downtime for a dierent /24 block b. Due to space
constraints, we plot only the 500 blocks with most outages; we provide the full graph as
Supplement 1 [QHP12b]. Color is keyed to the country to whom each block is allocated.
As an example of what clustering shows, we see that clusters of blocks that have
near-identical outage end times. The cluster labeled (a) covers 19 /24s that are down for
the first third of the survey; it corresponds to the Feb. 2011 Egyptian Internet shutdown.
80
The cluster labeled (b) covers 21 /24 blocks for a slightly longer duration; it is an outage
in Australia concurrent with flooding in the eastern coast. Since all large countries
have disjoint address space, our clustering algorithm is essential to identify these large
events; they are dispersed and therefore invisible if one views the data sorted by address
or similar factors. Beyond this one example, Section 4.3.2 and Section 4.4.2 provides a
more detailed discussion of insights from visualization.
Performance and Alternatives: Our clustering algorithm is a simple greedy algo-
rithm; its performance is O(b
2
) for b blocks. Although we would prefer a faster algo-
rithm, the largest possible b for IPv4 is 2
24
; a value within the reach of current comput-
ers. We have done clustering for 2.5M blocks, processing time is about 35 minutes on a
single core of a 2.2GHz Intel Xeon CPU.
We considered other standard clustering algorithms, including k-means and hierar-
chical agglomerative clustering, but found neither suitable for our problem. The k-means
algorithm cannot be used because we do not know how many k clusters exist beforehand.
Hierarchical agglomerative clustering has runtime O(b
3
), pushing its performance out of
reach for our larger datasets.
Algorithm 4 Finding correlated events
Input: O: the set of all outages in a survey
Output: E: the set of network outage events, each containing one or more outages
Parameters: : the threshold to decide if two outages belong to same event
while O,? do
find first occurring outage o2 O
e =fp : 8p2 O; s.t. d
e
(o; p)g
O = On e
E = E[feg
end while
return E
81
4.2.2 Choice of the Closeness Threshold
Our clustering algorithm depends on, the threshold determining when blocks are close
enough to cluster. We want to use a conservative, to be safe when declaring two blocks
are close in temporal behavior. We use = 2 rounds (22 minutes for outages, 4 hours
for route changes), in clustering events. We have also studied much larger = 10 (110
minutes for outages), finding the main dierence is it tends to group more single-round
events together [QHP12a] (omitted due to space).
4.2.3 Marginal Distributions
In addition to basic clustering, we find the marginal distributions are important to char-
acterize the size of an event relative to the whole network. To evaluate events over the
Internet as a whole, we next define statistical measures, how many and for how long
blocks are experiencing events.
Figure 4.2, with data from S
30w
, shows an example of marginal distributions (full
figure as Supplement 5 in [QHP12b]). We see outages that aect many blocks for a
short period (event (c) here, about 20 minutes), while others like (d) and (e) aect fewer
blocks but for longer periods of time (here 2 to 3 hours).
Given N
b
blocks and N
i
times in an observation, the marginal distributions are the
time- and space-specific sums:
¯
I
(i) =
N
b
X
b=1
b
(i)
¯
B
(b) =
N
i
X
i=1
b
(i)
82
Figure 4.2: The 900 largest outages of Survey S
30w
, x axis: time, y axis: address space
(blocks). Colors represent countries. Subgraphs on X and Y axis show marginal distri-
butions (green line) and overall block responsiveness (red dots). Full interactive graph
available as Supplement 5 in [QHP12b].
We normalize
¯
I
(i) by N
b
and
¯
B
(b) by N
i
in the subgraphs of some plots (such as
Figure 4.1).
We show
¯
I
(i) along the x-axis. In our S
30w
(Supplement 5 [QHP12b]), we see
bumps in
¯
I
(i) late in the day on 2009-12-29, and midday on 2010-01-05. We show
¯
B
(b) as the solid green line to the left of the y-axis. Because we sort by degree of
outages, the largest
¯
B
(b) appear at the bottom of the graph. (For network outages, we
also show the responsiveness of the target block in the margin, since sparse blocks can
give misleading outage values. They appear as a speckle of red dots.)
When blocks are dierent sizes, as with route changes, we can compute the marginal
distribution
¯
I
(i) in terms of numbers of prefixes and weighted by numbers of addresses.
(Currently we show unweighted values.)
83
The above metrics are useful to characterize the degree of network events in today’s
Internet. We consider long-term trends and validation of these metrics as part of outage
computation [QHP12a].
4.2.4 Handling Large Images
The IPv4 address space covers thousands of blocks, and events can span large parts of
that space. Even at one pixel per block and round, visualizations are large: of 20k1k
for 2-week outage survey; 2.5M130 for a 24-hour, whole Internet outage study (Sup-
plement 6 in [QHP12b]); and 130k360 for a 1-month routing change study. These
large images quickly become dicult to view in traditional tools, where viewers typ-
ically assume the image fits in memory, and more robust tools (like Photoshop) are
encumbered with editing. We have therefore implemented a custom, Google-maps-style
web-based browser to make them more accessible. We use OpenLayers and the “slippy
map” from Open Street Map [Lay12]. Interactive examples of our visualizations are on
the web in our supplemental data [QHP12b].
4.3 Visualizing Network Outages
Our first application is to visualize network outages. We use existing outage information
derived from active probing [QHP12a]; we describe this source briefly below. Visual-
ization is useful to characterize both large and small outages, and evaluate outage onset
and recovery.
4.3.1 Data Sources: Detecting Outages
We visualize outage data covering from 10k to 2.5M /24 blocks from two data sources.
84
Outage Detection Through Probing: We draw on probing data from two sources.
First, we consider Internet address surveys: active probing of all addresses in a sample
of about 22k /24 address blocks in the IPv4 address space for two weeks [HPG
+
08].
Second, we probe 20 addresses in each /24 block for the entire measurable IPv4 address
space (blocks with addresses that will respond), 2.5M blocks, for 24 hours [QHP12a].
In both cases, addresses are probed in a pseudo-random order, spreading probes out
over an 11 minute round. We are careful to consider clock drift when mapping probes
to rounds. We observe similar results from data taken from locations in California,
Colorado, and Japan.
Detecting block-level outages: The overall responsiveness of block b at round i
is defined as the fraction of responding addresses. We watch the change of the overall
responsiveness at round i against recent behavior, and conclude an outage starts if we see
a threshold-determined, dramatic drop in responsiveness (likewise for outage ends, with
a dramatic increase of responsiveness). The output of this step is the binary timeseries
b
(i).
Details of data collection and validation of the method are outside the scope of this
chapter on visualization, but are available elsewhere [QHP12a]. In principle, our visu-
alization methods can be applied to other sources of outage input, such as that from
iPlane [MIP
+
06a], Hubble [KBMJ
+
08a], or localized methods [SS11a]; such integra-
tion is future work.
4.3.2 Learning from Outage Visualization
Visualization has been important in developing our outage detection methods. We use
visualization to quickly identify both large and typical events and to gain intuition about
the dynamics of outages. Visualization is also important to diagnose problems in data
collection and analysis. We describe each of these cases below.
85
Identifying Large Events From Background: The most important benefit of visu-
alization is to provide an intuitive view of a large amount of data: quickly scanning a
two-week survey allows one to “eye-ball” about 20M data points, yet large events and
trends jump out from noisy background data.
As one example, the region marked (a) in Figure 4.1 corresponds to the Egyptian
Internet outage in response to protests that resulted in the resignation of the Mubarak
government by 2011-02-11 [Tim11c]. This survey began 2011-01-27 T23:07 +0000,
just missing the beginning of the Egyptian network shutdown, and observed the restora-
tion of network service around 2011-02-02 T09:28 +0000. Our survey covered 19
responsive /24 blocks in the Egyptian Internet (about 1% of responsive networks in
Egypt). The use of country-specific coloring helps this outage stand out in our visual-
ization.
A second interesting example is a significant Australian outage in the same dataset,
marked (b) in Figure 4.1. We are able to locate these blocks in the east coast of Australia,
including Sydney and Brisbane. Private communications [Arm11] and the AusNOG
mailing list [Aus11] suggest this outage may be related to mid-January flooding in east-
ern Australia. The recovery of the network seems consistent with news reports about
telecommunications repairs [Tim11a]. Visualization is important in this case to demon-
strate that the outage was about as large and long-lasting as the Egyptian outage, yet the
Egyptian Internet outage made global news while the Australian outage got little discus-
sion. The Egyptian outage was more newsworthy both because of the political signifi-
cance, and because it represented nearly all Egyptian trac. Australia, by comparison,
has eight times more allocated IPv4 addresses than Egypt, so though the Australian out-
age may be as large as the Egyptian one, it does not have the same country-wide impact.
Visualization was the primary reason this outage came to our attention, as there was
minimal news coverage, and nearly none that emphasized network problems.
86
Figure 4.3: The 500 largest outages in Survey S
39c
, x axis: time, y axis: address space
(blocks). Colors represent countries. Subgraphs on X and Y axis show marginal distri-
butions (green line) and overall block responsiveness (red dots). Full interactive graph
available as Supplement 2 in [QHP12b].
Although our visualization methods help large events to stand out, and can provide
an overview of small events, ultimately automated tools are necessary to consistently
find and characterize events. We believe the large events described here show the use-
fulness of our visualization, but we refer interested readers to our technical report for
discussion of tools to quantify large and small outages, and for analysis of a random
sample of events [QHP12a].
Understanding Dynamics of Outages: In addition to presence of outages, visu-
alization is ideal to reveal the dynamics of complicated outages. For example, in Sur-
vey S
39c
, we see Japanese network outages related to the T¯ ohoku Japanese earthquake of
2011-03-11 [Mal11], marked (f) in Figure 5.7. Unlike most other outages we observe,
both the start and recovery from this outage vary in time. For many blocks, the out-
age begins at the exact time of the earthquake, but for some it occurs two hours later.
Recovery for most blocks occurs within ten hours, but a few remain down for several
days. Analysis by IIJ provides details about Japanese network outages and recoveries
from the perspective of a network operator [CPBW11], confirming our observations.
The contribution of our visualization method is to provide some information about the
87
impact of this natural disaster on the network, but observable externally by non-experts
on Japanese network topology.
Quantifying Typical Outages: In addition to major events, visualization also helps
characterize typical network behavior. We can see this partly in the scattering of outages
in Figure 4.1. We can also see that outages come in dierent “shapes”, with many small
and short, some long duration, like (a) and (b) in this figure, and others are “taller”
(aecting more networks), like (g) in Figure 4.4.
We have found ground truth for several regional outages of dierent shapes. Sur-
vey S
30w
, omitted here for space but available as Supplement 5 in [QHP12b] and as
Figure 3 in [QHP12a], shows three regional network outages: (c) a Verizon outage
aecting many (331) /24 blocks for a short time (22 minutes); (d) an AT&T/Comcast
outage aecting fewer (12) blocks for longer time (165 minutes); and (e), a large Mexico
outage aecting 105 blocks. These three events show how visualization quickly char-
acterizes the dierent shapes of typical events, from tall events like (c) that aect many
networks and multiple countries, to fatter, longer outages like (d) and (e). It also shows
the value of marginal distributions; both of these significant regional events show up as
bumps in
¯
I
running along the x-axis.
Detecting Problems in Local Networks and Data Collection: Finally, we have
found visualization vital to gaining confidence in our measurement approach by reveal-
ing problems in our network and software. Figure 4.4 (from Survey S
39w
) shows exam-
ples of three problems we detected with visualization and then corrected.
First, event (g) shows a long-term outage of over 200 networks, all at USC. In diag-
nosing this problem, we discovered that USC’s network operators chose to block our
probes, in spite of pre-authorization. Visualization helped us detect and repair this mis-
understanding.
88
Figure 4.4: The 500 largest outages in S
39w
, x axis: time, y axis: address space (blocks).
Colors represent countries. Subgraphs on X and Y axis show marginal distributions
(green line) and overall block responsiveness (red dots). Full interactive graph available
as Supplement 3 in [QHP12b].
Events (h) and (i) both show outages across all monitored blocks (although this
visualization shows only 500 of them), corresponding to planned maintenance activi-
ties. Event (h) was planned maintenance in our server room; the blue color indicates
absence of data because we temporarily suspended data collection. Event (i) was a sec-
ond planned power outage that took down a router near our survey machines, although
probes continued running. More subtly, the smattering of white “up” blocks in event
(i) shows a data collection problem. The combination of actively probing 20k blocks,
and all probes timing out due to a powered-o second-hop router, results in the over-
flow of connection tracking (iptables) at our probing machine. Such overflow produces
random ICMP network unreachable error replies at the probing host, which are then
interpreted incorrectly as a response from the remote network. We have since corrected
this problem by disabling ICMP connection tracking, but it was detected only because
of dierences in visualization.
89
4.4 Visualizing Routing Changes
We have also used our clustering algorithm to visualize BGP path changes. We find
that visualization is helpful in finding correlated BGP path changes, sometimes showing
common failure sources that are not obvious from simple inspection of AS paths. Visu-
alization of path changes also shows that our methodology generalizes to other kinds of
network timeseries.
4.4.1 Data Sources: Detecting BGP Path Changes
We collect BGP snapshots with BGPmon [YOB
+
09], at three sites in California, Col-
orado, and Japan, taking snapshots every two hours. To visualize this data, we consider
each announced, routable IPv4 prefix as a block, and the two-hour collection interval as
round duration. (Finer time resolution is future work.)
To generate our
b
(i) timeseries, we compare the AS path at round i with round i 1
for each prefix (block) b, setting
b
(i) = 1 if the paths dier. We apply our clustering
visualization algorithm to
b
(i), to find groups of BGP prefixes that show similar timing
across repeated route changes.
4.4.2 Learning from Visualizing Route Changes
Just as with network outages (Section 4.3.2), we find that visualization of route changes
can be used to quickly identify large events, to understand dynamics, and to quantify
these changes with marginal distributions.
Comparing Route Changes to Outages: Comparing route changes to outages,
visualization makes it clear that route changes aect many more prefixes and more
frequent but shorter than network outages. For example, route changes in Figure 4.5
aect hundreds of blocks, each of which is often much larger than a /24, while outages
90
(such as Figure 5.7) typically aect only a few /24s, but often for much longer. Such
comparisons must be made carefully, though, since the time and spatial scales of our
visualizations are dierent. In both cases, though, repeated temporal correlations are
useful to suggest blocks that behave similarly.
Finding Hidden Prefix Correlations: Prior approaches have studied AS paths to
detect common points of failure [CGH03]. Our approach, however, uses only timing to
cluster blocks into events, thus it can reveal correlations that are not apparent from AS
paths alone. We describe several examples below using data from our Japanese routing
vantage point taken in June 2012.
An Example of Correlated Routing Problems: To illustrate a correlation seen in
our data, we observe several correlated route changes aecting 242 routing prefixes on
3 days, as shown in Figure 4.5.
While correlated routing changes are expected, but it is surprising when correlated
appear to be caused at dierent points in the AS path. Of the cluster, 101 prefixes
switch between AS paths 2500 2914 4837 17816 and 2500 2914 3356 4837 17816, sug-
gesting that AS2914 (NTT) is selecting between AS4837 (China Unicom) and AS3356
(Level3), while 131 prefixes switch between 2500 2914 4837 17816 and 2500 2914
4837 4837 17816, apparently due to a problem in AS4837. These dierences would
not be seen from AS path analysis only, since the AS paths of the first block suggest
the problem is between AS2914 and AS4837, while in the second path it appears to be
between AS4837 and AS17816.
A similar problem aects 116 prefixes in Australia, where many 115 prefixes oscil-
late between paths 2500 2914 4713 2516 4637 1221 38285 10113 and 2500 2914 4713
2516 4637 1221 38285 38285 38285 10113, suggesting changes in AS38285; while
another oscillates between two very dierent paths: 2500 2914 4713 2516 4637 1221
38285 and 2500 2914 3257 7473 7474 38285.
91
Figure 4.5: Sample cluster: correlated BGP changes for China prefixes, in June 2012.
X-axis: time at 2 hour rounds. Y-axis: prefixes. Full interactive graph available as
Supplement 4 in [QHP12b].
Frequency of Hidden Correlations: Hidden correlations are not uncommon. We
examined all 8716 clusters in our June 2012 data and found that about 11% showed
such hidden prefix correlations. Analyzing the root causes of such correlations is future
work, outside the scope of this visualization work, but we believe it demonstrates that
visualization and our simple clustering algorithm can reveal hidden correlations.
92
4.5 Conclusions
Our visualization method presented in this chapter is another example in support of the
thesis statement, because it shows that aggregation by visualization is ecient in finding
large and correlated events. Visual aggregation provides an interface for timeseries data
analysis. With visual aggregation, we provide an ecient way to analyze and charac-
terize sparse network event data, such as block-level outages found by active probing,
and BGP path changes. We show that we can more easily correlate many events to large
Internet events manually, which overcomes the constraint of analyzing very large num-
ber of small events and enables further studies. We can also automatically find events
by checking variations in the marginal distributions. We conclude that with this chapter
of visualization aggregation, we have presented another example supporting our thesis
statement.
Up to this chapter, we have shown several examples with sampling and aggregation
techniques in either time or space dimension to support the thesis statement. In the next
chapter, we will present a system that uses adaptive sampling in both time and space
dimensions to track millions of Internet blocks for outage analysis.
93
Chapter 5
Trinocular: Understanding Internet
Reliability through Adaptive Probing
In this chapter, we present how we use sampling in both time and space dimensions to
adaptively probe all the responsive Internet edge. We explore a new sampling mech-
anism which is guided by Bayesian inference and sends “just enough” probes in each
round to confirm Internet block states. Similar to previous chapters, we also aggregate
in space dimension for more complete knowledge. We also add another kind of aggre-
gation, combining data from multiple vantage points to eliminate measurement bias.
Part of this chapter was published in SIGCOMM 2013 [QHP13c].
5.1 Motivation for Trinocular
Although rare, network outages are a serious concern since users depend on con-
nectivity, and operators strive for multiple “nines” of reliability. Replicated services
and content delivery networks may conceal outages, but not eliminate them, and the
size of the Internet means outages are always occurring somewhere. As we saw
in Chapter 3, outages are triggered by natural disasters [Mal11, Wik12], political
upheavals [Tim11c, Cow11a, Cow11b, Cow11c], and human error [MWA02].
Prior work has generally focused on outages from the perspective of rout-
ing. Groups today directly monitor routing [Cow11a], track routable prefixes with
94
control- and data-plane methods [MIP
+
06b, KBSC
+
12], and study trac to unoccu-
pied addresses [DSA
+
11]. While these approaches are useful to detect and some-
times mitigate large outages related to routing, most of the Internet uses default rout-
ing [BMRU09], and we show that most outages are smaller than routable prefixes.
While some systems target probing to detect specific kinds of smaller outages [SS11b],
to our knowledge, no service today actively tracks outages in all Internet edge networks.
In this chapter, we are interested in the same problem domain as in Chapter 3: Inter-
net outages. In this chapter we replace the measurement mechanism of Chapter 3 with
a much more ecient approach. Instead of probing all addresses in a sample of 20k
blocks, or probing top k in each of all responsive blocks, we explore a more principled
and parsimonious sampling process. We build a simple outage-centric model of the
Internet, and our sampling process is guided by the Bayesian inference to provide guar-
anteed precision of outage results (see Section 5.3). A second contribution we make is in
the aggregation process, we combine and examine results from multiple vantage points
and carefully evaluate measurement bias due to local view. Our much more ecient
approach presented here allows Trinocular to run 247 in this work.
5.1.1 Contributions
The contribution of this chapter is to address this gap, providing unbiased, accurate
measurements of Internet reliability to all analyzable edge networks. First, we describe
Trinocular
1
, an adaptive probing system to detect outages in edge networks. Our system
is principled, deriving a simple model of the Internet that captures the information per-
tinent to outages, parameterizing the model with long-term observations, and learning
current network state with probing driven by Bayesian inference.
1
We call our system Trinocular after the three states a block make take: up, down, or uncertain.
95
Second, using experiments, analysis, and simulation, we validate that these prin-
ciples result in a system that is predictable and precise: we detect 100% of outages
longer than our periodic probing interval, with known precision in timing and duration.
It is also parsimonious, requiring minimal probing trac. On average, each Trinocular
instance increases trac to covered networks by no more than 0.7% of the Internet’s
“background radiation”. This low rate allows a single computer to monitor the entire
analyzable Internet, and multiple concurrent instances to identify outage scope.
Finally, we use Trinocular to observe two days of Internet outages from three sites.
We also adapt our model to re-analyze existing data, developing three years of trends
from measurements of samples of the Internet. This re-analysis includes observations of
outages due to Hurricane Sandy in 2012, the Japanese Earthquake in March 2012, and
the Egyptian Revolution in January 2012.
5.1.2 Relation to Thesis
Following the work in Chapter 3, this chapter serves as another example supporting
the thesis statement. In this chapter, we show that principled low-rate probing helps
precisely track outages in Internet edge. Our mean probe rate is at 19.2 probes per hour,
a big step since Chapter 3, where it probes 109 probes per hour (to top 20 addresses).
Our more ecient mechanism reduces the probe rate by more than 80% as compared to
Chapter 3, and yet obtain almost perfect accuracy (Section 5.5). We sample inside each
/24 block, in both time and space. In each 11-minute round of probing, we contact up
to 15 sample addresses at a very precise 3 s interval, until we reach a threshold of belief
decided by Bayesian inference. Space-wise, we maintain a per block ever-responsive
address list, and eventually walk through this list over multiple rounds. The resource
constraint we face in this chapter is similar to previous chapter, the large scale of the
Internet with millions of responsive blocks. Our aggregation technique is similar to
96
previous work (Chapter 3), with the exception of aggregating data from multiple vantage
points to evaluate measurement bias due to local view.
5.2 Problem Statement
Our goal is to provide principled, predictable, precise, and parsimonious record of net-
work outages at the Internet edge.
By principled, we mean we build a simple model of network blocks and track
their status through learning and active probes (Section 5.3). Our simple model is, of
course, incomplete and unsuitable to model all aspects of the Internet, but we show it
is well suited to track outages. We use multi-year network observations to inform our
model, establishing the expected behavior of each block (a /24 network prefix). We
use Bayesian inference to provide a strong theoretical basis to learn the status of each
block, and to decide how many probes to send to improve our belief when it is uncer-
tain. We use periodic probes at fixed-interval, multi-minute rounds to detect network
outages with a known degree of precision. We use adaptive probing at timescales of
seconds to quickly resolve inconsistent information and distinguish transient or non-
network behavior (such as packet loss or edge system failure) from outages at the target
network. Our default measurements employ three years of quarterly observations at long
timescales, rounds of 11 minutes at medium timescales (following [HPG
+
08, SS11b]),
and 3 second intervals for adaptive probes, although these values can be adapted to trade
precision for trac.
By predictable, we mean our conclusions about analyzable network blocks provide
guaranteed precision and positive statements about block status (Section 5.4). Our peri-
odic probing bounds the precision of detecting block transitions, and we show that error
in estimates of outage duration is uniformly distributed by one half round (330 s). As
97
with all active probing mechanisms, our approach cannot determine the status of net-
works that decline to participate, such as those that use firewalls that block probes, nor
networks that are too sparse for our techniques. We find 3.4M /24 blocks to be ana-
lyzable by our method, and we identify non-analyzable blocks. This coverage is 30%
greater than current approaches, if one holds accuracy constant.
By parsimonious, we mean that we use a minimum number of probes required to
establish our belief in edge network state. Long-term history informs our model, and
Bayesian reasoning justifies each probe we make, avoiding unnecessary probes. Min-
imizing probing trac is critical for a service that operates across the entire Internet.
While money can solve the problem of outgoing network capacity at the prober, recipi-
ents of probing trac are very sensitive. Even modest trac can draw complaints (for
which we maintain an opt-out list). We evaluate the impact of our trac on target net-
works by comparing it to the amount of background radiation that all public networks
observe [WKB
+
10]. We show that at steady state, each Trinocular instance increases
background trac by less than 0.7%, allowing us to run multiple instances to under-
stand outage scope.
Finally, our target is all edge networks. We are interested in edge networks because
prior work has shown that many networks employ default routing [BMRU09], and out-
ages occur inside ISPs [SS11b]. We show that probing all /24s detects many more
outages than considering only ASes or routed prefixes (Section 5.5). We combine data
from three sites to study outage scope, separating outages adjacent to the prober from
partial and global outages aecting some or all of the Internet.
These four characteristics distinguish our work from prior work, which often
employs ad hoc mechanisms, does not provide guarantees about outage precision,
requires excessive probing, or monitors routable prefixes instead of considering smaller
98
outages in edge networks. They also allow us to provide unique view of Internet relia-
bility, both as a whole, and of specific events (Section 5.6).
5.3 Principled Low-rate Probing
Trinocular carries out principled probing: we define a simple model of the Internet to
capture elements essential to outage detection. Trinocular establishes belief B(U) that
each block is available, and uses Bayesian inference to learn the current status of the
network. We drive probing using this model and belief, sending at regular intervals
to guarantee freshness, and more quickly when necessary to resolve uncertainty about
network state.
5.3.1 An Outage-Centric Model of the Internet
Trinocular’s model of the Internet tracks block-level outages, measured with probes to
active addresses, and reasons about them using belief changed by Bayesian inference.
We study /24 address blocks (designated b) as the smallest unit of spatial coverage.
Larger blocks, such as prefixes that appear in global routes, may capture outages due to
routing changes, but they hide smaller outages. Prior work shows that default routing
is widely used [BMRU09], and outages occur inside ISPs [SS11b], and we show that
outages often occur in sizes smaller than routable prefixes (Section 5.5.2).
Trinocular sends only ICMP echo requests as probes, each with a 4-byte payload.
We chose end-to-end, data-plane probing to detect outages unrelated to routing. We use
ICMP because it is innocuous and, compared to other options, less likely to be blocked
or interpreted as malicious [HPG
+
08].
In each block, we model which addresses are active, the ever active addresses, E(b),
a set of up to 256 elements. To interpret the meaning of probe responses, we model
99
probe prior
result U
P(probejU
) reason
n U 1 A(E(b)) inactive addr.
p U A(E(b)) active addr.
n
¯
U 1 (1`)=jbj non-response to block
p
¯
U (1`)=jbj lone router?
Table 5.1: Bayesian inference from current block state U
and a new probe.
the expected response rate of E(b) as availability, A(E(b)), a value from 0 to 1, never to
always responding. These dimensions are independent, so a block where E(b) = 64 and
A(E(b)) = 0:5 has one-quarter of addresses that each respond (on average) half the time.
We discard very sparse and very unresponsive blocks as non-analyzable (Section 5.3.4).
For blocks when A(E(b))< 1, a negative probe response is ambiguous: it can result
from probing temporarily unoccupied address, or from the block being down. Our model
evaluates the likelihood of these events. We show that this model provides more infor-
mation per probe than current approaches, allowing lower probe rates (Section 5.5.1).
Finally, we judge blocks as either down (unreachable), up (reachable), or uncertain,
and denote these states as U,
¯
U, or U
?
. Belief, B(U) ranges from 0 to 1, with low to
high values corresponding to the degree of certainty the block is down or up. Probes
influence this belief as described next.
5.3.2 Changing State: Learning From Probes
Trinocular uses Bayesian inference to weigh each probe’s information into our under-
standing of block status.
Probe responses are either positive, p, or negative or non-responses, n, and they
aect belief according to conditional probabilities from Table 5.1. This table reflects the
block size,jbj, the combined rate of probe and reply loss,`, and the long-term probability
that those addresses reply, A(E(b)).
100
The first two lines of the table represent how belief changes when the block is cur-
rently up. They reflect the probability of hitting an active address (A(E(b))), or an inac-
tive address (1-A(E(b))). In this study we treat A(E(b)) as a static parameter and derive
this value from analysis of long-term observations, so it reflects both transient address
usage and possible loss of probes or replies. Since outages are very rare, they have
negligible influence on A(E(b)).
The last two lines characterize what we learn when the block is down. The final line
is a positive reply to a block that is down. We consider this case to represent the unusual
situation where a single router is up, but all addresses “behind” the router are down.
This low-probability event will almost always draw subsequent probes that clarify the
block’s status. This term uses`, representing the probability of packet loss of the probe
or reply. On-line estimation of packet loss is future work; we currently use ` = 0:01,
a reasonable but arbitrary value; our results are not sensitive to small changes to this
value. The third line is the complement of that probability.
A new probe observation results in a new belief B
0
based on our old belief B as
influenced by this table. After a positive response:
B
0
(
¯
U) =
P(pj
¯
U)B(
¯
U)
P(pj
¯
U)B(
¯
U) + P(pjU)B(U)
After a negative- or non-response:
B
0
(
¯
U) =
P(nj
¯
U)B(
¯
U)
P(nj
¯
U)B(
¯
U) + P(njU)B(U)
with analogous values for B
0
(U), and B(
¯
U) = 1 B(U).
These equations break down, failing to consider alternatives, if conditional proba-
bilities (P(probejU
)) go to 0 or 1. We avoid this case by capping A(E(b)) to 0.99 for
101
stable blocks, and avoiding very intermittent blocks (A< 0:1) as unsuitable for analysis,
and we also cap belief to at most 0.99 (and at least 0.01).
5.3.3 Gathering Information: When to Probe
Trinocular probes each block with periodic probing at medium-timescales coupled with
adaptive probes sent quickly when we suspect block status may have changed, and
recovery probes to account for sparse blocks. We probe addresses from E(b) in a pseudo-
random order, both to gather information from many addresses and to spread the reply
burden.
Periodic probing: We probe each analyzable blocks at a fixed interval so we can
bound the precision of our measurements of network outages. Like prior work [HPG
+
08,
SS11b], we use a fixed 11-minute interval for basic probing. The precision in outage
measurements follows from this period (see Section 5.4.2); we choose it to trade desired
precision against trac. Periodic probing and target rotation are design choices that
make Trinocular as lightweight on the target network as possible.
Adaptive probing: We classify a block as down when B(U) < 0:1, and up when
B(U) > 0:9. When a periodic probe causes our belief to become uncertain, or to shift
towards uncertainty, we carry out additional, adaptive, short-timescale probes to resolve
this uncertainty. For adaptive probing, we send new additional probes as soon as each
prior probe is resolved until we reach a conclusive belief of the block status. Most
probes are resolved by 3 s timeout, so adaptive probes typically occur every 3 s.
Usually a few adaptive probes will quickly resolve uncertainty in our belief; we
study this value in Section 5.4.3. As address usage becomes sparser, the number of
probes to converge grows geometrically (Figure 5.1). To bound probing, we send at
most 15 total probes per round (1 periodic and up to 14 additional adaptive). We cease
probing when belief is definitive and not shifting; if we cannot reach definitive belief
102
0
5
10
15
20
25
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
probes required
availability A(E(b))
(maximum allowed probes)
(min. allowed A(E(b))
down-to-up
up-to-down
still down
Figure 5.1: Median number of probes needed to reach a definitive belief after a change
in block state. Boxes show quartiles, whiskers 5 and 95%ile; both equal median for
outages. Data: analysis and simulation; details: Section 5.4.3.
in 15 probes we mark the block as uncertain. Uncertainty is similar to the “hosed”
state in prior work [SS11b]. We speculate that Bayesian analysis could resolve some
intermediate states in their work, but detailed comparison is future work.
Recovery probing: There is an asymmetry when blocks transition from down-to-
up for intermittently active blocks (low A(E(b))). While positive responses are strong
evidence the block is up, interpretation of negative responses has increasing ambiguity
as A falls. When an intermittent block comes back up, we still may see several negative
responses if probes chance upon temporarily unoccupied addresses.
To account for this asymmetry, we do additional recovery probes for blocks that
are down. From A(E(b)), the probability we get consecutive misses due to k vacant
addresses is (1 A)
k
, resulting in a “false negative” belief that an up block is down. We
select k to reach a 20% false-negative rate as a function of A (k is the “still down” line in
Figure 5.1), performing up to k = 15 total probes when A = 0:1 With recovery probes,
103
false negatives cause outages in sparse blocks that are one third of a round too long, on
average.
Trac: For long-term operation across the Internet, Trinocular must have minimal
impact on target networks. Our benchmark is Internet background radiation, the unso-
licited trac every public IP address receives as part of being on the public network. It
thus provides a reasonable baseline of unsolicited trac against which to balance our
measurement. A typical unused but routable /8 block receives 22 to 35 billion packets
per week [WKB
+
10], so each /24 block sees 2000 to 3300 packets/hour. Our goal is to
increase this rate by no more than 1%, on timescales of 10 minutes.
In the best case, we send only 5.4 probes/hour per /24 block in steady state, and if
all addresses in a block are active, we probe each address only every other day. This
best-case is only a 0.25% increase in trac. With adaptive and recovery probing, our
worst-case probing rate adds 15 probes per 11-minute round, an average probe rate of
82 probes/hour per /24 block, about 5% of the rate of background radiation. Since
this worst case will occur only for low-A blocks that change state, we expect typical
performance to be very close to best case, not worst case. In Section 5.4.3 we show
experimentally that median trac is at 0.4% to 0.7% of our benchmark, our 5% worst
case occurs less than 2% of the time.
5.3.4 Parameterizing the Model: Long-term Observation
We determine parameters E(b) and A(E(b)) for each block to weigh the information in
each probe.
Initialization: We use long-term, multi-year, Internet censuses to initialize these
parameters for each block. Prior work generates regular IP history datasets that provide
the information we need [FH10]. These datasets include the responsiveness of each
public, unicast IP address in IPv4 measured 16 times over approximately the last 3
104
years. We use the full history (16 measurements) to identify E(b). To use recent data, we
consider only the 4 most recent censuses to compute A(E(b)). We update E(b) every 2-3
months as new history datasets become available, bringing in newly active blocks and
retiring gone-dark blocks. Current Internet censuses are specific to IPv4. Our approach
applies to IPv6 if E(b) can be determined, but methods to enumerate all or part of IPv6
are an area of active research.
It is very trac-intensive to track intermittent and sparse blocks with reasonable
accuracy (see Figure 5.1). We therefore discard blocks where addresses respond very
infrequently (A(E(b)) < 0:1). We also discard blocks that are too sparse, where E(b) <
15, so that we are not making decisions based on a very few computers. Because A(E(b))
is based on only recent censuses, discard of low A(E(b)) blocks removes “gone dark”
blocks [FH10].
Of the 16.8M unicast blocks as of July 2012, we find 14.5M are routed, 8.6M are
non-responsive, 0.7M have E(b) < 15, 1.5M have A(E(b)) < 0:1, leaving 3.4M blocks
that are analyzable: 24% of the routed space (and 40% of responsive).
Since most of the Internet is always up, we set belief to indicate all blocks are up on
startup.
Evolution: As networks change, model parameters may no longer match the current
network. We update our target list and A-estimations every two months as new long-term
data is made available. At shorter-timescales, we must handle or adapt when parameter
estimates diverge from reality.
Underestimating E(b) misses an opportunity to spread trac over more addresses.
Underestimating A(E(b)) gives each probe less weight. In both cases, these errors have
a slight aect on performance, but none on correctness.
When E(b) is too large because it includes non-responsive addresses, it is equivalent
to overestimating A(E(b)). When A(E(b)) exceeds the actual A, negative probes are
105
given too much weight and we infer outages incorrectly. Ideally A(E(b)) will evolve
as a side-eect of probing to avoid false outages when it diverges from the long-term
average. Our current system does not track A dynamically (although work is underway),
so we detect divergence in post-processing, and identify and discard inaccurate blocks.
The result is greater trac, but few false outages.
Trac: We do not count long-term observations against Trinocular’s trac budget
since it is an ongoing eort, independent of our outage detection. However, even if we
take responsibility for all trac needed to build the history we use, it adds only 0.18
probes per hour per /24 block since collection is spread over 2 months.
5.3.5 Outage Scope From Multiple Locations
A single site provides only one view of the Internet, and prior work has shown that about
two-thirds of outages are partial [KBMJ
+
08b]. We use two approaches to judge outage
scope: we detect and eliminate outages where probers are eectively o the network,
and we merge views from multiple observers to distinguish between partial and global
outages. In Section 5.6.1 we report on how frequently these occur in the Internet.
Prober-local outages: Router failures immediately upstream of a prober unenlight-
eningly suggest that nearly the entire Internet is out. We detect and account for outages
that aect more than half the probed blocks.
Partial and global outages: We detect outage scope by merging observations from
multiple vantage points. Because each site operates independently and observations
tend to occur at multiples of a round, direct comparison of results from dierent sites
will show dierent timing by up to one round. We correct these dierences by tak-
ing the earlier of two changes that occur, since periodic probes always delay detection
of a change in block status. We therefore correct disagreements in the merged results
only when (a) both sites agree before and after the disagreement, (b) the disagreement
106
lasts less than 1.1 rounds, and (c) the network changes state before and after disagree-
ment. Rules (a) and (b) detect transient disagreement that is likely caused by phase
dierences. Rule (c) avoids incorrectly changing very short outages local to one van-
tage point. Merging results thus improves precision. After correction, any remaining
disagreement represents partial outages.
5.3.6 Operational Issues
Our system implementation considers operational issues to insure it cannot harm the
Internet.
Probing rate: In addition to per-block limits, we rate limit all outgoing probes to
20k probes/s using a simple token bucket. Rate limiting at the prober insures that we
do not overwhelm our first-hop router, and it provides a fail-safe mechanism so that,
even if all else goes wrong, our prober cannot flood the Internet incessantly. In practice,
we have never reached this limit. (This limit is at the prober, spread across all targets.
Figure 5.4 shows that only a tiny fraction of this trac is seen at each target block.)
We expect our monitor to run indefinitely, so we have implemented a simple check-
point/restart system that saves current belief about the network. This mechanism accom-
modates service on the probing machine. We restart our probers every 5.5 h as a simple
form of garbage collection.
We have run Trinocular for several multi-day periods, and we expect to run Trinoc-
ular continuously when adaptive computation of A is added.
Implementation: We use a high-performance ICMP probing engine that can handle
thousands of concurrent probes. We use memory-optimized data structures to keep state
for each block, leaving CPU cost to match probe replies with the relevant block as the
primary bottleneck. We find a single prober can sustain 19k probes/s on one core of
107
our 4-core Opteron. Fortunately, probing parallelizes easily, and with four concurrent
probers, a single modest computer can track all outages on the analyzable IPv4 Internet.
5.4 Validating Our Approach
We validate correctness with controlled experiments, and probe rate by simulation and
Internet experiments.
5.4.1 Correctness of Outage Detection
We first explore the correctness of our approach: if an outage occurs, do we always
see it? For a controlled evaluation of this question, we run Trinocular and probe 4 /24
blocks at our university from 3 sites: our site in Los Angeles, and universities 1600 km
and 8800 km distant in Colorado and Japan. We control these blocks and configure
them in two-hour cycle where the network is up for 30 minutes, goes down at some
random time in the next 20 minutes, stays down for a random duration between 0 and
40 minutes, then comes back up. This cycle guarantees Trinocular will reset between
controlled outages. We studied these blocks for 122 cycles, yielding 488 observations as
dataset A
controlled
, combining data for 4 controlled blocks from datasets A
1w
(2013-01-19,
4 days), A
3w
(2013-01-24, 1 day), A
4w
(2013-01-25, 2 days), A
7w
(2013-02-12, 2 days)
2
.
Figure 5.2 shows these experiments, with colored areas showing observed outage
duration rounded to integer numbers of rounds. We group true outage duration on the
x into rounds with dotted black lines. Since periodic probing guarantees we test each
network every round, we expect to find all outages that last at least one round or longer.
We also see that we miss outages shorter than a round roughly in proportion to outage
2
We name datasets like A
7w
for Trinocular scans of the analyzable Internet (A
20addr
uses a variant
methodology), H
49w
for Internet histories [FHG11], S
50j
for Internet Surveys [HPG
+
08]. The subscript
includes a sequence number and code for site (w: Los Angeles, c: Colorado, j: Japan).
108
0
20
40
60
80
100
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
outages found (%)
duration of controlled outage (minutes)
missed
observed as 1 round
observed as 2 rounds
observed as 3 rounds
* * *
Figure 5.2: Fraction of detected outages (bar height) and duration in rounds (color), for
controlled experiments. Dataset: A
controlled
.
duration (the white region of durations less than 11 minutes). While these experiments
are specific to blocks where addresses always respond (A(E(b)) = 1), they generalize to
blocks with A 0:3 since we later show that we take enough probes to reach a definitive
conclusion for these blocks (Figure 5.1).
These results confirm what we expect based on our sampling schedule: if we probe
a block with A 0:3, we always detect outages longer than one round.
5.4.2 Precision of event timing
Figure 5.2 shows we do detect outages. We next evaluate the precision of our observed
outage durations.
We continue with dataset A
controlled
in Figure 5.3, comparing ground truth outage
duration against observed outage duration at second-level precision. Our system mea-
sures block transition events with second-level precision, but when we examine outage
durations, we see they group into horizontal bands around multiples of the round dura-
tion, not the diagonal line that represents perfect measurement. We also see that error in
each case is uniformly distributed with error plus or minus one-half round. As expected,
109
0 500 1000 1500 2000 2500
actual outage duration (s)
0
500
1000
1500
2000
2500
estimated outage duration (s)
0
2
4
6
8
10
12
14
16
18
(perfect prediction)
Figure 5.3: Observed outage duration vs. ground truth. Dataset: A
controlled
(same as
Figure 5.2).
we miss some outages that are shorter than a round; we show these as red circles at
duration 0. Finally, we also see a few observations outside bands, both here and marked
with an asterisk in Figure 5.2. These are cases where checkpoint/restart stretched the
time between two periodic probes.
These results are consistent with measurement at a fixed probing interval sampling a
random process with a uniform timing. When we compare observed and real event start-
and end-times it confirms this result, with each transition late with a uniform distribution
between 0 and 1 round.
These experiments use blocks where addresses are always responsive (A(E(b)) = 1).
We carried out experiments varying A from 0.125 to 1 and can confirm that we see no
missed outages longer than one round and similar precision as long as Trinocular can
reach a conclusion (A> 0:3). When 0:3< A< 1, additional adaptive probes add at most
110
45 s to detection time (15 probes at 3 s per adaptive probe). For blocks with A < 0:3,
precision will deteriorate and block status may be left uncertain.
We conclude that periodic probing provides a predictable and guaranteed level of
precision, detecting state transitions in just more than a round (705 s, one round plus 15
adaptive probes) for blocks where A > 0:3. Greater precision is possible by reducing
the round duration, given more trac.
5.4.3 Probing rate
Our goal is good precision with low trac, so we next validate trac rate. We use
simulation to explore the range of expected probing rates, then confirm these based on
our Internet observations.
Parameter Exploration: We first use simulation to explore how many probes are
needed to detect a state change, measuring the number of probes needed to reach conclu-
sive belief in the new state. Our simulation models a complete block (jE(b)j = 256) that
transitions from up-to-down or down-to-up. When up, all addresses respond with prob-
ability A(E(b)). When down, we assume a single address continues to reply positively
(the worst case outage for detection).
Figure 5.1 shows the up-to-down and down-to-up costs. Down-to-up transitions
have high variance and therefore have boxes that show quartiles and whiskers 5%ile
and 95%ile values. Up-to-down transitions typically require several probes because
Trinocular must confirm a negative response is not an empty address or packet loss, but
they have no variance in these simulations. Trinocular reaches a definitive belief and a
correct result in 15 probes for all blocks with A> 0:3.
For down-to-up transitions, 15 probes are sucient to resolve all blocks in 50% of
transitions when A > 0:15, and in 95% of transitions when A > 0:3. Variance is high
111
0
0.01
0.02
0.03
0.04
0 10 20 30 40 50 60 70 80 90
0 2 4 6 8 10 12 14
fraction of blocks
mean probes per hour
mean probes per round
median: 13.2 pr/hr
mean: 19.2 pr/hr
max. allowed: 15 pr/rnd 0.1%
>= 85 p/h
Figure 5.4: Distribution of probes to each target block. Dataset: A
7w-5.5h
.
because, when A is small, one will probe many unused addresses before finding an active
one. This variance motivates recovery probing (the black “still down” line).
Experimentation: To validate these simulations, Figure 5.4 shows probe rates from
A
7w
, a 48-hour run of Trinocular on 3.4M Internet-wide, analyzable blocks starting
2013-02-12 T14:25 UTC. Here we examine the subset A
7w-5.5h
from this data: the first
5.5 hours (30 rounds) from one of the four probers, with 1M blocks; other subsets are
similar.
As one would expect, in most rounds, most blocks finish with just a few probes:
about 73% use 4 or fewer per round. This distribution is skewed, with a median of 13.2
probes/hour, but a mean of 19.2 probes/hour, because a few blocks (around 0.18%) reach
our probing limit per round. Finally, we report that 0.15% of blocks actually show more
than expected trac (the rightmost peak on the graph). We find that a small number of
networks generate multiple replies in response to a single probe, either due to probing a
broadcast address or a misconfiguration. We plan to detect and blacklist these blocks.
112
0
2
4
6
8
10
12
14
16
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
probes required
availability A(E(b))
(maximum allowed probes)
down-to-up
up-to-down
aggregate
still up
still down
Figure 5.5: Median number of probes, with quartiles, for aggregate and state transitions.
Dataset: A
7w-5.5h
.
This experiment shows we meet our goals of generating only minimal trac, with
probing at 0.4% (median) to 0.7% (mean) of background radiation, and bounding trac
to each block.
Probe rate as a function of A(E(b)): The above experiment shows most blocks
require few probes, and our simulations show probe rate at transition depends strongly
on address responsiveness. To verify this relationship, Figure 5.5 breaks down probes
required by transition type and each block’s A(E(b)).
The dotted line and thick quartile bars show aggregate performance across all states.
We track blocks with A > 0:3 with less than 4 probes per round, with relatively low
variance. Intermittent blocks (A< 0:3) become costly to track, and would often exceed
our threshold (15 probes).
113
Figure 5.5 identifies each state transition from Figure 5.4 separately. We see that the
shape of recovery and outages match simulations (Figure 5.1), although outage detection
has larger variance because of imperfect estimation of A(E(b)).
Overall this result confirms that Trinocular does a good job of keeping probe rate
low, and of adapting the probe rate to meet the requirements of the block.
5.5 Eects of Design Choices
We next explore two design choices that dier from prior work and contribute to Trinoc-
ular accuracy.
5.5.1 How Many Addresses to Probe
Trinocular sends probes to E(b), the known active addresses in each block. Most prior
systems probe a single address, sometimes a single specific address (such as that ending
in .1 [KBMJ
+
08b, MIP
+
06b, KBSC
+
12], or they probe all addresses [QHP13b]). We
show that alternatives either can cover fewer blocks or gather less information per probe,
and may miss outages.
Information Per Probe and Coverage: To evaluate the amount of information
each probe may provide, we examine IP history data for each alternative. We begin with
history dataset “it49w” [FHG11] (identified here as H
49w
), summarizing IPv4 censuses
from 2006 to 2012. Figure 5.6 shows the distribution of availability value A for dierent
approaches to selecting probing targets per block: probing .1, a hitlist’s single most
responsive address [FH10], all responsive addresses (E(b)), and all addresses. This A
value correlates to the information a single probe provides about the block, since probing
an inactive address does not help determine if the block is up or down.
114
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6
availability of target addresses
cumulative blocks (in millions) with at least A(.)
E(b)
dot-one
hitlist
all
15-try-E(b)
(sparse threshold)
Figure 5.6: Probability of a positive response (availability) of one address of each block
(dot-one, hitlist, (E(b), and all), or any of 15 addresses (15-try-E(b)). Dataset: H
49w
.
We first compare probing active to all addresses (E(b) vs. all). The E(b) line has
greater availability for all blocks, so A(E(b))> A(b) and each Trinocular probe provides
more information than does probing all.
Dot-one and hitlist do better than E(b) for many of blocks (about 40% of all, from
3.5M to the right), but poorer for about 50% of all blocks (from 0.7M to 3.5M). In
many cases (for dot-one, about 2M blocks from 0.7 to 2.5M, about 30% of all blocks), a
single address provides no coverage where E(b) shows some addresses would respond.
Thus, while a single target, may work well for 40% of blocks, particularly when probing
includes retries, it provides poor or no coverage for even more blocks—probing E(b) can
cover about two-thirds more blocks than .1.
While A characterizes the information provided by a single probe, Trinocular sends
an adaptive number of probes, allowing low-A blocks to get good coverage. To show an
115
strategy single hitlist Trinoc. all
samples per /24 1 1 jE(b)j 256
which addresses .1 topever resp. all
precision 99.97%99.98% 100%(100%)
recall 58.6% 66.6% 96.6%(100%)
Table 5.2: Comparing precision and recall of dierent probing targets. Dataset: S
50j
.
upper bound on Trinocular’s ability to find an active address, the curve labeled “15-try-
E(b)” shows the probabilty that any of 15 probes will respond, suggesting that Trinoc-
ular can use multiple probes to provide reasponable results even for blocks with very
intermittently responding addresses.
While other systems use secondary methods to improve reliability (perhaps verifica-
tion with traceroute), or use fewer but larger blocks (Section 5.5.2), we show that E(b)
provides about 30% broader coverage than depending on a single address.
Eect on Outage Detection: To evaluate the impact of probing choice on outages,
we next examine a single dataset with three choices of probe target. We use Internet
survey “it50j” [HPG
+
08], a 2-week ICMP probing of all addresses in 40k /24 blocks
starting 2012-10-27 (here called S
50j
). We define any response in probes to all addresses
as ground truth since it is the most complete. We define a false outage (fo) as a prediction
of down when it’s really up in all-probing, with analogous definitions of false availability
(fa), true availability (ta), and true outages (to). We then compute precision (ta=(ta+fa)),
and recall (ta=(ta + fo)).
Table 5.2 compares these design choices. Here we focus on the eect of number of
targets on precision and recall. While precision is uniformly good (inference of “up” is
nearly always correct), recall suers because there are many false outages. We conclude
that probing one target (single and hitlist cases) has lower recall. The problem is that
in some blocks the target address is never active, and in others with transient usage it is
only sometimes active. Probing multiple addresses handles both of these cases.
116
Other systems use ICMP as a triggering mechanism for secondary methods that
verify outages; for example, traceroutes may recover from a false trigger. However,
these systems raise other questions (is the target for traceroute up?), and even when
self-correcting, incur additional trac. We show that probing E(b) provides 30–40%
better recall than probing a single address, even without secondary verification.
5.5.2 What Granularity of Blocks
Most previous active probing systems track reachability per routable prefix [KBMJ
+
08b,
MIP
+
06b] (Hubble operation probes at most 1 target per BGP prefix [KB12]). However,
reachability is not correlated with BGP prefixes [BMRU09]; we see smaller units.
We next compare block-based schemes that directly measure each /24 block with
prefix-based schemes where measurement of a single representative address determines
outages for a routable prefix of one or many blocks. Prefix-based schemes require lit-
tle trac and get broad coverage. However, their trade-o is that they are imprecise,
because the single representative may not detect outages that occur in parts of the prefix
that are not directly measured. Block-based schemes, on the other hand, require more
trac and cannot cover blocks where no addresses respond, so they have lower cover-
age. But because block-based schemes directly measure each block, they provide very
good precision.
We first compare how precision and coverage trade-o with block-based and prefix-
based measurement schemes, then how this dierence in precision aects the accuracy
of outage detection.
Methodology: We first must define when block-based or prefix-based methods can
cover a prefix. Block-based measurement systems can track outages in blocks that have
active addresses that respond. Here we require 20 active addresses with A > 0:1 (a
slightly stricter requirement than Trinocular). Prefix-based systems expect an active .1
117
in prefixes in blocks
block-direct 184,996 (44%) 2,438,680 (24%)
prefix-direct 240,178 (57%) 219,294 (2%)
prefix-inferred — 8,115,581 (81%)
overlap 152,295 (36%) 2,410,952 (24%)
neither 145,268 (35%) 1,908,122 (19%)
total 418,147 (100%) 10,051,431 (100%)
Table 5.3: Comparing coverage by granularity. Dataset: A
20addr
.
address (the target address). To be generous, we consider all .1 addresses in any /24 of
a prefix, not just the first.
However, prefix-based systems only directly measure the target address, and from
that infer outages for the rest of the prefix. Prefix-based systems require less probing
trac, but we have shown that Trinocular’s probe rate is acceptable.
We evaluate the eects of coverage by re-analyzing an Internet-wide survey taken
2012-08-03 [QHP13a], labeled A
20addr
. As with S
50j
, this dataset consists of ICMP
probes sent to addresses every 11 minutes. But it covers only 20 addresses in each
of 2.5M /24 blocks, and only for 24 hours on 2012-08-03. We compare this probe data
with default-free BGP routing tables from the same site on the same day.
Precision: We compare precision of coverage in Table 5.3. In the left column we
consider, for each routable prefix, if any of its address blocks are covered by block-based
measurements, prefix-based, both, or neither. We see 418k prefixes in the BGP routing
table. Of these, prefix measurements directly observe 240k prefixes (prefix-direct, 57%),
while block-based measurements include data for only 185k prefixes (44%). Block-
based coverage misses some prefixes where all blocks include fewer than 20 addresses;
prefix-based coverage misses some prefixes where no .1 addresses respond. Overall,
prefix-based probing covers 13% more prefixes, although block-based picks up 8% that
prefix-based misses (block-direct minus overlap).
118
The block-level view (right column) presents a dierent picture. Prefix-based has
much larger coverage (81%) when one considers inferred blocks. This large coverage
is due to large prefixes that are sparsely occupied, like MIT’s 18/8, where most blocks
do not respond but a prefix-based scheme allows 18.0.0.1 to represent reachability to
them all. Block-based coverage is also lower because it requires more than one address
per block. However, direct measurements in these prefixes are quite few: we observe
10 times more blocks than prefix-direct, but inference allows prefix-based to suggest
answers for 3 times more blocks. We next consider how direct and indirect measure-
ments aect accuracy.
The above comparison is rough, because Hubble’s lower coverage can be compen-
sated by indirectly inferring outage decisions to other blocks within the same BGP pre-
fix. To make a fairer comparison, we compared to Hubble’s decision, both direct and
indirect. We iterate over all /24 blocks in each BGP prefix, and all rounds on the day of
our probing (2012-08-03). We find that 57% of the times, although prefix direct does not
have coverage, prefix indirect, same as our more complete probing, correctly decides the
block is up. However, 25% of the times, prefix indirect falsely thinks a block is down,
while our more complete probing reveals the block is actually up. We omit other small
percentage cases here.
Accuracy: We next compare the dierent granularities of prefix- and block-based
measurements aect the accuracy of outages in A
20addr
.
For prefix-based measurements, we observe the status of one address as representing
the entire prefix. It therefore directly observes outages in the block of the representative
address, and infers the status of other blocks of the prefix. This approach works per-
fectly when an outage is prefix-complete—all parts of the prefix go up or down together,
119
perhaps due to a common change in routing. It is incorrect when the outage is prefix-
partial, and can either over- or under-count the size of the outage when some blocks in
the prefix are up while others are down.
To compare accuracy we simulate a prefix-based scheme by observing outages in
the prefix’s target. We compare to re-analysis of A
20addr
with a Trinocular-like scheme,
following Section 5.6.2, but with all 20 addresses in each block as E(b).
With prefix-based schemes, often a prefix will be declared down, but the data shows
that other blocks in the prefix remain up. We find that 25% of all block-rounds inferred to
be down are incorrect, so prefix inference often overstates outages. It can also understate
outages when small outages do not occur at the direct measured block of the prefix; 37%
of block-round outages seen by us are missed by a prefix-based scheme.
A fundamental limitation of prefix-based measurement is that outages usually do
not aect entire prefixes. To quantify this claim, we examine each routable prefix with
any block-level outages in A
20addr
. For each prefix, we evaluate if the outage is prefix-
complete or prefix-partial. Any prefix-based measurement scheme will always be incor-
rect for prefix-partial outages, over- or under-reporting depending on the status of the
directly measured block. We find that only 22% of all prefix-rounds (that have any out-
age) are prefix-complete, while 78% are prefix-partial, showing that most outages are
partial.
5.6 Studying the Internet
We next examine what Trinocular says about the Internet.
120
sites (% block-time)
(1 vantage point) (2 vantage points) (3)
status w c j wc wj cj wcj
all down 0.79 0.92 0.74 0.24 0.22 0.26 0.15
all up 99.2199.0899.26 98.5398.62 98.53 98.01
disagree — — — 1.23 1.16 1.21 1.84
Table 5.4: Outages observed at three sites over two days. Dataset: A
7
.
5.6.1 Days in the Life of the Internet
We begin by evaluating Internet-wide outages to evaluate the proportion of local and
global outages, and demonstrate Trinocular operation. We collected data tracking out-
ages on 3.4M blocks over two days, starting at 2013-02-12 14:25 UTC, from three
universities, labeled w, c, and j in Los Angeles, Colorado, and Japan. This experiment
produces three datasets: A
7w
, A
7c
, and A
7j
. For analysis, we then identify blocks where
A is inconsistent and remove them, leaving 863k, 865k, and 863k blocks.
When rendered to an image with one pixel per round and block, the data is over-
whelming (omitted here for space, but at [QHP13a]), forming an image of 270 pixels
wide and 3.4M tall.
Data from three vantage points lets us begin to evaluate how widespread are the
outages we observe. Table 5.4 shows the level of agreement between the three sites
for the period when all three had overlapping coverage. We measure agreement as
percentage of block-time, that is, the sum of the duration of outages for each block
for a given status, for each combination of the three vantage points.
Comparing columns w, c, and j, we see slight variations between the three sites (from
0.74% to 0.92%). We have seen this magnitude of variation in most measurements, with
no strong trend favoring any site. We see a similar variation in these whole Internet
measurements (here) as in measurements of a sample of the Internet in Figure 5.10,
suggesting that our samples are unbiased.
121
We can evaluate the degree of local and global outages by comparing each site with
the other. The 2-vantage point columns (wc, etc.) show that many outages are local and
seen at one site but not another. Overlap of all three vantage points (column wcj) shows
only about 0.15% of the Internet is down, suggesting only 16–20% of outages seen by
a single site are global. We believe we are converging on reporting only global out-
ages with three independent vantage points, but future work should explore additional
vantage points to demonstrate a plateau.
Here we considered two days of the Internet. We are currently running Trinocular
actively, and in future work plan to compare long-term observations and compare to
other public information about network outages.
5.6.2 Re-analyzing Internet Survey Data
While Section 5.6.1 uses Trinocular to study the global Internet, it provides only a brief
snapshot. To get a longer-term perspective we next re-examine existing datasets using
the principles behind Trinocular.
We draw on Internet survey data collected over the last three years from Los Angeles,
Colorado, and Japan [HPG
+
08]. Surveys start with a random sample of 20k or 40k /24
blocks (about 1–2% of the responsive Internet), then probes all addresses in each block
every 11 minutes for two weeks.
Survey data is quite dierent from Trinocular. All addresses in b are probed, not
just E(b), so the survey trac rate is 100 greater than Trinocular. To adapt Trinocular
to this bigger but less-tuned data, we track belief of the state of each block and use
all probes as input. Since we probe all addresses, here E(b) = b (Table 5.1). Since
we cannot control probing, we have neither adaptive nor recovery probing, but periodic
probing occurs every 2.6 s, slightly more frequent than Trinocular’s adaptive probing.
This change is both good and bad: frequent periodic probing can improve precision in
122
detection of outage start and end, but many probes are sent uselessly to non-responsive
addresses that Trinocular would avoid.
Our reanalysis computes A(b) from the survey itself. This “perfect” value diers
from Trinocular operation, where A is computed from possibly outdated IP history.
Adapting A from probes is work-in-progress.
5.6.3 Case Studies of Internet Outages
We next examine several cases where Internet outages made global news. We see that
systematic measurement of outages can provide information the scope of problems and
the speed of recovery. Where possible, we visualize outages by clustering blocks by
similarity in outage timing [QHP12c], and coloring blocks based on their geolocation.
Political Outages: Egypt and Libya
Two major 2011 outages were caused by political events: most Egyptian routes were
withdrawn on 2011-01-27 by the government during what became the 2011 Egyptian
revolution, and all Libyan routable prefixes were withdrawn 2011-02-18 during the
Libyan revolution. In both cases, we re-examined surveys covering these events (S
38c
began 2011-01-27, just after Egypt’s outage, and ran for 3 weeks to cover Libya). We
have strong evidence of the Egyptian outage, with 19 /24 blocks of Egypt’s 22k in the
survey (visualization omitted due to space). The end of the observed outage is confirmed
with news reports and analysis of BGP data.
Libya’s Internet footprint is much smaller than Egypt’s: only 1168 /24 blocks as of
March 2011. Only one of those blocks was in the dataset, and that block is too sparse
(only 4 active addresses) to apply Trinocular. However, Trinocular’s lightweight probing
means that it could have covered the whole analyzable Internet. Had it been active at the
time, we would have tracked 36% of Libya’s 1168 blocks and likely seen this outage.
123
Figure 5.7: Six days of the 600 largest outages in March 2011 showing results of the
T¯ ohoku earthquake. Dataset: S
39c
. Colors are keyed to countries.
March 2011 Japanese Earthquake
In survey S
39c
, we observe a Japanese Internet outage, in Figure 5.7 mid-day (UTC)
on 2011-03-11. This event is confirmed as an undersea cable outage caused by the
T¯ ohoku Japanese earthquake [Mal11]. We mark a vertical line 30 minutes before the
earthquake so as to not obscure transition times; individual blocks do not cluster well
because recovery times vary, but the outage is visible as a large uptick in the marginal
distribution. Unlike most human-caused outages, both the start and recovery from this
outage vary in time. For most blocks, the outage begins at the exact time of the earth-
quake, as shown by the sudden large jump in marginal distribution less than 6 hours into
124
Figure 5.8: Six days of the 300 largest outages in U.S.-based networks showing Hurri-
cane Sandy. Dataset: S
50j
.
2011-03-11, but for some it occurs two hours later. Recovery for most blocks occurs
within ten hours, but a few remain down for several days.
This dataset also shows strong evidence of diurnal outages in Asia as the green and
white banding seen in the low 300 blocks. These diurnal outages make Trinocular’s
outage rate slightly higher than our previous approach [QHP12c]. We show that these
blocks come and go, meeting our definition of outage. Future work may distinguish
between cases where networks intentionally go down (such as turning of a laboratory at
night) from unexpected outages.
October 2012: Hurricane Sandy
We observed a noticeable increase in network outages following Hurricane Sandy. The
Hurricane made landfall in the U.S. at about 2012-10-30 T00:00 UTC. When we focus
on known U.S. networks, we see about triple the number of network outages for the day
following landfall, and above-baseline outages for the four days following landfall.
125
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
2012-10-27
2012-10-28
2012-10-29
2012-10-30
2012-10-31
2012-11-01
2012-11-02
2012-11-03
2012-11-04
2012-11-05
2012-11-06
2012-11-07
2012-11-08
outages per day (fraction of block-rounds)
date (UTC)
hurricane U.S. landfall
CA
us
CA
us
us
NJ
NY
PA
us
CT
NJ
NY
us
NJ
NY
us
NJ
NY
us
us
us
us
us us us
Figure 5.9: Median number of outages per day, broken down by state, weighted by
outage size and duration, with jittered individual readings (dots). Dataset: S
50j
.
Visualizing outages: Figure 5.8 visualizes the 400 blocks in the U.S. with the largest
degree of outages, and label (a) shows a strong cluster of outages at 2013-10-30 (UTC)
corresponding with hurricane landfall. Hurricane-related outages tend to be long, lasting
one or more days. We believe these outages correspond to residential power outages.
Quantifying outages: We know that some part of the Internet is always down, so to
place these outages in perspective, Figure 5.9 plots the exact number of /24 blocks that
are down in each round (this value is the marginal distribution of Figure 5.8). We plot
each round as small red points (with small jitter to make consecutive more distinct), and
we show 24-hour median values with the dark line.
Figure 5.9 shows U.S. networks had an outage rate of about 0.36% before landfall.
(This rate seems somewhat less than the global average.) This rate jumps to 1.27%,
about triple the prior U.S. baseline, for the 24-hours following landfall. The outage
level drops over the next four days, and finally returning to the baseline on 2012-11-03.
126
Locating outages: To confirm the correlation between the hurricane and these out-
ages, we look at the weighted blocks by state. The bars in Figure 5.9 identify outages
by state. The top “US” portion represents outages that are geolocated in the U.S., but
not to a specific state.
This figure shows that there are large increases in the amount of outages in New York
and New Jersey (the lighter colored bars in the middle of the graph) after hurricane land-
fall on 2012-10-30, about three times the prior baseline. These problems are generally
resolved over the following four days. (Because of our more sensitive methodology, we
see more outages here than in our prior analysis [HQP12], but our qualitative results are
similar.)
While re-analysis of S
50j
provides insight into Sandy-related problems and recover,
survey collection places significant trac on the targets. Trinocular can cover 3.4M
blocks, about 80 more than the 40k in a survey, at about 1% the trac to each target
block.
5.6.4 Longitudinal Re-analysis of Existing Data
Finally, we re-analyze three years of surveys. This data lets us compare the stability of
our results over time and across dierent locations.
Probing location can aect evaluation results. Should the probing site’s first hop
ISP be unreliable, we would underestimate overall network reliability. We re-analyze
surveys collected from three sites (see Section 5.6.2), each with several upstream net-
works. In Figure 5.10, locations generally alternate, and each location is plotted with
a dierent symbol (W: empty symbols, C: filled, J: asterisks), and survey number and
location letter are shown at the graph top. Visually, this graph suggests the results are
similar regardless of probing site and for many dierent random samples of targets.
Numerically, variation is low: mean outage rate (area) is 0.64% with standard deviation
127
99
99.2
99.4
99.6
99.8
100
availability (%)
availability
0
10
20
30
40
50
2009-10-01
2010-01-01
2010-04-01
2010-07-01
2010-10-01
2011-01-01
2011-04-01
2011-07-01
2011-10-01
2012-01-01
2012-04-01
2012-07-01
2012-10-01
2013-01-01
2013-04-01
0
1
2
3
outages (down events x1000)
outage rate (%)
Date
outage rate
outages
29
w
c
30
w
c
31
w
c
32
w
c
33
w
c
34
w
c
35
w
c
36
w
c
37
w
c
38
w
c
39
w
c
40
w
c
41
w
c
42
w
c
43
w
c
j
44
w
c
j
45
c
j
w
46
w
c
j
47
w
c
j
48
w
c
j
49
w
c
j
50
w
c
j
51
w
c
j
52
w
c
j
53
w
c
j
target list doubled
Figure 5.10: Evaluation of single-site outages in 2-week surveys over three years. Top
shows availability, bottom shows Internet events, outages and outage percentage over
time. (Dataset varies by time, as shown in the figure.)
of only 0.1%. To strengthen this comparison we carried out Student’s t-test to evaluate
the hypothesis that our estimates of events, outages, and outage rates for our sites are
equal. The test was unable to reject the hypothesis at 95% confidence, suggesting the
sites are statistically similar.
Besides location, Figure 5.10 suggests fairly stable results over time. We see more
variation after 2011, when the size of the target list doubled to about 40k blocks.
128
These observations are each from a single vantage point, thus they include both
global and local outages. Surveys are taken for non-overlapping, two week periods
because each places a significant burden on the subject networks. Trinocular’s much
lower trac rate to targeted blocks (1% that of a survey) allows outage detection to
overcome both of these limitations. As demonstrated in Section 5.6.1, it can operate
concurrently from three sites. We plan to carry out continuous monitoring as Trinocular
matures.
5.7 Conclusions
Trinocular is a significant advance in the ability to observe outages in the network edge.
Our approach is principled, using a simple, outage-centric model of the Internet, pop-
ulated from long-term observations, that learns the current status of the Internet with
probes driven by Bayesian inference. We have shown that it is parsimonious, with each
instance increasing the burden on target networks by less than 0.7%. It is also predictable
and precise, detecting all outages lasting at least 11 minutes with durations within 330 s.
It has been used to study 3.4M blocks for two days, and to re-analyze three years of
existing data, providing a new approach and understanding of Internet reliability.
In this chapter, we show that more sophisticated sampling can help bring down probe
rate without harming much accuracy. by exploring principled low-rate sampling, which
adaptively probes addresses within each block to obtain accurate outage information.
Guided by the Bayesian inference framework, we reduce the mean probe rate by more
than 80% as compared to Chapter 3. With both time- and space-wise sampling, we
adaptively probe inside each of 3.5M /24 blocks continuously and precisely. Aggre-
gating data from multiple vantage points, we also study the eect of local only view
outages.
129
With the techniques introduced in Trinocular, in the next chapter we will conduct a
large scale Internet outage and diurnal usage analysis, and also explore broader policy
eects.
130
Chapter 6
Evaluating Policy Eects on Internet
Usage
In this chapter, we present how we use adaptive sampling to dynamically track Inter-
net block availability and then discover diurnal usage and outages. We use a dier-
ent approach of aggregation in this chapter. We first correlate Internet blocks to other
domains, and then aggregate on that domain for dierent views of Internet management.
With these technique, we are able to start to evaluate various policy eects on how the
Internet is managed.
6.1 Motivation for Evaluating Policy Eects
Nowadays, a reliable Internet connection is crucial for almost everyone. We need reli-
able network connections, because many applications, both on regular personal com-
puters and smart-phones, heavily rely on them. For example, with mobile technologies
(3G/4G networks), “anytime, anywhere” access is now a must for many people. Another
example is that more and more airlines are responding to customer demands: equipping
their air-crafts with Internet connections on board.
Besides reliability, many other policies are also important, because they aect how
the population actually uses the Internet. Tiny delays in startup times aect search
engines like Google and their revenues. Similarly, diurnal controls of network blocks
131
aect the ability of people to depend on their Internet. Lack of always-on networks
aects latency, and dynamic addressing prevents people from deploying home servers.
Prior studies are very limited in identifying these factors, much less to say comparing
their aects across states and countries. Our goal is to quantitatively understand how
policies aect Internet usage in dierent parts of the world. For example, we know that
some outages are caused by political decisions [QHP12a, QHP13c]; some countries use
network blocks on a diurnal basis because of economic constraints (saving power bills);
and dierent access technologies (and organizations) lead to dierent outage rates.
Beyond these anecdotal understandings, we would like to systematically understand
which factors contribute most to a certain network metric of our interest.
6.1.1 Contributions
To study policy eects on Internet use, we make several contributions in this chapter.
First, we develop an adaptive approach to accurately estimate block availability
over time, and use spectral analysis to identify diurnal used blocks (Section 6.2). This
approach also enables us to do outage detection on 3.7M responsive blocks.
Second, we use our tool to provide the first long-term, wide-scale analysis of diurnal
network behavior and outages (Section 6.4). We provide thorough analysis on charac-
teristics of diurnal usage blocks and outages.
Third, we correlate diurnal usage and outages to other domains, such as countries,
organizations and link types. With our 35 day reachability dataset and CIA’s world fact-
book [CIA], we also quantitatively correlate to economic factors (GDP and electricity
consumption, Section 6.5). We use a rigorous statistical tool, ANOV A, to evaluate some
interesting research questions: How does per-capita GDP and IP resources aect diur-
nal usage of IP blocks? How does electricity consumption and political decisions aect
outage rate?
132
6.1.2 Relation to Thesis
This study supports our thesis that adaptive sampling in time can be used to accurately
track block availability, which serves as the basis for analyzing diurnal blocks. We over-
come similar resource constraints as in prior chapter, the large number of 3.7M Internet
blocks. We adopt new ideas of aggregation in this study. Instead of directly aggre-
gate over Internet blocks, we first map IP address space to other more useful spaces,
such as geolocation and link types. We then aggregate over those new domains to find
correlations, finding indirect and subtle knowledge about diurnal usage and outages.
6.2 Methodology
In this section, we first introduce how we dynamically measure block availability with
minimal probes, and integrate with outage detection in a previous system, Trinocular.
We then introduce how we apply spectral analysis (FFT) to detect diurnal blocks in the
world. To enable evaluation of reliability of more useful entities, we also map IPv4 /24
blocks to other spaces: Autonomous Systems (ASes), geolocations, and organizations
(ISPs). Finally, we introduce how we use statistical factorial analysis to correlate diurnal
and outage data to policy factors.
6.2.1 Dynamic Tracking of Block Availability
Our goal is to dynamically and accurately estimate availability (A value) of each block,
while doing adaptive probing to track Internet outages (no extra probes). Specifically,
our goal is to find an operational
ˆ
A
o
value, given a series of observations.
133
Problem Statement
A constraint in estimating A is that we must minimize the number of observations. While
it would be easy to estimate A by probing all addresses in the block (what we do for
ground truth in our experiments), in operation we wish to estimate A with no additional
probes beyond those required for outage detection.
There are two challenges that make A estimation dicult in this case. First, sampling
in outage detection is biased. The goal of outage detection is to make the minimum
number of probes to reduce stress on the target, so a few or even one positive response
is usually sucient to terminate probing. Probes are thus biased in favor of positive
responses.
The second challenge is that the operational
ˆ
A
o
estimate must not be too high. When
ˆ
A
o
> A, a few negative probes are an indication that the block is down and Trinocular
will produce false outages.
Two additional challenges are that observations are often quantized, and that our
initial A estimate may be quite inaccurate. Because we take from 1 to 15 observations
per round, precision of a new A value is at most0:07 and could be as coarse as 0.5 or
1. Short-term estimates therefore show significant jitter. Our initial estimates for A are
based on historical data over several years. They may be o significantly if block usage
has changed.
Approach
Overall, our approach uses exponentially weighted moving average (EWMA) on our
observations. From observations we form both short- (
ˆ
A
s
) and long-term estimates (
ˆ
A
l
,
ˆ
d
l
) and use them to derive our operational value (
ˆ
A
o
).
134
For each block, in each round of adaptive probing, we observe p positive responses
and t total responses. We first model short-term EWMA of positive and negative
responses to quickly adapt to real values.
ˆ p
s
=
s
p + (1
s
) ˆ p
ˆ
t
s
=
s
t + (1
s
)
ˆ
t
We can then calculate the short-term availability
ˆ
A
s
:
ˆ
A
s
= ˆ p
s
=
ˆ
t
s
For this short-term estimate, we set a gain of
s
= 0:1, to quickly adapt to changes.
We estimate
ˆ
A
s
to show what data adaptive probing provides about the block, plot-
ting it as a cyan line (in Figures 6.2 through 6.7). We can then compare it to the true
A taken from complete survey data. We see that
ˆ
A
s
is quite noisy (see Figure 6.2). By
considering short-term trends it shows diurnal eects (see Section 6.4.1 and Figure 6.6).
We do not use this value in outage detection, but use it to understand block behavior.
We next model long-term estimates of availability (
ˆ
A
l
) to derive a conservative oper-
ational value
ˆ
A
o
. We model the long-term version of positive and total responses, and
availability
ˆ
A
l
with
l
= 0:01:
ˆ p
l
=
l
p + (1
l
) ˆ p
ˆ
t
l
=
l
t + (1
l
)
ˆ
t
ˆ
A
l
= ˆ p
l
=
ˆ
t
l
135
Our goal in a conservative
ˆ
A
o
is that it should not exceed the true value. To reflect
this goal, we evaluate the absolute deviation of each sample from the estimate and then
intentionally underestimate the operational value by this amount. We model a devia-
tion term d
l
, and use it to compute a more conservative
ˆ
A
o
(note that we limit
ˆ
A
o
to a
minimum of 0.1):
ˆ
d
l
=
l
j
ˆ
A
l
p=tj + (1
l
)
ˆ
d
ˆ
A
o
= max(
ˆ
A
l
1
2
ˆ
d
l
; 0:1)
Old Approach
Our 35-day dataset A
12all
(described in Section 6.2.1) uses an older approach to approx-
imate
ˆ
A
o
. We show in Section 6.3.1 that this approach does not over-estimate
ˆ
A
o
and
therefore it does not change our results significantly. However we plan to switch to our
new approach for new data collection. The (old) short-term estimate noticeably over-
estimates
ˆ
A
s
; we have re-analyzed A
12all
to use the new
ˆ
A
s
for our diurnal analysis.
For each block, in each round of adaptive probing, we observe o
p
positive and and o
n
negative probes, giving o
t
= o
p
+ o
n
tries per round. Each round then provides a single
observation of A:
o
A
= o
p
=o
t
Our assumption is that blocks are up or down, and our goal is to estimate A when
the block is up. If outage detection determines the block is down, we usually discard the
observation, however we do adjust estimates if any probes are positive.
Similar to that of RTT estimation in TCP [Jac88], we then use EWMA to compute
A
s
, a short-term estimate of A:
136
ˆ
A
0
s
=
s
o
A
+ (1
s
)
ˆ
A
s
For this short term observation, we set a gain of
s
= 0:1 (so an estimate has a
“weight” of about 0.35 after 10 rounds pass). When we begin, we start with
ˆ
A
s
=
ˆ
A
h
,
the estimate computed from census history.
Outage detection depends on finding a good operational estimate
ˆ
A
o
. For this, we
compute a long-term estimate adjusted for the bias in
ˆ
A
s
. Our operational estimate
begins with a long-running estimate of the mean:
ˆ m
0
=
l
o
A
+ (1
l
) ˆ m
where
l
= 0:01, a long visible window. (We initialize this value from history with
ˆ m = 0:5
ˆ
A
h
.)
To account for noise and bias, we compute the prediction error (average absolute
deviation) of each observation: d
A
:=j ˆ m o
A
j. We then track the EWMA of this error:
ˆ
d
0
=
d
d
A
+ (1
d
)
ˆ
d
We use the same long-duration gain, with
d
=
l
= 0:01, and initialize
ˆ
d with 0.
Finally, we compute the operational estimate
ˆ
A
o
:
ˆ
A
o
= max( ˆ m z
ˆ
d; 0:1)
Where z is a constant to determine how much to weigh deviation, and we cap
ˆ
A
o
to
a lower limit (0.1).
137
To select z, we draw inspiration from statistics. If the process was stationary, error
was Gaussian, and we estimated standard deviation, selecting z = 1:96 places 95% of
cases (of true A value) above
ˆ
A
o
. However, after checking the underlying process with
a random sample of 65 blocks where we have full data, we find that z = 1 is a more
reasonable parameter (satisfying
ˆ
A
o
A), placing 90% of cases below the true A value.
We thus use z = 1 in our system.
Recap of Trinocular: Adaptive Probing to Study Internet Reliability
In order to understand reliability globally, we need an ecient and accurate active prob-
ing tool to track all the Internet, all the time.
To achieve this goal, passive systems are not suitable because of coverage; Current
other active systems are either less complete or less accurate [QHP13c]. So we have
developed a principled active probing system, Trinocular [QHP13c]. Trinocular works
with minimal trac to establish belief of network state for each of 3.7M analyzable IPv4
/24 blocks.
However, this previous system has assumed accurate initial estimate of block avail-
ability (A). In practice, it is not always the case; and we had to filter out those blocks
with inaccurate A estimates. We improve Trinocular with accurate A estimate in Sec-
tion 6.2.1, coupled with regular probes (we don’t require extra probes just for establish-
ing accurate A).
We collected data tracking outages on 3.7 M Internet edge network blocks over 35
days, starting at 2013-04-24 17:18 UTC, from three universities, labeled w, c, and j in
Los Angeles, Colorado, and Japan. This experiment produces dataset A
12all
, which is
consisted of three sub-datasets: A
12w
[USC13c], A
12c
[USC13a], and A
12j
[USC13b].
We merge sub-datasets (A
12w
, A
12c
, A
12j
) to A
12all
by calculating overlaps for global
138
outages and disagreements for partial outages. Details are described in our Trinocular
paper [QHP13c].
6.2.2 Diurnal Detection Algorithm
Block availability indicates the number of current users of the block. From that value
we next wish to understand if there are periodic changes to that value. We know that use
of blocks change over the course of a day as the people who use computers come and
go. Other blocks change as ISPs automatically reassign dynamic IP addresses.
Periodic behavior will appear in frequency analysis of address usage. We consider
the block availability A as a timeseries with an estimated value
ˆ
A
s
samples every 11
minutes.
Spectral analysis of timeseries data wants regular input aligned with the sampling
interval (in our case 11 minutes). However, because our probing software is not perfectly
aligned with 11 minute rounds, sometimes we see missing or duplicate observations in
a round. Like previous work [QHP12a], we correct this minor drift in analysis, by
extrapolation (for missing) or trusting most recent observation (for duplicates). Future
work can improve this process with interpolation of
ˆ
A
s
.
Another potential source of inaccuracy is because some (adaptive) probes could
cross round boundaries. To suppress this, we currently shorten the round by 45 s (the
duration of 15 probes) for datasets earlier than A
15
. This shortened round behavior could
lead to double entries per round, which we deal with in post-facto analysis just like with
duplicates. For probes that still cross round boundaries, we terminate them, usually leav-
ing the old round in the unknown state. This is a very rare case, happening in a small
fraction (1.16%) of our target blocks. Because we use fixed ordering of blocks (after an
initial randomization), this rare case often happens to the same blocks. However, omis-
sion of these blocks does not bias our results because target order (and therefore which
139
blocks are aected) is randomized at start-up. Overall, we believe this doesn’t aect our
overall results.
We use Fourier analysis to extract periodic behavior from this timeseries. We use a
discrete FFT: given a timeseries a
m
(m2 [0; n 1]) of n input samples, we compute its
frequency-domain components as:
A
k
=
n1
X
m=0
a
m
expf2{
mk
n
g k = 0;:::; n 1
We then examine the magnitude of A
k
to find the strength of at kth frequency com-
ponent (sub-signal) of a
m
. Since our sampling period is 11 minutes, A
k
shows the fre-
quency at kF
s
=n where F
s
=
1
660
Hz, our sampling period of 11 minutes (660 s). Because
we know our probing period is N
d
days (14 for surveys, and 35 for A
12all
), and each day
is 86,400 seconds, we can compute total number of samples n = N
d
86400 F
s
. So
A
k
represents frequency
kF
s
86400N
d
F
s
=
k
86400N
d
Hz, or
k
N
d
cycles per day.
Detect diurnal blocks: We believe that daily fluctuations in availability correspond
to diurnal blocks. If there is diurnal behavior in a block, we should expect periodicity
at 24 hour intervals (and multiples of 24 hours). We compute sub-signals based on
our probing duration and sample frequency, and examine the frequency corresponding
to diurnal behavior at 1 cycle per day. In Section 6.4.1, we list several examples of
our frequency domain analysis. From these examples, we establish our diurnal block
detection algorithm: if the highest amplitude of non-zero frequency occurs at 1 cycle
per day (corresponding to k = N
d
) and is at least twice as strong as the next strongest
frequency, then this timeseries suggests a very strong diurnal pattern.
Stationarity: FFT over data that is non-stationary or too short can distort analysis
of periodic behavior. Our datasets are two or more weeks long, so they capture many
diurnal periods and should not be distorted by weekly periodicity. We verified our data
140
0
0.2
0.4
0.6
0.8
1
-40 -20 0 20 40
CDF
slope (address changes / day)
Figure 6.1: Checking stationarity of all blocks in Survey S
51w
, showing the slope of
linear regression on true availability A.
is roughly stationary by doing a linear fit over the duration and confirming slopes are
near-zero.
To further validate the stationarity of blocks we study, we check all 29,001 blocks in
Survey S
51w
. We do linear regression on the true availability A, getting a slope per block.
We plot the cumulative distribution of slopes in Figure 6.1. We observe that 80.3% of
these blocks are stationary, with a slope equivalent to less than 1 address changes per
day. We randomly selected 10 extreme cases with large slopes, and find that they are
mostly due to a period of no data, or a sudden change of availability which is most likely
a repurpose of the block. Because blocks are randomly selected, this experiment means
that most Internet blocks are behaving in a stationary manner, thus our FFT analysis is
not distorted by non-stationarity.
141
6.2.3 Other Network Factors: Geolocation, Organizations, and
Link Types
Although we analyze network behavior at the block level, address proximity is not the
only factor that correlates with network reliability: physical location, logical location—
the organization operating the address, and last-mile link technology are all factors that
may aect network operation. We therefore map IP addresses to these factors, building
largely on existing mechanisms as described below.
Geolocation: Dierent countries and regions have dierent policies and economics
for networking. To understand these eects we use IP geolocation to get city-level
physical locations. While there have been many studies of geolocation [MPD00, oI,
PS01, MV09, SZ10, GZCF06, KBJK
+
06, Max13, HH12], we use MaxMind’s city-level
database [Inc13] as a free and widely used source. Although this source is not the most
accurate (claimed accuracy is 40 km), it is sucient to demonstrate at least country-level
correlation.
We map each /24 block to location. Although sometimes blocks span multiple loca-
tions, such instances are fairly rare. Our analysis of existing per-IP data [HH12] shows
about 93% of /24 blocks have homogeneous locations, and our MaxMind dataset speci-
fies location with at most /24 precision.
Organization: Dierent organizations have dierent policies in how they use IP
addresses. We therefore map IP addresses to Autonomous System numbers, and ASes
to organizations.
We use data from Team Cymru for IP to AS number mapping [Cym]. We map each
/24 to an AS, since subdividing /24s across ASes is rare and does not appear in Team
Cymru’s data (confirmed by comparing .0 and .128 with Cymru’s data). We use the
first address (.0) to represent each block. We find mappings for 99.41% of /24 blocks;
142
AS names are also provided for convenience. This step serves as the base for further
analysis.
We map ASes to organizations using prior work that uses WHOIS and string-based
clustering [CHKW10]. For a given organization or ISP P (for example, Time Warner
Cable), we first use keyword matching (ex. “Time Warner”) to find relevant clusters,
then find all ASes within same cluster(s). Finally, for all ASes within P, we join with
IP-AS mapping and find all relevant IP blocks for P.
The above method assumes good accuracy of IP-AS and AS-organization mappings.
In our future work, we wish to look at how accuracy of the above mappings aects our
IP to organization mapping. Another interesting direction is to compare performance of
dierent ASes within the same organization.
Link type: We define the link type as the technology connecting the final hop to the
destination. Identification of link types for blocks is dicult because such information
is not readily available, and because in principle dierent parts of a /24 block could
be connected dierently (for example, a router at .1 and the rest of the addresses with
dial-up). Previous work has shown one can often infer link types from reverse domain
names [CH10b, SS11b, Moc87, Int07]. While existing data was too sparse or old to
reuse, we expanded this idea as follows.
First, we do look up the reverse domain name of each address of each analyz-
able block. We then use string matching to non-exclusively classify each address to
one or more types of links (see Section 6.5.4 for more details of link types). For
example, we classify an address as static if its domain name has the word “static”,
whiledhcp-dialup-001.example.com would be both marked both dhcp and dial-up.
Finally, we aggregate the data within each /24 block, getting a vector of link types (fea-
tures) for each block. We suppress minor features in each block by filtering out features
143
that are less than
1
15
th of the most frequent feature. Finally, we label the block with all
remaining features that have non-zero counts.
Our idea of classifying block by domain names is inspired by Thunderping [SS11b].
However, Thunderping’s link classification was done manually and only for the United
States; we wish to study millions of international blocks.
We consider 16 keywords (sta, dyn, srv, rtr*, gw*, dhcp, ppp, dsl, dial, cable, ded*,
res, client*, sql*, wireless*, wifi*). Of these, we discard the seven marked with asterisks
because they are dominant in less than 1000 blocks.
For our 3.7M blocks in dataset A
12all
, and find that 46.3% of these blocks have some
feature (4 more than Thunderping), and 11.4% have multiple features.
6.2.4 Factorial Analysis with ANOV A
To quantify the correlation of policies, places, and technologies on Internet usage, we
need to weigh the factors that contribute to network outcomes such as diurnal address
usage and network reliability. While anecdotal and personal evidence suggests that
individuals in some countries turn o computers at night to save power, we wish to
measure these eects.
To discover correlations between a range of possible factors and network outcomes
we use analysis of variance or ANOV A, a form of statistical hypothesis testing. It tests
the probability (p-value) of of a factor or factors (separately or in combination) being
correlated with an observation. A high p-value indicates that the factor has little cor-
relation with outcome, while a low p-value indicates that correlation of the factor and
outcome are unlikely to have been by chance alone. In practice, p-values less than a
threshold (0.05 or 0.01) are typically considered statistically significant, because a low
p-value (with high probability) rejects the null-hypothesis: that the input variable(s) and
output variable are unrelated.
144
We refer interested readers to more comprehensive and mathematical rigorous expla-
nations of this subject in related work [Mil97, Wik, Wol]. We use the open source R
software for our ANOV A analysis [GNU].
6.3 Validation
6.3.1 Validating Block Availability Tracking
We introduced how we dynamically track block availability with EWMA in Sec-
tion 6.2.1. In order to make sure we do this accurately, we first list several exam-
ples of dierent kinds of blocks, varying availability and sparseness (see Figures 6.2
through 6.7). We observe that our system has the ability to provide reasonable tracking
for a wide range of A values: from A near 1 with both dense (Figure 6.3) and sparse
(Figure 6.5) blocks, to medium values of A = 0:75 in Figure 6.2; and to low values of
A = 0:2 in Figure 6.4. Our system also correctly handles blocks that have large diurnal
variations, like Figures 6.6 and 6.7.
We also want to make sure we are insensitive to a bad initial estimate. Although our
system converges relatively slowly, it does converge after about three days, even from
a bad initial value (Figures 6.3 and 6.5). A rare case is that a network block could be
re-purposed, where the availability could change dramatically (just like the case of a bad
initial estimate). Currently we can handle this case slowly in a few days, due to the limit
of the EWMA algorithm. As a future work, we will work on faster ways to respond to
this case.
Accuracy: The above examples are useful for manual inspections. However, for
a more complete study, we turn to another data source with many more blocks: Sur-
vey S
51w
with 29k /24 blocks. We run our availability tracking algorithm on Survey S
51w
,
and get a timeseries of estimated availability
ˆ
A (both
ˆ
A
s
and
ˆ
A
o
) for each block, over the
145
0 500 1000 1500
time (rounds)
0.0
0.2
0.4
0.6
0.8
1.0
A
prefix: 0x010915/24, α
s
=0.1, α
l
=0.01, α
d
=0.01, probe_overhead=0.0%
outages
0
5
10
15
20
number of probes
Figure 6.2: Block 1.9.21/24 (0x010915/24),jE(b)j = 42, A = 0:735. Sparse but high
availability block, with an outage. Black: ground truth A. Cyan: estimated A (
ˆ
A
s
). Blue:
operational A (
ˆ
A
o
). Red dots: number of probes used in each round (shown on the right
axis). Magenta: outages.
2 week survey period. To validate the accuracy of our estimates of block availability, we
find the correlation coecient over true block availability A and estimated values. We
find that we have good correlation between true A value and estimated
ˆ
A
s
with an overall
correlation coecient of 0.95685. To more quantitatively examine this correlation, we
plot A and
ˆ
A
s
from all 29k blocks and all rounds over 2 weeks, in a density scatter plot
in Figure 6.8 (similar version of old algorithm in Figure 6.10). We normalize the density
with the product of number of blocks and number of rounds. The dense cluster near the
“perfect correlation” of y = x line shows this correlation visually. Similarly, we plot
the correlation between A and operational
ˆ
A
o
in Figure 6.9 (similar version of algorithm
in Figure 6.11). Our operational A
o
values are almost always (94% of the time) under
146
0 500 1000 1500
time (rounds)
0.0
0.2
0.4
0.6
0.8
1.0
A
prefix: 0x172e97/24, α
s
=0.1, α
l
=0.01, α
d
=0.01, probe_overhead=-0.11%
outages
0
5
10
15
20
number of probes
Figure 6.3: Block 23.46.151/24 (0x172e97/24),jE(b)j = 249, A = 0:991. Dense and
high availability block, with an outage. Note that the slow convergence is due to low
initial value.
true A, which is desirable for our probing purpose. Since we do not probe very-sparse
blocks, we omit cases where
ˆ
A
o
< 0:1.
False outages: Besides accuracy of A, an end-to-end result we care about is false
outages: cases when we decide a block is down (using
ˆ
A
o
), but in fact there are respond-
ing addresses with the oracle A. To see how we deal with such cases, we evaluate all
blocks and rounds in Survey S
51w
, checking the false outage rate of our approach (with
estimated
ˆ
A
o
), compared to an oracle approach (with true A). We find that the false
outage rate of our approach is only 0.11%, and conclude that our approach rarely miss-
reports outages.
147
0 500 1000 1500
time (rounds)
0.0
0.2
0.4
0.6
0.8
1.0
A
prefix: 0x5dd0e9/24, α
s
=0.1, α
l
=0.01, α
d
=0.01, probe_overhead=0.0%
outages
0
5
10
15
20
number of probes
Figure 6.4: Block 93.208.233/24 (0x5dd0e9/24),jE(b)j = 245, A = 0:191. Dense but
low availability block.
6.3.2 Validating Diurnal Blocks
We observe strong daily patterns in Figure 6.6 and 6.7, where all addresses go online at
the same time each day, resulting in instant variations of availability. These examples
show that we can observe diurnal behavior in network blocks; they motivate our use
of spectral analysis to detect them (Section 6.2.2). We next examine the accuracy of
our detection method with spectral analysis. We use two sources of data for validation:
simulation data and survey data.
148
0 500 1000 1500
time (rounds)
0.0
0.2
0.4
0.6
0.8
1.0
A
prefix: 0x515081/24, α
s
=0.1, α
l
=0.01, α
d
=0.01, probe_overhead=0.0%
outages
0
5
10
15
20
number of probes
Figure 6.5: Block 81.80.129/24 (0x515081/24),jE(b)j = 43, A = 0:964. Medium but
high density block. Note that the slow convergence is due to low initial value.
Validation with Simulated Diurnal Blocks
We first simulate a diurnal response so that we can study detection in the face of con-
trolled levels of noise. Since we control the simulated block, we know ground truth
against which we test our detection algorithm.
We simulate one /24 block (256 addresses), evaluating responses over 11-minute
rounds for a duration of 4 weeks. In that block, 50 addresses are stable and always
responding, and n
d
= 100 addresses are diurnal, with the remain addresses not active.
Diurnal addresses are responsive for 8 hours and down for 16 hours each day. Each
diurnal address i turns on at a certain time during the day, the phase
i
.
We evaluate the eects of three kinds of variation in this model. First, we vary phase
selecting
i
for each address (once, at simulation start) uniformly from the range [0; ].
149
0 500 1000 1500
time (rounds)
0.0
0.2
0.4
0.6
0.8
1.0
A
prefix: 0x1bba09/24, α
s
=0.1, α
l
=0.01, α
d
=0.01, probe_overhead=0.55%
outages
0
5
10
15
20
number of probes
Figure 6.6: Block 27.186.9/24 (0x1bba09/24),jE(b)j = 256, A = 0:598. Diurnal block.
We also sometimes add normally-distributed noise drawn each day for each address to
the start time of up periods with variance
s
, and to the duration of up periods with
variance
d
. We select variation of phase (
i
) once for each address i per experiment.
We vary each of these parameters to study its eects. In each simulation below, we
run 10 batches of experiments, each batch with 100 experiments to report accuracy:
percentage of correct detection of diurnal blocks within the 100 experiments. The error
bars show median and quartiles in the 10 batches.
We use our adaptive availability tracking algorithm (Section 6.2.1) to estimate the
block’s availability (
ˆ
A
s
) over the 4 week period, and then apply our diurnal detection
algorithm (Section 6.2.2). We find that we detect 100% of these simulated diurnal blocks
in the simplest case when there is no noise ( =
s
=
d
= 0).
150
0 500 1000 1500
time (rounds)
0.0
0.2
0.4
0.6
0.8
1.0
A
prefix: 0x0286d8/24, α
s
=0.1, α
l
=0.01, α
d
=0.01, probe_overhead=0.0%
outages
0
5
10
15
20
number of probes
Figure 6.7: Block 2.134.216/24 (0x0286d8/24),jE(b)j = 256, A = 0:408. Another
diurnal block.
We next vary the above parameters to understand sensitivity, allowing more varied
usage, including some variations of number of diurnal addresses n
d
(Figure 6.12), up
and down phases (Figure 6.13), and uptime variations
d
(Figure 6.14).
We first explore eects of how many addresses are diurnal. In Figure 6.12, we
vary n
d
, the number of diurnal addresses, from 1 to 100 (or 2% to 67% of responsive
addresses). Our detection accuracy improves quickly. After 10 addresses (or 17% of
responsive addresses), the accuracy is more than 85%. Examination of several cases
shows the misses with n
d
< 10 are because n
d
is only a small fraction of the 50 always-
on addresses, and our probing will usually pick an address within the stable ones and
stop.
151
0 0.2 0.4 0.6 0.8 1
actual A
0
0.2
0.4
0.6
0.8
1
As
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
Figure 6.8: Correlation graph showing actual availability A and estimated availability
ˆ
A
s
. Density is normalized by the product of number of blocks and rounds. Quartiles
show A
s
aggregated into (0.1 increment) A bins.
We next consider varying phase: we return to n
d
= 100 but now select
i
randomly,
picking a value linearly distributed between 0 and . to decide when it starts during
the time of each day. We observe a sharp drop when maximum at 14 hours. At this
threshold the variations in phase cause the signals from dierent addresses to smooth
together, making diurnal eects impossible to extract from interference. We have plotted
and confirmed this eect varying maximum from 0 to 24 hours. In Table 6.1, we
observe that at = 14 hours, the combined signal starts to smooth out, without strong
diurnal patterns. Note that human phase behavior is usually within a few hours, in which
cases we are able to detect diurnal blocks correctly.
152
0 0.2 0.4 0.6 0.8 1
actual A
0
0.2
0.4
0.6
0.8
1
Ao
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
Figure 6.9: Correlation graph showing actual availability A and operational availability
ˆ
A
o
. Density is normalized by the product of number of blocks and rounds. Quartiles
show A
o
aggregated into (0.1 increment) A bins.
Finally we consider varying the duration, changing
d
in Figure 6.14. We vary
d
from 0 to 24 hours to evaluate our sensitivity to variations in up-time. We find that
this variance only slightly aects accuracy for large
d
(>10 hours). This is because
we synchronize the up periods daily, similar to real world settings where people syn-
chronize clocks and go to work. With a normal distribution, variations in up-time for
the sub-signals will cancel each other out over time. Considering that ordinary people’s
schedules have variations within only a few hours, this shows that our algorithm works
well for wide range of up-times.
153
0 0.2 0.4 0.6 0.8 1
actual A
0
0.2
0.4
0.6
0.8
1
estimated A
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
Figure 6.10: (Old algorithm). Correlation graph showing actual availability A and esti-
mated availability
ˆ
A
s
. Density is normalized by the product of number of blocks and
rounds. Quartiles show A
s
aggregated into (0.1 increment) A bins.
After studying sensitivity of n
d
,,
d
, we are confident with our availability tracking
and diurnal detection algorithms. Because we can track blocks with small n
d
(less than
20% of stable addresses), large (typical human phase is less than 4 hours), and large
d
(typical up time is 6 to 10 hours). We didn’t study
s
because it is similar to the
eect of .
154
0 0.2 0.4 0.6 0.8 1
actual A
0
0.2
0.4
0.6
0.8
1
estimated A
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
Figure 6.11: (Old algorithm). Correlation graph showing actual availability A and oper-
ational availability
ˆ
A
o
. Density is normalized by the product of number of blocks and
rounds. Quartiles show A
o
aggregated into (0.1 increment) A bins.
Validation with Survey Blocks
While simulations are useful to systematically study dierent kinds of variation, we next
validate diurnal detection with a 2 week survey dataset S
51w
, containing 29k blocks, to
evaluate accuracy with real-world data. In survey data, we can obtain ground-truth
availability (A). We first apply spectral analysis on A to detect the best-available ground
truth diurnal blocks. (We cannot contact the operators of thousands of blocks.) We
use the same rule of identifying diurnal block as in Section 6.4.1: a block is diurnal if
the strongest non-zero signal is once per day, and is at least twice as strong as the next
155
0
20
40
60
80
100
0 20 40 60 80 100
0 40 80 120 160 200
accuracy
number of diurnal addresses
percentage of diurnal vs. stable addresses
Figure 6.12: Accuracy of diurnal block detection, varying number of diurnal behavior
addresses (n
d
) in each block, so that n
d
is 2% to 67% of responsive addresses (including
always-on and diurnal addresses). Here =
s
=
d
= 0.
0
20
40
60
80
100
0 5 10 15 20 25
accuracy
maximum variation of phase (hours)
Figure 6.13: Accuracy of diurnal block detection, varying maximum of diurnal
addresses in each block, from 0 to 24 hours. Here n
d
= 100;
s
=
d
= 0.
156
raw A
s
FFT raw A
s
FFT
(hrs) (hrs)
0 13
1 14
2 15
3 16
4 17
5 18
6 19
7 20
8 21
9 22
10 23
11 24
12
Table 6.1: Vary maximum phase , showing the eect of phase on diurnal block detec-
tion.
signal, but here we know that the A value uses possible external (public) information.
157
0
20
40
60
80
100
0 5 10 15 20 25
accuracy
standard deviation of uptime duration (hours)
Figure 6.14: Accuracy of diurnal block detection, varying standard deviation of uptime
duration (
d
), from 0 to 24 hours. Here n
d
= 100; =
s
= 0.
We then apply our availability tracking algorithm to get a timeseries of estimated avail-
ability (
ˆ
A
s
). We apply spectral analysis on
ˆ
A
s
to detect diurnal blocks with sampled
measurements.
Treating full information (A) as ground truth, we compare our sampled result (
ˆ
A
s
)
using standard measures from information retrieval in Table 6.2. We see that our method
has good precision (82.48%), meaning we rarely falsely predict a wrong diurnal block.
We also provide good overall accuracy (90.99%), which represents the ability to cor-
rectly track both positives and negatives. Our measurement is conservative in detecting
diurnalness, with a fairly high false negative rate. For the comparisons we make in Sec-
tion 6.5, this bias seems preferable to false positives.
158
full (A)
d (diurnal) n (non-diurnal)
sampled
ˆ
d 2890 (9.97%) tp 614 (2.12%) fp
(
ˆ
A
s
) ˆ n 1999 (6.89%) fn 23497 (81.02%) tn
precision: 82.48%; accuracy: 90.99%
Table 6.2: Validation of diurnal blocks in Survey S
51w
(29k blocks), using true availabil-
ity (A) to compute ground truth, and Trinocular estimated availability (
ˆ
A
s
) to predict.
6.4 Directly Observed Results
In this section, we show directly observable results from our long term study with A
12all
:
diurnal blocks and outages. We compare these observations with other factors to identify
correlations in the next section (Section 6.5).
6.4.1 Diurnal Blocks
We are interested in blocks with diurnal usage, because they help reveal how dierent
parts of the world are using the Internet. We use frequency analysis to find such blocks
and provide an overall analysis. In Section 6.5 we will correlate diurnal activity with
location and other factors.
Sample Blocks
We first consider two sample blocks to illustrate what a diurnal block look like.
We apply the spectral analysis introduced in Section 6.2.2, to analyze two diurnal
blocks. Of these two blocks, Figure 6.15 (raw data in Figure 6.6) is from our 14-day
survey S
51w
. It is diurnal because the highest non-zero amplitude happens at frequency
one cycle per day, the peak in the frequency graph. Another example is the same block,
but from our 35 day dataset A
12all
, shown in Figure 6.16. We see similar peaks at the
159
1 2 3 4 5 6 7 8 9 10 11 12 13 14
frequency (cycles per day)
0
20
40
60
80
100
120
140
amplitude
14 28 42 56 70 84 98 112 126 140 154 168 182 196
k (samples)
Figure 6.15: FFT components of block 27.186.9/24 (0x1bba09/24), in 14-day survey
S
51w
(N
d
= 14), showing strong diurnal pattern because strongest signal appears at 1
cycle per day.
1 2 3 4 5
frequency (cycles per day)
0
50
100
150
200
250
300
amplitude
35 70 105 140 175
k (samples)
Figure 6.16: FFT components of block 27.186.9/24 (0x1bba09/24), in 35-day A
12w
(N
d
= 35), showing strong diurnal pattern because strongest signal appears at 1 cycle
per day.
diurnal frequency, the only dierence is k = 35 because we run much longer than an
Internet survey.
Besides these two diurnal examples, we show another block that is not diurnal, in
Figure 6.17.
We believe that FFT serves as a powerful tool to extract diurnal signals from usually
noisy observations. We find that the overall rate of diurnally managing networks is not
uncommon. Of the 3.7M blocks we study, we classify 412k (or 11%) as diurnal blocks.
The rate of diurnal block usage is not uniform across all the world, we will study the
distribution and other characteristics of these diurnal blocks in Section 6.5.
160
1 2 3 4 5 6 7 8 9 10 11 12 13 14
frequency (cycles per day)
0
20
40
60
80
100
120
140
amplitude
14 28 42 56 70 84 98 112 126 140 154 168 182 196
k (samples)
Figure 6.17: FFT components and auto-correlation of block 1.9.21/24 (0x010915/24),
in 14-day survey S
51w
(N
d
= 14). This is a non-diurnal block.
Daily or other periodicity?
Our test for diurnalness looks strictly for regularity at 24 hours, requiring that to be the
strongest periodicity in the block. We focus on daily patterns because they are likely to
be related to human use of the Internet. However, another possible source of periodicity
is DHCP lease times. If dynamic addresses are allocated for some period p, and given
out sequentially across a region that spans multiple /24 blocks, then those blocks will
see usage that changes with period p.
To understand dierent periodicities (24-hours and others), we next examine the
strongest amplitudes in the FFT. Figure 6.18 shows the strongest frequencies for all
3.7 M blocks in A
12w
. As expected, we observe a strong peak at 1 cycle per day (24
hours), about 25% of all the blocks. Note that we only declare 11% to be diurnal because
we require the strongest signal to be at least twice as strong as the next signal. The
second group (3%) is at about 4.3 cycles per day. This periodicity results because we
restart our probing software every 5.5 hours (4.3 times per day) to recover from possible
prober failure. The prober restarts takes 6.5 minutes on average, so our availability
estimates of some blocks are “frozen” and thus show up in the frequency graph.
161
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.01 0.1 1 10 100
CDF
strongest frequency (cycles per day)
Figure 6.18: Cumulative distribution of the highest frequency in 35-day A
12w
.
Long-term Status
Besides sample blocks and diurnal fraction in a specific dataset, we next look at long-
term diurnal usage trend. In Figure 6.19, we apply our spectral diurnal detection algo-
rithm over more than 3 years of Internet surveys. We observe that the percentage of
diurnal blocks is rather stable, mostly in the 10% to 15% range. This result is consistent
with that of the 35-day dataset A
12w
, reported in Section 6.4.1.
6.4.2 A Month of Internet Outages
Our next interested topic is Internet outages. We have previously worked on characteriz-
ing outages [QHP12a] and adaptive probing with Trinocular [QHP13c]. In this section,
we provide a long-term and wide scale analysis of the Internet’s reliability, with 3.7M
/24 blocks and a 35 day continuous probing with Trinocular A
12all
.
162
0
20
40
60
80
100
2010-01-01
2010-07-01
2011-01-01
2011-07-01
2012-01-01
2012-07-01
2013-01-01
2013-07-01
2014-01-01
diurnalness (%)
date
Figure 6.19: Overall diurnal rate for Internet surveys over time. Each survey is taken at
a single site, one from our three probing sites: ISI (w), CSU (c), Japan (j).
We next consider a long-term, continuous, wide-area study of outages. We have
previously studied outages in several ways [QHP13c], detailed, but short and 1% of
the Internet (11 minute observations over a 2-week duration, for 20k /24 blocks, from
one site at a time); detailed, but shorter and broader (11 minute observations over 2
days, but for 3.4M /24 blocks, from three sites); and sparse but longitudinal (about 62
samples each summarizing 2 weeks, taken over 3 years, each sample from a partially
overlapping set of 20k /24 blocks, from one site at a time). We next examine the first
detailed, medium-term study of all the measurable Internet (11 minute observations for
35 days for 3.4M /24 blocks).
Figure 6.20 shows the outage fraction over all of dataset A
12all
for our probing period
(35 days). This data combines readings from three sites, so it avoids outages near any
single vantage point. We observe that the Internet generally has good reliability, with
outage fraction less than 0.2%. The mean overall outage fraction is 0.164%, with a
163
0
0.2
0.4
0.6
0.8
1
5 10 15 20 25 30 35
outage fraction (%)
time (days)
Figure 6.20: Overall outage fraction in A
12all
. Data is the intersection of outages three
vantage points.
median of 0.159%, and quartiles of 0.117%, 0.204%. This result is similar to our pre-
vious 3-site analysis [QHP13c], and much lower than single-site observations, showing
the importance of multiple perspectives to avoid observations being influenced by the
observer.
The main dierence from our prior analysis is the significant variation in the outage
fraction over the day. This variation results from two factors. First, our current approach
with
ˆ
A
o
estimation is more sensitive than our prior results, allowing analysis of millions
of blocks at 11-minute intervals. Second, it stems from our definition of outage: a block
is out when it stops responding. These outages are largely due to diurnal behavior where
people’s networks have few devices at night. These blocks are “out” by our definition—
although they do not have network problems, they have no responsive devices.
164
Figure 6.21: May 2013 Syria outages observed from all three sites, showing two com-
plete shutdowns.
Case Study: Syrian Outages in May 2013
The Internet is a big place with all kinds of events aecting its operation and reliability.
Here we give an example of how we use our system to find Internet connection problems:
the Syria network shutdowns in May 2013.
Since March 2011, Syria has been in turbulent states and consequently dragged into
civil war for more than two years. During this unstable time, the Syria government
turned o their Internet for several times. In our data, we also observe the eect of such
political maneuvers.
Figure 6.21 shows such eects on 1700 blocks geolocated to Syria covered by our
system. We see two complete shutdowns: the first is from 2013-05-07 to -08, for about
a day; the second shutdown is shorter, about 8 hours, on 2013-05-15. A related work is
done by Renesys Inc., also reporting that Syria turned o Internet connection on 2013-
05-15, for 8 hours and 25 minutes [Blo13]. Our observation of the second (shorter) out-
age is strongly confirmed by Renesys. However, we are unable to find relevant reports
regarding the first shutdown seen by us in the Renesys portal.
165
We believe that our system can be useful to companies like Renesys, and also to
the general public. We cover more cases of politics-caused outages, such as Egypt and
Libya upheavals, in our previous work [QHP13c].
Figure 6.21 also shows the background of diurnal outages, as we see in global data
(Section 6.4.2). Again, we believe these are diurnal eects of network use.
Next, we will explore reasons for both diurnal management and outages, with statis-
tical analysis.
6.5 Indirectly Observed Results
Examination of diurnal behavior and outages by themselves are interesting, but they
show little about why these network behaviors happen. While establishing causality
is dicult, we next examine correlations between these observations and longevity on
the Internet (Section 6.5.2), economic conditions (diurnal blocks in Section 6.5.3 and
outages in Section 6.5.3), technologies (Section 6.5.4), and organizations (ISPs, Sec-
tion 6.5.5).
6.5.1 Location of diurnal blocks
An obvious first step is to find the geographical distribution of diurnal blocks, to get
some insights on which countries utilize their blocks diurnally.
To get an overall look at where the Internet is, we first geolocate all the 3.7M blocks
(using MaxMind’s city-level database [Inc13]) we probe and plot them on a world map
in Figure 6.22, grouped into a 22 degree grid. There are large numbers of addresses
in North America and Europe, as well as concentrations in Japan, China, and several
other countries. (Because geolocation with Maxmind data is only approximate, many
addresses are located only to a country and not a specific city in that country. These
166
Figure 6.22: Number of total blocks in A
12w
. Gray-scale shows 0 to 10k.
addresses show up as dark blocks in the physical middle of the country, accounting for
the dark blocks in the largely unpopulated geographic centers of Brazil, Russia, and
Australia.)
The distribution of diurnal blocks has a clear replacement with that of total blocks.
For example, there are very few diurnal blocks in the United States, although it has
the most allocated blocks. To quantify dierences in fractions of diurnal blocks, to put
this disagreement more clearly and more quantitatively, we show the fraction of diurnal
blocks in We see that there is a strong bias towards the eastern globe (particularly Asia)
and eastern Europe. For example, China has a high fraction of diurnal blocks.
To understand where each of the 412,208 diurnal blocks are located, we show the
percentage of blocks in each grid cell based on our diurnal detection method in Fig-
ure 6.24. Several countries with large numbers of blocks show nearly no diurnal behav-
ior, including the U.S., western Europe, and Japan. Other areas show a large amount of
diurnal behavior: large areas in India, China, Russia and the former Soviet Union, and
much of South America, particularly Peru.
167
Figure 6.23: Number of diurnal blocks in A
12w
. Gray-scale shows 0 to 1k.
Figure 6.24: Fraction of diurnal blocks in A
12w
. Gray-scale shows 0% to 100%.
To quantify what Figure 6.24 shows graphically, Table 6.3 shows the fraction of
diurnal blocks for each country Networks within a country usually have similar policies
and culture about management. So we aggregate diurnal blocks by countries and list the
top 20 countries and the United States in Table 6.3. Table 6.4 shows statistics grouped by
global regions. We will correlate diurnal blocks to economic conditions in Section 6.5.3.
168
country region # blocks frac. diurnal per-capita
code GDP (US$)
AM Western Asia 1075 0.630 5900
GE Western Asia 1395 0.546 6000
BY Eastern Europe 1748 0.512 15900
CN Eastern Asia 394244 0.498 9300
PE South America 4600 0.401 10900
KZ Central Asia 3832 0.400 14100
RS Southern Europe 4429 0.393 10600
AR South America 20382 0.339 18400
TH South-Eastern Asia 10986 0.336 10300
SV Central America 1145 0.311 7600
UA Eastern Europe 16575 0.289 7500
CO South America 9379 0.261 11000
MY South-Eastern Asia 9747 0.247 17200
PH South-Eastern Asia 5721 0.239 4500
IN Southern Asia 36470 0.225 3900
MA Northern Africa 2115 0.185 5400
BR South America 79095 0.185 12100
VN South-Eastern Asia 8197 0.183 3600
ID South-Eastern Asia 7617 0.166 5100
RU Eastern Europe 53048 0.159 18000
. . . . . . . . . . . .
US Northern America 672104 0.002 50700
Table 6.3: Fraction of diurnal blocks, top 20 countries (with at least 1000 blocks in our
study), and United States. Diurnal analysis data is from A
12all
; Geolocation data is from
Maxmind [Max13]; GDP data is from CIA world factbook [CIA].
6.5.2 Correlating Diurnal Blocks with Internet Entry Time
Policies for IP address allocation have evolved since IPv4 was introduced, becom-
ing much stricter in the run-up to full allocation in May 2012 [Int12]. Policies
now encourage dynamic addressing for IPv4, and previous work has documented this
trend [CH10a].
To see how allocation time aects address use, we correlate amount of diurnal
behavior with the date when each /8 block was allocated to a regional registry by
ICANN [Int13]. Figure 6.25 shows this correlation as a box plot for diurnalness over
169
region # blocks frac. diurnal
Northern America 721716 0.002
Southern Africa 11255 0.0108
Western Europe 275224 0.0109
Northern Europe 133911 0.0131
Caribbean 2174 0.016
Oceania 27206 0.0349
Western Asia 25570 0.0765
Northern Africa 9984 0.0992
Southern Europe 134933 0.124
Central America 44644 0.133
Eastern Europe 146552 0.135
Southern Asia 44524 0.200
South America 133493 0.208
South-Eastern Asia 48885 0.219
Eastern Asia 757352 0.279
Central Asia 3832 0.401
Table 6.4: Fraction of diurnal blocks grouped by regions. Diurnal analysis data is from
A
12all
; Geolocation data is from Maxmind [Max13].
0
5
10
15
20
25
30
35
40
1990-01
1992-01
1994-01
1996-01
1998-01
2000-01
2002-01
2004-01
2006-01
2008-01
2010-01
diurnalness (%)
block allocation date
Figure 6.25: Percentage of diurnal blocks in each month, based on block allocation date
(diurnalness measured in 2013 with dataset A
12all
, block allocation data from IANA).
170
allocation time. We see a clear upward trend towards more diurnal usage over time,
corresponding with more careful policies in address use. (Linear regression shows a
positive slope of 0.08% per month, with 61% confidence coecient.)
6.5.3 Eects of Economic Conditions
We suspect that economic factors may correlate with diurnal use, since dynamic address
assignment and turning devices o at night is more ecient than static or always-on
dynamic assignment and always-on devices. To evaluate this hypothesis, we next look
at economic conditions.
Correlating Diurnal Blocks with Economics
Why do some countries have much higher diurnal rate? A well-known fact is that coun-
tries in Asia, such as China and India, have cultures of high saving rate and have low
disposable income. They thus choose to use their networks diurnally to save electricity
bills.
Although we cannot directly model cultural dierences, we can observe correlations
between disposable income (measured in per-capita GDP) and diurnal network usage.
We correlate diurnal behavior by country with per-capita GDP drawn from the CIA
world factbook [CIA].
Figure 6.26 plots this correlation, and Table 6.3 shows a specific values. We also
plot the weak linear fit for Figure 6.26, with a negative slope and confidence coecient
of -0.52625. The top 20 countries with diurnal policies generally have a per-capita GDP
less than USD $15,000, which is less than one third of the United States. There are a few
exceptions, mostly countries with abundant natural resources lifting GDP (like Russia
and Malaysia).
171
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 10000 20000 30000 40000 50000 60000 70000
diurnal fraction
per-capita GDP
Figure 6.26: Scatter plot of diurnalness and per-capita GDP for all countries, for
Table 6.3. Black dashed line shows a weak linear fit.
Other Factors
While per-capita GDP is one factor, there are many intertwined factors. To understand
if other factors are better correlated, we consider five factors: per-capita GDP, Internet
users per host, electricity consumption per capita. and age of first (and mean) block allo-
cation. We compare all of these eects using ANOV A factorial analysis (Section 6.2.4).
Table 6.5 shows this analysis. We find there are three factor combinations with p-
value less than 0.05, the practical threshold for relevance. The first is per-capita GDP,
with p-value of 6:61 10
8
. The second is the interaction between per-capita electricity
consumption and mean age of allocation, with p-value of 0:001476. The third is mean
age of allocation, with a p-value of 0.031354.
The result from ANOV A analysis supports our previous hypothesis that per-capita
GDP is the main factor for diurnalness. It also reveals several other important factors that
have influences on diurnal usage, including per-capita electricity consumption and mean
172
per-capita per-capita Internet users age of mean age
gdp elec. consumption per host first alloc. of alloc.
per-capita gdp 6.61e-08 0.306230 0.822111 0.789226 0.994624
per-capita elec. consumption 0.703536 0.609148 — 0.001476
Internet users per host 0.036558 0.958814 0.848766
age of first alloc. 0.829530 —
mean age of alloc. 0.031354
Table 6.5: ANOV A analysis of correlations between diurnal and individual factors (the
diagonal) and pairwise combinations of factors (o the diagonal). Bold indicates com-
binations that are statistically significant. Dataset: A
12all
.
age of block allocation. Electricity consumption is usually used to measure the devel-
opment phase of a country, and it directly constrains the “OFF” behavior—how much
time of day should computers be put oine. Mean age of block allocation corresponds
to the Internet entry time of countries and influences diurnal usage (Section 6.5.2).
Although ANOV A analysis is powerful, we do need to be careful that some factors
can be related or intermingled. For example, a country with low GDP has a high pos-
sibility of low power consumption too. Additional work is needed to show a causal
relationship between these (or other) factors and network use. However, our work sug-
gests the potential for such relationships and is the first to suggest this area as a direction
for study.
Correlating Outages with Economics
We first analyze per country outage fractions. Similar to Section 6.5.3, we organize
Internet outage fractions in countries for ease of understanding, with a scatter plot in
Figure 6.27.
We collect outage fractions from dataset A
12all
, and use Maxmind’s geolocation to
map blocks and aggregate to countries (Section 6.2.3). We show countries with lowest
overall outage fractions in Table 6.6. Here we show top 20 countries with least outages,
plus United States (US), Korea (KR), and China (CN). We see that filtering of diurnal
173
0
0.01
0.02
0.03
0.04
0.05
0.06
0 10000 20000 30000 40000 50000 60000 70000
outage fraction
per-capita GDP
Figure 6.27: Scatter plot of outage fraction and per-capita GDP for all countries.
blocks have little influence on the overall per country outage fractions, probably due to
the small fraction of diurnal blocks in these countries.
We next analyze the eects of economic conditions on another metric: Internet out-
age fraction. The research questions we are interested is: which factors can outage
fractions be correlated to?
We apply similar ANOV A analysis to find the contributing factors for outage frac-
tions. We analyze the same five factors (and the interactions between them): per-capita
GDP, Internet users per host, per-capita electricity consumption, first and mean alloca-
tion age, shown in Table 6.7. We find two combinations of factors with the strongest
influence. The fist combination is Internet users per host and per-capita electricity con-
sumption (with p-value of 9:34 10
9
); the second combination is per-capita GDP and
Internet users per host (with p-value of 1:63 10
6
).
From the above analysis, we believe that electricity consumption is a key factor for
outage fractions, as it appears at both combinations. High outage fraction is correlated
174
mean blocks mean outage
country outage (no fraction
code blocks fraction diurnal) (no diurnal)
AE 1743 0.000316 1732 0.000315
EE 2141 0.000346 2106 0.000337
DO 1081 0.000367 1057 0.000367
AT 12007 0.000452 11798 0.000457
FI 9047 0.000586 8935 0.000590
DK 18210 0.000600 18156 0.000602
FR 74324 0.000731 73809 0.000732
SI 2509 0.000766 2462 0.000780
IE 4458 0.000769 4427 0.000772
SA 3409 0.000800 3328 0.000818
PA 2311 0.000924 2152 0.000928
HK 10185 0.000970 9463 0.001009
LT 4011 0.001003 3595 0.001094
RS 4429 0.001028 2688 0.001428
NO 11334 0.001042 11259 0.001005
CZ 12818 0.001108 12521 0.001111
MK 1480 0.001208 1425 0.001220
AU 22522 0.001210 21646 0.001196
TW 37381 0.001257 32916 0.001322
LV 3511 0.001325 3376 0.001359
. . . . . . . . . . . . . . .
KR 193028 0.0021159 185266 0.002156
. . . . . . . . . . . . . . .
US 672104 0.0031322 670762 0.003137
. . . . . . . . . . . . . . .
CN 394244 0.0056575 197735 0.007474
Table 6.6: Mean outage fraction grouped by countries and regions (with at least 1000
blocks in our study), and KR, US, CN, seen from all three sites.
to low electricity consumption. We believe this correlation is because, as a key infras-
tructure, countries without enough power supply is prone to network outages: frequent
power outages will always bring down networks. This result is consistent with outages
in countries with high per-capita electrical consumption. The U.S. had a large number
175
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
AE
EE
DO
AT
FI
DK
FR
SI
IE
SA
PA
HK
LT
RS
NO
CZ
MK
AU
TW
LV
KR
US
CN
0
0.2
0.4
0.6
0.8
1
mean outage fraction
diurnal fraction
countries
outage(all)
outage(no_diurnal)
diurnalness
Figure 6.28: Bar chart of mean outage fraction for 23 countries, for Table 6.6.
per-capita per-capita Internet users age of mean age
gdp elec. consumption per host first alloc. of alloc.
per-capita gdp 0.11309 0.24673 1.63e-06 0.81449 0.48936
per-capita elec. consumption 0.55230 9.34e-09 0.85537 0.71949
Internet users per host 0.53831 0.85741 0.72229
age of first alloc. 0.43515 0.06788
mean age of alloc. 0.96789
Table 6.7: ANOV A analysis of correlations between outages and individual factors (the
diagonal) and pairwise combinations of factors (o the diagonal). Bold indicates com-
binations that are statistically significant. Dataset: A
12all
.
of network outages for end-users after Hurricane Sandy [QHP13c]. Network infrastruc-
ture survived and backbones survived the hurricane fairly well [MCM13], we believe
our large outage rates correspond with electrical outages in the NY/NJ area [HQP12].
176
6.5.4 Eects of Access-Link Technology
We next consider the eect of access-link technology. Unlike economic conditions and
policies (which are national issues and dicult for individuals or companies to influ-
ence), users and ISPs can often choose between dierent ISPs and dierent “last-mile”
methods to connect to the Internet.
IP addresses do not automatically identify access link technology, we must infer
it from public information. We map each IP address we probe to a set of one or more
access-link features using reverse domain names, then identify features common to a /24
block (as described in Section 6.2.3). With this approach, we are able to classify 22.4%
of all IP blocks we probe, a reasonable sample for our study of access technologies.
Diurnalness We first consider how often dierent access links are used diurnally.
Table 6.8 shows the percentage of blocks of each type are identified as diurnal.
As expected, dynamic addresses are strongly correlated with diurnal behavior, with
dynamic at 19%. Somewhat surprisingly, dialup is not strongly diurnal (< 3%), while
dsl is more diurnal (11%). These results suggest the importance of measuring network
behavior rather than assuming.
Access technology and reliability We next consider how dierent technologies com-
pare in terms of reliability, measured by overall fraction of outages (measured in block-
rounds). Table 6.8 shows shows the overall fraction of outage for each technology,
while Figure 6.30 shows the cumulative distribution of outages for each block, both
taken over 35 days in A
12all
. (We plotted a similar graph to Figure 6.30, without diurnal
blocks, showing very similar trends.)
177
0.000
0.002
0.004
0.006
0.008
0.010
static
dynamic
dhcp
dsl
dial
ppp
cable
srv
res
0
0.2
0.4
0.6
0.8
1
mean outage fraction
diurnal fraction
keyword
outage(all)
outage(no_diurnal)
diurnal
Figure 6.29: Mean outage fraction and diurnal fraction for 9 access technologies,
observed from all three probers, in our 35 day dataset A
12all
(for Table 6.8).
0.8
0.85
0.9
0.95
1
0.001 0.01 0.1
CDF
outage fraction (log scale)
hour day
static
dynamic
dhcp
dsl
dial
ppp
cable
srv
res
static
dynamic
dhcp
dsl
dial
ppp
cable
srv
res
Figure 6.30: Cumulative distribution of fraction of outages for each block, by access
keyword. (Note y axis starts at 0.75.) Outages are the intersection of all three sites,
from A
12all
for 35 days.
178
keyword # blocks diurnal frac. outage frac. outage frac.
(all) (no diurnal)
static 193506 0.0384 0.00153 0.00146
dynamic 194830 0.185 0.00360 0.00405
dhcp 35692 0.0334 0.00190 0.00190
dsl 231355 0.109 0.00321 0.00341
dial 79753 0.0283 0.00223 0.00224
ppp 31261 0.0912 0.00351 0.00370
cable 67341 0.0167 0.00170 0.00170
srv 518270 0.0853 0.00373 0.00333
res 53015 0.00569 0.00891 0.00896
Table 6.8: Mean outage fraction of 9 access keywords, observed from all three probers.
Dataset: A
12all
.
First, it’s important to observe that all technologies are quite reliable. Most blocks
show no outage at all, shown in the CDF curves, starting at about 80% to 90%. Further-
more, if we consider outages of less than an hour unimportant (fraction: 0.001), then
85% to 95% of blocks down less than an hour in our 35 day dataset.
However, we do see that dynamic access technologies, such as dhcp and dial-up, are
less reliable than static technologies, such as cable (over the traditional TV cables) and
DSL (over the telephone lines). We believe that this is due to the dynamic allocation and
lower utilization of IP addresses in such technologies, because it means higher churn of
addresses and the potential that blocks sometimes have no active addresses. Cable and
DSL are usually used for always-on devices. Another possible factor is under-utilization
of once pervasive technologies, or alignment of dynamic address allocation policies with
diurnal cycles.
We are interested in comparing wireless (wifi) to wired access technologies. Unfor-
tunately, our current sample of wireless-identified blocks is too small (only 691 blocks)
to provide a statistically strong comparison.
179
To detangle some of these factors and get at the relatively reliability of the underly-
ing technology, we reanalyze this data excluding diurnal blocks. Our goal is to separate
human behavior (turning o devices at night) from other factors. This factor is partic-
ularly important for DSL, usually used for always-on devices but with a large diurnal
component. We see that the diurnal factor usually does not aect outage fraction very
much, with the exception of EarthLink, where diurnal usage dominates, leaving fewer
non-diurnal blocks.
To verify this analysis (especially for low reliability blocks), we randomly select
100 “bad” DHCP blocks; those with outage fractions larger than 1%. We then plot the
A
s
value over time, and manually check with outage reports obtained from Trinocular.
Figure 6.31 shows four examples to illustrate this. Because ground truth is hard to get,
we first check raw log files with outage periods. We observe that there is no response
during the outage periods, and our prober has covered the ever responding address list
of each block, E(b), for many times. For the examples in Figure 6.31, during their main
outage periods we see no positive replies, even after probing all active addresses multiple
times (in block 130.34.73/24,jE(b)j = 17 probed 154 times, 130.41.96/24,jE(b)j = 24
probed 581 times, 67.235.190/24,jE(b)j = 141 probed 5 times, 67.209.245/24,jE(b)j =
29 probed 54 times). This evaluation gives us confidence that the Trinocular’s reported
outage periods are actual outages in these probed blocks. The strong correlation between
A
s
curve and examination of outages in raw log files thus shows the eectiveness of our
algorithm in low reliability blocks.
6.5.5 Eects of Organizations on Reliability
End-users often have choices of Internet service provider (ISP). We next compare the
reliability we observe across major U.S. ISPs. We consider focus on American ISPs
because our method uses string matching but our method should also work for ISPs
180
(a) DHCP block 130.34.73/24 (0x822249). (b) DHCP block 130.41.96/24 (0x829960).
(c) DHCP block 67.235.190/24 (0x43ebbe). (d) DHCP block 67.209.245/24 (0x43d1f5).
Figure 6.31: Availability A
s
and outage periods of 4 DHCP blocks.
in other countries. (Currently identification of ISP blocks is partially manually, so we
currently focus on U.S. blocks where such information is easily available.) When it
comes to choosing a Internet service for home or business, people today usually consider
speed. We suggest that reliability may be a second metric to consider.
To compare the reliability of ISP networks, we consider the top 20 US ISPs [eCo11].
We identify addresses for each organization as described in Section 6.2.3. We then
exclude ISPs for which we have fewer than 750 blocks (CenturyLink, Optimum, Net-
Zero, Juno, AOL, MSN, Basic ISP, ISP.com), and add four large content providers for
comparison (Google, Yahoo, Facebook, LinkedIn). Table 6.9 shows our coverage. We
see that our method covers many blocks in most big ISPs (such as Comcast, AT&T and
181
Time Warner). However, because our search for ISPs is based on a string matching pro-
cess on data from previous work [CHKW10], we could end up with no or few data for
some ISPs (for example, Optimum and NetZero). There are three reasons for this. The
first is that the search keyword we use is outdated compared to the dataset we search.
For example, we originally use the keyword “CenturyLink” to search for ASes related
to CenturyLink but found no data; however many ASes and blocks show up when we
add “Qwest” to the search (CenturyLink and Qwest merged in 2011). In another case,
Juno and NetZero merged to form United Online in 2001, but we could not find records
for any of the three. The second reason is that there are IP blocks associated with an
ISP, but we did not probe or only probed a small fraction (such as Optimum and AOL).
The third reason is the source AS to organization mapping data we are relying on is not
perfect [CHKW10]. In future work, we will continue to improve on these factors and
maximize our coverage.
Table 6.9 shows the fraction of outages we observe for each ISP. To be statistically
significant, we should only consider the ISPs with more than 100 probed blocks (marked
with bold in the number of blocks probed, Table 6.9). Figure 6.33 shows the cumulative
distribution of outage for each block in each ISP. We also plotted the same graph with
diurnal blocks filtered out, showing very similar results (except for EarthLink, which we
will look into in future work).
As with access links, the first observation is that all ISPs are very reliable. The
highest overall fraction of outages is about 0.5%; most ISPs have less than 0.3% out-
ages. When we look at individual blocks in Figure 6.33, 85% of all blocks show no
outages over 5 weeks of observation (exceptions are Mediacom and Cable One). If we
assume outages less than 1 hour are not significant for most individuals (as with access
links, Section 6.5.4), then 92% of blocks are acceptably reliable.
182
0.000
0.005
0.010
0.015
0.020
0.025
0.030
comcast
att
timewarner
charter
verizon
cox
frontier
suddenlink
earthlink
windstream
cableone
aol
msn
mediacom
google
yahoo
facebook
linkedin
0
0.2
0.4
0.6
0.8
1
mean outage fraction
diurnal fraction
organization
outage(all)
outage(no_diurnal)
diurnal
Figure 6.32: Mean outage fraction and diurnal fraction for top United States ISPs,
observed from all three probers, in our 35 day dataset A
12all
(for Table 6.9).
0.7
0.75
0.8
0.85
0.9
0.95
1
0.001 0.01 0.1
CDF
outage fraction (log scale)
hour day
Comcast
CenturyLink
AT&T
Time Warner Cable
Charter
Verizon
Cox
Frontier
Suddenlink
EarthLink
Windstream
Cable One
AOL
MSN
Mediacom
Google
Yahoo
Comcast
CenturyLink
AT&T
TimeWarnerCable
Charter
Verizon
Cox
Frontier
Suddenlink
EarthLink
Windstream
CableOne
AOL
Mediacom
Google
Yahoo
Figure 6.33: Cumulative distribution of block outage fraction for top US ISPs and orga-
nizations, (with at least 100 blocks in our study), showing ISP reliability, observed from
all three probers.
183
ISP #blocks probed #total blocks diurnal frac. outage frac. outage frac.
(all) (no diurnal)
comcast 153183 276224 0.000344 0.00239 0.00239
att 62279 403307 0.00301 0.000810 0.000812
timewarner 70452 106381 0.000871 0.00521 0.00521
CenturyLink 30908 78432 0.000647 0.00239 0.00802
charter 13742 24581 0.00102 0.00300 0.00300
verizon 43594 625519 0.00469 0.000443 0.000444
cox 22603 46715 0.000457 0.00313 0.00313
Optimum — 528 — — —
frontier 9897 18631 0.00335 0.00103 0.00103
suddenlink 2381 6405 0.000771 0.00160 0.00160
earthlink 933 977 0.821 0.00408 0.0280
windstream 11197 13863 0.000246 0.00112 0.00112
cableone 939 3757 0.0 0.000343 0.000343
NetZero — — — — —
Juno 1 4 — — —
aol 73 26009 0.0782 0.00128 0.00128
msn 291 4284 0.0 0.000542 0.000542
mediacom 2327 4768 0.00236 0.00325 0.00325
Basic ISP — — — — —
ISP.com — — — — —
google 312 1055 0.00255 0.00219 0.00219
yahoo 2461 4850 0.0 0.00109 0.00109
facebook 73 212 0.0 0.00414 0.00414
linkedin 7 8 0.0 0.00145 0.00145
Table 6.9: Mean outage fraction of top Internet service providers (ISPs) in the United
States, plus Google, Yahoo, Facebook and LinkedIn. Bold indicates that there are
enough blocks probed to be statistically significant. Total number of blocks is inferred
from previous work on AS to organization [CHKW10].
6.6 Conclusions
In this chapter, we develop an accurate adaptive approach to estimate network block
availability over time. Based on this, we perform the first long-term, wide-scale anal-
ysis of diurnal network behavior and outages. We also begin to evaluate how policies,
economic conditions and technologies aect Internet use in the world. With statistical
184
ANOV A analysis, we correlate diurnal usage and outages to economic factors such as
GDP and power consumption.
This chapter also supports the thesis. It shows that sampling adaptively within Inter-
net blocks provides an accurate way to track block availability over time. With this
approach, we analyze diurnal block usage and outage properties on all the 3.7M respon-
sive Internet blocks. For knowledge beyond the IP address space, we aggregate our
results based on mappings from IP to other domains for broader views. We aggregate
over countries to analyze distribution and reasons of diurnal blocks; over link types and
ISPs to expand our knowledge of outage properties from previous chapters.
185
Chapter 7
Related Work
In this chapter, we discuss studies that are related to our thesis, with various sampling
and aggregation techniques, and compare them to our three specific studies. We present
three related areas of research: long-lived Internet flows (Section 7.1), visualization
(Section 7.2), Internet outages (Section 7.3), adaptive sampling (Section ).
7.1 Understanding Internet Flows
An important aspect of understanding Internet trac is packet sizes and protocols.
Thompson et al. studied the packet size distribution and protocol mixes in one-day
period, and diurnal patterns of aggregate trac in 7-day period [TMW97]. CAIDA has
collected several 90-second traces each day, in a period of ten months, and studied the
trends of packet lengths and protocol mixes [Mkc00]. We use the common 5-tuple flow
definition [TMW97], but we are more interested in flow characteristics and trac mixes
across dierent time scales.
Characteristics of Internet flows have also been studied extensively. Brownlee et
al. studied lifetimes of streams (bi-directional flows) [Bkc02, Bro05]. They found that
at least 45% of the streams are dragonflies (lasting less than 2 seconds), 98% of the
streams last less than 15 minutes and the rest 2% being tortoises. Similarly, we find that
most of the Internet bytes are carried by the vast majority of short flows, but long flows
also account for a considerable fraction of bytes (see 2.3.1). Later work studied flow
characteristics systematically, showing the correlations between flow size, duration, rate
186
and burstiness [cLH06]. We adopt the similar ideas from this work, but compare flows
behavior as a function of duration.
Because of the large volume of trac, careful sampling techniques have been used
to achieve better processing rates. Researchers from AT&T estimated flow statistics by
sampling packet streams and exploiting protocol details [DLT03]. Researchers at UCSD
used adaptive sampling algorithms to increase Cisco NetFlow system robustness without
compromising accuracy (in case of large volume of trac) [EKMV04]. Zhang et al.
studied the distributions and causes of dierent flow rates [ZBPS02]. They collected
sampled traces from a backbone ISP covering from 1 to 24 hours, and unsampled traces
ranging from 30 to 120 minutes. They also studied the correlations between flow rates
with size and duration, and gave careful analysis on the causes of dierent flow rates
(such as congestion limited, sender/receiver window limited, and application limited).
Our work builds on theirs: we continuously collect unsampled IP packet headers, and
systematically study the relations between flow durations and other characteristics. We
also provide the ability to investigate multi-time-scale flows for ecient analysis and
give preliminary analysis of causes of long-lived flows.
Several other groups have exploited flow characteristics for trac engineering pur-
poses. Shaikh et al. studied load-sensitive routing, but they adopted a conservative,
10-or-more packet definition of long flows. We study several longer time scales and
find interesting implications of the long flows. Trunking (with TCP or optical net-
works [KW99, HDL
+
98, AR01]) gathers together groups of flows to achieve throughput
benefits. Our work identifies long-duration flows that could be used by trunking. Recent
work in low-buer routing has shown the possibility of using very few router buers
(two magnitudes fewer than current commercial practices), provided that trac is “suf-
ficiently” smooth [AKM04]. We show that long-duration flows are smoother and could
be a good candidate for such optimization.
187
7.2 Internet Visualization
There is significant prior work on routing outages and route changes, there has been
much less work visualizing network phenomena.
Several prior papers have visualized numbers of outages by time, including
Markopolou et al. [MIB
+
04], Turner et al. [TLSS10], and Gill et al. [GJN11]. While
showing timeseries, none of these attempt to cluster the blocks to bring out correlations.
Our clustering would make the correlations they discuss more obvious.
Data clustering is a well established field and there are many generic clustering algo-
rithms (refer to [XW05, Lia05] for detailed surveys), we employ a simple greedy algo-
rithm most suitable for our problem (Section 4.2.1).
Somewhat related to our visualization work is CAIDA’s AS dispersion graphs, which
cluster and visualize based on the common AS paths from one point to many destina-
tions [CAI12]. This intuitive visualization shows the AS level connectivity seen at a
vantage point. We complement their work by clustering the changes of AS paths, instead
of static analysis.
7.3 Internet Outage Detection
To understand Internet outages, many studies utilize control plane messages [MIB
+
04,
TR04, LABJ00, MWA02, BMRU09, HFLX07, CGH03, FMM
+
04], direct data plane
measurements [FABK03, CBG10, CTFD09, DTDD07, HFT08], or passive data analy-
sis [DSA
+
11, DAAC12, TLSS10]. See below and our papers for more details [QHP12a,
QHP12c, QHP13b].
Very close to our second work, the Hubble system uses continuous probes to one
sample (.1) of each routed /24 block, to find potential Internet outages [KBMJ
+
08a]. We
instead probe multiple or all addresses in each /24 block. We study the tradeo between
188
sampling and accuracy and show our use of multiple representatives per block greatly
reduces the number of false conclusions about network outages. We also describe new
algorithms for clustering outages for visualization and into network-wide events.
Cunha et al. run multiple probes to confirm a link failure and location. They analyze
the benefits of numbers of probes, and improve accuracy with minimal probing over-
head [CTFD09]. We also study the tradeo in probe volume against accuracy,but focus
on end-system outage detection rather than specific link failures.
Bush et al. study the reachability of Internet address space using traceroute to detect
incorrect filtering [BHM
+
07], and find biases in reachability experiments [BMRU09].
We provide additional evidence supporting their observation that default routes are
widely used and that control-plane measurements underestimate outages.
Several prior groups have used meshes of measurement computers [Pax96,
ABKM01, FABK03, KYGS07]. Such experiments can provide strong results for the
behavior of the networks between their n vantage points (typically less than 50), and
link coverage grows as O(n
2
) for small n, but edge coverage is only O(n). Without prob-
ing outside the mesh, however, these approaches ultimately study only a small fraction
of the entire Internet. Our work aims to provide complete coverage.
In early work, Paxson reports routing failures in about 1.5%–3.3% of trials [Pax96].
A more recent work, the RON system reports 21 “path-hours” of complete or partial
outages out of a total of 6825 path-hours, a 0.31% outage rate [ABKM01]. Feamster
et al. measure Internet path failures with n = 31, and correlate with BGP messages
for causes [FABK03]. They find that most failures are short (under 15 minutes) and
discuss the relationship between path failures and BGP messages. As with their work,
we validate our findings using control plane data.
The instrumentation in these systems can often isolate locations of problems, such
as SCORE (Kompella et al. [KYGS07]); work that complements ours.
189
Rather than a mesh, PlanetSeer studies trac from 7–12k end-users to a network
of 120 nodes to track path outages [ZZP
+
04]. They report that their larger population
identifies more anomalies than prior work; we expect our edge coverage of 2.5M blocks
will be broader still. In addition, their measurements occur only on clients; they miss
outages from already disconnected clients.
Client support in these studies allows better fault diagnosis than our work. Our
work complements theirs by providing much larger coverage (2.5M /24 blocks, a large
fraction of the Internet edge), rather than “only” hundreds or thousands; and support-
ing regular, centrally driven measurement, rather than client-driven measurements that
undercount outages.
Greenberg et al. study failures in datacenters [GHJ
+
09]. They collect failure logs
from 8 production datacenters which contains hundreds of thousands of servers, for over
a year. They find that more than most failures are small in size (less than 20 devices)
and large correlated failures are rare. They also investigate failure durations and get
similar results to ours: 95% of failures are within 10 minutes, 98% less than an hour,
and 99.6% less than a day. They find the main reason for failures in datacenter networks
is misconfiguration and firmware bugs.
Passive Data Analysis: Recent studies by Dainotti et al. do an in-depth analysis
of Internet outages caused by political censorship [DSA
+
11, DAAC12]. Their main
focus is the Egypt and Libya outages in 2011, using a novel approach that combines
observations from both control-plane (BGP logs) and data-plane sources (backscatter
trac at UCSD network telescope and active probing data from Ark). They focus on
the use of multiple passive data sources; they find their source of active probes is of
limited use because it probes each /24 every three days. We instead show that a single
PC can actively probe all visible and responsive /24 blocks every 11 minutes suggesting
active probing can provide complement to them.
190
Above the network layer, other systems have looked at system- and user-level logs to
determine outages. For example, UCSD researchers have done careful studies of “low-
quality” data sources (including router configurations, e-mail and syslog messages), to
discover characteristics and reasons of failures in the CENIC network [TLSS10]. Such
log analysis requires collaboration with the monitored networks, and so their study
focuses on a single ISP. In contrast, we use active probing that can be done indepen-
dent of the target.
Origins of Routing Instability: BGP centralization of otherwise distributed routing
information makes it an attractive source of data for outage analysis. Prior work has used
the AS path to study where outages originate. Chang et al. cluster BGP path changes
into events, both temporally and topologically [CGH03]. They also provide insights on
how to infer where network events happen. Feldmann et al. identify ASes responsible
for Internet routing instabilities using time, views and prefixes [FMM
+
04]. They report
that most routing instabilities are caused by a single AS or a session between two ASes.
(Chang et al. make similar conclusions [CGH03]). They also propose useful insights on
hazards in identifying instability originators. We develop conceptually similar clustering
methods, but based on data-plane observations. Our active probing approach finds many
large Internet outages that cut across multiple ASes, and also detects outages in edge
networks that use default routing.
Network tomography uses coordinated end-to-end probes to detect the specific loca-
tion of network failures [CTFD09, DTDD07, HFT08]. We also identify outages near
our vantage points to correct for errors. However, our work is in a dierent domain, as
our focus is to analyze the end-to-end reachability of the whole Internet.
191
7.4 Adaptive Sampling
Several prior studies have used adaptive probing techniques to find network
faults [BRM01, BRM02, RBM
+
05, NS07], and use simulations for validation. Simi-
lar to them, we propose adaptive sampling inside blocks to achieve high accuracy with
low probing cost. Our research goal covers a much large fraction of the Internet, and
aims to report the reachability of all responsive Internet.
7.5 ISP Comparison
Several third-party systems exist to compare performance of ISPs, including
Keynote [Key], top ten reviews [Top], and the popular speed test tool oered by
OOKLA [OOK]. They share a common drawback: lack of scalability. Perhaps clos-
est to our work is Netdi [MZPP08], which measures performance of hundreds of paths
for 18 backbone ISPs. They find that ISP performance depends on geographic proper-
ties of trac and popularity of destinations. Our work is more focused on nodes instead
of paths, and thus complements Netdi.
192
Chapter 8
Future Work and Conclusions
In this chapter, we list several directions for future work. We then summarize and con-
clude our thesis.
8.1 Future Work
Although we have listed several studies supporting our thesis, there are areas of immedi-
ate future work that would help substantiate our claims. We next discuss these short-term
future work for our three studies in this section.
In Chapter 2 we study properties of long-lived flows and find that they are mostly
computer-to-computer in nature. We could take several future directions to make this
work stronger. First, we can do a longer period of study on properties of flows spanning
months and seasons. Next, we currently use port-based analysis for causes of long-lived
flows. Although eective, there are shortcomings in this method. For example, we
know that some protocols switch ports or use randomized ports in their operations (such
as bittorrent). Another future work could take more complex and accurate measures
(such as a signature-based study) to investigate causes. Finally, we can investigate our
results at other vantage points or campuses, to make sure we do not have local eects.
In Chapters 3 and 4 we study characterization of Internet outages and visualization
aggregation. One future direction is studying at more probing sites and do a more care-
ful study of local-view eect and global outage study. Although we worked hard on
validating outages with public routing information, there are many events not related to
193
routing at all, such as default-routed blocks (Chapter 3). For such events, we are likely
to go to network operators for more ground truth. Finally, we can explore visualization
clustering on other types of data, beyond our current examples of Internet outages and
path changes. Our current visualization algorithm uses O(n
2
) simple clustering, which
has scalability problems for large datasets. A future work could look at how to paral-
lelize the clustering problem or other clustering algorithms with lower complexities.
In Chapters 5 and 6, we develop Trinocular and use it to study policy eects of
diurnal blocks and outages. There are several interesting future directions in this work
too. First, we currently use a fixed probe loss model with a fixed loss rate of` = 0:01.
In future work we could refine this model with an online version of`. The next possible
future work is to improve the adaptive sampling process. Our current adaptive probing
iterates through E(b), the ever-responded list of addresses, and treats every address in
E(b) equally. It would be interesting to do a weighted version in this sampling process
where we give top responders more weight. This is just like what we do with the top
k algorithm in earlier Chapter 3. Finally, an improvement to Chapter 6 could look at
dierences of reliability and diurnalness of ASes within the same organization or ISP.
8.2 Conclusions
The Internet is important for nearly all aspects of our society. Despite years of research,
we still have limited knowledge of its overall state. Our thesis presents how we obtain
new important knowledge about the Internet, with four specific examples, each using
dierent sampling and aggregation techniques. More specifically, our first work uses
aggregation in the time dimension to study the properties and causes of long-lived Inter-
net flows. Our second work uses sampling and aggregation in the IPv4 address space to
track outages in the Internet edge. Our third work uses aggregation by visualization to
194
make both manual and automated analysis more ecient and eective. Last, our fourth
work aims to use adaptive sampling to understand policy eects on Internet usage: diur-
nal blocks and reliability.
195
Bibliography
[ABKM01] David Andersen, Hari Balakrishnan, Frans Kaashoek, and Robert Morris.
Resilient overlay networks. In Proceedings of the eighteenth ACM sym-
posium on Operating systems principles, SOSP ’01, pages 131–145, New
York, NY , USA, 2001. ACM.
[Ade09] Jacob Adelman. Los Angeles approves $7.2 million plan to tap Google
for e-mail and other web services. Los Angeles Times, 27 Oct. 2009.
[AKM04] Guido Appenzeller, Isaac Keslassy, and Nick McKeown. Sizing router
buers. In Proc. of ACM SIGCOMM, pages 281–292, Portland, Oregon,
USA, August 2004. ACM.
[All05] MM Allen. The ever-shifting internet population: A new look at intranet
access and the digital divide, 2005.
[AR01] Daniel Awduche and Yakov Rekhter. Multiprotocol Lambda Switching:
Combining MPLS Trac Engineering Control with Optical Crosscon-
nects. IEEE Communications Magazine, 39(3):111–116, March 2001.
[Arm11] Grenville Armitage. Private communications, Jul. 2011.
[Aus11] AusNOG. Discussions about australia flooding, Jan. 2011. http://
lists.ausnog.net/pipermail/ausnog/2011-January.
[BHM
+
07] Randy Bush, James Hiebert, Olaf Maennel, Matthew Roughan, and Steve
Uhlig. Testing the reachability of (new) address space. In Proc. of
ACM Workshop on Internet Nework Management, pages 236–241, Kyoto,
Japan, August 2007. ACM.
[BHP07] Genevieve Bartlett, John Heidemann, and Christos Papadopoulos. Under-
standing passive and active service discovery. In Proc. of ACM IMC,
pages 57–70, San Diego, California, USA, October 2007. ACM.
196
[Bkc02] Nevil Brownlee and kc clay. Understanding Internet trac streams:
Dragonflies and tortoises. IEEE Communications Magazine, 40:110–117,
2002.
[Blo13] Renesys Blog. Syrian Internet.. Fragility, 2013. http://www.renesys.
com/2013/05/syrian-internet-fragility/.
[BMRU09] Randy Bush, Olaf Maennel, Matthew Roughan, and Steve Uhlig. Inter-
net optometry: assessing the broken glasses in internet reachability. In
Proc. of ACM IMC, pages 242–253, New York, NY , USA, 2009. ACM.
[BOR
+
02] Anindya Basu, Chih-Hao Luke Ong, April Rasala, F. Bruce Shepherd,
and Gordon Wilfong. Route oscillations in i-bgp with route reflection. In
Proc. of ACM SIGCOMM, SIGCOMM ’02, pages 235–247, New York,
NY , USA, 2002. ACM.
[BRM01] Mark Brodie, Irina Rish, and Sheng Ma. Optimizing probe selection for
fault localization. 2001.
[BRM02] M. Brodie, I. Rish, and S. Ma. Intelligent probing: A cost-eective
approach to fault diagnosis in computer networks. IBM Systems Jour-
nal, 41(3):372 –385, 2002.
[Bro05] Nevil Brownlee. Some Observations of Internet Stream lifetimes. In
Passive and Active Measurement Workshop, pages 265–277, 2005.
[BTI
+
02] Chadi Barakat, Patrick Thiran, Gianluca Iannaccone, Christophe Diot,
and Philippe Owezarski. A flow-based model for internet backbone traf-
fic. In Proc. of ACM SIGCOMM Internet Measurement Workshop, pages
35–47, Marseille, France, October 2002. ACM.
[CAI12] CAIDA. As dispersion graphs, 2012. http://www.caida.org/
projects/ark/statistics/san-us/as_dispersion_by_as.
html.
[CB97] Mark E. Crovella and Azer Bestavros. Self-similarity in world wide web
trac: evidence and possible causes. ACM/IEEE Transactions on Net-
working, 5(6):835–846, December 1997.
[CBG10] David R. Chones, Fabi´ an E. Bustamante, and Zihui Ge. Crowdsourcing
service-level network event monitoring. In Proc. of ACM SIGCOMM,
SIGCOMM ’10, pages 387–398, New York, NY , USA, 2010. ACM.
197
[CFH
+
13] Matt Calder, Xun Fan, Zi Hu, Ethan Katz-Bassett, John Heidemann, and
Ramesh Govindan. Mapping the expansion of Google’s serving infras-
tructure. In Proceedings of the ACM Internet Measurement Conference,
page to appear, Barcelona, Spain, October 2013. ACM.
[CGH03] Di-Fa Chang, Ramesh Govindan, and John Heidemann. The Temporal
and Topological Characteristics of BGP Path Changes. In Proc. of IEEE
International Conference on Network Protocols, pages 190–199, Atlanta,
Georga, USA, November 2003. IEEE.
[CH10a] Xue Cai and John Heidemann. Understanding Block-level Address Usage
in the Visible Internet. In Proc. of ACM SIGCOMM, pages 99–110, New
York, NY , USA, 2010. ACM.
[CH10b] Xue Cai and John Heidemann. Understanding block-level address usage
in the visible Internet (extended). Technical Report ISI-TR-2009-665,
USC/Information Sciences Institute, June 2010. This technical report
extends the SIGCOMM 2010 paper with three appendices with support-
ing details.
[CHKW10] Xue Cai, John Heidemann, Balachander Krishnamurthy, and Walter Will-
inger. Towards an AS-to-Organization map. In Proceedings of the ACM
Internet Measurement Conference, pages 199–205, Melbourne, Aus-
tralia, November 2010. ACM.
[CIA] CIA World Factbook. List of countries by GDP (PPP) per capita. https:
//www.cia.gov/library/publications/the-world-factbook.
[Cla88] D. Clark. The design philosophy of the darpa internet protocols. In
Symposium proceedings on Communications architectures and protocols,
SIGCOMM ’88, pages 106–114, New York, NY , USA, 1988. ACM.
[cLH06] Kun chan Lan and John Heidemann. A measurement study of correla-
tion of Internet flow characteristics. Computer Networks, 50(1):46–62,
January 2006.
[CLM
+
13] Jakub Czyz, Kyle Lady, Sam G. Miller, Michael Bailey, Michael Kallitsis,
and Manish Karir. Understanding ipv6 internet background radiation. In
Proceedings of the 2013 conference on Internet measurement conference,
IMC ’13, pages 105–118, New York, NY , USA, 2013. ACM.
[Cow11a] James Cowie. Egypt leaves the Internet. Renesys Blog http://
www.renesys.com/blog/2011/01/egypt-leaves-the-internet.
shtml, January 2011.
198
[Cow11b] James Cowie. Egypt returns to the Internet. Renesys Bloghttp://www.
renesys.com/blog/2011/02/egypt-returns-to-the-internet.
shtml, February 2011.
[Cow11c] James Cowie. Libyan disconnect. Renesys Bloghttp://www.renesys.
com/blog/2011/02/libyan-disconnect-1.shtml, February 2011.
[CPBW11] Kenjiro Cho, Cristel Pelsser, Randy Bush, and Youngjoon Won. The
Japan earthquake: the impact on trac and routing observed by a local
ISP. In Proc. of, pages 2:1–2:8, Tokyo, Japan, December 2011. ACM.
[CTFD09]
´
Italo Cunha, Renata Teixeira, Nick Feamster, and Christophe Diot. Mea-
surement methods for fast and accurate blackhole identification with
binary tomography. In Proc. of ACM IMC, IMC ’09, pages 254–266,
New York, NY , USA, 2009. ACM.
[Cym] Team Cymru. IP to ASN Mapping. http://www.team-cymru.org/
Services/ip-to-asn.html.
[DAAC12] A. Dainotti, R. Amman, E. Aben, and K. Clay. Extracting benefit from
harm: using malware pollution to analyze the impact of political and geo-
physical events on the Internet. ACM Computer Communication Review,
(1):31–39, Jan 2012.
[DLT03] Nick Dueld, Carsten Lund, and Mikkel Thorup. Estimating Flow Dis-
tributions from Sampled Flow Statistics. In Proc. of ACM SIGCOMM,
pages 325–337, Karlsruhe, Germany, August 2003. ACM.
[DSA
+
11] Alberto Dainotti, Claudio Squarcella, Emile Aben, Kimberly C. Clay,
Marco Chiesa, Michele Russo, and Antonio Pescap´ e. Analysis of
country-wide internet outages caused by censorship. In ACM IMC, IMC
’11, pages 1–18, New York, NY , USA, 2011. ACM.
[DTDD07] Amogh Dhamdhere, Renata Teixeira, Constantine Dovrolis, and
Christophe Diot. NetDiagnoser: troubleshooting network unreachabili-
ties using end-to-end probes and routing data. In Proceedings of the 2007
ACM CoNEXT conference, CoNEXT ’07, pages 18:1–18:12, New York,
NY , USA, 2007. ACM.
[eCo11] Practical eCommerce. 20 Top Internet Service Providers,
2011. http://www.practicalecommerce.com/articles/
3225-20-Top-Internet-Service-Providers.
199
[EKMV04] Cristian Estan, Ken Keys, David Moore, and George Varghese. Building
a better NetFlow. In Proc. of ACM SIGCOMM, pages 245–256, Portland,
Oregon, USA, August 2004. ACM.
[FABK03] Nick Feamster, David G. Andersen, Hari Balakrishnan, and Frans
Kaashoek. Measuring the Eects of Internet Path Faults on Reactive
Routing. In ACM Sigmetrics - Performance 2003, San Diego, CA, Jun.
2003.
[FH10] Xun Fan and John Heidemann. Selecting Representative IP Addresses
for Internet Topology Studies. In Proc. of ACM IMC, pages 411–423,
Melbourne, Australia, Nov. 2010. ACM.
[FHG11] Xun Fan, John Heidemann, and Ramesh Govindan. LANDER IP history
datasets. http://www.isi.edu/ant/traces/ipv4_history, 2011.
[FMM
+
04] Anja Feldmann, Olaf Maennel, Z. Morley Mao, Arthur Berger, and Bruce
Maggs. Locating Internet Routing Instabilities. In Proc. of ACM SIG-
COMM, pages 205–218, New York, NY , USA, 2004. ACM.
[GDS
+
03] Krishna P. Gummadi, Richard J. Dunn, Stefan Saroiu, Steven D. Gribble,
Henry M. Levy, and John Zahorjan. Measurement, modelling, and anal-
ysis of a peer-to-peer file-sharing workload. In Proc. of 19th Symposium
on Operating Systems Principles, pages 314–329, Bolton Landing, NY ,
USA, October 2003. ACM.
[GHJ
+
09] Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula,
Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and
Sudipta Sengupta. Vl2: a scalable and flexible data center network. In
Proceedings of the ACM SIGCOMM 2009 conference on Data communi-
cation, SIGCOMM ’09, pages 51–62, New York, NY , USA, 2009. ACM.
[GHW
+
10] Hongyu Gao, Jun Hu, Christo Wilson, Zhichun Li, Yan Chen, and Ben Y .
Zhao. Detecting and characterizing social spam campaigns. In ACM IMC,
pages 35–47, 2010.
[GJN11] Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. Understanding
network failures in data centers: measurement, analysis, and implica-
tions. In Proceedings of the ACM SIGCOMM 2011 conference, ACM
SIGCOMM, pages 350–361, 2011.
[GNU] GNU. The R Project for Statistical Computing. http://www.
r-project.org/.
200
[GS09] Mehmet H. Gunes and Kamil Sarac. Analyzing router responsiveness
to active measurement probes. In Proceedings of the 10th Interna-
tional Conference on Passive and Active Network Measurement, PAM
’09, pages 23–32, Berlin, Heidelberg, 2009. Springer-Verlag.
[GT00] Ramesh Govindan and Hongsuda Tangmunarunkit. Heuristics for Inter-
net Map Discovery. In Proc. of IEEE Infocom, pages 1371–1380, Tel
Aviv, Israel, March 2000. IEEE.
[GZCF06] Bamba Gueye, Artur Ziviani, Mark Crovella, and Serge Fdida.
Constraint-based geolocation of Internet hosts. ACM/IEEE Transactions
on Networking, 14(6):1219–1232, December 2006.
[HBP
+
05] Alefiya Hussain, Genevieve Bartlett, Yuri Pryadkin, John Heidemann,
Christos Papadopoulos, and Joseph Bannister. Experiences with a con-
tinuous network tracing infrastructure. In Proceedings of the ACM SIG-
COMM MineNet Workshop, pages 185–190, Philadelphia, PA, USA,
August 2005. ACM.
[HDL
+
98] K.-P. Ho, H. Dal, C. Lin, S.-K. Liaw, H. Gysel, and M. Ramachandran.
Hybrid wavelength-division-multiplexing systems for high-capacity dig-
ital and analog video trunking applications. IEEE Photonics Technology
Letters, 10:297–299, February 1998.
[HFLX07] Yiyi Huang, Nick Feamster, Anukool Lakhina, and Jim (Jun) Xu. Diag-
nosing network disruptions with network-wide analysis. In Proceedings
of the 2007 ACM SIGMETRICS international conference on Measure-
ment and modeling of computer systems, SIGMETRICS’07, pages 61–
72, New York, NY , USA, 2007. ACM.
[HFT08] Yiyi Huang, Nick Feamster, and Renata Teixeira. Practical issues with
using network tomography for fault diagnosis. SIGCOMM Comput. Com-
mun. Rev., 38:53–58, September 2008.
[HH12] Zi Hu and John Heidemann. Towards geolocation of millions of IP
addresses. In Proceedings of the ACM Internet Measurement Conference,
pages 123–130, Boston, MA, USA, 2012. ACM.
[HPG
+
08] John Heidemann, Yuri Pradkin, Ramesh Govindan, Christos Papadopou-
los, Genevieve Bartlett, and Joseph Bannister. Census and Survey of the
Visible Internet. In Proc. of ACM IMC, pages 169–182, V ouliagmeni,
Greece, Oct. 2008. ACM.
201
[HQP12] John Heidemann, Lin Quan, and Yuri Pradkin. A preliminary analysis
of network outages during Hurricane Sandy. Technical Report ISI-TR-
2008-685, USC/Information Sciences Institute, November 2012.
[Inc13] MaxMind Inc. MaxMind GeoIP Geolocation Products, 2013. http:
//www.maxmind.com/en/city.
[Int07] Internet Software Consortium. Internet Domain Survey. web pagehttp:
//www.isc.org/solutions/survey, January 2007.
[Int12] Internet Corporation for Assigned Names and Numbers. Global policy
for post exhaustion ipv4 allocation mechanisms by the iana, May 2012.
[Int13] Internet Assigned Numbers Authority. IANA IPv4 address space
registry. web page http://www.iana.org/assignments/
ipv4-address-space/ipv4-address-space.txt, May 2013.
[Jac88] V . Jacobson. Congestion avoidance and control. In Symposium proceed-
ings on Communications architectures and protocols, SIGCOMM ’88,
pages 314–329, New York, NY , USA, 1988. ACM.
[KB12] Ethan Katz-Bassett. Private communications, May 2012.
[KBJK
+
06] Ethan Katz-Bassett, John P. John, Arvind Krishnamurthy, David Wether-
all, Thomas Anderson, and Yatin Chawathe. Towards IP geolocation
using delay and topology measurements. In Proc. of ACM IMC, pages
71–84, Rio de Janeiro, Brazil, October 2006. ACM.
[KBMJ
+
08a] Ethan Katz-Bassett, Harsha V . Madhyastha, John P. John, Arvind Krish-
namurthy, David Wetherall, and Thomas Anderson. Studying black holes
in the internet with Hubble. In USENIX NSDI, NSDI’08, pages 247–262,
Berkeley, CA, USA, 2008. USENIX Association.
[KBMJ
+
08b] Ethan Katz-Bassett, Harsha V . Madhyastha, John P. John, Arvind Krish-
namurthy, David Wetherall, and Thomas Anderson. Studying black holes
in the Internet with Hubble. In Proc. of 5th USENIX NSDI, pages 247–
262. USENIX, April 2008.
[KBSC
+
12] Ethan Katz-Bassett, Colin Scott, David R. Chones,
´
Italo Cunha, Vytau-
tas Valancius, Nick Feamster, Harsha V . Madhyastha, Tom Anderson, and
Arvind Krishnamurthy. LIFEGUARD: Practical repair of persistent route
failures. In Proc. of ACM SIGCOMM, pages 395–406, Helsinki, Finland,
August 2012. ACM.
202
[KCF
+
08] Hyunchul Kim, KC Clay, Marina Fomenkov, Dhiman Barman, Michalis
Faloutsos, and KiYoung Lee. Internet trac classification demystified:
myths, caveats, and the best practices. In Proceedings of the 2008 ACM
CoNEXT Conference, pages 1–12, New York, NY , USA, 2008. ACM.
[Key] Keynote. Internet Health Report. http://www.internetpulse.net.
[Key10] Ken Keys. Internet-scale IP alias resolution techniques. ACM Computer
Communication Review, 40(1):50–55, January 2010.
[KW99] H. T. Kung and S. Y . Wang. TCP Trunking: Design, Implementation and
Performance. In IEEE International Conference on Network Protocols,
page 222, Washington, DC, USA, 1999. IEEE Computer Society.
[KWNP10] Christian Kreibich, Nicholas Weaver, Boris Nechaev, and Vern Paxson.
Netalyzr: illuminating the edge network. In Proceedings of the 10th ACM
SIGCOMM conference on Internet measurement, IMC ’10, pages 246–
259, New York, NY , USA, 2010. ACM.
[KYGS05] Ramana Rao Kompella, Jennifer Yates, Albert Greenberg, and Alex C.
Snoeren. IP fault localization via risk modeling. In USENIX NSDI,
NSDI’05, pages 57–70, Berkeley, CA, USA, 2005. USENIX Associa-
tion.
[KYGS07] Ramana Rao Kompella, Jennifer Yates, Albert Greenberg, and Alex C.
Snoeren. Detection and Localization of Network Black Holes. In Proc. of
IEEE Infocom, 2007.
[LABJ00] Craig Labovitz, Abha Ahuja, Abhijit Bose, and Farnam Jahanian.
Delayed Internet routing convergence. In Proc. of ACM SIGCOMM,
pages 175–187, New York, NY , USA, 2000. ACM.
[Lay12] Open Layers. Openlayers: Free maps for the web. web site http://
openlayers.org, 2012.
[Lia05] T. Warren Liao. Clustering of time series data—a survey. Pattern Recog-
nition, 38(11):1857 – 1874, 2005.
[LIJM
+
10] Craig Labovitz, Scott Iekel-Johnson, Danny McPherson, Jon Oberheide,
and Farnam Jahanian. Internet inter-domain trac. In Proc. of ACM
SIGCOMM, SIGCOMM ’10, pages 75–86, New York, NY , USA, 2010.
ACM.
203
[LL10] Derek Leonard and Dmitri Loguinov. Demystifying service discovery:
Implementing an internet-wide scanner. In Proc. of ACM IMC, pages
109–123, Melbourne, Victoria, Australia, November 2010. ACM.
[LMJ97] Craig Labovitz, G. Robert Malan, and Farnam Jahanian. Internet routing
instability. In Proc. of ACM SIGCOMM, pages 115–126, New York, NY ,
USA, 1997. ACM.
[LTWW94] W.E. Leland, M.S. Taqqu, W. Willinger, and D.V . Wilson. On the self-
similar nature of Ethernet trac (extended version). ACM/IEEE Transac-
tions on Networking, 2(1):1–15, February 1994.
[LYL08] Xin Liu, Xiaowei Yang, and Yanbin Lu. To filter or to authorize: network-
layer DoS defense against multimillion-node botnets. In ACM SIG-
COMM, pages 195–206, 2008.
[Mal11] Om Malik. In Japan, many undersea cables are dam-
aged. GigaOM blog, http://gigaom.com/broadband/
in-japan-many-under-sea-cables-are-damaged/, Mar. 14
2011.
[Max13] MaxMind Inc. MaxMind GeoIP Geolocation Products, 2013. http:
//www.maxmind.com/app/geolocation.
[MCM13] Doug Madory, Chris Cook, and Kevin Miao. Superstorm Sandy: Impacts
on global connectivity. presentation at NANOG 57, February 2013.
[MIB
+
04] Athina Markopoulou, Gianluca Iannaccone, Supratik Bhattacharyya,
Chen nee Chuah, and Christophe Diot. Characterization of Failures in
an IP Backbone. In Proc. of IEEE Infocom, 2004.
[Mil97] Rupert G. Miller. Beyond ANOVA: Basics of Applied Statistics (Texts in
Statistical Science Series). Chapman & Hall/CRC, January 1997.
[MIP
+
06a] Harsha V . Madhyastha, Tomas Isdal, Michael Piatek, Colin Dixon,
Thomas Anderson, Arvind Krishnamurthy, and Arun Venkataramani.
iPlane: an information plane for distributed services. In USENIX OSDI,
OSDI ’06, pages 367–380, Berkeley, CA, USA, 2006. USENIX Associa-
tion.
[MIP
+
06b] Harsha V . Madhyastha, Tomas Isdal, Michael Piatek, Colin Dixon,
Thomas Anderson, Arvind Krishnamurthy, and Arun Venkataramani.
iPlane: An information plane for distributed services. In Proc. of 7th
USENIX OSDI, pages 367–380, Seattle, WA, USA, November 2006.
USENIX.
204
[Mkc00] Sean McCreary and kc clay. Trends in Wide Area IP Trac Patterns -
A View from Ames Internet Exchange. CAIDA ITC Specialist Seminar,
2000.
[Moc87] P. Mockapetris. Domain names—concepts and facilities. RFC 1034,
Internet Request For Comments, November 1987.
[MPD00] David Moore, Ram Periakaruppan, and Jim Donohoe. Where in the world
is netgeo.caida.org?, July 2000.
[MV09] James A. Muir and Paul C. Van Oorschot. Internet geolocation: Evasion
and counterevasion. ACM Computing Surveys, 42(1), December 2009.
[MWA02] Ratul Mahajan, David Wetherall, and Tom Anderson. Understanding
BGP misconfiguration. In Proc. of ACM SIGCOMM, pages 3–16, New
York, NY , USA, 2002. ACM.
[MZPP08] Ratul Mahajan, Ming Zhang, Lindsey Poole, and Vivek Pai. Uncovering
performance dierences among backbone isps with netdi. In Proceed-
ings of the 5th USENIX Symposium on Networked Systems Design and
Implementation, NSDI’08, pages 205–218, Berkeley, CA, USA, 2008.
USENIX Association.
[NS07] Maitreya Natu and Adarshpal Sethi. Probabilistic fault diagnosis using
adaptive probing. In Alexander Clemm, Lisandro Granville, and Rolf
Stadler, editors, Managing Virtualization of Networks and Services, vol-
ume 4785 of Lecture Notes in Computer Science, pages 38–49. Springer
Berlin / Heidelberg, 2007.
[oI] University of Illinois. IP to latitude/longitude server. web page http:
//cello.cs.uiuc.edu/cgi-bin/slamm/ip2ll.
[oO] University of Oregon. The Route Views Project. http://www.
routeviews.org/.
[OOK] OOKLA. Speedtest. http://www.speedtest.net.
[Pax96] Vern Paxson. End-to-end routing behavior in the internet. In Proc. of
ACM SIGCOMM, SIGCOMM ’96, pages 25–38, New York, NY , USA,
1996. ACM.
[Pax97] Vern Paxson. Automated Packet Trace Analysis of TCP Implementations.
In Proc. of ACM SIGCOMM, pages 167–179, Cannes, France, September
1997. ACM.
205
[PFTK98] J. Padhye, V . Firoiu, D. Towsley, and J. Kurose. Modelling TCP Through-
put: A Simple Model and its Empirical Validation. In Proc. of ACM SIG-
COMM, pages 303–314, Vancouver, Canada, September 1998. ACM.
[Pos12] Hungton Post. Obama google plus hangout, Jan. 2012. http:
//www.huffingtonpost.com/2012/01/30/obama-google-plus_
n_1242816.html.
[PR06] Larry Peterson and Timothy Roscoe. The design principles of planetlab.
SIGOPS Oper. Syst. Rev., 40(1):11–16, January 2006.
[PS01] Venkata N. Padmanabhan and Lakshminarayanan Subramanian. An
investigation of geographic mapping techniques for Internet hosts. In
Proc. of ACM SIGCOMM, pages 173–185, San Diego, California, USA,
August 2001. ACM.
[QH10a] Lin Quan and John Heidemann. Detecting internet outages with active
probing. Technical Report ISI-TR-2011-672, USC/Information Sciences
Institute, May 2010.
[QH10b] Lin Quan and John Heidemann. On the characteristics and reasons of
long-lived internet flows. In Proc. of ACM IMC, pages 444–450, Mel-
bourne, Australia, November 2010. ACM.
[QHP12a] Lin Quan, John Heidemann, and Yuri Pradkin. Detecting internet out-
ages with precise active probing (extended). Technical Report ISI-TR-
2012-678b, USC/Information Sciences Institute, February 2012. (This
TR superceeds ISI-TR-2011-672.).
[QHP12b] Lin Quan, John Heidemann, and Yuri Pradkin. Supplemental data and
visualizations, September 2012. http://www.isi.edu/ant/pubs/
network-vis/.
[QHP12c] Lin Quan, John Heidemann, and Yuri Pradkin. Visualizing sparse internet
events: Network outages and route changes. In Proceedings of the First
ACM Workshop on Internet Visualization, page to appear, Boston, Mass.,
USA, November 2012. Springer.
[QHP13a] Lin Quan, John Heidemann, and Yuri Pradkin. LANDER Internet outage
datasets. http://www.isi.edu/ant/traces/internet_outages,
2013.
206
[QHP13b] Lin Quan, John Heidemann, and Yuri Pradkin. Poster abstract: Towards
active measurements of edge network outages. In Proceedings of the
Passive and Active Measurement Workshop, pages 276–279, Hong Kong,
China, March 2013. Springer.
[QHP13c] Lin Quan, John Heidemann, and Yuri Pradkin. Trinocular: Understanding
internet reliability through adaptive probing. In Proceedings of the ACM
SIGCOMM Conference, pages 255–266, Hong Kong, China, August
2013. ACM.
[Qos10] Qosient Inc. Audit Record Generation and Usage System (ARGUS),
2010. http://www.qosient.com/argus/.
[RBM
+
05] I. Rish, M. Brodie, Sheng Ma, N. Odintsova, A. Beygelzimer,
G. Grabarnik, and K. Hernandez. Adaptive diagnosis in distributed sys-
tems. Neural Networks, IEEE Transactions on, 16(5):1088 –1109, sept.
2005.
[Rob00] L.G. Roberts. Beyond moore’s law: Internet growth trends. Computer,
33(1):117–119, 2000.
[Sli10] Sling Media Inc. Slingbox: a tv streaming device, 2010. http://en.
wikipedia.org/wiki/Slingbox.
[SRS99] Anees Shaikh, Jennifer Rexford, and Kang G. Shin. Load-sensitive rout-
ing of long-lived IP flows. In Proc. of ACM SIGCOMM, pages 215–226,
Cambridge, MA, USA, September 1999. ACM.
[SS05] Yuval Shavitt and Eran Shir. DIMES: let the internet measure itself. ACM
Computer Communication Review, 35:71–74, Oct. 2005.
[SS11a] Aaron Schulman and Neil Spring. Pingin’ in the rain. In Proc. of ACM
IMC, pages 19–28, 2011.
[SS11b] Aaron Schulman and Neil Spring. Pingin’ in the rain. In Proc. of ACM
IMC, pages 19–25, Berlin, Germany, November 2011. ACM.
[SZ10] Yuval Shavitt and Noa Zilberman. A study of geolocation databases.
CoRR abs/1005.5674, 2010.
[Tim11a] International Business Times. Optus, Telstra see ser-
vice outages after Cyclone Yasi, Feb. 3 2011. http:
//hken.ibtimes.com/articles/108249/20110203/
optus-telstra-see-service-outages-after-cyclone-yasi.
htm.
207
[Tim11b] Los Angeles Times. Amazon apologizes for temporary
server outage, 2011. http://www.latimes.com/business/
la-fi-amazon-apology-20110430,0,4604776.story.
[Tim11c] New York Times. Egypt cuts o most internet and cell ser-
vice, 2011. http://www.nytimes.com/2011/01/29/technology/
internet/29cutoff.html.
[TLSS10] Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage.
California fault lines: understanding the causes and impact of network
failures. In Proc. of ACM SIGCOMM, pages 315–326, New York, NY ,
USA, 2010. ACM.
[TMW97] Kevin Thompson, Gregory J. Miller, and Rick Wilder. Wide-Area Internet
Trac Patterns and Characteristics (extended version). IEEE Network
Magazine, 11(6):10–23, Nov/Dec 1997.
[Top] TopTenReviews. Internet Service Provider Review. http://
isp-review.toptenreviews.com.
[TR04] Renata Teixeira and Jennifer Rexford. A measurement framework for
pin-pointing routing changes. In Proc. of the ACM SIGCOMM workshop
on Network troubleshooting, NetT ’04, pages 313–318, New York, NY ,
USA, 2004. ACM.
[USC] USC. High-Performance Computing and Communications. http://
www.usc.edu/hpcc/.
[USC12] USC/LANDER Project Traces, September 2012. http://www.isi.
edu/ant/lander/traces.
[USC13a] USC/LANDER project. Internet address survey dataset, predict id
usc-lander/internet_address_survey_adaptive_reprobing_
a12c. web pagehttp://www.isi.edu/ant/lander, April 2013.
[USC13b] USC/LANDER project. Internet address survey dataset, predict id
usc-lander/internet_address_survey_adaptive_reprobing_
a12j. web pagehttp://www.isi.edu/ant/lander, April 2013.
[USC13c] USC/LANDER project. Internet address survey dataset, predict id
usc-lander/internet_address_survey_adaptive_reprobing_
a12w. web pagehttp://www.isi.edu/ant/lander, April 2013.
208
[VLF
+
11] Vytautas Valancius, Cristian Lumezanu, Nick Feamster, Ramesh Johari,
and Vijay V . Vazirani. How many tiers?: pricing in the internet transit
market. In Proceedings of the ACM SIGCOMM 2011 conference, SIG-
COMM ’11, pages 194–205, New York, NY , USA, 2011. ACM.
[Web11] Webnet77. IpToCountry database, March 2011. http://software77.
net/geo-ip/.
[Wik] Wikipedia. Analysis of variance. http://en.wikipedia.org/wiki/
Analysis_of_variance.
[Wik12] Wikipedia. Hurricane Sandy. http://en.wikipedia.org/wiki/
Hurricane_sandy, 2012. Retrieved 2012-11-24.
[WKB
+
10] Eric Wustrow, Manish Karir, Michael Bailey, Farnam Jahanian, and Geo
Huston. Internet background radiation revisited. In Proc. of ACM IMC,
IMC ’10, pages 62–74, New York, NY , USA, 2010. ACM.
[Wol] Wolfram MathWorld. ANOV A. http://mathworld.wolfram.com/
ANOVA.html.
[Wol13] Wolfram MathWorld. ANOV A, November 2013. http://mailman.
apnic.net/mailing-lists/bgp-stats/.
[XW05] Rui Xu and II Wunsch, D. Survey of clustering algorithms. IEEE Trans-
actions on Neural Networks, 16(3):645 –678, May 2005.
[YOB
+
09] He Yan, Ricardo Oliveira, Kevin Burnett, Dave Matthews, Lixia Zhang,
and Dan Massey. BGPmon: A real-time, scalable, extensible monitoring
system. In Proc. of IEEE Cybersecurity Applications and Technologies
Conference for Homeland Security (CATCH), pages 212–223, Washing-
ton, DC, USA, March 2009. IEEE.
[ZBPS02] Yin Zhang, Lee Breslau, Vern Paxson, and Scott Shenker. On the charac-
teristics and origins of internet flow rates. In Proc. of ACM SIGCOMM,
pages 309–322, New York, NY , USA, 2002. ACM.
[ZD01] Yin Zhang and Nick Dueld. On the Constancy of Internet Path Prop-
erties. In Proceedings of the 1st ACM SIGCOMM Workshop on Internet
Measurement, IMW ’01, pages 197–211, New York, NY , USA, 2001.
ACM.
[ZZP
+
04] Ming Zhang, Chi Zhang, Vivek Pai, Larry Peterson, and Randy Wang.
PlanetSeer: Internet Path Failure Monitoring and Characterization in
Wide-area Services. In USENIX OSDI, pages 12–12, Berkeley, CA, USA,
2004. USENIX Association.
209
Abstract (if available)
Abstract
The Internet is important for nearly all aspects of our society, affecting ordinary people, businesses, and social activities. Because of its importance and wide‐spread applications, we want to have good knowledge about Internet's operation, reliability and performance, through various kinds of measurements. However, despite the wide usage, we only have limited knowledge of its overall performance and reliability. The first reason of this limited knowledge is that there is no central governance of the Internet, making both active and passive measurements hard. The second reason is the huge scale of the Internet. This makes brute‐force analysis hard because of practical computing resource limits such as CPU, memory and probe rate. ❧ This thesis states that sampling and aggregation are necessary to overcome resource constraints in time and space to learn about better knowledge of the Internet. Many other Internet measurement studies also utilize sampling and aggregation techniques to discover properties of the Internet. We distinguish our work by exploring novel mechanisms and new knowledge in several specific areas. First, we aggregate short‐time‐scale observations and use an efficient multi‐time‐scale query scheme to discover the properties and reasons of long‐lived Internet flows. Second, we sample and probe /24 blocks in the IPv4 address space, and use greedy clustering algorithms to efficiently characterize Internet outages. Third, we show an efficient and effective aggregation technique by visualization and clustering. This technique makes both manual inspection and automated characterization easier. Last, we develop an adaptive probing system to study global scale Internet reliability. It samples and adapts probe rate within each /24 block for accurate beliefs. By aggregation and correlation to other domains, we are also able to study broader policy effects on Internet use, such as political causes, economic conditions, and access technologies. ❧ This thesis provides several examples of Internet knowledge discovery with new mechanisms of sampling and aggregation techniques. We believe our approaches of new sampling and aggregation mechanisms can be used by and will inspire new ways for future Internet measurement systems to overcome resource constraints, such as large amount and dispersed data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Enabling efficient service enumeration through smart selection of measurements
PDF
Global analysis and modeling on decentralized Internet
PDF
Measuring the impact of CDN design decisions
PDF
Improving user experience on today’s internet via innovation in internet routing
PDF
Improving network reliability using a formal definition of the Internet core
PDF
Detecting and mitigating root causes for slow Web transfers
PDF
Aggregation and modeling using computational intelligence techniques
PDF
Learning the geometric structure of high dimensional data using the Tensor Voting Graph
PDF
Block-based image steganalysis: algorithm and performance evaluation
PDF
Sampling theory for graph signals with applications to semi-supervised learning
PDF
Scalable sampling and reconstruction for graph signals
PDF
Novel and efficient schemes for security and privacy issues in smart grids
PDF
Contributions to structural and functional retinal imaging via Fourier domain optical coherence tomography
PDF
Transmission tomography for high contrast media based on sparse data
PDF
Detecting and characterizing network devices using signatures of traffic about end-points
PDF
Efficient graph learning: theory and performance evaluation
PDF
Cloud-enabled mobile sensing systems
PDF
Machine learning for efficient network management
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Do humans play dice: choice making with randomization
Asset Metadata
Creator
Quan, Lin
(author)
Core Title
Learning about the Internet through efficient sampling and aggregation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/26/2014
Defense Date
11/22/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Internet measurement,knowledge discovery,OAI-PMH Harvest,sampling and aggregation
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Heidemann, John (
committee chair
), Huang, Ming-Deh (
committee member
), Katz-Bassett, Ethan (
committee member
), Ortega, Antonio K. (
committee member
)
Creator Email
linquan@usc.edu,quanlin.thu@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-366017
Unique identifier
UC11287981
Identifier
etd-QuanLin-2271.pdf (filename),usctheses-c3-366017 (legacy record id)
Legacy Identifier
etd-QuanLin-2271.pdf
Dmrecord
366017
Document Type
Dissertation
Rights
Quan, Lin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Internet measurement
knowledge discovery
sampling and aggregation