Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Scaling-out traffic management in the cloud
(USC Thesis Other)
Scaling-out traffic management in the cloud
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Scaling-out Traffic Management in the Cloud
by
Rui Miao
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2018
Copyright 2018 Rui Miao
Acknowledgements
I wish to express my deepest appreciation to my advisor Prof. Minlan Yu for the mentorship and
giving me the opportunity and the freedom to work on exciting problems. All of my works in
this dissertation would not have been possible without her insightful guidance. Additionally, I
have enjoyed working closely with Changhoon Kim from Barefoot Networks, Navendu Jain from
Microsoft Research, James Hongyi Zeng from Facebook, and Prof. Ethan Katz-Bassett from
Columbia University. Their collaboration and mentorship has been an integral part of the works
outlined in this thesis. Also, I would like to thank the Department of Computer Science at the
University of Southern California.
I would like to thank Prof. Ramesh Govindan and Prof. Konstantinos Psounis for being my
dissertation committee and for their comments and suggestions on improving the quality of this
dissertation.
I’m so grateful for the great source of collaboration, help, support and mentorship from the
past and present members of the USC Networked Systems Lab and Minlan’s group at Harvard
University.
Also, it is my honor to have the opportunity to collaborate with many prestigious researchers
and also get the help from many others. I wish to express sincere appreciation to Jeongkeun Lee,
Jitendra Padhye, Patrick Bosshart, Prof. Nick McKeown, Kyriakos Zarifis, Matt Calder, and
Dileep Rao.
ii
Finally, I would like to thank my family and friends for the support during my graduate school,
especially my wife Ting Xiao for her dedicated support over the past six years after both of us
coming to America together.
iii
Table of Contents
Acknowledgements ii
List Of Tables vii
List Of Figures viii
Abstract xii
Chapter 1: Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Traditional Traffic Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scaling-out Traffic Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Dissertation Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2: SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap
Using Switching ASICs 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Background on load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Layer 4 load balancing function . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Limitations of software load balancers . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Duet: Storing VIPTable in ASICs . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Challenges of Frequent DIP Pool Updates . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Frequent DIP pool updates . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Problems of storing ConnTable in SLBs . . . . . . . . . . . . . . . . . . . . 19
2.4 SilkRoad Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Features in commodity switching ASICs . . . . . . . . . . . . . . . . . . . . 23
2.4.2 Scaling to millions of connections . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.3 Ensuring per-connection consistency . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Implementation and Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.1 Prototype implementation on high-speed commodity ASICs . . . . . . . . . 33
2.5.2 Prototype performance and overhead . . . . . . . . . . . . . . . . . . . . . . 34
2.5.3 Network-wide deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6.2 Ensuring PCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
iv
Chapter 3: The Dark Menace: Characterizing Network-based Attacks in the
Cloud 46
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Datasets and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.1 Cloud provider overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.2 Dataset and attack detection methodology . . . . . . . . . . . . . . . . . . . 50
3.3 Attack Overview and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Attack Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Analysis of attacks by VIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4.1 Attack frequency per VIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4.2 Attacks on the same VIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4.3 Attacks on multiple VIPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.4 Cloud services under inbound attack . . . . . . . . . . . . . . . . . . . . . . 68
3.5 Attack Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5.1 Attack throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5.2 Attack duration and inter-arrival time . . . . . . . . . . . . . . . . . . . . . 74
3.6 Internet AS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.6.1 Inbound attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.6.2 Outbound attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.7 Nimbus: Attack Detection as a Cloud Service . . . . . . . . . . . . . . . . . . . . . 84
3.7.1 Customized attack detection and traffic selection . . . . . . . . . . . . . . . 84
3.7.2 Balance detection accuracy and cost . . . . . . . . . . . . . . . . . . . . . . 86
3.7.3 Adjust sampling rate on a fine time scale . . . . . . . . . . . . . . . . . . . 86
3.7.4 Evaluating cost-effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.8 Existing security practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Chapter 4: DIBS: Just-in-time Congestion Mitigation for Data Centers 92
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2 DIBS overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4 Design Considerations and Implications . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5.1 NetFPGA implementation and evaluation: . . . . . . . . . . . . . . . . . . . 105
4.5.2 Click implementation and evaluation . . . . . . . . . . . . . . . . . . . . . . 105
4.5.3 Simulations setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.5.4 Performance under different traffic conditions . . . . . . . . . . . . . . . . . 109
4.5.4.1 Impact of background traffic . . . . . . . . . . . . . . . . . . . . . 109
4.5.4.2 Impact of query arrival rate . . . . . . . . . . . . . . . . . . . . . . 111
4.5.4.3 Impact of query response size . . . . . . . . . . . . . . . . . . . . . 111
4.5.4.4 Impact of incast degree . . . . . . . . . . . . . . . . . . . . . . . . 112
4.5.5 Performance of different network configurations . . . . . . . . . . . . . . . . 114
4.5.5.1 Impact of buffer size . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.5.5.2 Impact of shared buffers . . . . . . . . . . . . . . . . . . . . . . . . 115
4.5.5.3 Impact of TTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.5.5.4 Impact of oversubscription . . . . . . . . . . . . . . . . . . . . . . 116
4.5.6 Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.7 When does DIBS break? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.8 Comparison to pFabric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
v
4.5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Chapter 5: Conclusions and Future Directions 125
5.1 Extending scaling-out Traffic management . . . . . . . . . . . . . . . . . . . . . . . 126
5.2 Scaling-out Cloud Services using Programmable Data Plane . . . . . . . . . . . . . 127
Bibliography 129
vi
List Of Tables
1.1 Summary of works presented in this thesis . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Trend of SRAM size and switching capacity in ASICs. (SRAM does not include packet
buffer; estimated based on table sizes claimed in whitepapers.) . . . . . . . . . . . . . 23
2.2 Additional H/W resources used by SilkRoad with 1M connection entries, normalized by
the usage of the baseline switch.p4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Summary of the network-based attacks in the cloud we studied. . . . . . . . . . . . 51
3.2 Detected inbound alerts and outbound incident reports (“-” means that the alerts
or reports do not support the attack type). . . . . . . . . . . . . . . . . . . . . . . 59
3.3 The percentage of total victim VIPs hosting different services involved with different
inbound attacks; all numbers are in %. . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1 Default DIBS settings following common practice in data centers (e.g., [49]) unless
otherwise specified. In Table 4.2, we indicate how we explore varying buffer sizes and other
traffic and network parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2 Simulation parameter ranges we explore. Bold values indicate default settings. The top
half of the table captures traffic intensity, including background traffic (top row, with
lower inter-arrival time indicating more traffic) and query intensity (next three rows). The
bottom half of the table captures network and switch configuration, including buffer sizes,
initial packet TTLs, and link oversubscription. We revisit some aspects in additional
sections on shared switch buffers (§4.5.5.2) and extreme congestion (§4.5.7). . . . . . . . 109
vii
List Of Figures
2.1 ConnTable and VIPTable in load balancers. . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Frequent DIP pool updates (Y% of clusters have more than X updates per minute in the
median or 99th percentile minute in a month.) . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Distribution of root causes for DIP additions and removals (in a month). . . . . . . . . . 16
2.4 Distribution of downtime duration with various root causes. (Provisioning does not cause
downtime) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 The dilemma between SLB loads and PCC violations. . . . . . . . . . . . . . . . . . . 21
(a) SLB loads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
(b) PCC violations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Num. of active connections per ToR switch across clusters. . . . . . . . . . . . . . . . . 26
2.7 Using digest and version in DIP selection. . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Number of new connections per VIP in one minute. . . . . . . . . . . . . . . . . . . . . 29
2.9 TransitTable to ensure PCC during DIP pool updates (c_a-c_f are connection digests). 31
(a) Timeline of connections. (c_c and c_d are pending connections at t
exec
(u)) . 31
(b) Step 1: between t
req
(u) and t
exec
(u) . . . . . . . . . . . . . . . . . . . . . . . 31
(c) Step 2: between t
exec
(u) and t
finish
(u) . . . . . . . . . . . . . . . . . . . . . 31
2.10 System architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.11 Network-wide VIP assignment to different layers. . . . . . . . . . . . . . . . . . . . . . 36
2.12 SRAM usage in SilkRoad deployed on ToR switches across clusters. . . . . . . . . . . . 37
2.13 The ratio of #SLBs and #SilkRoad to support load balancing across clusters. . . . . . . 37
viii
2.14 Memory saving for SilkRoad deployed on ToR switches across clusters . . . . . . . . . . 37
2.15 Benefit of version reuse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.16 Effectiveness of ensuring PCC with various update frequencies. . . . . . . . . . . . . . . 39
2.17 Impact of new connection arrival rates (10 updates per minute, SilkRoad TransitTable=256B). 39
2.18 TransitTable size (10 updates per minute). . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 The distribution of inactive time for each attack type; the x-axis is on log-scale. . 54
(a) Inbound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
(b) Outbound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Percentage of total inbound and outbound attacks. . . . . . . . . . . . . . . . . . 55
3.3 Attack characterization for VIPs with inbound and outbound attacks; the x-axis is
on log-scale in the top figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
(a) Number of attacks per (VIP, day). . . . . . . . . . . . . . . . . . . . . . . . . 63
(b) Inbound attacks for VIPs with occasional/frequent attacks. . . . . . . . . . . 63
(c) Outbound attacks for VIPs with occasional/frequent attacks. . . . . . . . . . 63
3.4 CDF of the percentage of VIP active time in attack. . . . . . . . . . . . . . . . . . 65
3.5 Inbound and outbound attacks on the same VIP. We estimate the UDP throughput
and the upper bound of the number of RDP connections based on the 1 in 4096
sampling rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6 The 99th percentile and the peak number of VIPs simultaneously involved in the
same type of attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.7 Median and maximum aggregate throughput by attack type; the y-axis is on log-scale. 70
3.8 Median and maximum attack throughput across VIPs; the y-axis is on log-scale. . 72
3.9 Median and 99th percentile of attack duration by attack type; the y-axis is on
log-scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.10 Median and 99th percentile of attack inter-arrival time by attack type; the y-axis is
on log-scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.11 Different types of ASes generating inbound attacks. . . . . . . . . . . . . . . . . . . 77
(a) Percentage of inbound attacks in each AS type. . . . . . . . . . . . . . . . . . 77
(b) Average of percentage of inbound attacks per AS in each AS type. . . . . . . 77
ix
3.12 Percentage of inbound attacks from big clouds and mobile ASes in each attack type. 78
3.13 Different types of ASes generating inbound DNS and SPAM attacks. . . . . . . . 79
(a) Percentage of inbound DNS or spam attacks in each AS type. . . . . . . . . . 79
(b) Average percentage of inbound DNS or spam attacks per AS in each AS type. 79
3.14 Attack geolocation distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
(a) Geolocation distribution of inbound attack sources. . . . . . . . . . . . . . . . 81
(b) Geolocation distribution of outbound attack targets. . . . . . . . . . . . . . . 81
3.15 Different types of ASes targeted by outbound attacks. . . . . . . . . . . . . . . . . 82
(a) Percentage of outbound attacks in each AS type. . . . . . . . . . . . . . . . . 82
(b) Average percentage of outbound attacks per AS in each AS type. . . . . . . . 82
3.16 Top Internet applications under outbound attacks. . . . . . . . . . . . . . . . . . . 83
3.17 NIMBUS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.18 NIMBUS achieve better cost-effectiveness than fixed sampling-rate+autoscaling approach
under different parameter settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.1 Example path of a packet detoured 15 times in a K=8 fat-tree topology. For simplicity, we
only show 1 edge switch and 1 aggregation switch in the sender’s pod, and abstract the 16
core switches into a single node. The numbers the arc thicknesses indicate how often the
packet traversed that arc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2 Detours and buffer occupancy of switches in a congested pod. During heavy congestion,
multiple switches are detouring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
(a) Detours per switch over time. (Each dot denotes the decision of a switch to detour
a packet.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
(b) Buffer occupancy at times t1, t2, t3 in a pod. Each switch is represented by 8 bars.
Each bar is an outgoing port connecting to a node in the layer below or above. The
size of the bar represents the port’s output queue length: (green: packets in buffer;
yellow: buffer buildup; red: buffer overflow). . . . . . . . . . . . . . . . . . . . . 96
4.3 Sparsity of hotspots in four workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4 Hotlinks. Baseline workload is 300 qps, high is 2000 qps and extreme is 10,000 qps. See
Table 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
x
4.5 Neighboring buffer size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.6 Click implementation of DIBS achieves near optimal Query Completion Times because no
flow experiences timeouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
(a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
(b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.7 Variable background traffic. Collateral damage is consistently low. (incast degree: 40;
response size: 20KB; query arrival rate: 300 qps). Although depicted on one graph, QCT
and background FCT are separate metrics on different traffic and cannot be compared to
each other. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.8 Variable query arrival rate. Collateral damage is low. Query traffic rate has little impact
on collateral damage, and, at high query rate, DIBS improves performance of background
traffic. (Background inter-arrival time: 120ms; incast degree: 40; response size: 20KB) . 110
4.9 Variable response size. Collateral damage is low, but depends on response size. Improve-
ment in QCT decreases with increasing response size. (Background inter-arrival time:
120ms; incast degree: 40; query arrival rate: 300 qps) . . . . . . . . . . . . . . . . . . . 110
4.10 Variable incast degree. Collateral damage is low, but depends on incast degree. (Back-
ground inter-arrival time: 120ms; query arrival rate: 300 qps; response size: 20KB) . . . 112
4.11 Variable buffer size. There is no collateral damage and DIBS performs best with medium
buffer size. (Background inter-arrival rate: 10ms; incast degree: 40; response size: 20KB;
query arrival rate: 300 qps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
(a) Background traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
(b) Query traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.12 Variable max TTL. Limiting TTL does not have significant impact on background traffic.
(Background inter-arrival rate: 10ms; incast degree: 40; response size: 20KB; query arrival
rate: 300 qps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.13 Extreme query intensity (Background inter-arrival time: 120ms; incast degree: 40) . . . . 114
4.14 Large query response sizes (Background inter-arrival time: 120ms; incast degree: 40) . . 116
4.15 DIBS vs pFabric: Mixed traffic: Variable query arrival rate. (Background inter-arrival
time: 120ms; incast degree: 40; response size: 20KB) . . . . . . . . . . . . . . . . . . . 116
(a) Background traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
(b) Query traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
xi
Abstract
Managing cloud traffic is challenging due to its large and constantly growing traffic in scale and
traffic anomalies. Network infrastructure and traffic management need to scale their capacity to
such traffic growth and anomalies, otherwise the application performance will suffer. Existing
traffic management functions have so far focused on proprietary hardware appliances and software
servers. However, with limited capacity and fixed functionality per box, those solutions incur a
high cost, low performance and high management complexity.
In this thesis, we argue that we should scale-out traffic management functions for the support of
increasing traffic scale and anomalies. By scaling-out, we mean those traffic management functions
should support the full throughput of datacenter networks. The key idea of this thesis is to leverage
the hardware switches with line-rate packet processing and the emerging programmability to
directly build advanced functionaries. We have identified three major traffic management functions:
load balancing, attack mitigation, and congestion control. We present SilkRoad, a load balancer
built directly in hardware switching ASICs, which provides a load balancing function at line-rate
while keeps tracking of millions of connections. We then discuss the large-scale measurement
study on the characteristics of network attacks for both coming towards and from the cloud. After
we systematically analyze nine types of attacks and quantified their prevalence, intensive, and
pattens, we advocate a cloud attack detection service called Nimbus to leverage hardware switches
to select and filter traffic for accurate attack detection in virtual machines. Last, we present DIBS,
a mechanism that allows switches to share their buffer for large buffer absorbing. We demonstrate
xii
that it reduces the 99th percentile of delay-sensitive query completion time by up to 85%, with
very little impact on other traffic.
xiii
Chapter 1
Introduction
1.1 Introduction
Today, large cloud providers host tens of thousands of different services, and their market is
expected to reach $411 billion by 2020 [88]. Those hosted services are running in geographically
distributed datacenters, each with tens of thousands of servers and thousands of switches. The
datacenter network needs to serviceexternal traffic from Internet users andinternal traffic among
services. And hence there are critical requirements on its high performance and high uptime upon
various diversity of traffic.
Managing cloud traffic is challenge. First, cloud traffic is large in scale and it keeps growing
constantly: doubling every year in Facebook [26] and growing by 50 times in six years in Google
[147]. For a typical mid-size datacenter, the external traffic from Internet users can reach up to
100s Gbps, and the internal traffic can be one or two order of magnitude larger [133]. Network
infrastructure and management services need to scale its capacity to such traffic growth, otherwise,
the performance of hosted applications and services will suffer.
Second, there are traffic anomalies in the datacenter networks like traffic bursts and attacks.
1) Traffic bursts arises from many datacenter applications. For example, in web-search, when
many workers respond near simultaneously to search queries [49], and in MapReduce, intermediate
1
key-value pairs from many Mappers are transferred to the same Reducers during the shuffle
stage [78]. These traffic bursts can be as large as over half of the full-duplex bisection bandwidth
of the network [103]. Traffic bursts cause network congestion and excessive packet drops. As a
result, application jobs experience a long delay to finish. Nevertheless, traffic bursts also cause
packet loss and delayed transmission on other flows traversing the same congested switches or
paths.
2) Network attacks also happen prevalently in the cloud. Today, cloud is becoming an attractive
target for attacker to disrupt services and steal data, and to compromise resources to launch attacks.
Network attack in datacenter becomes increasing severe in scale, distribution, and sophistication
over the years. For example, one DDoS attack with 800Gbps is observed in 2016 [22]. Attacks can
not only make cloud services unavailable to intended users, but also affect cloud infrastructure
such as load balancers, BGP sessions, monitoring, etc., which affects other services sharing the
cloud. In 2013, an attack against Xbox Life causes more than 50 Windows Azure services to
experience connection difficulties or go offline [20].
In addition to their severe impact, those traffic anomalies happen very frequently in the cloud.
According to the measurement on buffer occupancy in Facebook, there is at least one instance of
over two-thirds of the available shared switch buffer is utilized during each second interval over
24-hour period, even though link utilization is on the order of 1% most of the time [143]. A recent
survey of datacenter operators indicates the 98% of those experienced DDoS attacks, with 34% of
those experiencing frequent attacks [22]. Datacenter networks need to handle traffic anomalies
and prevent them from causing performance hit on hosted applications.
1.2 Traditional Traffic Management
In order to handle both large scale traffic and traffic anomalies, network operators typically employ
several network management functions. 1) load balancing: splitting user requests of a services
2
among a group of servers; 2) attack mitigation: protecting cloud services against network attacks;
3) congestion control: preventing packet drops and ensuring on-time delivery.
When traffic arrives at datacenter, attack mitigation function first filters out attack traffic. And
load balancing function splits user traffic among a group of servers. Then the traffic is forwarded to
the selected server on the path across multiple hops of switches, where congestion control function
is employed to prevent packet drops and ensure on-time delivery. Similarly, internal traffic also
goes through the same three functions. Because their importance to serve all datacenter traffic on
the path, we view load balancing, attack mitigation, and congestion control are the three major
traffic management functions in the cloud. Their performance has critical impact on the hosted
cloud applications.
Today, traffic management functions are instantiated in hardware appliances or server software.
Cloud operator usually bought specialized hardware appliances called “middlebox” from third-party
companies [21, 129]. However, those specialized hardware appliances have limited capacity with
high cost. Also, they only provide 1+1 redundancy. Cloud operators start to develop software
solutions for traffic management, leveraging the commodity servers and elastic computing resources
in the cloud. However, both hardware appliances and software-based solutions fail to address scale
and anomaly challenges in data center networks. Current solutions have the following challenges:
High Capital and Operating Costs. Load balancing and attack mitigation functions deployed in
hardware appliances or servers provide limited capacity per device ( 10 Gbps or 1M packets per
second). In order to support the large and increasing cloud-scale traffic, the network operator has
to deploy more and more devices for those traffic management functions, which incurs a high cost.
Those devices can otherwise be used to run valuable user applications, which can gain revenue.
Low Performance and Poor Performance Isolation. Attack mitigation appliances provide fixed
functionality with respect to the types of attacks they can handle. With the fast evolving of
attack sophistication and impact, those appliances usually fail to protect the network against
those advanced attacks. Moreover, processing traffic in server software for load balancing and
3
attack mitigation incurs a high latency and jitter, and poor performance isolation. In addition, it
is usually too late to response the large traffic bursts using software-based congestion control, and
thus extensive packet drops and application latency spikes can inevitably happen.
High Management Complexity. Hardware appliances and servers for load balancing and attack
mitigation usually are placed in the fixed location (e.g., central or aggregation point) and the cloud
operator has to explicitly reroute traffic through them, which introduces additional traffic latency
and requires complex routing hacks. Also, it is hard to handle traffic anomalies in a cost-effective
way. Those devices placed can be either over-provisioned to cover the peak traffic but with a high
cost or under-provisioned with the risk of being overwhelmed by the traffic spikes like flash crowd
or DDoS attacks.
1.3 Scaling-out Traffic Management
Today, people have already scale-out L2/L3 network functionalities, by treating the datacenter
fabric as a big virtual switch. Those works are ranging from multi-rooted tree topology design
[91] to datacenter transport [49]. Building on top of those L2/L3 functionalities, whether we can
provide scaling-out traffic management functions that support the full throughput of a data center?
This is main question we will address in this thesis. We argue that the key idea to address this
question is to leverage another type of network devices: hardware switches.
Given the limitations of software and specialized hardware, we can build those three traffic
management functions by leveraging switches programmability to address both the scale and
anomaly challenges in line-rate. This is because switches process all network traffic in line-rate, it
naturallyprovidesthesamescalabilitywithnetworksize. Today, switchbecomemoreprogrammable
with OpenFlow, P4, etc. Hashing, Match-Action Unit (TCAM/SRAM), Header Parsing, Stateful
counter/meter, etc. In addition, we leverage virtual machines to provide flexible processing with
elasticity for on-demand deployment. We will propose a set of full scaling-out designs for different
4
traffic management functions, by leveraging 1) the full line-rate packet processing in hardware
switches and 2) flexible programmability in virtual machines. The key idea is to allow hardware
switches to process most of traffic with high performance and virtual machines as the backup
to process only a small set of traffic with fine-grained control. First, we implement the traffic
management functions directly on network switches, by using the fine-grained control APIs to
enable high performance in-network packet processing. For instances, load balancing is built
in network switches by leveraging hash-based traffic splitting and tunneling. To handle traffic
bursts, switch on-board buffer is used to directly absorb short traffic bursts. Second, we place
the fine-grained traffic processing and analysis in virtual machines with flexible programmability.
However, hardware switches and virtual machines have specific resource restrictions that bring the
following challenges:
Limited memory and buffer in switches. Individual switch has limited memory to store all
ongoing connections and limited buffer to absorb large traffic bursts. In response, we propose the
resource sharing among switches by viewing the whole data center network as a big virtual switch.
By enabling the whole network as a virtual load balancing, each switch only needs to handle
the connections traversing through it. Similarly, each switch can detour the excessive packets to
its neighboring switches to avoid packet drops by enabling the virtual buffer among all network
switches.
Limited programmability in switches. Hardware switches only support limited programmability
and are optimized for packet forwarding. Typical functions are match-action units and counting.
Match-action unit is to match a specific header field of a packet and perform the pre-defined
actions. While counting is to maintain the traffic states in an across packet manner. In order
to support the complex traffic management functions, we need to build on top of the limited
programmability for advanced functionalities. For example, we maintain connection states in the
match-action unites for load balancing. However, switches take non-constant time to install a
connection state. In the meanwhile, any network change can trigger the state inconsistency. To
5
address this, we propose to leverage the counting primitive to build a bloom filter to differentiate
the traffic states across network changes. Another example comes from the fact that we leverage
the sampling group to implement the flexible sampling for different sets of traffic.
Limited capacity on individual VM. To address the limited capacity on individual VM, we
propose to auto-scale the VMs in face of traffic changes and use flexible sampling to select the
required traffic to accommodate the VM capacity.
Table 1.1 summarizes the works in this thesis. For each traffic management function, the table
shows the real-world measurement study we did from large cloud service providers to understand
the problems and requirements, what scaling-out solutions to address the problems, what the
consequent challenges are, and what the algorithm and system design to address the challenges.
Function
Real-world measurement
study
Scale-out solution
Resource
limitations
system design Ch.
Load bal-
ancing
Large-scale traffic and intensive
updates (Facebook)
Load balancing built on
switching ASICs
Switch memory
Compact data struc-
ture that fits switch
memory
Ch. 2
Attack
mitiga-
tion
Large-scale and diverse attacks
(Microsoft Azure)
Auto-scaling attack de-
tection service
VM capacity
VM auto-scaling with
flexible sampling
Ch. 3
Congestion
Control
Incast traffic (DCTCP w/ Mi-
crosoft)
Detour-Induced Buffer
Sharing
Switch buffer
Absorbing bursts with
detour
Ch. 4
Table 1.1: Summary of works presented in this thesis
1.4 Summary of Results
This thesis presents three novel systems which demonstrate the feasibility of scaling-out for the full
throughput of datacenter traffic and highlight the benefits of leveraging emerging programmable
hardware switches for traffic management functions.
SilkRoad is a load balancer built directly on a state-of-the-art switching ASIC. The challenge
is that the load balancer must keep track of millions of connections simultaneously. Until recently,
it was not possible to implement a load balancer with connection tracking in a merchant switching
ASIC, because high-performance switching ASICs typically can not maintain per-connection states
6
with consistency guarantee. We explore how to use switching ASICs to build much faster load
balancers than have been built before. we show SilkRoad can load-balance ten million connections
simultaneously at line rate. SilkRoad achieves 40%-95% SRAM reduction and thus can fit into
switch SRAM for all the clusters we studied. Using the available SRAM in switches, one SilkRoad
can replace up to hundreds of software load balancers. Our design always keep tracking of
connections with a bloom filter of only 256 bytes even under the scenarios with the most frequent
service changes observed from the network.
Nimbus is a system to safeguard cloud-hosted tenants against network-based attacks in the
cloud. Using three months of NetFlow data from a large cloud provider, we present the first
large-scale characterization of inbound attacks towards the cloud and outbound attacks from the
cloud. We investigate nine types of attacks ranging from network-level attacks such as DDoS to
application-level attacks such as SQL injection and spam. Our analysis covers the complexity,
intensity, duration, and distribution of these attacks, highlighting the key challenges in defending
against attacks in the cloud. By characterizing the diversity of cloud attacks, we propose to
coordinate flexible packet sampling and filtering at commodity switches, customizable out-of-band
attack detection at VMs, and in-band mitigation at switches. Our simulations and prototype
evaluations using cloud attack traces show that Nimbus can quickly detect attacks at near line
speed with high accuracy and resource efficiency, and it is robust against rapidly changing attack
traffic.
DIBS is a mechanism that achieves a near lossless network without requiring additional buffers
at individual switches. We argue that switches should share buffer capacity to effectively handle this
spot congestion without the monetary hit of deploying large buffers at individual switches. Using
DIBS, a congested switch detours packets randomly to neighboring switches to avoid dropping the
packets. Implemented in hardware, on software routers in a testbed, and in simulation, and we
demonstrate that it reduces the 99th percentile of delay-sensitive query completion time by up to
85%, with very little impact on other traffic.
7
1.5 Dissertation Plan
This thesis proceeds as follows. In Chapter 2 we present SilkRoad, which is load balancer built
directly on switching ASIC and can keep track of millions of connections. In Chapter 3 we perform
a comprehensive measurement study on the characteristics of both inbound and outbound attacks
in the cloud. And then we discuss a system called Nimbus to address those attacks by coordinating
flexible packet sampling and filtering at commodity switches and flexible, elastic attack detection
at VMs. In Chapter 4 we present DIBS, a mechanism allows switches to share their buffers for
absorbing traffic bursts. In Chapter 5 we discuss the future of traffic management in the cloud
and conclude.
8
Chapter 2
SilkRoad: Making Stateful Layer-4 Load Balancing Fast
and Cheap Using Switching ASICs
In the previous chapter, we discussed the current traffic management solutions suffer from high
cost, low performance, and high management complexity. In this chapter, we take a first step
to scale-out load balancing by directly building load balancing function on switching ASICs. By
doing so, we show that up to hundreds of software load balancer (SLB) servers can be replaced by
a single modern switching ASIC, potentially reducing the cost of load balancing by over two orders
of magnitude. Today, large data centers typically employ hundreds or thousands of servers to
load-balance incoming traffic over application servers. These software load balancers (SLBs) map
packets destined to a service (with a virtual IP address, or VIP), to a pool of servers tasked with
providing the service (with multiple direct IP addresses, or DIPs). An SLB is stateful, it must
always map a connection to the same server, even if the pool of servers changes and/or if the load
is spread differently across the pool. This property is called per-connection consistency or PCC.
The challenge is that the load balancer must keep track of millions of connections simultaneously.
Until recently, it was not possible to implement a load balancer with PCC in a merchant
switching ASIC, because high-performance switching ASICs typically can not maintain per-
connection states with PCC. Newer switching ASICs provide resources and primitives to enable
PCC at a large scale. In this paper, we explore how to use switching ASICs to build much faster
9
load balancers than have been built before. Our system, called SilkRoad, is defined in a 400 line
P4 program and when compiled to a state-of-the-art switching ASIC, we show it can load-balance
ten million connections simultaneously at line rate.
2.1 Introduction
Stateful layer-4 (L4) load balancers scale out services hosted in cloud datacenters by mapping
packets destined to a service with a virtual IP address (VIP) to a pool of servers with multiple
direct IP addresses (DIPs or DIP pool). L4 load balancing is a critical function for inbound traffic
to the cloud and traffic across tenants. A previous study [133] reports that an average of 44% of
cloud traffic is VIP traffic and thus needs load balancing function. Building cloud-scale L4 load
balancing faces two major challenges:
Support full bisection traffic with low latency: Data centers have rapid growth in traffic:
doubling every year in Facebook [26] and growing by 50 times in six years in Google [147]. While
the community has made efforts to scale out L2/L3 virtual switching to match full bisection
bandwidth for intra-datacenter traffic (or full gateway capacity for inbound traffic) [51, 109], one
missing piece is scaling L4 load balancers to match the full bisection bandwidth of the underlying
physical network. Load balancing is also a critical segment for the end-to-end performance of
delay-sensitive applications [87] and for low latency data centers (e.g., 2-5 μs RTT with RDMA
[160]).
Ensure per connection consistency (PCC) during frequent DIP pool changes: Data
center networks are constantly changing to handle failures, deploy new services, upgrade existing
services, and react to the traffic increase [90]. Each operational change can result in many DIP
pool changes. For example, when we upgrade a service, we need to bring down DIPs and upgrade
them one by one to avoid affecting the service capacity. Such frequent DIP pool updates are
observed from a large web service provider with about a hundred of data center clusters (§2.3.1).
10
During a DIP pool change, it is critical to ensure per connection consistency (PCC), which
means all the packets of a connection should be delivered to the same DIP. Sending packets of an
ongoing connection to a different DIP breaks the connection. It often takes subseconds to seconds
for applications to recover from a broken connection (e.g., one second in Wget), which significantly
affects user experience.
Today, L4 load balancing is often implemented in software servers [133, 82]. The software load
balancer (SLB) can easily support DIP pool updates and ensure PCC, but cannot provide full
bisection bandwidth with low latency and low cost. This is because processing packets in software
incurs high compute overhead (requiring thousands of servers or around 3.75% of the data center
size [82]), high latency and jitter (50 μs to 1 ms) [86, 82], and poor performance isolation (§2.2.2).
In contrast, if we run load balancing in switches, we can process the same amount of traffic with
about two orders of magnitude saving in power and capital cost [25, 18].
To improve throughput and latency, offloading load balancing to hardware is an appealing
option, similar to offloading packet checksumming, segmenting and rate limiting to NICs [106, 116].
One approach to run load balancing at switches is to leverage ECMP hashing to map a VIP to a
DIP pool, but do not maintain the connection state at switches [86]. However in this approach,
during a DIP pool update, switches need to redirect all the related traffic to SLBs to create the
connection state there and ensure PCC. The issue here is that there is no clean way to decide
when to migrate the VIP traffic back to switches (§2.3.2). If we migrate too early, many ongoing
connections that should match the old DIP pool may match the new DIP pool at switches and
violate PCC. If we migrate too late, the SLBs process most of the traffic (the ‘slow-path’ problem),
losing the throughput/latency benefits of using switches. If we migrate VIPs back to switches
periodically as used in Duet [86], it leads to either around 1% of connections broken or up to 70%
of traffic handles in SLBs (§2.3.2).
Instead, we propose SilkRoad, which uses simple hardware primitives available in today’s
switching ASICs for stateful load balancing. SilkRoad aims to provide a direct path between
11
application traffic and application servers by eliminating the need for another software layer (SLBs)
in-between. SilkRoad maintains the connection state at the switch, thus ensuring PCC for all
the connections. In addition, every packet of a VIP connection is forwarded by ASIC, and hence
SilkRoad inherits all the benefits of high-speed commodity ASICs such as high throughput, low
latency and jitter, and better performance isolation. The key challenge for SilkRoad is to maintain
connection states in the ASIC using hardware primitives while scaling up to millions of connections
as well as ensuring PCC upon frequent DIP pool updates. SilkRoad addresses this challenge via
the following contributions.
Fitting millions of connections in SRAM: Switching ASICs have continuously increased
memory size (growing by five times over the past four years as shown in Table 2.1 in §2.4.1)
and has just reached a stage where storing all the connection states and running load balancing
at switches become possible. However, with a naive approach, storing the states of ten million
connections in a match-action table takes a few hundreds of MB of SRAM.
1
This is far more than
50-100 MB SRAM size available in the latest generation of switching ASICs. To reduce the SRAM
usage, we looked at what constitutes each connection entry: match field and action data. We
propose to store a small hash of a connection rather than the 5-tuple to reduce the match field
size, while our design mitigates the impact of false positives introduced by the hash. To reduce
the action data bits of a connection entry, we store a DIP pool version rather than the actual
DIPs and intelligently reuse version numbers across a series of DIP pool updates.
Ensuring PCC during frequent DIP pool updates: Ensuring PCC is challenging because
switches often use the slow software (running on a switch management CPU) to insert new
connection entries into the table. Hence, a new connection entry may not be ready for the
subsequent packets in a timely fashion. We call those connections that arrive before time t but do
not have a connection entry installed in the table at time t as pending connections. To ensure PCC
during DIP pool updates, SilkRoad remembers pending connections in an on-chip bloom filter,
1
In the case of IPv6 connection, a connection entry takes 37 bytes to store 5-tuple as match key, 18 bytes to
store new destination address plus port number as action data, and a couple bytes of packing overhead.
12
built on commonly available transactional memory (as counters/meters) in ASICs. We minimize
the size of the bloom filter using a 3-step update process.
Since SilkRoad uses existing features on ASICs today, it can be implemented by either modifying
the logic of existing fixed function ASICs or using programmable ASICs (e.g., Barefoot Tofino [9]).
In fact, we built the SilkRoad prototype on a programmable switching ASIC and confirmed fitting
10M connections is feasible.
We performed an extensive simulation with a variety of production traffic and update traces
from a large web service provider. The results show that SilkRoad achieves 40%-95% SRAM
reduction and thus can fit into switch SRAM for all the clusters we studied. Using the available
SRAM in switches, one SilkRoad can replace up to hundreds of SLBs. Our design always ensures
PCC with a bloom filter of only 256 bytes even under the scenarios with the most frequent DIP
pool updates observed from the network.
2.2 Background on load balancing
In this section, we first give background on the layer 4 load balancing function. Next, we discuss
two existing solutions for load balancing: 1) software load balancers running on x86 servers [133, 82]
and 2) Duet [86] which stores VIP-to-DIP mappings (not per-connection states) in switching
ASICs.
2.2.1 Layer 4 load balancing function
In this paper, we refer load balancers as layer-4 load balancers as opposed to layer-7 ones (e.g.,
Nginx [32]). A load balancer maintains two tables (Figure 2.1):
VIPTable: VIPTable maintains the mapping from a VIP (e.g., a service IP:port tuple 20.0.0.1:80)
to a DIP pool (e.g., a server pool {10.0.0.1:20, 10.0.0.2:20}). When the first packet of a connection
c
1
arrives, the load balancer identifies the DIP pool for its VIP in VIPTable and runs a hash
13
miss
VIP DIP
20.0.0.1:80
10.0.0.1:20
10.0.0.2:20
20.0.0.2:80 10.0.1.1:22
Connection DIP
1.2.3.4:1234
è20.0.0.1:80
TCP
10.0.0.2:20
VIPTable
ConnTable
Install
hit
1.2.3.4:1234
è20.0.0.1:80
TCP
1.2.3.4:1234
è10.0.0.2:20
TCP
Figure 2.1: ConnTable and VIPTable in load balancers.
on the packet header fields (e.g., 5-tuple) to select a DIP (e.g., 10.0.0.2:20). Since the hash is
performed based on the same packet header fields, all subsequent packets of the same connection
pick the same DIP as long as the DIP pool remains static.
When the DIP pool changes for server addition or removal, the packets of the same connection
may be hashed to a different DIP, breaking per-connection consistency (PCC). Formally, we define
a load balancing (LB) function as a mapping function from packet p
i
j
that belongs to connection
c
i
to DIP d
k
. We define PCC as: for a given connection c
i
,∀p
i
j
∈c
i
, LB(p
i
j
) =LB(p
i
0
)=d
k
.
ConnTable: To ensure PCC during DIP pool updates, a load balancer maintains a table
that stores per-connection states. We call it ConnTable. ConnTable maps each connection (e.g.,
5-tuple) to a DIP that is selected by the hash result of VIPTable for the first packet of the
connection. In Figure 2.1, when the first packet of c
1
arrives, a match-action rule that maps c
1
to the DIP is inserted into ConnTable (e.g., [1.2.3.4:1234, 20.0.0.1:80, TCP]→ 10.0.0.2:20). All
the subsequent packets of this connection match the rule in ConnTable and get forwarded to
10.0.0.2:20 consistently.
PCC challenge: Now suppose a new DIP 10.0.0.3:20 is added to the DIP pool for this given
VIP, which requires an update of the DIP pool members in VIPTable. The challenge here is, to
guarantee PCC, the VIPTable update must be atomic and synchronous with connection insertions
in ConnTable. In software implementations, the load balancer locks VIPTable and holds new
incoming connections in a buffer to prevent them from being processed by VIPTable. Then, the
14
SLB updates VIPTable to the new DIP pool and then releases the new connections from the buffer.
This way the SLB ensures PCC for connections arrived both before and after the update, but at
the cost of slow-path packet processing by CPU and buffering delay. Later we will also show that
this synchronous update between VIPTable and ConnTable is difficult to achieve in switching
ASICs (§2.4.3).
2.2.2 Limitations of software load balancers
Most cloud data centers today run software load balancers (SLBs) [133, 82] by implementing both
ConnTable and VIPTable in software. As mentioned in the Duet paper [86], using software has
the following drawbacks:
High cost of server resources: Today’s SLBs typically occupy a significant number of servers.
For example, a typical 40K-server data center has 15 Tbps traffic for load balancing [86] and
requires over 15 Tbps / 10 Gbps=1500 SLB servers or 3.75% of total servers even assuming full
NIC line rate processing. In addition to the NIC speed limit for bit-per-second throughput, the
state-of-the-art SLBs using 8 CPU cores can only achieve up to 12 Mpps (packet-per-second)
throughput [82], three orders of magnitude slower than modern switching ASICs that easily process
billions of packets per second.
Cloud traffic volume is rapidly growing (doubling every year in Facebook clusters [26] or growing
by 50 times from 2008 to 2014 observed in Google data centers [147]) and lots of the cloud traffic
need load balancing (44% of traffic as reported in [133]). Thus, we need even more servers for load
balancing, wasting the resources that could otherwise be used for revenue-generating applications.
By contrast, if we run load balancing on switches, we process the same amount of traffic with
about two orders of magnitude saving in power and capital cost [25, 18].
High latency and jitter: SLBs add a high latency of 50 μs to 1 ms for processing packets in
batches [82, 86], which is comparable to the end-to-end RTT in data centers (median 250 μs in
[94]). New techniques such as RDMA enable even lower RTT of 2-5 μs [160]. Therefore, SLBs
15
��
���
���
���
���
����
�� ��� ��� ��� ��� ���� ����
������������� ��� ���������
���� ��������������� ����
��� ��������� ������������
������������ ���������
���������������������
������������ ����
Figure 2.2: Frequent DIP pool up-
dates (Y% of clusters have more
than X updates per minute in the
median or 99th percentile minute
in a month.)
��
���
���
���
���
����
�������
����
�������
�������
������
���������
����������������
��� �������������������� ������������
�������
���
���
���
��� ��� ���
����
���
���
��� ��� ���
Figure 2.3: Distribution of root
causes for DIP additions and re-
movals (in a month).
��
����
����
����
����
��
�� ��� ���� ����� ������
������������������
�����������������������
�������
����
�������
�������
������
Figure 2.4: Distribution of down-
time duration with various root
causes. (Provisioning does not
cause downtime)
become a severe bottleneck for many delay-sensitive applications [160, 87]. Moreover, fulfilling
a service request may trigger a chain of requests for its provider services or third-party services
(e.g., storage, data analytics, etc), traversing SLBs multiple times. The accumulated latencies with
multiple SLB layers in-between hurt tail latency performance experienced by the application.
Poor performance isolation: When one VIP is under DDoS attacks or experiences a flash
crowd, the other VIP traffic served by the same SLB server instance also experience increased
delays or even packet drops because of poor resource/performance isolation in x86-based systems.
One may employ rate-limiting at SLBs, but software rate-limiting tools (e.g., Linux TC queueing
discipline (qdisc)) incur high CPU overhead. For example, metering 6.5 Gbps traffic may require a
dedicated 8-core CPU [138]. As a result, we cannot expect fine-grained and efficient performance
isolation on SLBs.
2.2.3 Duet: Storing VIPTable in ASICs
To address the limitations of SLBs, one natural idea is to leverage the high-speed low-cost ASICs
that are already available in data center switches. Duet [86] leverages two features in the ASIC: (1)
Generic hash units: which already exist in ASICs for functions like ECMP, Link Aggregation
Group (LAG), checksum verifier, etc. (2) Match-action tables: which match on selected packet
header fields for various actions and are used for ECMP tables, forwarding tables, etc. Duet uses
ECMP hashing and fixed match-action table to implement VIPTable at switches. Due to the
16
limited ECMP table size, Duet only uses switches to handle VIPs with high-volume traffic and
employs a few SLBs (with both ConnTable and VIPTable) to handle the other VIPs.
Duet can get around the performance limitations of SLBs. To handle 10 Tbps traffic, the Duet
paper claims it forwards most traffic in switches while only needs 230 SLBs to handle around 5%
of traffic in software. Duet can achieve a median latency of 474 μs [86]. Duet can also achieve
better performance isolation using rate-limiters (meters) at switches. However, as we will show in
the next section, Duet cannot handle frequent DIP pool updates, which is an important scenario
in data centers today.
2.3 Challenges of Frequent DIP Pool Updates
From our study of a large web service provider, we observe a major challenge for load balancing
functions: frequent DIP pool updates. If we store ConnTable only in SLBs, as proposed in recent
works [86], during DIP pool updates, we may incur either a high SLB overhead or many broken
connections, which degrade application performance.
2.3.1 Frequent DIP pool updates
We study about a hundred clusters from a large web service provider. There are three types of
clusters: PoPs (points of presence) where user requests first arrive at, Frontends which serve
requests from PoPs, and Backends that run backend services. All clusters need L4 load balancing
functions.
Frequent DIP pool updates in a cluster: We collect the number of DIP pool update events
per minute from the network operation logs in a month. For each cluster, we identify the median
and 99th percentile minute in the month. We then draw the distribution across clusters with
different update frequency in Figure 2.2. For example, overall, there are 32% of clusters with more
than 10 updates per minute in the 99th percentile minute (which means more than 10 updates per
17
minute for 432 minutes in a month) and 3% of clusters have more than 50 updates. Some clusters
experience 10 updates per minute for the median minute (by half of the time).
Backends have more frequent DIP pool updates than PoPs/Frontends. For example, half of the
Backends have more than 16 updates in the 99th percentile minute. This indicates a continuous
evolution of backend services. But some PoPs/Frontends have more than 100 updates in the 99th
percentile minute because there a DIP is often shared by most of the VIPs (similar to [82]) and
thus one DIP down or up incurs a burst of updates from all the VIPs.
To understand the burstiness of DIP pool updates across minutes, we also measure the number
of DIP pool updates every ten minutes (not shown in the figure). In the 99th percentile ten-minute,
42% of clusters have more than 100 updates and 2% of clusters have more than 500 updates. The
latter indicates an average of 50 updates per minute during a 10-minute period.
Why are there frequent DIP pool updates? To understand the sources of frequent DIP
pool updates, we analyze service management logs for all clusters in a month. The logs include
the transition of operational stages (e.g., reserving the machine, setting up container environment,
announcing at the load balancer, etc.) and DIP downtimes. We only select events related to DIP
additions and removals.
Figure 2.3 categorizes the distribution of DIP additions and removals for different root causes.
82.7% of the DIP additions/removals are from VIP service upgrades in Backends, where the service
owner issues a command to upgrade all the DIPs of the service to use the latest version of the
service package and associated configurations.
All the other sources of DIP additions/removals account for less than 13% of the total updates
because they affect only a handful of DIPs at a time. For example, testing is a special case of
service upgrade and applied only in Backends, where the service owner restarts a subset of its DIPs
to run the test version of the service package. Failure (e.g., lost control, application crash, etc.) or
preempting (e.g., maintenance, resource contention, etc.) only triggers the restart or migration of
the specific DIP (or a few DIPs if the physical machine is failed or preempted). Provisioning or
18
removing is to add or delete a specific DIP in the DIP pool to adjust the service capacity according
to traffic changes.
Why cannot we reduce the frequency of DIP pool updates? As we will discuss in §2.3.2,
frequent DIP pool updates add new challenges to load balancing. So one question is if operators
can reduce the number of DIP pool updates.
One way is to limit the update rate by delaying the execution of some updates. This smoothing
approach may be feasible for some planned upgrades, but delaying updates can badly hurt
application performance and reliability [90], especially for updates that swap out a faulty DIP,
install critical security patches, or roll back a defective service version.
Another way is to reduce the number of updates by merging multiple DIP updates of the same
VIP into a batch. This is not a good choice for both DIP removals and DIP additions. For DIP
removals in service upgrades and testing, we need to ensure enough DIPs are up to provide enough
service capacity at any time. Thus, the cluster scheduler typically uses the rolling reboot strategy,
which reboots a fixed number of DIPs in every certain period (e.g., two DIPs every five minutes).
For DIP additions, it takes different times for a DIP to come back alive (either finish the reboot
or migrate to a new server). For example, for upgrades, the DIP downtime (from reboot to back
alive) is 3 minutes in the median but 100 minutes in the 99th percentile as in Figure 2.4.
For the updates caused by failures, preemption, or DIP removal, we can remove all the related
DIPs in a batch to prevent new connections from reaching these DIPs. However, since it takes
a variety of time for these DIPs to come back (Figure 2.4), we have to handle DIP additions
separately.
2.3.2 Problems of storing ConnTable in SLBs
To ensure PCC during DIP pool updates, we rely on ConnTable to remember which DIP a
connection mapped to. The SLB stores ConnTable at the server software but has performance
limitations (§2.2.2). Another option as used in Duet [86] is to maintain VIPTable at switches but
19
still leave ConnTable at SLBs. Thus, to update the DIP pool of a VIP, we need to redirect all the
traffic of that VIP to SLBs to build up a ConnTable there to perform the DIP pool update.
2
The main question is when to migrate the VIP from SLBs back to switches: If we migrate
immediately, many remaining old connections (mapping to the old DIP pool) can get hashed to
different DIPs under the new DIP pool at switches and violate PCC. If we migrate later, we need
to handle more traffic in SLBs consuming more software resources. In addition, it is hard to find
the right time to migrate all the VIPs with different DIP pool update timings. Hence, we can
either periodically migrate VIPs to the switches as used in Duet [86], or wait until all the old
connections have finished.
To illustrate the dilemma between PCC violations and SLB loads, we run a flow-level simulation
using one-hour traffic traces collected from one PoP cluster with 149 VIPs. The cluster has an
average of 18.7K new connections per minute per VIP and an average rate of 19.6 Mbps per VIP
per top-of-rack (ToR) switch. We simulate different DIP pool update frequencies with an average
of 1 to 50 updates per minute (as indicated in Figure 2.2). We simulate Hadoop traffic with a
median flow duration of 10 seconds as in [143].
We evaluate three settings of storing ConnTable at SLBs: (1) Migrate-10min: periodically
migrating VIPs back in every ten minutes as used in Duet; (2) Migrate-1min: migrating back in
every minute; and (3) Migrate-PCC, where we wait until all the old connections have terminated
before migrating the VIP to switches to ensure PCC.
The default Migrate-10min has a high SLB load. Figure 2.5a shows that, under 50 updates
per minute, Migrate-10min handles 74.3% of the total traffic volume in SLBs. Note that, even
coordinating with the service upgrade system does not help to decide when to migrate VIPs back,
because it has to wait for the old connections to finish. As a result, this load in SLBs affects not
only during the minute of bursty updates but also up to ten minutes until next VIP migration
2
Before the update, the SLB has to first wait long enough to ensure it sees at least one packet from each ongoing
connections to have an entry for them in ConnTable. This is another problem of redirecting traffic between switches
and SLBs.
20
��
���
���
���
���
����
�� ��� ��� ��� ��� ���
�����������������������
������� ��������
������������ ��� ������������ ����
�����������
�������������
������������
(a) SLB loads.
�
�����
����
��
���
�� ��� ��� ��� ��� ���
���������������������
�� ���� ����������
��������������������������� ����
������������
�������������
�����������
(b) PCC violations.
Figure 2.5: The dilemma between SLB loads and PCC violations.
event. Worse still, operators have to provision a large number of SLBs all the time to handle the
burst of updates because it is hard to instantiate SLBs and announce BGP paths fast enough to
react to the burst. Migrate-10min also has a high SLB load for VIPs under rolling reboot. For
example, a large VIP with hundreds of DIPs may take a couple of hours to upgrade all of its DIPs.
Besides, the large VIP often has a large volume of traffic which leads to a high SLB load.
To reduce SLB overhead, one may use Migrate-1min. In fact, Migrate-1min reduces the portion
of traffic handled in SLBs down to 13.2% for 50 updates per minute. However, Migrate-1min
causes more PCC violations than Migrate-10min. Figure 2.5b shows that Migrate-1min has 1.4%
of connections broken for 50 updates per minute. Even for Migrate-10min, it has also 0.3% of
connections with PCC violation.
21
PCC violations significantly degrade the tail latency for cloud services. Different from packet
drops that can be recovered by TCP within sub-millisecond, for a broken connection, it often takes
subseconds to seconds for applications to establish a new connection (e.g., one second in Wget).
Broken connections also violate the service level agreements for users and affect the cloud revenue.
Migrate-PCC avoids PCC violations, but it causes 93.8% traffic handled in SLBs for 50 updates
per minute (Figure 2.5a).
To be conservative, our experiment is using Hadoop traffic with a short flow duration (a median
of 10 seconds). For other traffic with longer flow durations, the number of PCC violations is
much larger because there are more old connections when the SLBs migrate VIPs to switches. For
example, we also simulate the cache traffic in [143] with a median flow duration of 4.5 minutes.
Migrate-10min has 53.5% of total connections with PCC violation for 50 updates per minute.
In summary, the fundamental problem of storing ConnTable only in SLBs is that during DIP
pool updates, the connections have to transfer between switches and SLBs. Instead, if we store
ConnTable in the switches, we can avoid both the SLB overhead and the broken connections.
2.4 SilkRoad Design
In this section, we present the SilkRoad design which implements both ConnTable and VIPTable
on switching ASICs. Given the recent advances on ASICs with larger SRAMs (§2.4.1), it is now the
right time to make the design choice of storing ConnTable in switches. Unlike the recent approach
of maintaining VIPTable at switches [86], which still handles some packets of a connection at
SLBs during DIP pool updates, SilkRoad ensures that all the packets of a connection are always
handled at switches. In this way, we always gain the benefits of high-speed low-cost ASICs such as
high throughput, low latency and jitter, and good performance isolation, while ensuring PCC.
We address two major challenges in SilkRoad: (1) To store millions of connections in ConnTable
with tens of MB SRAM, we propose to store a hash digest of a connection rather than the actual
22
ASIC generation Year SRAM (MB)
<1.6 Tbps [24, 17] 2012 10-20
3.2 Tbps[13, 18] 2014 30-60
6.4+ Tbps [29, 12, 9] 2016 50-100
Table 2.1: Trend of SRAM size and switching capacity in ASICs. (SRAM does not include packet buffer;
estimated based on table sizes claimed in whitepapers.)
5-tuple to reduce the match field size, while eliminating the impact of false positives introduced by
the digests. We also store a DIP pool version rather than the actual DIPs and allow version reuse
to reduce the action field size. (§2.4.2) (2) To ensure PCC during frequent DIP updates we should
handle the limitation of slow ConnTable insertion by a switch CPU. We use a small bloom filter
to remember the new connections arrived during DIP pool updates, thus providing consistent DIP
mappings for those connections in the hardware ASIC (§2.4.3).
In this section, for simplicity of presentation, we assume SilkRoad is deployed only at the
top-of-rack (ToR) switches. We will discuss more flexible deployment scenarios in §2.5.
2.4.1 Features in commodity switching ASICs
Modern switching ASICs bring a significant growth in processing speed (up to 6.5 Tbps [9]) and
also provide resources and primitives that enable us to implement PCC at large scale, via a careful
co-design of switch data plane and control plane. Here we describe the notable characteristics of
modern switching ASICs, which serve as enablers and also pose challenges to our design.
Increasing SRAM sizes: The SRAM size in ASICs has grown by five times over the past
four years and reach 50-100 MB (Table 2.1), to meet the growing requirements to store a large
number of L2/L3 forwarding and ACL entries. We later discuss this trend in more detail (§2.7).
Existing fixed function ASICs often assign dedicated SRAM (or TCAM) blocks to each function.
Emerging programmable ASICs [9, 13] allow network operators to flexibly assign memory blocks
(from multiple physical stages) to user-defined match-action tables, which gives enough room to fit
many connection states into on-chip SRAM with careful engineering.
23
Connection learning and insertion: The key-value mapping semantics of ConnTable requires
an exact matching table, which is typically implemented as a hash table on SRAM. Hash table
implementations differ in their ways of handling hash collisions [132, 125]. Modern switching
ASICs often take an approach known as cuckoo hashing that rearranges existing entries over a
sequence of moves to resolve a collision [132, 67]. The cuckoo hash table provides a high packing
ratio and memory efficiency but at the cost of running a complex search algorithm (breadth-first
graph traversal) to find an empty slot. The time/space complexity of the algorithm is too high to
run on switching ASICs at line rate. Hence, the entry insertion/deletion is the job of the software
running on a switch management CPU.
Unlike routing table entries whose insertions are triggered by software routing protocols, the
entry insertion into ConnTable is triggered by hardware: the event for the first packet of each
connection hitting the ASIC. For this, we leverage the learning filter available in switching ASICs
for L2 MAC learning. The learning filter usually batches a sequence of new events paired with
additional metadata (e.g., mac-address-to-port mapping) and removes duplicate events. The CPU
reads the arrival events from the filter and runs the cuckoo algorithm to insert new entries into
the hardware table.
The slow connection learning and insertion time via CPU is not a problem for MAC learning
because the frequency of new mac address or server/VM migration is relatively low. However,
L4 load balancing needs to insert a new entry for every new L4 connection. This brings a new
challenge for ensuring PCC which we will address in §2.4.3.
Transactional memory: Switching ASICs maintain an array of thousands of counters and
meters [67] to collect various statistics and limit traffic rates. Tharray provides packet transactional
semantics [67, 114]: the update on a counter by a previous packet can be immediately seen and
modified by the right next packet, i.e., read-check-modify-write is done in one clock cycle time. P4
[66] exposes the generalized idea of transactional stateful processing as register arrays. By using
24
the register primitive, we can implement a simple bloom filter and ensure PCC during DIP pool
updates by remembering pending connections there.
2.4.2 Scaling to millions of connections
Motivation: millions of active connections: To understand the total number of active
connections that we need to store in ConnTable, we take snapshots of ConnTable in all the SLBs
every minute for each cluster. Figure 2.6 calculates the median and 99th percentile numbers of
active connections normalized by the number of ToR switches in each cluster and draws the CDF
across clusters. The 99th percentile number shows the worst-case ConnTable size we need to
provision in each ToR switch if we deploy SilkRoad in the cluster. Among the PoPs and Backends,
the most loaded clusters have around 10M connections. Frontends have fewer connections than
PoPs because PoPs merge many user-facing TCP connections to a few persistent connections to
Frontends.
It is challenging to store 10M connection states in a switch. For IPv6 connections, we need
to store a 37-byte connection key (5-tuple) and an 18-byte action data (DIP and port) in each
connection entry. This means we need at least 550 MB memory for ConnTable, which is far more
than 50-100 MB SRAM available in switches today. Therefore, we use compact data structures to
reduce the memory usage for both the match key and action data fields of each connection entry.
Compact connection match keys by hash digests: To reduce the size of the match field
of each connection entry, we store a hash digest instead of the 5-tuple as proposed in [83]. For
example, rather than storing 37 bytes of a 5-tuple of IPv6 connection, we only need a hash digest
with 16 bits.
Using hash digests introduces a false positive if two connections hash to the same digest in the
same hash location. When a new connection falsely hits at ConnTable, the connection uses the
DIP assigned for the collided existing connection. Thus, it is unable to get its own DIP for the
correct VIP due to the bypass of VIPTable. The good news is that the chance of false positives is
25
�
���
���
���
���
�
�� ��� ���� �� ��� ����
���������������
��������������������������
�������������� ������������ ���� �����������
�������� ����
�������� �����
������� ����
������� �����
�������
��������
Figure 2.6: Num. of active connections per ToR switch across clusters.
low (0.01% of total connections using a 16-bit digest as shown in §2.6) and we can resolve them
with a marginal software overhead.
As the first packet of a connection, TCP SYN matching on an existing entry is a good
indication of a false positive. We redirect such a TCP SYN packet that matches an existing entry
in ConnTable to the switch CPU. The switch software has complete 5-tuple information for each
entry in ConnTable and thus can identify the existing entry that causes the false hit for the new
connection.
The software resolves the hash collision by leveraging the multi-stage architecture of a match-
action pipeline and relocating the existing colliding entry to another stage. In switching ASICs, a
large exact-match table like ConnTable is instantiated on multiple physical stages, each storing a
portion of ConnTable in the SRAM blocks of the stage [67, 100]. We can use a different set of
hash functions for each stage. Hence, when a TCP SYN packet collides with an existing entry, the
switch software migrates the existing entry to another stage. At that stage with a different hash
function, these two connections are hashed to separate entry locations, resolving the hash collision.
After relocating the existing entry and inserting a new entry for the SYN packet, the software
sends the SYN packet back to the switch, which then hits on the right entry. This adds a few
26
milliseconds delay to the redirected TCP SYN packet. This marginal overhead is well justified by
the high compression ratio (e.g., 2B/37B for IPv6 connections) and the low collision chance. Note,
the SYN packet should trigger normal connection learning and is sent out immediately by ASIC if
it does not falsely hit on ConnTable.
Compact action data with DIP pool versioning: To reduce the size of action data part of
each connection entry, we map each connection to a version of DIP pool instead of actual DIP.
When a DIP pool is updated, we actually create a new DIP pool by applying the change to a copy
of the original DIP pool. We then assign a new version to the new pool and program VIPTable to
map new incoming connections to the newest DIP pool version. Once a DIP pool is created and
has active connections that still use it, the DIP pool never changes to provide consistent hashing
to the active connections. A connection ‘uses’ a DIP pool if the connection arrives when the pool
was the newest, and thus VIPTable maps the connection to the pool.
A DIP pool is destroyed when the connections that use it are timed-out and deleted from
ConnTable. When the pool is destroyed, the version of the pool is also released and returned to
a ring buffer so it can be reassigned to a newly created DIP pool. The switch software tracks
the connection-to-pool mappings and manages DIP pool creation/deletion and as well as the
ring buffer that stores available version numbers. From the large web service provider data, we
observed 6-bit version number is big enough to handle many DIP pool update scenarios. Since
most inbound connections are short-lived, each DIP pool and its version do not need to last for
long. Using a 6-bit version field reduces the action data size to 1/24 in case the DIPs are IPv6
(16B IP + 2B port).
Since we introduce another level of indirection (pool version between connections and DIPs),
we maintain the version-to-pool mappings in a new table called DIPPoolTable.
3
Figure 2.7 shows
an example of our DIP selection design with DIPPoolTable. DIPPoolTable incurs an extra memory
consumption to maintain a set of multiple (active) DIP pools for each VIP. The additional overhead
3
DIPPoolTable is similar to an ECMP table that maps ECMP group ID to a set of ECMP members (routing
next hops).
27
Miss
VIP Ver
20.0.0.1:80 V1
20.0.0.2:80 V2
Conn Ver
Digest1
(EF1…)
V1
VIPTable ConnTable
Hit
VIP Ver DIP
20.0.0.1:80 V1
10.0.0.1:20
10.0.0.2:20
10.0.0.4:20
V2 10.0.0.1:20
20.0.0.2:80 V1 10.0.1.1:21
DIPPoolTable
Figure 2.7: Using digest and version in DIP selection.
is easily amortized by the savings from ConnTable when the number of connections mapped to
each DIP pool version is large and they are short-lived, as observed from the web service provider
data in §2.6. If the number of active connections is small and they are long-lived, we fall back to
the design that maps each connection to the actual DIP instead of version.
There are a few ways to further reduce the number of active versions, thus decreasing the size
of version bits. One is modifying an existing DIP pool and reusing it as the current/newest pool
when it is possible, instead of blindly creating a new DIP pool and assigning a new version. It is
possible when a new DIP is added to substitute a previously removed DIP in the pool, which is
usually the case of rolling reboot discussed in §2.3.1. For example, in Figure 2.7, VIP 20.0.0.1:80
has two DIPs in version V1. When DIP 10.0.0.2:20 has failed, we remove it from the VIP and
create another DIP pool version V2. Existing connections to 10.0.0.1:20 still use V1 to ensure
PCC and new connections use V2 to ensure no more new connections towards the failed DIP
10.0.0.2:20. When we add a new DIP 10.0.0.4:20 to the VIP, instead of creating a new DIP pool
version, we reuse the old V1 and replace DIP 10.0.0.2:20 with 10.0.0.4:20. Now new connections
use V1 to select the new DIP.
4
4
There may have a very rare chance that we use out all the versions because of a few long-lasting connections.
We can move those long-lived connections to a small table to fall back to connection-to-DIP mapping.
28
��
����
����
����
����
��
�� ��� �� ��� ���� �� ��� ����
������ �����
���������������������������������������
�������������������
���
��������
�������
Figure 2.8: Number of new connections per VIP in one minute.
2.4.3 Ensuring per-connection consistency
Motivation: many new connections during DIP pool update: In SilkRoad, for the first
packet of each new connection, the ASIC selects a DIP and sends out the packet immediately. In
the meanwhile, the ASIC notifies software for entry insertion into ConnTable. Given a short RTT
in data centers (250 μs in the median and 2-5 μs with RDMA), a new connection can have many
packets arrived before the software completes entry insertion into ConnTable. When overlapped
with a DIP pool update, the subsequent packets of the connection can be mapped to a different
DIP, violating PCC.
To better understand the PCC problem, we define pending connections at time t as those
connections that arrive before t but have not been inserted in ConnTable. We cannot apply a DIP
pool update when there are pending connections because the first few packets of these connections
already match the old DIP pool, and if ConnTable is not ready, the follow-up packets would match
the new DIP pool. Therefore, we can only safely apply a DIP pool update only when there is
no single pending connection. However, this may never happen. Suppose at time t
1
, we have a
set of pending connections C
t1
, and the switch software inserts entries for these connections at
29
ConnTable at time t
2
. Between t
1
and t
2
, other new connections can arrive, which become new
pending connections C
t2
. This can go on forever if there are continuous new connection arrivals.
The number of pending connections during a DIP pool update depends on two factors: the new
connections arrival rate and the time the switch takes to learn and insert new connection entries
in ConnTable. To quantify the arrival rate of new connections, we measured the number of new
connections per minute per VIP for all clusters. Figure 2.8 shows that a VIP can have more than
50M new connection arrivals in a minute.
The connection learning and insertion time depends on the ASIC design. As discussed in
§2.4.1, ASICs often batch new connection events in a learning filter to avoid frequent interruptions
to the switch CPU. The filter also removes duplicate events (from multiple packets of the same
connection). The learning filter can store up to thousands of requests and notifies the switch
software when the learning filter is full or after a timeout. The timeout value can be set by the
network operator, and we expect anywhere between 500 μs to 5 ms. The switch software then
reads the new connection events in batches, run the cuckoo hashing algorithm to select empty
slots, and inserts the entries in ConnTable.
Suppose there are consistent arrivals of 1M new connection every minute for a VIP at one
switch. Whenever we want to update the DIP pool for the VIP, there is always around 8 new
connections in the learning filter with 500 μs timeout setting, leaving no time window to perform
the DIP pool update with PCC.
3-step PCC update with TransitTable: To be able to update a DIP pool without violating
PCC, we need to ensure the switching ASIC can handle the pending connections correctly. We
introduce a TransitTable that remembers the set of pending connections that should be mapped
to an old DIP pool version when VIPTable updates its DIP pool version. In this way, no pending
connection is left out: it is either pinned in ConnTable or marked in TransitTable.
We use the transactional memory primitive in switching ASIC (§2.4.1) to implement Transit-
Table as a bloom filter. Bloom filter indices are addressed by a number of hashes, and unlike a
30
t
req
(u) t
exec
(u) t
finish
(u)
c_a
c_b
c_c
c_d
c_e
c_f
t
start
(c)
t
inst
(c)
(a) Timeline of connections. (c_c
and c_d are pending connections at
texec(u))
miss
VIP Version
VIP1 old_ver
Conn Version
c_a old_ver
c_b old_ver
VIPTable ConnTable
hit
TransitTable
Conn
c_c,c_d
Install (c_c, c_d)
insert
(b) Step 1: between treq(u) and
texec(u)
miss
VIP Version
VIP1 new_ver
Conn Version
c_a old_ver
c_b old_ver
VIPTable ConnTable
hit
TransitTable
Conn
c_c,c_d
Install (c_c, c_d, c_e, c_f)
hit
old_ver
miss
new_ver
lookup
(c) Step 2: between texec(u) and
t
finish
(u)
Figure 2.9: TransitTable to ensure PCC during DIP pool updates (c_a-c_f are connection digests).
cuckoo-based exact matching table, the hash collision between different connections is allowed.
Hence, bloom filter does not need CPU to run a complex cuckoo algorithm and can do read-
check-modify-write in one cycle time, providing the packet transactional semantics. Collisions in
all hash indices lead to a false positive, which is kept negligible as long as the filter size is large
enough to handle the number of connections. Thus, the main challenge is the size of bloom filter
(TransitTable), which can grow as large as ConnTable if designed poorly.
A naive design is to always store every new connection sent to each VIP upon its arrival in
TransitTable and keep a record of both the connection and its selected DIP pool version. This
allows immediate execution of DIP pool updates for any VIP but requires TransitTable to be large
enough to remember all new connection states.
To reduce the memory usage, we only consider the connections to the VIP currently under DIP
pool update and only store the pending connections that are mapped to the old DIP pool. The
key insight is that there are just two versions of DIP pool involved during an update for a given
VIP: the old version before the update and the new version after the update. Thus, we reduce
a key-value store problem (the connection to DIP version mapping) to a simple membership set
problem. TransitTable only needs to remember the set of connections that are mapped to the old
version, where a binary Bloom filter can do in a memory-efficient way.
We take the following 3-step update process to ensure PCC as in Figure 2.9a, where t
start
(c)
and t
inst
(c) indicate the time of connection arrival and the time of insertion into ConnTable
respectively. Step 1: In Figure 2.9b, when the switch receives a request of DIP pool update
31
ConnTable
(Digest à Version)
VIPTable
(VIP à Version)
TransitTable
(Cache pending conn)
DIPPoolTable
(VIP, Version à DIP)
miss
hit
Switch API
Software
Hardware
VIP in update
no update
Learning Insertion
LearnTable
Match action tables
Transactional memory
miss
hit (use old version)
(use new version)
Learning
filter
Figure 2.10: System architecture.
(t
req
(u)), we start to remember all the new connections in TransitTable bloom filter. Step 2:
When all the connections that arrive before t
req
(u) get inserted in ConnTable, we stop updating
the bloom filter and execute the update(t
exec
(u)) on VIPTable. After the update, all the packets
that miss ConnTable retrieve both old and new versions from VIPTable and then are checked by
TransitTable to see if the packets hit the bloom filter. If hit, they use the old version; if miss they
use the new version. Note that the bloom filter is read-only in this step while it was write-only
in the first step (Figure 2.9c). Step 3: Once all the connections in TransitTable get inserted in
ConnTable, we clear TransitTable and finish the process (t
finish
(u)).
Note that using a Bloom filter for TransitTable can cause false positives when a new connection
arrives betweent
exec
(u) andt
finish
(u) (i.e., Step 2 in Figure 2.9c) falsely matches on TransitTable
and take the old version. The chance of false positives is low (with 256-byte bloom filter, no false
positive observed in one hour under most frequent updates as shown in §2.6). To handle it, the
ASIC redirects any TCP SYN packet matching on TransitTable at Step 2 to the switch software
similar to the solution to the digest collisions problem in ConnTable.
Figure 2.10 shows an overall architecture of SilkRoad, depicting the control flow between the
various tables. A simple table (LearnTable) is added to trigger new connection learnings to the
switch software.
32
2.5 Implementation and Deployment
Inthissection, wetalkaboutthedetailsofimplementingtheSilkRoadprototypeonaprogrammable
ASIC. We then discuss the prototype performance and evaluate its implementation overhead in
terms of hardware resource consumption and software overhead. We also discuss using SilkRoad
in a network-wide setting.
2.5.1 Prototype implementation on high-speed commodity ASICs
We built a P4 prototype of SilkRoad on top of a baseline switch.p4 and compiled on a programmable
switching ASIC [9]. The baseline switch.p4 implements various networking features needed for
typical cloud data centers (L2/L3/ACL/QoS/...) in about 5000 lines of P4 code. A simplified
version of the baseline switch.p4 is open-sourced at [36]. We added ~400 lines of P4 code that
implements all the tables and metadata needed for SilkRoad (Figure 2.10). More details of our
prototype are demonstrated in [115].
We implement all the tables as exact-match tables, except for TransitTable as a bloom filter
on transactional memory. ASICs often support word packing which allows efficiently matching
against multiple words in the SRAM block at a time [67]. We carefully design word packing to
maximize the memory efficiency while minimizing false positives [100].
We also implement a control plane in switch software that handles new connection events from
learning filter and connection expiration events from ConnTable. The software runs the cuckoo
hash algorithm to insert or delete connection entries in ConnTable. Besides, the control plane
performs 3-step PCC update for DIP pool updates. The event and update handler is written in
about 1000 lines of C code while entry insertion/deletion is part of switch driver software.
33
2.5.2 Prototype performance and overhead
Performance: Our prototype shows that SilkRoad achieves full line-rate load balancing with
sub-microsecond processing latency. Note that in a pipeline architecture of most switching ASICs
today, adding any new logic into the pipeline does not really change the bit/packet processing
throughput of a switch as long as the logic fits into the pipeline resource constraints. Switch
pipeline latency may slightly increase by up to tens of nanoseconds, which is negligible in end-to-end
datacenter latency and three to five orders of magnitude smaller than SLB processing latency
[82, 86].
In addition, SilkRoad achieves tighter performance isolation than that in SLBs because it
handles all traffic completely in hardware. To throttle a VIP under DDoS attacks or flash crowds,
SilkRoad associates a meter (rate-limiter) to a VIP to detect and drop excessive traffic. A meter is
marking packets to one of the three colors defined by two rate thresholds [14]. To measure metering
accuracy, we generated 10 Gbps traffic to a VIP and measured color marking accuracy with various
rate thresholds and burst size settings, and observed less than 1% average error. Creating 40K
meter instances consumes 1% of the entire SRAM in the ASIC, providing performance isolations
for many VIPs.
ASIC resource consumption: We evaluate the additional resources that SilkRoad needs on
top of the baseline switch.p4 mentioned before. Table 2.2 shows the additional hardware resources
used by SilkRoad while storing 1M connections (with 16-bit digest and 6-bit version) compared to
the switch.p4. We see that the additional resource usage is less than 50% for all types of hardware
resources. The exact-match tables together increase the usage of SRAM and match crossbars,
while ConnTable is the major consumer. The VLIW (very long instruction word) actions are used
for packet modifications [67]. The hash operations in the exact matching tables and the multi-way
hash addressing of bloom filter consume additional hash bits. The bloom filter implementation uses
stateful ALUs to perform transactional read-check-update, as meters/counters do in the baseline
34
Resource Additional usage
Match Crossbar 37.53%
SRAM 27.92%
TCAM 0%
VLIW Actions 18.89%
Hash Bits 34.17%
Stateful ALUs 44.44%
Packet Header Vector 0.98%
Table 2.2: Additional H/W resources used by SilkRoad with 1M connection entries, normalized by the
usage of the baseline switch.p4.
switch.p4. Our P4 prototype defines a few metadata fields to carry DIP pool version and other
information between the tables (Figure 2.10). The metadata fields consume a negligible amount of
PHV (Packet Header Vector) bits [67]. We have also evaluated that up to 10M connections can fit
in the on-chip SRAM in our SilkRoad prototype.
Software overhead: The switch employs a standard embedded x86 CPU that connects to the
switching ASIC via PCI-E interface. For each new connection, the switch software sends a sequence
of moves to the ASIC to make an empty slot for the new entry in ConnTable. The switching ASIC
makes sure the execution of these moves does not affect the ongoing traffic matching ConnTable.
We measure the connection insertion rate to understand the switch software overhead in
managing SilkRoad. In our software prototype using single-core, we found the bottleneck is on
the CPU, not the PCI-E interface. Hash computations for cuckoo hashing and connection digest
take most of the CPU time while cuckoo search algorithm took the second largest but relatively
small time. The CPU overhead increases as ConnTable utilization approaches close to 100%. We
expect SilkRoad can achieve ConnTable insertion throughput of 200K connections per second
by employing 1) better software library for hash computation and 2) multiple cores to handle
insertions into different physical pipes.
2.5.3 Network-wide deployment
A simple deployment scenario is to deploy SilkRoad at all the ToR switches and core switches.
Each switch announces routes for all the VIPs with itself as the next hop. In this way, all inbound
35
A
1
T
1
A
2
T
2
A
3
T
3
A
4
T
4
C
1
C
2
C
3
C
4
VIP
1
VIP
2
ToR
Agg
Core
Figure 2.11: Network-wide VIP assignment to different layers.
and intra-DC traffic gets the load balancing function at its first hop switch into the data center
network. Intra-DC traffic reaches the load balancing function at the ToR switch where the traffic
is originated. Inbound internet traffic gets split to multiple core switches via ECMP and gets load
balanced there. In this design, we can easily scale the load balancing function with the size of the
data centers.
However, this design is unable to efficiently handle network-wide load imbalance. And the
network operator may want to limit the SRAM budget for load balancing function at specific
switches. To address this, rather than blindly serving a VIP traffic at the first hop switch in the
network, we can decide which layer (e.g., ToR, aggregation, and core) to handle a specific VIP and
thus split traffic across multiple switches.
Figure 2.11 shows a simple example. If inbound traffic to VIP1 cannot meet SRAM budget at
core switch C
2
, we then migrate VIP1 from all core switches (only show C
2
here) to ToR switches.
The traffic is balanced via ECMP to the switches at ToR layer, where these switches together have
enough memory to handle a large number of connections. Similarly, if a ToR switch T
3
experiences
a burst of intra-DC traffic for VIP
2
from its rack servers, SilkRoad can migrate VIP
2
to the
multiple core switches.
The adaptive VIP assignment problem can be formulated as a bin-packing problem. The
input includes network topology, the list of VIPs, and the traffic for each VIP. The traffic consists
36
��
����
����
����
����
��
�� ��� ��� ��� ��� ��� ���
���������������
�����������������
���
��������
�������
Figure 2.12: SRAM usage in
SilkRoad deployed on ToR
switches across clusters.
��
����
����
����
����
��
�� ��� ���� ���� ���� ���� ����
���������������
���������������
���
��������
�������
Figure 2.13: The ratio of #SLBs
and #SilkRoad to support load
balancing across clusters.
��
����
����
����
����
��
�� ��� ��� ��� ��� ����
���������������
����������������������� �������
���
��������
�������
Figure 2.14: Memory saving
for SilkRoad deployed on ToR
switches across clusters .
of traffic volume and number of active connections. The objective is to find the VIP-to-layer
assignment that minimizes the maximum SRAM utilization across switches while not exceeding the
forwarding capacity and SRAM budget at each switch. This can also preserve SRAM headroom
for operators to expand service capacity or handle failures.
We can extend our design to do incremental deployment: where the operator may deploy
SilkRoad on a subset of switches or add some new SilkRoad-enabled switches to the network. Our
bin-packing algorithm still works to assign VIPs to different layers so as to fit in all switches’
memory. The only difference is the traffic for a VIP is then split to only SilkRoad-enabled switches
in the assigned layer instead of all switches in that layer.
2.6 Evaluation
We build a flow-level simulator to evaluate the memory usage and PCC guarantee of SilkRoad
using real traffic traces from the large web service provider we studied. We simulate the SilkRoad
on all ToR switches in each cluster. We show that our design achieves 40%-95% SRAM reduction
and thus can fit into switch SRAM to support traffic for all clusters. Using the available SRAM in
switches, one SilkRoad can replace up to hundreds of SLBs. We show that processing the same
amount of traffic in SilkRoad has about two orders of magnitude saving in power and capital cost
compared with SLBs. Our design ensures PCC with only a 256-byte bloom filter even under most
frequent DIP pool updates observed from the network.
37
��
���
����
����
����
����
����
����
�� ��� ���� ���� ���� ���� ���� ����
�������������������������
��������������� ���� ����
�������������������������
�����������������
���������� ������
Figure 2.15: Benefit of version reuse.
2.6.1 Scalability
We first evaluate how SilkRoad can scale to cloud-scale traffic using the traffic traces from each
of PoPs, Frontends, and Backends during its peak hour of a day. We replay the traffic traces to
measure the memory usage on each ToR switch. Most Backends use IPv6 addresses while most
PoPs and Frontends use IPv4 addresses. Throughout the simulations, we consider the SRAM
word of 112 bits as used in [67]. We configure ConnTable entry as 28 bits with a 16-bit digest, a
6-bit version number, and a 6-bit overhead.
5
In this way, we exactly pack four ConnTable entries
in each SRAM word.
SilkRoad can fit millions of connections into the switch SRAM: Figure 2.12 shows the
switch memory usage of SilkRoad for each cluster. In PoPs, the SilkRoad uses 14 MB in a median
cluster and 32 MB in a peak cluster. The SilkRoad in Backends has a median of 15 MB and a
peak of 58 MB memory usage. Backends have larger SRAM consumption in the peak cluster than
PoPs because the peak Backend cluster has more connections (up to 15M) than PoP (up to 11M).
SilkRoad in Frontends consumes less than 2 MB SRAM because Frontends has a small number
of connections (see Figure 2.6). Therefore, SilkRoad can fit into ASIC SRAM with 50-100 MB
(Table 2.1).
5
The overhead bits include an instruction address and a next table address.
38
�
�
��
�
�
��
�
�
��
�
�
��
�
�
��
�
�
�� ��� ��� ��� ��� ���
������������� ��� ������������
�� ���� ����������
���� ������������� ���� ����
����
�������� ���� �������������
��������
Figure 2.16: Effectiveness of en-
suring PCC with various update
frequencies.
�
����
��
���
����
�����
������
�������
� �� �� �� �� �� ��
���������������������
������������������������
������� ���� ���������������� ����
����
������������ �������������
��������
Figure 2.17: Impact of new con-
nection arrival rates (10 updates
per minute, SilkRoad Transit-
Table=256B).
��
���
���
���
���
���
���
���
���
� �� �� ��� ����
���������������������
������������������������
������������ ������������
����������� ���
��������� ���
��������� ���
Figure 2.18: TransitTable size (10
updates per minute).
We investigate the breakdown of ConnTable and DIPPoolTable usage. Take the peak Backend
cluster as an example. ConnTable consumes 91.7% of total 58MB memory usage to store up to
15M connections. The DIPPoolTable takes the rest to host 64 versions of 4187 IPv6 DIPs.
SilkRoad can significantly reduce the number of SLBs: The key benefit of SilkRoad is
that we can reduce the number SLBs by providing high throughput and low latency load balancing.
SilkRoad requires memory resources to store connections at switches while SLBs require CPU
resources to process packets at hosts. To understand the tradeoff, we take the peak throughput
and the peak number of connections of a day in each cluster and estimate the number of SLBs and
SilkRoad switches we need to support load balancing function. We assume that each SilkRoad can
handle 10M connections. For SLB we use the state-of-the-art performance 12 Mpps for 52-byte
packets using 8 cores reported by Google [82]. The results are shown in Figure 2.13. For PoPs
where most traffic is short user-facing connections, we need 2-3 times more SLBs compared to
SilkRoad. Frontends can replace 11 SLBs with one SilkRoad in the median because Frontends
receive a small number of persistent connections with large volume from PoPs. In Backends, one
SilkRoad can replace 3 SLBs in the median cluster and 277 SLBs in the peak cluster. The need
for a large number of SLBs in some peak Backends is because connections there are typically
volume-centric traffic across services (e.g., storage) and the prevalent use of persistent connections
for low latency. Generally, it is more suitable to deploy SilkRoad in those clusters with more
volume-centric traffic.
39
SilkRoad saves the cost and power for running load balancing as well. To support the state-
of-the-art performance of 12 Mpps for 52-byte packets, a typical SLB with Intel Xeon Processor
E5-2660 costs around 200 Watt and 3K USD [82, 25]. By contrast, SilkRoad with 6.4Tbps ASIC
can achieve about 10 Gpps with 52-byte packets, consuming around 300 Watt and 10K USD [18].
So processing the same amount of traffic in ASIC consumes about 1/500 of the power and 1/250
of the capital cost compared to SLBs.
Using digest and version save memory: SilkRoad reduces the memory usage of ConnTable
by using digest for the key field and DIP pool versions for the value field. Figure 2.14 quantifies
the memory savings of this design. All the clusters have more than 40% of memory reduction
through using digest with or without version. PoPs have a consistent memory reduction by around
85% from using both digest and version. Frontends have around 50% memory saving from only
using digest. Backends receive 60%-95% of memory saving.
Version reuse: To quantify the benefit of reusing the versions (see §2.4.2), we consider all the
VIPs in Backends. For each ten-minute time window, we count the number of DIP pool versions
before and after version reuse mechanism. Here, we choose ten minutes as time window to cover
the lifetime for most of the connections [143]. Figure 2.15 shows that a VIP can have up to 330
DIP pool updates in ten minutes and thus need 330 versions and 9 version bits. With version reuse,
we only need to use 6 version bits to handle up to 51 DIP pool versions. Consider a cluster with
10M connections and 4K DIPs, reducing the number of bits for versions can reduce ConnTable by
7.5 MB and the DIPPoolTable by 4.5 MB, with a total of 74.6% memory reduction.
Tradeoffs of digest sizes and false positives: To understand the false positives, we evaluate
the memory and false positives for one PoP cluster with 2.77M new connections per minute per
ToR switch. If we use 32 MB SRAM with 16-bit digest, there are an average of 270 (0.01%) false
positives per minute. If we use 42.8 MB SRAM with 24-bit digest, we have 1.1 (0.00004%) false
positives per minute. All false positives are resolved via switch software with no PCC violation
(§2.4.2).
40
2.6.2 Ensuring PCC
Now we evaluate the effectiveness and overhead to ensure PCC across different solutions. We
conduct experiments on following scenarios: (1) Duet: Duet design which migrates VIPs back to
switcheseverytenminutes; (2) SilkRoad without TransitTable: SilkRoadwithoutusingTransitTable
to ensure PCC; (3) SilkRoad: SilkRoad with 3-step PCC update process using TransitTable.
We use a one-hour traffic trace from one PoP cluster as introduced in §2.3.2. The trace consists
of 149 VIPs and has a peak of 2.77M new connections per minute per ToR switch. We generate
traffic and updates independently because we intend to evaluate the range of changes, instead of a
specific combination in our dataset. By default, we use learning filter size of 2K insertions with 1
ms timeout and TransitTable size of 256 bytes. The software insertion rate is set to 200K entries
per second as we discussed in §2.5.2.
SilkRoad ensures PCC for various DIP pool update frequencies: Figure 2.16 shows
SilkRoad always ensures PCC with 256-byte TransitTable even under the most frequent updates.
For 10 updates per minute, Duet incurs PCC violations in 0.08% of total connections and SilkRoad
without TransitTable breaks 0.00005% of total connections. SilkRoad without TransitTable has
about three orders of magnitude fewer PCC violations than Duet because the DIP pool update
in SilkRoad affects only new connections during their insertion period (a few milliseconds). In
contrast, Duet affects existing connections (running for seconds to minutes) when it migrates VIPs
back to switches.
SilkRoad ensures PCC for various new connection arrival rates: We now vary the
arrival rate of the number of new connections by scaling the traffic of 2.77M new connections per
minutes by a factor of 0.1 to 2. Figure 2.17 shows the average number of connections with PCC
violation per minute across traffic intensities. SilkRoad with 256-byte TansitTable has no PCC
violation. With the connection arrival rate increases, SilkRoad without TansitTable has more
PCC violations because there are an increasing number of pending connections in the learning
41
filter. Duet also has increasing PCC violations because with more new connections arrive, there
are more old connections at SLBs when we migrate the VIP back to switches. SilkRoad ensures
PCC using small TransitTable: Figure 2.18 shows that SilkRoad requires a small size of
TransitTable to ensure PCC during a DIP pool update. For example, during the simulation period
of one hour, TransitTable with only 8 bytes prevents PCC violation for learning filter timeout
within 1 ms. With a larger timeout of 5ms, there are 20 connections with PCC violations with just
8-byte TransitTable and no violations with 256-byte TransitTable. This should be easily supported
because today’s ASICs already have the transactional memory for thousands of counters.
2.7 Discussion
How much memory in ASICs can we use for load balancing? We expect in the future the
SRAM size in switching ASICs will continue to grow. This is because there is a strong requirement
of a large memory for building various functions for diverse markets, such as storing 100Ks of
Internet IP prefixes in ISP edge routers, maintaining a large MAC table and a large access control
list (ACL) in enterprise switches, and storing MPLS labels in backbone switches. However, the
memory requirements for data center switches are relatively small compared to other markets [86].
This is because data centers use simple ECMP-based routing and push access control rules to
host hypervisors. We expect a fair amount of memory in switching ASICs available for the load
balancing function or for offloading other connection-tracking functionalities hosted in traditional
middleboxes or hypervisor.
In addition, traditional fixed function ASICs often waste memory space because they have
dedicated tables for each function. Emerging programmable switching ASICs [9, 13] allow network
operators to use memory flexibly, which leaves more room for load balancing functions. Moreover,
we can use the feature of multiple stages in the pipeline to further optimize the tradeoffs between
memory usage and false positives. For example, we can use different digest sizes in different stages
42
to reduce the overall false positives. When there is a small number of connections, we insert
new connections to stages with larger digest sizes (i.e., low false positives). When the number of
connections increases, we use stages with smaller digest sizes to scale up.
Handle DIP failures: To detect DIP failures, each SilkRoad can perform the health check on
DIPs. Today’s switches already have health check for BGP sessions. Many switches today offer an
ability to offload BFD (Bidirectional Forwarding Detection) [10]. This mechanism can be leveraged
to perform a fast health check. To perform the health check for 10K DIPs in every 10 seconds
with 100-byte packets [28], switches only need around 800 Kbps bandwidth.
After we find a DIP failed or unreachable, SilkRoad switch quickly removes the DIP from the
DIP pool. To reduce the number of DIP pool versions, we can continue to use the same DIP pool
version and use resilient hashing [11] to maintain existing connections to other DIPs. This is an
alternative way for version reuse.
Handle switch failures: If a SilkRoad switch fails, the existing connections on this switch get
redirected to other switches via ECMP and get load balanced there because all the switches use
the same latest VIPTable. Thus if a connection was using the latest version of VIPTable at the
failed switch, it would get the same VIPTable at the new switch and thus ensure PCC. However,
since we lose the ConnTable at the failed switch, those connections that used an old DIP pool
version may break PCC. This is the same issue with an SLB failure in the software load balancing
case.
Combine with SLB solutions: We propose SilkRoad as a new primitive to implement load
balancing in switches for better performance. In practice, operators can choose to use SilkRoad
only or combine it with SLBs to best meet their traffic scenarios. For example, when ConnTable
in SilkRoad is full, SilkRoad can redirect extra connections to either the switch software or SLBs
(basically treating SilkRoad ConnTable as a cache of connections). Or we can use SilkRoad to
handle VIPs with high traffic volume and use SLBs to handle those VIPs with a large number
of connections. We can enable this hybrid setting by withdrawing those VIPs from switches and
43
announcing them from SLBs via BGP. Different from Duet [86], we do not need to migrate VIPs
during DIP pool updates and always ensure PCC.
2.8 Related Work
Load balancing: Beyond SLBs [133, 82, 109, 35] and Duet [86], there are other works [95, 153,
104] which use OpenFlow switches to implement load balancing. They either leverage the controller
to install flow rules based on incoming packets which are too slow due to the slow switch-controller
channel, or pre-install wildcard rules that are hard to change during traffic dynamics. Instead,
SilkRoad supports line-rate packet processing during traffic dynamics and DIP pool updates.
Consistent updates: The paper [141] introduced per-packet and per-flow consistency ab-
stractions for updates on a network of switches. These inconsistency problems are caused by
different rule update rates across switches. The recent load balancer paper [82] leverages consistent
hashing to ensure that all SLBs select DIPs in the same way when DIP pool changes and ECMP
rehashing happen at the same time. Instead, SilkRoad focuses on per-connection consistency
(PCC) for updating VIPTable and ConnTable inside a single switch. PCC problem is caused by
the slow insertion time of switch software. Thus, we introduce a TransitTable that stores pending
connections to ensure PCC.
Programmable ASICs: Recently, due to more control requirements of switch internals from
major cloud providers (e.g., [147, 26]), switch vendors (e.g., Cavium [13], Barefoot [9], Intel [24])
start to expose low-level hardware primitives of high-speed low-cost ASICs to customers. Recent
research works such as reconfigurable match-action tables (RMT) [67] and Protocol-Independent
Switch Architecture (PISA) [66, 100] are built on those primitives. SilkRoad focuses on the load
balancing function, which is critical for data centers, and leverages existing features of ASICs.
Thus SilkRoad can be either implemented with either small modifications of fixed function ASICs
or built directly on top of programmable ASICs in the market.
44
2.9 Conclusion
L4 load balancing is a critical function for data centers but becomes increasingly challenging to build
with the growth of traffic and the constantly changing data centers. To address these challenges,
SilkRoad leverages the increasing SRAM sizes in today’s ASICs and stores per-connection states
at ASICs. In this way, SilkRoad inherits all the benefits of high-speed low-cost ASICs such as high
throughput, low latency and jitter, and better performance isolation, while ensuring per-connection
consistency during DIP pool changes, as demonstrated by our extensive simulations and a P4
prototype on a programmable ASIC.
45
Chapter 3
The Dark Menace: Characterizing Network-based Attacks
in the Cloud
In the previous chapter we present SilkRoad, a system to scale-out load balancing function for
cloud datacenters. In this chapter, we present the characteristics of network attacks in the cloud
and how to protect cloud against the attacks. As the cloud computing market continues to grow,
the cloud platform is becoming an attractive target for attackers to disrupt services and steal data,
and to compromise resources to launch attacks. Using three months of NetFlow data in 2013 from
a large cloud provider, we present the first large-scale characterization of inbound attacks towards
the cloud and outbound attacks from the cloud. We investigate nine types of attacks ranging
from network-level attacks such as DDoS to application-level attacks such as SQL injection and
spam. Our analysis covers the complexity, intensity, duration, and distribution of these attacks,
highlighting the key challenges in defending against attacks in the cloud. By characterizing the
diversity of cloud attacks, we aim to motivate the research community towards developing future
security solutions for cloud systems.
46
3.1 Introduction
The cloud computing market reached $40 billion in 2014 with a rapid growth of 23%-27% per
year [1]. Hosting tens of thousands of online services, the cloud platform is increasingly becoming
both the target and source of attacks. A recent survey of data center operators indicates that
half of them experienced DDoS attacks, with 94% of those experiencing regular attacks [57].
Moreover, attackers can abuse hosted services or compromise VMs [92] in the cloud to target
external sites via deploying botnets [89], sending spam [105, 136], selling VMs in the underground
economy [72, 150], or launching DDoS attacks [64]. In April 2011, an attack on the Sony Playstation
network compromising more than 100 million customer accounts was carried out by a malicious
service hosted on Amazon EC2 [63]. While there have been some reports of individual attacks
on enterprise and cloud networks [43, 89], to the best of our knowledge, there have not been any
systematic measurement studies of attacks on and off the cloud which can guide the design of
attack detection and mitigation systems. In fact, little has been published about the prevalence,
diversity, and characteristics of these cloud-based attacks.
In this paper we investigate over 200 TB of NetFlow records collected from dozens of edge
routers spread across multiple geographically distributed data centers of a major cloud provider.
We group traffic based on public virtual IPs (VIPs) assigned to each cloud hosted service. We
identify network-based attacks from the NetFlow data using four key features as also used in
prior work [61, 99, 121, 145]: (1) significant traffic volume (e.g., packets per second), (2) abnormal
fan-in or fan-out (e.g., number of unique clients or number of connections), (3) abnormal packet
header signatures (e.g., TCP flags), and (4) communication with Internet malicious hosts [117].
Using these features, we identified nine types of attacks, ranging from various DDoS attacks to
application-level attacks such as SQL injection and spam.
Due to sampling in the NetFlow data used in our study and the fact that NetFlow lacks
application-level information, we do not aim at identifying all the attacks in the cloud. Instead, our
47
goal is to understand the characteristics of attacks using low overhead network-level information
typically collected in many data center networks. Thus, we take a conservative approach of setting
the attack detection thresholds to ensure that most of the attacks we detect are real attacks.
1
We validate the detected attacks against alerts from deployed security appliances and incident
reports written by operators. Our detected attacks cover 78.5% of the inbound attack alerts from
DDoS protection appliances and 83.7% of the incident reports on outbound attacks, due to the
NetFlow sampling used in our study and our conservative approach. Note that the cloud provider
we studied deploys a combination of software and hardware appliances to protect the infrastructure
against such attacks.
Our broader goal is to (a) understand the key characteristics of these attacks to evaluate the
effectiveness of existing DDoS mitigation approaches and (b) analyze their implications on building
cloud-scale attack detection and mitigation solutions.
Although there have been many studies on Internet attacks, this paper presents one of the first
analysis of the key characteristics of attacks to and from the cloud based on a three-month dataset.
We make the following main observations:
• We identify nine types of attacks and quantify their frequencies for inbound and outbound
attacks (Section 4.2).
• We find that most VIPs experiencing attacks only incur one attack incident in a day. There is
a very small fraction of VIPs that experience or generate many attacks (Section 3.4).
• We find multi-vector attacks and combinations of inbound and outbound attacks on the same
VIP. While most attacks target only one VIP, there are a few cases of multiple attacks that
target 20-60 VIPs simultaneously (Section 3.4).
• We observe high variations in attack throughput across time and VIPs, requiring cloud security
solutions to have dynamic resource allocation over time and multiplexing of resources across
1
These attacks may also include some traffic anomalies caused by flash crowds or misconfigurations. We do not
distinguish them because they all impact cloud services and it is an open problem to accurately distinguish them
from benign traffic.
48
VIPs. Attacks often have short duration (within 10 minutes), which require fast attack detection
and mitigation (Section 3.5).
• We investigate the origins and targets of inbound and outbound attacks and identified the
major types of Internet ASes that are involved in cloud-related attacks (Section 3.6).
Scope and Limitations. Our study analyzed traffic data from a single cloud provider and
thus it may not generalize to other providers. However, the scale and diversity of our dataset,
and our conversation with security operators (having a broader industry view and some having
worked at other cloud networks) suggests that similar security challenges are likely faced by other
providers. We collected NetFlow records from data center edge routers before they are filtered by
the security appliances. Thus, the attacks we detected should not be interpreted as impacting the
cloud infrastructure or services. Finally, since the traffic measurement is at one-minute granularity,
it is likely to smooth the effect of short-lived attack spikes. Overall, our study highlights the need
for developing programmable (to handle attack diversity), scalable (to handle varying intensity),
and flexible approaches (for individual tenants) to protect against attacks.
3.2 Datasets and Methodology
We first present the basic setup in a major cloud provider we studied, and then describe the
datasets we collected and the methodology for characterizing attacks.
3.2.1 Cloud provider overview
The cloud network we study comprises 10+ geographically distributed data centers across America,
Europe, Asia, and Oceania, which are connected to each other and to the Internet via edge routers.
Each data center hosts tens to hundreds of thousands of servers. The cloud provider hosts more
than 10,000 services including web services, mobile application services, database and storage
services, and data analytics. Each service is assigned a public virtual IP (VIP). The traffic to the
49
VIP is load balanced across a group of virtual machines hosting the service; sometimes these VMs
are located across multiple data centers.
Such scale of services makes the cloud an attractive target for inbound attacks. Incoming traffic
to different services first traverses the edge routers and then the commercial security appliances
(e.g., Arbor [57]). These security appliances, typically designed for enterprise-scale deployments,
analyze inbound traffic to protect against a variety of well-known attacks such as TCP SYN flood,
UDP flood, ICMP flood, and TCP NULL attacks; these appliances use NetFlow records for traffic
monitoring. However, the detection logic is often limited to known high-volume attacks. Thus
they risk missing other low-volume attack types which aim to probe vulnerabilities but that do
not impact the cloud infrastructure such as stealth port scans and application-level attacks e.g.,
spam, SQL injection. To reduce false positives (noisy alerts), traffic thresholds for alerting can be
set either on a per-tenant basis or across tenant groups on these devices.
Attackers can also abuse the cloud resources to launch outbound attacks. For instance, they
can first launch brute-force attacks (e.g., password guessing) to compromise vulnerable VMs
in the cloud. These compromised VMs may then be used for YouTube click fraud, BitTorrent
hosting, Bitcoin mining, spamming, malware propagation, or launching DDoS attacks. To mitigate
outbound attacks, the cloud provider we studied enforces several security mechanisms including
limiting the outbound bandwidth per VM, preventing IP spoofing of egress traffic, shutting down
the misbehaving VMs and isolating anomalous traffic. To our knowledge, no prior work has
characterized the prevalence of outbound attacks from the cloud.
3.2.2 Dataset and attack detection methodology
We obtained more than 200TB NetFlow logs from a major cloud provider over three months (May,
Nov, and Dec 2013). The NetFlow logs collected for our study had a 1 in 4096 packet sampling
rate for both inbound and outbound traffic at the edge routers of the data centers, and aggregated
50
Attacks Description Net
/App
Target Network
features
Detection
method
Inactive
time-
out
TCP
SYN
flood
Send many TCP SYN, UDP,
ICMP packets to random or
fixed ports on a server
Net
Server
resources
#pkts/min Volume 1 min
UDP
flood
Net
Network
bandwidth
#pkts/min Volume 1 min
ICMP
flood
Net
Server
resources
#pkts/min Volume 120 min
DNS re-
flection
A large number of DNS
responses sent to a target
from DNS servers (triggered
by DNS requests sent by
attackers with spoofed
source addresses)
App
Network
bandwidth
#pkts/min Volume 60 min
Spam
Launch email spam to
multiple SMTP servers
App Users fan-in/out
ratio
Spread 60 min
Brute-
force
Scan weak passwords or
administrative control (using
RDP, SSH, VNC)
App
Server vul-
nerability
fan-in/out
ratio,
#conn/min
Spread 60 min
SQL
injection
Send different SQL queries
to exploit software
vulnerabilities
App
SQL
server vul-
nerability
#conn/min Spread 30 min
Port
scan
Scan for open ports (using
NULL, Xmas packets)
Net
Server vul-
nerability
#conn/min Signature,
Spread
60 min
Malicious
web
activity
(TDS)
Communicate with hosts on
malicious web infrastructure
App Users src IP/dst
IP
Comm
pattern
120 min
Table 3.1: Summary of the network-based attacks in the cloud we studied.
over one-minute windows.
2
Since the edge routers (where we collect the logs) are located upstream
of the security appliances, the attacks we detect are likely mitigated before they reach VMs hosting
services in the cloud. We analyze the NetFlow data on Cosmos, a large scalable data storage
system using SCOPE [76], a programming framework similar to Map-Reduce. Our SCOPE scripts
use C# and SQL-like queries to perform the analysis described below.
We aggregate the NetFlow data by VIP in each one-minute window, and study the traffic to
a VIP (inbound traffic) and from the same VIP (outbound traffic). For each VIP in each time
window, we first filter the traffic based on the protocol number (e.g., UDP), TCP flags (e.g., TCP
2
All the traffic volume numbers we show in the paper are estimated volumes calculated based on the number in
the NetFlow data and the sampling rate.
51
SYN), or port numbers (e.g., SQL traffic is filtered by TCP traffic with destination port 1433 or
3306). We then identify nine types of attacks listed in Table 3.1. Our attack detection is based on
the following four network-level features:
Volume-based: Many volume-based attacks try to exhaust server or infrastructure resources
(e.g., memory, bandwidth) by sending a large volume of traffic via a specific protocol such as TCP
SYN and UDP floods, and DNS reflection attacks. We capture volume-based attacks by identifying
traffic with large relative spikes. We use sequential change point detection [61, 99] by comparing
the traffic volume at the current time window with the Exponentially Weighted Moving Average
(EWMA) of the past 10 time windows. We then compare the difference with a change threshold of
100 packets per minute in NetFlow (1:4096 sampling rate), corresponding to an estimated value
of about 7K pps in the original traffic. The threshold is suggested by the cloud security team
based on the network capacity and prior attack incidents. As shown in Section 3.3.2, using such
threshold settings, we can verify many of the attacks reported in the attack alerts and the incident
reports.
Spread-based: For many services (e.g., mail, SQL, SSH), a single VIP typically connects to only
a few Internet hosts in normal operation. Thus, if a VIP communicates with a large number of
Internet hosts, it is likely an anomaly. To identify such anomalies, we use the NetFlow data to
compute the spread of a VIP (i.e., the number of distinct Internet IPs communicating with a VIP
during a time window) for inbound and outbound traffic. We then capture the relative spikes of
the spread using sequential change point detection. Such spread-based detection of brute-force
attacks has also been used in prior work [96]. We choose 10 and 20 Internet IPs as the threshold
for brute-force and spam, respectively, and 30 connections for SQL in the sampled NetFlow, as
recommended by the cloud security team.
Signature-based: Although packet payloads are not logged in our NetFlow data, we can still
detect some attacks by examining the TCP flag signatures. Port scanning and stack fingerprinting
tools use TCP flag settings that violate protocol specifications (and as such, they are not used
52
by normal traffic ) [19, 126]. For instance, the TCP NULL port scan sends TCP packets without
any TCP flags, and the TCP Xmas port scan sends TCP packets with FIN, PSH, and URG flags
(Table 3.1). If a VIP receives a packet with an illegal TCP flag setting during a time window,
we mark the time window as under an attack. Since the NetFlow data is sampled, even a single
logged packet may represent a significant number of packets with illegal TCP flag settings in the
original traffic.
Communication pattern based: Previous security studies have identified blacklists of IPs in
the Internet. We can identify attacks by filtering VIP traffic communicating with such blacklisted
IPs. For example, the Traffic Distribution System (TDS) [117] includes a list of dedicated hosts
that deliver malicious web content on the Internet. Since these hosts are hardly reachable via
web links from legitimate sources, it is likely that cloud VIPs communicating with these hosts
are involved in malicious web activities. In particular, these VIPs are either a victim of inbound
attacks (e.g., spam, malicious advertising) or that they have been compromised to launch outbound
attacks (e.g., drive-by downloads, scams, and phishing). Note that it is not always possible to infer
the direction of an attack involving TDS nodes because some SYN packets may not get sampled
in the NetFlow data. Thus, we distinguish the inbound from outbound communication pattern
based attacks based on the destination IP in the flow records.
Counting the number of unique attacks. Given the attacks in each one minute time window,
we identify the attack incidents that last multiple time windows for the same VIP. Due to NetFlow’s
low sampling rate, we may not be able to detect an attack over its entire duration. Therefore,
similar to previous work [121, 126, 145], we group multiple attack windows as a single attack where
the last attack interval is followed by T inactive windows (i.e., no attacks).
Instead of selecting a fixed T, we choose to select different T for different attacks based on
analyzing the distributions of inactive times between two consecutive attack minutes of each
type for both inbound and outbound attacks, as shown in Figure 3.1. We select the T value
by generating a linear regression line between each point and the 99 percentile of each attack
53
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000
CDF
Inactive Time (min)
between two consecutive attack intervals
SYN
UDP
ICMP
DNS
SPAM
Brute-force
SQL
PortScan
TDS
(a) Inbound
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000
CDF
Inactive Time (min)
between two consecutive attack intervals
SYN
UDP
ICMP
DNS
SPAM
Brute-force
SQL
PortScan
TDS
(b) Outbound
Figure 3.1: The distribution of inactive time for each attack type; the x-axis is on log-scale.
distribution curve and checking that the average R-squared [81] value for regression models of
inbound and outbound curves is above 85%. We summarize the inactive timeout values we use for
different attacks in Table 3.1.
54
0
5
10
15
20
25
SYN
UDP
ICMP
DNS
SPAM
Brute-force
SQL
Port-Scan
TDS
Percentage (%)
Inbound
Outbound
Figure 3.2: Percentage of total inbound and outbound attacks.
3.3 Attack Overview and Validation
In this section we first give an overview of each type of inbound and outbound attack observed in
our study. For validation, we compare these detected attacks using inbound attack alerts from
DDoS security appliances and the attack incident reports.
3.3.1 Attack Overview
Figure 3.2 shows the distribution of inbound and outbound attacks; absolute counts omitted due
to confidentiality and privacy concerns.
Flood attacks: Flood attacks (TCP, UDP, ICMP floods) in the Internet domain have been widely
studied [124], and they can be launched in both inbound and outbound directions. Our analysis
identified a significant increase of inbound flood attacks during Nov and Dec compared to May
(breakdown by month not shown), possibly to disrupt the e-commerce sites hosted in the cloud
during the busy holiday shopping season. UDP floods are often against media services hosted in
the cloud and on HTTP ports. We also observe that there are about 5 times more outbound TCP
SYN and about 2 times more UDP floods than inbound. This is likely because it is easier for
attackers to leverage cloud resources to attack Internet users, while it is harder to attack the cloud
55
where operators have high level of security expertise and many attacks floods can be filtered by
commercial security appliances.
DNS reflection attacks: The DNS reflection attack is one of the most common amplification
attacks in the Internet. It has received increasing attention from DDoS protection services [43, 57].
In DNS reflection attacks, attackers send DNS requests toward multiple open DNS servers with
spoofed source address of the target, which results in a large number of DNS responses to the
target from DNS servers. Since the cloud has its own DNS servers to answer DNS queries from
hosted tenants, there should not be any DNS responses from the Internet to the cloud. Therefore,
any activity of inbound DNS responses may signify a potential DNS reflection attack. Inbound
DNS reflection attacks often come from up to 6K distinct sources (with 1500 byte full-size packets).
We only observed outbound DNS responses from a single VIP hosting a DNS server at 5666 packets
per second for a couple of days repeatedly.
Spam: Email services often communicate with a stable number of clients at any given time. If
we see a large deviation in the number of email flows, they are likely to be spam. For instance,
we observed an outbound spam eruption on a single day, which accounted for 40% of the total
outbound spam instances in May. The spam traffic came from hundreds of VIPs towards thousands
of external mail servers from email providers such as Yahoo and Lycos, enterprises like CenturyLink,
and small clouds like SoftLayer. We observed prevalent on-off traffic pattern from the spamming
VIPs. Specifically, each VIP generated slow rate spam traffic with a median of 2266 packets per
second over a median of one hour period. It then subsided completely over a median of 5 hours,
and launched new attacks again. We investigated these VIPs with the security team and found
that most of these VIPs are free trial accounts which were quickly shut down. About 98% of these
VIPs were new with no spam traffic recorded before, and the remaining ones were slow spammers
lasting up to a month.
Brute-force attacks: Remote connection protocols like SSH, RDP (Remote Desktop Protocol),
and VNC (Virtual Network Computing) often have just a few connections to a single VIP. If we
56
observe many connections in the sampled NetFlow, they are likely caused by malicious behaviors
such as password guessing (i.e., brute-force attacks). We observed that inbound brute-force attacks
have a median of 24 distinct hosts communicating with a single VIP just in the sampled NetFlow
data (i.e., there are likely other Internet hosts communicating with the VIP that are not in our
sampled data). At the tail, a VIP can receive SSH traffic from up to 10K distinct hosts from the
sampled NetFlow data. This can be caused by an attacker controlling multiple Internet hosts to
try out different passwords in parallel. Outbound brute-force attacks have a median of 60 distinct
hosts targeted by the same VIP in the sampled NetFlow data. This may be because the VIP is
scanning a set of Internet servers with the same set of passwords. There are about 4 times more
outbound brute-force attacks than inbound and more SSH-based brute-force attacks compared to
the RDP ones, likely because the servers running in the cloud often use random ports (e.g., for
RDP), and thus they are less likely to experience brute-force attacks compared to Internet hosts.
SQL injection attacks: Some attackers send a variety of malware SQL queries to exploit the
vulnerability of SQL servers [56, 59]. Although these attacks are in the application layer, we can
still observe such attacks in the network layer when there is a large number of connections issued
towards SQL database servers. It is likely that they are exploiting all possible malformed user
inputs to gain unauthorized access [56]. There are about 5 times more outbound SQL attacks
compared to the inbound attacks.
Port scan: We observe many inbound port scan attacks such as TCP NULL and Xmas attacks.
For example, we observed inbound traffic of 125k TCP NULL packets per second lasting for 4
minutes. Attackers usually leverage these packets to sneak through firewalls and packet filters that
only handles normal TCP packets [19]. Moreover, there is a significant number of inbound TCP
RST packets, which are likely caused by Internet hosts spoofing the IPs in the cloud, leading to
57
TCP RST signals to be directed to VIPs inside the cloud. There are much fewer outbound port
scans.
Malicious web activities (TDS): There are 0.039% of VIPs involved in communicating with
TDS hosts in the Internet. These TDS hosts often use source ports uniformly distributed between
1024 and 5000. There is one attack incident with 89 unique Internet TDS IPs communicating with
a single VIP with 31K packets per second lasting for 98 minutes.
Summary: There are more outbound attacks than inbound attacks (64.9% vs 35.1%). This
implies that it is relatively easier for attackers to abuse cloud resources to launch outbound attacks
than to attack the cloud-hosted tenants due to improved security over the years. At the same time,
new security mechanisms need to be developed to reduce the outbound attacks. The inbound
attacks are dominated by flood attacks, brute-force attacks, port scan, and TDS attacks, while
the outbound attacks are dominated by flood attacks, brute-force attacks, SQL attacks, and TDS
attacks. While our study focuses on characterizing the diversity of cloud attacks, comparing across
attack categories (e.g., by impact, traffic thresholds) may also reveal interesting insights. However,
defining a universal metric to compare attack types is difficult because it requires normalizing
the attack data across diverse metrics e.g., quantifying the impact of an attack in terms of the
service downtime, privacy compromise, and the number of users impacted. Further, some of these
measures may not be known till long after the attack happened. In Section 3.4, we study one
aspect of how VIPs are affected by different attacks and leave the broader analysis to future work.
3.3.2 Validation
In a large heterogeneous network, it is difficult to verify whether all the detected attacks are real
because it requires a snapshot of the actual traffic and the runtime application and system state
before and during the attack. This problem becomes even harder given the coarsely sampled
NetFlow data available for our analysis. We collect the security records including alerts from the
DDoS protection hardware appliances for inbound attacks and the incident reports for outbound
58
Attack Inbound Outbound
#detected/#alerts #detected/#reports
TCP SYN flood 98/197 8/8
UDP flood 403/442 4/4
ICMP flood 0/0 0/0
DNS reflection - 10/10
Spam - 55/55
Brute-force - 27/34
SQL injection - 4/4
port scan 3/3 0/0
TDS - -
Others(Malware - 0/14
hosting/phishing)
Total 504/642=78.5% 108/129=83.7%
Table 3.2: Detected inbound alerts and outbound incident reports (“-” means that the alerts or reports do
not support the attack type).
attacks. We compare our detected attacks with the alerts and incident reports to identify the
attacks we miss. Note that the cloud provider deploys software and hardware security appliances
to safeguard against these attacks so they should not be interpreted as impacting the infrastructure
or tenants.
Inbound attacks: The cloud provider detects and mitigates inbound attacks using a combination
of software and hardware DDoS protection appliances. These appliances generate alerts about
TCP SYN floods, UDP floods, ICMP floods, and TCP NULL scan. Note that on hardware
security appliances, the traffic thresholds are typically set to handle only the high-volume attacks
(low-volume attacks don’t cause any impact to the cloud infrastructure due to high network
capacity) over a large time window, and these appliances aggregate multiple incidents together
that occur close in time. Therefore, to do a side-by-side comparison with alerts from these devices,
we also first group the attacks we detected based on the VIP, attack type and time window. We
found that 73.2% of these attack instances were correlated. This is due to the fact that we set the
traffic thresholds to (a) cover a broad range of inbound attacks, and (b) detect these attacks in
their early stages. To check the latter hypothesis, we measured the detection latency of hardware
security appliances by randomly sampling a few attack instances over a week and observed that
59
these appliances detected them after an order of tens of seconds delays on average. In comparison,
our detection approach (in offline mode) signaled the attack based on the NetFlow data for these
instances within a minute.
Table 3.2 shows the number of alerts in each type and those alerts that we also detected.
Overall, we successfully identified 78.5% of the alerts from hardware security appliances in our
detected attacks. The remaining alerts are missed by our detection approach because of the low
sampling in the NetFlow data and the false positives of these alerts. For other types of attacks,
the cloud relies on individual tenants to detect and report them. However, we did not have the
ground truth data to validate them.
Outbound attacks. The cloud security team detects every potential outbound attack, but they
do not necessarily log all of them as incident reports to avoid false positives. Specifically, only
the cases of anomalous activity reported by external sites are logged as incidents. Similar to
inbound attacks, the cloud provider uses security appliances to mitigate the outbound attacks.
The cloud security team may receive a complaint from an ISP when they notice malicious traffic
originating from the cloud provider. Given such a complaint, the security team checks the logs of
security appliances and investigates the corresponding tenant profile and payment information,
and generates an incident report. Moreover, when the security team receives a complaint, the team
may do traffic analysis for deeper investigation; they may also perform the VHD (Virtual Hard
Disk) forensic analysis on behalf of the customer if the customer (who owns the VHD) requested
it. Based on these investigations, the security team creates incident reports logging their findings
such as the “attack” or “no attack found” label. We use NetSieve [137] to extract the attack
information from these incident reports, and then compare it with our detected outbound attacks.
Since these incident reports come from Internet users’ complaints, there is a large number of
short-term transient attacks that are not covered by these reports. Therefore, we focus on the
false negatives of those attacks that we missed in our approach when we compare with incident
reports, rather than those attacks that we detect but are missed in these reports. For those attacks
60
that are not covered by the incident reports, we randomly picked a few attacks for each type
and investigated them. The attacks for which the packet traces got logged were verified as being
mitigated by the security team.
Table 3.2 shows the number of incident reports (labeled as “attacks”) and those that are also
detected by our approach. We detect most of the attacks reported in the incident reports (83.7%).
There are only two exceptions: (1) There are some incident reports about application-level attacks
such as phishing and malware hosting that we cannot detect with network signatures. (2) We
only investigate brute-force attacks on three remote communication protocols (SSH, RDP, VNC).
Therefore we miss brute-force attacks on other protocols such as FTP.
3
There are four incident
reports labeled as “no attacks” that are also covered by our detected attacks. We investigated
these attacks manually and confirmed with the security team that they are real attacks (on TCP
SYN floods, SSH and RDP brute-force attacks) but mislabeled. Our analysis has been leveraged
by the cloud security team towards improving attack detection, reducing time to detect, and
identifying correlations across attack types.
Limitations of NetFlow. Due to coarse-grained sampling in the NetFlow data collected for our
study and the fact that NetFlow lacks application-level information, we do not aim at identifying
all the attacks in the cloud. We may miss application-level attacks without network-level signatures
and those attacks that do not appear in sampled NetFlow (e.g., HTTP Slowloris [73]). Instead,
our goal is to understand what we can learn about cloud attacks with low overhead network-level
information.
Although we just detect a subset of attacks due to the conservative approach, it is still
useful to understand the key characteristics of these attacks to shed light on the effectiveness of
commercial attack-protection appliances, and the implications on designing future attack detection
and protection solutions. Studies [48, 68, 107] have shown that sampled NetFlow does not affect
the detection accuracy of flood attacks but it may underestimate the number of flows. Therefore,
3
Some incident reports do not describe the protocols that are involved in the brute-force attacks.
61
the number of flows we report should be viewed as a lower bound on the number in the original
traffic.
3.4 Analysis of attacks by VIP
In our three-month trace data, there are on average 0.08% of VIPs per day under inbound attacks
and 0.11% of VIPs per day generating outbound attacks. In this section we investigate these VIPs
to understand the attack frequency per VIP, multi-vector attacks on the same VIP, inbound and
outbound attacks on the same VIP, and attacks that involve multiple VIPs.
3.4.1 Attack frequency per VIP
Attack frequency per VIP: We count the number of attacks per VIP per day (Figure 3.3a).
Most VIPs experiencing attacks only incur one attack incident during a day. Out of the 13K (VIP,
day) pairs experiencing inbound attacks, 53% of pairs experience only one attack in a day. Out
of 18K (VIP, day) pairs experiencing outbound attacks, 44% generate only one attack in a day
because the misbehaving instances are aggressively shut down by the cloud security team.
At the tail, a VIP can get 39 inbound attacks in a day. This is a VIP hosting Media and HTTP
services receiving frequent flood attacks (i.e., SYN, UDP, ICMP) with a median duration of 6
minutes and a median inter-arrival time of 64 minutes. For outbound attacks, there are 0.05% of
outbound (VIP, day) pairs generating more than 100 attacks. We observed one VIP that generated
more than 144 outbound TCP SYN flood attacks in a day to many web servers in the Internet
with a median duration of 1 minute and a median inter-arrival time of 10 minutes. This VIP did
not receive any inbound traffic during a whole month in the NetFlow data indicating that this VIP
does not likely host any legitimate cloud service but it is only being used for malicious behavior.
VIPs with frequent and occasional attacks: We observed that there are only a few VIPs
getting more than 10 attacks (2% of the inbound pairs and 5% of the outbound pairs). Therefore,
62
0
0.2
0.4
0.6
0.8
1
1 10 100 1000
CDF of VIPs with attacks
Num. of attacks per day for each VIP
Inbound
Outbound
(a) Number of attacks per (VIP, day).
0
5
10
15
20
25
30
SYN
UDP
ICMP
DNS
SPAM
Brute-Force
SQL
PortScan
Sec-Prot
TDS
Percentage of attacks over
total inbound attacks (%)
VIPs with occasional attacks
VIPs with frequent attacks
(b) Inbound attacks for VIPs with occasional/frequent attacks.
0
5
10
15
20
25
30
SYN
UDP
ICMP
DNS
SPAM
Brute-Force
SQL
PortScan
Sec-Prot
TDS
Percentage of attacks over
total outbound attacks (%)
VIPs with occasional attacks
VIPs with frequent attacks
(c) Outbound attacks for VIPs with occasional/frequent attacks.
Figure 3.3: Attack characterization for VIPs with inbound and outbound attacks; the x-axis is on log-scale
in the top figure.
63
we classify the VIPs into two classes: those VIPs with no more than 10 attacks per day and those
with more than 10 attacks per day. Understanding the VIPs under frequent attacks is important
for operators to extract the right attack signatures (e.g., popular attack sources) to protect these
VIPs from future attacks.
Figure 3.3 shows that for inbound attacks, there are more TDS, port scan, and brute-force
attacks for those VIPs with occasional attacks than those with frequent attacks (26.6% vs. 0%
for TDS, 20.1% vs. 1.84% for port scan, and 15.7% vs. 0.359% for brute-force). It is natural for
port scan and brute-force attacks to target VIPs with occasional attacks because these attacks
search widely for vulnerabilities (e.g., open ports, weak passwords). TDS attacks also interact
more with VIPs with occasional attacks, which makes TDS attacks harder to detect. Our further
investigation shows that these occasional attacks mainly target applications running protocols like
HTTP, HTTPS, DNS, SMTP, and SSH.
VIPs under frequent attacks often experience relatively more TCP SYN flood attacks than those
VIPs under occasional attacks (5.3% vs. 1.4%). Our investigation shows that these frequent flood
attacks often target several popular cloud services on these VIPs including streaming applications,
HTTP, HTTPS, and SSH.
Similarly, for outbound attacks, the VIPs with occasional attacks experience more brute-force,
TDS, and spam attacks than the VIPs with frequent attacks (19.4% vs. 1.97% for brute-force,
12.8% vs. 0% for TDS, and 4.7% vs. 0.119% for spam). While attackers may try to use free-trials or
create fake accounts to launch them, the attack activity is only short-lived because the anomalous
VMs are aggressively shut down by the cloud operators. It is challenging to detect these attacks
because they come from multiple VIPs in the case of occasional attacks (e.g., spam) and they
typically last only a short time. In contrast, those VIPs with frequent attacks are often the sources
for TCP SYN and UDP flood attacks. For a few cases, we manually verified that these VIPs have
compromised VMs, which may be sold in the underground economy [72, 150].
64
0
0.2
0.4
0.6
0.8
1
0.1 0.2 0.5 1 2 5 10 20 50 100
CDF of VIPs with attacks
Percentage of VIP active time in attack (%)
Inbound
Outbound
Figure 3.4: CDF of the percentage of VIP active time in attack.
Fraction of VIP’s lifetime involved in inbound attacks: We investigated the fraction of
the time a VIP is under inbound attacks or generating outbound attacks compared to its total
active time (i.e., the time that the VIP has active traffic). Figure 3.4 shows that 50% of VIPs
experience inbound attacks for 0.2% of their active times. These are occasional attacks that do
not likely affect much of their service. However, 3% of the VIPs receive inbound attack more than
50% of their operating time. Further investigation reveals that these VIPs run media, web, mail,
and database services. Cloud operators need to effectively block these attacks to eliminate their
impact on cloud services.
Compromised VIPs vs. malicious VIPs for outbound attacks: We also study the fraction
of time a VIP generates outbound attacks compared to its active time. Note that most of these
compromised VMs had weak passwords highlighting the need to enforce security best practices such
as configuring cryptographically strong passwords. Figure 3.4 shows that 50% of VIPs generate
outbound attacks for 1.2% of their active times. These VIPs are likely legitimate tenants that may
65
have been compromised by attackers to generate outbound attacks occasionally (see Section 3.4.2
for one such example). In contrast, 8% of attack VIPs generate outbound attack for more than
50% of their active times. These VIPs are likely to be recruited mainly for attacks e.g., attackers
may buy compromised VMs in the cloud or leverage free trial accounts.
3.4.2 Attacks on the same VIP
Multi-vector attacks: We observe multiple types of attacks attacking the same VIP or coming
from the same VIP at the same time. This is likely because a single malicious program tries
to launch multiple types of attacks to exploit the vulnerabilities of targets or to exhaust target
resources in different ways. We identify these attacks if their start times to/from the same VIP
differ less than five minutes. We find that 106 VIPs experience more than one type of inbound
attacks simultaneously, which accounts for 6.1% of the total inbound attacks. There are 74 VIPs
that experience more than one type of outbound attacks simultaneously, which accounts for 0.83%
of the total outbound attacks. Among these VIPs, 46 VIPs are targets of multi-vector volume-based
attacks (i.e., TCP SYN, UDP, ICMP floods, and DNS reflection). There are 11 VIPs that launch
multi-vector outbound volume-based attacks.
A new observation we make about outbound attacks is that there are 35 VIPs which launch
brute-force attacks together with TCP SYN and ICMP flood attacks (which account for 22.3%
of the outbound multi-vector attacks). This is likely a new attack pattern that attackers find
effective in breaking Internet hosts.
InboundandoutboundattacksonthesameVIP.Therearealsoseveralcasesofsimultaneous
inbound and outbound attacks. Figure 3.5 shows one such case. A VIP from a partner subscription
was inactive (i.e., no traffic) for a long time. Starting the second day, the VIP started to receive
inbound RDP brute-force attack for more than a week. These brute-force attacks originated from
85 Internet hosts, where 70.3% of attack packets are from three IP addresses within a single
resident AS in Asia. These brute-force attack had a peak of estimated around 70 K flows per
66
0
10
20
30
40
50
60
70
80
02 04 06 08 10 12 14
0
5
10
15
20
25
Estimated # of RDP connections (K)
Estimated UDP throughput (Kpps)
Day
Inbound Brute-force
Outbound UDP flood
Figure 3.5: Inbound and outbound attacks on the same VIP. We estimate the UDP throughput and the
upper bound of the number of RDP connections based on the 1 in 4096 sampling rate.
minute with a few packets sampled in each flow. On the eighth day, the VIP started to generate
outbound UDP floods against 491 external sites. The outbound UDP attack had a peak volume
at 23 Kpps, lasting for more than two days. Detecting such attacks requires first jointly analyzing
the inbound and outbound traffic to identify the attack patterns of compromised VIPs, and then
blocking their outbound traffic.
3.4.3 Attacks on multiple VIPs
If attacks of the same type start on multiple VIPs simultaneously, it is likely that these attacks
are controlled by the same attacker. We identify attacks on multiple VIPs if the difference of their
start times on different VIPs is less than five minutes.
4
Figure 3.6 shows that most types of attacks are targeted by fewer than 10 VIPs in the 99th
percentile. We also observe that most types of attacks are targeted at only one VIP in the median
(not shown in the figure). Inbound brute-force attacks have simultaneous attacks on 53 VIPs in
4
We choose five minutes because the ramp-up time is 1-3 minutes for flood attacks and the inactive time T
I
(defined in Section 3.2) for other attacks is larger than 10 minutes.
67
0
10
20
30
40
50
60
70
SYN
UDP
ICMP
DNS
SPAM
Brute-Force
SQL
PortScan
TDS
99th(bar) and Peak(line) #VIPs
with correlated attacks
Inbound
Outbound
Figure 3.6: The 99th percentile and the peak number of VIPs simultaneously involved in the same type of
attacks.
the 99th percentile and 66 VIPs in the peak. We investigated the attacks at the tail and find that
there are two Internet hosts from small cloud providers that start attacking 66 VIPs at the same
time, and move to other VIPs. During a single day, these two Internet hosts attack more than 500
VIPs. These VIPs are located in five data centers in the cloud, and they belong to 8 IP subnets
with different sizes (/17 to /21). The attacker scans through the entire IP subnet with up to 114.5
Kpps attack traffic per VIP. To prevent such attacks, we need to correlate the traffic to different
VIPs and coordinate their attack detection and mitigation.
For outbound attacks, UDP flood, spam, brute-force, and SQL attacks involve around 20 VIPs
simultaneously in the 99th percentile. In the peak, UDP flood and brute-force attacks involve
more than 40 VIPs simultaneously.
3.4.4 Cloud services under inbound attack
We now investigate the major types of cloud services under inbound attacks. We capture the
NetFlow records for VIPs receiving inbound attacks. We then filter all the attack traffic from the
68
Service(port) Total SYN UDP ICMPDNS SPAMBrute-
force
SQL Portscan TDS
RDP (3389) 35.06 0.11 0.21 0.54 0.11 0.07 33.88 0.11 0.32 0
HTTP
(80,8080)
33.20 3.40 1.50 1.97 0.79 0.32 9.34 0.11 13.63 6.94
HTTPS
(443)
13.27 1.22 0.29 1.40 0.21 0.07 4.44 0.04 8.05 0.14
SSH (22) 8.69 0 0.11 0 0.04 0 8.52 0 0.18 0
IP Encap (0) 6.55 0.54 1.57 0.79 1.07 0.04 0.29 0 0.39 0.04
SQL (1433,
3306)
3.11 0 0 0.07 0 0.04 1.29 1.79 0.11 0
SMTP (25) 2.75 0.04 0.04 0.04 0 0.86 0.04 0 0.04 1.75
Table 3.3: The percentage of total victim VIPs hosting different services involved with different inbound
attacks; all numbers are in %.
traffic on the VIPs, and the remaining traffic on the VIPs is mostly legitimate traffic. We use the
destination port of inbound traffic to infer what type of applications and services are hosted on
the VIPs. We count the application type if the traffic on the application port exceeds ten percent
of its total traffic. Table 3.3 shows the percentage of VIPs with different types of cloud services
that experience different types of inbound attacks.
Web services (HTTP/HTTPS) are major services in the cloud with 99% of the total traffic.
VIPs hosting these web services receive a wide range of attacks such as SYN floods, ICMP floods,
brute-force attacks, port scan, and TDS attacks. Web services receive the largest number of SYN
attacks which aim to consume all the available connections of application servers. 1.2% of the
SYN floods use source port 1024 and 3072, which are likely caused by a bug from an old SYN
flood tool juno [42]. Blacklisting or rate-limiting these ports can help mitigate SYN floods.
We also observe other non-flood attacks targeting specific types of services. For instance,
there are 35.06% VIPs hosting RDP servers with standard RDP port. The attackers often detect
active RDP ports and then generate brute-force attacks against the server. TDS attacks mostly
target VIPs running web services and mail services for spam, malware spreading, and malicious
advertising. There are 6.94% of VIPs under attack running web services and 1.75% of VIPs
running mail services.
69
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
SYN
UDP
ICMP
DNS
SPAM
Brute-Force
SQL
PortScan
TDS
Overall
Median(bar) and Peak(line)
aggregate throughput(pkt/sec)
Inbound Outbound
Figure 3.7: Median and maximum aggregate throughput by attack type; the y-axis is on log-scale.
3.5 Attack Characterization
We next investigate the characteristics of attacks to derive implications for the design of attack
detection and prevention systems. First, to quantify the cloud bandwidth capacity needed to defend
against attacks, we study the throughput of different types of attacks. Second, to understand how
fast the attack detection system needs to react to attacks, we study the duration, ramp-up rate,
and inter-arrival times of different types of attacks.
3.5.1 Attack throughput
Throughput by attack type: Figure 3.7 shows the median and peak aggregate throughput
over the entire cloud for each type of attack and all the attacks overall. We measure the attack
throughput using packets per second (pps) because the resources (CPU and memory) used to
prevent these attacks are often correlated to the traffic rate. The overall inbound attack throughput
has a median of 595 Kpps and a peak of 9.4 Mpps. The overall outbound attack throughput is
70
lower with a median of 662 Kpps and a peak of 2.25 Mpps. Compared to the average throughput
of legitimate traffic (54.3 Mpps for inbound and 49.7 Mpps for outbound), the median attack
throughput is about 1% of the total traffic. These attacks can have a significant impact on both
the cloud infrastructure (firewalls, load balancers) and services hosted in the cloud if we do not
filter them at the network edge.
We now study the peak volumes of individual attacks to understand the resources we need
to protect against them. TCP SYN floods have a peak throughput of 1.7 Mpps for inbound and
184 Kpps for outbound. It is important to prevent these attacks timely (e.g., using SYN cookies)
before they exhaust many resources in the cloud infrastructure such as load balancers and firewalls.
The peak throughput for inbound UDP floods is 9.2 Mpps while that of outbound UDP floods
is 1.6 Mpps. While these flood attacks aim to consume the cloud bandwidth or cause congestion
to degrade service performance, the cloud networks are provisioned with a high network capacity
to defend against them [30, 54]. Given that a software load balancer (SLB) can handle 300 Kpps
per core [134] for simple Layer 4 packet processing, in the worst case handling inbound UDP
floods may waste 31 extra cores in the data center infrastructure. If we fail to do in-network
filtering of these UDP floods, they would cost even more resources per packet in the VMs which
have more complex packet processing. However, detecting some application-level attacks (e.g.,
brute-force, spam), endpoint based approaches can leverage application semantics to better handle
them compared to in-network defense approaches.
We also observe large variations in attack volumes over time. For inbound attacks, port scan
has 1000x difference between the peak and median volumes. This implies that it would incur high
costs and waste resources if we overprovision attack detection and mitigation solutions in hardware
boxes. In comparison, elastic approaches that dynamically adjust resource usage over time may
be a more cost-effective and efficient solution. The outbound attack throughput variations are
relatively smaller, but for TCP SYN floods and TDS attacks, we still see a 20x-30x difference
between the peak and median volumes.
71
1
10
100
1000
10000
100000
1e+06
1e+07
SYN
UDP
ICMP
DNS
SPAM
Brute-Force
SQL
PortScan
TDS
Median(bar) and Maximum(line)
peak throughput (pkt/sec)
Inbound Outbound
Figure 3.8: Median and maximum attack throughput across VIPs; the y-axis is on log-scale.
Throughput per VIP: Today, cloud operators mostly focus on preventing large-volume attacks
that may affect the cloud infrastructure, but they rely on tenants to secure their own services [142].
However, many of the attacks we investigated are smaller attacks targeting individual VIPs.
Therefore, westudythepeakattackthroughputforindividualVIPsandcharacterizethethroughput
differences across VIPs (Figure 3.8) to understand the resources individual VIPs need to defend
against attacks.
We observe that some VIPs experience a high peak volume of attacks at a certain time (ranging
from 100 pps to 8.7 Mpps). At times, the per-VIP peak volume is even higher than the median
throughput for the entire cloud. For example, a single VIP can experience up to 8.7 Mpps inbound
UDP floods. The per VIP inbound TCP SYN flood has a peak of 1.7 Mpps. We found one
inbound TCP SYN flood that caused a CPU spike at the software load balancer (SLB) appliance
and resulted in a restart of that appliance. However, the traffic from that device was quickly and
automatically shifted to other SLBs.
72
1 min
10 min
1 hour
1 day
1 week
1 month
SYN
UDP
ICMP
DNS
SPAM
Brute-Force
SQL
PortScan
TDS
Median(bar) and 99th(line)
duration
Inbound Outbound
Figure 3.9: Median and 99th percentile of attack duration by attack type; the y-axis is on log-scale.
There are large differences in the throughput volumes across VIPs. For example, for inbound
brute-force attacks, the VIP having the peak throughput has 361 times larger volume than the
VIP with the median value; for outbound brute-force attacks, this ratio is 75. Therefore, it may
become too expensive to over-provision hardware security appliances for individual VIPs based on
their maximum attack volumes. In comparison, resource management mechanisms that multiplex
the attack protection resources across VIPs are likely to be more cost-effective.
Finally, weobservethatforvolume-basedattacks(TCPSYN,UDP,ICMP,andDNSreflections),
the peak volume of inbound attacks is 13 to 238 times higher than that of outbound. This is caused
by the differences in attack resources and targets between the Internet and the cloud. For inbound
attacks, there are more resources to leverage in the Internet (e.g., botnets, easily compromised
personal machines) to launch high-volume attacks. These attacks also need to have high volumes
to break the VIPs in the cloud, which have plenty of bandwidth and CPU resources. In contrast,
outbound attacks can only leverage a few VMs in the cloud, because it is hard to compromise a
large number of VMs or create multiple fake accounts to get many free VIPs.
73
1
10
100
1000
10000
SYN
UDP
ICMP
DNS
SPAM
Brute-Force
SQL
PortScan
TDS
Median ininter-arrival time (min)
Inbound Outbound
Figure 3.10: Median and 99th percentile of attack inter-arrival time by attack type; the y-axis is on
log-scale.
3.5.2 Attack duration and inter-arrival time
Attack duration: Figure 3.9 shows that both inbound and outbound attacks have short duration
with a median value within 10 minutes. This is consistent with other studies of Internet attacks [121,
126]. Interestingly, port scan attacks have a median duration within one minute for both inbound
and outbound (but they can last for 100 minutes at the 99th percentile). There are several reasons
that an attacker may launch short-duration attacks towards or from the cloud: (1) Shorter attacks
are harder to be detected by cloud operators; (2) an attacker may attack one target for a short
time and if not successful move quickly to another target. As a result, it is important to detect
such short duration attacks in a timely fashion.
At the 99th percentile, most attacks have a duration longer than 80 minutes or even for days.
For example, the TCP SYN flood attack lasts for 85 minutes at the 99th percentile. This is shorter
than the previous study of Internet attacks [126], which observed that 2.4% of the attacks take
more than five hours. This could be due to better security support in the cloud.
74
DNS reflection attacks last longer than others in both inbound and outbound directions. These
attacks can sustain for a long time before being detected because they leverage many DNS resolvers
simultaneously and each resolver receives relatively low query rate. Thus, it is hard to detect these
attacks.
Ramp-up time: For volume-based attacks, we calculate the ramp-up time of an attack from its
start time to the time when the packet rate grows to 90% of its peak. We observe a median ramp
up time of 2-3 minutes for inbound attacks and 1 minute for outbound attacks. Today’s flood
detection solutions take about 5 minutes to detect the attacks [57, 145], and thus they are not fast
enough to fully eliminate the flood attacks before they ramp up to affect the target with their
peak strengths.
Inter-arrival time: We measure the inter-arrival time as the interval between the start times
of two consecutive attacks to/from the same VIP. Figure 3.10 shows that most types of attacks
have a median inter-arrival time of hundreds of minutes. The outbound TCP SYN and UDP
flood attacks are shorter than inbound (about 25 minutes vs. 100 minutes). This indicates that
malicious VIPs launch periodic attacks frequently. Attack protection systems can leverage such
repeated attacks to identify and tune the right signatures for filtering attack traffic.
We identify two types of UDP flood attacks based on the correlations of inter-arrival time
and peak attack size. 81% of the attacks have a median peak size with 8 Kpps but with large
inter-arrival time (a median of 226 min). The rest 19% of the attacks have a median peak size with
457 Kpps, but with short inter-arrival time (a median of 95 min). The first type of small-scale,
occasional attacks are relatively hard to distinguish from normal traffic and thus they are hard
to mitigate without significant collateral damage. In contrast, the large-scale, frequent attacks
require the cloud security operators to provision more resources to detect their traffic signatures
and mitigate them.
75
3.6 Internet AS Analysis
In this section we investigate the types of Internet ASes that are commonly involved in attacks to
the cloud and that are under attacks from the cloud.
3.6.1 Inbound attacks
Are source IPs spoofed? We investigate whether the Internet IPs of inbound attacks are
spoofed to understand the effectiveness of blacklisting on preventing different inbound attacks.
Similar to prior work [126], we leverage the Anderson-Darling test (A2) [135] to determine if the
IP addresses of an attack are uniformly distributed (i.e., an attack has spoofed IPs if A2 value is
above 0.05). We observe that 67.1% of the TCP SYN floods have spoofed IPs. This is contrary to
the study in 2006 [121] which observed that most flood attacks are not spoofed.
AS classification: We first remove those spoofed IPs and then map the IP addresses of inbound
attack sources and outbound attack targets to AS numbers using Quova [44]. We use AS taxonomy
repository [79] to identify AS types, which include large ISPs, small ISPs, customer networks,
universities (EDU), Internet exchange points (IXP), and network information centers (NIC). We
further classify big cloud (i.e., Google, Microsoft, Amazon), small cloud (i.e., web hosting services),
and mobile ASes based on the AS descriptions. We count the number of attack incidents of
different types for each AS class if any of its IP is involved in the attack.
Figure 3.11a shows the distribution of attacks across different types of ISPs. We observe that
small ISPs and customer networks generate 25.4% and 15.9% of the attacks, respectively. For
instance, an ISP in Asia contributed to 3.53% of the total attack packets. This is probably because
these local or regional ISPs have relatively less security expertise and weak defense systems, and
thus they are more likely to be compromised and leveraged by attackers.
76
0
5
10
15
20
25
30
BigCloud
SmallCloud
Mobile
LargeISP
SmallISP
Customer
EDU
IXP
NIC
Percentage(%)
(a) Percentage of inbound attacks in each AS type.
0
0.05
0.1
0.15
0.2
0.25
0.3
BigCloud
SmallCloud
Mobile
LargeISP
SmallISP
Customer
EDU
IXP
NIC
Percentage (%)
(b) Average of percentage of inbound attacks per AS in each AS type.
Figure 3.11: Different types of ASes generating inbound attacks.
When we calculate the average of percentage of attacks per AS (Figure 3.11b), we observe
that there are more attacks per AS from big cloud and IXP. Individual ASes in small ISPs and
customer networks do not generate many attacks on average.
Attacks from big clouds: Figure 3.12 shows the distribution of attack types that originated
from big clouds. UDP floods, SQL injections, and TDS attacks are the majority types. This is
77
0
10
20
30
40
50
UDP
ICMP
DNS
SPAM
Brute-force
SQL
Portscan
TDS
Percentage (%)
BigCloud Mobile
Figure 3.12: Percentage of inbound attacks from big clouds and mobile ASes in each attack type.
probably due to the availability of a large set of resources in big clouds to generate a high traffic
volume and a large number of connections. In fact, big clouds contribute to 35% of TDS attacks
with just 0.21% of TDS IPs.
Attacks from mobile and wireless ASes: With the growth of mobile devices, attackers can
try to compromise and exploit their resources for malicious activities. Given the relatively weaker
software model in mobile devices compared to desktop PCs and the wide deployment of third-party
apps on them, they are more likely to be compromised by malware for launching attacks. Users
may also jailbreak the security restrictions and install tools (e.g., AnDOSid or mobile LOIC) to
participate in botnet activities [43]. In fact, there are 2.1% of the inbound attack traffic from
mobile networks.
Figure 3.12 shows that mobile networks mainly generate UDP floods, DNS reflections, and
brute-force attacks. These attacks are harder to mitigate because simple source-based blacklisting
does not work well for mobile devices. This is because most mobile devices are often located
behind a NAT. While NAT may become less common with IPv6 adoption, there would be more
ephemeral addresses.
78
0
10
20
30
40
50
BigCloud
SmallCloud
Mobile
LargeISP
SmallISP
Customer
EDU
IXP
NIC
Percentage (%)
DNS SPAM
(a) Percentage of inbound DNS or spam attacks in each AS type.
0
0.5
1
1.5
2
2.5
3
3.5
BigCloud
SmallCloud
Mobile
LargeISP
SmallISP
Customer
EDU
IXP
NIC
Percentage (%)
DNS SPAM
(b) Average percentage of inbound DNS or spam attacks per AS in each AS type.
Figure 3.13: Different types of ASes generating inbound DNS and SPAM attacks.
Origins of DNS attacks: Figure 3.13a shows that the cloud we studied received a similar
number of DNS attacks from all types of ASes. Figure 3.13b shows that if we count per AS attacks,
79
there are more DNS attacks from IXPs. Our further investigation shows that each DNS attack
involved a median value of only 17 unique DNS resolvers in the NetFlow records.
Origins of spam: Figure 3.13a shows that spam attacks are mainly from large cloud, small
ISPs, and customer networks. For example, 81.0% of the spam packets are from Amazon Web
Services (AWS) [4] in Singapore.
5
However, each individual small ISP or customer network does
not generate many attacks as indicated by the number of per AS attacks shown in Figure 3.13b.
This indicates that it is easier for attackers to leverage the free trial accounts in large clouds, the
end hosts in small ISPs, and customer networks to generate spams. Prior study of spams in the
Internet [140] shows many spams come from network information centers (NIC), but we observed
only a single attack from NICs in our data.
Geolocation distribution of inbound attacks: Figure 3.14a shows the geographical distribu-
tion of inbound attack sources. The inbound attack sources are spread mainly across places in
Europe, Eastern Asia, and North America. Specifically, there is one AS in Spain involved with
more than 35% of the total inbound attacks. It mainly generated UDP flood, TDS and SQL
attacks. There are ASes from the west coast of North America that are involved with more than
20% of the total inbound attacks.
3.6.2 Outbound attacks
Are outbound attacks clustered? Unlike Internet floods which often target a single host [145],
we observe outbound UDP flood target a median of 8 hosts, while TCP SYN floods often target a
median of 25 Internet hosts just in our sampled NetFlow data. This means attackers often use
cloud resources to attack a group of hosts instead of individual IPs. We count the number of
unique victim IPs of outbound attacks in each AS to understand if the victims are clustered on
particular ASes. We find that 80% of the attacks target hosts in a single AS.
5
We did not validate these spam attacks with AWS.
80
0
5
10
15
20
25
30
35
40
Percentage of total inbound attack (%)
(a) Geolocation distribution of inbound attack sources.
0
5
10
15
20
25
30
35
40
Percentage of total outbound attack (%)
(b) Geolocation distribution of outbound attack targets.
Figure 3.14: Attack geolocation distribution.
While prior work has shown that a small number of ASes are involved in a significant fraction
of attacks in ISP networks [121] and distributed intrusion detection system [156], we show that
cloud-related attack incidents are widely spread across many ASes. Top 10 ASes are targets of
8.9% of the attacks; top 100 ASes are targets of 16.3% of the attacks. However, there is a small
portion of attacks responsible for the major attack traffic. For instance, 40% of the outbound
81
0
5
10
15
20
25
30
35
40
45
BigCloud
SmallCloud
Mobile
LargeISP
SmallISP
Customer
EDU
IXP
NIC
Percentage(%)
(a) Percentage of outbound attacks in each AS type.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
BigCloud
SmallCloud
Mobile
LargeISP
SmallISP
Customer
EDU
IXP
NIC
Percentage (%)
(b) Average percentage of outbound attacks per AS in each AS type.
Figure 3.15: Different types of ASes targeted by outbound attacks.
attack packets were directed from three VIPs towards a small cloud AS in Romania, which offers
web hosting, online gaming, and VPN services.
AS classes of outbound attacks: Figure 3.15 shows that 42% of outbound attacks are against
services in big clouds. Most of these attacks are SQL injection and TDS attacks. Small ISPs and
customer networks face 25% and 13% of the outbound attacks (Figure 3.15a), but individual ASes
do not generate many attacks (less than 0.01% of the total outbound attacks per AS) (Figure 3.15b).
82
0
100
200
300
400
500
600
700
800
900
1000
1100
HTTP
HTTPS
SQL
SMTP
SSH
RDP
VNC
#VIPs
Figure 3.16: Top Internet applications under outbound attacks.
Small ISPs and customer networks are the major target for brute-force attacks and spam. This is
probably because these networks often lack strong security support. For example, 23.6% of the
outbound DNS reflection attack packets are sent to an ISP in France. It is therefore important to
coordinate security measures across the cloud infrastructure and these networks to protect against
these attacks.
There are only a few brute-force attacks against mobile networks (1.4%). This may be because
mobile devices are often behind a NAT preventing them from unsolicited connections, which makes
it harder for attacks to get through.
Internet applications under attack: To understand the Internet applications under attacks,
we investigate the destination port of outbound traffic coming from the VIPs generating out-
bound attacks (Figure 3.16). We find that most of the outbound attacks target web services
(HTTP/HTTPS together form 64.5% of the attack VIPs involved with outbound attacks). For
example, 69% of the outbound UDP floods use port 80 as the destination port targeting HTTP
services. The other popular target services are SQL, SMTP, and SSH.
Geolocation distribution of outbound attacks: Outbound attack targets are mainly spread
across places in Europe and North America (Figure 3.14b). There are less outbound attack targets
83
than inbound attack sources in Eastern Asia. The same AS in Spain we discussed in inbound
attacks also receives more than 35% of outbound attacks (mainly with brute-force, TDS, and SQL
attacks).
3.7 Nimbus: Attack Detection as a Cloud Service
To handle tenant-targeted attacks, we propose NIMBUS, a system that detect and mitigate attacks
by coordinating commodity switches and VMs that achieves high accuracy with low cost. As
shown in Figure 3.17, the switches select and mirror flow groups with flexible sampling rates to
redirect just enough traffic for accurate attack detection. The flow groups can be defined by packet
header fields (e.g., IP prefixes and/or port numbers) and can cover partial traffic from a specific
tenant or aggregated traffic across tenants. The VMs run customized attack detection modules,
one for each flow group, to detect a variety of attacks. Finally, we install fine-grained rules to
mitigate attacks at switches with high performance and low cost.
NIMBUS dynamically adapts the number of VMs and the traffic sampling rates that achieves
the best tradeoff between cost and detection accuracy based on the traffic. It also quickly adjusts
the sampling rate at switches at a fine time scale to quickly detect short-lived attacks.
3.7.1 Customized attack detection and traffic selection
Customized attack detection at VMs: NIMBUS provides APIs for tenants and network
operators to configure flow groups and subscribe to the detection modules they need. For example,
a tenant can subscribe its port 80 traffic to a DDoS detection module; an operator can subscribe
the SSH traffic to all the tenants to a brute-force attack detection module. The detection modules
can run proprietary detection algorithms from security vendors or open-source systems such as
84
Traffic Manager VM Manager
Switching
Fabric
Hypervisor
Configure
routing/sampling
Attack
mitigation
Nimbus Controller
. .
Tenant VM
Nimbus VM
Free VM
Hypervisor
Configure
VMs
Figure 3.17: NIMBUS architecture
Bro [69], Snort [149]. The cloud provider can charge tenants based on the amount of traffic and
the type of detection modules they subscribe to.
Selection of tenant traffic at switches: NIMBUS configures switches to select and mirror
traffic to NIMBUS VMs in three ways: (1) Filtering: When the attack detection logic at NIMBUS
VMs only cares about traffic belonging to a specific protocol or port, we can filter packets based on
those packet header fields; (2) Chunking: If the detection module only performs analysis on packet
header fields, we can choose to send only the packet headers to NIMBUS VMs; (3) Sampling: For
detection modules that can work on sampled packets, the switches can provide a fixed sampling
rate for each flow group. Previous studies [68] have shown that many attack detection solutions
(e.g.,volume based attack detection, entropy estimation for anomaly detection) can achieve high
detection accuracy with packet sampling.
85
3.7.2 Balance detection accuracy and cost
There is a tradeoff between the resource usage (bandwidth and VMs) and accuracy in choosing
the right sampling rate: a high sampling rate leads to more accurate attack detection, but it also
consumes more VM resources. Similarly, deploying a small number of VMs would risk under-
provisioning while allocating a large number to handle peak rates would risk high costs e.g., up to
1253 VMs are needed to handle 100Gbps DDoS traffic [84]. The problem become even harder
when there are multiple attacks across attack types and tenants. Moreover, different tenants have
different types of traffic and attacks, and they may experience large traffic variance over times,
especially during attacks.
We propose to dynamically adapt the sampling rates for individual tenants and scale-in/out
of the VMs. We model the detection accuracy for each attack and the cost of running VMs for
processing traffic, and formulate a joint optimization problem with detection accuracy and VM
cost. Before NIMBUS reaches the VM capacity limit (e.g., when attacks happen), the controller
running optimization can scale-out NIMBUS by adding the optimal number of VMs into the
system, and vice-versa when attacks subside.
Switches often have limited entries for support different sampling rates. Thus, we only provide
fixed sampling rate for each flow group, which maps to a detection module. If a flow matches
multiple flow groups and thus multiple detection modules, the switch can pick the highest sampling
rate for this flow.
3.7.3 Adjust sampling rate on a fine time scale
Attackers can launch low-rate attacks that have short and repetitive bursts but on average the
attack throughput is low. To handle these attack traffic bursts, NIMBUS needs to quickly scale
out VMs.
86
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.01 0.1 1 10 100
Average fraction of
successful attack duration
Average sampling-rate(%)
Nimbus
Fixed samp+autoscale
Figure 3.18: NIMBUS achieve better cost-effectiveness than fixed sampling-rate+autoscaling approach
under different parameter settings.
However, in the cloud, it often takes up to a few minutes to start a new VM from suspended
state [146, 120]
6
. During the VM startup time, attack traffic may increase dramatically and
overload NIMBUS VMs leading to packet drops at these VMs.As a result of these packet drops,
the detection modules no longer have an accurate estimate of the attack and legitimate traffic
and thus risking low detection accuracy. Moreover, adversaries may rapidly change traffic (e.g.,
on-off traffic bursts [110]) causing NIMBUS to waste resources by unnecessary scale out/in of the
VMs. Therefore, NIMBUS performs adjustment of sampling rates at fine-granularity timescales to
handle attack bursts during VM (de)provisioning windows.
3.7.4 Evaluating cost-effectiveness
To understand the tradeoff between detection accuracy and VM cost, we generate 1000 prevalent
attacks in each 10 second epoch. For comparison, we also run fixed sampling-rate+autoscale with
different sampling rates. For NIMBUS, we logged the sampling rate adaptively set by NIMBUS
during its execution. For each experiment, we calculate the average sampling-rate for the overall
traffic and the average attack detection accuracy. The result is shown in Figure 3.18.
NIMBUS achieves better cost-effectiveness than fixed sampling-rate+autoscaling ap-
proach. At 1% average sampling-rate, fixed sampling-rate+autoscale has 42.6% of the successful
attack duration while NIMBUS reduces it to 19.0% because NIMBUS uses flexible sampling to
6
There exist specialized solutions to boot VMs quickly (e.g., ClickOS [122]), but they are not available or
integrated with networking services running in commercial cloud providers.
87
better utilize VM resources for attack detection. The fixed sampling approach has 22.0% of
the successful attack duration at 50% sampling-rate while NIMBUS achieves a similar (19.9%)
successful attack duration but with three orders of magnitude less traffic sampling-rate (0.048%).
As a result, NIMBUS would need to provision a significantly smaller number of VMs to detect
attacks in sampled traffic and thus achieving higher cost-effectiveness.
3.8 Existing security practices
In this section we discuss how existing cloud security solutions handle the types of attacks observed
in our study.
Inbound TCP SYN, UDP, ICMP floods, DNS reflection attacks: Cloud VMs are only
accessible through virtual IPs (VIPs). Traffic towards VIPs is routed through a load-balancing
infrastructure [134]. At the infrastructure, the cloud can monitor, detect and mitigate flood attacks
(e.g., using SYN cookies, rate limiting, and connection limits) to help ensure that such attacks
do not impact customer VM instances [30]. The tenants can also leverage scale-out (e.g., adding
more VMs on demand) or scale-up (e.g., deploying resource-rich VMs) techniques to raise the bar
for attacks [7].
There are multiple in-built mechanisms in cloud systems to safeguard tenants. For example,
the hypervisor that hosts guest VMs in the cloud is neither directly addressable internally by other
tenants nor it is externally addressable. Additional filters are put in place to block broadcast
and multicast traffic, with the exception of what is needed to maintain DHCP leases. Inside the
VMs, tenants can further enable web server add-ons that protect against certain DoS attacks [128].
For example, for TCP SYN flooding attacks, a security rule can be specified to track significant
88
deviations from the norm in terms of the number of half-open TCP sessions, and then drop any
further TCP SYN packets from specified sources [8].
Port scan: Unauthorized port scans are often viewed as a violation of cloud use policy and they
are blocked by the cloud provider [5]. Port scans are likely to have limited effectiveness because,
by default, all the inbound ports on VMs are closed except the ones opened by tenants; a tenant
can author a service definition file that contains the internal endpoints that should be opened
in service VMs and what roles can communicate with them [41]. Tenants can also use security
groups or firewalls to further block unauthorized traffic [128].
Other inbound attacks: Some cloud providers choose not to actively block network traffic
affecting individual tenants because the infrastructure does not interpret the expected behavior of
customer applications. Instead, these cloud systems allow tenants to use firewall proxy devices such
as Web Application Firewalls (WAFs) that terminate and forward traffic to endpoints running in a
VM. Tenants can also use network ACLs or VLANs to prevent packets from certain IP addresses
from reaching VMs [30].
Cloud providers can also leverage high-level signals emitted from workloads running on the
cloud from customer accounts. One such signal is the number of open ports. Legitimate customers
aim to minimize the susceptibility of their applications running in the cloud to any external attacks,
and hence deploy services usually with a limited number of open ports (e.g., HTTP port 80).
However, compromised accounts or VMs may perform a variety of anomalous activities such as
running a Botnet controller or torrent services on multiple open ports. The cloud security team
monitors such activities and aggressively shuts down any misbehaving tenant VMs.
Outbound attacks: To mitigate outbound attacks, the most important step is to identify
fraudulent VMs when a tenant sets up a subscription. In the cloud, several anti-fraud techniques
are used such as credit card validation, computing the estimated geographical distance from the
IP address used for login to the billing address and ensuring it is within reasonable bounds, and
determining whether the email address of the purchaser is from a free email provider.
89
Note that attackers can also exploit vulnerabilities in VMs of legitimate customers. If an attack
is successful, the compromised VMs can then be used by attackers for malicious activity e.g., to
send large amounts of unsolicited mail (spam). The cloud provider can enforce limits on how many
emails a VM can send as well as prevent SMTP open relay, which can be used to spread spam [8].
3.9 Related Work
This paper presented one of the first large-scale studies to investigate the prevalence of network-
based attacks in the cloud. We compare our work with related work on detecting and understanding
Internet-based attacks.
Attack detection methods: Previous works have used NetFlow logs to understand DDoS
attacks and traffic anomalies [121, 68, 112, 111] in ISP networks. Our work takes a similar
approach to understand a broader set of attacks in the cloud. Most previous studies on application-
level attacks leverage analysis of the application content (e.g., spam [55, 75], SQL injection [59],
SSH [99]). Our work shows that it is possible to detect some of these attacks by leveraging
network-level signatures such as volumes, spread, TCP flags, and communication patterns. In fact,
previous works have also shown that application-level attacks (e.g., spam) have strong network-level
signatures [140]. We validate our network-based detection by comparing the detected attacks
against the security appliance alerts and incident reports. Although network-based detection may
not capture all types of application-level attacks (e.g., malware), they are more pragmatic to
implement in today’s cloud monitoring infrastructure.
Attack characterization: There is a large body of work on characterizing Internet-based
attacks. Most prior efforts (e.g., [43, 52, 57, 126, 140, 157]) focus on one or a few types of attacks
in the Internet. Given the importance of cloud services in today’s Internet, understanding the
attacks from/to the cloud is critical. Our study investigates a wide diversity of inbound and
outbound attacks in the cloud. We differentiate DDoS attacks based on their protocols (TCP
90
SYN, UDP, DNS reflection), and show that other types of attacks (e.g., brute-force, port scans,
and TDS) also need cloud operator’s attention. Moreover, we show the detailed characteristics of
attacks in the cloud such as the cloud services affected by the attacks, the Internet origins and
targets, and the intensity and frequency of these attacks. These results can provide guidelines for
future design of attack detection and mitigation systems for the cloud.
3.10 Conclusion
We investigated the prevalence of network-based attacks both on and off the cloud. We performed
the first measurement study of the characteristics of a wide range of cloud attacks that vary in
complexity, intensity, duration and distribution. Our study shows a strong evidence of increasing
scale, attack volume, and sophistication of these attacks. Our results have been leveraged by the
cloud security team towards identifying correlations and improving mitigations for different attack
types. We hope that this study motivates future research towards designing attack detection and
mitigation systems for the cloud. In future work, we plan to extend our measurement study to
analyze application level attacks, compare across attack categories and leverage packet traces for
deeper analysis.
91
Chapter 4
DIBS: Just-in-time Congestion Mitigation for Data Centers
In the previous two chapters, we discussed how to scale-out load balancing and attack mitigation
for cloud-scale traffic. In this chapter, we present DIBS, a mechanism to scale-out congestion
control function in switches across entire network. Data centers must support a range of workloads
with differing demands. Although existing approaches handle routine traffic smoothly, intense
hotspots–even if ephemeral–cause excessive packet loss and severely degrade performance. This
loss occurs even though congestion is typically highly localized, with spare buffer capacity at
nearby switches. In this paper, we argue that switches should share buffer capacity to effectively
handle this spot congestion without the monetary hit of deploying large buffers at individual
switches. Specifically, we present detour-induced buffer sharing (DIBS), a mechanism that achieves
a near lossless network without requiring additional buffers at individual switches. Using DIBS, a
congested switch detours packets randomly to neighboring switches to avoid dropping the packets.
We implement DIBS in hardware, on software routers in a testbed, and in simulation, and we
demonstrate that it reduces the 99th percentile of delay-sensitive query completion time by up to
85%, with very little impact on other traffic.
92
4.1 Introduction
Modern data center networks (DCNs) must support a range of concurrent applications with varying
traffic patterns, from long-running flows that demand throughput over time to client-facing Web
applications that must quickly compile results from a collection of servers. To reduce cost and
queuing delay, DCN switches typically offer very shallow buffers,
1
leading to packet losses–and
therefore slow transfers–under bursty traffic conditions [77, 151].
Researchers have proposed a number of solutions to tackle this problem [154, 158, 152, 47, 50,
62, 155, 51], including DCTCP [49]. DCTCP uses ECN marking to signal congestion early to
senders–before buffer overflows. This approach effectively slows long-lasting flows. However, no
transport-layer congestion control scheme can reliably prevent packet loss when switch buffers are
shallow and traffic bursts are severe and short-lived. One extreme example is a large number of
senders each sending only a few packets in a burst to a single receiver. Since congestion control–and
senders in general–cannot react in time, switches must attempt to absorb the burst. But since
deep-buffered switches are expensive to build, generally switches have to drop packets–even though
data center congestion is typically localized [102], meaning that the network as a whole has capacity
to spare.
This extreme scenario highlights a key assumption that pervades much of the DCN work: when
a switch needs to buffer a packet, it must use only its own available buffer. If its buffer is full, it
must drop that packet. This assumption is so taken for granted that no one states it explicitly.
In this paper, we challenge this assumption. We advocate that, faced with extreme congestion,
the switches should pool their buffers into a large virtual buffer to absorb the traffic burst, instead
of dropping packets.
1
Arista 7050QX-32 has just 12MB of buffer to be shared by as many as 104 ports (96x10Gbps + 8x40Gbps).
This is because the high-speed memory that forms the buffer must support NxC read-write bandwidth, where N is
the number of ports and C is the nominal link speed. The cost of memory increases as N and C increase.
93
Sender's
pod
Receiver's
pod
Detour
Forward
S R
Edge
Aggr
Core
1
1
1
8
9
2 2
1
1
1
2
1
2
2
1
1
Figure 4.1: Example path of a packet detoured 15 times in a K=8 fat-tree topology. For simplicity, we
only show 1 edge switch and 1 aggregation switch in the sender’s pod, and abstract the 16 core switches
into a single node. The numbers the arc thicknesses indicate how often the packet traversed that arc.
To share buffers among switches, we propose that a switch detour excess packets to other
switches–instead of dropping them–thereby temporarily claiming space available in the other
switches’ buffers. We call this approach detour-induced buffer sharing (DIBS).
DIBS is easiest to explain if we assume output-buffered switches, although it can work with
any switch type. When a packet arrives at a switch input port, the switch checks to see if the
buffer for the destination port is full. If so, instead of dropping the packet, the switch selects one
of its other ports to forward the packet on.
2
Other switches will buffer and forward the packet,
and it will make its way to its destination, possibly coming back through the switch that originally
detoured it.
Figure 4.1 shows an example. It depicts the path of a single packet (from one of our simulations)
that was detoured multiple times, before reaching destination R. The weight of an arc indicates
2
We will describe later how we select the port. Specifically, we avoid ports whose buffers are full, and also those
that are connected to end hosts.
94
how often the packet traversed that specific arc. Dashed arcs indicate detours. While the actual
order of the hops cannot be inferred, we can see that the packet bounced 8 times back to a core
switch because the aggregation switch was congested. The packet bounced several times within
the receiver pod prior to reaching the receiver.
The idea of detouring–and not dropping–excess traffic seems like an invitation to congestion
collapse [77]. Our key insight is to separate the slowing of senders to avoid congestion collapse from
the handling of excess traffic necessary before congestion control kicks in. Our results show that
DIBS works well as long as a higher-layer congestion control scheme such as DCTCP suppresses
persistent congestion and the network has some spare capacity for DIBS to absorb transient
congestion. We will formalize these requirements later in the paper and show that they are easily
satisfied in a modern DCN.
In fact, DIBS is particularly suited for deployment in DCNs. Many popular DCN topologies
offer multiple paths [46], which detouring can effectively leverage. The link bandwidths in DCNs
are high and the link delays are small. Thus, the additional delay of a detour is low. DCNs are
under a single administrative control, so we do not need to provide incentives to other switches to
share their buffer.
DIBS has two other key benefits. First, DIBS does not come into play until there is extreme
congestion–it has no impact whatsoever when things are “normal”. Second, the random detouring
strategy we propose in this paper has no parameters to tune, which makes implementation very
easy.
In the rest of the paper, we will describe the DIBS idea in detail, and evaluate DIBS extensively
using a NetFPGA [31] implemenation, a Click [108] implementation and simulations using NS-3 [33].
Our results show that DIBS significantly improves query completion time when faced with heavy
congestion. In cases of heavy congestion, DIBS can reduce 99th percentile of query completion
times by as much as 85%. Furthermore, this improvement in performance of query traffic is
achieved with little or no collateral damage to background traffic. for typical data center traffic
95
patterns. We also investigate how extreme the traffic must be before DIBS “breaks”, and find that
DIBS handle traffic load of up to 10000 queries per second in our setting. Finally, we compare the
performance of DIBS to pFabric [51] (the state-of-the-art datacenter transport designs intended
to achieve near-optimal flow completion times). We find that during heavy congestion, DIBS
performs as well (if not slightly better) in query completion time while having less impact on the
background traffic.
4.2 DIBS overview
DIBS is not a congestion control scheme; indeed, it must be paired with congestion control (§4.3).
Nor is DIBS a completely new routing scheme, relying on the underlying Ethernet routing (§4.3).
Instead, DIBS is a small change to that normal Ethernet (L2) routing. In today’s data centers,
when a switch receives more traffic than it can forward on a port, it queues packets at the buffer
for that port.
3
If the buffer is full, the switch drops packets. With DIBS, the switch instead
forwards excess traffic via other ports.
When a detoured packet reaches the neighboring switch, it may forward the packet towards its
destination normally, or, if it too is congested, it may detour the packet again to one of its own
neighbors. The detoured packets could return to the original switch before being forwarded to its
destination, or it could reach the destination using a different path.
Single packet example. Figure 4.1 depicts the observed path of a single packet that switches
detoured 15 times on its way to its destination R. This example came from an trace in our
simulation of a K=8 fat-tree topology with a mixture of long flows and short bursts, modeled on
actual production data center workloads [49]. We discuss our simulations more in §4.5.3. Here, we
just illustrate how DIBS moves excess traffic through the network.
3
We assume an output queued architecture for the ease of description. DIBS can be implemented in a variety of
switch architectures (§4.4).
96
Edge
Aggr
Core
0 2 4 6 8 10
Switches
Time(ms)
Detours
t1 t2 t3
(a) Detours per switch
over time. (Each dot
denotes the decision of
a switch to detour a
packet.)
t1 : queues building up
right before the burst
t3 : Only the edge switch
still detouring
t2 : edge switch and all
aggregate switches detouring
Aggr
Edge
(b) Buffer occupancy at times t1, t2, t3 in a pod. Each switch is represented by
8 bars. Each bar is an outgoing port connecting to a node in the layer below
or above. The size of the bar represents the port’s output queue length: (green:
packets in buffer; yellow: buffer buildup; red: buffer overflow).
Figure 4.2: Detours and buffer occupancy of switches in a congested pod. During heavy congestion,
multiple switches are detouring.
When the packet first reached an aggregation switch in R’s pod, the switch’s buffer on its
port toward R was congested, so the aggregation switch detoured the packet. In fact, most of the
detouring occurred at this switch. To avoid dropping the packet, the switch detoured the packet
to other edge switches four times and back to core switches eight times, each time receiving it
back. After receiving the packet back the twelfth time, the switch had buffer capacity to enqueue
the packet for delivery to R’s edge switch. However, the link from that switch to R was congested,
and so the edge switch detoured the packet back to the aggregation layer three times. After the
packet returned from the final detour, the edge switch finally had the buffer capacity to deliver
the packet to R. The path of this packet illustrates how DIBS effectively leverages neighboring
switches to buffer the packet, keeping it close unless congestion subsides, rather than dropping it.
Network-wide example. In the next example, a large number of senders send to a single
destination, causing congestion at all the aggregate switches in a pod, as well as at the destination’s
edge switch. Figure 4.2a illustrates how the switches respond to the congestion over time, with
each horizontal slice representing a single switch and each vertical slide representing a moment
in time. Each marker on the graph represents a switch detouring a single packet. From time t1
until just after t2, four aggregation switches have to detour a number of packets, and the edge
switch leading to the destination has to detour over a longer time period. Even with this period of
congestion, DIBS absorbs and delivers the bursts within 10ms, without packet losses or timeouts.
97
Figure 4.2b shows the buffer occupancy of the eight switches in the destination’s pod over the
first few milliseconds of the bursts. The bursts begin at time t1, with the aggregation switches
buffering some packets to deliver to the destination’s edge switch and the edge switch buffering
a larger number of packets to deliver to the destination. Soon after, at time t2, all five of those
buffers are full, and the five switches have to detour packets. The figure depicts the destination’s
edge switch buffering packets randomly to detour back to each of the aggregation switches, and
the aggregation switches scheduling packets to detour to the other edge switches and back to the
core. By timet3, most of the congestion has abated, with only only the edge switch of the receiver
needing to continue to detour packets.
These examples illustrate how DIBS shares buffers among switches to absorb large traffic bursts.
They also highlight the four key decisions that DIBS needs to make: (i) when to start detouring;
(ii) which packets to detour; (iii) where to detour them to; and (iv) when to stop detouring. By
answering these questions in different ways, we can come up with a variety of detouring policies.
In this paper, we focus on the simplest policy. When a switch receives a packet, if the buffer
towards that packet’s destination is full, the switch detours the packet via a random port that
is connected to another switch,
4
and whose buffer is not full. In §4.8, we briefly consider other
detouring policies.
We conclude by noting two salient features of DIBS.
DIBS has no impact on normal operations. It comes into play only in case of severe
congestion. Otherwise, it has no impact on normal switch functioning.
DIBS has no tunable parameters. The random detouring strategy does not have any
parameters that need tuning, which makes it easy to implement and deploy. It also does have not
any coordination between switches, and thus detouring decisions can be made at line rate.
4
We do not detour packets to end hosts, because end hosts do not forward packets not meant for them.
98
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Fraction of Demand Matrices
(Cumulative)
Fraction of links that are hot
IndexSrv
3Cars
Neon
Cosmos
Figure 4.3: Sparsity of hotspots in four workloads
4.3 Requirements
DIBS can improve performance only if certain conditions are met. We now describe these
requirements, and show that they are easily satisfied in modern DCNs.
No persistent congestion: DIBS is based on the premise that even though a single switch may
be experiencing congestion and running out of buffer space, other switches in the network have
spare buffer capacity. We also assume that the network has sufficient bandwidth to carry detoured
packets to the switches with buffer capacity. In other words, DIBS requires that congestion in
a data center environment is rare, localized, and transient. Several studies of traffic patterns in
data centers satisfy this requirement [101, 102]. In fact, the Flyways system also relies on this
observation [102].
Figure 4.3 reproduces a graph from the Flyways paper [102]. It shows measurements from four
real-world data center workloads. The data sets represent different traffic patterns, including map-
reduce, partition-aggregate, and high performance computing. The graph shows the distribution
(over time) of the fraction of links that are running “hot,” with utilization at least half that of the
99
most loaded link. At least 60% of the time in every dataset, fewer than 10% of links are running
hot.
We see similar results when we use scaled versions of workload from [49] in our simulations
(§4.5). Figure 4.4 shows the fraction of links with utilization of 90% or more for three different
levels of traffic (using default settings in Table 4.2 except qps). We use 90% as a threshold, since
it is more representative of the extreme congestion that DIBs is designed to tackle
5
. The takeaway
remains the same – at any time, only handful of the links in the network are “hot”. Figure 4.5
shows that for both baseline and heavy cases, there is plenty of spare buffer in the neighborhood
of a hot link. Although the heavy workload induces 3X more hot links than the baseline does, the
fraction of available buffers in nearby switches are just slightly reduced. In fact, we see that nearly
80% of the buffers on switches near a congested switch are empty in all cases except the extreme
scenario where dibs fails (§4.5).
Congestion control: DIBS is meant for relieving short-term congestion by sharing buffers
between switches. It is not a replacement for congestion control. To work effectively, DIBS must
be paired with a congestion control scheme.
The need for a separate congestion control scheme stems from the fact that DIBS does nothing
to slow down senders. Unless the senders slow down, detoured packets will eventually build large
queues everywhere in the network, leading to congestion collapse. To prevent this, some other
mechanism must signal the onset of congestion to the senders.
Because DIBS is trying to avoid packet losses, the congestion control mechanism used with it
cannot rely on packet losses to infer congestion. For example, we cannot use TCP NewReno [40]
as a congestion control mechanism along with DIBS. Since NewReno slows down only when faced
with packet losses, the senders may not slow down until all buffers in the network are filled, and
DIBS is forced to drop a packet. This not only defeats the original purpose of DIBS, but also
results in unnecessary and excessive queue buildup.
5
With threshold set to 50%, the graph looks similar to Figure 4.3.
100
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Fraction of time
(Cumulative)
Fraction of links that are hot
Baseline workload
Heavy workload
Extreme workload
Figure 4.4: Hotlinks. Baseline workload is 300 qps, high is 2000 qps and extreme is 10,000 qps. See
Table 4.2.
In this paper, we couple DIBS with DCTCP [49], which uses ECN marking [16] instead of
packet losses to signal congestion.
No spanning-tree routing: Enterprise Ethernet networks use spanning tree routing. Detouring
packets on such a network would interfere with the routing protocol itself. On the other hand, data
center networks typically offer multiple paths and hence do not rely on spanning trees. Following
recent data center architecture designs [91, 130, 46], we assume that switches forward packets
based on forwarding tables (also known as FIBs). A centralized controller may compute the FIBs,
or individual switches may compute them using distributed routing protocols like OSPF or ISIS.
When there are multiple shortest paths available, a switch uses flow-level equal-cost multi-path
(ECMP) routing to pick the outgoing port for each flow [98]. We assume that switches do not
invalidate entries in their FIB if they happen to detect loops.
101
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Fraction of time
(Cumulative)
Fraction of buffers that
are available in neighboring switches
Baseline, 1-hop neighbors
Baseline, 2-hop neighbors
Heavy, 1-hop neighbors
Heavy, 2-hop neighbors
Extreme, 1-hop neighbors
Extreme, 2-hop neighbors
Figure 4.5: Neighboring buffer size
4.4 Design Considerations and Implications
We now discuss several issues that stem from the DIBS design and requirements.
Interactions with TCP: By detouring some packets along longer paths, DIBS can affect
TCP in three ways. First, DIBS can reorder packets within a flow. Such reordering is rare,
since DIBS detours packets only in cases of extreme congestion. Reordering can be dealt with
either by disabling fast retransmissions entirely [51] or by increasing the number of duplicate
acknowledgments required to trigger them [159]. For all the experiments presented in the paper that
used DIBS, we disabled fast retransmissions. We have also experimented with simply increasing
the dupack threshold, and found that a dup-ack threshold of larger than 10 packets is usually
sufficient to deal with reordering caused by DIBS. Other than this, no changes are required on the
hosts.
Second, packets traveling along longer paths can inflate TCP’s estimate of round-trip time
(RTT) and RTT variance, which TCP uses to calculate a retransmission timeout (RTO) value to
102
detect loss. However, packet losses are exceedingly rare with DIBS, so the value of the timeout is
not important. Indeed, we do not even need to set MinRTO to a very low value, as advocated by
some prior proposals [77]. Instead, we use a default MinRTO value of 1ms, which is commonly
used in data center variants of TCP [49].
Third, excessive detouring delays may trigger a spurious retransmission by the sender. Since
we do not need to set an aggressive MinRTO, spurious retransmissions are rare.
Loops and multiple detours: Detour paths may take packets along loops on their way to
the destination. For example, in Figure 4.1, the packet loops between the receiver’s edge switch
and an aggregation switch twice, among other loops. Loops can complicate network management
and troubleshooting, and so operators and routing protocols generally try to avoid them. We
believe that the loops induced by DIBS represent a net win, because DIBS reduces loss and flow
completion times, and the loops are transient and only occur during periods of extreme congestion.
In §4.5, we show that DIBS only requires a limited number of detours and only causes loops in a
transient time period.
Collateral damage: DIBS may cause “collateral damage.” For instance, if the example in
Figure 4.2 had additional flows, the detoured packets may have gotten in their way, increasing their
queuing delay and, possibly, causing switches to detour packets from the other flows. However, we
will show in §4.5 that such collateral damage is uncommon and limited for today’s data center
traffic. This is because DIBS detours packets only in rare cases of extreme congestion. Also, in
the absence of DIBS, some of these flows may have performed worse, due to packet drops caused
by buffer overflows at the site of congestion. We will see examples of such traffic patterns in §4.5.
Switch buffer management: So far, our description of DIBS has assumed that the switch has
dedicated buffers per output port. Many switches in data centers have a single, shallow packet
buffer, shared by all ports. The operator can either statically partition the buffer among output
ports, or configure the switch to dynamically partition it (with a minimum buffer on each port to
avoid deadlocks). DIBS can be easily implemented on both types of switches. With DIBS, if the
103
switch reaches the queuing threshold for one of the output ports, it will detour the packets to a
randomly selected port. §4.5.5.2 evaluates DIBS with dynamic buffer allocations.
Switches can also have a combined Input/Output queue (CIOQ) [158]. Since these switches
have dedicated egress queues, we can easily implement DIBS in this architecture. When a packet
arrives at an input port, the forwarding engine determines its output port. If the desired output
queue is full, the forwarding engine can detour the packet to another output port.
Comparison to ECMP and Ethernet flow control: Detouring is closely related to Ethernet
flow control [85] and ECMP [98]. We provide a more detailed comparison in §4.7.
4.5 Evaluation
We have implemented DIBS in a NetFPGA switch, in a Click modular router [108] and in the
NS3 [34] simulator. We use these implementations to evaluate DIBS in increasingly sophisticated
ways. Our NetFPGA implementation validates that DIBS can be implemented at line rate in
today’s switches (§4.5.1). We use our Click implementation for a small-scaled testbed evaluation
(§4.5.2).
We conduct the bulk of our evaluation using NS-3 simulations, driven by production data
center traffic traces reported in [49]. Unless otherwise mentioned, we couple DIBS with DCTCP
and use the random detouring strategy (§4.2). The simulations demonstrate that, across a wide
range of traffic (§4.5.4) and network/switch configurations (§4.5.5), DIBS speeds the completion of
latency-sensitive traffic without unduly impacting background flows, while fairly sharing bandwidth
between flows (§4.5.6). Of course, DIBS only works if the congestion control scheme it is coupled
with is able to maintain buffer capacity in the network; we demonstrate that extremely high rates
of small flows can overwhelm the combination of DCTCP and DIBS (§4.5.7). Finally, we show
that DIBS can even outperform state-of-the-art datacenter transport designs intended to achieve
near-optimal flow completion times [51] (§4.5.8).
104
DC settings Value TCP settings Value
Link rate 1 Gbps minRTO 10 ms
Switch buffer 100 pkt per port Init. cong. win. 10
MTU 1500 Bytes Fast retransmit disabled
Table 4.1: Default DIBS settings following common practice in data centers (e.g., [49]) unless otherwise
specified. In Table 4.2, we indicate how we explore varying buffer sizes and other traffic and network
parameters.
4.5.1 NetFPGA implementation and evaluation:
To demonstrate that we can build a DIBS switch in hardware, we implemented one on a 1G
NetFPGA [31] platform. We found that it adds negligible processing overhead.
We followed the reference Ethernet switch design in NetFPGA, which implements the main
stages of packet processing as pipelined modules. This design allows new features to be added
with relatively small effort. We implemented DIBS in the Output Port Lookup module, where
the switch decides which output queue to forward a packet to. We provide the destination-based
Lookup module with a bitmap of available output ports whose queues are not full. It performs a
bitwise AND of this bitmap and the bitmap of the desired output ports in a forwarding entry. If
the queue for the desired output port is not full, it stores the packet in that queue. Otherwise, it
detours the packet to an available port using the bitmap.
Our DIBS implementation requires about 50 lines of code and adds little additional logic (2
Slices, 10 Flip-Flops and 3 input Look-Up Tables). Given the response from the lookup table,
DIBS decides to forward or detour within the same system clock cycle. That means DIBS does not
add processing delay. Our followup tests verified that our DIBS switch can forward and detour a
stream of back-to-back 64-byte packets at line-rate. Implementing DIBS requires very little change
in existing hardware.
4.5.2 Click implementation and evaluation
We implemented the detour element in the Click modular router with just 50 additional lines of
code. When a packet arrives at a switch, Click first performs a forwarding table lookup. Before
105
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50 60
CDF
Query Completion Time (ms)
InfiniteBuf
Detour
Droptail100
(a)
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50 60
CDF
Duration of individual flows(ms)
InfiniteBuf
Detour
Droptail100
(b)
Figure 4.6: Click implementation of DIBS achieves near optimal Query Completion Times because no
flow experiences timeouts
enqueuing the packet to the appropriate output buffer, our detour element checks if the queue is
full. If it is, the detour element picks a different output queue at random.
We evaluated our Click-based prototype on Emulab [15]. Our testbed was a small FatTree
topology with two aggregator switches, three edge switches, and two servers per rack. We set
parameters based on common data center environments as shown in Table 4.1. To avoid having
out-of-order detoured packets cause unnecessary retransmissions, when using DIBS, we disabled
fast retransmission.
We evaluated DIBS with the partition-aggregate traffic (incast) pattern known to cause
incast [77]. This traffic pattern arises when a host queries multiple hosts in the network in parallel,
and they all respond at the same time.
Detouring is a natural solution for incast, because it automatically and immediately increases
the usable buffer capacity when the incast traffic overwhelms a switch’s own buffer. We will show
that detouring can achieve near optimal performance for incast traffic patterns.
In our test, the first five servers each sent ten simultaneous flows of 32KB each to the last
server.
6
The primary metric of interest is query completion time (QCT), which is the time needed
for the receiver to successfully receive all flows. We ran the experiment 50 times with these settings.
6
Toensurethattheserversgeneratedflowssimultaneously, wemodifiediperf[27]topre-establishTCPconnections.
106
We also tried several different settings for flow sizes, numbers of flows, and queue sizes. All incast
scenarios caused qualitatively similar results.
We consider three settings. In the first setting, the queue on each switch port is allowed to
grow infinitely. This setting allows the switches to absorb incast burst of any size, without packet
loss so it represents a useful baseline. In the second setting, we limit the queue size to 100 packets,
and enforce droptail queuing discipline. The third setting also uses 100 packets buffers, but enables
DIBS.
The performance of the three settings is shown in Figure 4.6a. With infinite buffer, all queries
complete in 25ms. DIBS also provides very similar results, with all queries completing in 27ms.
QCT is much higher with the smaller buffer with droptail queuing, ranging from 26ms to 51ms.
The reason for this high QCT with droptail queuing and why DIBS provides such an im-
provement to it becomes apparent by observing the completion times of all the individual query
flows. In Figure 4.6b, in the droptail setting we notice that a small number of responses (about
9%) have durations between 25-50ms. The reason is that due to the bursty nature of the traffic,
those responses suffer from packet loss, which forces them to take a timeout. In our experiment,
all queries had at least one such delayed response. Since a query is not completed until all the
responses have arrived, those delayed responses determine the QCT. DIBS eliminates packet drops
and consequent retransmission timeouts, guaranteeing that all responses have arrived within 25ms,
which is close to the optimal case. This explains the significant improvement in QCT.
4.5.3 Simulations setup
Now we use NS3 simulations to study the performance of DIBS on larger networks with realistic
workloads. We will primarily focus on comparing the performance of DCTCP with and without
DIBS, although we will also consider pFabric [51] in §4.5.8.
DCTCP: DCTCP is the state of the art scheme for congestion control in data center environ-
ment [49], and is currently being deployed in several large data centers supporting a large search
107
engine. DCTCP is a combination of a RED-like [38] AQM scheme with some changes to end-host
TCP stacks. The switch marks packets by setting ECN codepoints [16], once the queue exceeds a
certain threshold. If the queue overflows, packets are dropped. The receiver returns these marks
to the sender. The TCP sender reduces its rate in proportion to the received marks.
DCTCP augmented with DIBS: Instead of dropping packets when the queue overflows, we
detour them to nearby switches as described in §4.2. The detoured packets are also marked. In
addition, we also disable fast retransmit on TCP senders. For brevity, we refer to this scheme
simply as DIBS in this section.
To ensure fair comparison, we use settings similar to those used to evaluate DCTCP [49].
Network and traffic settings: Table 4.1 shows network settings. The switches use Equal-Cost
Multi-Path (ECMP) routing to split flows among shortest paths. In our default configuration, we
use a fixed 100 packet FIFO buffer per output port, and set the marking threshold to 20 packets.
We simulate a network consisting of 128 servers, connected using a fat-tree topology (K = 8) and
1Gbps links.
We use the traffic distribution data from a production data center [49] to drive the simulations.
We vary various parameters of these distributions, as well as switch and network configurations, to
evaluate a wide range of scenarios (i.e. a “parameter sweep”). Table 4.2 depicts the default value
of each parameter in bold (corresponding to the normal case in an earlier study [49]), the range
we vary each parameter over, and the sections in which we vary each parameter.
The traffic includes both query traffic (a.k.a. partition / aggregate, a.k.a. incast) [49], and
background traffic. The background traffic has 80% of flows smaller than 100KB. We vary the
intensity of background traffic by varying the interarrival time. In the query traffic, each query
consists of a single incast target that receives flows from a set of responding nodes, all selected
at random. We vary three parameters of the query traffic to change its intensity: the degree of
incast (i.e., the number of responding nodes), the size of each response, and the arrival rate of
the queries. The degree of incast and the response size determine how intense a given burst will
108
Setting Min Max Section
BG inter-arrival (ms) 10 120 4.5.4.1
QPS 300 15000 4.5.4.2, 4.5.7
Response size (KB) 20 160 4.5.4.3, 4.5.7
Incast degree 40 100 4.5.4.4, 4.5.5.2
Buffer (packets) 20 100 4.5.5.1, 4.5.5.2
TTL 12 255 4.5.5.3
Oversubscription 1 16 4.5.5.4
Table 4.2: Simulation parameter ranges we explore. Bold values indicate default settings. The top half
of the table captures traffic intensity, including background traffic (top row, with lower inter-arrival time
indicatingmoretraffic)andqueryintensity (nextthreerows). Thebottomhalfofthetablecapturesnetwork
and switch configuration, including buffer sizes, initial packet TTLs, and link oversubscription. We revisit
some aspects in additional sections on shared switch buffers (§4.5.5.2) and extreme congestion (§4.5.7).
be, while the arrival rate determines how many nodes will be the target of incast in a given time
period. We also vary the size of each output buffer at the switches.
In general, our intention is to start from the traffic patterns explored in [49], and consider more
intense traffic by varying interarrival time of background traffic and intensity of query traffic.
Metric: The metric of interest for query traffic is 99th percentile of query completion time
(QCT) [49]. This is the standard metric for measuring performance of incast-like traffic. For
background traffic, the metric is flow completion time (FCT) of short flows (1-10KB). Recall that
the initial window size for TCP connections in a data center setting is 10 packets (Table 4.1).
Since these flows can complete in one window, the FCT of these flows stems almost entirely
from the queueing delay experienced by these packets, as well as the delay of any loss recovery
induced by congestion. We focus on the 99th percentile of FCT for these flows, to highlight any
collateral damage. In contrast, DIBS has little impact on the performamce of long flows, since
their throughput is primarily determined by network bandwidth.
4.5.4 Performance under different traffic conditions
4.5.4.1 Impact of background traffic
In our first experiment, we hold the query traffic constant at the default values from Table 4.2
and vary the inter-arrival time of the background traffic from 10ms to 120ms. The amount of
109
0
20
40
10 20 40 80 120
99th completion time(ms)
Average interarrival time (ms)
QCT: DCTCP
QCT: DCTCP + DIBS
BG FCT: DCTCP
BG FCT: DCTCP + DIBS
Figure 4.7: Variable background
traffic. Collateral damage is con-
sistently low. (incast degree: 40;
response size: 20KB; query ar-
rival rate: 300 qps). Although
depicted on one graph, QCT and
background FCT are separate met-
rics on different traffic and cannot
be compared to each other.
0
20
40
300 500 1000 1500 2000
99th completion time (ms)
Query arrival rate (qps)
QCT: DCTCP
QCT: DCTCP + DIBS
DCTCP
DCTCP + DIBS
Figure 4.8: Variable query ar-
rival rate. Collateral damage is
low. Query traffic rate has lit-
tle impact on collateral damage,
and, at high query rate, DIBS im-
proves performance of background
traffic. (Background inter-arrival
time: 120ms; incast degree: 40;
response size: 20KB)
0
20
40
60
80
20 30 40 50
99th completion time (ms)
Query response size (KB)
QCT: DCTCP
QCT: DCTCP + DIBS
FCT: DCTCP
FCT: DCTCP + DIBS
Figure 4.9: Variable response size.
Collateral damage is low, but de-
pends on response size. Improve-
ment in QCT decreases with in-
creasing response size. (Back-
ground inter-arrival time: 120ms;
incast degree: 40; query arrival
rate: 300 qps)
background traffic decreases as inter-arrival time increases. Figure 4.7 shows the result. Although
depicted on one graph, QCT and background FCT are separate metrics on different traffic and
cannot be compared to each other.
DIBS significantly reduces the response time of query traffic. The 99th percentile of QCT is
reduced by as much as 20ms. Furthermore, the improvement in QCT comes at very little cost to
background traffic. The 99th percentile of background FCT increases by less than 2ms, while most
of the background flows are barely affected. In other words, there is very little collateral damage
in this scenario. Furthermore, we see that the collateral damage does not depend on the intensity
of the background traffic.
There are two reasons for this lack of collateral damage. First, in all the experiments, on
average, DIBS detours less than 20% of the packets. Over 90% of these detoured packets belong
to query traffic, since DIBS detours only in cases of extreme congestion, and only query traffic
causes such congestion. DIBS detours only 1% of the packets from background flows. These
background packets come from flows that happen to traverse a switch where query traffic is causing
a temporary buffer overflow. Without DIBS, the switch would have dropped these packets. DIBS
does not suffer any packet drops in this scenario, whereas standard DCTCP does. Second, most
background flows are short, and the congestion caused by an incast burst is ephemeral. Thus,
110
most background flows never encounter an incast burst, and so it is rare for a background packet
to end up queued behind a detoured packet.
4.5.4.2 Impact of query arrival rate
We now assess how DIBS affects background traffic as we vary the rate of queries (incasts). Keeping
all other parameters at their default values (Table 4.2), we vary the query arrival rate from 300
per second to 2000 per second. This scenario corresponds to a search engine receiving queries at a
high rate. The results are shown in Figure 4.8.
We see that DIBS consistently improves performance of query traffic, without significantly
hurting the performance of background traffic. The 99th percentile of QCT improves by 20ms.
The 99th percentile FCT shows a small increase (1-2ms). However, at the highest query arrival
rate (2000 qps), DIBS improves the 99th percentile FCT for short background flows. At such high
query volume, DCTCP fails to limit queue buildup. As a result, without DIBS, some background
flows suffer packet loses, slowing their completions. With DIBS, background traffic does not suffer
packet losses, as the packets will instead detour to other switches.
Collateral damage remains low because, even with 2000 qps, over 80% of packets are not
detoured. More than 99% of the detoured packets belong to query traffic in all cases. DIBS does
not suffer any packet drops in this scenario.
4.5.4.3 Impact of query response size
We now vary the intensity of query traffic by varying the response size from 20KB to 50KB, while
keeping all other parameters at their default values (Table 4.2). Larger responses correspond to
collecting richer results to send to a client. The results are shown in Figure 4.9.
We see that DIBS improves 99th percentile of QCT, but, as the query size grows, DIBS is less
effective. At a response size of 20KB, the difference in 99th percentile QCT is 21ms, but, at a
response size of 50KB, it is only 6ms. This behavior is expected. As the response size grows, the
111
0
20
40
60
80
40 60 80 100
99th completion time(ms)
Incast degree
QCT: DCTCP
QCT: DCTCP + DIBS
FCT: DCTCP
FCT: DCTCP + DIBS
Figure 4.10: Variable incast de-
gree. Collateral damage is low, but
depends on incast degree. (Back-
ground inter-arrival time: 120ms;
query arrival rate: 300 qps; re-
sponse size: 20KB)
0
5
10
15
20
1 5 10 25 40 100 200
99th FCT (ms)
Buffer size (packets)
DCTCP
DCTCP + DIBS
(a) Background traffic
1
10
100
1000
1 5 10 25 40 100 200
99th QCT (ms)
Buffer size (packets)
DCTCP
DCTCP + DIBS
(b) Query traffic
Figure 4.11: Variable buffer size. There is no collateral damage and
DIBSperformsbestwithmediumbuffersize. (Backgroundinter-arrival
rate: 10ms; incast degree: 40; response size: 20KB; query arrival rate:
300 qps)
size of the burst becomes larger. While there are no packet drops, as more packets suffer detours,
the possibility that a flow times out before the detoured packets are delivered goes up. In other
words, with DIBS, TCP flows involved in query traffic suffer from spurious timeouts.
Needless to say, with DCTCP alone, there are plenty of packet drops, and both background
and query traffic TCP flows routinely suffer from (not spurious) timeouts. In contrast, DIBS has
no packet drops.
As before, the impact on background traffic is small, but it increases slightly with the increase
in response size. The difference in the 99th percentile of FCT of short flows is 1.2ms at 20KB,
while it increases to 4.4ms at 50KB. This is because more packets are detoured for a given incast
burst.
4.5.4.4 Impact of incast degree
We now vary the intensity of incast by varying the incast degree; i.e., the number of responders
involved, from 40 to 100, while keeping all other parameters at their default values (Table 4.2).
More responders correspond to collecting more results to send to a client. Figure 4.10 presents the
results.
We see that DIBS improves 99th percentile of QCT, and the improvement grows with increasing
incast degree. For example, the difference between the two schemes is 22ms when the degree of
112
incast is 40, but grows to 33ms when the incast degree is 100. The impact on background traffic is
small, but it increases with increasing incast degree.
It is interesting to compare Figures 4.9 and 4.10, especially at the extreme end. For the extreme
points in the two figures, the total amount of data transmitted in response to a query is the
same (2MB). Yet, both DCTCP and DIBS perform worse when the large responses are due to a
high incast degree (many senders) than when they are due to large response sizes. The drop in
DCTCP’s performance is far worse. With large responses, the 99th percentile QCT of DCTCP
was 44ms. With many senders, it is 79ms. For DIBS, the corresponding figures are 37ms and
46ms, respectively.
The reason is that the traffic with high incast degree is far more bursty. When we explored
large responses, 40 flows transferred 50KB each. In the first RTT, each flow sends 10 packets
(initial congestion window size). Thus, the size of the initial burst is 400KB. In contrast, with
extremely high incast degree, 100 flows transfer 20KB each. Thus, the size of the first burst is
1MB.
The bursty traffic affects DCTCP far more than it impacts DIBS. With DCTCP alone, as the
degree of incast grows, more and more flows belonging to the incast suffer from packet drops and
timeouts.
DIBS is able to spread the traffic burst to nearby switches and ensure no packet loss. However,
DIBS is not completely immune to burstiness. Packets suffer many more detours in this setting
than they do in the previous setting, for a comparable total response size. For example, when the
incast degree is 100, 1% of the packets are detoured 40 times or more. In contrast, in the previous
setting, when the burst size was 50KB, the worst 1% of the packets suffered only about 10 detours.
113
0
20
40
12 24 36 48 Max
99th completion time (ms)
TTL
QCT: DCTCP
QCT: DCTCP + DIBS
BG FCT: DCTCP
BG FCT: DCTCP + DIBS
Figure 4.12: Variable max TTL. Limiting TTL does
not have significant impact on background traffic.
(Background inter-arrival rate: 10ms; incast degree:
40; response size: 20KB; query arrival rate: 300 qps)
0
200
400
600
800
1000
1200
1400
6000 8000 10000 12000 14000
99th% completion time (ms)
Query per second
QCT: DCTCP
QCT: DCTCP + DIBS
BG FCT: DCTCP
BG FCT: DCTCP + DIBS
Figure 4.13: Extreme query intensity (Background
inter-arrival time: 120ms; incast degree: 40)
4.5.5 Performance of different network configurations
4.5.5.1 Impact of buffer size
All previous experiments were carried out with buffer size of 100 packets per port. We now consider
the impact of switch buffer size on DIBS performance. We vary the buffer size at the switch from 1
packet to 200 packets, while keeping all other parameters at their default values (Table 4.2). The
results are shown in Figure 4.11. We show the query and background traffic results separately for
better visualization.
We see that DIBS improves 99th percentile of QCT significantly at lower buffer sizes. The
performance boost is more obvious at lower buffer sizes, where DCTCP suffers from more packet
drops while DIBS is able to absorb the burst by spreading it between switches. This result also
shows that our choice of buffer size of 100 for all past experiments is a conservative one in order to
compare with the default DCTCP setting [49]. For smaller buffer sizes the performance boost of
DIBS becomes more obvious.
7
7
The fluctuation in the figure is caused by the queries in the 99th having timeouts.
114
4.5.5.2 Impact of shared buffers
The switches deployed in production data centers usually use Dynamic Buffer Allocation (DBA)
to buffer the extra packets in shared packet memory upon congestion. In our simulation, we model
a DBA-capable switch with 8X1GbE ports and 1.7MB of shared memory based on the Arista
7050QX-32 switch [6] .
We use the default settings from Table 4.2 but vary the incast degree and compare the result
with Figure 4.10. By enabling DBA, DCTCP has zero packet loss and DIBS is not triggered.
However, when we further increase the incast degree beyond 150 by using multiple connections on
single server, we find that DCTCP with DBA experiences packet loss and an increased QCT in
99th percentile. However, when DIBS is enabled, we observe no packet loss even upon extreme
congestion which overflows the whole shared buffer, which leads to a decrease of 75.4% for the
99th percentile QCT.
4.5.5.3 Impact of TTL
So far, we have seen that the impact of DIBS on background flows was small in all but the most
extreme cases. Still, it is worth considering whether we could reduce the impact further. In some
cases, packets are detoured as many as 20 times. One possible way to limit the impact of detoured
packets on background flows is to limit the number of detours a packet can take. This limits the
amount of “interference” it can cause. The simplest way to limit detours is to restrict the TTL on
the packet.
We now carry out an experiment where we vary the max TTL limit from 12 to 48, while
keeping all other parameters at their default values (Table 4.2). Recall that the diameter of our
test network is 6 hops. Each backwards detour reduces the TTL by two before the packet is
delivered to the destination (e.g. one when packet is detoured, and one when it re-traces that
hop). Thus when the max TTL is 12, the packet can be detoured backwards 3 times. When the
115
0
100
200
300
400
500
600
700
800
900
1000
60 80 100 120 140 160
99th% completion time (ms)
Query Response Size(KB)
QCT: DCTCP
QCT: DCTCP + DIBS
BG FCT: DCTCP
BG FCT: DCTCP + DIBS
Figure 4.14: Large query re-
sponse sizes (Background inter-
arrival time: 120ms; incast degree:
40)
0
100
200
300
300 500 1000 1500 2000
99th completion time (ms)
Query arrival rate (qps)
BG FCT: pFabric
BG FCT: DCTCP + DIBS
(a) Background traffic
0
20
300 500 1000 1500 2000
99th completion time (ms)
Query arrival rate (qps)
QCT: pFabric
QCT: DCTCP + DIBS
(b) Query traffic
Figure4.15: DIBSvspFabric: Mixedtraffic: Variablequeryarrivalrate.
(Background inter-arrival time: 120ms; incast degree: 40; response
size: 20KB)
max TTL is 48, the packet can be detoured backwards 9 times. “Forward” detours change the
accounting somewhat.
Figure 4.12 shows the results. Note that the last point on the X axis represents maximum
TTL, which is 255. We have also shown performance of DCTCP for reference. TTL values have no
impact on DCTCP’s performance. For DIBS, the 99th percentile QCT improves with increasing
TTL. The reason is that at lower TTL, DIBS is forced to drop packets due to TTL expiration. We
also see that TTL has little impact on 99th percentile FCT of background flows.
We note one other interesting observation. Note that DIBS performs slightly better with TTL
of 12, instead of TTL of 24. We believe that with TTL of 24, packets stay in the network for far
too long, only to get dropped. In other words, we waste resources. Thus, it may be best not to
drop a packet that’s been detoured many times! We are investigating this anomaly in more detail.
We have carried out this experiments in other settings (e.g. higher incast degree etc.) and have
found the results to be qualitatively similar.
4.5.5.4 Impact of oversubscription
In order to study the effect of oversubscribed links on the performance of DIBS, we repeated the
previous experiments with different oversubscription parameters for the fat-tree topology. This was
done by lowering the capacity of the links between switches by a factor of 2, 3 and 4 (providing
oversubscription of 1:4, 1:9 and 1:16 respectively). Our experiments showed that DIBS consistently
116
lowers the 99th percentile QCTs by 20ms, for every oversubscription setting, without affecting the
background FCTs. This is because an increasing number of packets are buffered in the upstream
path when oversubscription factor increases, but the last hop of the downstream path is still
the bottleneck for query traffic. So at the last hop, DIBS can avoid packet loss and fully utilize
bottleneck bandwidth.
4.5.6 Fairness
We now show that DIBS ensures fairness for long-lived background flows. Recall that our simulated
network has 128 hosts, with bisection bandwidth of 1Gbps. We split the 128 hosts into 64 node-
disjoint pairs. Between each pair we start N long-lived flows in both directions. If the network
is stable, and DIBS does not induce unfairness, then we would expect each flow to get roughly
1/N Gpbs bandwidth, and the Jain’s fairness index [37] would be close to 1. We carry out this
experiment for N ranging from 1 to 16. Note that when N is 16, there 128 * 2 * 16 = 4096
long-lived TCP flows in the network. Our results show that Jain’s fairness index is over 0.9 for all
values of N.
4.5.7 When does DIBS break?
While DIBS performs well under a variety of traffic and network conditions, it important to note
that beyond a level of congestion detouring packets (instead of dropping them) can actually hurt
performance. However, for this to happen, the congestion has to be so extreme that the entire
available buffer (across all switches) in the network gets used. In such a scenario, it is better to
drop packets than to detour them, since there is no spare buffer capacity anywhere in the network.
To understand where this breaking point is we push the workload, especially the query traffic, to
extreme. The goal is to show that such a breaking point where DIBS can cause congestion to
spread critically exists indeed, but for the tested scenario it is unrealistically high.
117
We start by pushing QPS to ever higher values. Figure 4.13 shows that, for the specific topology
we tested against, DIBS breaks when we generate more than 10K queries per second. In this case,
the queries arrive so fast that detoured packets do not get a chance to leave the network before
new packets arrive. Thus, large queues build up in the entire network, which hurts performance of
both query and background traffic.
Note that the size of the query response is small, and it takes the sender just two RTTs to
send it to target receiver. Thus, DCTCP’s congestion control mechanism (ECN marking) is not
effective, since it requires several RTTs to work effectively.
The exact tipping point depends on the specifics of the topology, the link speed, etc. Our goal
is not to find it for all environments, but rather to show that such a tipping point exists, and that,
for a reasonable setup, it is extremely high.
Next, we tried to increase query response sizes to see if we could obtain a similar tipping point.
Query rate was held constant at 2000 queries per second. However, we found that DIBS does not
“break” in this scenario. The reason is that the large query response size requires several RTTs to
transmit, which gives DCTCP enough time to throttle the senders (Figure 4.14).
4.5.8 Comparison to pFabric
A number of recent proposals have shown that network performance can be significantly improved
by assigning priorities to flows and taking these priorities into account when making scheduling
decisions [154, 51]. The latest proposal in this area is pFabric [51], which provides near-optimal
performance for high-priority traffic by maintaining switch queues by priority order, instead of
FIFO.
We compare to pFabric as it is the state of the art, performing better than similar approaches
like PDQ [97]. pFabric calls for very shallow buffers per port (24 packets) to minimize the overhead
involved in maintaining a sorted packet queue. If a high-priority packet arrives at a queue that
is full, a switch simply drops the lowest priority packet in the queue to make room for the new
118
arrival. It relies on the end host to retransmit the dropped packet expeditiously. To this end,
pFabric requires end hosts to run a modified version of TCP with a fixed timeout set to a very
low value of 40 μs. However, the 40μs timeout that pFabric demands is difficult to implement in
servers: most modern kernels have 1ms clock ticks. Additionally, packets need to be tagged with
priorities, and switches must be able to drop or forward random packets in their queues selectively.
We now compare the performance between DIBS (i.e., DCTCP + DIBS) and pFabric. We use
the same k=8 FatTree topology as earlier and we generate the same mixed workloads in both
cases, as described in §4.5.3. For pfabric, the buffer size is set to 24 packets and the minRTO is
adjusted to 350us since 1Gpbs links are used.
Figure 4.15a shows that pFabric can hurt performance of large background flows, since it gives
higher priority to shorter flows. Thus, when query traffic (short flows) is high, long flows get
starved. DIBS does not prioritize traffic, and hence does not suffer from this problem
8
. In fact,
Figure 4.15b shows that at high query arrival rate, DIBS even slightly improves 99th percentile of
the QCT for query traffic. This is because at high query arrival rates, pfabric drops too many
packets, and ends up doing excessive retransmissions. DIBS reduces packet losses via detouring,
and thus has slightly better performance.
4.5.9 Summary
In summary, these results show that DIBS performs well over a wide range of traffic. It improves
the performance of query traffic while causing little or no collateral damage to the background
flows. The results also show that under heavy load, DIBS is stable and long-lived flows treat each
other fairly – although DIBS can lead to poor performance under extremely heavy query arrival
rate.
8
It is possible that not prioritizing short flows can hurt DIBS– e.g. if queue is already filled with packets from
background flows right before an incast wave, then DIBS will detour packets from query flows (short), while pFabric
will give them priority. However, these scenarios are less common and were not observed in our simulations.
119
4.6 Discussion
Network topology and detouring: In this paper, we focused on the FatTree topology. We
now consider the effectiveness of DIBS in other topologies with different levels of path diversity. A
switch gains more detouring options if it has more neighbors (higher degree), which are better for
flow completion if they offer alternate paths to the destination. DIBS suffers if detouring options
result in packets traversing long paths with lengthy queues, leading to timeouts or drops.
9
Two recent topologies, HyperX [45] and JellyFish [148], seem to have properties well-suited for
detouring. HyperX networks have many paths of different lengths between pairs of hosts. One
can imagine using the short paths under normal conditions, but using detouring to exploit the
larger path diversity when conditions warranted. Jellyfish connects fixed-degree switches together
randomly to provide higher bandwidth than equivalently-sized FatTree topologies. To achieve
these bandwidth gains, Jellyfish uses a fixed number of paths between pairs of hosts, some of
which may be longer than the shortest ones. DIBS can detour packets to all these paths even they
are of different length. Moreover, since Jellyfish has more switches that are closer to a destination
than FatTree, DIBS to have more neighboring buffers to share.
Network admission control: Congestion mitigation is always coupled with network admission
control. DCTCP controls the sending rate of long flows, thus admitting more short flows into the
network. With DIBS, we admit even more short flows by sharing more buffers across switches.
However, we still need admission control at the hosts to prevent applications from sending too
many intensive short flows (e.g., due to misconfigurations, application bugs, or malicious users).
Other detouring policies: In this paper, we focus on simple random detouring without any
parameter tuning. However, DIBS can provide highly flexible detouring policies by making different
design decisions on (i) when to start detouring; (ii) which packets to detour; (iii) where to detour
9
To merely function correctly, DIBS does not need multiple disjoint paths between a sender and a receiver. In
theory, DIBS would work even on a linear topology, where DIBS can either detour packets back on the reverse path,
or, in the worst case, drop them.
120
them to; and (iv) when to stop detouring. We discuss example detouring policies that may be
useful in different settings, leaving detailed design and evaluation for future work.
Load-aware detouring: Random detouring works well in a FatTree topology, because ECMP is
effective in splitting traffic equally among shortest paths. However, topologies such as Jellyfish
and HyperX have paths with different lengths and have varying numbers of flows on these paths.
Load-aware detouring detours packets to neighboring switches based on load. For example, when
the destination port’s buffer is full, a switch sends the packet via its output port that has the
lowest current buffer usage.
Flow-based detouring: Our basic mechanism makes detouring decisions at the packet level, meaning
that packets from the same flow can traverse different paths. Instead, switches could detour at the
flow granularity, similar to how ECMP is usually deployed. Some flows would be detoured more
often than others, and detoured packets from the same flow would follow a consistent path. An
operator could even encode policy in the configurations for flow-based decisions in order to, for
example, favor detouring of long flows, short flows, or flows from certain users.
Probabilistic detouring: Detouring can be used to provide different delays to different priorities of
traffic. A switch can detour packets with different probabilities based on current buffer occupancy
and packet priority. When the buffer is lightly loaded, the switch may only detour some of the
lowest priority traffic to reserve room for higher-priority packets. As the buffer fills, the switch
detours more classes of traffic with higher probability. By detouring different traffic with different
probabilities, we essentially use a group of FIFO queues at different switches to approximate a
priority queues at a single switch.
121
4.7 Related Work
Data center networking is an active area of research. Here, we only discuss recent ideas, closely
related to DIBS.
Hot potato routing: DIBS’ main ancestor is hot potato (or deflection) routing [60] which
established the basic premise to forward a packet to another node if it cannot be buffered while in
transit. DIBS is the first to explore how this idea is uniquely suited for data center environments
and workloads. There are several theoretical frameworks [70, 123, 131] designed to evaluate hot
potato routing that can also be used as the basis for a theoretical analysis of DIBS.
Ethernet flow control: Hop-by-hop Ethernet flow control is designed for building lossless
Ethernet networks
10
. When buffer of a switch gets full, it pauses its upstream switch, and the
pause message eventually cascades to the sender. Priority flow control (PFC) [85, 3] expands
on this basic idea to provide flow control for individual classes of service. PFC is leveraged by
protocols like RoCE [39] and DeTail [158].
Like DIBS, Ethernet flow control may be viewed as a mechanism to implicitly allow switches
to share buffers: when buffer usage at a switch exceeds a certain threshold, the pause message
causes packets to queue up at the upstream switch.
However, DIBS does not guarantee a lossless network; it only minimizes losses in case of bursty
traffic. Lossless L2 layer may be needed for specialized settings like high-performance compute
clusters and storage area networks may need a lossless L2 layer. However, typical data center
applications and the transport protocols (e.g. TCP) are designed to tolerate occasional packet loss.
Ethernet flow control can be difficult to tune. To avoid buffer overflow, a switch must send a
pause message before its buffer actually fills up, since message propagation and processing takes
time. Calculation of this threshold must account for cable lengths and switch architecture [2, 158].
To avoid buffer underflow, the pause duration must also be calculated carefully. In contrast, DIBS,
with random detouring strategy, has no parameters, and thus requires no tuning.
10
Infiniband [23] uses similar ideas, although it is a different L2 technology.
122
DIBS also offers more flexibility than Ethernet flow control. With Ethernet flow control, buffer
sharing happens only between a switch and its upstream neighbors. DIBS can redirect packets to
any neighbor, including downstream ones.
DIBS is also free of problems such as deadlock [58], as we do not require any host or switch to
stop transmitting.
Equal-cost multi-path (ECMP): ECMP spreads packets between a source and a destination
across multiple routes, which is essential to data center topologies such as VL2 [91] and FatTree [46].
ECMP may be seen as a form of implicit buffer sharing among switches, since it splits traffic along
multiple paths. However, ECMP and DIBS differ in several respects.
First, ECMP typically operates at flow level, while DIBS operates at packet level, achieving
finer-grained buffer sharing at the expense of some packet reordering. While packet level ECMP
has been proposed [80], it is not widely used. Second, DIBS spreads packets based on network
load, not path length. While load aware ECMP has been proposed, it often requires complex
centralized route management and hence is not practical. Third, ECMP, as the name implies, is
limited to using equal cost paths. DIBS has no such restrictions. Most importantly, ECMP cannot
provide succor in some traffic scenarios, such as incast. When multiple flows converge on a single
receiver and the edge switch become a bottleneck, even packet-level, load-aware routing [74] will
not help in this setting, while DIBS can.
Using ECMP does not rule out using DIBS. ECMP would do coarse-granularity load-spreading,
while DIBS helps out on a shorter timescale. Indeed, in all the experiments shown in §4.5, we
used DIBS with flow-level ECMP.
Multipath TCP (MPTCP): MPTCP [139] is a transport protocol that works with ECMP to
ensure better load spreading. DIBS can co-exist with MPTCP.
Centralized traffic management systems: Centralized traffic management systems [47, 62,
152, 154, 158] collect traffic information in data centers and coordinate the hosts or switches to
optimize flow scheduling. Thus, the centralized controller can only manage coarse-grained traffic
123
at large timescale. DIBS can complement such systems, by mitigating packet losses arising from
short-term behavior of flows that the centralized schemes cannot fully control.
Other transport protocols: In this paper, we coupled DIBS with DCTCP. It is possible to
combine DIBS with other DCN transport protocols [155] as well, as long as requirements in (§4.3)
are met.
Detouring in other settings The concept of detouring has been explored in scenarios as diverse
as on-chip networks [127], optical networks [71], overlay networks [53, 93, 65, 144] and fast failure
recovery schemes [113, 119, 118]. Like DIBS, flyways [102] rely on the fact that data center
networks usually have sufficient capacity but experience localized congestion. Flyways provide
spot relief using one-hop wireless detours.
4.8 Conclusion
In this paper, we proposed detour-induced buffer sharing, which uses available buffer capacity in
the network to handle sudden flashes of congestion. With DIBS, when a switch’s buffer towards a
destination is full, instead of dropping packets, it detours them to neighboring switches, achieving
a near lossless network. In effect, DIBS provides a temporary virtual infinite buffer to deal with
sudden congestion.
We implemented and evaluated DIBS using NetFPGA, Click and NS-3, under a wide range of
conditions. Our evaluation shows that DIBS can handle bursty traffic, without interfering with
their abilities to regulate routine traffic. This simple scheme is just the starting point for what we
believe we can realize using more sophisticated detouring schemes.
124
Chapter 5
Conclusions and Future Directions
In this thesis, we have argued that by leveraging the emerging programmability in hardware
switches, we can fully scale-out cloud traffic management functions. The key principle is to build
traffic management functions directly in hardware switches. In the face of limited resources and
limited programmability in switches, we design sorts of algorithms and data structures, and also
combine software and hardware to work around those limitations. By doing so, we have achieved
both high performance and high cost-effectiveness. Specifically, we have scaled-out three major
traffic management functions. In Chapter 2, we present how to leverage emerging programmability
in switching ASICs to build a fast and cheap layer-4 load balancer. We show that maintaining the
connections state for millions of active connections in hardware switches is feasible. We ensure
that each connection always maps to the same server even with frequent service changes, at cost
of only 256 bytes of counters. In Chapter 3 we present world’s first comprehensive measurement
study to understand the characteristics of network attacks for both towards and from the cloud.
We have identified nine types of common attacks in the cloud and quantified their prevalence,
complexity, intensity, and characteristics. We then leverage the observation to propose a novel
attack mitigation system call Nimbus to detect those attacks. Nimbus builds the flexible sampling
and filtering using primitives from hardware switches and leverage VMs to perform customized
attack detection. Our evaluation shows that Nimbus can accurately detection various attacks in
125
a cost-effective manner. In Chapter 4 we present a mechanism called DIBS to allow switches to
share their buffer by detouring excessive packets to its neighboring switches. Our evaluation shows
DIBS can ensure no packet drop even under large traffic bursts like incast scenario.
Next, we turn to present the future direction of scaling-out other services and applications in
cloud datacenters.
5.1 Extending scaling-out Traffic management
Being able to claim computing functionality besides packet processing on network devices can
achieve three orders of magnitude cost-efficiency on price and power, which can be potentially
extended to a wide range of functions across the stack. There are many other traffic management
functions like deep packet inspection, firewall, traffic shaping, etc. To support those functions, the
network operator instantiates a combination of physical and virtual appliances, which typically
has low capacity and high cost. In the face of growing traffic size and traffic anomalies, it requires
to scale out those functions in the near future.
Inthemeanwhile, thereareemergingnetworkdeviceswithbothhighperformanceandincreasing
programmability like Network Processing Unit, Programmable NICs, FPGA, programmable
switching ASICS, etc. Those devices provide one-to-three orders of magnitudes higher capacity
than traditional proprietary appliances and CPUs. If we can leverage those programmable devices
to rebuild traffic management functions, we can achieve a significant better cost-efficiency.
In order to leverage those programmable devices, we need to take a close analysis of function
components and enabling a right labor division between software and hardware to achieve both
high performance processing in hardware and flexible management operations in software. The
need for improving traffic management will in turn trigger device vendor and chip vendor to
enhance those programmable devices for better application support.
126
5.2 Scaling-out Cloud Services using Programmable Data
Plane
Managing cloud services is challenging due to its large and constantly growing traffic in scale.
Network infrastructure and management services need to scale their capacity to such traffic growth,
or the performance will suffer. Thus, today’s Cloud operators deploy many service management
functions, ranging from network-level load balancing that splits user requests of a service among a
group of servers, to application-level distributed storage that provides a rich set of data access
abilities for many other applications. Existing research on scaling-out service management functions
has so far focused on distributed software using commodity servers, software stack, and application-
level protocols to assume the responsibility of high scalability and fault-tolerance. However, the
complexity of distributed software coupled with the highly dynamic network environments limits
the efficiency of existing research.
From our experience in traffic management, we take a different approach of building a pro-
grammable data plane to tackle this question: whether we can provide scaling-out service man-
agement functions that support the full throughput of a data center? We leverage emerging
technical trend in programmable switches, including line-rate packet processing to address the
scaling challenge and a rich set of new primitives in data plane (e.g., hashing, key-value store,
stateful operations, in-band telemetry, etc) for function diversity challenge.
In this thesis, we have studied scaling layer 4 load balancing as an example: L4 load balancing is
a critical function for cloud services with challenges as: how to scale to the full bisection bandwidth
and ensure per-connection consistency under high dynamism? Existing solution using software
load balancers cannot scale to full bisection traffic and simply splitting traffic using ECMP at
switches cannot ensure per-connection consistency with low cost. We propose to use new primitives
available in today’s fast and cheap switching ASICs for load balancing. Our evaluation with
production traffic traces shows that it can scale to millions of connections with existing SRAM
127
sizes, inherit all the benefits of line-rate processing while ensuring per-connection consistency for a
variety of traffic and update settings.
Inthefuture, wecanscaleothercloudservicesusingprogrammabledataplane. Morespecifically,
wewillworkonscalingthetransactionaloperationsindistributedsystems. Transactionaloperations
are essential for many distributed systems of strong consistency and fault-tolerance. A critical
bottleneck for transactional operations is distributed concurrency control due to the high dynamic
and unreliable network environments. Programmable data plane can improve the performance
but has many new challenges, including enabling a right labor division between servers and
switches to achieve both high performance packet processing in hardware and flexible management
operations in software, and localizing the problems underlying complex systems coupled with
domain-specific protocols over highly contented network resources. We can propose a new switch
abstraction to programming network functions to run coordination and stateful operations in
line-rate, a switch-centric concurrency control protocol to achieve both strong consistency and
high performance, and a fault isolation firework to enable state replication in switches to enable
fast and transparent failover.
128
Bibliography
[1] http://www.everestgrp.com/2015-04-40-billion-global-cloud-services-market-expected-to-
grow-27-percent-per-annum-for-next-3-years-press-release-17218.html.
[2] http://www.cisco.com/en/US/netsol/ns669/networking_solutions_solution_segment_
home.html.
[3] 802.11p. http://en.wikipedia.org/wiki/IEEE_802.11p.
[4] Amazon web services. http://aws.amazon.com/.
[5] Amazon Web Services: Overview of Security Processes. https://d0.awsstatic.com/
whitepapers/Security/AWS%20Security%20Whitepaper.pdf.
[6] Arista 7050qx-32. http://www.aristanetworks.com/media/system/pdf/Datasheets/7050QX-
32_Datasheet.pdf.
[7] AWS Best Practices for DDoS Resiliency. https://d0.awsstatic.com/whitepapers/DDoS_
White_Paper_June2015.pdf.
[8] AWS Security Best Practices. https://d0.awsstatic.com/whitepapers/aws-security-best-
practices.pdf.
[9] Barefoot tofino: programmable switch series up to 6.5tbps.
[10] Bidirectional forwarding detection (bfd).
[11] Broadcom smart-hash technology.
[12] The broadcom strataxgs bcm56970 tomahawk ii switch series.
[13] Cavium xpliant™ ethernet switch product family.
[14] A differentiated service two-rate, three-color marker with efficient handling of in-profile
traffic.
[15] The emulab project. http://emulab.net/.
[16] Explicit congestion notification. http://tools.ietf.org/rfc/rfc3168.txt.
[17] High capacity strataxgs®trident ii ethernet switch series.
[18] High-density 25/100 gigabit ethernet strataxgs tomahawk ethernet switch series.
[19] http://nmap.org/book/man-port-scanning-techniques.html.
[20] https://goo.gl/TJd4ch.
[21] https://www.arbornetworks.com/ddos-protection-products.
129
[22] https://www.arbornetworks.com/report/.
[23] Infiniband. http://en.wikipedia.org/wiki/Infiniband.
[24] Intel flexpipe.
[25] Intel product specifications.
[26] Introducing data center fabric, the next-generation facebook data center network.
[27] iperf. http://iperf.fr.
[28] Load-balancer-as-a-service configuration options.
[29] Mellanox spectrum™ ethernet switch.
[30] Microsoft Azure Network Security Whitepaper. http://blogs.msdn.com/b/azuresecurity/
archive/2015/03/03/microsoft-azure-network-security-whitepaper-version-3-is-now-
available.aspx.
[31] The netfpga project. http://netfpga.org/.
[32] Nginx.
[33] The ns-3 project. http://www.nsnam.org.
[34] ns-3 simulator. www.nsnam.org.
[35] Nsx distributed load balancing.
[36] Open-source p4 implementation of features typical of an advanced l2/l3 switch.
[37] A quantitative measure of fairness and discrimination for resource allocation in shared
computer systems. In ICNP.
[38] Random early drop. http://tools.ietf.org/rfc/rfc2481.txt.
[39] Rdma over converged ethernet. http://en.wikipedia.org/wiki/RDMA_over_Converged_
Ethernet.
[40] Tcp new reno. http://tools.ietf.org/html/rfc6582.
[41] https://msdn.microsoft.com/en-us/library/azure/ee758711.aspx.
[42] juno.c. http://goo.gl/i1Qodc, 2013.
[43] Q4 2013 global ddos attack report. http://goo.gl/lIyRmK, 2013.
[44] Quova. http://www.quova.com, 2013.
[45] J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber. HyperX: Topology,
routing, and packaging of efficient large-scale networks. In SC, 2009.
[46] Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A scalable, commodity data
center network architecture. In SIGCOMM, 2008.
[47] Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and
Amin Vahdat. Hedera: Dynamic flow scheduling for data center networks. In NSDI, pages
281–296, 2010.
130
[48] Sardar Ali, Irfan Ul Haq, Sajjad Rizvi, Naurin Rasheed, Unum Sarfraz, Syed Ali Khayam,
and Fauzan Mirza. On Mitigating Sampling-induced Accuracy Loss in Traffic Anomaly
Detection Systems. ACM SIGCOMM Computer Communication Review 2010.
[49] Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen Patel,
Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. Data center tcp (dctcp). In
ACM SIGCOMM 2010.
[50] Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, and
Masato Yasuda. Less is more: Trading a little bandwidth for ultra-low latency in the
data center. In Proceedings of the USENIX Symposium on Networked Systems Design and
Implementation, NSDI ’12, San Jose, CA, USA, 2012. USENIX.
[51] Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji
Prabhakar, and Scott Shenker. pfabric: Minimal near-optimal datacenter transport. In ACM
SIGCOMM ’13, 2013.
[52] Mark Allman, Vern Paxson, and Jeff Terrell. A Brief History of Scanning. In IMC, 2007.
[53] David Andersen, Hari Balakrishnan, Frans Kaashoek, and Robert Morris. Resilient overlay
networks. In SOSP, 2001.
[54] Andrew Marshall, Michael Howard, Grant Bugher, Brian Harden. Security Best Practices For
Developing Windows Azure Applications. http://download.microsoft.com/documents/uk/
enterprise/88_security_best_practices_for_developing_windows_azure_applicat.pdf.
[55] Ion Androutsopoulos, John Koutsias, Konstantinos V Chandrinos, George Paliouras, and
Constantine D Spyropoulos. An Evaluation of Naive Bayesian Anti-Spam Filtering. Pro-
ceedings of the workshop on Machine Learning in the New Information Age, 11th European
Conference on Machine Learning, 2000.
[56] C. Anley. Advanced SQL Injection in SQL Server Applications. In Next Generation Security
Software Ltd, 2002.
[57] Arbor Networks. Insight into the global threat landscape. http://goo.gl/15oOx3, February
2013.
[58] Sven arne Reinemo and Tor Skeie. Ethernet as a lossless deadlock free system area network.
In in Parallel and Distributed Processing and Applications: Third International Symposium,
ISPA 2005, pages 2–5. Springer Berlin/Heidelberg.
[59] Sruthi Bandhakavi, Prithvi Bisht, P. Madhusudan, and V. N. Venkatakrishnan. CANDID:
Preventing SQL injection attacks using dynamic candidate evaluations. ACM CCS, 2007.
[60] Paul Baran. On distributed communications networks. Communications Systems, IEEE
Transactions on, 12(1):1–9, 1964.
[61] M. Basseville and I.V. Nikiforov. Detection of Abrupt Changes: Theory and Application.
Prentice Hall Englewood Cliffs, 1993.
[62] Theophilus Benson, Ashok Anand, Aditya Akella, and Ming Zhang. Microte: fine grained
traffic engineering for data centers. In CoNEXT, page 8, 2011.
[63] Bloomberg. Sony Network Breach Shows Amazon Clould’s Appeal for Hackers. http:
//goo.gl/3WiAaj, 2011.
131
[64] Gehana Booth, Andrew Soknacki, and Anil Somayaji. Cloud Security: Attacks and Current
Defenses. In ASIA, 2013.
[65] Claudson Bornstein, Tim Canfield, and Gary Miller. Akarouting: A better way to go. In
MIT OpenCourseWare 18.996, 2002.
[66] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford,
Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, et al. P4: Programming
protocol-independent packet processors. ACM SIGCOMM Computer Communication Review,
2014.
[67] Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, Nick McKeown, Martin Izzard,
Fernando Mujica, and Mark Horowitz. Forwarding metamorphosis: Fast programmable
match-action processing in hardware for sdn. In ACM SIGCOMM Computer Communication
Review, 2013.
[68] Daniela Brauckhoff, Bernhard Tellenbach, Arno Wagner, Martin May, and Anukool Lakhina.
Impact of packet sampling on anomaly detection metrics. In IMC, 2006.
[69] bro. https://www.bro.org/.
[70] Costas Busch, Maurice Herlihy, and Roger Wattenhofer. Routing without flow control. In
SPAA, pages 11–20, 2001.
[71] A. Busic, M. Ben Mamoun, and J.-M. Fourneau. Modeling fiber delay loops in an all optical
switch. In international Conference on the Quantitative Evaluation of Systems, 2006.
[72] Juan Caballero, Chris Grier, Christian Kreibich, and Vern Paxson. Measuring pay-per-install:
The commoditization of malware distribution. In USENIX Conference on Security, 2011.
[73] Enrico Cambiaso, Gianluca Papaleo, and Maurizio Aiello. Taxonomy of Slow DoS Attacks
to Web Applications. In TCNDSS. Springer, 2012.
[74] Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo, Guohan Lu, Lihua Yuan, Yixin
Zheng, Haitao Wu, Yongqiang Xiong, and Dave Maltz. Per-packet load-balanced, low-latency
routing for clos-based data center networks. CoNEXT ’13. ACM, 2013.
[75] Xavier Carreras and Lluis Marquez. Boosting Trees for Anti-Spam Email Filtering. Proceed-
ings of RANLP, 2001.
[76] Ronnie Chaiken, Bob Jenkins, Per-Ake Larson, Bill Ramsey, Darren Shakib, Simon Weaver,
and Jingren Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets.
VLDB’08.
[77] Yanpei Chen, Rean Griffith, Junda Liu, Randy H. Katz, and Anthony D. Joseph. Under-
standing tcp incast throughput collapse in datacenter networks. In WREN, pages 73–82,
2009.
[78] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters.
Communications of the ACM, 2008.
[79] X. Dimitropoulos, D. Krioukov, G. Riley, and k. claffy. Revealing the Autonomous System
Taxonomy: The Machine Learning Approach. In Passive and Active Network Measurement
Workshop (PAM), 2006.
[80] Advait Dixit, Pawan Prakash, and Ramana Rao Kompella. On the efficacy of fine-grained
traffic splitting protocols in data center networks. In SIGCOMM poster, 2011.
132
[81] Norman Richard Draper, Harry Smith, and Elizabeth Pownell. Applied regression analysis.
Wiley New York, 1966.
[82] Daniel E. Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, Eric Mann-
Hielscher, ArdasCilingiroglu, BinCheyney, WentaoShang, andJinnahDylanHosein. Maglev:
A fast and reliable software network load balancer. In NSDI, 2016.
[83] Bin Fan, Dave G Andersen, Michael Kaminsky, and Michael D Mitzenmacher. Cuckoo filter:
Practically better than bloom. In ACM CoNEXT 2014.
[84] Seyed K Fayaz, Yoshiaki Tobioka, Vyas Sekar, and Michael Bailey. Bohatei: flexible and
elastic ddos defense. In USENIX Security, 2015.
[85] Oliver Feuser and Andre Wenzel. On the effects of the ieee 802.3x flow control in full-duplex
ethernet lans. In LCN, pages 160–, 1999.
[86] Rohan Gandhi, Hongqiang Harry Liu, Y Charlie Hu, Guohan Lu, Jitendra Padhye, Lihua
Yuan, and Ming Zhang. Duet: Cloud scale load balancing with hardware and software. In
Proceedings of the 2014 ACM conference on SIGCOMM.
[87] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agar-
wal, Sylvia Ratnasamy, and Scott Shenker. Network requirements for resource disaggregation.
In USENIX OSDI, 2016.
[88] Gartner. https://www.gartner.com/newsroom/id/3815165.
[89] Google. Malware Distribution by Autonomous System. http://goo.gl/mZQeG4, 2013.
[90] Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. Evolve
or die: High-availability design principles drawn from googles network infrastructure. In
ACM SIGCOMM 2016.
[91] Albert Greenberg, James R Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim,
Parantap Lahiri, David A Maltz, Parveen Patel, and Sudipta Sengupta. Vl2: a scalable and
flexible data center network. In ACM SIGCOMM computer communication review, 2009.
[92] Chris Grier, Lucas Ballard, et al. Manufacturing Compromise: The Emergence of Exploit-
As-A-Service. In CCS, 2012.
[93] Krishna P. Gummadi, Harsha V. Madhyastha, Steven D. Gribble, Henry M. Levy, and David
Wetherall. Improving the reliability of Internet paths with one-hop source routing. In OSDI,
2004.
[94] Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi
Liu, Vin Wang, Bin Pang, Hua Chen, et al. Pingmesh: A large-scale system for data center
network latency measurement and analysis. ACM SIGCOMM Computer Communication
Review, 2015.
[95] Nikhil Handigol, Srinivasan Seetharaman, Mario Flajslik, Nick McKeown, and Ramesh
Johari. Plug-n-serve: Load-balancing web traffic using openflow. ACM SIGCOMM Demo,
2009.
[96] Laurens Hellemons. Flow-based detection of ssh intrusion attempts. Scanning, 2012.
[97] Chi-Yao Hong, Matthew Caesar, and Brighten Godfrey. Finishing flows quickly with
preemptive scheduling. In SIGCOMM, pages 127–138, 2012.
133
[98] C Hopps. Analysis of an equal-cost multi-path algorithm, 2000.
[99] Mobin Javed and Vern Paxson. Detecting Stealthy, Distributed SSH Brute-forcing. CCS,
2013.
[100] Lavanya Jose, Lisa Yan, George Varghese, and Nick McKeown. Compiling packet programs
to reconfigurable switches. In NSDI 15.
[101] Srikanth Kandula and Ratul Mahajan. Sampling biases in network path measurements and
what to do about it. In IMC, pages 156–169, 2009.
[102] Srikanth Kandula, Jitendra Padhye, and Paramvir Bahl. Flyways to de-congest data center
networks. In HotNets, 2009.
[103] Srikanth Kandula, Sudipta Sengupta, Albert Greenberg, Parveen Patel, and Ronnie Chaiken.
The nature of data center traffic: measurements & analysis. In Proceedings of the 9th ACM
SIGCOMM conference on Internet measurement conference, 2009.
[104] Nanxi Kang, Monia Ghobadi, John Reumann, Alexander Shraer, and Jennifer Rexford.
Efficient traffic splitting on commodity switches. In ACM CoNEXT, 2015.
[105] Chris Kanich, Nicholas Weaver, Damon McCoy, Tristan Halvorson, Christian Kreibich, Kirill
Levchenko, Vern Paxson, Geoffrey M Voelker, and Stefan Savage. Show Me the Money:
Characterizing Spam-Advertised Revenue. In USENIX SEC, 2011.
[106] Antoine Kaufmann, SImon Peter, Naveen Kr Sharma, Thomas Anderson, and Arvind
Krishnamurthy. High performance packet processing with flexnic. In ACM SIGPLAN
Notices. ACM, 2016.
[107] R. Kawahara, K. Ishibashi, T. Mori, N. Kamiyama, S. Harada, and S. Asano. Detection
accuracy of network anomalies using sampled flow statistics. In GLOBECOM ’07. IEEE.
[108] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. The
Click modular router. ACM Transactions on Computer Systems, August 2000.
[109] Teemu Koponen, Keith Amidon, Peter Balland, Martín Casado, Anupam Chanda, Bryan Ful-
ton, Igor Ganichev, Jesse Gross, Paul Ingram, Ethan J Jackson, et al. Network virtualization
in multi-tenant datacenters. In NSDI, 2014.
[110] Aleksandar Kuzmanovic and Edward W Knightly. Low-Rate TCP-targeted Denial of Service
Attacks: The Shrew vs. The Mice and Elephants. In ACM ATAPCC, 2003.
[111] Anukool Lakhina, Mark Crovella, and Christophe Diot. Diagnosing network-wide traffic
anomalies. In SIGCOMM, 2004.
[112] Anukool Lakhina, Mark Crovella, and Christophe Diot. Mining anomalies using traffic
feature distributions. In SIGCOMM, 2005.
[113] Karthik Lakshminarayanan, Matthew Caesar, Murali Rangan, Thomas Anderson, Scott
Shenker, and Ion Stoica. Achieving convergence-free routing using failure-carrying packets.
In SIGCOMM, 2007.
[114] Leslie Lamport. Interprocess communication. Technical report, DTIC Document, 1985.
[115] Jeongkeun Lee, Rui Miao, Changhoon Kim, Minlan Yu, and Hongyi Zeng. Stateful layer-4
load balancing in switching asics. In ACM SIGCOMM demo, 2017.
134
[116] Bojie Li, Kun Tan, Layong Larry Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang
Xiong, and Peng Cheng. Clicknp: Highly flexible and high-performance network processing
with reconfigurable hardware. In Proceedings of the 2016 conference on ACM SIGCOMM
2016 Conference. ACM, 2016.
[117] Zhou Li, Sumayah Alrwais, Yinglian Xie, Fang Yu, and XiaoFeng Wang. Finding the
Linchpins of the Dark Web: A Study on Topologically Dedicated Hosts on Malicious Web
Infrastructures. In Security and Privacy (SP), IEEE Symposium on, 2013.
[118] Junda Liu, Aurojit Panda, Ankit Singla, Brighten Godfrey, Michael Schapira, and Scott
Shenker. Ensuring connectivity via data plane mechanisms. In NSDI, 2013.
[119] Suksant Sae Lor, Raul Landa, and Miguel Rio. Packet re-cycling: Eliminating packet losses
due to network failures. In HotNets, 2010.
[120] Ming Mao and Marty Humphrey. A performance study on the vm startup time in the cloud.
In IEEE International Conference on Cloud Computing, 2012.
[121] Z. Morley Mao, Vyas Sekar, Oliver Spatscheck, Jacobus van der Merwe, and Rangarajan
Vasudevan. Analyzing large DDoS attacks using multiple data sources. In SIGCOMM
Workshop on Large-scale Attack Defense, 2006.
[122] Joao Martins, Mohamed Ahmed, Costin Raiciu, Vladimir Olteanu, Michio Honda, Roberto
Bifulco, and Felipe Huici. Clickos and the art of network function virtualization. In USENIX
NSDI, 2014.
[123] Nicholas F. Maxemchuk. Comparison of deflection and store-and-forward techniques in the
manhattan street and shuffle-exchange networks. In INFOCOM, pages 800–809, 1989.
[124] Jelena Mirkovic and Peter Reiher. A Taxonomy of DDoS Attack and DDoS Defense
Mechanisms. SIGCOMM CCR, 2004.
[125] Michael Mitzenmacher. The power of two choices in randomized load balancing. IEEE
Transactions on Parallel and Distributed Systems, 2001.
[126] David Moore, Colleen Shannon, Douglas J Brown, Geoffrey M Voelker, and Stefan Savage.
Inferring Internet Denial-Of-Service Activity. ACM Transactions on Computer Systems,
2006.
[127] Thomas Moscibroda and Onur Mutlu. A case for bufferless routing in on-chip networks. In
ISCA, 2009.
[128] Craig Nelson. Best practices to protect your azure deployment against “cloud drive-
by” attacks. http://blogs.msdn.com/b/azuresecurity/archive/2015/07/05/best-practices-to-
protect-your-azure-deployment-against-cloud-drive-by-attacks.aspx.
[129] F5 Networks. Big-ip. https://f5.com/products/big-ip.
[130] Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri,
Sivasankar Radhakrishnan, Vikram Subramanya, and Amin Vahdat. Portland: a scalable
fault-tolerant layer 2 data center network fabric. In Proceedings of the ACM SIGCOMM
2009 conference on Data communication, SIGCOMM ’09, pages 39–50, New York, NY, USA,
2009. ACM.
[131] George Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu, and Srinivasan Seshan.
On-chip networks from a networking perspective: congestion and scalability in many-core
interconnects. In SIGCOMM, pages 407–418, 2012.
135
[132] Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. Journal of Algorithms, 2004.
[133] Parveen Patel, Deepak Bansal, Lihua Yuan, Ashwin Murthy, Albert Greenberg, David A
Maltz, Randy Kern, Hemant Kumar, Marios Zikos, and Hongyu Wu. Ananta: Cloud scale
load balancing. In ACM SIGCOMM Computer Communication Review, 2013.
[134] Parveen Patel, Deepak Bansal, Lihua Yuan, Ashwin Murthy, Albert Greenberg, David A
Maltz, Randy Kern, Hemant Kumar, Marios Zikos, Hongyu Wu, et al. Ananta: Cloud Scale
Load Balancing. In ACM SIGCOMM, 2013.
[135] V. Paxson, G. Almes, J. Mahdavi, and M. Mathis. Framework for IP performance metrics.
In RFC 2330, 1998.
[136] Andreas Pitsillidis, Chris Kanich, et al. Taster’s Choice: A Comparative Analysis of Spam
Feeds. In IMC, 2012.
[137] Rahul Potharaju, Navendu Jain, and Cristina Nita-Rotaru. Juggling the jigsaw: Towards
automated problem inference from network trouble tickets. In NSDI’13.
[138] Sivasankar Radhakrishnan, Yilong Geng, Vimalkumar Jeyakumar, Abdul Kabbani, George
Porter, and Amin Vahdat. Senic: scalable nic for end-host rate limiting. In 11th USENIX
Symposium on Networked Systems Design and Implementation (NSDI 14), 2014.
[139] Costin Raiciu, Sébastien Barré, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,
and Mark Handley. Improving datacenter performance and robustness with multipath tcp.
In SIGCOMM, pages 266–277, 2011.
[140] Anirudh Ramachandran and Nick Feamster. Understanding the network-level behavior of
spammers. In SIGCOMM, 2006.
[141] Mark Reitblatt, Nate Foster, Jennifer Rexford, Cole Schlesinger, and David Walker. Ab-
stractions for network update. SIGCOMM ’12. ACM.
[142] Ben Ridgway. Security best practices for windows azure solutions. Azure Manual, 2014.
[143] Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren. Inside the
social network’s (datacenter) network. SIGCOMM ’15. ACM.
[144] Stefan Savage, Andy Collins, Eric Hoffman, John Snell, and Thomas Anderson. The
end-to-end effects of Internet path selection. In SIGCOMM, 1999.
[145] Vyas Sekar, Nick G Duffield, Oliver Spatscheck, Jacobus E van der Merwe, and Hui Zhang.
LADS: Large-scale automated DDoS detection system. In USENIX ATC, 2006.
[146] Amazon Web Services.
[147] Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon,
Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost,
Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat.
Jupiter rising: A decade of clos topologies and centralized control in google’s datacenter
network. In ACM SIGCOMM ’15, 2015.
[148] Ankit Singla, Chi yao Hong, Lucian Popa, and P. Brighten Godfrey. Jellyfish: Networking
data centers, randomly.
[149] Snort. http://www.snort.org/.
136
[150] Brett Stone-Gross, Thorsten Holz, Gianluca Stringhini, and Giovanni Vigna. The un-
derground economy of spam: A botmaster’s perspective of coordinating large-scale spam
campaigns. USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET),
2011.
[151] Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat, David G. Andersen, Gre-
gory R. Ganger, Garth A. Gibson, and Brian Mueller. Safe and effective fine-grained tcp
retransmissions for datacenter communication. In SIGCOMM, pages 303–314, 2009.
[152] Bhanu Chandra Vattikonda, George Porter, Amin Vahdat, and Alex C. Snoeren. Practical
tdma for datacenter ethernet. In EuroSys, pages 225–238, 2012.
[153] Richard Wang, Dana Butnariu, and Jennifer Rexford. Openflow-based server load balancing
gone wild. Hot-ICE, 2011.
[154] Christo Wilson, Hitesh Ballani, Thomas Karagiannis, and Ant Rowtron. Better never than
late: meeting deadlines in datacenter networks. In Proceedings of the ACM SIGCOMM 2011
conference, SIGCOMM ’11, pages 50–61, New York, NY, USA, 2011. ACM.
[155] HaitaoWu, ZhenqianFeng, ChuanxiongGuo, andYongguangZhang. Ictcp: Incastcongestion
control for tcp in data center networks. In CoNEXT, page 13, 2010.
[156] Vinod Yegneswaran, Paul Barford, and Somesh Jha. Global intrusion detection in the domino
overlay system. In NDSS, 2004.
[157] Vinod Yegneswaran, Paul Barford, and Johannes Ullrich. Internet intrusions: Global
characteristics and prevalence. In ACM SIGMETRICS 2003.
[158] David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, and Randy H. Katz.
Detail: reducing the flow completion time tail in datacenter networks. In SIGCOMM, pages
139–150, 2012.
[159] Ming Zhang, Brad Karp, Sally Floyd, and Larry L. Peterson. Rr-tcp: A reordering-robust
tcp with dsack. In ICNP, pages 95–106, 2003.
[160] Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan
Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. Congestion
control for large-scale rdma deployments. In ACM SIGCOMM Computer Communication
Review, 2015.
137
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Congestion control in multi-hop wireless networks
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Satisfying QoS requirements through user-system interaction analysis
PDF
Optimal distributed algorithms for scheduling and load balancing in wireless networks
PDF
Towards highly-available cloud and content-provider networks
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
Rate adaptation in networks of wireless sensors
PDF
Enabling efficient service enumeration through smart selection of measurements
PDF
Protecting online services from sophisticated DDoS attacks
PDF
Improving network security through collaborative sharing
PDF
Detecting and mitigating root causes for slow Web transfers
PDF
Networked cooperative perception: towards robust and efficient autonomous driving
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Machine learning for efficient network management
PDF
Adaptive resource management in distributed systems
PDF
Anycast stability, security and latency in the Domain Name System (DNS) and Content Deliver Networks (CDNs)
PDF
Measuring the impact of CDN design decisions
PDF
Improving user experience on today’s internet via innovation in internet routing
PDF
Enabling virtual and augmented reality over dense wireless networks
Asset Metadata
Creator
Miao, Rui
(author)
Core Title
Scaling-out traffic management in the cloud
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/30/2018
Defense Date
04/11/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
congestion control,data center networks,load balancing,network attacks,OAI-PMH Harvest,programmable switches
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Yu, Minlan (
committee chair
), Govindan, Ramesh (
committee member
), Psounis, Konstantinos (
committee member
)
Creator Email
rmiao@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-40340
Unique identifier
UC11671625
Identifier
etd-MiaoRui-6567.pdf (filename),usctheses-c89-40340 (legacy record id)
Legacy Identifier
etd-MiaoRui-6567.pdf
Dmrecord
40340
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Miao, Rui
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
congestion control
data center networks
load balancing
network attacks
programmable switches