Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Detecting and mitigating root causes for slow Web transfers
(USC Thesis Other)
Detecting and mitigating root causes for slow Web transfers
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DETECTING AND MITIGATING ROOT CAUSES FOR SLOW WEB TRANSFERS by Tobias Flach A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2016 Copyright 2016 Tobias Flach Acknowledgements First and foremost, I would like to express my deepest gratitude to my two Ph.D. advisors Ramesh Govindan and Ethan Katz-Bassett. Over the past six years they supported and mentored me on my path towards becoming a researcher and an expert in networking protocols and measurements. I am also thankful to John Heidemann, Yan Liu, and Konstantinos Psounis who volunteered to serve on my qualification exam and defense committees and provided valuable feedback to improve this dissertation. A large part of my work is tied to studies conducted while I was interning at Google. I am particularly grateful for the support I received from my two internship hosts Nandita Dukkipati and Pavlos Papageorge. They were tremendously helpful in shaping my research trajectory. I am certain that the work discussed in Chapters 2 and 5 would not have been possible without them. In addition, I benefited from a lot of collaboration with other people at Google and peers within my lab. In particular, Yuchung Cheng, Luis Pedrosa, Andreas Terzis, and Kyriakos Zarifis contributed greatly to parts of this work. I am thankful to all of them. Finally, I would like to thank my friends and family. I am especially thankful to my mom for her continuous and unconditional support, and I dedicate this dissertation to her. ii Table of Contents Acknowledgements ii List of Tables vi List of Figures viii Abstract xi Chapter 1: Introduction 1 1.1 Protocol Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Structural Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Third-party Interference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Reducing Web Latency: the Virtue of Gentle Aggression 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 The Case for Faster Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Towards 1-RTT Recoveries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Reactive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Corrective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6 Proactive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7 The Role of Middleboxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.8.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.8.2 End-to-end Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.8.3 Reactive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.8.4 Corrective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.8.4.1 Isolated Flows . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.8.4.2 Web Page Replay . . . . . . . . . . . . . . . . . . . . . . . . 41 2.8.5 Proactive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 3: Breaking Down TCP Latency Worldwide 46 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.1 Delay Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 iii 3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.1 Tail Delays Across the Globe . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.2 Mapping Latency to Delay Components . . . . . . . . . . . . . . . . . . 56 3.4.3 Regional Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.4 Delay Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.5 Impact on Data Delivery Rate . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.6 Tail Performance for Short Flows . . . . . . . . . . . . . . . . . . . . . 63 3.4.7 Timer Inflation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter 4: Diagnosing Path Inflation of Mobile Client Traffic 65 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4 A Taxonomy of Inflated Routes . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.6 Path Inflation Today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Chapter 5: An Internet-Wide Analysis of Traffic Policing 81 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2 Detecting & Analyzing Policing at Scale . . . . . . . . . . . . . . . . . . . . . . 87 5.2.1 Detecting Policing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2.2 Analyzing Flow Behavior At Scale . . . . . . . . . . . . . . . . . . . . 92 5.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3.1 Lab Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3.2 Consistency of Policing Rates . . . . . . . . . . . . . . . . . . . . . . . 97 5.4 Policing in the Wild . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4.1 The Prevalence of Policing . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.4.2 Enforced Policing Rates . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.4.3 Impact of Policing on the Network . . . . . . . . . . . . . . . . . . . . . 107 5.4.4 Impact on Playback Quality . . . . . . . . . . . . . . . . . . . . . . . . 111 5.4.5 Interaction Between Policers and TCP . . . . . . . . . . . . . . . . . . . 113 5.4.6 Policing Pathologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.5 Mitigating Policer Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.5.1 Solutions for ISPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.5.1.1 Optimizing Policing Configurations . . . . . . . . . . . . . . . 119 5.5.1.2 Shaping Instead of Policing . . . . . . . . . . . . . . . . . . . 121 5.5.2 Solutions for Content Providers . . . . . . . . . . . . . . . . . . . . . . 124 5.5.2.1 Limiting the Server’s Sending Rate . . . . . . . . . . . . . . . 124 5.5.2.2 Avoiding Bursty Transmissions . . . . . . . . . . . . . . . . . 126 5.5.3 Summary of Recommendations . . . . . . . . . . . . . . . . . . . . . . 127 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 iv Chapter 6: Literature Review 129 6.1 Analyzing and Reducing Web Latency . . . . . . . . . . . . . . . . . . . . . . . 129 6.2 Path Inflation and Mobile Performance . . . . . . . . . . . . . . . . . . . . . . . 132 6.3 Traffic Policing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Chapter 7: Conclusions 135 7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Bibliography 141 v List of Tables 2.1 Recovery behavior with Reactive packets for different tail loss scenarios. . . . . . 23 2.2 Round-trip comparison between Linux baseline and Proactive + Reactive. . . . . 34 2.3 Response time comparison of baseline Linux and Reactive. . . . . . . . . . . . . 35 2.4 Retransmission statistics in Linux and the corresponding delta in the Reactive experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5 Latency reduction with Corrective for random and correlated loss patterns. . . . . 39 2.6 Round-trip comparison for Linux baseline and Proactive. . . . . . . . . . . . . . 42 3.1 Metrics for connections with a non-zero loss trigger delay. . . . . . . . . . . . . 57 3.2 Delays by component for tail performers in short flows. . . . . . . . . . . . . . . 62 4.1 Fraction of traceroutes from major US carriers with metro-level inflation. . . . . 71 4.2 Observed path inflation for two carriers in Q4 2011. . . . . . . . . . . . . . . . . 77 4.3 Observed peering locations between carriers and Google. . . . . . . . . . . . . . 79 5.1 Overview of policing and shaping. . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2 PD classification accuracy for several controlled scenarios. . . . . . . . . . . . . 95 5.3 Prevalence of policing and loss rates for policed and unpoliced segments. . . . . 100 5.4 Top-7 policing ASes using traces after May 2015 only. . . . . . . . . . . . . . . 105 vi 5.5 Overview of 5 highly policed ISPs in the Google dataset. . . . . . . . . . . . . . 116 5.6 Impact of token bucket capacity on rebuffering time of the same 30-second video playback. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.7 Average join/rebuffer times for first 30 s of a video with the downlink throttled to 0.5 Mbps by either a policer or shaper. . . . . . . . . . . . . . . . . . . . . . . . 123 5.8 Loss rates for policed connections served by two selected CDN servers, with mitigation techniques in place. . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 vii List of Figures 1.1 Overview for performance-limiting factors investigated in this work. . . . . . . . 2 2.1 Mean TCP latency to transfer an HTTP response from Web server to a client. . . 11 2.2 CDF of the measured RTOs normalized by RTT. . . . . . . . . . . . . . . . . . . 13 2.3 Relative probabilities for packet losses based on a packet’s position in a burst. . . 15 2.4 Probability of bursts with one or two losses. . . . . . . . . . . . . . . . . . . . . 15 2.5 Deployment options for our redundancy techniques. . . . . . . . . . . . . . . . . 16 2.6 Design space for transport mechanisms with different aggressiveness levels. . . . 18 2.7 Timeline of a connection using Corrective. . . . . . . . . . . . . . . . . . . . . . 25 2.8 Timeline of a connection using Proactive. . . . . . . . . . . . . . . . . . . . . . 29 2.9 Latency improvements of HTTP responses with Reactive vs. baseline Linux. . . 35 2.10 Completion and render start times for Web site downloads when using baseline Linux vs. Corrective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.11 Reduction in response time achieved by Proactive . . . . . . . . . . . . . . . . . 43 3.1 Sample flows and delay breakdown. . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Overall distributions of the worst per-connection ACK delay. . . . . . . . . . . . 55 3.3 Delay components broken down for each connection’s tail performer. . . . . . . 56 viii 3.4 Observed timeout inflation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 Distribution of individual delay components in selected ASes. . . . . . . . . . . 59 3.6 Overall delay CDF for selected countries over time. . . . . . . . . . . . . . . . . 60 3.7 Buffer requirement to compensate for tail performer’s ACK delay. . . . . . . . . 61 3.8 Required buffer size normalized by the bytes already acknowledged. . . . . . . . 61 3.9 Estimated timer values worldwide after 1 MB of data was sent per connection. . . 61 4.1 Optimal routing for mobile clients. . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2 Different ways a client can be directed to a server. . . . . . . . . . . . . . . . . . 73 4.3 Root cause analysis for metro-level inflation. . . . . . . . . . . . . . . . . . . . 74 4.4 Server selection flapping due to coarse client-server mapping. . . . . . . . . . . . 76 4.5 Observed ingress points for major US carriers. . . . . . . . . . . . . . . . . . . . 79 5.1 TCP sequence graph for a policed flow. . . . . . . . . . . . . . . . . . . . . . . 84 5.2 Policing Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3 Number of rate clusters required to cover at least 75% of the rate samples per AS. 97 5.4 Prevalence of policing in the M-Lab NDT data-set across client continents. . . . 101 5.5 Top-10 countries with the most policing in the NDT dataset. . . . . . . . . . . . 101 5.6 Prevalence of policing over time. . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.7 Prevalence of policing over time (grouped by continent). . . . . . . . . . . . . . 101 5.8 Prevalence of policing over time (grouped by country). . . . . . . . . . . . . . . 102 5.9 Observed policing rates per segment. . . . . . . . . . . . . . . . . . . . . . . . . 103 5.10 Distribution of policing rates in ASes with highest policing prevalence. . . . . . . 104 5.11 Distribution of policing rates observed in the top-7 policing ASes in 2015. . . . . 105 ix 5.12 Distribution of policing rates observed in AS 6697, in two different time frames. . 105 5.13 Loss rates observed on policed and unpoliced segments in different regions of the world. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.14 Per-segment loss rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.15 Ratio between the median burst throughput and the policing rate per segment. . . 107 5.16 Loss rates seen in the whole NDT dataset. . . . . . . . . . . . . . . . . . . . . . 108 5.17 Loss rates seen in the NDT dataset (grouped by continent). . . . . . . . . . . . . 109 5.18 Loss rates seen in the NDT dataset (grouped by country). . . . . . . . . . . . . . 109 5.19 Loss rates seen in the NDT dataset (grouped by country). . . . . . . . . . . . . . 110 5.20 Rebuffer to watch time ratios for video playbacks. . . . . . . . . . . . . . . . . . 112 5.21 Wait times for all HD segments and those policed below 2.5 Mbps. . . . . . . . . 113 5.22 Common traffic patterns when a traffic policer enforces throughput rates. . . . . . 114 5.23 Policing rates in policed segments for selected ISPs. . . . . . . . . . . . . . . . . 117 5.24 Loss rates in policed segments for selected ISPs. . . . . . . . . . . . . . . . . . . 117 5.25 Per-segment ratio between 90 th and 10 th percentile latencies for shaped segments and all video segments globally. . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.26 TCP sequence graphs for three flows with different mitigation techniques in place, passing through a policer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 x Abstract One of the key goals for Web service providers is the quick delivery of their content to customers. Minimizing the latency between a user’s service request and the delivery of the corresponding content is of paramount importance for Web services like search, shopping, or video streaming. The importance is motivated by the fact that users have a low tolerance for delays. Past studies verified a link between increasing latency for content delivery and corresponding reductions in user engagement and provider revenue. As a result, content providers go to great lengths to minimize latency by improving their infrastructure, communication protocols, and proximity to the users. However, end-to-end latency can still suffer from other network limitations, some of which are outside the control of the content provider. In this thesis we strive to get a better understanding of the performance-limiting factors that affect Web transfers and explore techniques to mitigate these factors. To achieve this goal we con- duct multiple measurement studies dissecting Web transfers to find and analyze the root causes for poor performance. Different parties are involved in transporting content from providers to customers, and we look into limitations at different points of the transport process affecting la- tency: the sender’s transport protocol, the network infrastructure carrying the data, and finally interference by networks through which traffic passes on its way to the client. xi First, we present two measurement studies investigating how the Transmission Control Pro- tocol (TCP) can introduce delays that adversely affect Web transfers. We use large-scale mea- surements that we obtained at Google frontends across the globe for this task. In addition, we leverage a secondary set of measurements from the widely distributed M-Lab platform. We start by evaluating how packet loss affects Google’s content delivery and show that especially short- lived connections suffer when packet loss happens in the network. We then discuss the design, deployment, and evaluation of algorithms tailored to reduce the frequency and impact of the costly losses. As a follow-up, we present a methodology to break down the delay incurred by a packet into components attributable to propagation delay and cross-traffic, loss recovery, and queuing. Moreover, we investigate the degree to which queuing delays slow TCP’s loss recov- ery. We find that many of the flows see packet delivery times of one second or more, with large regional differences, and with queuing being a key cause of delay. Second, we take a look at network infrastructure limitations affecting Web latency. Specif- ically, we analyze the impact of path inflation in mobile carriers where traffic between con- tent providers and mobile customers takes geographically circuitous routes. We attribute these pathologies to root causes like a lack of ingress points between a carrier’s network and the wider Internet as well as limited peering arrangements with content providers. Based on longitudinal data we show that performance in some carriers improved over time with other regions continuing to suffer from path inflation. Third, we look at a particular type of third-party interference as a contributor to delay. We analyze the prevalence and impact of traffic policing, a traffic management technique used to enforce pre-configured throughput limits on connections by dropping excess packets. Based on xii global-scale measurements taken at Google frontends we show that a substantial number of con- nections with packet loss are affected by policing. Moreover we demonstrate that policing neg- atively impacts user quality of experience. We conclude by designing and testing solutions for content providers and the policing ISPs to mitigate the negative effects of policers. xiii Chapter 1 Introduction Delivering content to customers quickly is one of the key goals for Web service providers. Es- pecially latency-sensitive services like search or shopping strive to minimize the delay between a customer’s request and the corresponding response. Past studies confirm how sensitive users are to latency. For example, Amazon estimates that every 100ms increase in latency cuts profits by 1% [77]. Timely content delivery is important to other services with a higher latency tolerance, like video streaming, as well. Web players are designed to keep content buffered to ensure that a video can be played back without interruption even when it is only partially loaded. However, high latency can result in the player’s buffer to run empty, forcing it to pause the video until more data has been rebuffered. These rebuffering events can be costly. A recent study of Conviva data showed that every 1% of watch time that a player spends on rebuffering content potentially reduces user engagement by over 3 minutes [37]. Since high latency can negatively affect a content provider’s revenue, they go to great lengths to minimize latency. This includes backbone and point-of-presence (PoP) expansions to achieve proximity to their clients, as well as careful re-engineering of routing and DNS redirection [6,27]. 1 However, end-to-end latency can still suffer from other network limitations. These limitations are challenging to find for multiple reasons. First, the end-to-end path between communicating hosts is controlled by multiple parties, including at least the content provider and the client’s Internet Service Provider (ISP). As a result a connection can perform badly, for example, if traffic passes through a congested network, even if all other parties optimized their infrastructure. Second, due to the size and heterogeneity of the Web, identifying common causes of bad performance requires large-scale measurements from many vantage points. Similarly, new al- gorithms designed to mitigate known performance problems require widespread deployment and evaluation. Only a small number of providers have the means to address this scalability challenge. HTTP Response Protocol Limitation Structural Limitation Third-party Interference Server Client Figure 1.1: Overview for performance-limiting factors investigated in this work. The con- tent provider aims to deliver an HTTP response from its server to the client as fast as possi- ble. However, latency can inflate due to protocol limitations (e.g. TCP delays the transmis- sion of packets), structural limitations (e.g. a lack of connectivity between routers to keep paths short), or third-party interference (e.g. a traffic policer drops packets). Thesis statement. To understand and mitigate the performance-limiting factors affecting Web transfers we need measurements at scale from real users around the world. In addition, by taking advantage of the infrastructure of content providers we can develop and deploy solutions that can quickly have a global impact and reduce Web latency for billions of users. 2 In this work, we investigate performance problems appearing at different points in the pro- cess of transporting content from a provider to a customer. This includes latency introduced by an inefficiently running transport protocol, suboptimal transport paths due to limited network infras- tructure, and finally interference by networks through which traffic passes on its way to the client. Figure 1.1 illustrates the performance-limiting factors that we investigated. To demonstrate that these problems have a global impact, we leverage measurements taken at hundreds of frontend nodes owned by Google, one of the largest content providers in the world [115], and longitudi- nal measurements taking through the Network Diagnostics Toolkit (NDT) [82]. Together, these data points provide insights into the performance problems observed by a population of over one billion users [145]. The individual pieces of this dissertation were motivated by anecdotal evidence for particular pathologies affecting content delivery. For example, manual investigations conducted by devel- opers at Google before we started our work revealed that packet loss can lead to high latency experienced by users. However, for each of the pathologies we analyzed here, it was unknown to which extent they affected customers on a global scale. In addition, no techniques existed to deal with these problems. We overcome these shortcomings with this work. In the following chapters we discuss the studies that we conducted in detail, each of which resulted in a quantitative and qualitative analysis of the performance degradation observed across the Web. In addition, most chapters conclude with a set of recommendations to mitigate the root problems to reduce latency, some of which were deployed and tested in production networks resulting in better performance for users around the world. 3 1.1 Protocol Limitations In Chapters 2 and 3 we discuss two measurement studies investigating how the Transmission Control Protocol (TCP) can introduce delays that adversely affect Web transfers. TCP is the primary transport protocol used across the Web guaranteeing reliable, ordered, and error-free content delivery between connected pairs of endpoints. In addition, TCP is designed to provide fast communication by using available network capacity without producing excessive congestion in the network. In Chapter 2 we analyze how packet loss can cause TCP to introduce delays when deliver- ing web content. Since TCP guarantees that a data stream is delivered reliably it has to ensure that any lost data is retransmitted until the client successfully receives it. Based on large-scale measurements that we obtained at Google frontends across the globe, we evaluated how packet loss affects transfers between Google services, including video and search, and their customers across the world. We found that especially short-lived connections suffer from packet loss in the network, since the protocol needs time to detect and recover from loss. We then designed, de- ployed, and evaluated algorithms to reduce the frequency and impact of the costly losses. Overall, we showed that enabling our redundancy techniques at Google’s frontends can reduce Google’s average Web latency by 23%. While we found that packet loss can induce long recovery delays, we did not investigate the factors that influence the magnitude of these delays. In addition, we did not look for other potential pathologies affecting TCP latency. To overcome this limitation, we conducted a follow- up study of the root causes for TCP delays discussed in Chapter 3. We developed a methodology to break down the delay incurred by a packet into components attributable to propagation delay 4 and cross-traffic, loss recovery, and queuing. Moreover we investigated the degree to which queuing delays slow TCP’s loss recovery. We analyzed packet traces of 10-second flows captured across the globe using data from the publicly available NDT dataset. We found that many of the flows see packet delivery times of one second or more, with large regional differences, and with queuing being a key cause of delay. The findings presented in the chapter point towards a need for continuous measurement-driven approach to optimize transport protocols, since variations across regions and time mean that an optimization can have a large impact in one setting but not others. 1.2 Structural Limitations Even if the network protocol used for Web transfers is working efficiently, the performance it can provide to the higher layers (e.g. the application) is dependent on how well the underlying network is working. As a result, content providers try optimize the network infrastructure used to get data from their servers to their customers, e.g. by maximizing the proximity between the two. However, content providers still depend on Internet Service Providers (ISPs) to provide connec- tivity between their infrastructure and clients. In Chapter 4 we evaluate how structural limitations in mobile networks can affect transport latency. More specifically, we investigate the role of path inflation, where traffic is taking a “detour” due to limitations in a carrier’s network topology and thereby introducing additional propagation delays before reaching the receiver. Common causes for path inflation are a lack of ingress points between a carrier’s network and the wider Inter- net as well as limited peering arrangements with content providers. Based on longitudinal data, we observed that the evolution of some carrier networks improves performance in some regions. 5 However, we also observe many clients - even in major metropolitan areas - that continue to take geographically circuitous routes to content providers, due to limitations in the current topologies. 1.3 Third-party Interference Finally, even if the infrastructure and protocols responsible for delivering content are optimized, performance can suffer when third parties interfere with transmissions. In Chapter 5 we an- alyzed a particular form of traffic management, called traffic policing, that is used to enforce pre-configured throughput limits on connections by dropping excess packets. Through measure- ments taken at Google frontends around the world we were able to quantify the impact of policing on connection performance and the user’s quality of experience. We also used longitudinal mea- surements from M-Lab’s Network Diagnostics Toolkit as a secondary data source to confirm our findings. In addition, we explored a multitude of options for both content providers and the polic- ing ISPs to mitigate the negative effects of policers. Techniques like rate limiting and pacing can be used by content providers to minimize the risk of bursty losses induced by a policer or even avoid packet loss overall. To help the ISPs that currently deploy policers, we explored ways to op- timize the configurations of policers or the usage of traffic shaping as an alternative, to minimize their impact on end-to-end connections. 6 Chapter 2 Reducing Web Latency: the Virtue of Gentle Aggression To serve users quickly, Web service providers build infrastructure closer to clients and use multi- stage transport connections. Although these changes reduce client-perceived round-trip times, TCP’s current mechanisms fundamentally limit latency improvements. We performed a mea- surement study of a large Web service provider and found that, while connections with no loss complete close to the ideal latency of one round-trip time, TCP’s timeout-driven recovery causes transfers with loss to take five times longer on average. In this chapter, we present the design of novel loss recovery mechanisms for TCP that judi- ciously use redundant transmissions to minimize timeout-driven recovery. Proactive, Reactive, and Corrective are three qualitatively-different, easily-deployable mechanisms that (1) proac- tively recover from losses, (2) recover from them as quickly as possible, and (3) reconstruct packets to mask loss. Crucially, the mechanisms are compatible both with middleboxes and with TCP’s existing congestion control and loss recovery. Our large-scale experiments on Google’s production network that serves billions of flows demonstrate a 23% decrease in the mean and 47% in 99th percentile latency over today’s TCP. 7 2.1 Introduction Over the past few years, and especially with the mobile revolution, much economic and social activity has moved online. As such, user-perceived Web performance is now the primary metric for modern network services. Since bandwidth remains relatively cheap, Web latency is now the main impediment to improving user-perceived performance. Moreover, it is well known that Web latency inversely correlates with revenue and profit. For instance, Amazon estimates that every 100ms increase in latency cuts profits by 1% [77]. In response to these factors, some large Web service providers have made major structural changes to their service delivery infrastructure. These changes include a) expanding their back- bones and PoPs to achieve proximity to their clients and b) careful re-engineering of routing and DNS redirection. As such, these service providers are able to ensure that clients quickly reach the nearest ingress point, thereby minimizing the extent to which the client traffic traverses the public Internet, over which providers have little control. To improve latency, providers engineer the capacity of and traffic over their internal backbones. As a final latency optimization, providers use multi-stage TCP connections to isolate internal access latency from the vagaries of the public Internet. Client TCP connections are usually terminated at a frontend server at the ingress to the provider’s infrastructure. Separate backend TCP connections between frontend and backend servers complete Web transactions. Using persistent connections and request pipelining on both of these types of connections amortizes TCP connection setup and thereby reduces latency. Despite the gains such changes have yielded, improvements through structural re-engineering have reached the point of diminishing returns [72], and the latency due to TCP’s design now limits further improvement. Increasing deployment of broadband access—the average connection 8 bandwidth globally was 2.8Mbps in late 2012, more than 41% of clients had a bandwidth above 4Mbps, and 11% had more than 10Mbps [7]—has significantly reduced transmission latency. Now, round-trip time (RTT) and the number of round trips required between clients and servers largely determine the overall latency of most Web transfers. TCP’s existing loss recovery mechanisms add RTTs, resulting in a highly-skewed client Web access latency distribution. In a measurement of billions of TCP connections from clients to Google services, we found that nearly 10% of them incur at least one packet loss, and flows with loss take on average five times longer to complete than those without any loss (Section 2.2). Furthermore, 77% of these losses are repaired through expensive retransmission timeouts (RTOs), often because packets at the tail of a burst were lost, preventing fast recovery. Finally, about 35% of these losses were single packet losses in the tail. Taken together, these measurements suggest that loss recovery dominates the Web latency. In this chapter, we explore faster loss recovery methods that are informed by our measure- ments and that leverage the trend towards multi-stage Web service access. Given the immediate benefits that these solutions can provide, we focus on deployable, minimal enhancements to TCP rather than a clean-slate design. Our mechanisms are motivated by the following design ideal: to ensure that every loss is recovered within 1-RTT. While we do not achieve this ideal, we conduct a principled exploration of three qualitatively-different, deployable TCP mechanisms that progres- sively take us closer to this ideal. The first mechanism, Reactive, retransmits the last packet in a window, enabling TCP to trigger fast recovery when it otherwise might have had to incur an RTO. Corrective additionally transmits a coded packet that enables recovery without retransmission in cases where a single packet is lost and Reactive might have triggered fast recovery. Proactive redundantly transmits each packet twice, avoiding retransmissions for most packets in a flow. 9 Along other dimensions, too, these approaches are qualitatively different. They each involve increasing levels of aggression: Reactive transmits one additional packet per window for a small fraction of flows, Corrective transmits one additional packet per window for all flows, while Proactive duplicates the window for a small portion of flows. Finally, each design leverages the multi-stage architecture in a qualitatively different way: Reactive requires only sender side changes and can be deployed on frontend servers, Corrective requires both sender and receiver side changes, while Proactive is designed to allow service providers to selectively apply redun- dancy for Web flows, which often are a minuscule fraction of the traffic relative to video on a backbone network. Despite the differences, these approaches face common design challenges: avoiding interference with TCP’s fast retransmit mechanism, ensuring accurate congestion win- dow adjustments, and co-existing with middleboxes. Together with our collaborators at Google we have implemented all three mechanisms in the Linux kernel. We deployed Reactive on frontend servers for production traffic at Google and have used it on hundreds of billions of flows, and we have experimented with Proactive for backend Web connections in a setting of interest for a month. In addition, we measured them extensively. Our evaluations of Reactive and Proactive use traces of several million flows and traffic to a wide variety of clients including mobile devices. We base the evaluation of Corrective on realistic loss emulation, as it requires both client and server changes and cannot be unilaterally deployed. Our large-scale experiment in production with Proactive at the backend and Reactive at the frontend yielded a 23% improvement in the mean and 47% in 99th percentile latency over today’s TCP. Our emulation experiment with Corrective yielded 29% improvement in 99th percentile latency for short flows with correlated losses. The penalty for these benefits is the increase in traffic per connection by 0.5% for Reactive, 100% for Proactive, and 10% for Corrective on 10 0 1000 2000 3000 4000 5000 6000 0 200 400 600 800 1000 TCP latency (ms) RTT bucket (ms) Ideal Without loss With loss Figure 2.1: Mean TCP latency to transfer an HTTP response from Web server to a client. Measurements are bucketed by packet round-trip time between the frontend and the client. average. Our experience with these mechanisms indicates that they can yield immediate benefits in a range of settings, and provide stepping stones towards the 1-RTT recovery ideal. 2.2 The Case for Faster Recovery In this section, we present measurements from Google’s frontend infrastructure that indicate a pressing need to improve TCP recovery behavior. Web latency is dominated by TCP’s startup phase (the initial handshake and slow start) and by time spent detecting and recovering from packet losses; measurements show about 90% of the connections on Web servers finish within the slow start phase, while the remaining experience long recovery latencies [125]. Recent work has proposed to speed up the connection startup by enabling data exchange during handshake [104] and by increasing TCP’s initial congestion window [43]. However, mechanisms for faster loss recovery remain largely unexplored for short flows. Data Collection. We examine the efficacy of TCP’s loss recovery mechanisms through mea- surements of billions of Web transactions from Google services excluding videos. We measure the types of retransmissions in a large data center which primarily serves users from the U.S. 11 East coast and South America. We selected this data center because it has a mix of RTTs, user bandwidths, and loss rates. In addition, about 30% of the traffic it served is for cellular users. For ease of comparison, we also use the same data center to experiment with our own changes to loss recovery described in later sections. We collected Linux TCP SNMP statistics from Web servers and measured TCP latency to clients for one week in December 2012 and January 2013. Obser- vations described here are consistent across several such sample sizes taken in different weeks and months. In addition, we also study packet loss patterns in transactions from two days of server-side TCP traces in 2012, of billions of clients accessing latency-sensitive services such as Web search from five frontend servers in two of our data centers. These measurements of actual client traffic allow us to understand the TCP-layer characteristics causing poor performance and to design our solutions to address them. Loss makes Web latency 5 times slower. In our traces, 6.1% of HTTP replies saw loss, and 10% of TCP connections saw at least one loss. 1 The average (server) retransmission rate was 2.5%. Figure 2.1 depicts the TCP latency in the traces (the time between the first byte the server sent to its receipt of the lack ACK), separating the transfers that experienced loss from those that did not experience loss. The figure buckets the transfers by measured RTT and depicts the mean transfer latency for each bucket. For comparison, the figure also depicts the ideal transfer latency of one RTT. As seen in the figure, transfers without loss generally take little more than the ideal duration. However, transfers that experience loss take much longer to complete—5 times longer on average. 1 A TCP connection can be reused to transmit multiple responses. 12 0 0.2 0.4 0.6 0.8 1 1 10 100 CDF RTO / RTT RTO normalized by RTT Figure 2.2: CDF of the measured RTOs normalized by RTT. Finding: Flows without loss complete in essentially optimal time, but flows with loss are much slower. Design implication: TCP requires improved loss recovery behavior. 77% losses are recovered by timeout, not fast recovery. As suggested by the tail transfer latency in our traces, the time to recover from loss can dwarf the time to complete a lossless transfer. In our traces, frontend servers recover about 23% of losses via fast retransmission—the other 77% require RTOs. This is because Web responses are small and tail drops are common. As a result, there are not eno ugh duplicate ACKs to trigger fast recovery. 2 Even worse, many timeouts are overly conservative compared to the actual network RTT. The sender bases the length of its RTO upon its estimate of the RTT and the variation in the RTT. In practice, this estimate can be quite large, meaning that the sender will not recover from loss quickly. In our traces we found that the median RTO is six times larger than the RTT, and the 99th percentile RTO is a whopping 200 times larger than the actual RTT, as shown in Figure 2.2. These high timeout values likely result from high variance in RTT, caused by factors such as insufficient RTT samples early in a flow and varying queuing delays in routers with large buffers [127]. In 2 Linux implements early retransmit that requires only one duplicate ACK to perform fast recovery. 13 such cases, an RTO causes a severe performance hit for the client. Note that simply reducing the length of the RTO does not address the latency problem for two reasons. First, it increases the chances of spurious retransmissions. Based on TCP DSACK [20], our traces report that about 40% of timeouts are spurious. More importantly, a spurious RTO reduces the congestion window to one and forces a slow start, unnecessarily slowing the transfer of remaining data. Finding: Servers currently recover from most losses using slow RTOs. Design implication: RTOs should be converted into fast retransmissions or, even better, TCP should recover from loss with- out requiring retransmission. (Single) packet tail drop is very common. The duplicate acknowledgments triggered by pack- ets received after a loss can trigger fast retransmission to recover the missing packet(s). The prevalence of RTOs in our traces suggests that loss mostly occurs towards the end of bursts. Fig- ure 2.3 shows how likely a packet is to be lost, based on its position in the burst. We define a burst as a sequence of packets where the server sends each packet at most 500ms after the previous one. The figure shows that, with few exceptions, the later a packet occurs in a burst, the more likely it is to be lost. The correlation between position in a burst and the probability of loss may be due to the bursts themselves triggering congestive losses, with the later packets dropped by tail-drop buffers. Figure 2.4 indicates, for flows experiencing loss, the probability of having at most two packet losses. For bursts of at most 10 packets,35% experienced exactly one loss, and an additional 10% experienced exactly two losses. Finding: Many flows lose only one or two consecutive packets, commonly at the tail of a burst. Design implication: Minimizing the impact of small amounts of tail loss can significantly improve TCP performance. 14 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 1 2 3 4 5 6 7 8 9 10 Relative number of losses Packet position in the burst 3-burst 4-burst 6-burst 10-burst Figure 2.3: Relative probability of the x-th packet (in a burst) being lost compared to the probability of the first packet being lost (same burst). A line depicts the ratio for a fixed burst length derived from HTTP frontend-client traces. 0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 p Burst length p(#lost <= 2) p(#lost = 1) Figure 2.4: Probability of one (bottom line) or at most two (top line) packet losses in a burst for lossy bursts, derived from HTTP frontend-client traces. These findings confirm not only that tail losses are commonplace in modern networks, but that they can cause poor end-to-end latency. Next, we build upon these findings to develop mech- anisms that improve loss recovery performance. 2.3 Towards 1-RTT Recoveries In this chapter, we explore three qualitatively different TCP mechanisms, working towards the ideal of 1-RTT loss recovery. The first two were designed by collaborators at Google. Reactive re-sends the last packet in a window, enabling TCP to trigger fast recovery when it otherwise 15 Backend TCP Connection Proactive/Corrective Client TCP Connection Reactive/ Corrective Private Network Client Frontend Server Backend Server Figure 2.5: Proactive is applied selectively on certain transactions in the backend; Reactive can be deployed on client-facing side of frontends to speed Web responses; Corrective can apply equally to both client and backend connections. might have had to incur an RTO. Corrective transmits a coded packet that enables recovery with- out retransmission when a single packet in the coded block is lost. Finally, Proactive is 100% redundant: it transmits each data packet twice, avoiding retransmissions for almost all packets in a flow. Our measurements from Section 2.2 motivate not just a focus on the 1-RTT recovery ideal, but have also informed these mechanisms. Proactive attempts to avoid loss recovery completely. Motivated by the finding that RTOs dominate loss recovery, Reactive effectively converts RTOs into fast retransmissions. Corrective is designed for the common case of a single packet loss. They were also designed as a progression towards the 1-RTT ideal: from fast recovery through more frequent fast retransmits in Reactive, to packet correction in Corrective, to recovery avoid- ance in Proactive. This progression reflects an increase in the level of aggression from Reactive to Proactive and the fact that each design is subsumed by the next: Corrective implicitly converts RTOs to fast retransmissions, and Proactive corrects more losses than Corrective. 16 Finally, these mechanisms were designed to be immediately deployable in a multi-stage Web service architecture, like that shown in Figure 2.5. Each of them makes relatively small changes to TCP, but different designs apply to different stages, with each stage having distinct constraints. Reactive requires sender side changes and can be deployed in the client-facing side of frontends to speed Web responses. Proactive requires both sender and receiver side changes and can be se- lectively applied on backends. Prompt loss recovery is relevant for backend connections because frontends deployed in remote, network-constrained locations can experience considerable loss: the average retransmission rate across all our backend connections on a particular day was 0.6% (max=16.3%). While Proactive adds high overhead, Web service traffic is a small fraction of overall traffic, so Proactive’s aggression adds negligible overhead (in our measurements, latency critical Web traffic is less than 1% of the overall video-dominated traffic). Finally, Corrective requires sender and receiver side changes, and can apply equally to client or backend connec- tions. These designs embed assumptions about the characteristics of losses observed today and about the structure of multi-stage architectures. Section 2.9 discusses the implications of these assumptions. Despite the differences between these approaches, they face several common challenges that arise in adding redundancy to TCP. First, a redundantly transmitted data packet might trigger additional ACKs and consequently fast retransmissions. Second, when a redundantly transmitted data packet masks a loss, the congestion control algorithms must react to the loss. Finally, any changes to TCP must co-exist with middleboxes [55]. In subsequent sections, we present the design of each of the mechanisms and describe how they address these challenges. In the broader context (Figure 2.6) of other schemes that have attempted to be more or less aggressive than TCP, our designs occupy a unique niche: leveraging gentle aggression for loss 17 Decreased slightly Increased slightly Increased greatly Startup / Short flows Steady state Recovery Timeout Aggressive- ness Phase Vegas [24] Moderation [60] CUBIC [53] Relentless, Decongestion [86, 106] DDoS defense by offense [137] Corrective Reactive IW 10 [44] Proactive Figure 2.6: The design space of transport mechanisms that are of a different aggressiveness than the baseline. recovery. As our results show, this degree of aggression is sufficient to achieve latency reduction without introducing network instability (e.g., by increasing loss rates). 2.4 Reactive In this section we discuss the Reactive algorithm, a technique to mitigate retransmission timeouts (RTOs) that occur due to tail losses, which was designed by collaborators at Google [41]. Reactive sends probe segments to trigger duplicate ACKs to attempt to spur fast recovery more quickly than an RTO at the end of a transaction. Reactive requires only sender-side changes and does not require any TCP options. The design of Reactive presents two main challenges: a) how to trigger an unmodified client to respond to the server with appropriate information so as to help plug tail losses using fast recovery, and b) how to avoid circumventing TCP’s congestion control. After we describe the basic Reactive mechanism, we then outline the algorithm to detect the cases in which Reactive 18 plugs a hole. We will show that the algorithm makes the sender aware that a loss had occurred so it performs the appropriate congestion window reduction. We then discuss how Reactive enables a TCP sender to recover any degree of tail losses via fast recovery. Reactive algorithm. The Reactive algorithm allows a sender to quickly detect tail losses without waiting for an RTO. 3 The risk of a sender incurring a timeout is high when the sender has not received any acknowledgments for some time but is unable to transmit any further data either because it is application-limited (out of new data to send), receiver window-limited (rwnd), or congestion window-limited (cwnd). In these circumstances, Reactive transmits probe segments to elicit additional ACKs from the receiver. Reactive is applicable only when the sender has thus far received in-sequence ACKs and is not already in any state of loss recovery. Further, it is designed for senders with Selective Acknowledgment (SACK) enabled because the SACK feedback of the last packet allows senders to infer whether any tail segments were lost [21, 87]. The Reactive algorithm triggers on a newly defined probe timeout (PTO), which is a timer event indicating that an ACK is overdue on a connection. The sender sets the PTO value to approximately twice the smoothed RTT and adjusts it to account for a delayed ACK when there is only one outstanding segment. The basic version of the Reactive algorithm transmits one probe segment after a PTO if the connection has outstanding unacknowledged data but is otherwise idle, i.e. it is not receiving any ACKs or is cwnd/rwnd/application-limited. The transmitted segment— the loss probe—can be either a new segment if available and the receive window permits, or a retransmission of the most recently sent segment, (i.e., the segment with the highest sequence number). In the case of tail loss, the ACK for the probe triggers fast recovery. In the absence of 3 In the rest of the chapter, we will use the term “tail loss” to generally refer to either drops at the tail end of transactions or a loss of an entire window of data or acknowledgments. 19 Algorithm 1: Reactive. 1 % Called after transmission of new data in Open state. 2 Function schedule pto(): 3 if FlightSize> 1 then PTO 2 RT T ; 4 else if FlightSize== 1 then PTO 1:5 RT T+WDT ; 5 PTO= min(PTO;RTO) 6 Conditions: 7 (a) Connection is in open state 8 (b) Connection is cwnd- and/or application-limited 9 (c) Number of consecutive PTOs 2 10 (d) Connection is SACK-enabled 11 if all conditions hold then Arm timer with PTO; 12 else Rearm timer with RTO; 13 Function handle pto(): 14 if previously unsent segment exists then 15 Transmit new segment 16 FlightSize FlightSize+ segment size 17 else Retransmit last segment; 18 schedule pto() 19 Function handle ack(): 20 Cancel existing PTO 21 schedule pto() loss, there is no change in the congestion control or loss recovery state of the connection, apart from any state related to Reactive itself. Pseudocode and Example. Algorithm 1 gives pseudocode for the basic Reactive algorithm. FlightSize is the amount of in-network outstanding data and WDT is the worst-case delayed ACK timer. The key part of the algorithm is the transmission of a probe packet in Function handle pto() to elicit an ACK without waiting for an RTO. It retransmits the last segment (or new one if available), such that its ACK will carry SACK blocks and trigger either SACK-based [21] or Forward Acknowledgment (FACK)-based fast recovery [87] in the event of a tail loss. Next we provide an example of how Reactive operates. Suppose a sender transmits ten seg- ments, 1 through 10, after which there is no more new data to transmit. A probe timeout is sched- uled to fire two RTTs after the transmission of the tenth segment, handled by schedule pto() in 20 Algorithm 1. Now assume that ACKs for segments one through five arrive, but segments six through ten at the tail are lost and no ACKs are received. Note that the sender (re)schedules its probe timer relative to the last received ACK (Function handle ack()), which is for segment five in this case. When the probe timer fires, the sender retransmits segment ten (Function handle pto())—this is the key part of the algorithm. After an RTT, the sender receives an ac- knowledgement for this packet that carries SACK information indicating the missing segments. The sender marks the missing segments as lost (here segments six through nine) and triggers FACK-based recovery. Finally, the connection enters fast recovery and retransmits the remaining lost segments. Detecting recovered losses. If the only loss was the last segment, there is the risk that the loss probe itself might repair the loss, effectively masking it from congestion control. Reactive includes a loss detection mechanism that detects, by examining ACKs, when the retransmission probe might have masked a loss; Reactive then enforces a congestion window reduction, thus complying with the mandatory congestion control. 4 The basic idea of Reactive loss detection is as follows. Consider a Reactive retransmission “episode” where a sender retransmits N consecutive Reactive packets, all for the same tail packet in a flight. Suppose that an episode ends when the sender receives an acknowledgment above the SND:NXT at the time of the episode. We want to make sure that before the episode ends the sender receives N “Reactive dupacks”, indicating that all N Reactive probe segments were unnecessary, so there was no hole that needed plugging. If the sender gets less than N “Reactive dupacks” before the end of the episode, it is likely that the first Reactive packet to arrive at the 4 Since we observed from our measurements that a significant fraction of the hosts that support SACK do not support DSACK [20], the Reactive algorithm for detecting such lost segments relies only on the support of basic SACK. 21 receiver plugged a hole, and only the remaining Reactive packets that arrived at the receiver generated dupacks. Note that delayed ACKs complicate the picture since a delayed ACK implies that the sender will receive fewer ACKs than would normally be expected. To mitigate this complication, be- fore sending a loss probe retransmission, the sender should attempt to wait long enough that the receiver has sent any delayed ACKs that it is withholding. Our sender implementation features such a delay. If there is ACK loss or a delayed ACK, then this algorithm is conservative, because the sender will reduce cwnd when in fact there was no packet loss. In practice this is acceptable, and potentially even desirable: if there is reverse path congestion then reducing cwnd is prudent. Implementation. We implemented Reactive in Linux kernels 2.6 and 3.3. In line with our overarching goal of keeping our mechanisms simple, the basic Reactive algorithm is 110 lines of code and the loss detection algorithm is 55 (0.7% of Linux TCP code). Initially Reactive was designed to send a zero window probe (ZWP) with one byte of new or old data. The acknowledgment from the ZWP would provide an additional opportunity for a SACK block to detect loss without an RTO. Additional losses can be detected subsequently and repaired with SACK-based fast recovery. However, in practice sending a single byte of data turned out to be problematic to implement in Linux TCP. Instead the developers opted to send a full segment to probe at the expense of the slight complexity required to detect the probe itself masking losses. The Reactive algorithm allows the source to transmit one or two PTOs. However, one of the design choices the developers made in their implementation is to not use consecutive probe timeouts based on the observation that over 90% of the latency gains by Reactive are achieved 22 Pattern Reactive scoreboard Mechanism AAAL AAAA Reactive loss detection AALL AALS Early retransmit ALLL ALLS Early retransmit LLLL LLLS FACK fast recovery >=5 L ..LS FACK fast recovery Table 2.1: Recovery behavior with Reactive packets for different tail loss scenarios (A = ACKed segment, L = lost segment, S = SACKed segment). The TCP sender maintain the received SACK blocks information in a data structure called scoreboard. The Reactive scoreboard shows the state for each segment after theReactive packet was ACKed. with a single probe packet. Finally, the worst case delayed ACK timer is set to 200ms. This is the delayed ACK timer used in most of the Windows clients served from our Web server. Reactive is also described in the IETF draft [41] and is on by default in mainline Linux kernels [40]. Recovery of any N-degree tail loss. Reactive remedies discontinuity in today’s loss recovery algorithms wherein a single segment loss in the middle of a packet train can be recovered via fast recovery while a loss at the end of the train causes a retransmission timeout. With Reactive, a segment loss in the middle of a train as well as at the tail triggers the same fast recovery mech- anisms. When combined with a variant of the early retransmit mechanism [9], Reactive enables fast recovery instead of an RTO for any degree of N-segment tail loss as shown in Table 2.1. 5 2.5 Corrective Reactive recovers from tail loss without incurring (slow) RTOs, and it does so without requiring client-side changes, but it does not eliminate the need for recovery. Instead, it still requires the sender to recognize packet loss and retransmit. Proactive achieves 0-RTT loss recovery, but it has limited applicability, since it doubles bandwidth usage. Further, this level of redundancy may 5 The variant proposed here is to allow an early retransmit in the case where there are three outstanding segments that have not been cumulatively acknowledged and one segment that has been fully SACKed. 23 be overkill in many settings–our measurements in Section 2.2 found that many bursts lose only a single packet. In this section, we explore a middle way–a mechanism to achieve 0-RTT recovery in common loss scenarios. Our approach, Corrective, requires both sender and receiver changes (like Proac- tive, unlike Reactive) but has low overhead (like Reactive, unlike Proactive). Instead of complete redundancy, we employ forward error correction (FEC) within TCP. The sender transmits extra FEC packets so that the receiver can repair a small number of losses. While the use of FEC for transport has been explored in the past, for example in [16,126,131], to our knowledge we are the first to place FEC within TCP in a way that is incrementally deploy- able across today’s networks. Our goal is to achieve an immediate decrease in Web latency, and thus enhancing native TCP with FEC is important. However, this brings up significant challenges that we now discuss. Corrective encoding. The sender and receiver negotiate whether to use Corrective during TCP’s initial handshake. If both hosts support it, every packet in the flow will include a new TCP option, the Corrective option. We then group sequences of packets and place the XOR of their payloads into a single Corrective checksum packet. Checksums have low CPU overhead relative to other coding schemes like Reed-Solomon codes [108]; while such algorithms provide higher recovery rates than checksums in general, our measurements indicated that many bursts experience only a single loss, and so a checksum can recover many losses. Corrective groups together all packets seen within a time window, up to a maximum of sixteen MSS bytes of packets. It aligns the packets along MSS bytes boundaries to XOR them into a single Corrective payload. Because no regular packet carries a payload of more than MSS bytes, this encoding guarantees that the receiver can recover any single packet loss. Corrective delays 24 Figure 2.7: Timeline of a connection using Corrective. The flow shows regular (solid) and Corrective packets (dashed), sequence/ACK numbers, and Corrective option values (terms in brackets). transmitting the encoded packet by RTT 4 since our measurements indicate that this minimizes the probability of losing both, a regular packet and the XOR packet that encodes it. Incorporating loss correction into TCP adds a key challenge. TCP uses a single sequence number space to provide an ordered and reliable byte stream. Blocking on reliable delivery of Corrective packets is counter to our goal of reducing latency. For this reason, a Corrective packet uses the same sequence number as the first packet it encodes. This prevents reliability for Corrective packets and avoids the overhead of encoding the index of the first encoded byte in a separate header field. The Corrective packet sets a special ENC flag in its Corrective option signaling that the payload is encoded which allows the receiver to distinguish a Corrective packet from a regular retransmission (since they both have the same sequence number). The option also includes the number of bytes that the payload encodes. 25 Corrective recovery. To guarantee that the receiver can recover any lost packet, the Corrective module keeps the last 15 ACKed MSS blocks buffered, even if the application layer has already consumed these blocks. 6 Since a Corrective packet encodes at most 16 MSS blocks, the receiver can then recover any single lost packet by computing the XOR of the Corrective payload and the buffered blocks in the encoding range. To obtain the encoding range, the receiver combines the sequence number of the Corrective packet (which is set to be the same as the sequence number of the first encoded byte) and the number of bytes encoded (which is part of the Corrective TCP option). Corrective reception works as follows. Once the receiver establishes that the payload is en- coded (by checking the ENC flag in the Corrective option), it checks for holes in the encoded range. If it received the whole sequence, the receiver drops the Corrective packet. Otherwise, if it is missing at most MSS continuous bytes, the receiver uses the Corrective packet to recover the subsequence and forward it to the regular reception routine, allowing 0-RTT recovery. If too much data is missing for the Corrective packet to recover, the receiver sends an explicit dupli- cate ACK. This ACK informs the sender that a recovery failed and denotes which byte ranges were lost 7 via an R FAIL flag and value in the Corrective option. The sender marks the byte ranges as lost and triggers a fast retransmit. Thus, even when immediate recovery is not possible, Corrective provides the same benefit as Reactive. If the receiver were to simply ACK a recovered packet, it would mask the loss and circum- vent congestion control during a known loss episode. Since TCP connections may be reused for 6 Packets received out-of-order are already buffered by default. 7 We can say that the packets were lost with confidence since the Corrective packet transmissions are delayed (as described earlier). 26 multiple HTTP transactions, masking losses can hurt subsequent transfers. To prevent this be- havior, we devised a mechanism similar to explicit congestion notification (ECN) [107]. Upon successful Corrective recovery, the receiver enables an R SUCC flag in the Corrective option in each outgoing ACK, signaling a successful recovery. Once the sender sees this flag, it triggers a cwnd reduction. In addition, it sets an R ACK flag in the Corrective option of the next packet sent to the receiver. Once the receiver observes R ACK in an incoming packet, indicating that the sender reduced the congestion window, it disablesR SUCC for future packets. Figure 2.7 shows a sample packet with a successful Corrective recovery. Implementation. We implemented our prototype in Linux kernel versions 2.6 and 3.2 in 1674 lines of code (7.3% of the Linux TCP codebase). Our implementation is modularized and makes minimal changes to the existing kernel. This separation has made it easy, for example, to port Corrective to the Linux stack for Android devices. We plan to make our implementation publicly available. 2.6 Proactive Proactive takes the aggressive stance of proactively transmitting copies of each TCP segment. If at least one copy of the segment arrives at the receiver then the connection proceeds with no delay. The receiver can discard redundant segments. While sending duplicate segments can potentially increase congestion and consequently decrease goodput and increase response time, Proactive is designed only for latency-sensitive services on networks where these services occupy a small percentage of the total traffic. While repeating packets is less efficient than sophisticated error correction coding schemes, Proactive is designed to keep the additional complexity of TCP 27 implementation at a minimum while achieving significant latency improvements. Proactive was designed by collaborators at Google. While intuitively simple, the implementation of Proactive has some subtleties. A naive ap- proach would be to send one copy of every segment, or two instances of every segment. 8 If the destination receives both data segments it will send two ACKs, since the reception of an out-of- order packet triggers an immediate ACK [12]. The second ACK will be a duplicate ACK (i.e., the value of the ACK field will be the same for both segments). Since modern Linux TCP stacks use duplicate SACKs (DSACK) to signal sequence ranges which were received more than once, the second ACK will also contain a (D)SACK block. This duplicate ACK does not falsely trigger fast recovery because it only acknowledges old data and does not indicate a hole (i.e., missing segment). However, many modern network interface controllers (NICs) use TCP Segmentation Offload- ing (TSO) [55]. This mechanism allows TCP to process segments which are larger than MSS, with the NIC taking care of breaking them into MTU-sized frames. 9 For example, if the sender- side NIC splits a segment into K on-the-wire frames, the receiver will send back 2K ACKs . If K >dupthresh and SACKs are disabled or some segments are lost, the sender will treat the du- plicate ACKs as a sign of congestion and enter a recovery mode. This is clearly undesirable since it slows down the sender and offsets Proactive’s potential latency benefits. Figure 2.8 illustrates one such a spurious retransmission. To avoid spurious retransmissions TSO is disabled for the flows that use Proactive and the receivers are enlisted to identify original/copied segments reordered by or lost in the network. 8 We use the term copies to differentiate them from duplicate segments that TCP sends during retransmission. 9 For now assume that each of these on-the-wire packets generates an ACK and that the network does not lose or reorder any messages. 28 Figure 2.8: Timeline of aProactive connection with TSO enabled that loses a segment. While theProactive copy recovers the loss, the sender retransmits the segment due to three dupli- cate ACKs. Specifically, the sender marks the copied segments by setting a flag in the header using one of the reserved but unused bits. Then, a receiver processes incoming packets as follows. If the flag is set, the packet is only processed if it was not received before (otherwise it is dropped). In this case an ACK is generated. If the flag is not set, the packet will be processed if it was not received before or if the previous packet carrying the same sequence did not have the flag set either. These rules will prevent the generation of duplicate ACKs due to copied segments while allowing duplicate ACKs that are due to retransmitted segments. In addition to copying data segments, Proactive can be configured to copy SYN and pure ACK segments for an added level of resiliency. We implemented Proactive in Linux kernels 2.6 and 3.3 with the new module comprising 358 lines of code, or1.6% of the Linux TCP codebase. 29 2.7 The Role of Middleboxes We aim to make our modules usable for most connections in today’s Internet, despite on-path middleboxes [55]. Reactive is fully compatible with middleboxes since all Reactive packets are either retransmissions of previously sent packets or the next in-order segment. Proactive uses reserved bits in the TCP header for copied segments which can trigger middleboxes that dis- card packets with non-compliant TCP flags. However, in our experience, possibly due to the widespread use of reserved bits and the position of frontend servers relative to middleboxes, we did not observe this effect in practice. Corrective introduces substantial changes to TCP which could lead to compatibility issues with middlebox implementations that are unaware of the Corrective functionality. Our goal is to ensure a graceful fallback to standard TCP in situations where Corrective is not usable. Even if hosts negotiate Corrective during the initial handshake, it is possible for a middlebox to strip the option from a later packet. To be robust to this, if either host receives a packet without the option, it discards the packet and stops using Corrective for the remainder of the connection, so hosts don’t confuse Corrective packets with regular data packets. Some middleboxes translate sequence numbers to a different range [55], and so Corrective uses only relative sequence numbers to convey metadata (such as the encoding range) between endpoints. We have also designed, but have not yet fully implemented, solutions for other middlebox issues. Some devices rewrite the ACK number for a recovered sequence since they have not seen this sequence before. To solve this problem, the sender would retransmit the recovered sequence, even though it is not needed by the other endpoint anymore, to plug this “sequence hole” in the state of the middlebox. Solutions to other issues include Corrective checksums to detect if a middlebox rewrites payloads 30 for previously seen sequences, as well as introducing additional identifier information to the Corrective option to cope with packet coalescing or splitting. We could have avoided middlebox issues by implementing Corrective above or below the transport layer. Integrating it into TCP made it easier to leverage TCP’s option negotiation (so connections can selectively use Corrective) and its RTT estimates (so that the Corrective packet transmission can be timed correctly). It also eased buffer management, since Corrective can leverage TCP’s socket buffers; this is especially important since buffer space is at a premium in production Web servers. Ideally, middlebox implementations would be extended to be aware of our modules. For Proactive, adding support for the TCP flag used is sufficient. Corrective on the other hand re- quires new logic to distinguish between regular payloads and Corrective-encoded payloads based on the Corrective option and flags used. In particular, stateful middleboxes need this functionality to properly update the state kept for Corrective-enabled connections. 2.8 Evaluation Next we evaluate the performance gains achieved by Reactive, Corrective, and Proactive in our experiments. We begin with results from a combined experiment running Proactive for backend connections and Reactive for client connections. We then describe detailed experiments for each of the mechanisms in order of increasing aggressiveness. First, we describe our experimental setup. 31 2.8.1 Experimental Setup We performed all of our Web server experiments with Reactive and Proactive in a production data center that serves live user traffic for a diverse set of Web applications. The Web servers run Linux 2.6 using default settings, except that ECN is disabled. The servers terminate user TCP connections and are load balanced by steering new connections to randomly selected Web servers based on the server and client IP addresses and ports. Calibration measurements over 24-hour periods show that SNMP and HTTP latency statistics agree within 0.5% between individual servers. This property permits us to run N-way experiments concurrently by changing TCP configurations on groups of servers. A typical A/B experiment runs on four or six servers with half of them running the experimental algorithm while the rest serve as the baseline. Note that multiple simultaneous connections opened by a single client are likely to be served by Web servers with different A/B configurations. These experiments were performed over several months. The primary latency metric that we measure is response time (RT) which is the interval be- tween the Web server receiving a client’s request to the server receiving an ACK for the last packet of the response. We are also interested in retransmission statistics and the overhead from each scheme. Linux is a fair baseline comparison because it implements the latest loss recovery techniques in TCP literature and IETF RFCs. 10 10 This includes SACK, F-RTO, RFC 3517, limited-transmit, dynamic duplicate ACK threshold, reordering detec- tion, early retransmit algorithm, proportional rate reduction, FACK based threshold recovery, and ECN. 32 2.8.2 End-to-end Evaluation Since our overarching goal is to reduce latency for Web transfers in real networks, we first present our findings in experiments using both Reactive and Proactive in an end-to-end setting. The end-to-end experiment involves multi-stage TCP flows as illustrated in Figure 2.5. The backend server resides in the same data center described in Section 2.2, but user requests are directed to nearby frontend servers that then forward them to the backend server. The connections between the backend and frontend servers use Proactive, while the connections between the end users and the frontend nodes use Reactive. 11 The baseline servers used standard TCP for both backend and client connections. We measure RT, which includes the communication between the frontend and backend servers. Table 2.2 shows that, over a two-day period, the experiment yielded a 14% reduction in average latency and a substantial 37% improvement in the 99th percentile, We noticed that the baseline retransmission rate over the backend connections was 5.5% on Day 1 of the experiment and 0.25% on Day 2. The redundancy added by Proactive effectively reduced the retransmission rate to 0.02% for both days. Correspondingly, the mean response time reduction on Day 1 was 21% (48% for the 99th percentile) and 4% on Day 2 (9% for the 99th percentile). Results from an- other 15-day experiment between a different frontend-backend server pair demonstrated a 23.1% decrease in mean response time (46.7% for the 99th percentile). The sample sizes for the sec- ond experiment were2.6 million queries while the retransmission rates for the baseline and Proactive were 0.99% and 0.09%, respectively. 11 For practical reasons, we did not include Corrective in this experiment as it requires changes to client devices that we did not control. 33 Quantile Linux Proactive + Reactive 25 362 -5 -1% 50 487 -11 -2% 90 940 -173 -18% 99 5608 -2058 -37% Mean 700 -99 -14% Sample size 186K 243K Table 2.2: RT comparison (in ms) for Linux baseline andProactive combined withReactive. The two rightmost columns show the relative latency w.r.t the baseline. This experiment was enabled only for short Web transfers, due to its increased overhead. Taken in perspective, such a latency reduction is significant: consider that an increase in TCP’s initial congestion window to ten segments—a change of much larger scope—improved the average latency by 10% [43]. We did not measure the impact of 23% response latency reduction on end-user experience. Emulations with Corrective in section 2.8.4.2, show the browser’s render start time metric. Ultimately, user latency depends not just on TCP but also on how browsers use the data – including the order that clients issue requests for style sheets, scripts and images, image scaling and compression level, browser caching, DNS lookups and so on. TCP’s job is to deliver the bits as fast as possible to the browser. To understand where the improvements come from, we elaborate on the performance of each of the schemes in the following subsections. 2.8.3 Reactive Using our production experimental setup, we measured Reactive’s performance relative to the baseline in Web server experiments spanning over half a year. The results reported below repre- sent a week-long snapshot. Both the experiment and baseline used the same kernels, which had an option to selectively enable Reactive. Our experiments included the two flavors of Reactive 34 Google Web Search Images Google Maps Quantile Linux Reactive Linux Reactive Linux Reactive 25 344 -2 -1% 74 0 59 0 50 503 -5 -1% 193 -2 -1% 155 0 90 1467 -43 -3% 878 -65 -7% 487 -18 -3% 99 14725 -760 -5% 5008 -508 -10% 2882 -290 -10% Mean 1145 -32 -3% 471 -29 -6% 305 -14 -4% Sample size 5.7M 5.7M 14.8M 14.8M 1.64M 1.64M Table 2.3: Response time comparison (in ms) of baseline Linux vs. Reactive. The Reactive columns shows relative latency w.r.t. the baseline. 0 2 4 6 8 10 Search Images Video (T) Latency improvement (%) Verizon Movistar Figure 2.9: Average latency improvement (in %) of HTTP responses withReactive vs. base- line Linux for two mobile carriers. Carriers and Web applications are chosen because of their large sample size. discussed above, with and without loss detection support. The results reported here include the combined algorithm with loss detection. All other algorithms such as early retransmit and FACK based recovery are present in both the experiment and baseline. Table 2.3 shows the percentiles and average latency improvement of key Web applications, including responses without losses. The varied improvements are due to different response-size distributions and traffic patterns. For example, Reactive helps the most for Images, as these are served by multiple concurrent TCP connections which increase the chances of tail segment losses. 12 There are two takeaways: the average response time improved up to 6% and the 99th 12 It is common for browsers to use four to six connections per domain, and for Web sites to use multiple subdomains for certain Web applications. 35 Retransmission type Linux Reactive Total # of Retransmission 107.5M -7.3M -7% Fast Recovery events 5.5M +2.7M +49% Fast Retransmissions 24.7M +8.2M +33% Timeout-based Retransmissions 69.3M -9.4M -14% Timeout On Open 32.4M -8.3M -26% Slow Start Retransmissions 13.5M -6.2M -46% cwnd undo events 6.1M -3.7M -61% Table 2.4: Retransmission statistics in Linux and the corresponding delta in the Reactive experiment. Reactive results in 14% fewer timeouts and converts them to fast recovery. percentile improved by 10%. Also, nearly all of the improvement for Reactive is in the latency tail (post-90th percentile). Figure 2.9 shows the data for mobile clients, with an average improvement of 7.2% for Web search and 7.6% for images transferred over Verizon. The reason for Reactive’s latency improvement becomes apparent when looking at the differ- ence in retransmission statistics shown in Table 2.4—Reactive reduced the number of timeouts by 14%. The largest reduction in timeouts is when the sender is in the Open state in which it receives only in-sequence ACKs and no duplicate ACKs, likely because of tail losses. Cor- respondingly, RTO-triggered retransmissions occurring in the slow start phase reduced by 46% relative to baseline. Reactive probes converted timeouts to fast recoveries, resulting in a 49% increase in fast recovery events. Also notable is a significant decrease in the number of spuri- ous timeouts, which explains why the experiment had 61% fewer cwnd undo events. The Linux TCP sender [117] uses either DSACK or timestamps to determine if retransmissions are spuri- ous and employs techniques for undoing cwnd reductions. We also note that the total number of retransmissions decreased 7% with Reactive because of the decrease in spurious retransmissions. 36 We also quantified the overhead of sending probe packets. The probes accounted for 0.48% of all outgoing segments. This is a reasonable overhead even when contrasted with the overall retransmission rate of 3.2%. 10% of the probes are new segments and the rest are retransmissions, which is unsurprising given that short Web responses often do not have new data to send [43]. We also found that, in about 33% of the cases, the probes themselves plugged the only hole, and the loss detection algorithm reduced the congestion window. 37% of the probes were not necessary and resulted in a duplicate acknowledgment. A natural question that arises is a comparison of Reactive with a shorter RTO such as 2 RTT. We did not shorten the RTO on live user tests because it induces too many spurious retransmissions that impact user experience. Tuning the RTO algorithm is extensively studied in literature and is complementary to Reactive. Our own measurements show very little room exists in fine-tuning the RTO estimation algorithm. The limitations are: 1) packet delay is becoming hard to model as the Internet is moving towards wireless infrastructure, and 2) short flows often do not have enough samples for models to work well. 2.8.4 Corrective In contrast to Reactive and Proactive, we have not yet deployed Corrective in our production servers since it requires both server and client support. We evaluate Corrective in a lab environ- ment. 2.8.4.1 Isolated Flows Experimental setup. We directly connected two hosts that we configured to use the Corrective module. We used thenetem module to emulate a 200 ms RTT between them and emulated both 37 fixed loss rates and correlated loss. In correlated loss scenarios, each packet initially had a drop probability of 0.01 and we raised the loss probability to 0.5 if the previous packet in the burst was lost. We chose these parameters to approximate the loss patterns observed in the data collection described earlier (see Section 2.2). We used netperf to evaluate the impact of Corrective on various types of connections. We ran each experiment 10,000 times with Corrective disabled (baseline) and 10,000 times with it enabled. All percentiles shown in tables for this evaluation have margins of error< 2% with 95% confidence. Corrective substantially reduces the latency for short bursts in lossy environments. In Table 2.5a we show results for queries using 40 byte request and 5000 byte response messages, similar to search engine queries. These queries are isolated which means that the hosts initiate a TCP connection, then the client sends a request, and the server responds, after which the connection closes. Table 2.5b gives results for pre-established TCP connections; here we measure latency from when the client sends the request. Both tables show the relative latency improvement when using Corrective for correlated losses and for a fixed loss rate of 2%. When we include handshakes, Corrective reduces average latency by 4–10%, depending on the loss scenario, and reduces 90th percentile latency by 18–28%. Because hosts negotiate Cor- rective as part of the TCP handshake, SYN losses are not corrected which can lead to slow queries if the SYN is lost. If we pre-establish connections, as would happen when a server sends multi- ple responses over a single connection, Corrective packets cover the entire flow, and in general Corrective provides high latency reductions in the 99th percentile as well. Existing work demonstrates that transmitting all SYN packets twice can reduce latency in cases of SYN loss [135]. 13 For our correlated loss setting, on queries that included handshakes, 13 These redundant transmissions are similar to Proactive applied only to the SYN packet. 38 Quantile Random Correlated 50 0% 0% 90 -28% -24% 99 0% -15% Mean -8% -4% (a) Short transmission with connection establishment (Initial handshake, 40 byte request, 5000 byte response) Quantile Random Correlated 50 0% 0% 90 -37% 0% 99 -52% -29% Mean -13% -7% (b) Short transmission without connec- tion establishment (40 byte request, 5000 byte response) Quantile Random Correlated 50 -13% 0% 90 -10% 0% 99 -5% -9% Mean -10% -1% (c) Long transmission without connec- tion establishment (40 byte request, 50000 byte response) Table 2.5: Latency reduction withCorrective for random and correlated loss patterns under varying connection properties. we found that adding redundant SYN transmissions to standard TCP reduces the 99th percentile latency by 8%. If we use redundant SYN transmission with Corrective, the combined reduction reaches 17%, since the two mechanisms are complementary. Corrective provides less benefit over longer connections. Next, using established connections, we evaluate Corrective’s performance when transferring larger responses. While still reducing 90th percentile latency by 7% to 10% (Table 2.5c), Corrective provides less benefit in the tail on these large responses than it did for small responses. The benefits diminish as the minimum number of RTTs necessary to complete the transaction increases (due to the message size). As a result, the recovery of losses in the tail no longer dominates the overall transmission time. Corrective is better suited for small transfers common to today’s Web [43]. 39 0.7 0.75 0.8 0.85 0.9 0.95 1 322 500 1000 1500 2000 CDF Latency (ms) Linux w/ Corrective (a) Completion time for Craigslist 0.7 0.75 0.8 0.85 0.9 0.95 1 2559 4000 6000 8000 10000 CDF Latency (ms) Linux w/ Corrective (b) Completion time for New York Times 0.7 0.75 0.8 0.85 0.9 0.95 1 737 1000 1500 2000 2500 3000 CDF Time to render latency (ms) Linux w/ Corrective (c) Render start time for New York Times Figure 2.10: CDFs (y 0:7) for Web site downloads on a desktop client with a cable connec- tion and a correlated loss pattern. The first label on each x-axis describes the ideal latency observed in a no-loss scenario. 40 2.8.4.2 Web Page Replay Experimental setup. In addition to the synthetic workloads, we used the Web-page-replay tool [2] and dummynet [30] to replay resource transfers for actual Web page downloads through controlled, emulated network conditions. We ran separate tests for Web pages tailored for desk- top and mobile clients. The tests for desktop clients emulated a cable connection with 5Mbit/s downlink and 1Mbit/s uplink bandwidth and an RTT of 28ms. The tests for mobile clients emu- lated a 3G mobile connection with 2Mbit/s downlink and 1Mbit/s uplink bandwidth and an RTT of 150ms. 14 In all tests, we simulated correlated losses as described earlier. We tested a variety of popular Web sites, and Corrective substantially reduced the latency distribution in all cases. We limit our discussion to two representative desktop Web sites, a simple page with few resources (Craigslist, 5 resources across 5 connections, 147KB total) and a content-rich page requiring many resources from a variety of providers (New York Times, 167 resources across 47 connections, 1387KB total). Figure 2.10 shows the cumulative latency distributions for these websites requested in a desk- top environment. The graphs confirm that Corrective can improve latency significantly in the last quartile. For example, the New York Times website takes 15% less time until the first objects are rendered on the screen in the 90th percentile. Corrective also significantly improves performance in a lossy mobile environment as well. For example, fetching the mobile version of the New York Times website takes 2793ms instead of 3644ms (-23%) in the median, and 3819ms instead of 4813ms (-21%) in the 90th percentile. 14 The connection parameters are similar to the ones used byhttp://www.webpagetest.org. 41 Quantile Linux Proactive 25 372 -9 -2% 50 468 -19 -4% 90 702 -76 -11% 99 1611 -737 -46% Mean 520 -65 -13% Sample size 260K 262K Table 2.6: Round-trip comparison (in ms) for Linux baseline and Proactive. The Proactive columns shows the relative latency vs. the baseline. Proactive was enabled only for short Web transfers, due to its increased overhead. 2.8.5 Proactive While Section 2.8.2 presented results when Reactive in the client connection is used in con- junction with Proactive in the backend connection, in this section we report results using only Proactive in the backend connections. We conducted the experiments in production datacenters serving live user traffic, as described in Section 2.8.1. Table 2.6 presents the reduction in re- sponse time that Proactive achieves for short Web transfers by masking many of the TCP losses on the connection between the CDN node and the backend server. Specifically, while the average retransmission rate for the baseline was 0.91%, the retransmission rate for Proactive was only 0.13%. Even though the difference in retransmission rates may not seem significant, especially since the baseline rate is already small, Proactive reduces tail latency (99-th percentile) by 46%. What is not obvious from Table 2.6 is the sensitivity of response time to losses and conse- quently the benefit that Proactive brings by masking these losses. The performance difference between the two days of the experiment in Section 2.8.2 hinted at this sensitivity. Here we report results across one week, allowing a more systematic evaluation of the relationship between base- line retransmission rate and response time reduction. Figure 2.11 plots the reduction in response time as a function of the baseline retransmission rate. Even though the baseline retransmission 42 0 5 10 15 20 0.6 0.7 0.8 0.9 1 1.1 1.2 RT Reduction (%) Baseline Retransmission rate (%) Figure 2.11: Reduction in response time achieved by Proactive as a function of baseline retransmission rate. rate increases only modestly across the graph, Proactive’s reduction of the average response time grows from 6% to 20%. 2.9 Discussion The 1-RTT Recovery Ideal. Even if the mechanisms described in this chapter do not achieve the ideal, they make significant progress towards the 1-RTT recovery ideal articulated in Section 2.1. We did not set out to conquer that ideal; over the years, many loss recovery mechanisms have been developed for TCP, and yet, as our measurements show, there was still significant room for improvement. An open question is: is it possible to introduce enough redundancy in TCP (or a clean-slate design) to achieve 1-RTT recovery without adding instability, in order to effectively scale the recovery mechanisms with growing bandwidths? When should Gentle Aggression be used? A transport’s job is to provide a fast pipe to the applications without exposing complexity. In that vein, the level of aggression that makes use of fine grained information like RTT or loss is best decided by TCP – examples are Reactive and the fraction of extra Corrective packets. At a higher level, the application decides whether to enable Proactive or Corrective, based on its knowledge of the traffic mix in the network. 43 Multi-stage connections. Our designs leverage multi-stage Web access to provide different levels of redundancy on client and backend connections. Some of our designs, like Proactive (but not Corrective or Reactive), may become obsolete if Web accesses were engineered differently in the future. We see this as an unlikely event: we believe the relationship between user perceived latency and revenue is fundamental and becoming increasingly important with the use of mobile devices, and so the large, popular Web service providers will always have incentive to build out backbones in order to engineer low latency. Loss patterns. We based the designs of our TCP enhancements on loss patterns observed in today’s Internet. How likely is it that these loss patterns will persist in the future? First, we note that at least one early study pointed out that a significant number (56%) of recoveries incurred RTOs [14], so at least one of our findings appears to have existed over a decade and a half ago. Second, networks that use network-based congestion management, flow isolation, Explicit Congestion Notification, and/or QoS can avoid most or all loss for latency critical traffic. Such networks exist but are rare in the public Internet. In such environments, tail losses may be less common, making the mechanisms in this chapter less useful. In these settings, Reactive is not detrimental since it responds only on an impending timeout, and Corrective can also be adapted to have this property. So long as there is loss, these techniques help trim the tail of the latency distribution, and Proactive could still be used in targeted environments. Moreover, while such AQM deployments have been proposed over the decades, history suggests that we are still many years away from a loss-free Internet. Coexistence with legacy TCP. In our large scale experiments with Reactive and Proactive, clients were served with a mix of experiment and baseline traffic. We monitored the baseline with and without the experiment and observed no measurable difference between the two. This is 44 not surprising: even though Proactive doubles the traffic, it does so for a small fraction of traffic without creating instabilities. Likewise, the fraction of traffic increased by Reactive is smaller than 0.5% – comparable to connection management control traffic. Both Reactive and Corrective, which we recommend using over the public Internet, are well-behaved in that they appropriately reduce the congestion window upon a loss event even if the lost packet is recovered. Corrective increases traffic by an additional 10%, similar to adding a new flow(s) on the link; since emulation is unlikely to give an accurate assessment of the impact of this overhead on legacy TCP, we hope that future efforts lead to a large-scale deployment of Corrective. 2.10 Conclusion Ideally packet loss recovery would take no more than one RTT. We are far from this ideal with today’s TCP loss recovery. Reactive, Corrective and Proactive are simple, practical, easily de- ployable, and immediately useful mechanisms that progressively move us closer to this ideal by judiciously adding redundancy. In some cases, they can reduce 99th percentile latency by 35%. Reactive is enabled by default in mainline Linux. Our plan is to also integrate the remain- ing mechanisms in mainline operating systems such as Linux, with the aim of making the Web faster. 45 Chapter 3 Breaking Down TCP Latency Worldwide Delivering content to consumers quickly is a key objective for content providers. Most providers deliver data using TCP, which needs to support efficient and low-latency transport. In this study, based on packet captures collected worldwide over a 6-year period, we investigate situations in which TCP incurs high latency. We develop a methodology to break down the delay incurred by a packet into components attributable to propagation delay and cross-traffic, loss recovery, and queuing. We also investigate the degree to which queuing delays slow TCP’s loss recovery. We analyzed packet traces of 10-second flows captured across the globe, and find that many of them see packet delivery times of one second or more, with large regional differences, and with queuing being a key cause of delay. These findings support the need for continuous measurement-driven transport optimizations, since variations across regions and time mean that an optimization can have a large impact in one setting but not others. 46 3.1 Introduction Delivering a Web page or a video stream to a consumer quickly is an important objective for content providers. This requires a transport fabric and protocol that achieves end-to-end commu- nication without excessive delays. While TCP often meets this demand, the previous chapter and past work demonstrated that occasionally connections can observe high delays [48, 149]. High delays can stem from different reasons. Two well known factors that affect latency are packet loss and standing queues (“bufferbloat”) [48]. Packet loss requires loss detection and recovery when a reliable transport protocol like TCP is used. Bufferbloat inflates the end-to-end delay due to the additional time that packets remain in queues. The previous chapter and earlier studies proposed solutions to better deal with loss (e.g., us- ing redundancy for faster loss detection and recovery) and bufferbloat (e.g., using delay-based congestion control [24]), but did not try to fully understand how and why the performance de- teriorates when observing a specific pathology. As a result, these solutions might be incomplete and leave a much larger potential for improvement. For example, we argued earlier that delays increase when a loss causes retransmission timeouts, but did not look into why loss delays vary widely. As we will show, packet loss is only part of the equation. Thus, before we can design solutions to improve a transport protocol’s performance we need to better understand: where a transmission is ”wasting time”; what the sources of delays are (instead of handling symptoms only); and how multiple sources interact (if at all). For loss, we need to understand how connections end up in states where recovery takes hundreds of RTTs. In contrast to past studies, instead of investigating packet loss in isolation, we examine the degree to which other contributing factors impact performance. 47 In addition to getting a better understanding about pathologies of individual connections, we also want to determine the magnitude of the impact of pathologies across space and time, to enable network operators to prioritize problems with the biggest performance impact in different regions of the world, and to assess how pathologies have evolved. Overall this study will answer the following questions: (a) What are the contributors to packet latency across the Internet? (b) Why does packet loss result in a wide distribution of recovery delays? (c) What is the impact of queuing on packet delay? (d) How do delays evolve over time? (e) Are there regional differences? Central to this chapter is a methodology for breaking down the latency experienced by a packet into several components, accounting for delays due to propagation, queuing, and loss recovery. The latter we break down further into slowdowns arising from delayed retransmission triggers and inflated round-trip time estimates affecting timers. Using this methodology, together with a global dataset of TCP packet captures collected over seven years, we find that queuing accounts for 33% of tail delay in the median of all connections, in addition to drastically slowing down loss detection and recovery whenever retransmissions are necessary. That said, we observe large regional differences, with queuing delays almost non- existent in some autonomous systems (ASes) while dominating in others. Each of the delay components sometimes introduces latencies that are up to two orders of magnitude larger than a connection’s minimum round-trip time. Finally, in some regions, delays have decreased over time, whereas others see the reverse trend. Our findings suggest that TCP enhancements moti- vated by measurements within a region (say North America) may not perform well globally at all times, suggesting a continuous measurement-driven approach for TCP enhancements. 48 Base Queuing 1 2 Sender Receiver Base Late ACK triggers Loss recovery Sender Receiver 1 2 3 Base Late ACK arms timer Loss recovery (timeout) Inflated timeout Loss recovery (retransmission) Sender Receiver 1 2 3 4 5 (a) (b) (c) Data packet Affects tail performer’s latency Tail performer Queuing delay Lost packet Fast retransmission RTO retransmission Figure 3.1: Sample flows and delay breakdown (vertical intervals and number bullets). Each flow is represented by data packets (solid lines) and ACKs (dashed lines). (a – c) are affected by queuing delay. In addition, (b) requires a fast retransmission, and (c) requires a timeout-based recovery. 3.2 Methodology Instead of breaking down delays for every single data packet we focus our analysis on the tail performer in each connection, that is, the packet 1 with the highest delay between the initial trans- mission and its acknowledgment (ACK). This focus lets us reason about the time frame where a connection is experiencing a severe performance degradation. For example, video streaming services rely on long-living flows like the ones we analyze and are usually able to deal with some goodput fluctuation. However, ex- cessive delays can cause buffers to drain, degrading quality of experience for the user [37]. And even for shorter transfers, like HTTP objects, tail delays can have a large impact. For example, if HTTP pipelining is used, latency-induced head-of-line blocking can delay the delivery of objects. Since our dataset is only comprised of 10-second transfers, latency characteristics might dif- fer from the ones shorter flows see. To capture short-flow behavior, we also examine the tail performer among the first 100 KB of data sent. 2 1 More precisely we mean the sequence of bytes carried by the original packet as well as all its retransmissions. 2 A middle ground between a Web page’s median HTML transfer size (30 KB) and median JS transfer size (275 KB) [56]. 49 3.2.1 Delay Components Once the sender’s transport protocol transmits a packet, delay can arise at two places: 1. The network introduces a propagation delay. In addition, intermediate routers or switches may introduce queuing delay before forwarding the packet. 2. Packet loss results in delays introduced by the sender’s transport protocol to detect and recover from the loss. At an application or from a packet capture, we do not have visibility into the transport layer’s or the network’s behavior. This makes it challenging to figure out where delays are introduced. In addition, the location where we see delay is not necessarily the source. For example, even though the network can introduce queuing delay, the sender could have caused the queue to fill up by sending packets faster than the bottleneck rate. To improve the performance of a sender’s transport protocol, we want to find these delay sources. To achieve this, we break down the delay of the tail performer into five categories. Below we define these categories and describe how we attribute fractions of the overall delay to each of the components. In Figure 3.1 we discuss example delay breakdowns. Base delay. This delay solely depends on the propagation delay, as well as sustained interference from other flows. We group these two types of delay together since we are primarily interested in delays that a sender’s transport protocol can influence. We use the minimum RTT observed in the whole connection as an estimate for the base delay. Loss recovery delay. If a packet is dropped by the network, TCP waits for a duplicate acknowl- edgement (dupack) or a timeout to detect the loss, before sending a retransmission. To avoid 50 congestion collapse, TCP only retransmits a packet once we are sure that it is no longer travers- ing the network [61]. Overall, we use the delay between the initial transmission and the final retransmission as the estimate of the recovery delay. However, we exclude the delay that comes from late triggers, as described below. Late trigger delay. To ensure reliability, TCP retransmits packets once they are marked as lost. Marking requires a trigger event, which can be an incoming packet or timeout. In the ideal case, the time it takes to observe these triggers is only dependent on the base delay (e.g., the RTT for an out-of-order packet). This is captured by the loss recovery component described above. However, triggers can be late due to queuing delays seen earlier in the connection, not necessarily for the packet we are analyzing. The computation of this delay depends on the trigger being a packet or a timeout. We now describe these cases. Delays from late packet triggers. For fast retransmissions (as well as forward and early retrans- mission [10, 86]), the trigger is a duplicate acknowledgment (dupack). Since the dupack itself is caused by the out-of-order reception of a data packet, loss recovery is delayed if that data packet was delayed as well. Thus, the late trigger delay is the queuing delay of the data packet triggering the dupack. Delays from late timeouts. With retransmission timeouts (RTOs) and tail loss probes (TLPs) [39], 3 the trigger is a timeout that can be delayed if the timeout value is inflated, or if the timer was armed late (compared to the ideal setting). The timeout value is affected by queuing delay since it is based on the RTT samples from earlier packets. To compute the fraction of the timeout delay attributable to earlier queuing, we subtract the estimated queuing delay (described below) from 3 TLPs were introduced to send an extra packet early to trigger fast recovery instead of waiting for the much longer RTO. 51 each RTT sample and recompute a new “queue-free” RTO value. Note that the new RTT esti- mates can still be inflated compared to the base delay, for example, due to additional cross traffic or delayed ACKs. The difference between the observed and queue-free RTO value is added as trigger delay. For the second case, the timer is armed late if the latest incoming ACK itself was delayed due to the corresponding data packet experiencing queuing delay. The trigger delay is therefore the queuing delay of that data packet. In the case of multiple losses (e.g., a fast retrans- mission is lost and causes an RTO) we compute the sum of all trigger delays affecting the delay of the final (successful) transmission. Queuing delay. This is the fraction of time that a packet spends in a queue due to the network load caused by this flow. To estimate it, we first compute the correlation between the number of bytes in flight and the RTT observed when sending a packet. If the correlation is sufficiently high, we can approximate the relationship between bytes in flight and RTT through a linear fit. After subtracting the base delay, we get a function which maps an observation of the number of bytes in flight to a queuing delay estimate. Other delay. This includes any remaining delay that is not accounted for by the previously described components (e.g., temporary queuing delays introduced by cross-traffic, or latency due to the delayed ACK mechanism [22]). 3.3 Dataset The Network Diagnostics Toolkit (NDT) is a measurement platform that collects performance metrics, including packet traces, between clients and many M-Lab servers spread across the world [82]. Since 2010, NDT collected millions of packet traces and metadata. Each trace is 52 the result of a 10-second file transfer manually triggered between a user’s machine and an M-Lab server. To derive insights about the state of the Web, we compute results based on all NDT traces recorded in March 2016 capturing over 1.7 million connections. For longitudinal trends, we used all traces between March 10 and 19 from 2011 to 2016, totalling 6.9 million connections. We only keep samples for which we can derive queuing delays through linear fitting (described earlier) with a correlation factor of at least 0.8, which applies to roughly half of all samples. 4 Limitations. Despite collecting data from clients on a global scale, the M-Lab dataset does have some limitations. NDT is an active measurement toolkit and as such requires the user to trigger the collection of data. While it is integrated into BitTorrent clients like mtorrent and Vuze to reach a larger user base, it potentially biases the data collection towards clients with connectivity problems who are more likely to run speed tests. As such, our results may over-estimate the incidence of pathologies. Nevertheless we believe that the analysis presented in the following sections provides valuable insights about sources of high latency. In the following sections we often break down results by continent, country, or AS. Our per- continent analysis includes data from all continents (except Antarctica). For the per-country and per-AS breakdowns we selected a subset of countries and ASes to illustrate the diversity of tail performer behavior. However, we made the results for all countries and the top-100 ASes (based on sample size), as well as our analysis source code available online [92]. Finally, we report median and other percentiles for many of our results denoted by m and p x (for the xth percentile) for brevity. 4 We opted for a high correlation threshold to enable noise-free estimates of queuing delays instead of minimizing the number of samples excluded from our analysis. 53 3.4 Results We derived the following key results: 1. Globally, many connections suffer tail delays of one second or more, with large regional differences (Section 3.4.1). 2. By breaking down tail delays, we find that queuing, in the median, accounts for 33% of a tail performer’s delay, compared to 12 - 14% for the remaining components. For over 5% of all flows, loss recovery, late triggers, and queuing each introduce delays that are at least a magnitude larger than the base delay (Section 3.4.2). 3. Delay compositions are non-uniform. Some networks suffer mostly from queuing whereas others are primarily affected by loss recovery delays (Section 3.4.3). Based on longitu- dinal data, delays have significantly improved in some countries and worsened in others (Section 3.4.4). 4. Tail performers can stall connections for a long time, requiring applications, like video streaming, to keep large amounts of data buffered if they want to maintain a previously achieved delivery rate (Section 3.4.5). 5. Short flows see less severe delays, and many do not experience loss. Yet delay distributions continue to have a long tail (Section 3.4.6). 6. Timeout-based recovery mechanisms see high delay inflation due to queuing-inflated RTT samples affecting timeout values (Section 3.4.7). 54 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 CDF Worst ACK delay (in milliseconds) Africa Asia Europe North America South America Oceania (a) Per continent 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 CDF Worst ACK delay (in milliseconds) Armenia Germany India Kazakhstan Russia United States (b) Per country 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 CDF Worst ACK delay (in milliseconds) AS701 AS852 AS6697 AS16345 AS24608 AS198471 (c) Per AS Figure 3.2: Overall distributions of the worst per-connection ACK delay. 3.4.1 Tail Delays Across the Globe Figure 3.2 shows the distribution of the tail performer’s ACK delays across connections at dif- ferent levels of granularity. Many of the delay distributions see long tails, often represent- ing ACK delays of multiple seconds. Broken down by continent, connections in Asia see the worst tail delays (m= 820ms, p 95 = 3,750ms), whereas traces from North America have much lower tail latencies (m= 165ms, p 95 = 1,323ms). We observe dissimilar distributions on the other breakdown levels as well. For example, AS701 sees low tail delays for most connections (m= 92ms, p 95 = 1,119ms), whereas AS16345 has a median tail delay that is more than 10x worse (m= 1,290ms, p 95 = 3,450ms). 55 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 CDF Normalized fraction of overall delay Base Loss Recovery Late Trigger Queuing (a) Normalized delays (S= 1) 0 0.2 0.4 0.6 0.8 1 0.1 1 10 100 CDF Relative fraction of overall delay (wrt. to base delay) Base Loss Recovery Late Trigger Queuing (b) Relative delays (base= 1) Figure 3.3: Delay components broken down for each connection’s tail performer, without a regional breakdown. 0 0.2 0.4 0.6 0.8 1 20 1 10 CDF Observed timeout / queue-free timeout Figure 3.4: Observed timeout inflation, based on the quotient of the observed timeout and the recomputed “queue-free” timeout. Takeaway. Even at the coarse continent granularity, we see very different delay distributions, suggesting that Internet experiences vary a lot as well. Thus, the potential for reducing latency may vary a lot across regions. 3.4.2 Mapping Latency to Delay Components To tackle the problem of high latency we first have to determine what is causing the delays. In this section we start by presenting our results for mapping latency to delay compoents from a global perspective. In the next section we look into regional differences, specifically per-AS distributions, that help explain how some ASes see a much higher tail latency than others. 56 Delay caused (m) Delay caused (p 95 ) Observation # Abs. Norm. Rel. Abs. Norm. Rel. Late ACK triggered 489k 61 ms 0.27 2.1 648 ms 0.51 16.0 Late ACK armed timer 9k 78 ms 0.12 2.7 913 ms 0.33 20.4 Inflated timeout 9k 94 ms 0.13 3.0 1,404 ms 0.39 33.8 Multiple late triggers 4k 284 ms 0.35 7.6 2,421 ms 0.91 59.5 Table 3.1: Metrics for connections with a non-zero loss trigger delay. Figure 3.3a shows distributions of the normalized delay breakdown for tail performers (the sum of the delay components for each connection’s tail performer equals 1). This lets us iden- tify the dominant source of delay. In the median, 33% of a tail performer’s delay comes from queuing (of the final successful transmission), compared to only 12 - 14% from each of the other components (base, loss recovery, or late trigger delay). Each component’s curve has a long tail though, indicating that a significant number of flows is primarily affected by these other factors as well. The 95th percentiles are: 45% caused by the base delay, 80% by loss recovery, 49% by late triggers, and 85% by queuing. Recall though that late triggers are the result of queuing as well. In the case of packet loss, a full bottleneck queue can result in latency inflation at multiple points when recovering from a single loss. While the lost packets technically do not experience queuing delay, all subsequent packets that serve as loss indicators have to pass through a full queue (assuming drop-tail at routers). Once the retransmission is triggered, it easily can experience the same level of conges- tion at the bottleneck and get delayed by a full queue again. Detecting a loss might even require a retransmission timeout (RTO). Since the timeout value is based on recent RTT measurements, it can be heavily inflated due to queuing as well (Figure 3.4). At the 95th percentile, we observed timeout values that were three times higher compared to a timeout based on queue-free RTT samples. 57 Table 3.1 breaks down the late trigger delays. Note that timeout-based triggers are less com- mon in this dataset compared to the one used in the previous chapter. This is not surprising since large data transfers rarely experience the loss of a complete window of data transmitted, therefore enabling timeout-free loss recovery. However, when connections do experience timeouts or even multiple late triggers (e.g., a fast retransmission followed by a timeout) they can introduce very large delays. In the 95th percentile of cases with multiple late triggers, they accounted for 91% of the overall tail performer delay. To understand how much the performance of a connection degraded compared to an ideal case where each packet incurs only the base delay, we calculated the delays relative to the base delay, as shown in Figure 3.3b. Each delay component sees long tails with latency inflated by up to two orders of magnitude. In the 95th percentile, the delays incurred by loss recovery, late triggers, and queuing are respectively the 20-, 13-, and 19-fold of the connection’s base delay. Takeaway. Many connections observe high latencies due to self-inflicted queuing that not only delay packet delivery, but also slow down TCP’s mechanism to deal with loss. This compounding impact amplifies the importance of minimizing queuing in the network to reduce latency, via mechanisms like delay-dependent congestion control [24] or active queue management schemes like RED or CoDel [46, 95]. 3.4.3 Regional Differences Earlier we showed that tail delays can vary significantly depending on a client’s AS (Figure 3.2c). Figure 3.5 confirms that we also see regional differences in the normalized values for the different delay components. AS852 and AS16345 are good examples to highlight the differences. In the median, queuing accounts for 8% of the tail delay in AS852, compared to 61% in AS16345. In 58 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 CDF Normalized delay AS701 AS852 AS6697 AS16345 AS24608 AS198471 (a) Loss recovery delay 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 CDF Normalized delay AS701 AS852 AS6697 AS16345 AS24608 AS198471 (b) Late trigger delay 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 CDF Normalized delay AS701 AS852 AS6697 AS16345 AS24608 AS198471 (c) Queuing delay Figure 3.5: Distribution of individual delay components in selected ASes. The graphs show the CDF of normalized delays (S= 1) per connection. contrast, loss recovery contributes 39% to the median tail delay in AS852, compared to no loss recovery even being required in AS16345. Interestingly, tail delays in AS16345 are much higher compared to AS852 showing that excessive queuing in deep buffers can hurt latency more than the need for packet recovery when a shallower queue is overflowing. Takeaway. Sources of delay for high latency can be very dissimilar across ASes. As such, addressing one source on a global scale can have a large impact on the performance in one AS, compared to no effect in another. 3.4.4 Delay Evolution In addition to observing regional differences in recent measurements, we also see different trends when analyzing tail performance over time. Figure 3.6 shows how tail delays evolved in three 59 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 CDF Worst ACK delay (in milliseconds) 2011 2012 2013 2014 2016 (a) Germany 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 CDF Worst ACK delay (in milliseconds) 2011 2012 2013 2014 2016 (b) India 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 CDF Worst ACK delay (in milliseconds) 2011 2012 2013 2014 2016 (c) United States Figure 3.6: Overall delay CDF for selected countries over time. selected countries between 2011 and 2016. Each curve is based on all measurements recorded from March 10 - 19 in the given year 5 . In India and the United States, tail delays generally shrunk over time. For Germany we see the opposite trend where delays increased. One reason for this is queuing. In the 95th percentile, queuing accounted for 416 ms of a tail performer’s delay in 2011, compared to 918 ms in 2016 (Figure not shown). Takeaway. Since network conditions change over time, performance in some regions might benefit from occasionally optimizing the transport layer to account for these changes. To detect regional trends, we need continuous regional measurement studies. 5 We have no data points for 2015 due to a bug in NDT’s collection pipeline, resulting in no packet captures for a 10-month period. 60 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 CDF Buffer needed (in kilobytes) AS701 AS852 AS6697 AS16345 AS24608 AS198471 Figure 3.7: Buffer requirement to compensate for tail performer’s ACK delay. 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 CDF Fraction of acked bytes needed buffered AS701 AS852 AS6697 AS16345 AS24608 AS198471 Figure 3.8: Required buffer size normalized by the bytes already acknowledged. 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 CDF Estimated timout value (in milliseconds) Queue-free TLP TLP Queue-free RTO RTO Figure 3.9: Estimated timer values worldwide after 1 MB of data was sent per connection. 61 Delay (in ms) Normalized delay Component p 50 p 90 p 95 p 98 p 99 p 50 p 90 p 95 p 98 p 99 Base 36 150 204 312 391 0.46 0.72 0.82 0.89 0.94 Loss recovery 0 0 0 130 416 0 0 0 0.35 0.63 Late trigger 0 0 0 16 101 0 0 0 0.03 0.23 Queuing 11 92 185 375 664 0.17 0.53 0.67 0.80 0.85 Table 3.2: Delays (in milliseconds) by component for tail performers in short flows (max. 100 KB) 3.4.5 Impact on Data Delivery Rate To evaluate the impact of a tail performer on the actual data delivery rate to the receiver’s appli- cation we look at two metrics. First, we compute how much data a receiver would need to keep buffered to compensate for the delay introduced by the tail performer while consuming data at the goodput rate achieved up to this point. For our earlier selection of ASes a receiver needs to buffer between 20 and 178 KB in the median, and between 452 and 2665 KB in the 95th percentile, to compensate for tail delay (Figure 3.7). Second, since goodput rates can vary widely across connections we also compute the buffer size relative to the number of bytes that were already delivered to the receiver. In the 95th per- centile, between 40 and 125% of the delivered bytes would have to be buffered in our earlier selection of ASes (Figure 3.8). The latter number means that more than the amount of already delivered data needs to be buffered. Takeaway. A high delay for a single packet can drastically reduce the goodput of a connection. As a result, even applications that do not immediately consume all available data from a TCP connection, e.g., a video stream that maintains a buffer, can be affected by high latency. 62 3.4.6 Tail Performance for Short Flows Many latency-sensitive transfers are short-lived. Since the measurement system that our dataset is based on does not test for short flow performance we use a surrogate. We look at the performance involving only packets carrying the first 100 KB of application data. For these packets we extract and analyze the tail performer as before. On average these “early” tail performers have much lower delays than the worst packet in a long flow (Table 3.2). Especially losses are less prevalent. However, in the tail each of the delay sources incurs latency of hundreds of milliseconds as well. Takeaway. Since tail latencies are high for both, short- and long-lived connections, reducing them benefits a wide variety of applications, like Web transfers and video streams. 3.4.7 Timer Inflation Our dataset consists of long flows with a relatively small number of timeouts (Section 3.4.2), but timeouts are frequently a cause for high latency in short flows (see Chapter 2). To quantify the impact of a timeout if it had happened, we recomputed timer values 6 for RTOs and tail loss probes (TLPs) [39] after 1 MB of data was sent per connection. To assess the impact of queuing, we also removed the delay we attributed to queuing and recalculated the timers. As shown in Figure 3.9, TLP timers are heavily affected by queuing delays, up to a point where they are almost as large as the RTO values, and no longer achieve their goal of recovering from loss faster than an RTO would. Since the RTO has to be at least 200 ms in Linux, we only see significant RTO timer inflation in the tail. Takeaway. Many flows see queue-induced timer inflation causing extra delay if a timeout occurs and prevents some connections from achieving rapid loss recovery through TLPs. 6 Using the same timer logic as Linux TCP. 63 3.5 Conclusion In this chapter, we leverage longitudinal measurement data to investigate the magnitude and the sources for the longest packet ACK delay seen per connection. By breaking down this tail latency into delay components we find that in the median 33% of the delay is the direct result of the corresponding packet being queued in the network. Other causes, like delaying retransmission triggers and recovering from loss, contribute delays to some flows that are more than a magnitude larger than the connection’s minimum RTT. Finally, delays and their sources can vary wildly depending on the country or the AS that a client is located in. In conclusion, any changes to the transport protocol or the network should be preceded by a localized root cause analysis to ensure that changes address the actual underlying performance problem. For example, while mechanisms that can reduce network queue occupancy, like ECN, are likely of benefit for areas that suffer from excessive queuing delays, they might elicit little change elsewhere. As a final note, datasets produced by tools like NDT are a valuable source for performance analysis, but have shortcomings. NDT focuses on identifying causes for low throughput which might not be related to situations where, for example, a user is experiencing high latency when accessing a Web site. We can overcome this limitation by extending our measurement tools to exercise a wider variety a traffic patterns that a client might observe when accessing Web resources. For NDT, this would mean recording traces not only for long-living transfers, but also for complex, multi-source accesses of smaller objects. 64 Chapter 4 Diagnosing Path Inflation of Mobile Client Traffic As mobile Internet becomes more popular, carriers and content providers must engineer their topologies, routing configurations, and server deployments to maintain good performance for users of mobile devices. Understanding the impact of Internet topology and routing on mobile users requires broad, longitudinal network measurements conducted from mobile devices. In this work, we are the first to use such a view to quantify and understand the causes of geographically circuitous routes from mobile clients using 1.5 years of measurements from devices on 4 US carriers. We identify the key elements that can affect the Internet routes taken by traffic from mobile users (client location, server locations, carrier topology, carrier/content-provider peering). We then develop a methodology to diagnose the specific cause for inflated routes. Although we observe that the evolution of some carrier networks improves performance in some regions, we also observe many clients - even in major metropolitan areas - that continue to take geographically circuitous routes to content providers, due to limitations in the current topologies. 65 4.1 Introduction As mobile Internet becomes more popular, carriers and content providers must engineer their topologies, routing configurations, and server deployments to maintain good performance for users of mobile devices. A key challenge is that performance changes over space and time, as users move with their devices and providers evolve their topologies. Thus, understanding the impact of Internet topology and routing on mobile users requires broad, longitudinal network measurements from mobile devices. In this work, we are the first to identify and quantify the performance impact of several causes for inflated Internet routes taken by mobile clients, based on a dataset of 901,000 measurements gathered from mobile devices during 18 months. In particular, we isolate cases in which the distance traveled along a network path is significantly longer than the direct geodesic distance between endpoints. Our analysis focuses on performance with respect to Google, a large, popular content provider that peers widely with ISPs and hosts servers in many locations worldwide. This rich connectivity allows us to expose the topology of carrier networks as well as inefficiencies in current routing. We constrain our analysis to devices located in the US, where our dataset is densest. Our key results are as follows. First, we find that path inflation is endemic: in the last quarter of 2011 (Q4 2011), we observe substantial path inflation in at least 47% of measurements from devices, covering three out of four major US carriers. While the average fraction of samples ex- periencing path inflation dropped over the subsequent year, we find that one fifth of our samples continue to exhibit inflation. Second, we classify root causes for path inflation and develop an algorithm for identifying them. Specifically, we identify whether the root cause is due to the 66 mobile carrier’s topology, the peering between the carrier and Google, and/or the mapping of mobile clients to Google servers. Third, we characterize the impact of this path inflation on net- work latencies, which are important for interactive workloads typical in the mobile environment. We show that the impact on end-to-end latency varies significantly depending on the carrier and device location, and that it changes over time as topologies evolve. We estimate that additional propagation delay can range from at least 5-50ms, which is significant for service providers [72]. We show that addressing the source of inflation can reduce download times by hundreds of mil- liseconds. We argue that it will become increasingly important to optimize routing as last-mile delays in mobile networks improve and the relative impact of inflation becomes larger. 4.2 Background As Internet-connected mobile devices proliferate, we need to understand factors affecting Inter- net service performance from mobile devices. In this chapter, we focus on two factors: the car- rier topology, and the routing choices and peering arrangements that mobile carriers and service providers use to provide access to the Internet. The device’s carrier network can have multiple Internet ingress points — locations where the carrier’s access network connects to the Internet. The carrier’s network may also connect with a Web service provider at a peering point — a location where these two networks exchange traffic and routes. The Domain Name System (DNS) resolvers from (generally) the carrier and the service provider combine to direct the client to a server for the service by resolving the name of the service to a server IP address. 67 Peering Server User Cell Tower Ingress Metro Area Internet Figure 4.1: Optimal routing for mobile clients. Idealized Operation. This chapter focuses on Google as the service provider. To understand how mobile devices access Google’s services, we make the following assumptions about how Google maps clients to servers to minimize latency. First, Google has globally distributed servers, forming a network that peers with Internet service provider networks widely and densely [50,73]. Second, Google uses DNS to direct clients (in our case, mobile devices) to topologically nearby servers. Last, Google can accurately map mobile clients to their DNS resolvers [84]. Since its network’s rich infrastructure aims at reducing client latency, Google is an excellent case study to understand how carrier topology and routing choices align with Google’s efforts to improve client performance. We use Figure 4.1 to illustrate the ideal case of a mobile device connecting to a Google server. A mobile device uses DNS to look up www.google.com. Google’s resolver returns an optimal Google destination based on a resolver-server mapping. Traffic from the device traverses the carrier’s access network, entering the Internet through an ingress point. Ideally, this ingress point is near the mobile device’s location. The traffic enters Google’s network through a nearby peering point and is routed to the server. In this chapter, we identify significant deviations from this idealized behavior. Specifically, we are interested in metro-level path inflation [128], where traffic from a mobile client to a Google 68 server exits the metropolitan (henceforth metro) area even though Google has a presence there. This metro-level inflation impacts performance by increasing latency. Example Inflation. Carrier topology determines where traffic from mobile hosts enters the car- rier network. Prior work has suggested that mobile carriers have relatively few ingress points [140]. Therefore, traffic from a client in the Los Angeles area may enter the Internet in San Francisco be- cause the carrier does not have an ingress in Los Angeles. If the destination service has a server in Los Angeles, the topology can add significant latency compared to having an ingress in LA. Rout- ing configurations and peering arrangements can also cause path inflation. As providers move services to servers located closer to clients, the location where carriers peer with a provider’s net- work may significantly affect performance. For instance, if a carrier has ingress points in Seattle and San Francisco, but peers with a provider only in San Francisco, it may route Seattle traffic to San Francisco even if the provider has a presence in Seattle. 4.3 Dataset Data Collected. Our data consists of network measurements (ping, traceroute, HTTP GET, UDP bursts and DNS lookups) issued from Speedometer, an internal Android app developed by Google and deployed on thousands of volunteer devices. Speedometer conducts approximately 20-25 measurements every five minutes, as long as the device has sufficient remaining battery life (80%) and is connected to a cellular network. 1 Our analysis focuses on measurements toward Google servers including 310K traceroutes, 300K pings and 350K DNS lookups issued in three three-month periods (2011 Q4, 2012 Q2 and 1 The app source is available at: https://github.com/Mobiperf/Speedometer 69 Q4). We focus on measurements issued by devices in the US, where the majority of users is located, with a particular density of measurements in areas with large Google offices. All users running the app have consented to sharing collected data in an anonymized form. 2 Some fields are stripped (e.g. device IP addresses, IDs), others are replaced by hash values (e.g. HTTP URLs). Location data is anonymized to the center of a region that contains at least 1000 users and is larger than 1 km 2 . The above measurements are part of a dataset that we published to a Google Cloud Storage bucket and released under the Creative Commons Zero license 3 . We also provide Mobile Perfor- mance Maps, a visualization tool to navigate parts of the dataset, understand network performance and supplement the analysis in this chapter: http://mpm.cs.usc.edu. Finding Ingress Points. In order to identify locations of ingress points, for each carrier, we graphed the topology of routes from mobile devices to Google, as revealed by the traceroutes in our dataset. We observe that traceroutes from clients in the same regions tend to follow similar paths. We used the DNS names of routers in those paths to identify the location of hops at which they enter the public Internet. In general, the traceroutes form well-defined structures, starting with private or unresolvable addresses, where all measurements from a given region reach the Internet in a single, resolvable location, generally a point of presence of the carrier’s backbone network. We define this location as the ingress point. Finding Peering Points. To infer peering locations between the carriers and Google, we iden- tified for each path the last hop before entering Google’s network, and the first hop inside it (identified by an IP address from Google’s blocks). Using location hints in the hostnames of 2 Google’s privacy and legal teams reviewed and approved data anonymization and release. 3 http://commondatastorage.googleapis.com/speedometer/README.txt 70 AT&T Sprint T-Mobile Verizon Q4 2011 0.98 0.10 0.65 0.47 Q2 2012 0.98 0.21 0.25 0.15 Q4 2012 0.00 0.21 0.20 0.38 Table 4.1: Fraction of traceroutes from major US carriers with metro-level inflation. those hop pairs, we infer peering locations for each carrier [124]. In cases where the carrier does not peer with Google (i.e.,sends traffic through a transit AS), we use the ingress to Google’s network as the inferred peering location. 4.4 A Taxonomy of Inflated Routes Types of Path Inflation. Table 4.1 shows, for traceroutes in our dataset from the four largest mobile carriers in the US, the fraction of routes that incurred a metro-level path inflation. For three of the four carriers, more than half of all traceoutes to Google experienced a metro- level deviation in Q4 2011. Further, nearly all measurements from AT&T customers traversed inflated paths to Google. Note that these results are biased toward locations of users in our dataset and are not intended to be generalized. Nevertheless, at a high-level, this table shows that metro- level deviations occur in routes from the four major carriers, even though Google deploys servers around the world to serve nearby clients [72]. However, we also observe that the fraction of paths experiencing metro-level inflation decreases significantly over the subsequent 12 months. As we will show, we can directly link some of these improvements to the topological expansion of carriers. In the rest of the chapter, we examine path inflation to understand its causes and to explore what measures carriers have adopted to reduce or eliminate it. We begin by characterizing the 71 different types of metro-level inflations we see in our dataset. We split the end-to-end path into three logical parts: client to carrier ingress point (Carrier Access), carrier ingress point to service provider ingress point (Interdomain), and service provider ingress point to destination server (Provider Backbone). Then we define the following observed traffic patterns of inflated routes: Carrier Access Inflation. Traffic from a client in metro area L (Local) enters the Internet in metro area R (Remote), and is directed to a Google server in R. Interdomain Inflation. Traffic from a client in area L enters the carrier’s backbone in L, then enters Google’s network in area R and is directed to a Google server there. Carrier Access-Interdomain Inflation. Traffic from a client in metro area L enters the carrier’s backbone in metro area R, then enters Google’s network back in area L and is directed to a Google server there. Provider Backbone Inflation. Traffic from a client in area L enters the carrier’s backbone and Google’s network in area L, but is directed to a Google server in a different area R. In all cases, Google servers are known to exist in both metro areas L and R. Possible Causes of Path Inflation. If a carrier lacks sufficient ingress points from its cellular network to the Internet, it can cause Carrier Access Inflation. For example, if a carrier has no Internet ingress points in metro area L, it must send the traffic from L to another area R (Fig- ure 4.2, user B). If a carrier’s access network ingresses into the Internet in metro-area L, a lack of peering between the mobile carrier and Google in metro-area L causes traffic to leave the metro area, resulting in Interdomain Inflation (Figure 4.2, user C). If a carrier has too few ingresses and lacks peering near its ingresses, we may observe Carrier Access-Interdomain Inflation. In this case a carrier, lacking ingress in area L, hauls traffic to a remote area R, where it lacks peering 72 Internet Metro Area A User B Cell Tower Metro Area B Peering Server Metro Area C Server No Peering No Ingress Ingress Cell Tower Ingress User A Server Peering Cell Tower User C Figure 4.2: Different ways a client can be directed to a server. User A is the ideal case, where the traffic never leaves a geographical area. User B and C’s traffic suffers path inflation, due to lack of ingress point and peering point respectively. with Google. A peering point exists in area L, so traffic returns there to enter Google’s network. Though a provider like Google has servers in most major metropolitan areas, it can still experi- ence Provider Backbone Inflation if either Google or the mobile carrier groups together clients in diverse regions when making routing decisions. In this case, Google directs at least some of the clients to distant servers. Google may also route a fraction of traffic long distances across its backbone for measurement or other purposes. Identifying root causes. We run one or more of the following checks, depending on the inflated part(s) of the path, to perform root cause analysis (illustrated in Figure 4.3). Examining Carrier Access Inflation. For inflated carrier access paths, we determine whether the problem is the lack of an available nearby ingress point. To do so, we examine the first public IP addresses for other traceroutes issued by clients of the same carrier in the same area. If none of those addesses are in the client’s metro area, we conclude there is a lack of available local ingress. 73 Lack of local ingress point Lack of local peering point Inefficient client clustering Are there any traces with first hop in this area? Are there any traces served by local target without exiting area? Are all traces directed to exactly one destination at any given time? YES NO NO End-to-end path Analysis Diagnosis Carrier Access Part Inflated? Interdomain Part Inflated? Provider Backbone Part Inflated? Input Carrier Access Interdomain Provider Backbone YES YES YES Classification Figure 4.3: Root cause analysis for metro-level inflation. Examining Interdomain Inflation. For paths inflated between the carrier ingress point and the ingress to Google’s network, we determine whether it is due to a lack of peering near the carrier’s ingress point. We check whether any traceroutes from the same carrier enter Google’s network in that metro area, implying that a local peering exists. If no such traceroutes exist, we infer a lack of local peering. Examining Provider Backbone Inflation. For paths inflated inside Google’s network, we check for inefficient mappings of clients to servers. We look for groups of clients from different metro areas all getting directed to servers at either one or the other area for some period, possibly flapping between the two areas over time. If we observe that behavior, we infer inefficient client/ resolver clustering. A small number of traceroutes (< 2%) experienced inflated paths but did not fit any of the above root causes. These could be explained by load balancing, persistent incorrect mapping of a client to a resolver/server, or a response to network outages. 74 4.5 Results We first present examples of the three dominant root causes for metro-level inflation. We then show aggregate results from our inflation analysis, its potential impact on latency, and the evolu- tion of causes of path inflation over time. Case studies. For each root cause, we now present one example. For each example, we describe what the traceroutes show, what the diagnosis was, and note the estimated performance hit, rang- ing from 7-72% extra propagation delay. We constrain our analysis to the period between late 2011 and mid 2012, where the dataset is sufficiently dense. Lack of ingress point. We observe that all traceroutes to Google from AT&T clients in the NYC area enter the public Internet via an ingress point in Chicago. Thus, Google directs these New York clients to a server in the Chicago area, even though it is not the server geographically closest to the clients. These Chicago servers are approximately 1074km further from the clients than the New York servers are, leading to an expected minimum additional round-trip latency of 16ms (7% overhead) [67]. Lack of peering. We observe AT&T peering with Google near San Francisco (SF), 4 but not near Los Angeles (LA) or Seattle. Therefore, Google directs clients in those two areas to servers in SF rather than in their local metros. While our data in these regions become sparse after mid 2012, we verified that this inflation persists for clients from LA in Q2 2013. The observed median RTT for Seattle users served by servers in SF is 90ms. Since those servers are 1089km farther away from the servers nearest to the Seattle users, they experience a delay inflation of at least 4 For the granularity of our analysis, we treat all locations in the Bay Area as equivalent. 75 0 500 1000 1500 Nov 15 Dec 1 Dec 15 # measurements Seattle server SF server (a) SF clients 0 50 100 150 200 Nov 15 Dec 1 Dec 15 SF server Seattle server (b) Seattle clients Figure 4.4: Server selection flapping due to coarse client-server mapping. Dashed areas denote measurements where the client was directed to a remote server. 16ms (21%). As a result, loading even a simple website like the Google homepage requires an additional 160ms. Coarse client-server mapping granularity or Inefficient client/resolver clustering. We observe a behavior for Verizon clients that suggests that Google is jointly directing clients in Seattle and SF. At any given time, traffic from both areas was directed towards the same Google servers, either in the Seattle or in the SF area, therefore exhibiting suboptimal performance for some distant clients. Figure 4.4 illustrates this behavior over a 2-month period. Normally, users served by servers in their metro area observe a median RTT of 22ms and 45ms for SF and Seattle respectively. However, when users in one area are served by servers in the other area (indicated by the filled pattern in the figure), the additional 1089km one-way distance adds an extra 16ms delay (an overhead of 72% and 35% for SF and Seattle users respectively). Inflation Breakdown by Root Cause. In this section, we show aggregated statistics of some of the observed anomalies that cause performance degradation. We focus on Q4 2011 and on AT&T and Verizon Wireless, the period and carriers for which the dataset is the densest. We also focus on three large metropolitan areas that were populated enough to generate significant data (SF, New York and Seattle). Google servers exist in all three areas. For all measurements issued from 76 Closest Server Count Fraction Inflated I P D Extra Dst. (km) Extra RTT (ms) Extra PLT (ms) AT&T SF 7759 1.00 x x 4200 31.5 315 Seattle 303 1.00 x 2106 15.8 158 NYC 2720 1.00 x 2148 16.1 161 Verizon SF 20528 0.30 x 2178 16.3 163 Seattle 2435 0.33 x 1974 14.8 148 NYC 7029 0.98 694 5.2 52 Table 4.2: Overall results for two carriers in Q4 2011. The table shows what fraction of all traceroutes from clients in three different locations presented a deviation, cause of the deviation (I = Ingress, P = Peering, D = DNS/clustering), extra distance traveled (round- trip), extra round trip time (RTT), and extra page load time (PLT) when accessing the Google homepage. those areas, we quantify the fraction of metro-level inflations and determine the root cause. We believe that the path inflation observed in those areas implies probable inflation in less-populated regions. Table 4.2 shows aggregate results for the three regions. For each case, it includes the extra round-trip distance traveled as well as a loose lower bound of the additional delay incurred by traveling that distance, based on the speed of data through fiber [67]. We observed inflated routes from all regions for both carriers. Most of the traceroutes from Verizon clients in the NYC area went to servers near Washington, D.C., but we were unable to discern the exact cause. This represents a small geographic detour and may not impact performance in practice. Verizon clients from the Seattle and SF metro were routed together, possibly as a result of using the same DNS resolvers, as described in our case study above. For all traces from AT&T clients in the NYC area, the first public AT&T hop is in Chicago, indicating a lack of a closer ingress point. AT&T clients from the SF area were all served by a nearby Google server. However, traffic went from SF to Seattle before returning to the server in SF. In the traceroutes, the first public IP address was always from an AT&T router in Seattle, suggesting a lack of an ingress 77 point near SF, and increasing the RTT by at least 31ms for all traffic. This behavior progressively disappeared in early 2012, with the observed appearance of an AT&T ingress point in the SF area. An informal discussion with the carrier confirms initial deployment of this ingress in 2011. Note that traceroutes from clients in Seattle were also routed to Google targets in the SF area. Though Seattle traffic reached a local ingress, AT&T routed it to SF before handing it to Google’s network, indicating a lack of peering in Seattle and explaining why traffic from SF clients returned to SF after detouring to Seattle. Evolution of Root Causes. As suggested above, carriers’ topologies have evolved over time. Since our dataset is skewed towards some regions, we cannot enumerate the complete evolution of carrier topology and routing configuration, but can provide insight into why we see fewer path inflation instances over time for some carriers. Ingress Points. Figure 4.5 maps the observed ingress points at the end of 2011. While our dataset is limited, we can see indications of improvement. An earlier study [140] found 4-6 ingress points per carrier, whereas our results indicate that some carriers doubled this figure. This expansion opens up the possibility of much more direct routes from clients to services. Additionally, we noticed the appearance of AT&T ingresses in SF and LA, and of at least one Sprint ingress point in LA during the measurement period. Peering points. Table 4.3 summarizes the peering points that we observe. In 2011, most tracer- outes from Sprint users in LA are directed to Google servers in Texas or SF. In measurements from Q2 2012, we observed an additional peering point between Sprint and Google near LA. Around the same time, we observe that Google started directing Sprint’s LA clients to LA servers. 78 AT&T Sprint T-Mobile Verizon SEA PDX SFO LAX SAN PHX SLC SAN DEN DFW HOU MIA ATL MSP OMA MCI BNA CHI BOS LGA PHL DCA CVG Figure 4.5: Observed ingress points for major US carriers. Locations are labeled with airport codes belonging to the ingress metro area. Carrier Peering locations (2011 Q4) (2012 Q2) (2012 Q4) AT&T CHI, DFW, HOU, MSP, PDX, SAT, SFO + ATL, CMH + DEN Sprint ASH, ATL, CHI, DFW, LGA, SEA, SFO + LAX T-Mobile DCA, DFW, LAX, LGA, MSP, SEA, SFO + MIL + MIA Verizon ATL, CHI, DAL, DCA, DFW, HOU, LAX, + ASH, MIA SCL, SEA, SFO Table 4.3: Observed peering locations between carriers and Google. Locations are identified by airport codes belonging to the metro area. 4.6 Path Inflation Today Our measurements show that many instances of path inflation in the US disappeared over time. However, in addition to the persistent lack of AT&T peering in the LA area mentioned earlier, we see evidence for inflated paths in other regions of the world (from Q3 2013 measurement data). For example, clients of Nawras in Oman are directed to servers in Paris, France instead of closer servers in New Delhi, India. This increases the round trip distance by over 7000km, and may be related to a lack of high-speed paths to the servers in India. We also see instances of path inflation 79 in regions with well-developed infrastructure. E-Plus clients in southern Germany are delegated to Paris or Hamburg servers instead of a close-by server in Munich, and Movistar clients in Spain are directed to servers in London instead of local servers in Madrid. These instances suggest that path inflation is likely to be a persistent problem in many parts of the globe, and motivate the design of a continuous measurement infrastructure for identifying instances of path inflation, and diagnosing their root causes. 4.7 Conclusions This chapter took a first look into diagnosing path inflation for mobile client traffic, using a large collection of longitudinal measurements gathered by smartphones located in diverse regions and carrier networks. We provided a taxonomy of causes for path inflation, identified the reasons behind observed cases, and quantified their impact. We found that a lack of carrier ingress points or provider peering points can cause lengthy detours, but, in general, routes improve as carrier and provider topologies evolve. 80 Chapter 5 An Internet-Wide Analysis of Traffic Policing Large flows like video streams consume significant bandwidth. Some ISPs actively manage these high volume flows with techniques like policing, which enforces a flow rate by dropping excess traffic. While the existence of policing is well known, our contribution is an Internet-wide study quantifying its prevalence and impact on transport-level and video-quality metrics. We devel- oped a heuristic to identify policing from server-side traces and built a pipeline to process traces at scale collected from hundreds of Google servers worldwide. Using a dataset of 270 billion packets served to 28,400 client ASes, we find that, depending on region, up to 7% of connec- tions are identified to be policed. Loss rates are on average 6 higher when a trace is policed, and it impacts video playback quality. We verified most of these findings using a second dataset consisting of packet captures collected by M-Lab’s Network Diagnostics Toolkit over the last six years. This gave us the additional benefit of detecting longitudinal trends. In particular, we found that policing became less prevalent over time. Finally, we show that alternatives to policing, like pacing and shaping, can achieve traffic management goals while avoiding the deleterious effects of policing. 81 Policing Enforces rate by dropping excess packets immediately – Can result in high loss rates + Does not require memory buffer + No RTT inflation Shaping Enforces rate by queueing excess packets + Only drops packets when buffer is full – Requires memory to buffer packets – Can inflate RTTs due to high queueing delay Table 5.1: Overview of policing and shaping. 5.1 Introduction Internet traffic has increased fivefold in five years [35], much of it from the explosion of streaming video. YouTube and Netflix together contribute nearly half of the traffic to North American Internet users [93,115,145]. Content providers want to maximize user quality of experience. They spend considerable effort optimizing their infrastructure to deliver data as fast as possible [28,47, 58]. In contrast, an ISP needs to accommodate traffic from a multitude of services and users, of- ten through different service agreements such as tiered data plans. High-volume services like streaming video and bulk downloads that require high goodput must coexist with smaller volume services like search that require low latency. To achieve coexistence and enfore plans, an ISP might enforce different rules on its traffic. For example, it might rate-limit high-volume flows to avoid network congestion, while leaving low-volume flows that have little impact on the con- gestion level untouched. Similarly, to enforce data plans, an ISP can throttle throughput on a per-client basis. The most common mechanisms to enforce these policies are traffic shaping – in which traffic above a preconfigured rate is buffered – and traffic policing – in which traffic above the rate is dropped [34]. Table 5.1 compares both techniques. To enforce rate limits on large flows only, 82 networks often configure their shapers and policers (the routers or middleboxes enforcing rates) to accommodate bursts that temporarily exceed the rate. In this chapter, we focus on policing and briefly discuss shaping (Section 5.5.1.2). The Impact of Policing. Policing is effective at enforcing a configured rate but can have negative side effects for all parties. While operators have anecdotally suggested this problem in the past [34, 139], we quantify the impact on content providers, ISPs, and clients at a global scale by analyzing client-facing traffic collected at most of Google’s CDN servers, serving clients around the world. Policing impacts content providers: it introduces excess load on servers forced to retransmit dropped traffic. Globally, the average loss rates on policed flows are over 20%, compared to about 2% for all other flows! Policing impacts ISPs: they transport that traffic across the Internet from the content provider to the client, only for it to be dropped. With 20% loss, a fifth of the bandwidth used by affected flows is wasted — the content provider and ISPs incur costs transmitting it, but it never reaches the client. This traffic contributes to congestion and to transit costs. Policing impacts clients: ISP-enacted policing can interact badly with TCP-based applica- tions, leading to degraded video quality of experience (QoE) in our measurements. Bad QoE contributes to user dissatisfaction, hurting content providers and ISPs. Figure 5.1 shows the time-sequence plot of a policed flow collected in a lab experiment (see Section 5.3). Because the policer is configured to not throttle short flows, the flow ramps up to over 15 Mbps without any loss (bubble 1), until the policer starts to throttle the connection to a rate of 1.5 Mbps. Since packets are transmitted at a rate that exceeds the policed rate by an order of magnitude, most of them are dropped by the policer and retransmitted over a 5-second period 83 0.0 0.5 1.0 1.5 2.0 2.5 0.0 2.0 4.0 6.0 8.0 10.0 Sequence number (in M) Time (s) Data (First transmit) Data Retransmits Acked Data Policing Rate 1 2 3 4 5 Figure 5.1: TCP sequence graph for a policed flow: (1 and 4) high throughput until token bucket empties, (2 and 5) multiple rounds of retransmissions to adjust to the policing rate, (3) idle period between chunks pushed by the application. (2). Following the delivery of the first 2 MB, the sender remains idle until more application data becomes available (3). Since the flow does not exhaust its allotted bandwidth in this time frame, the policer briefly allows the sender to resume transmitting faster than the policing rate (4), before throttling the flow again (5). Overall, the flow suffers 30% loss. Understanding Policing. Little is known about how traffic policing is deployed in practice. Thus, we aim to answer the following questions at a global scale: (1) How prevalent is traffic policing on the Internet? (2) How does it impact application delivery and user quality of ex- perience? (3) How can content providers mitigate adverse effects of traffic policing, and what alternatives can ISPs deploy? The question of user experience is especially important, yet ISPs lack mechanisms to under- stand the impact of traffic management configurations on their users. They lack visibility into transport-layer dynamics or application-layer behavior of the traffic passing through their net- works. Further, policing means that content providers lack full control over the performance 84 experienced by their clients, since they are subject to ISP-enacted policies that may have unin- tended interactions with applications or TCP. To answer these questions, we need to overcome two hurdles. First, traffic management prac- tices and configurations likely vary widely across ISPs, and Internet conditions vary regionally, so we need a global view to get definitive answers. Second, it is logistically difficult, if not im- possible, to access policer configurations from within ISPs on a global scale, so we need to infer them by observing their impact on traffic and applications. We address these hurdles and answer these three questions by analyzing captured traffic between Google servers and its users. Contributions. We make the following contributions: 1. We design and validate an algorithm to detect traffic policing from server-side traces at scale (Section 5.2, Section 5.3). 2. We analyze policing across the Internet based on global measurements (Section 5.4). We collected over 270 billion packets captured at Google servers over a 7-day span, labelled as the Google dataset. This dataset gives us insight to traffic delivered to clients all over the world, spread across over 28,400 different autonomous systems (ASes). In addition, we conducted an exhaustive study of policing observed in traces collected by M-Lab’s Network Diagnostics Toolkit, labelled as the NDT dataset 1 . This data was collected over a six-year span, enabling us to detect longitudinal trends. We sampled 7.5 million traces from this dataset where each trace represents a single chunk of data transferred between an M-Lab server and clients located all over the world. 1 http://measurementlab.net/tools/ndt 85 3. We describe solutions for ISPs and content providers to mitigate adverse effects of traffic management (Section 5.5). In our Google dataset, we find that between 2% and 7% of lossy transmissions (depending on the region) have been policed. While we detected policing in only 1% of samples overall in our dataset, connections with packet loss perform much worse than their loss-free counterparts [149]. Thus, understanding and improving the performance for lossy transmissions can have a large impact on average performance (see Chapter 2). We find that policing induces high packet loss overall: on average, a policed connection sees over 20% packet loss vs. at most 4.1% when no policing is involved. Traces in our NDT dataset show similar trends in terms of prevalence of policing and the impact of policing on loss rates. Finally, policing can degrade video playback quality. Our measurements reveal many cases in which policed clients spend 15% or more of their time rebuffering, much more than non-policed connections with similar goodput. With every 1% increase in rebuffering potentially reducing user engagement by over 3 minutes [37], these results would be troubling for any content provider. While this study primarily highlights the negative side effects of policing, our point is not that all traffic management is bad. ISPs need tools to handle high traffic volumes while accommo- dating diverse service agreements. Our goal is to spur the development of best practices which allow ISPs to achieve management needs and better utilize networks, while also enabling con- tent providers to provide a high-quality experience for all customers. As a starting point, we discuss and evaluate how ISPs and content providers can mitigate the adverse effects of traffic management (Section 5.5). Stepping back, this chapter presents an unprecedented view of the Internet: a week of (sam- pled) traffic from most of Google’s CDN servers, delivering YouTube, one of the largest volume 86 services in the world serving 12-32% of traffic worldwide [115]; a global view of aspects of TCP including loss rates seen along routes to networks hosting YouTube’s huge user base; mea- surements of policing done by the middleboxes deployed in these networks; and statistics on client-side quality of experience metrics capturing how this policing impacts users. The analysis pipeline built for this chapter enabled this scale of measurement, whereas previous studies, even those by large content providers like Google, were limited to packet captures from fewer vantage points [4, 44, 49, 74, 103, 149]. 5.2 Detecting & Analyzing Policing at Scale In this section, we present an algorithm for detecting whether a (portion of a) flow is policed or not from a server-side trace. We added this algorithm to a collection and analysis framework for traffic at the scale of Google’s CDN. 5.2.1 Detecting Policing Challenges. Inferring the presence of policing from a server-side packet trace is challenging for two reasons. First, many entities can affect traffic exchanged between two endpoints, including routers, switches, middleboxes, and cross traffic. Together they can trigger a wide variety of net- work anomalies with different manifestations in the impacted packet captures. This complexity requires that our algorithm be able to rule out other possible root causes, including congestion at routers. 2 The second challenge is to keep the complexity of policing detection low to scale the detection algorithm to large content providers. 2 Whether tail-drop or those using some form of active queue management, such as Random Early Drop (RED) or CoDel [46, 95]. 87 Definition. Traffic policing refers to the enforcement of a rate limit by dropping any packets that exceed the rate (with some allowance for bursts). Usually, traffic policing is achieved by using a token bucket of capacity N, initially filled with m tokens. Tokens are added (maximum N tokens in the bucket) at the preconfigured policing rate r. When a packet of length p arrives, if there are p tokens available, the policer forwards the packet and consumes p tokens. Otherwise it drops the packet. Goal. The input to our algorithm is an annotated packet flow. Our analysis framework (Sec- tion 5.2.2) annotates each packet to specify, among other things: the packet acknowledgement latency, as well as packet loss and retransmission indicators. Our goal is to detect when traffic is policed, i.e., when a traffic policer drops packets that ex- ceed the configured rate. Our approach uses loss events to detect policing (as described below). 3 If a flow requires fewer than m tokens, policing will not kick in and drop packets, and we do not attempt to detect the inactive presence of such a policer. The output of the algorithm is (a) a single bit that specifies whether the flow was policed or not, and (b) an estimate of the policing rate r. Detection. Figure 5.2 outlines our policing detector (PD, for short). PD starts by generating the estimate for the token refresh rate r, as follows. We know that a policer drops packets when its token bucket is empty. Assuming losses are policer-induced, we know there were not enough tokens when the first loss (p f irst loss ) and last loss (p last loss ) happened within a flow. All suc- cessfully delivered packets in-between must have consumed tokens produced after the first loss. Thus, PD uses the goodput between the first and last loss to compute the token production rate 3 Since we rely on loss signals, we only detect policing when a flow experiences loss. To be robust against noise, we only run the algorithm on flows with 15 losses or more. We derived this threshold from a parameter sweep, which found that lower thresholds often produced false positives. On average, flows marked as policed in our production environment carried about 600 data packets out of which 100 or more were lost. 88 Variable: r (estimated policing rate) Variable: p f irst loss ; p last loss (first/last lost packet) Variable: t u ; t p ; t a (used/produced/available tokens) Variable: l loss ; l pass (lists of # tokens available when packets were lost/passed) Variable: n loss ; n pass (fraction of lost/passed packets allowed to not match policing constraints) 1 r rate(p f irst loss , p last loss ); 2 t u 0; 3 for p current p f irst loss to p last loss do 4 t p r(time(p current )time(p f irst loss )); 5 t a t p t u ; 6 if p current is lost then 7 Add t a to l loss ; 8 else 9 Add t a to l pass ; 10 t u t u +bytes(p current ); 11 ifaverage(t a in l loss )<average(t a in l pass ) 12 andmedian(t a in l loss )<median(t a in l pass ) 13 and[t a 2 l loss : t a 0](1 n loss ) l loss 14 and[t a 2 l pass : t a & 0](1 n pass ) l pass 15 and RTT did not increase before p f irst loss then 16 Add traffic policing tag to flow; Figure 5.2: Policing Detector 89 (line 1). 4 Our algorithm is robust even if some losses have other root causes, such as congestion, so long as most are due to policing. Next, PD determines if the loss patterns are consistent with a policer enforcing rate r. To do so, it estimates the bucket fill level as each packet arrives at the policer and verifies if drops are consistent with expectation. For this estimation, it computes the following values for each packet between the first and the last loss (lines 3–10). The number of produced tokens t p , i.e., the overall (maximum) number of bytes that a policer would let through up to this point (line 4), based on the goodput estimate and the elapsed time since the first loss (t elapsed =time(p current )time(p f irst loss )). The number of used tokens t u , i.e., the number of bytes that passed through the policer already (line 10). The number of available tokens t a , i.e., number of bytes that a policer currently would let through based on the number of produced and already used tokens (line 5). If the number of available tokens is roughly zero, i.e., the token bucket is (almost) empty, we expect a packet to be dropped by the policer. Conversely, if the token count is larger than the size of the packet, i.e., the token bucket accumulated tokens, we expect the packet to pass through. The exact thresholds depend on the goodput and the median RTT of the connection to account for the varying offsets between the transmission timestamp of packets that we record and the arrival times at the policer. 4 If the first and/or last loss are not triggered by policing we potentially miscalculate the policing rate. To add robustness against this case, we always run the algorithm a second time where we cut off the first and last two losses and reestimate the policing rate. 90 Based on this intuition, PD infers traffic policing if all of the following conditions hold (lines 11–15). First, the bucket should have more available tokens when packets pass through than when packets are lost. Second, we expect the token bucket to be roughly empty, i.e., t a 0 in the case of a lost packet. This check ensures that losses do not happen when the token bucket is supposed to have sufficient tokens to let a packet pass (t a 0), or when the token bucket was supposed to be empty and have dropped packets earlier (t a < 0). We allow a fraction of the samples (at most n loss ) to fail this condition for robustness against noisy measurements and sporadic losses with other root causes. A similar condition applies to the token counts observed when packets pass through, where we expect that the number of available tokens is almost always be positive. We allow fewer outliers here (at most n pass < n loss ) since the policer always drops packets when the token bucket is empty. We derived the noise thresholds n loss and n pass from a parameter sweep in a laboratory setting (Section 5.3.1) with a preference for keeping the number of false positives low. For our analysis, we used n loss = 0:1 and n pass = 0:03. Finally, PD excludes cases where packet losses were preceded by RTT inflation that could not be explained by out-of-order delivery or delayed ACKs. This check is another safeguard against false positives from congestion, often indicated by increasing buffering times and RTTs before packets are dropped due to queue overflow. By simulating the state of a policer’s token bucket and having tight restrictions on the in- stances where we expect packets to be dropped vs. passed through, we reduce the risk of attribut- ing losses with other root causes to interference by a policer. Other causes, like congestion, tran- sient losses, or faulty router behavior, will, over time, demonstrate different connection behaviors than policing. For example, while a policed connection can temporarily achieve a goodput above the policing rate whenever the bucket accumulates tokens, a connection with congestion cannot do the same by temporarily maintaining a goodput above the bottleneck rate. Thus, over time 91 the progress on connections affected by congestion will deviate from progress seen on policed connections. 5.2.2 Analyzing Flow Behavior At Scale We have developed, together with other collaborators within Google, a pipeline for analyzing flows at scale. The first step of this pipeline is a sampler that efficiently samples a small fraction of all flows based on 5-tuple hashes, capturing all the headers and discarding the payload after the TCP header. The sampler is deployed at most of Google’s CDN servers and periodically transfers collected traces to an analyzer backend in a datacenter. By running the analysis online in a datacenter, we minimize the processing overhead introduced on the CDN servers. As the traces arrive at the analyzer backend, an annotator analyzes each flow. We designed the annotator to be broadly applicable beyond detecting policing; for example, in Section 5.5.1.2, we use it to detect traffic shaping. For each trace, the annotator derives annotations at the individual packet level (e.g., the RTT for the packet, or whether the packet was lost and/or a retransmission), and at the flow level (e.g., the loss rate and average throughput experienced by the flow). It can also identify application-level frames within a flow, such as segments (or chunks) in a video flow. The annotator also captures more complex annotations, such as whether a connection experienced bufferbloat [48]. PD is just one component of the annotator: it annotates whether a segment was policed and, if so, at what rate. Developing these annotations was challenging. The annotation algorithms had to be fast since a single trace might need several hundred annotations and we have many traces. The more complex annotations also required significant domain knowledge and frequent discussions with 92 experienced network engineers looking at raw packet traces and identifying higher-level struc- tures and interactions. Also complicating the effort were the complexity of the TCP specifica- tion, implementation artifacts, and application and network element behavior that led to a very large variety in observed packet traces. Our annotator is a significant step in packet analysis at scale beyond existing tools [26, 29, 88, 101, 109, 130, 138, 142]. Our analysis framework helped us explore policing in the wild and was also helpful in iterating over different designs of complex annotations. The framework can detect CDN-wide anomalies in near real-time (e.g., when traffic from an ISP experiences significant loss). 5.3 Validation We validate our algorithm in two ways. First, we evaluate the accuracy of PD by generating a large set of packet traces in a controlled lab setting with ground truth about the underlying root causes for packet loss (Section 5.3.1). Second, we show that the policing rates in the wild are consistent within an AS, meaning the AS’s traces marked as policed have goodput rates that cluster around only a few values, whereas the remaining traces see goodput rates that are dispersed (Section 5.3.2). 5.3.1 Lab Validation Our lab experiments are designed to stress-test our algorithm. We generated a large number of packet traces while using different settings that cover common reasons for dropped packets, focusing on the ones that could elicit traffic patterns similar to a policed connection. 93 Policing. We use a carrier-grade network device from a major router vendor to enforce traffic policing. We configured the device in much the same way an ISP would to throttle their users, and we confirmed with the router vendor that our configurations are consistent with ISP practice. Across multiple trials, we set the policing rates to 0.5, 1.5, 3, and 10 Mbps, and burst sizes to 8kB, 100kB, 1MB, and 2MB. Congestion. We emulate a bottleneck link which gets congested by one or multiple flows. We evaluated drop-tail queueing and three active queue management (AQM) schemes: CoDel [95], RED [46], and PIE [100]. We varied bottleneck link rates and queue sizes across trials using the same values as for the policing scenario. Random loss. We used a network emulator to randomly drop 1% and 2% of packets to simulate the potential behavior of a faulty connection. We simulated traffic resembling the delivery of data chunks for a video download, similar to the type of traffic we target in our analysis in Section 5.4. Overall, we analyzed 14,195 chunks and expected our algorithm to mark a chunk as policed if and only if the trace sees packet loss and was recorded in the Policing setting. Table 5.2 summarizes the results, with a detailed breakdown of all trials available online [1]. Policed traces. PD was able to detect policing 93% of the time for most policing configurations (A). The tool can miss detecting policing when it only triggers a single large burst of losses, 5 or when the token bucket is so small that it allows almost no burstiness and is therefore similar in behavior to a low-capacity bottleneck with a small queue. We aggregated these cases as spe- cial cases (B). PD is conservative in order to avoid false positives for non-policed traces (D–H). 5 Given only a single burst of losses, we cannot estimate a policing rate since all losses happened at roughly the same time. 94 Scenario Accuracy A Policing (except (B) and (C) below) 93.1% B Policing (special cases) 48.0% C Policing (multiple flows) 12.3% D Congestion (all AQM schemes) 100.0% E Congestion (drop-tail, single flow, except (G)) 100.0% F Random loss 99.7% G Congestion (drop-tail, single flow, min. queue) 93.2% H Congestion (drop-tail, multiple flows) 96.9% Table 5.2: PD classification accuracy for several controlled scenarios. Consequently, we likely underestimate global policing levels by failing to recognize some of the special cases (B). We also analyzed the scenario where multiple flows towards the same client are policed together (C). For our in-the-wild study (Section 5.4), PD typically does not have visibility into all flows towards a single client, as the CDN servers in the study independently select which flows to capture. To emulate this setting in our validation, we also only supply PD with a single flow, and so it can only account for some of the tokens that are consumed at the policer. There- fore, its inference algorithm is unable to establish a single pattern that is consistent with policing at any given rate. Since we are interested in a macroscopic view of policing around the globe, we can tolerate a reduced detection accuracy for cases where clients occasionally receive content through multiple connections at the same time. 6 We leave improving the algorithm’s accuracy for this scenario to future work which would also require the deployment of a different sampling method. Non-policed traces. In our experiments, PD correctly classifies as non-policed almost all seg- ments suffering from other common network effects, including network bottlenecks such as a congested link with packets dropped due to AQM (D) or drop-tail policy (E, G, H), and random 6 For video transfers in our setting, most of the time only one connection is actively transmitting a video chunk from the server to the client, even though multiple connections are established between them. 95 packet loss (F). PD is able to rule out policing because it checks for consistent policing behavior across many RTTs 7 , and other network effects rarely induce loss patterns that consistently mimic the policing signature over time. For example, when congestion overflows a queue, it drops pack- ets similar to a policer that has exhausted tokens. However, over time congestion will not always happen at exactly the same moment as a policer enforcing the rate limit for a specific flow. A closer look at the single-flow congestion cases shows that only trials using the minimum configurable queue size (8 kB) cause misclassifications (G). This is because a bottleneck with almost no available queue size to temporarily accommodate bursts results in the same packet loss patterns as traffic passing through a policer. However, in the wild (Section 5.4), 90% of the traces tagged as policed temporarily sustain larger bursts of 30 kB or more and therefore cannot fall in this category of false positives. In addition, a few cases of congestion from background traffic (H) induced loss patterns that were misclassified as policing. These cases have inferred bottleneck rates that vary widely, whereas we show in Section 5.3.2 that, in the wild, traces we classified as policed cluster around only a handful of goodput rates per AS. Note that a flow in the wild might experience more complex congestion dynamics, e.g., when contending with hundreds of other flows at a router. However, these dynamics are unlikely to result in a per-chunk traffic pattern consistent with a policer enforcing a rate (e.g., where losses always happen when exceeding a certain throughput rate), and, even if there are cases where chunks are misclassified as policed, we do not expect this to happen consistently for a large number of chunks within an AS. Finally we validated our algorithm against traces generated by Kakhki et al. [64]. These traces were also generated with carrier grade equipment, configured to perform traffic shaping only. As such, none of the traces should be labeled as policed by our tool. The 1,104 traces we 7 More precisely, the policing algorithm will only consider traces where the connection observed packet loss in at least three different round-trip periods. 96 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 30 35 40 45 50 CDF of ASes # clusters required for 75% coverage ASes w/ policing ASes w/o policing Figure 5.3: Number of rate clusters required to cover at least 75% of the rate samples per AS. analyzed contained 205,652 data chunks, of which only 37 chunks were falsely marked as policed by PD. This represents an accuracy of 99.98% for this dataset. 5.3.2 Consistency of Policing Rates Our policing rate analysis (Section 5.4.2) and our case studies (Section 5.4.6) suggest that polic- ing rates are often tied to advertised data plan rates. Thus we conjectured that, because most ASes have few plan rates, we should observe few policing rates per client AS. To validate this conjec- ture, we computed the number of prevalent policing rates seen per client AS, based on traces from most of Google’s CDN servers (see Section 5.4). We derived the minimum number of rate clusters required to cover at least 75% of the policed traces per AS. We define a rate cluster with center value v as all rates falling into the range[0:95 v;1:05 v]. For example, the 1-Mbps cluster incorporates all rates that are 0.95 Mbps and 1.05 Mbps. To find a solution, we use the greedy algorithm for the partial set cover problem which produces a good approximation of the optimal solution [78]. We looked at the distribution of goodput rates for segments marked as policed in ASes with at least 3% of their traffic being policed. Rates in the majority of ASes can be accounted for by 97 10 clusters or less (Figure 5.3). By visiting ISP homepages, we observe that many offer a range of data rates, some with reduced rates for data overages. Further, many ISPs continue to support legacy rates. In Chapter 5.4.2 we discuss the range of rates that we observed for different ISPs in more detail. Thus it is not surprising that we see more than just a couple of policing rates for most ASes. In contrast, goodput rates in ASes with no policing do not display clustering around a small number of rates and see a much wider spread. Since the false positives in our lab validation see a wide spread as well, this result provides us confidence that the traces we marked as policed in our production dataset are mostly true positives. 5.4 Policing in the Wild In this section, we characterize the prevalence and impact of policing in the Internet. The Google dataset. We analyze sampled data collected from most of Google’s CDN servers during a 7-day period in September 2015. The dataset consists of over 277 billion TCP packets, carrying 270 TB of data, associated with more than 800 million HTTP queries requested by clients in over 28,400 ASes. The TCP flows carried different types of content, including video segments associated with 146 million video playbacks. The dataset is a sampled subset (based on flow ID hashing) of Google’s content delivery traffic. To tie TCP performance to application performance, we analyze the data at a flow segment granularity. A segment consists of the packets carrying an application request and its response (including ACKs). The NDT dataset. As a secondary dataset we used traces collected by M-Lab’s Network Diagnostics Toolkit (NDT). The NDT dataset comprises of millions of packet traces and metadata collected over the past seven years. Each trace is the result of a diagnostic task manually triggered 98 between a client machine and one of many M-Lab vantage points. While the dataset should not be seen as good representation of all Web traffic, it serves as a valuable addition to the Google dataset. It lets us validate the findings from the analysis of the Google dataset, and in addition to that, the timespan over which the measurements were recorded allow us to observe longitudinal trends. Since the NDT dataset entails over 79TB of data we focused our work on a sample by only looking at the data from the first day of each month. In addition we filtered out traces with fewer than 100 packets for relevance. Despite collecting data from clients on a global scale, the NDT dataset does have some limita- tions. NDT is an active measurement toolkit and as such requires the user to trigger the collection of data. While it is integrated into torrent clients like mtorrent and Vuze to reach a larger user base it potentially biases the data collection towards clients with connectivity problems who are more likely to run speed tests. All results tied to this particular dataset should therefore be taken with a grain of salt. Overview of Results. In the following sub-sections, we present our key findings: Internet-wide, up to 7% of data transfers between servers and clients are affected by traffic policers (Section 5.4.1). Especially in Africa, a sizable amount of throttled traffic is limited to a rate of 2 Mbps or less, often inhibiting the delivery of HD quality content (Section 5.4.2). Policing can result in excessive loss (Section 5.4.3). The user quality of experience suffers with policing, as measured by more time spent re- buffering (Section 5.4.4). 99 Region Policed segments Loss rate (among lossy) (overall) (policed) (non-pol.) India 6.8% 1.4% 28.2% 3.9% Africa 6.2% 1.3% 27.5% 4.1% Asia (w/o India) 6.5% 1.2% 22.8% 2.3% South America 4.1% 0.7% 22.8% 2.3% Europe 5.0% 0.7% 20.4% 1.3% Australia 2.0% 0.4% 21.0% 1.8% North America 2.6% 0.2% 22.5% 1.0% Table 5.3: Percentage of segments policed among lossy segments ( 15 losses, the threshold to trigger the policing detector), and overall average loss rates for policed and unpoliced segments. Policing can induce patterns of traffic and loss that interact poorly with TCP dynamics (Sec- tion 5.4.5). Through ISP case studies, we reveal interesting policing behavior and its impact, including losses on long-distance connections. We also confirm that policing is often used to enforce data plans (Section 5.4.6). 5.4.1 The Prevalence of Policing A macroscopic analysis of the Google data (Table 5.3) shows that, depending on geographic region, between 2% and 6.8% of lossy segments were impacted by policing. 8 Overall, between 0.2% and 1.4% of the segments were affected. Our analysis of the NDT dataset yields similar results. Overall, our algorithm marked 162,080 traces (2.2%), out of a total of 7.4 million relevant traces in the NDT dataset, as policed. As shown in Figure 5.4, the policing frequency varies widely across continents. While less than 0.4% of traces in North America are marked as policed, 3.4% of the traces from Asia are tagged. We see 8 The video traffic we examine is delivered in segments (or chunks), thus we analyze the dataset on a per-segment granularity. Many video content providers stream video in segments, permitting dynamic adaptation of delivery to network changes. 100 NA SA -- OC AF EU AS 0 0.5 1 1.5 2 2.5 3 3.5 Continent % of traces policed Figure 5.4: Prevalence of policing in the M-Lab NDT data-set across client conti- nents. “–” represents clients that could not be geolocated with the MaxMind Ge- oLite2 Country database RU BY ME DZ IN NP AM KZ UZ KG 0 5 10 15 20 25 Country % of traces policed 10 2 10 3 10 4 10 5 10 6 10 7 total traces analyzed Figure 5.5: Top-10 countries with the most policing in the M-Lab NDT dataset over the whole measurement period. Countries with fewer than 1000 traces were excluded. 0 1 2 3 4 5 2010 2011 2012 2013 2014 2015 2016 % of traces policed Year (January marked) Smoothed Raw Figure 5.6: Prevalence of policing over time. 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 2010 2011 2012 2013 2014 2015 2016 % of traces policed Year (January marked) AF AS EU NA OC SA Figure 5.7: Prevalence of policing over time (grouped by continent). a similar disparity when clustering traces by the client’s country as well (Figure 5.5). The bar diagram displays the policing frequency and the sample size for the countries with the largest fraction of their traces policed. Almost 25% of the traces matched to clients in Kyrgyzstan are policed. While most of the countries in the top-10 list contribute a small number of samples to the overall dataset, there are exceptions like India and Russia for which we have many data points and high policing frequencies (10 and 8%, respectively). 101 0 5 10 15 20 2010 2011 2012 2013 2014 2015 2016 % of traces policed Year (January marked) Smoothed Raw (a) India 0 5 10 15 20 2010 2011 2012 2013 2014 2015 2016 % of traces policed Year (January marked) Smoothed Raw (b) Russia 0 5 10 15 20 2010 2011 2012 2013 2014 2015 2016 % of traces policed Year (January marked) Smoothed Raw (c) Algeria Figure 5.8: Prevalence of policing over time (grouped by country). Since our NDT dataset incorporates samples from a 6-year time frame (2010 to 2015), we also analyzed longitudinal trends. Figure 5.6 shows the global policing frequency seen in individual samples (the first day of every month) as well as the long-term trend over the past six years. 9 In the oldest samples we analyzed (from early 2010), we detected policing in about 3.5% of the recorded traces. Over time, policing became less prevalent, with less than 1% of the traces policed in our latest samples. Again we broke down our dataset based on the client’s continent (Figure 5.7) and country (Figure 5.8). For traces from Asia and Europe we see policing frequencies decline over time, whereas measurements from the remaining continents do not show a clear trend. For the per-country breakdown we selected three of the most policed countries with a substantial number of traces per sample to allow us to analyze long-term trends. For India, roughly 14% of the traces were policed in 2010, compared to less than 4% in late 2015. We observe a similar trend for traces tied to clients in Russia. However, this trend does not apply to all countries. For example, in Algeria we see the opposite trend with policing being more prevalent in the newer samples. 9 In this figure and all other figure showing longitudinal data, the period of inactivity between July 2014 and May 2015 represents a period without traces due to a bug in the NDT software. 102 0 0.2 0.4 0.6 0.8 1 1 10 100 CDF Policing Rate (in Mbps) Africa India South America Asia (w/o India) Europe North America Australia Figure 5.9: Observed policing rates per segment. 5.4.2 Enforced Policing Rates Figure 5.9 shows the rates enforced by policers based on data from the Google dataset. In Africa and India, over 30% of the policed segments are throttled to rates of 2 Mbps or less. The most frequent policing rates in these two regions are 1, 2, and 10 Mbps, as is evident from the pro- nounced inflections in the CDF. In Section 5.4.6 we examine some ISPs to demonstrate that this step-wise pattern of policing rates that emerge in the data reflects the available data plans within each ISP. The distributions in other regions of the world show no dominant rates, with many seg- ments being permitted to transmit at rates exceeding 10 Mbps. This is due to aggregation effects: these regions have many ISPs with a wide variety of data plans. That said, even in these regions, at least 20% of segments stream at less than 5 Mbps. We did a more thorough analysis of observed policing rates based on the publicly available NDT dataset and discuss our findings for this dataset next. Figure 5.10 shows the policing rate distributions broken down for the top ASes based on the fraction of their traces being policed. Rostelecom is a large Russian ISP which now owns six of the top-10 most policed ASes. To get a 103 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 CDF Policing Rate (in Mbps) 8193 17908 1547 28840 42575 47165 8728 (a) Excluding Rostelecom 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 CDF Policing Rate (in Mbps) 8443 28812 25436 2878 34449 34267 25008 (b) Rostelecom only Figure 5.10: Distribution of policing rates observed in the top-7 ASes (by prevalence of policing) excluding Rostelecom (left) and ASes that are now incorporated into Rostelecom (right), over the whole measurement period. Six of the top-10 ASes belong now to Rostele- com. more representative view of policing rates used by different ISPs we plot the ASes now managed by Rostelecom separately. The distribution for each of the ASes shows a clear staircase pattern with few policing rates dominating per AS. For example, Uzbektelekom (ASN 8193) configures their policers primarily for throughput rates of 0.25 and 0.5 Mbps, with a few traces seeing rates of 0.125 and 1 Mbps. To get a more scoped view we narrow down our measurement window to traces from 2015 only. The results are graphed in Figure 5.11. Note that a different set of ASes are the top policers in this time frame. This is not surprising. As we mentioned earlier, policing became less prevalent over time. It is possible that some ASes decided to abandon policing as their method for traffic engineering over time. Another possibility is that ASes increased their enforced policing rates which reduces the probability that connections with low throughput requirements are affected by policers starting to drop packets. Since we can only detect policing once a policer starts to drop packets, we would not detect the presence of policers in this situation. 104 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 CDF Policing Rate (in Mbps) 8193 28840 36947 6697 9829 6849 9198 Figure 5.11: Distribution of policing rates observed in the top-7 policing ASes in 2015. 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 CDF Policing Rate (in Mbps) 2010 - 2014 2015 only Figure 5.12: Distribution of policing rates observed in AS 6697, in two differ- ent time frames. ASN Name Country (TLD) Matched to plan rates (Mbps) Unmatched 8193 Uzbektelekom Uzbekistan (UZ) 0.125, 0.25, 0.5, 1 None 28840 Tattelecom Russia (RU) 0.5, 1 1.5 36947 Algerie Telecom Algeria (DZ) 1, 2, 4 0.5 6697 Beltelecom Belarus (BY) 1, 2, 3, 4 None 9829 BSNL India (IN) 0.5, 1 None 6849 Ukrtelecom Ukraine (UA) 0.5, 2 None 9198 Kazakhtelecom Kazakhstan (KZ) None 0.5, 1, 2 Table 5.4: Top-7 policing ASes using traces after May 2015 only. For some ASes shown in Figure 5.10 we see a large spread of common policing rates for two reasons. First, we aggregate data from the whole six-year measurement period during which the configured policing rates can change. Figure 5.12 examplifies this for AS 6697. We generated the distribution for samples from 2010 to 2014, and for the samples from 2015 separately. While we observe rates of 1, 2, and 3 Mbps in both distributions (albeit in different quantities), the rates of 0.25, 0.5, 4, and 5 Mbps are only seen in one of the time frames. Second, even if an ISP currently offers a relatively small number of data plans, legacy data plans remain intact resulting in a variety of enforced bandwidths. 105 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 CDF Loss Rate India (N) India (P) Asia w/o India (N) Asia w/o India (P) (a) Asia 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 CDF Loss Rate Africa (N) Africa (P) Australia (N) Australia (P) (b) Africa and Australia 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 CDF Loss Rate Europe (N) Europe (P) Americas (N) Americas (P) (c) Europe and Americas Figure 5.13: Distribution of loss rates observed on unpoliced (N) and policed (P) segments in different regions of the world. Finally, we note that the observed policing rates cluster around round numbers or fractions thereof that are commonly tied to data plans. For the top ASes based on the fraction of traffic po- liced in 2015, we looked up the data plans and bandwidth rates these ISPs offer to their customers and tried to find matches for the observed policing rates. The results are shown in Table 5.4. Gen- erally, the policing rates do align with data plan rates with a few exceptions. For example, for Kazakhtelecom we could not find data plans that match any of the policing rates of 0.25, 0.5, and 1 Mbps that we see in our dataset. The current data plans that this ISP offers start at 4 Mbps. It is possible that the policing rates are tied to legacy plans. It is also possible that they reflect rates enforced for oversubscribers, i.e. when a customer exceeds a data limit. Chapter 5.4.6 discusses a case of this observation. 106 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 CDF Retransmitted packets / all packets All Segments 0.5Mbps Goodput 5Mbps Goodput Per ASN Figure 5.14: Loss rate CDF per segment, for segments with an average goodput of 0.5 or 5 Mbps, and per ASN. 0 0.2 0.4 0.6 0.8 1 1 10 100 CDF Ratio between Burst Throughput and Policing Rate Australia Americas Europe Asia (w/o India) Africa India Figure 5.15: Ratio between the median burst throughput and the policing rate per segment. 5.4.3 Impact of Policing on the Network Policing noticeably increases the packet loss rate, which can in turn affect TCP performance [149] and user satisfaction (also see Chapter 2). 10 We start with an analysis of the Google dataset. In the Google dataset, we observed an average packet loss rate of 22% per segment for policed flows (Table 5.3). Figure 5.13 plots the loss rate CDF for policed and non-policed segments observed in different regions. Policed flows in Africa and Asia see a median loss rate of at least 10%, whereas the median for unpoliced flows is 0%. Other regions witness lower loss rates, yet a sizable fraction of segments in each experiences rates of 20% or more. The 99 th percentile in all regions is at least 40%, i.e., almost every other packet is a retransmission. In Section 5.4.5 we analyze common traffic patterns that can trigger such excessive loss rates. The loss rate distributions shown in Figure 5.14 see a wide variability with long tails: the overall loss rate distribution (All Segments) has a median of 0% and a 99 th percentile of over 25%. 10 To ensure that packet loss is caused by policing instead of only being correlated with it (e.g., in the case where policing would be employed as a remedy to excessive congestion in a network), we compared the performance of policed and unpoliced flows within an AS (for a few dozen of the most policed ASes). We verified that policed connections observed low throughput yet high loss rates. Conversely unpoliced connections achieved high throughput at low loss rates. In addition, we did not observe any diurnal patterns – loss rates and the fraction of traffic impacted by policing are not affected by the presence of peak times. Section 5.3 provides additional evidence that policers are the root cause for losses and not the other way round. 107 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 CDF Packet loss (%) non-policed policed Figure 5.16: Distribution of packet loss rates seen for policed and unpoliced traces across the whole dataset. The figure also shows the distribution for two segment subsets: one including the 20 million requests with an average goodput of 0.5 Mbps (50 kbps), and the other with the 7 million requests achieving 5 Mbps (50 kbps). Though there is some correlation between goodput and loss rates, there are many cases where high loss did not result in bad performance. For example, about 4% of the segments achieving a goodput of 5 Mbps also observe a loss rate of 10% or more. Policers are one cause for the uncommon high loss, high goodput behavior, as we show in Section 5.4.5. Our analysis of the NDT dataset showed slightly different loss distributions. Figure 5.16 compares the distribution of loss rates seen in policed vs. unpoliced traces in the NDT dataset. In the median, we see a loss rate of 7.4% when traces are marked as policed vs. 0.14% for non-policed traces. In the 90 th percentile loss rates increase to 17.6% for policed traces vs. 3.9% for unpoliced traces. A potential reason for the lower loss rates in this dataset is the difference in traffic composition between the two datasets. Since the NDT dataset is solely composed of continuous 10-second transfers compared to shorter transfers in the Google dataset, connections 108 0 5 10 15 20 25 -- AF AS EU NA OC SA loss [%] continent non-policed policed Figure 5.17: Packet loss rates seen for po- liced and unpoliced traces per continent (client location). Each bar shows the me- dian, as well as the 10 th and 90 th percentile using error bars. 0 5 10 15 20 25 KG UZ KZ AM NP IN DZ loss [%] country non-policed policed Figure 5.18: Packet loss rates seen for policed and unpoliced traces in the top- 7 countries (based on the percentage of traces policed per country). Each bar shows the median, as well as the 10 th and 90 th percentile using error bars. have more time to adjust to the enforced policing rates which reduces loss rates over time. We discuss this phenomenon later in this chapter. Many of the locations with high policing rates also have high loss rates. To rule out the client location as a confounding factor we break down the results by regions with different granularities. We start with a breakdown by continent, as shown in Figure 5.17. For each continent, we compare the loss rates seen with policed vs. unpoliced traces. Clearly policed traces see much higher loss rates, often at least a magnitude larger compared to unpoliced traces from the same continent. For example, the median loss rate in Africa is 14% in policed traces vs. 0.5% in unpoliced traces. Next, we look at the loss rates when clustering by the client’s country. Figure 5.18 shows results for the top-7 countries, based on the percentage of traces policed per country. Again, policed traces see much higher loss rates compared to unpoliced traces from the same country. Finally, we break down results based on the client’s AS (Figure 5.19). A large number of the top ASes, based on the percentage of traces policed per AS, now belong to a large Russian provider (Rostelecom). We therefore provide two plots, one for the top-7 ASes within Rostelecom 109 0 5 10 15 20 8443 28812 25436 2878 34449 34267 25008 loss [%] ASN non-policed policed (a) Without Rostelecom 0 5 10 15 20 25 8193 17908 1547 28840 42575 47165 8728 loss [%] ASN non-policed policed (b) Rostelecom only Figure 5.19: Packet loss rates seen for policed and unpoliced traces in the top-7 ASes (sepa- rating Rostelecom and non-Rostelecom ASes; based on the percentage of traces policed per AS). Each bar shows the median, as well as the 10 th and 90 th percentile using error bars. and one for all other ASes. In comparison to the per-continent or per-country figures, the disparity between loss rates for policed and unpoliced traces is smaller. For most ASes the median loss rate is twice as high for policed traces. ASes 8443 and 42575 are an exception in this regard. Loss rates are particularly high for these ASes, even when policing is not detected, suggesting that non-policer induced loss is overshadowing the the effects of policing here. Why can policing result in high loss rates? Next, we take another look at the Google dataset in an effort to determine why loss rates are much higher when connections are affected by policing. One situation that can trigger high loss is when there is a wide gap between the rate sustained by a flow’s bottleneck link and the rate enforced by the policer. We estimate the bottleneck capacity (or the burst throughput) by evaluating the interarrival time of ACKs for a burst of packets [31,57,68]. We found that in many cases the bottleneck capacity, and sometimes even the goodput rate achieved before the policer starts dropping packets is 1-2 orders of magnitude higher than the policing rate. Figure 5.15 compares the achieved burst throughput and policing rates we observed. The gap is particularly wide in Africa and India. With such large gaps, when the policer 110 starts to drop packets, the sender may already be transmitting at several times the policing rate. Since the sender’s congestion control mechanism usually only halves the transmission rate each round trip, it needs multiple round trips to sufficiently reduce the rate to prevent further policer packet drops. We investigate this and other interactions with TCP in Section 5.4.5. When policers drop large bursts of packets, the sender can end up retransmitting the same packets multiple times. Overshooting the policing rate by a large factor means that retransmis- sions as part of Fast Recovery or FACK Recovery [86] are more likely to also be lost, since the transmission rate does not decrease quickly enough. The same applies to cases where policing re- sults in a retransmission timeout (RTO) followed by Slow Start. In this situation, the token bucket accumulated tokens before the RTO fired, leading to a few rounds of successful retransmissions before the exponential slow start growth results in overshooting the policing rate again, requiring retransmissions of retransmissions. Multiple rounds of this behavior can be seen in Figure 5.1. These loss pathologies can be detrimental to both ISPs and content providers. Policing- induced drops force the content provider to transmit, and ISPs to carry, significant retransmission traffic. This motivates our exploration of more benign rate-limiting approaches in the Section 5.5. 5.4.4 Impact on Playback Quality In addition to the large overheads caused by excessive packet loss, policing has a measurable im- pact on the user’s quality of experience. Figure 5.20 shows, for a selection of playbacks delivered at different goodput rates, the distribution of the ratio of time spent rebuffering to time spend watching. This ratio is an established metric for playback quality and previous studies found a high correlation between this metric and user engagement [37]. Each of the selected playbacks had at least one of their video segments delivered at a goodput rate of either 300 kbps or 1.5 Mbps 111 0.99 0.9 0.5 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 CDF Rebuffer Time to Watch Time Ratio All other playbacks Policed playbacks (a) 300 kbps 0.99 0.9 0.5 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 CDF Rebuffer Time to Watch Time Ratio All other playbacks Policed playbacks (b) 1.5 Mbps Figure 5.20: Rebuffer to watch time ratios for video playbacks. Each had at least one chunk with a goodput of 300 kbps or1.5 Mbps(15%). (15%). 300 kbps is the minimum rate required to play videos of the lowest rendering quality, leaving little opportunity to bridge delayed transmissions by consuming already buffered data. For each selected rate, between 50% and 90% of the playbacks do not see any rebuffer events. For the rest, policed playbacks perform up to 200% worse than the unpoliced ones. For example, in the 90 th percentile, playbacks policed at a rate of 300 kbps spend over 15% of their time rebuffering, vs. 5% when not policed. Prior work found that a 1% increase in the rebuffering ratio can reduce user engagement by 3 minutes [37]. This result substantiates our claim that policing can have a measurable negative impact on user experience. Another way to assess playback quality is to explore the impact of observing a high-goodput short burst at the beginning of the flow, before policing starts. This can happen when the policer’s token bucket starts out with a sizable amount of tokens. As such, a flow might temporarily sustain a rate that is good enough for HD video delivery, while the policing rate enforced later prevents this, i.e., the rate is below the target of 2.5 Mbps. To quantify the impact of this behavior on the application, we evaluate the wait time. This is the delay between a segment request and the time 112 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 2.5 3 3.5 4 CDF HD Wait Time (s) All Policed < 2.5Mbps Figure 5.21: Wait time CDF for all HD segments (red solid line) and those policed below 2.5 Mbps (blue dotted line). when its playback can commence without incurring additional re-buffering events later. We can compute wait time from our traces since we can observe the complete segment behavior. Figure 5.21 shows that delivering even a single HD segment over a slow connection results in larger wait times. In the median, a client has to wait over 1 second for a policed segment, whereas the median for unpoliced ones is only 10 ms. 5.4.5 Interaction Between Policers and TCP Enabling traffic policing itself does not automatically result in high loss. Thus, before we can design solutions to avoid the negative side effects of policing, we need to have a better under- standing about when and why configurations trigger heavy losses. We found that high loss is only observed when the policer and TCP congestion control interact poorly in specific settings. To depict these interactions, we use the diagrams in Figure 5.22 that show specific patterns of connection progress. 113 1 2 3 Time Data Progress Policing Rate (a) Congestion avoidance pattern 1 Time Data Progress Policing Rate 2 3 4 (b) Staircase pattern 1 2 3 Time Data Progress Policing Rate (c) Doubling window pattern Figure 5.22: Common traffic patterns when a traffic policer enforces throughput rates. The plots show progress over time (blue solid line) with a steeper slope representing a higher goodput, the transmitted but lost sequences (red dotted lines), and the estimated policing rate (black dashed line). Packets which would put the progress line above the policing rate line are dropped while other packets pass through successfully. Congestion Avoidance Pattern. In the most benign interaction we have seen, the policer induces few losses over long time periods. The congestion window grows slowly, never overshooting the policing rate by much. This results in short loss periods, as shown in Figure 5.22a. In this pattern, the sender slowly increases the congestion window while a small number of excess tokens accumulate in the bucket (1). Towards the end of this phase, the progress curve has a slightly steeper slope than the policing rate curve. Consequently, we exceed the policing rate at 114 some point (the black dashed line) resulting in packet loss (2). The congestion window is reduced during the fast recovery, followed by another congestion avoidance phase (3). Staircase Pattern. A particularly destructive interaction between TCP and policers is a “stair- case” when flow rates before the policer drops packets are multiple times the policed rate (Fig- ure 5.22b). This results in short periods of progress followed by long periods of stagnation, with the sequence graph resembling a staircase. Initially the sender pushes data successfully at a high rate (bubble 1 in the figure). Eventually, the policer runs out of tokens and starts dropping. Since the token refresh rate is much lower than the transmission rate, (almost) all packets are lost (2). This results in a high probability of the last packet in a burst being lost, so TCP needs to fall back to timeout-based loss detection, since there are no subsequent packets to trigger duplicate ACKs. Consequently, the sender idles for a long time (3). This is problematic on low-RTT connections, since the loss detection mechanism accounts for possibly delayed ACKs, usually requiring a timeout of 200 ms or more [22], which may be much higher than the RTT. Once packets are marked as lost and retransmitted, the sender accelerates quickly (4), as the policer accumulated a large number of tokens during the idle time. In Section 5.5.1.1 and Section 5.5.2.2 we discuss how we can avoid this pattern by optimizing a policer’s configuration and reducing bursty transmits. Doubling Window Pattern. For clients near the server, the very low RTTs can enable con- nections to sustain high throughput rates even when the congestion window (cwnd) enables the sender to have only one packet carrying MSS bytes in flight at a time, where MSS is the maximum segment size allowed by the network. The throughput rate equals cwnd RT T excluding loss events. The policing rate lies between the throughputs achieved when using a congestion window of 1 MSS and a window of 2 MSS (see Figure 5.22c). Note that the window will grow linearly on a byte 115 ISP ISP Region Samples RTT Mobile A Azerbaijan 64K Medium B USA 31K Medium X C India 137K Very low D India 17K Low E Algeria 112K Medium Table 5.5: Overview of 5 highly policed ISPs. The RTT estimates apply only when content is fetched from the local cache. With cache misses content needs to be fetched from a data center which is potentially located much farther away, resulting in higher RTTs. granularity, thus observing values between 1 and 2 MSS. However, Nagle’s algorithm in TCP delays transmissions until the window allows the transmission of a full MSS-sized packet [91]. The pattern starts with the sender pushing data while using a congestion window of 1 MSS. In congestion avoidance mode, the window increases by 1 MSS every RTT. Thus, even though the window is supposed to grow slowly, it doubles in this extreme case (1). Next, the higher trans- mission rate makes the policer drop packets (2). The sender backs off, setting the congestion window back to 1 MSS. Timeout-based recovery isn’t necessary since the low amount of in-flight data enables “early retransmit” upon the reception of a single duplicate ACK (3). Even though the connection makes continuous progress without excessive loss periods, valu- able bandwidth is wasted. To avoid this pattern the sender would need to send packets that carry fewer bytes than the MSS allows to match the policing rate. Since the protocol is not configured to do this, using a window of 1 MSS is the only setting enabling permanent stability. This is not supported by TCP’s congestion control mechanism, since “congestion avoidance” will increase the window by 1 MSS every RTT. 116 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 CDF Policing Rate (Mbps) ISP A ISP B ISP C ISP D ISP E Figure 5.23: Policing rates in policed seg- ments for selected ISPs. 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 CDF Retransmitted packets / all packets (per segment) ISP A ISP B ISP C ISP D ISP E Figure 5.24: Loss rates in policed seg- ments for selected ISPs. 5.4.6 Policing Pathologies We now focus on the analysis of traces from the Google dataset for a small set of ISPs to highlight different characteristics of policed traffic. Table 5.5 gives an overview of five ISPs where policing was prevalent, selected to illustrate interesting pathologies arising from policing. Figures 5.23 and 5.24 show the policing and loss rates seen when delivering video to clients in each ISP. As we have observed with data from the NDT dataset in Chapter 5.4.2, we can clearly distinguish the small set of policing rates used within each ISP. The most popular choices are 1 and 2 Mbps, both of which are below the 2.5 Mbps needed for HD quality videos. For all ISPs except ISP B, we found the advertised bandwidth of their data plans on their websites, and, in each case, the plan rates matched the observed policing rates. For ISP C, we recently observed a drastic change in the rate distribution. In our earlier analysis from 2014, most traces were policed at 4 Mbps, at that point a plan offered by the ISP. Now we see 10 Mbps as the most prominent rate, which is consistent with a change of data plans advertised. We do observe two smaller bumps at roughly 3 Mbps and 4 Mbps. These rates do not correspond to a base bandwidth of any of their plans, but instead reflect the bandwidth given to customers once they exceed their monthly data cap. 117 Losses on long-distance connections. Traffic policing causes frequent loss, but losses can be particularly costly when the packets propagate over long distances just to be dropped close to the client. For example, for ISP A, a local cache node in Azerbaijan serves half the video requests, whereas the other half is served from more than 2,000 kilometers away. We confirmed that the policer operates regardless of content source. So the high drop rates result in a significant fraction of bandwidth wasted along the paths carrying the content. The same applies to many other ISPs (including C, D, and E) where content is sometimes fetched from servers located thousands of kilometers away from the client. Policing in wireless environments. We observe policing in many areas across the globe, even in developed regions. ISP B provides mobile access across the United States while heavily policing some of its users to enforce a data cap. While we understand that it is necessary to regulate access by heavy users, we find that there are many cases where the bandwidth used by throttled connections is actually higher than the bandwidth used by unthrottled ones carrying HD content, since the latter do not incur costly retransmissions. Large token buckets. ISP C sees heavy loss, with 90% of segments seeing 10% loss or more. Yet, flows achieve goodputs that match the policing rates (10 Mbps or more in this case). There are three reasons for this. First, median bottleneck capacity is 50 Mbps on affected connections. Second, most connections see a very small RTT. Finally, the policer is configured to accommo- date fairly large bursts, i.e., buckets can accumulate a large number of tokens. This allows the connection to “catch up” after heavy loss periods, where progress stalls, by briefly sustaining a goodput rate exceeding the policing rate by an order of magnitude. When plotting the progress over time, this looks like a staircase pattern which was discussed in more detail in Section 5.4.5. 118 While goodputs are not adversely affected, application performance can still degrade. For example, a video player needs to maintain a large buffer of data to bridge the time period where progress is stalled, otherwise playback would pause until the “catch up” phase. Small token buckets. ISP D is at the other end of the spectrum, accommodating no bursts by using a very small token bucket. The small bucket combined with the low RTT results in the doubling window pattern discussed earlier (Section 5.4.5). The small capacity also prevents a connection from “catching up.” After spending considerable time recovering from a loss, the policer immediately throttles transmission rates again since there are no tokens available that could be used to briefly exceed the policing rate. As such, the overall goodput rate is highly influenced by delays introduced when recovering from packet loss. Repressing video streaming. Finally, we note that we observed configurations where a video flow is throttled to a rate that is too small to sustain even the lowest quality. The small number of requests coming from affected users suggests that they stop watching videos altogether. 5.5 Mitigating Policer Impact We now explore several solutions to mitigate the impact of policing. Unless otherwise specified, we use the same setup as for the PD validation (see Section 5.3). 5.5.1 Solutions for ISPs 5.5.1.1 Optimizing Policing Configurations The selection of configuration parameters for a policer can determine its impact. The policed rate usually depends on objectives such as matching the goodput to the bandwidth advertised in 119 Capacity (KB) 8 16 32 64 128 256 512 1K 2K Rebuffering time (s) 3.5 2.0 1.5 1.6 1.6 1.6 2.4 3.1 3.1 Table 5.6: Impact of token bucket capacity on rebuffering time of the same 30-second video playback. Policing rate is set to 500 kbps. a user’s data plan and therefore may be inflexible. However, an ISP can play with other knobs to improve compatibility between policers and the transport layer, while maintaining the same policing rate. For example, we showed earlier that the staircase pattern can arise in the presence of large token buckets. To prevent the associated long bursty loss periods, two options come to mind. First, the enforcing ISP could configure policers with smaller burst sizes. This would prevent TCP’s congestion window from growing too far beyond the policing rate. For this, we again measured the performance of video playbacks when traffic is passed through a policer. We limited the rate to 500 kbps and varied the burst size between 8 kB (the smallest configurable size) and 8 MB, using powers of two as increments. In this setting, a fairly small buffer size of 32 kB results in the lowest rebuffering delays (Table 5.6). Smaller buffers prevent the policer from absorbing any bursty traffic. Larger buffers allow connections to temporarily achieve throughput rates that are much larger than the policing rates, which can result in long rebuffering events if a quality level can no longer be sustained (i.e., the player has to adjust to a lower bandwidth once traffic is policed) or if loss recovery is delayed (i.e., we observe a staircase pattern). A more thorough sensitivity analysis is left to future work. Second, policing can be combined with shaping, as discussed below. 120 5.5.1.2 Shaping Instead of Policing In contrast to a policer dropping packets, a traffic shaper enforces a rate r by buffering packets: if the shaper does not have enough tokens available to forward a packet immediately, it queues the packet until sufficient additional tokens accumulate. The traces of segments that pass through a shaper resemble those of segments limited by a bottleneck. Shaping can provide better per- formance than policing. It minimizes the loss of valuable bandwidth by buffering packets that exceed the throttling rate instead of dropping them immediately. However, buffering packets re- quires more memory. As with policers, shapers can be configured in different ways. A shaper can even be combined with a policer. In that case, the shaper spreads packet bursts out evenly before they reach the policer, allowing tokens to generate and preventing bursty losses. One key configuration for a shaper is whether to make it burst-tolerant by enabling a “burst” phase. When enabled, the shaper temporarily allows a goodput exceeding the configured shaping rate, similar to Comcast’s Powerboost feature [18, 19]. Burst-tolerant Shapers. We developed a detection algorithm for burst-tolerant shaping which determines whether a given segment has been subjected to this type of shaper, and estimates the shaping rate. It relies on the observation that a connection achieves a steady throughput rate after an initial burst phase with higher throughput. We have omitted the details of this algorithm for brevity. We found burst-tolerant shaping in 1.5% of the segments in our dataset. Given its prevalence, we ask: can burst-tolerant shaping mitigate the adverse impact of polic- ing? While shaping avoids the losses that policing induces, latency can increase as shapers buffer packets. To measure this effect, for each video chunk we compare the 10 th percentile latency, usually observed in the burst phase, with the 90 th (Figure 5.25). In the median, shaped segments 121 0 0.2 0.4 0.6 0.8 1 1 10 100 CDF 90th / 10th percentile RTT Ratio Shaped Global Figure 5.25: Per-segment ratio between 90 th and 10 th percentile latencies for shaped seg- ments (red solid line) and all video segments globally (blue dashed line). observe a 90 th percentile latency that is 4 larger than the 10 th percentile. About 20% of seg- ments see a latency bloat of at least an order of magnitude due to traffic shaping, whereas, among non-shaped segments, only 1% see such disparity. Latency-aware congestion control (e.g., TCP Vegas [24]) or network scheduling algorithms (e.g., CoDel [95]) can reduce this latency bloat. Burst-tolerant shaping can also induce unnecessary rebuffering delays at the client. When shaping forces a video server to switch from the burst rate to the actual shaping rate, the content provider may reduce the quality delivered to the client based on the new bandwidth constraint. Now, the older high quality chunk takes too long to be delivered to the client, whereas the new low quality chunk does not reach before the client-side application buffer has already drained. Shapers without burst tolerance. The alternative is shapers that enforce the shaping rate from the start. In theory, such shaping should not induce significant delays (unlike their burst-tolerant counterparts), nor drop packets like policers. Our dataset almost certainly includes flows shaped in this way, but detecting them is hard: connections affected by shaping produce the same traffic patterns as when a TCP flow hits a bot- tleneck at the same rate. Significant cross traffic sharing such a bottleneck may cause throughput variance. It may be possible to identify (burst-intolerant) shapers by looking for low variance. 122 Capacity Join Time (s) Rebuffer Time (s) (Policed) (Shaped) Diff. (Policed) (Shaped) Diff. 8 kB 14.0 12.0 –16% 2.8 1.7 –39% 100 kB 11.1 13.3 +20% 1.6 1.3 –19% 2 MB 0.3 12.6 +4200% 4.2 1.5 –64% Table 5.7: Average join/rebuffer times for first 30 s of a video with the downlink throttled to 0.5 Mbps by either a policer or shaper. Capacity is the token bucket size (for policer) and the queue size (for shaper). However, since our passive measurements cannot detect when cross traffic is present, we cannot infer these shapers with any reasonable accuracy. We evaluate the efficacy of shapers in the lab by fetching the same video playback repeatedly from YouTube and passing it through a policer or shaper (experimenting with different bandwidth limits and queue sizes) before the traffic reaches the client. Then, we calculated quality metrics using YouTube’s QoE API [144]. Table 5.7 summarizes the impact on QoE, averaged over 50 trials per configuration. 11 Join times are generally lower when policing is used, since data can initially pass through the policer without any rate limit if enough tokens are buffered. With sufficiently large token buckets (e.g., the 2 MB configuration in Table 5.7) a video playback can start almost immediately. However, this comes at the cost of much higher rebuffering times. The sudden enforcement of a policing rate causes the video player buffer to drain, causing high rebuffering rates. Shaping on the other hand enforces a rate at all times without allowing bursts. This reduces rebuffering by up to 64% compared to the policed counterparts. Since prior work found that a low rebuffering time increases user engagement [37], reducing rebuffering time might be more beneficial than optimizing join times. Interestingly, shaping performs well even when the buffer size is kept at a minimum (here, at 8 kB) which only allows the absorption of small bursts. 11 We show the results for a single throttling rate here, with other rates yielding similar trends. 123 Configuring shapers. While policer configurations should strive to minimize burst losses, there is no straightforward solution for shapers. Shaping comes at a higher memory cost than policing due to the buffer required to store packets. However, it also introduces queuing latency which can negatively affect latency-sensitive services [77]. Thus, ISPs that employ shaping have to trade off between minimizing loss rates through larger buffers that introduce higher memory costs, and minimizing latency through small buffers. In comparison to the cheaper policing option, a small buffer might still be affordable, and the additional hardware cost might be lower than the cost resulting from a policer that drops large amounts of traffic (e.g., additional transit cost). 5.5.2 Solutions for Content Providers 5.5.2.1 Limiting the Server’s Sending Rate A sender can potentially mitigate the impact of a policer by rate-limiting its transmissions, to avoid pushing the policer into a state where it starts to drop packets. Optimally, the sender limits outgoing packets to the same rate enforced by the policer. We experimentally verified the benefits of sender-side rate limiting in a lab environment. We also confirmed the result in the wild, by temporarily configuring one of Google’s CDN server to rate limit to a known carrier-enforced policing rate, then connecting to that server via the public Internet from one of our mobile devices that we know to be subject to that carrier’s policing rate. In both experiments, loss rates dropped from 8% or more to0%. Additionally, if the policer uses a small bucket, rate limiting at the sender side can even im- prove goodput. We verified this by configuring a policer to a rate of 1.5 Mbps with a capacity of only 8 KB. In one trial we transmit traffic unthrottled, and in a second trial we limit outgoing packets to a rate of 1.425 Mbps (95% of the policing rate). Figures 5.26a and 5.26b show the 124 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.0 0.5 1.0 1.5 2.0 Sequence number (in M) Time (s) Data (First Attempt) Data Retransmits Acked Data (a) No modifications. 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.0 0.5 1.0 1.5 2.0 Sequence number (in M) Time (s) Data (First Attempt) Data Retransmits Acked Data (b) With sender-side rate limit. 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.0 0.5 1.0 1.5 2.0 Sequence number (in M) Time (s) Data (First Attempt) Data Retransmits Acked Data (c) With TCP pacing. Figure 5.26: TCP sequence graphs for three flows passing through a policer with a token refresh rate of 1.5 Mbps and a bucket capacity of 8KB. The rate limit in (b) is set to 95% of the policing rate (i.e., 1.425 Mbps). sequence graphs for the first few seconds in both trials. The rate-limited flow clearly performs better in comparison, achieving a goodput of 1.38 Mbps compared to 452 kbps. The flow with- out rate limiting at the server side only gets a fraction of the goodput that the policer actually allows. The reason is that the small token bucket drops packets from larger bursts, resulting in low goodput. Finally, we measured the benefits of rate limiting video through lab trials. We fetched videos from a YouTube Web server, with traffic passing through our lab policier. For some trials, we inserted a shaper between the server and the policer, to rate limit the transfer. Non-rate-limited 125 playbacks observed an average rebuffering ratio of 1%, whereas the rate-limited flows did not see a single rebuffering event. 5.5.2.2 Avoiding Bursty Transmissions Rate limiting in practice may be difficult, as the sender needs to estimate the throttling rate in near real-time at scale. We explored two viable alternatives to decrease loss by reducing the burstiness of transmissions, giving the policer an opportunity to generate tokens between packets. We start by trying TCP Pacing [5]. Whereas a traditional TCP sender relies solely on ACK clocking to determine when to transmit new data, pacing spreads new packets across an RTT and avoids bursty traffic. Figure 5.26c shows the effect of pacing in the lab setup used in Sec- tion 5.5.2.1, but with a pacer in place of the shaper. Overall, the flow achieves a goodput of 1.23 Mbps which is worse than rate-limiting (1.38 Mbps) but a significant improvement over the unmodified flow (452 kbps). Packet loss is reduced from 5.2% to 1.3%. In addition, we confirmed the benefits of pacing by turning it on/off on multiple CDN servers serving real clients. Enabling pacing consistently caused loss rates to drop by 10 – 20%. Table 5.8 shows the results for two of the CDN servers (“base” and “paced” columns). Even when transmissions are not bursty, heavy losses can still occur when the sender con- sistently sends at a rate larger than the policing rate. In Linux, loss recovery can trigger periods of slow start [42], in which the server sends two packets for every ACKed packet. This results in sending at twice the policed rate during recovery and hence 50% of the retransmissions are dropped by the policer. To avoid this behavior, we modified the loss recovery to use packet con- servation (for every ACKed packet, only one new packet is sent) initially and only use slow start 126 Server Loss (median) Loss (95th pct.) (base) (paced) (rec. fixed) (base) (paced) (rec. fixed) US 7.5% 6.7% 6.4% 34.8% 26.7% 32.2% India 9.9% 7.8% 8.4% 52.1% 35.8% 34.6% Table 5.8: Observed median and 95th percentile loss rates on policed connections served by two selected CDN servers. if the retransmissions are delivered. Keeping slow start enables us to quickly recover from multi- ple losses within a window in a non-policed connection. Otherwise it will take N round trips to recover N packet losses. As with pacing, we experimentally deployed this change which caused loss rates to drop by 10 to 20% as well (“base” and “rec. fixed” columns in Table 5.8). After testing, we also upstreamed the recovery patch to the Linux 4.2 kernel [32]. 5.5.3 Summary of Recommendations While extensive additional experimentation is necessary, we make the following initial sugges- tions to mitigate the adverse effects of policers: 1. ISPs can configure policers with smaller burst sizes. This prevents TCP’s congestion win- dow from growing too far beyond the policing rate when the token bucket fills up thereby resulting in fewer bursty losses. 2. ISPs can deploy shapers with small buffers instead of policers. Shaping avoids the heavy losses usually seen when employing policing, while using only small buffers prevents ex- cessive queueing delays. 127 3. Content providers can rate-limit their traffic, especially when streaming large content. This can reduce the gap between the sending rate and the policing rate resulting in fewer bursty losses. 4. Content providers can employ TCP pacing on their servers to reduce the burstiness of their traffic. Our initial results show that these strategies can minimize or eliminate packet loss, and im- prove playback quality. 5.6 Conclusion Policing high-volume content such as video and cloud storage can be detrimental for content providers, ISPs and end-users alike. Using traces from Google, we found a non-trivial prevalence of traffic policing in almost every part of the globe: between 2% and 7% of lossy video traffic worldwide is subject to policing, often at throughput rates below what would be necessary for HD video delivery. Policers drop packets, and this results in policer-induced packet loss rates of 21% on average, 6 that of non-policed traffic. An analysis of data from a secondary dataset collected through M-Lab’s Network Diagnostics Toolkit yielded similar results. As a result of these loss rates, the playback quality of policed traffic is distributionally worse than that of non- policed traffic, a significant issue from a content provider perspective since it can affect user engagement in the content. We have identified benign traffic management alternatives that avoid adverse impacts of policing while still permitting ISPs to exercise their right to control their infrastructure: content providers can pace traffic, and ISPs can shape traffic using small buffers. 128 Chapter 6 Literature Review 6.1 Analyzing and Reducing Web Latency A number of studies analyzed latency across the Web. Recent work based on packet traces cap- tured by a large Chinese content provider investigated instances where connections stalled and attributed these delays to loss, packet delay, or receiver-side limitations [149]. Nikravesh et al. re- cently dissected the performance of mobile networks [96] and proposed Mobilyzer, a platform for controllable mobile network measurements [97]. A separate set of threads pointed out that large network queues can massively inflate round-trip times, and proposes solutions [13, 48, 62, 100]. Complementary to this, other work emphasized the importance of achieving low tail latency, and explored ways to improve Web latency and throughput and, more generally, content deliv- ery [25,28,47,58,60,71,79,81,83,106]. In her thesis, Li discusses ways to optimize the transport protocol to reduce the latency of short flows by reducing a connection’s establishment and trans- mission time [75]. Wang et al. present techniques to speed up Web page loads by restructuring the page load process [137]. Netravali proposes a Website measurement toolkit in his Master’s thesis that he also uses to motivate a caching approach that can speed up Web page load times [94]. 129 The study of TCP loss recovery in real networks is not new [14, 76, 110, 125]. Measure- ments from 1995 showed that 85% of timeouts were due to insufficient duplicate ACKs to trigger Fast Retransmit [76], and 75% of retransmissions happened during timeout recovery. A study of the 1996 Olympic Web servers estimated that SACK might only eliminate 4% of timeouts [14]. The authors invented limited transmit, which was standardized [11] and widely deployed. An analysis of the Coral CDN service identified loss recovery as one of the major performance bot- tlenecks [125]. Similarly, improving loss recovery is a perennial goal, and such improvements fall into several broad categories: better strategies for managing the window during recovery [14,54,87], detecting and compensating for spurious retransmissions triggered by reordering [20, 80], disambiguating loss and reordering at the end of a stream [119], and improving the retransmit timer estimation. Mittal et al. go even further by leveraging different quality of service (QoS) levels resulting in a congestion control algorithm that prioritizes packets needed for loss recovery while using excess capacity to forward lower-priority data [89]. TCP’s slow RTO recovery is known to be a bottleneck. For example, Griwodz and Halvorsen showed that repeated long RTOs are the main cause of game unresponsiveness [51]. Petlund et al. [102] propose to use a linear RTO, which has been incorporated in the Linux kernel as a non-default socket option for “thin” streams. This approach still relies on receiving duplicate ACKs and does not address RTOs resulting from tail losses. Mondal and Kuzmanovic further ar- gue that exponential RTO backoff should be removed because it is not necessary for the stability of Internet [90]. In contrast, Reactive does not change the RTO timer calculation or exponen- tial backoff and instead leaves the RTO conservative for stability but sends a few probes before 130 concluding the network is badly congested. F-RTO reduces the number of spurious timeout re- transmissions [116]. It is enabled by default in Linux, and we used it in all our experiments. F-RTO has close to zero latency impact in our end-user benchmarks, because it is rarely trig- gered. It relies on availability of new data to send on timeout, but typically tail losses happen at the end of an HTTP or RPC-type response. Reactive does not require new data and hence does not have this limitation. Early Retransmit [9] reduces timeouts when a connection has received a certain number of duplicate ACKs. F-RTO and Early Retransmit are both complementary to Reactive. In line with our approach, Vulimiri et al. [135] make a case for the use of redundancy in the context of the wide-area Internet as an effective way to convert a small amount of extra ca- pacity into reduced latency. Vulimiri’s thesis extends on this discussion [134]. RPT introduces redundancy-based loss protection with low traffic overhead in content-aware networks [53]. Op- stad et al. describe how piggybacking unacknowledged segments with new data can preempt the experience of loss [98]. Shah et al. investigate when and how redundant service requests can im- prove latency [121]. Studies targeting low-latency datacenters aim to reduce the long tail of flow completion times by reducing packet drops [8, 147]. However their design assumptions preclude their deployment in the Internet. Applying FEC to transport (at nearly every layer) is an old idea. Sundararajan et al. [126] suggested placing network coding in TCP, and Kim et al. [70] extended this work by implement- ing a variant over UDP while mimicking TCP capabilities to applications (mainly for high-loss wireless environments). Among others, Baldantoni et al. [16] and Tickoo et al. [131] explored ex- tending TCP to incorporate FEC. None of these, to our knowledge address the issues faced when building a real kernel implementation with today’s TCP stack, nor do they address middleboxes 131 tampering with packets. Finally, Maelstrom is an FEC variant for long-range communication be- tween data centers leveraging the benefits of combining and encoding data from multiple sources into a single stream [15]. 6.2 Path Inflation and Mobile Performance Research showed 10 years ago that interdomain routes suffer from path inflation particularly due to infrastructure limitations like peering points only at select locations, but also due to routing policies [123]. In contrast, Chiu et al. have shown recently that many content providers have direct connections to the majority of networks hosting their users [33]. Other researchers in- vestigated reasons for suboptimal performance of clients of Google’s CDN, showing that clients in the same geographical area can experience different latencies to Google’s servers [72, 150]. Rula and Bustamante investigate how poor localization of clients by DNS can result in inflated latency [112]. Cellular networks present new challenges and opportunities for studying path inflation. One study demonstrates differences in metro-area mobile performance but does not investigate the root causes [122]. Other work shows that routing over suboptimal paths due to lack of nearby ingress points causes a 45% increase in RTT latency because of the additional distance traveled, compared to idealized routing [38]. There are many other studies that look at performance limitations in mobile networks besides path inflation. Xu et al. analyze how transparent Web proxies in cellular networks affect per- formance [141]. The Flywheel study measures the impact of content size on performance and proposes a compression proxy to improve it [3]. 132 Finally some studies recently looked at the performance of mobile virtual network operators (MVNOs) that operate on top of existing cellular structures [120, 146]. 6.3 Traffic Policing To our knowledge, no prior work has explored the prevalence and impact of policers at a global scale. Others explored policing for differentiated services [113], fair bandwidth allocation [69], or throughput guarantees [45, 143]. One study explored the relationship between TCP performance and token bucket policers in a lab setting and proposed a TCP-friendly version achieving per- flow goodputs close to the policed rate regardless of the policer configuration [133]. Finally, a concurrently published study investigated the impact of traffic policing applied by T-Mobile to content delivery. This behavior was recently introduced as part of the carrier’s “BingeOn” program, where traffic can be zero-rated (i.e., results in no charges to customers) while it is at the same time policed to a rate of 1.5 Mbps [63]. Our work is inspired by and builds upon the large number of existing TCP trace analysis tools [26,29,88,99,101,118,138,142]. On top of these tools, we are able to annotate higher-level properties of packets and flows that simplify analysis of packet captures at the scale of a large content provider. A few threads of work are complementary to ours. One is the rather large body of work that has explored ways to understand and improve Web transfer performance (e.g., latency, through- put) and, more generally, content delivery, especially at the tail [25, 28, 47, 58, 60, 71, 79, 81, 83, 149]. None of these has considered the deleterious effects of policers. 133 Prior work has also explored the relationship between playback quality and user engage- ment [37]. Our work explores the relationship of network effects (pathological losses due to policers) and playback quality, and, using results from this prior work, we are able to establish that policing can adversely affect user satisfaction. A line of research explores methods to detect service differentiation [17, 36, 65, 129, 148]. They all exploit differences in flow performance characteristics, like goodput or loss rate, to identify differentiated traffic classes. However, they do not attempt to understand the underlying mechanisms (policing or shaping) used to achieve traffic discrimination. Prior work has explored detecting traffic shaping using active methods [66]; in contrast, we detect burst-tolerant shaping purely passively. Finally, some network operators were already aware of policing’s disadvantages, presenting anecdotal evidence of bad performance [34, 132, 139]. 134 Chapter 7 Conclusions In this dissertation we analyzed three types of performance-limiting factors that affect Web trans- fers: protocol limitations, structural limitations, and third-party interference. In addition we explored different approaches to mitigate some of these performance limiters we measured. In Chapters 2 and 3 we used large-scale measurements from Google as well as longitudinal data collected by M-Lab’s Network Diagnostics Toolkit (NDT) to find the sources of latency introduced by TCP, the most popularly used transport protocol used to deliver Web content. We found that queuing at a bottleneck router as well as packet loss can significantly slow down a TCP-based transfer. To partially address this limitation we designed and deployed algorithms to minimize the frequency of losses as well as reduce loss recovery times. In Chapter 4 we focussed on delays introduced by topological limitations in mobile carriers. Based on end-to-end path information collected on mobile phones we showed how traffic be- tween content providers and mobile clients can take geographically circuitous routes. Thus, even when content providers optimized their network infrastructure for low latency, Web transfers can be exposed to high delays due to the dependency on the mobile carrier’s infrastructure for the delivery. 135 Finally, in Chapter 5 we analyzed the global prevalence and impact of traffic policing, a traffic engineering technique that enforces rate limits by immediately dropping packets once the limits are exceeded. Again we were able to collect large-scale measurements by capturing client-facing TCP traffic at almost all Google frontends. We found that, in the presence of traffic policing, Web traffic can suffer from heavy packet loss. When we tied the affected transfers to the corresponding video playbacks we also found that policing can cause a lower quality of experience and higher user dissatisfaction. These findings motivated our search for best practices that policing ISPs can leverage to avoid or at least minimize the negative effects of policers, as well as the design and testing of techniques that content providers can use to better deal with policers affecting their traffic. 7.1 Future Directions In this dissertation we presented multiple measurement studies and developed several analysis tools to derive insights about performance-limiting factors in Web transfers. This enabled us to reason about different root causes for sub-optimal performance, especially their prevalence and impact on user quality of experience on a global scale. The measurements discussed in the previous chapters point towards some individual high- impact problems that we should address. However, our work is only a step towards fully under- standing the large space of possible root causes for bad Web performance. To get a complete picture we need to overcome a number of challenges and limitations, some of which affected us first-hand during our measurements. 136 We need a better infrastructure to collect Internet-scale measurements. Obtaining meaning- ful measurements that let us reason about Web performance at a global scale is hard for multiple reasons. Web content delivery involves several autonomous parties, including content and In- ternet service providers. Since they typically do not have access to each other’s infrastructure end-to-end monitoring of Web traffic is generally not possible. Heuristics sometimes help to deal with limitations introduced by a lack of data. For example, in this work we used knowledge about TCP internals to infer packet loss without having access to the routers that actually discarded the packets. Heuristics have their limits though. Anomalies like a sudden latency inflation are hard to diagnose with certainty since the information collected at endpoints lacks the necessary signals (e.g. where in the network packets get delayed). Conversely, an ISP responsible for an anomaly can pinpoint the exact root cause but lacks the end-to-end information, like the effect on a user’s quality of experience, to ascertain a problem’s severity. Setting up shared measurement infras- tructures that enable us to collect data from all participants involved in content delivery would be of tremendous value to overcome these limitations. Another challenge arises from the size of the Internet. When we look at Web-related mea- surements we often expect that they are conducted at a global scale to be representative. But setting up a single measurement system with global coverage is infeasible for most researchers and limited to a few large providers with the necessary resources. To avoid a continued depen- dency on large-scale studies coming from these few providers only we need to develop a stronger drive towards establishing a multitude of smaller measurement systems across the world that en- able us to reason about performance in diverse environments as well. Existing systems like NDT, RIPE Atlas, or SamKnows are a great start [82,111,114]. A key requirement for this is a standard 137 that ensures that results collected in different systems are comparable and can be easily merged together to form a single data corpus used for analysis. We need to collect signals tied to realistic traffic patterns. Besides establishing a measurement infrastructure it is imperative to carefully design the measurement tasks themselves. Our goal is to obtain signals that let us reason about the performance that actual users experience when accessing Web content. The heterogeneity of Web services and the traffic patterns they produce make this difficult. Existing tools already collect useful data but too often they are restricted to a narrow set of traffic patterns. For example, NDT captures packet headers for transfers between a fixed set of measurement servers and clients all over the world but the captures are solely based on a single traffic pattern, the transmission of a continuous data stream for ten seconds. This is helpful to reason about the performance problems of some applications like file transfers but unsuitable for others. Web search for example only requires small file transfers whereas video streams are typically delivered in larger chunks with connections often being idle for longer time frames in between. As a result they might experience different performance limiters and we can only analyze them if we model their traffic patterns in our measurements. We need better tools to combine data from different sources. Over the years the community evolved tools to gain insights about different aspects of network communication, includingping to measure the connectivity and latency between two machines, tcpdump to get transport-level packet captures, or traceroute to discover the paths via which packets are routed through the Internet. Joined together the data from all these sources can help us to paint a complete picture about Web performance but most times we consult only a small number of tools to keep the complexity of an analysis system at a minimum. If we take on the challenge of designing tools that combine data from different sources, for example to annotate Web service calls with knowledge 138 about the transport protocol and network performance, we likely end up being able to answer questions about Web performance limitations much more precisely. We need continuous measurements. Our networks are constantly evolving. Tomorrow’s networks might experience different performance-limiting factors causing one-off measurement studies to become outdated quickly. In addition many studies are motivated by anecdotal evi- dence for a problem (this includes most work discussed in previous chapters). This means that we often start measurement campaigns after indicators for problems having a major impact sur- face. It prevents us from finding problems before they become a big issue and it makes A/B comparisons harder (i.e. to measure the performance with or without a problem being present). To overcome this challenge we need to design our measurement campaigns such that they can be run continuously which lets us monitor the lifetimes of problems that limit the performance of Web content delivery. Finally, I want to point towards a few promising areas of research towards better network performance. As I have demonstrated in this work, low latency is a key performance goal for Web services. As such, any techniques that can reduce delays introduced by the network and transport layers are valuable. Revisiting delay-based congestion control. One way to significantly improve the performance of a transport protocol is to take another look at delay-based congestion control mechanisms, like TCP Vegas [24]. Studies have shown in the past that these algorithms achieve much lower latencies than their loss-based counterparts, like TCP CUBIC [52]. In the past, the deployment of delay-based mechanisms was blocked by the fact that they achieve poor performance when connections that use the delay-based approach compete with connections that use a loss-based 139 approach. However, their superiority of performance with respect to latency warrants another look. Taking advantage of network signals. Signaling mechanisms like ECN [107] haven proven useful to aid transport protocols in minimizing congestion in the network. They act as an early indicator for network oversubscription whereas the traditional congestion control approach relies purely on the loss signal induced by overflowing buffers. However, ECN is only a one-bit signal: if a packet gets marked, the sender should slow down. By enhancing this signal to provide more fine-grained feedback to a sender, for example by indicating to which degree the network is currently congested, we can tune transport protocols to use available capacity more efficiently while keeping network latency at a minimum. 140 Bibliography [1] Policing Detection (Supplemental Material). https://usc-nsl.github.io/ policing-detection/. [2] Web Page Replay. http://code.google.com/p/web-page-replay/. [3] Victor Agababov, Michael Buettner, Victor Chudnovsky, Mark Cogan, Ben Greenstein, Shane McDaniel, Michael Piatek, Colin Scott, Matt Welsh, and Bolian Yin. Flywheel: Google’s Data Compression Proxy for the Mobile Web. In Proc. of the Symposium on Networked Systems Design and Implementation (NSDI ’15), 2015. [4] Bernhard Ager, Nikolaos Chatzis, Anja Feldmann, Nadi Sarrar, Steve Uhlig, and Walter Willinger. Anatomy of a large european ixp. In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Com- puter Communication, SIGCOMM ’12, pages 163–174, New York, NY , USA, 2012. ACM. [5] Amit Aggarwal, Stefan Savage, and Thomas E. Anderson. Understanding the Performance of TCP Pacing. In Proc. of the IEEE Int. Conf. on Computer Communications (INFOCOM ’00), 2000. [6] Akamai. Cloud Computing Infrastructure. http://www.akamai.com/html/ resources/cloud-computing-infrastructure.html. [7] Akamai. The State of the Internet (3rd Quarter 2012), 2012. http://www.akamai.com/ stateoftheinternet/. [8] Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. Data center TCP (DCTCP). In Proc. of SIGCOMM, 2010. [9] M. Allman, K. Avrachenkov, U. Ayesta, J. Blanton, and P. Hurtig. RFC 5827: Early Retransmit for TCP and Stream Control Transmission Protocol, 2010. [10] M. Allman, K. Avrachenkov, U. Ayesta, J. Blanton, and P. Hurtig. RFC 5827: Early Retransmit for TCP and Stream Control Transmission Protocol, 2010. [11] M. Allman, H. Balakrishnan, and S. Floyd. Enhancing TCP’s Loss Recovery Using Lim- ited Transmit, January 2001. RFC 3042. [12] M. Allman, V . Paxson, and E. Blanton. TCP congestion control, September 2009. RFC 5681. 141 [13] Mark Allman. Comments on Bufferbloat. ACM SIGCOMM Computer Communication Review, 43(1):30–37, 2013. [14] Hari Balakrishnan, Venkata N. Padmanabhan, Srinivasan Seshan, Mark Stemm, and Randy H. Katz. TCP Behavior of a Busy Internet Server: Analysis and Improvements. In Proc. of INFOCOM, 1998. [15] Mahesh Balakrishnan, Tudor Marian, Kenneth P. Birman, Hakim Weatherspoon, and Lak- shmi Ganesh. Maelstrom: transparent error correction for communication between data centers. IEEE/ACM Trans. Netw., 19(3), June 2011. [16] L. Baldantoni, H. Lundqvist, and G. Karlsson. Adaptive end-to-end FEC for improving TCP performance over wireless links. In Proc. of Conf. on Commun., June 2004. [17] Vitali Bashko, Nikolay Melnikov, Anuj Sehgal, and Jurgen Schonwalder. BonaFide: A traffic shaping detection tool for mobile networks. In Proc. of IFIP/IEEE Int. Symp. on Integrated Network Management (IM ’13), 2013. [18] C. Bastian, T. Klieber, J. Livingood, J. Mills, and R. Woundy. RFC 6057: Comcast’s Protocol-Agnostic Congestion Management System, 2010. [19] Steven Bauer, David Clark, and William Lehr. PowerBoost. In Proc. of the ACM Workshop on Home Networks (HomeNets ’11), 2011. [20] E. Blanton and M. Allman. Using TCP DSACKs and SCTP duplicate TSNs to detect spurious retransmissions, February 2004. RFC 3708. [21] E. Blanton, M. Allman, L. Wang, I. Jarvinen, M. Kojo, and Y . Nishida. A Conservative Loss Recovery Algorithm Based on Selective Acknowledgment (SACK) for TCP, 2012. RFC 6675. [22] R. Braden. RFC 1122: Requirements for Internet Hosts - Communication Layers, 1989. [23] L. Brakmo, S. O’Malley, and L. Peterson. TCP Vegas: End to End Congestion Avoidance on a Global Internet. ACM Comput. Commun. Rev., August 1996. [24] Lawrence S. Brakmo, Sean W. O’Malley, and Larry L. Peterson. TCP Vegas: New Tech- niques for Congestion Detection and Avoidance. In Proc. of the ACM Conference of the Special Interest Group on Data Communication (SIGCOMM ’94), 1994. [25] Bob Briscoe, Anna Brunstrom, Andreas Petlund, David Hayes, David Ros, Jyh Tsang, Stein Gjessing, Gorry Fairhurst, Carsten Griwodz, and Michael Welzl. Reducing Internet Latency: a Survey of Techniques and Their Merits. IEEE Communications Surveys & Tutorials, 2014. [26] Kevin Burns. TCP/IP Analysis & Troubleshooting Toolkit. John Wiley & Sons, 2003. [27] Matt Calder, Xun Fan, Zi Hu, Ethan Katz-Bassett, John Heidemann, and Ramesh Govin- dan. Mapping the Expansion of Google’s Serving Infrastructure. In Proc. of the Internet Measurement Conference (IMC ’13), 2013. 142 [28] Matt Calder, Xun Fan, Zi Hu, Ethan Katz-Bassett, John Heidemann, and Ramesh Govin- dan. Mapping the Expansion of Google’s Serving Infrastructure. In Proc. of the ACM Internet Measurement Conference (IMC ’13), 2013. [29] CAPTCP. http://research.protocollabs.com/captcp/. [30] Marta Carbone and Luigi Rizzo. Dummynet revisited. ACM Comput. Commun. Rev., 40(2), 2010. [31] Robert L. Carter and Mark Crovella. Measuring Bottleneck Link Speed in Packet-Switched Networks. Performance Evaluation, 27/28(4), 1996. [32] Yuchung Cheng. tcp: reducing lost retransmits in recovery (Linux kernel patches). http: //comments.gmane.org/gmane.linux.network/368957, 2015. [33] Yi-Ching Chiu, Brandon Schlinker, Abhishek Balaji Radhakrishnan, Ethan Katz-Bassett, and Ramesh Govindan. Are We One Hop Away from a Better Internet? In Proc. of the Internet Measurement Conference (IMC ’15), 2015. [34] Cisco. Comparing Traffic Policing and Traffic Shaping for Bandwidth Limiting. http://www.cisco.com/c/en/us/support/docs/quality-of-service-qos/ qos-policing/19645-policevsshape.html#traffic. [35] Cisco. The Zettabyte Era – Trends and Analysis. White Paper, 2014. [36] Marcel Dischinger, Massimiliano Marcon, Saikat Guha, P. Krishna Gummadi, Ratul Ma- hajan, and Stefan Saroiu. Glasnost: Enabling End Users to Detect Traffic Differentiation. In Proc. of the USENIX Symposium on Networked Systems Design and Implementation (NSDI ’10), 2010. [37] Florin Dobrian, Vyas Sekar, Asad Awan, Ion Stoica, Dilip Antony Joseph, Aditya Ganjam, Jibin Zhan, and Hui Zhang. Understanding the Impact of Video Quality on User Engage- ment. In Proc. of the ACM Conference of the Special Interest Gr. on Data Communication (SIGCOMM ’11), 2011. [38] Wei Dong, Zihui Ge, and Seungjoon Lee. 3G Meets the Internet: Understanding the Performance of Hierarchical Routing in 3G Networks. In ITC, 2011. [39] N. Dukkipati, N. Cardwell, Y . Cheng, and M. Mathis. Tail Loss Probe (TLP): An Algorithm for Fast Recovery of Tail Losses, 2013. [40] Nandita Dukkipati. tcp: Tail Loss Probe (TLP). http://lwn.net/Articles/542642/. [41] Nandita Dukkipati, Neal Cardwell, Yuchung Cheng, and Matt Mathis. Tail Loss Probe (TLP): An Algorithm for Fast Recovery of Tail Losses, Feburary 2013. draft-dukkipati- tcpm-tcp-loss-probe-01. [42] Nandita Dukkipati, Matt Mathis, Yuchung Cheng, and Monia Ghobadi. Proportional Rate Reduction for TCP. In Proc. of the ACM Internet Measurement Conference (IMC ’11), 2011. 143 [43] Nandita Dukkipati, Tiziana Refice, Yuchung Cheng, Jerry Chu, Tom Herbert, Amit Agar- wal, Arvind Jain, and Natalia Sutin. An Argument for Increasing TCP’s Initial Congestion Window. ACM Comput. Commun. Rev., 40, 2010. [44] Nandita Dukkipati, Tiziana Refice, Yuchung Cheng, Jerry Chu, Tom Herbert, Amit Agar- wal, Arvind Jain, and Natalia Sutin. An Argument for Increasing TCP’s Initial Congestion Window. ACM SIGCOMM Computer Communications Review, 40:27–33, 2010. [45] W. Feng, Dilip D. Kandlur, Debanjan Saha, and Kang G. Shin. Understanding and Im- proving TCP Performance Over Networks With Minimum Rate Guarantees. IEEE/ACM Transactions on Networking, 7(2), 1999. [46] Sally Floyd and Van Jacobson. Random Early Detection Gateways for Congestion Avoid- ance. IEEE/ACM Transactions on Networking, 1(4), 1993. [47] Aditya Ganjam, Faisal Siddiqui, Jibin Zhan, Xi Liu, Ion Stoica, Junchen Jiang, Vyas Sekar, and Hui Zhang. C3: Internet-Scale Control Plane for Video Quality Optimization. In Proc. of the USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15), 2015. [48] Jim Gettys and Kathleen Nichols. Bufferbloat: Dark Buffers in the Internet. Queue, 9(11):40, 2011. [49] Monia Ghobadi, Yuchung Cheng, Ankur Jain, and Matt Mathis. Trickle: Rate Limiting YouTube Video Streaming. In Proc. of the USENIX Annual Technical Conference (ATC ’12), pages 191–196. USENIX, 2012. [50] Phillipa Gill, Martin F. Arlitt, Zongpeng Li, and Anirban Mahanti. The Flattening Internet Topology: Natural Evolution, Unsightly Barnacles or Contrived Collapse? In PAM, 2008. [51] Carsten Griwodz and P˚ al Halvorsen. The fun of using TCP for an MMORPG. In Proc. of NOSSDAV, 2006. [52] Sangtae Ha, Injong Rhee, and Lisong Xu. CUBIC: a new TCP-friendly high-speed TCP variant. SIGOPS Oper. Syst. Rev., 42(5), July 2008. [53] Dongsu Han, Ashok Anand, Aditya Akella, and Srinivasan Seshan. RPT: Re-architecting Loss Protection for Content-Aware Networks. In Proc. of NSDI, 2012. [54] J. Hoe. Improving the start-up behavior of a congestion control scheme for TCP. ACM Comput. Commun. Rev., August 1996. [55] Michio Honda, Yoshifumi Nishida, Costin Raiciu, Adam Greenhalgh, Mark Handley, and Hideyuki Tokuda. Is it still possible to extend TCP? In Proc. of IMC, 2011. [56] HTTP Archive. Chrome Page Load Metrics (04/2016). https://bigquery.cloud. google.com/table/httparchive:har.2016_04_15_chrome_pages. [57] Ningning Hu and Peter Steenkiste. Evaluation and Characterization of Available Band- width Probing Techniques. IEEE Journal on Selected Areas in Communications, 21(6), 2003. 144 [58] Te-Yuan Huang, Ramesh Johari, Nick McKeown, Matthew Trunnell, and Mark Watson. A Buffer-Based Approach to Rate Adaptation: Evidence from a Large Video Streaming Service. In Proc. of the ACM Conference of the Special Interest Group on Data Commu- nication (SIGCOMM ’14), 2014. [59] Amy Hughes, Joe Touch, and John Heidemann. Issues in TCP Slow-Start Restart after Idle, December 2001. draft-hughes-restart-00. [60] Jie Hui, Kevin Lau, Ankur Jain, Andreas Terzis, and Jeff Smith. YouTube performance is improved in T-Mobile network. http://velocityconf.com/velocity2014/public/ schedule/detail/35350. [61] Van Jacobson. Congestion avoidance and control. In ACM SIGCOMM Computer Commu- nication Review, volume 18, 1988. [62] Haiqing Jiang, Yaogong Wang, Kyunghan Lee, and Injong Rhee. Tackling Bufferbloat in 3G/4G Networks. In Proc. of the ACM Internet Measurement Conference (IMC ’12), 2012. [63] Arash Molavi Kakhki, Fangfan Li, David Choffnes, Alan Mislove, and Ethan Katz-Bassett. BingeOn Under the Microscope: Understanding T-Mobile’s Zero-Rating Implementation. In Proc. of Internet-QoE Workshop, 2016. [64] Arash Molavi Kakhki, Abbas Razaghpanah, Hyungjoon Koo, Anke Li, Rajeshkumar Golani, David Choffnes, Phillipa Gill, and Alan Mislove. Identifying Traffic Differentia- tion in Mobile Networks. In Proceedings of the 15th ACM/USENIX Internet Measurement Conference (IMC’15), Tokyo, Japan, October 2015. [65] Partha Kanuparthy and Constantine Dovrolis. DiffProbe: Detecting ISP Service Discrim- ination. In Proc. of the IEEE International Conference on Computer Communications (INFOCOM ’10), 2010. [66] Partha Kanuparthy and Constantine Dovrolis. ShaperProbe: End-to-end Detection of ISP Traffic Shaping Using Active Methods. In Proc. of the ACM Internet Measurement Con- ference (IMC ’11), 2011. [67] Ethan Katz-Bassett, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas An- derson, and Yatin Chawathe. Towards IP geolocation using delay and topology measure- ments. In IMC, 2006. [68] Srinivasan Keshav. A Control-theoretic Approach to Flow Control. In Proc. of the ACM Conference of the Special Interest Group on Data Communication (SIGCOMM ’91), 1991. [69] J. Kidambi, D. Ghosal, and B. Mukherjee. Dynamic Token Bucket (DTB): A Fair Band- width Allocation Algorithm for High-Speed Networks. Journal of High-Speed Networks, 2001. [70] MinJi Kim, Jason Cloud, Ali ParandehGheibi, Leonardo Urbina, Kerim Fouli, Douglas Leith, and Muriel Medard. Network Coded TCP (CTCP). arXiv:1212.2291. 145 [71] Christian Kreibich, Nicholas Weaver, Boris Nechaev, and Vern Paxson. Netalyzr: Illumi- nating the Edge Network. In Proc. of the ACM Internet Measurement Conference (IMC ’10), 2010. [72] Rupa Krishnan, Harsha V . Madhyastha, Sushant Jain, Sridhar Srinivasan, Arvind Krishna- murthy, Thomas Anderson, and Jie Gao. Moving Beyond End-to-End Path Information to Optimize CDN Performance. In Proc. of IMC, 2009. [73] Craig Labovitz, Scott Iekel-Johnson, Danny McPherson, Jon Oberheide, and Farnam Ja- hanian. Internet inter-domain traffic. In SIGCOMM, 2010. [74] Craig Labovitz, Scott Iekel-Johnson, Danny McPherson, Jon Oberheide, and Farnam Ja- hanian. Internet inter-domain traffic. SIGCOMM Comput. Commun. Rev., 41(4):–, August 2010. [75] Qingxi Li. Reducing Short Flow’s Latency in the Internet. PhD thesis, University of Illinois at Urbana-Champaign, 2016. [76] Dong Lin and H.T. Kung. TCP fast recovery strategies: Analysis and improvements. In Proc. of INFOCOM, 1998. [77] Greg Linden. Make Data Useful. http://sites.google.com/site/glinden/Home/ StanfordDataMining.2006-11-28.ppt, 2006. [78] L´ aszl´ o Lov´ asz. On the Ratio of Optimal Integral And Fractional Covers. Discrete Mathe- matics, 13(4):383–390, 1975. [79] M. Luckie, A. Dhamdhere, D. Clark, B. Huffaker, and K. Claffy. Challenges in Inferring Internet Interdomain Congestion. In Proc. of the ACM Internet Measurement Conference (IMC ’14), 2014. [80] R. Ludwig and R. H. Katz. The Eifel Algorithm: Making TCP Robust Against Spurious Retransmissions. (ACM) Comp. Commun. Rev., 30(1), January 2000. [81] M-Lab. ISP Interconnection and its Impact on Consumer Internet Perfor- mance. http://www.measurementlab.net/static/observatory/M-Lab_ Interconnection_Study_US.pdf. [82] M-Lab. Network Diagnostics Toolkit.http://www.measurementlab.org/tools/ndt. [83] Ratul Mahajan, Ming Zhang, Lindsey Poole, and Vivek S. Pai. Uncovering Performance Differences Among Backbone ISPs with Netdiff. In USENIX Symp. on Networked Systems Design & Implementation (NSDI ’08), 2008. [84] Zhuoqing Morley Mao, Charles D. Cranor, Fred Douglis, Michael Rabinovich, Oliver Spatscheck, and Jia Wang. A Precise and Efficient Evaluation of the Proximity Between Web Clients and Their Local DNS Servers. In USENIX ATC, 2002. [85] M. Mathis. Relentless Congestion Control, March 2009. draft-mathis-iccrg-relentless-tcp- 00.txt. 146 [86] Matthew Mathis and Jamshid Mahdavi. Forward Acknowledgement: Refining TCP Con- gestion Control. In Proc. of the ACM Conference of the Special Interest Group on Data Communication (SIGCOMM ’96), 1996. [87] Matthew Mathis and Jamshid Mahdavi. Forward acknowledgment: refining TCP conges- tion control. ACM Comput. Commun. Rev., 26(4), August 1996. [88] Microsoft Research TCP Analyzer. http://research.microsoft.com/en-us/ projects/tcpanalyzer/. [89] Radhika Mittal, Justine Sherry, Sylvia Ratnasamy, and Scott Shenker. Recursively Cau- tious Congestion Control. In Proc. of the Symposium on Networked Systems Design and Implementation (NSDI ’14), 2014. [90] Amit Mondal and Aleksandar Kuzmanovic. Removing exponential backoff from TCP. ACM Comput. Commun. Rev., 38(5), September 2008. [91] John Nagle. RFC 896: Congestion Control in IP/TCP Internetworks, 1984. [92] NDT analysis source code and results. https://github.com/USC-NSL/ ndt-analysis. [93] Netflix. Letter to Shareholders (Q4 2014). http://ir.netflix.com/results.cfm. [94] Ravi Arun Netravali. Understanding and Improving Web Page Load Times on Modern Networks. Master’s thesis, Massachusetts Institute of Technology, 2014. [95] Kathleen Nichols and Van Jacobson. Controlling Queue Delay. Queue, 10(5), May 2012. [96] Ashkan Nikravesh, David R. Choffnes, Ethan Katz-Bassett, Z. Morley Mao, and Matt Welsh. Mobile Network Performance from User Devices: A Longitudinal, Multidimen- sional Analysis. In Proc. of the Passive Active Measurement Conference (PAM ’14), 2014. [97] Ashkan Nikravesh, Hongyi Yao, Shichang Xu, David R. Choffnes, and Z. Morley Mao. Mobilyzer: An Open Platform for Controllable Mobile Network Measurements. In Proc. of the Conference on Mobile Systems (MobiSys ’15), 2015. [98] Bendik R Opstad, Jonas Markussen, Iffat Ahmed, Andreas Petlund, Carsten Griwodz, and Pal Halvorsen. Latency and fairness trade-off for thin streams using redundant data bundling in TCP. In Proc. of the Conference on Local Computer Networks (LCN ’15), 2015. [99] Jitendra Pahdye and Sally Floyd. On Inferring TCP Behavior. In Proc. of the ACM Con- ference of the Special Interest Group on Data Communication (SIGCOMM ’01), 2001. [100] Rong Pan, Preethi Natarajan, Chiara Piglione, Mythili Suryanarayana Prabhu, Vijay Sub- ramanian, Fred Baker, and Bill VerSteeg. PIE: A Lightweight Control Scheme to Address the Bufferbloat Problem. In Proc. of the IEEE Conference on High Performance Switching and Routing (HPSR ’13), 2013. 147 [101] V . Paxson. Automated packet trace analysis of TCP implementations. In Proc. of ACM SIGCOMM. ACM, 1997. [102] Andreas Petlund, Kristian Evensen, Carsten Griwodz, and P˚ al Halvorsen. TCP enhance- ments for interactive thin-stream applications. In Proc. of NOSSDAV, 2008. [103] Feng Qian, Alexandre Gerber, Zhuoqing Morley Mao, Subhabrata Sen, Oliver Spatscheck, and Walter Willinger. Tcp revisited: A fresh look at tcp in the wild. In Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement Conference, IMC ’09, pages 76–89, New York, NY , USA, 2009. ACM. [104] S. Radhakrishnan, Y . Cheng, J. Chu, A. Jain, and B. Raghavan. TCP Fast Open. In Proc. of CoNEXT, 2011. [105] Barath Raghavan and Alex Snoeren. Decongestion Control. In Proc. of HotNets, 2006. [106] Mohammad Rajiullah, Per Hurtig, Anna Brunstrom, Andreas Petlund, and Michael Welzl. An Evaluation of Tail Loss Recovery Mechanisms For TCP. ACM SIGCOMM Computer Communication Review, 45(1), 2015. [107] K. Ramakrishnan, S. Floyd, and D. Black. The Addition of Explicit Congestion Notifica- tion (ECN) to IP, September 2001. RFC 3042. [108] Irving Reed and Gustave Solomon. Polynomial Codes over Certain Finite Fields. Journ. of the Soc. for Industr. and Appl. Math., 8(2), jun 1960. [109] Sushant Rewaskar, Jasleen Kaur, and F Donelson Smith. A Passive State-Machine Ap- proach for Accurate Analysis of TCP Out-of-Sequence Segments. ACM SIGCOMM Com- puter Communication Review, 36(3):51–64, 2006. [110] Sushant Rewaskar, Jasleen Kaur, and F. Donelson Smith. A performance study of loss detection/recovery in real-world TCP implementations. Proc. of ICNP, 2007. [111] RIPE NCC. RIPE Atlas. http://atlas.ripe.net. [112] John P. Rula and Fabian E. Bustamante. Behind the Curtain - Cellular DNS and Content Replica Selection. In Proc. of the Internet Measurement Conference (IMC ’14), 2014. [113] Sambit Sahu, Philippe Nain, Christophe Diot, Victor Firoiu, and Donald F. Towsley. On Achievable Service Differentiation With Token Bucket Marking For TCP. In Proc. of the ACM SIGMETRICS Conf., 2000. [114] SamKnows. http://www.samknows.com. [115] Sandvine. Global Internet Phenomena Report 2H 2014. 2014. [116] P. Sarolahti, M. Kojo, K. Yamamoto, and M. Hata. Forward RTO-Recovery (F-RTO): An Algorithm for Detecting Spurious Retransmission Timeouts with TCP, September 2009. RFC 5682. 148 [117] Pasi Sarolahti and Alexey Kuznetsov. Congestion Control in Linux TCP. In Proc. of USENIX, 2002. [118] Stefan Savage. Sting: A TCP-based Network Measurement Tool. In USENIX Symposium on Internet Technologies and Systems (USITS ’99), 1999. [119] R. Scheffenegger. Improving SACK-based loss recovery for TCP, November 2010. draft- scheffenegger-tcpm-sack-loss-recovery-00.txt. [120] Paul Schmitt, Morgan Vigil, and Elizabeth Belding. A Study of MVNO Data Paths and Performance. In Proc. of the Passive and Active Measurement Conference (PAM ’16), 2016. [121] Nihar B. Shah, Kangwook Lee, and Kannan Ramchandran. When do redundant requests reduce latency? In Proc. of the Allerton Conference (Allerton ’13), 2013. [122] Joel Sommers and Paul Barford. Cell vs. WiFi: on the performance of metro area mobile connections. In IMC, 2012. [123] Neil T. Spring, Ratul Mahajan, and Thomas E. Anderson. The causes of path inflation. In SIGCOMM, 2003. [124] Neil T. Spring, Ratul Mahajan, David Wetherall, and Thomas E. Anderson. Measuring ISP topologies with Rocketfuel. IEEE/ACM Trans. Netw., 12(1), 2004. [125] Peng Sun, Minlan Yu, Michael J. Freedman, and Jennifer Rexford. Identifying Perfor- mance Bottlenecks in CDNs through TCP-Level Monitoring. In SIGCOMM Workshop on Meas. Up the Stack, August 2011. [126] J.K. Sundararajan, D. Shah, M. Medard, S. Jakubczak, M. Mitzenmacher, and J. Barros. Network Coding Meets TCP: Theory and Implementation. Proc. of the IEEE, 99(3), March 2011. [127] Srikanth Sundaresan, Walter de Donato, Nick Feamster, Renata Teixeira, Sam Crawford, and Antonio Pescap` e. Broadband Internet Performance: A View from the Gateway. ACM Comput. Commun. Rev., 41(4), 2011. [128] Hongsuda Tangmunarunkit, Ramesh Govindan, Scott Shenker, and Deborah Estrin. The Impact of Routing Policy on Internet Paths. In INFOCOM, 2001. [129] Mukarram Bin Tariq, Murtaza Motiwala, Nick Feamster, and Mostafa Ammar. Detect- ing Network Neutrality Violations with Causal Inference. In Proc. of the ACM Conf. on Emerging Networking Experiments and Technologies (CoNEXT ’09), 2009. [130] tcptrace. http://www.tcptrace.org. [131] Omesh Tickoo, Vijaynarayanan Subramanian, Shivkumar Kalyanaraman, and K. Ramakr- ishnan. LT-TCP: End-to-End Framework to improve TCP Performance over Networks with Lossy Channels. In Proc. of IWQoS, 2005. 149 [132] Iljitsch van Beijnum. BGP: Building Reliable Networks with the Border Gateway Protocol. O’Reilly Media, 2002. [133] Ronald van Haalen and Richa Malhotra. Improving TCP performance with bufferless token bucket policing: A TCP friendly policer. In Proc. of the IEEE Workshop on Local and Metropolitan Area Networks (LANMAN ’07), 2007. [134] Ashish Vulimiri. Latency-Bandwidth Tradeoffs in Internet Applications. PhD thesis, Uni- versity of Illinois at Urbana-Champaign, 2015. [135] Ashish Vulimiri, Oliver Michel, P. Brighten Godfrey, and Scott Shenker. More is less: reducing latency via redundancy. In Proc. of HotNets, 2012. [136] Michael Walfish, Mythili Vutukuru, Hari Balakrishnan, David Karger, and Scott Shenker. DDoS defense by offense. In Proc. of SIGCOMM, 2006. [137] Xiao Sophia Wang, Arvind Krishnamurthy, and David Wetherall. Speeding up Web Page Loads with Shandian. In Proc. of the Symposium on Networked Systems Design and Im- plementation (NSDI ’16), 2016. [138] Wireshark. http://www.wireshark.org. [139] Cathy Wittbrodt. CAR Talk: Configuration Considerations for Cisco’s Committed Access Rate. https://www.nanog.org/meetings/abstract?id=1290, 1998. [140] Qiang Xu, Junxian Huang, Zhaoguang Wang, Feng Qian, Alexandre Gerber, and Zhuo- qing Morley Mao. Cellular data network infrastructure characterization and implication on mobile content placement. In SIGMETRICS, 2011. [141] Xing Xu, Yurong Jiang, Tobias Flach, Ethan Katz-Bassett, David Choffnes, and Ramesh Govindan. Investigating Transparent Web Proxies in Cellular Networks. In Proc. of the Passive and Active Measurement Conference (PAM ’15), 2015. [142] yconalyzer. http://yconalyzer.sourceforge.net/. [143] Ikjun Yeom and A. L. Narasimha Reddy. Realizing Throughput Guarantees in a Differen- tiated Services Network. In Proc. of the IEEE Int. Conf. on Multimedia Computing and Systems (ICMCS ’99), 1999. [144] YouTube JavaScript Player API Reference. https://developers.google.com/ youtube/js_api_reference. [145] YouTube Statistics. http://www.youtube.com/yt/press/statistics.html. [146] Fatima Zarinni, Ayon Chakraborty, Vyas Sekar, Samir R. Das, and Phillipa Gill. A First Look at Performance in Mobile Virtual Network Operators. In Proc. of the Internet Mea- surement Conference (IMC ’14), 2014. [147] David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, and Randy Katz. DeTail: reducing the flow completion time tail in datacenter networks. In Proc. of SIGCOMM, 2012. 150 [148] Ying Zhang, Zhuoqing Morley Mao, and Ming Zhang. Detecting traffic differentiation in backbone ISPs with NetPolice. In Anja Feldmann and Laurent Mathy, editors, Proc. of the ACM Internet Measurement Conference (IMC ’09), 2009. [149] Jianer Zhou, Qinghua Wu, Zhenyu Li, Steve Uhlig, Peter Steenkiste, Jian Chen, and Gao- gang Xie. Demystifying and Mitigating TCP Stalls at the Server Side. In Proc. of the ACM Conf. on Emerging Networking Experiments and Technologies (CoNEXT ’15), 2015. [150] Yaping Zhu, Benjamin Helsley, Jennifer Rexford, Aspi Siganporia, and Sridhar Srinivasan. LatLong: Diagnosing Wide-Area Latency Changes for CDNs. IEEE TNSM, 9(3), 2012. 151
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Measuring the impact of CDN design decisions
PDF
Improving user experience on today’s internet via innovation in internet routing
PDF
Anycast stability, security and latency in the Domain Name System (DNS) and Content Deliver Networks (CDNs)
PDF
Making web transfers more efficient
PDF
Enabling efficient service enumeration through smart selection of measurements
PDF
Balancing security and performance of network request-response protocols
PDF
Scaling-out traffic management in the cloud
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Learning about the Internet through efficient sampling and aggregation
PDF
Global analysis and modeling on decentralized Internet
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Towards highly-available cloud and content-provider networks
PDF
Improving network security through collaborative sharing
PDF
Towards building a live 3D digital twin of the world
PDF
Efficient delivery of augmented information services over distributed computing networks
PDF
Mitigating attacks that disrupt online services without changing existing protocols
PDF
Relative positioning, network formation, and routing in robotic wireless networks
PDF
Optimal distributed algorithms for scheduling and load balancing in wireless networks
PDF
Supporting faithful and safe live malware analysis
PDF
Elements of next-generation wireless video systems: millimeter-wave and device-to-device algorithms
Asset Metadata
Creator
Flach, Tobias
(author)
Core Title
Detecting and mitigating root causes for slow Web transfers
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/07/2016
Defense Date
09/08/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Internet measurements,Internet protocols,OAI-PMH Harvest,packet loss,path inflation,recovery,TCP,traffic policing,Web latency
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Govindan, Ramesh (
committee chair
), Katz-Bassett, Ethan (
committee chair
), Heidemann, John (
committee member
), Psounis, Konstantinos (
committee member
)
Creator Email
cholerikasi@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-313343
Unique identifier
UC11214536
Identifier
etd-FlachTobia-4869.pdf (filename),usctheses-c40-313343 (legacy record id)
Legacy Identifier
etd-FlachTobia-4869.pdf
Dmrecord
313343
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Flach, Tobias
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Internet measurements
Internet protocols
packet loss
path inflation
TCP
traffic policing
Web latency