Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 935 (2013)
(USC DC Other)
USC Computer Science Technical Reports, no. 935 (2013)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MappingtheExpansionofGoogle’sServingInfrastructure Technical Report 13-935b y , University of Southern California, Department of Computer Science Matt Calder University of Southern California Xun Fan USC/ISI Zi Hu USC/ISI Ethan Katz-Bassett University of Southern California John Heidemann USC/ISI Ramesh Govindan University of Southern California ABSTRACT Modern content-distribution networks both provide bulk con- tent and act as \serving infrastructure" for web services in order to reduce user-perceived latency. Serving infrastruc- tures such as Google's are now critical to the online economy, making it imperative to understand their size, geographic distribution, and growth strategies. To this end, we develop techniques that enumerate IP addresses of servers in these infrastructures, nd their geographic location, and identify the association between clients and clusters of servers. While general techniques for server enumeration and geolocation can exhibit large error, our techniques exploit the design and mechanisms of serving infrastructure to improve accuracy. We use the EDNS-client-subnet DNS extension to measure which clients a service maps to which of its serving sites. We devise a novel technique that uses this mapping to geolocate servers by combining noisy information about client loca- tions with speed-of-light constraints. We demonstrate that this technique substantially improves geolocation accuracy relative to existing approaches. We also cluster server IP ad- dresses into physical sites by measuring RTTs and adapting the cluster thresholds dynamically. Google's serving infras- tructure has grown dramatically in the ten months, and we use our methods to chart its growth and understand its con- tent serving strategy. We nd that the number of Google serving sites has increased more than sevenfold, and most of the growth has occurred by placing servers in large and small ISPs across the world, not by expanding Google's backbone. The U.S. Government is authorized to reproduce and dis- tribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, ndings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily re ect the views of SSC-Pacic. y This technical report was originally released June 2013 and was updated with additional data and editing in August 2013. The updated technical report will appear in ACM Internet Measurements Conference 2013, Barcelona, Spain, and is Copyright 2013 by ACM. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. 1. INTRODUCTION Internet trac has changed considerably in recent years, as access to content is increasingly governed by web serving infrastructures. These consist of decentralized serving sites that contain one or more front-end servers. Clients of these infrastructures are directed to nearby front-ends, which ei- ther directly serve static content (e.g., video or images from a content distribution network like Akamai), or use split TCP connections to relay web acccess requests to back-end datacenters (e.g., Google's search infrastructure) [7,11,28]. Web service providers employ serving infrastructures to optimize user-perceived latency [31]. They invest heavily in building out these infrastructures and develop sophisticated mapping algorithms to direct clients to nearby front-ends. In recent months, as we discuss later, Google's serving infras- tructure has increased sevenfold in size. Given the increas- ing economic importance of these serving infrastructures, we believe it is imperative to understand the content serving strategies adopted by large web service providers, especially Google. Specically, we are interested in the geographic and topological scope of serving infrastructures, their expansion, and how client populations impact build-out of the serving infrastructure. Several prior studies have explored static snapshots of content-distribution networks [2, 14, 26], often focusing on bulk content delivery infrastructures [14], new mapping met- hodology [2], or new DNS selection methods [26]. In con- trast, our work focuses on web serving infrastructures, devel- ops more accurate methods to enumerate and locate front- ends and serving sites, and explores how one infrastructure, Google's, grows over ten months of active buildout. The rst contribution of this paper is a suite of meth- ods to enumerate the IP addresses of front-ends, geolocate them, and cluster them into serving sites. Our methods ex- ploit mechanisms used by serving infrastructures to optimize client-perceived latency. To enumerate the IP addresses, we use the EDNS-client-subnet prex extension [9] that some serving infrastructures, including Google, use to more ac- curately direct clients to nearby front-ends. A front-end IP address may sit in front of many physical server ma- chines. In this work, we focus on mapping out the front- end IP addresses, but we do not attempt to determine the number of physical servers. We develop a novel geoloca- tion technique and show that it is substantially more accu- rate than previously proposed approaches. Our technique, client-centric geolocation (CCG), exploits the sophisticated strategies providers use to map customers to their nearest serving sites. CCG geolocates a server from the geographic mean of the (possibly noisy) locations for clients associated with that server, after using speed-of-light constraints to dis- card misinformation. While EDNS-client-subnet has been examined before [23,26], we are the rst to use EDNS-client- subnet to (1) completely enumerate a large content delivery infrastructure; (2) demonstrate its benet over existing enu- meration techniques; and (3) geolocate the infrastructure. We also cluster the front-end IP addresses into serving sites, adding dynamic thresholding and RTT-based ngerprinting to current methods. These changes provide enough resolu- tion to distinguish dierent sites in the same city. These sites represent unique network locations, a view that IP ad- dresses, prexes, or ASes can obscure. Our second major contribution is a detailed study of Goo- gle's web serving infrastructure and its recent expansion over the last ten months. To our knowledge, we are the rst to observe rapid growth of the serving infrastructure of a major content provider. We nd that Google's serving in- frastructure has grown sevenfold in the number of front-end sites, with serving sites deployed in over 100 countries and in 768 new ASes. Its recent growth strategy has been to move away from serving clients from front-ends deployed on its own backbone and towards serving from front-ends de- ployed in lower tiers of the AS hierarchy; the number of /24 prexes served o Google's network more than quadrupled during the expansion. Furthermore, these new serving sites, predictably, have narrow customer cones, serving only the customers of the AS the site is deployed in. Finally, we nd that the expansion has noticeably shifted the distribution of geographic distances from the client to its nearest front- end server, and that this shift can also reduce the error in geolocating front-ends using client locations alone, but not enough to obviate the need for CCG's ltering techniques. An explicit non-goal of this work is to estimate the in- crease in Google's serving capacity: in placing front-ends in ISPs around the world, Google's expansion presumably focused on improving the latency of Web accesses through split-TCP connections [7,11,28], so proximity of front-ends to clients (this paper), and good path performance between clients and front-ends (future work) were more important than capacity increases. 2. BACKGROUND CDNs and Serving Infrastructures. Adding even a few hundreds of milliseconds to a webpage load time can cost service providers users and business [19,33], so providers seek to optimize their web serving infrastructure to deliver content quickly to clients. Whereas once a website might have been served from a single location to clients around the world, today's major services rely on much more com- plicated and distributed infrastructure. Providers replicate their services at serving sites around the world and try to serve a client from the closest one [17]. Content delivery networks (CDNs) initially sped delivery by caching static content and some forms of dynamic content within or near client networks. Today, providers use this type of distributed infrastruc- ture to speed the delivery of dynamic personalized content and responses to queries. To do so, providers direct clients to serving sites in or near the clients' networks. A client's TCP connection terminates at a front-end server in the serv- ing site, but the front-end proxies the request back to one of the provider's large datacenters [28]. This arrangement has a number of potential advantages compared to direct- ing the client directly to the datacenter. For example, the client's latency to the front-end is less than the client's la- tency to the datacenter, allowing TCP to recover faster after loss, the primary cause of suboptimal performance. More- over, the front-end can multiplex many clients into a high throughput connection to the datacenter. In these types of serving infrastructures, dierent classes of serving sites may serve dierent clients. Of course, the provider will still serve clients near a datacenter directly from that datacenter. But clients in networks that host a serving site are served locally. Front-ends deployed in clients' ISPs usually serve only clients of that ISP (or the ISP's customers), not clients in the ISP's providers or peers. DNS-based Redirection. Serving infrastructures use the Domain Name System (DNS) to direct clients to appropriate serving sites and front-end servers. When a client queries DNS to resolve a name associated with a service, the service returns an IP address for a front-end it believes is near the client. Traditionally, at resolution time, however, the service only knows the IP address of the client's resolver and not of the client itself, leading to two main complications. The resolver may be far from the clients it serves, and so the server closest to the resolver may not be a good choice for the client. Existing techniques can allow many services to discover which clients use a particular resolver [22], enabling services to direct a resolver based on the clients that use it. However, some resolvers serve clients with diverse locations; for these cases no server will be well-positioned for all clients of that resolver. To overcome this hurdle and provide quality DNS redirec- tions for clients, a number of Internet providers and CDNs proposed EDNS-client-subnet [9]. EDNS-client-subnet is an experimental extension DNS (using its EDNS extension mechanism) allowing clients to include a portion of their IP address in their request. This information passes through possible recursive resolvers and is provided to the authorita- tive DNS server, allowing a service to select content servers based on the client location, rather resolver location or in- ferred client location. 3. GOALANDAPPROACH Our goal is to understand content serving strategies for large IPv4-based serving infrastructures, especially that of Google. Serving strategies are dened by how many serv- ing sites and front-end servers a serving infrastructure has, where the serving sites are located geographically and topo- logically (i.e., within which ISP), and which clients access which serving sites. Furthermore, services continuously evolve serving strategies, so we are also interested in measuring the evolution of serving infrastructures. Of these, Google's serv- ing infrastructure is arguably one of the most important, so we devote signicant attention to this infrastructure. To this end, we develop novel measurement methods to enumerate front-end servers, geolocate serving sites, and cluster front-end servers into serving sites. The challenge in devising these measurement methods is that serving in- frastructures are large, distributed entities, with thousands of front-end servers at hundreds of serving sites spread across dozens of countries. A brute force approach to enumerating serving sites would require perspectives from a very large number of topological locations in the Internet, much larger than the geographic distribution provided by research mea- surement infrastructures like PlanetLab. Moreover, existing geolocation methods that rely on DNS naming or geoloca- tion databases do not work well on these serving infrastruc- tures where location-based DNS naming conventions are not consistently employed. While our measurement methods use these research in- frastructures for some of their steps, the key insight in the design of the methods is to leverage mechanisms used by serving infrastructures to serve content. Because we design them for serving infrastructures, these mechanisms can enu- merate and geolocate serving sites more accurately than ex- isting approaches, as we discuss below. Our method to enumerate all front-end server IP addresses within the serving infrastructure uses the EDNS-client-subnet extension. As discussed in Section 2, Google (and some other serving infrastructures) use this extension to address the problem of geographically distributed clients using a re- solver that prevents the serving infrastructure from opti- mally directing clients to front-ends. We use this extension to enumerate front-end IP addresses of a serving infrastruc- ture from a single location: this extension can emulate DNS requests coming from every active prex in the IP address space, eectively providing a very large set of vantage points for enumerating front-end IP addresses. To geolocate front-end servers and serving centers, we leverage another mechanism that serving infrastructures have long deployed, namely sophisticated mapping algorithms that maintain performance maps to clients with the goal of di- recting clients to the nearest available server. These algo- rithms have the property that clients that are directed to the server are likely to be topologically, and probably geo- graphically, close to the server. We exploit this property to geolocate front-end servers: essentially, we approximate the location of a server by the geographical mean of client loca- tions, a technique we call client-centric geolocation or CCG. We base our technique on this intuition, but we compensate for incorrect client locations and varying density of server deployments. Finally, we leverage existing measurement infrastructure (PlanetLab) to cluster front-ends into serving sites. We model the relative location of a front-end server as a vector of round-trip-times to many vantage points in the measure- ment infrastructure, then employ standard clustering algo- rithms in this high-dimensional space. Using these measurement methods over a ten month pe- riod, we are able to study Google's serving infrastructure and its evolution. Coincidentally, Google's infrastructure has increased sevenfold over this period, and we explore salient properties of this expansion: where (geographically or topologically) most of the expansion has taken place, and how it has impacted clients. There are interesting aspects of Google's deployment that we currently lack means to measure. In particular, we do not know the query volume from dierent clients, and we do not know the latency from clients to servers (which may or may not correlate closely with the geographic distance that we measure). We have left exploration of these to future work. We do possess information about client anity to front-end servers, and how this anity evolves over time (this evolution is a function of improvements in mapping algorithms as well as infrastructure rollout): we have left a study of this to future work. 4. METHODOLOGY In this section, we discuss the details of our measurement methods for enumerating front-ends, geolocating them, and clustering them into serving sites. 4.1 EnumeratingFront-Ends Our rst goal is to enumerate the IP addresses of all front- ends within a serving infrastructure. We do not attempt to identify when multiple IP addresses belong to one computer, or when one address fronts for multiple physical computers. An IP addresses can front hardware from a small satellite proxy to a huge datacenter, so careful accounting of public IP addresses is not particularly meaningful. Since most serving infrastructures use mapping algorithms and DNS redirection, one way to enumerate front-ends is to issue DNS requests from multiple vantage points. Each re- quest returns a front-end near the querying vantage point. The completeness of this approach is a function of the num- ber of vantage points. We emulate access to vantage points around the world using the proposed client-subnet DNS extension using the EDNS extension mechanism (we call this approach EDNS- client-subnet). As of May 2013, EDNS-client-subnet is sup- ported by Google, CacheFly, EdgeCast, ChinaCache and CDN 77. We use a patch to dig 1 that adds support for EDNS-client-subnet, allowing the query to specify the client prex. In our measurements of Google, we issue the queries through Google Public DNS's public recursive nameservers, which passes them on to the service we are mapping. The serving infrastructure then returns a set of front-ends it be- lieves is best suited for clients within the client prex. EDNS-client-subnet allows our single measurement site to solicit the recommended front-end for each specied client prex. Using EDNS-client-subnet, we eectively get a large number of vantage points We query using client prexes drawn from 10 million routable /24 prexes obtained Route- Views BGP. Queries against Google using this approach take about a day to enumerate. 4.2 Client-centricFront-EndGeolocation Current geolocation approaches are designed for general- ity, making few or no assumptions about the target. Unfor- tunately, this generality results in poor performance when geolocating serving infrastructure. For example, MaxMind's free database [24] places all Google front-ends in Mountain View, the company's headquarters. (MaxMind may have more accurate locations for IPs belonging to eyeball ISPs, but IPs belonging to transit ISPs will have poor geolocation results.) General approaches such as CBG [12] work best when vantage points are near the target [16], but front-ends in serving infrastructures are sometimes in remote locations, far from public geolocation vantage points. Techniques that use location hints in DNS names of front-ends or routers near front-ends can be incomplete [14]. Our approach combines elements of prior work, adding the observation that today's serving infrastructures use priv- ileged data and advanced measurement techniques to try to direct clients to nearby front-ends [35]. While we bor- 1 http://wilmer.gaa.st/edns-client-subnet/ row many previously proposed techniques, our approach is unique and yields better results. We base our geolocation technique on two main assump- tions. First, a serving infrastructure tries to direct clients to a nearby front-end, although some clients may be directed to distant front-ends, either through errors or a lack of deploy- ment density. Second, geolocation databases have accurate locations for many clients, at least at country or city granu- larity, but also have poor granularity or erroneous locations for some clients. Combining these two assumptions, our basic approach to geolocation, called client-centric geolocation (CCG), is to (1) enumerate the set of clients directed to a front-end, (2) query a geolocation database for the locations of those clients, and (3) assume the front-ends are located geographically close to most of the clients. To be accurate, CCG must overcome challenges inherent in each of these three steps of our basic approach: 1. We do not know how many requests dierent prexes send to a serving infrastructure. If a particular prex does not generate much trac, the serving infrastructure may not have the measurements necessary to direct it to a nearby front-end, and so may direct it to a distant front-end. 2. Geolocation databases are known to have problems in- cluding erroneous locations for some clients and poor lo- cation granularity for other clients. 3. Some clients are not near the front-end that serve them, for a variety of reasons. For example, some front-ends may serve only clients within certain networks, and some clients may have lower latency paths to front-ends other than the nearest ones. In other cases, a serving infrastruc- ture may direct clients to a distant front-end to balance load or may mistakenly believe that the front-end is near the client. Or, a serving infrastructure may not have any front-ends near a particular client. We now describe how CCG addresses these challenges. Selecting client prexes to geolocate a front-end. To enumerate front-ends, CCG queries EDNS using all routable /24 prexes. However, this approach may not be accurate for geolocating front-ends, for the following reason. Al- though we do not know the details of how a serving in- frastructure chooses which front-end to send a client to, we assume that it attempts to send a client to a nearby front- end and that the approach is more likely to be accurate for prexes hosting clients who query the service a lot than for prexes that do not query the service, such as IP addresses used for routers. To identify which client prexes can provide more accu- rate geolocation, CCG uses traceroutes and logs of users of a popular BitTorrent extension, Ono [8]. From the user logs we obtain a list of 2.6 million client prexes observed to participate in BitTorrent swarms with users. We assume that a serving infrastructure is likely to also observe requests from these prexes. We emphasize that we use Ono-derived traceroutes to obtain IP prexes for use with EDNS-client- subnet; other methods for obtaining such prexes would be equally applicable to our setting, and Ono itself is not nec- essary for CCG in the sense that we do not make use of the actual platform. Overcoming problems with geolocation databases. CCG uses two main approaches to overcome errors and lim- itations of geolocation databases. First, we exclude locations that are clearly inaccurate, based on approaches described in the next paragraph. Second, we combine a large set of client locations to locate each front-end and assume that the majority of clients have correct locations that will dominate the minority of clients with incorrect locations. To generate an initial set of client locations to use, CCG uses a BGP ta- ble snapshot from RouteViews [25] to nd the set of prexes currently announced, and breaks these routable prexes up into 10 million /24 prexes. 2 It then queries MaxMind's Ge- oLiteCity database to nd locations for each /24 prex. We chose MaxMind because it is freely available and is widely used in research. CCG prunes three types of prex geolocations as untrust- worthy. First, it excludes prexes for which MaxMind in- dicates it has less than city-level accuracy. This heuristic excludes 1,966,081 of the 10 million prexes (216,430 of the 2.6 million BitTorrent client prexes). Second, it uses a dataset that provides coarse-grained measurement-based ge- olocations for every IP address to exclude prexes that in- clude addresses in multiple locations [13]. Third, it issues ping measurements from all PlanetLab 3 locations to ve re- sponsive addresses per prex, and excludes any prexes for which the MaxMind location would force one of these ping measurements to violate the speed of light. Combined, these exclude 8,396 of the 10 million prexes (2,336 of the 2.6 mil- lion BitTorrent client prexes). With these problematic locations removed, and with sets of prexes likely to include clients, CCG assumes that both MaxMind and the serving infrastructure we are mapping likely have good geolocations for most of the remaining pre- xes, and that the large number of accurate client geoloca- tions should overwhelm any remaining incorrect locations. Dealing with clients directed to distant front-ends. Even after ltering bad geolocations, a client may be geo- graphically distant from the front-end it is mapped to, for two reasons: the serving infrastructure may direct clients to distant front-ends for load-balancing, and in some geograph- ical regions, the serving infrastructure deployment may be sparse so that the front-end nearest to a client may still be geographically distant. To prune these clients, CCG rst uses speed-of-light con- straints, as follows. It issues pings to the front-end from all PlanetLab nodes and use the speed of light to estab- lish loose constraints on where the front-end could possibly be [12]. When geolocating the front-end, CCG excludes any clients outside of this region. This excludes 4 million out of 10 million prexes (1.1 million out of 2.6 million BitTorrent client prexes). It then estimates the preliminary location for the front-end as the weighted average of the locations of the remaining client prexes, then renes this estimate by calculating the mean distance from the front-end to the re- maining prexes, and nds the standard deviation from the mean of the client-to-front-end distances. Our nal lter excludes clients that are more than a standard deviation be- yond the mean distance to the front-end, excluding 392,668 out of 10 million prexes (214,097 out of 2.6 million BitTor- rent client prexes). 2 In Section 5.1, we verify that /24 is often the correct prex length to use. 3 As we show later, we have found that PlanetLab contains a sucient number of vantage points for speed-of-light l- tering to give accurate geolocation. 10M prexes 2.6M prexes No city-level accuracy -1.9M (19.5%) -216K (8.1%) Multiple locations and client location speed-of- light violations -8K (.08%) -2K (.08%) Front-End location speed-of-light violations -4M (40%) -1.1M (41%) Outside one standard deviation -392K (3.9%) -214K (8%) Remaining 3.7M (37%) 1M (39%) Table 1: Summary of the number of client prexes ex- cluded from CCG by ltering. 10M is the 10 million client prex set and 2.6M is the 2.6 million BitTorrent client prex set. Putting it all together. In summary, CCG works as follows. It rst lists the set of prexes directed to a front- end, then lters out all prexes except those observed to host BitTorrent clients. Then, it uses MaxMind to geolo- cate those remaining client prexes, but excludes: prexes without city-level MaxMind granularity; prexes that in- clude addresses in multiple locations; prexes for which the MaxMind location is not in the feasible actual location based on speed-of-light measurements from PlanetLab and M-Lab; and prexes outside the feasible location for the front-end. (Table 1 shows the number of prexes ltered at each step.) Its preliminary estimate for the front-end location is the ge- ographic mean of the remaining clients that it serves. Cal- culating the distances from remaining clients to this prelim- inary location, CCG further exclude any clients more than a standard deviation beyond the mean distance in order to rene our location estimate. Finally, it locates the front-end as being at the geographic mean of the remaining clients that it serves. 4.3 Clusteringfront-ends As we discuss later, CCG is accurate to within 10s of kilo- meters. In large metro areas, some serving infrastructures may have multiple serving sites, so we develop a methodol- ogy to determine physically distinct serving sites. We cluster by embedding each front-end in a higher dimensional met- ric space, then clustering the front-end in that metric space. Such an approach has been proposed elsewhere [21, 27, 38] and our approach diers from prior work in using better clustering techniques and more carefully ltering outliers. In our technique, we map each front-end to a point in high dimensional space, where the coordinates are RTTs from landmarks (in our case, 250 PlanetLab nodes at dierent geographical sites). The intuition underlying our approach is that two front-ends at the same physical location should have a small distance in the high-dimensional space. Each coordinate is the smallest but one RTT of 8 consec- utive pings (a large enough sample size to obtain a robust estimate of propagation latency), and we use the Manhattan distance between two points for clustering (an exploration of other distance norms is left to future work). In computing this Manhattan distance, we (a) omit coordinates for which we received fewer than 6 responses to pings and (b) omit the highest 20% of coordinate distances to account for out- liers caused by routing failures, or by RTT measurements in ated by congestion. Finally, we normalize this Manhat- IPs /24s ASes Countries Open resolver 23939 1207 753 134 EDNS-client-subnet 28793 1445 869 139 Benet +20% +20% +15% +4% Table 2: Comparison of Google front-ends found by EDNS and open resolver. EDNS providers signicant benet over the existing technique. tan distance. Despite these heuristic choices, our clustering method works well, as shown in Section 5.3. The nal step is to cluster front-ends by their pairwise normalized Manhattan distance. We use the OPTICS al- gorithm [3] for this. OPTICS is designed for spatial data, and, instead of explicitly clustering points, it outputs an or- dering of the points that captures the density of points in the dataset. As such, OPTICS is appropriate for spatial data where there may be no a priori information about ei- ther the number of clusters or their size, as is the case for our setting. In the output ordering, each point is annotated with a reachability distance: when successive points have signicantly dierent reachability distances, that is usually an indication of a cluster boundary. As we show in Sec- tion 5 this technique, which dynamically determines cluster boundaries, is essential to achieving good accuracy. 5. VALIDATION In this section, we validate front-end enumeration, geolo- cation, and clustering. 5.1 CoverageofFront-EndEnumeration Using EDNS-client-subnet can improve coverage over pre- vious methods that have relied on using fewer vantage points. We rst quantify the coverage benets of EDNS-client-subnet. We then explore the sensitivity of our results to the choice of prex length for EDNS-client-subnet, since this choice can also aect front-end enumeration. Open Resolver vs EDNS-client-subnet Coverage. An existing technique to enumerate front-ends for a serving in- frastructure is to issue DNS queries to the infrastructure from a range of vantage points. Following previous work [14], we do so using open recursive DNS (rDNS) resolvers. We use a list of about 200,000 open resolvers 4 ; each resolver is eectively a distinct vantage point. These resolvers are in 217 counties, 14,538 ASes, and 118,527 unique /24 prexes. Enumeration of Google via rDNS takes about 40 minutes. This dataset forms our comparison point to evaluate the cov- erage of the EDNS-client-subnet approach we take in this paper. Table 2 shows the added benet over rDNS of enumerat- ing Google front-ends using EDNS-client-subnet. Our ap- proach uncovers at least 15-20% more Google front-end IP addresses, prexes, and ASes than were visible using open resolvers. By using EDNS-client-subnet to query Google on behalf of every client prex, we obtain a view from loca- tions that lack open recursive resolvers. In Section 6.1, we demonstrate the benet over time as Google evolves, and in Section 7 we describe how we might be able to use our 4 Used with permission from Duane Wessels, Packet Pushers Inc. Google results to calibrate how much we would miss using rDNS to enumerate a (possibly much larger or smaller than Google) serving infrastructure that does not support EDNS- client-subnet. Completeness and EDNS-client-subnet Prex Length. The choice of prex length for EDNS-client-subnet queries can aect enumeration completeness. Prex lengths shorter than /24 in BGP announcements can be too coarse for enu- meration. We nd cases of neighboring /24s within shorter BGP announcement prexes that are directed to dierent serving infrastructure. For instance we observed an ISP an- nouncing a /18 with one of its /24 subprexes getting di- rected to Singapore while its neighboring prex is directed to Hong Kong. Our evaluations query using one IP address in each /24 block. If serving infrastructures are doing redirections at ner granularity, we might not observe some front-end IP addresses or serving sites. The reply to the EDNS-client- subnet query returns a scope, the prex length covering the response. Thus, if a query for an IP address in a /24 block returns a scope of, say /25, it means that the corresponding redirection holds for all IP addresses in the /25 covering the query address, but not the other half of the /24. For almost 75% of our /24 queries, the returned scope was also for a /24 subnet, likely because it is the longest globally routable prex. For most of the rest, we saw a /32 prex length scope in the response, indicating that Google's serving infrastruc- ture might be doing very ne-grained redirection. We refer the reader to related work for a study of the relationship between the announced BGP prex length and the returned scope [34]. For our purposes, we use the returned scope as a basis to evaluate the completeness of our enumeration. We took half of the IPv4 address space and issued a series of queries such that their returned scopes covered all addresses in that space. For example, if a query for 1:1:1:0=24 returned a scope of /32, we would next query for 1:1:1:1=32. These brute force measurements did not uncover any new front-end IP addresses not seen by our /24 queries, suggesting that the /24 prex approach in our paper likely provides complete coverage of Google's entire front-end serving infrastructure. Enumeration Over Time. Front-ends often disappear and reappear from one day to the next across daily enu- meration. Some remain active but are not returned in any EDNS-client-subnet requests, others become temporarily in- active, and some may be permanently decommissioned. To account for this variation and obtain an accurate and com- plete enumeration, we accumulate observations over time, but also test which servers still serve Google search on a daily basis. We check liveness by issuing daily, rate-limited, HTTP HEAD requests to the set of cumulative front-end IP addresses we observe. The Daily row in Table 3 shows a snapshot of the num- ber of IPs, /24s, and ASes that are observed on 2013-8-8. The Cumulative row shows the additional infrastructure ob- served earlier in our measurement period but not on that day, and the Inactive row indicates how many of those were not serving Google search on 2013-8-8. This shows that the front-ends that are made available through DNS on a given day is only a subset of what may be active on a given day. For example, for several consecutive days in the rst week of August 2013, all our queries returned IP addresses from IPs /24s ASes Daily 22959 1213 771 Cumulative +5546 +219 +93 {Inactive -538 -24 -8 Active 27967 (+22%) 1408 (+16%) 856 (+11%) Table 3: A snapshot from 2013-8-8 showing the dier- ences in number of IPs, /24s, and ASes observed cumu- latively across time versus what can be observed within a day. Some front-end IP addresses may not be visible in a daily snapshot. However, IP addresses may be tem- porarily drained or become permanently inactive or be reassigned. Acquiring an accurate and complete snapshot of active serving infrastructure requires accumulating ob- servations over time and testing which remain active. Google's network, suggesting a service drain of the front- ends in other networks. Our liveness probes conrmed that the majority of front-ends in other networks still actively served Google search when queried, even though no DNS queries directed to them. In the future, we will examine whether we can use our approach to infer Google mainte- nance periods and redirections away from outages, as well as assess whether these shifts impact performance. 5.2 AccuracyofClient-CentricGeolocation Client-centric geolocation using EDNS-client-subnet shows substantial improvement over traditional ping based tech- niques [12], undns [32], and geolocation databases [24]. Dataset. To validate our approach, we use the subset of Google front-ends with hostnames that contain airport codes hinting at their locations. Although the airport code does not represent a precise location, we believe that it is reason- able to assume that the actual front-end is within a few 10s of kilometers of the corresponding airport. Airport codes are commonly used by network operators as a way to debug network and routing issues so having accurate airport codes is an important diagnostic tool. Previous work has show that only 0.5% of hostnames in a large ISP had misleading names [39], and so we expect that misnamed Google front- ends only minimally distort our results. A limitation of our validation is that we cannot validate against Google hosted IPs that do not have airport codes because popular geoloca- tion databases such as MaxMind place these IPs in Mountain View, CA. Using all 550 front-ends with airport codes, we measure the error of our technique as the distance between our estimated location and the airport location from data collected on April 17, 2013. Accuracy. Figure 1 shows the distribution of error for CCG, as well as for three traditional techniques. We com- pare to constraint-based geolocation (CBG), which uses lat- ency-based constraints from a range of vantage points [12], a technique that issues traceroutes to front-ends and locates the front-ends based on geographic hints in names of nearby routers [14], and the MaxMind GeoLite Free database [24]. We oer substantial improvement over existing approaches. For example, the worst case error for CCG is 409km, whereas CBG, the traceroute-based technique, and MaxMind have errors of over 500km for 17%, 24%, and 94% of front-ends, respectively. CBG performs well when vantage points are close to the front-end [16], but it incurs large errors for the half of the front-ends in more remote regions. The traceroute-based technique is unable to provide any location for 20% of the front-ends because there were no hops with geographic hints in their hostnames near to the front-end. The MaxMind database performs poorly because it places most front-ends belonging to Google in Mountain View, CA. 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 CDF of estimated location Error (km) client-centric geolocation (CCG) CBG undns Maxmind Figure 1: Comparison of our client-centric geolocation against traditional techniques, using Google front-ends with known locations as ground truth. 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 CDF of estimated location Error (km) client-centric geolocation (CCG) CCG only sol CCG only std CCG only eyeballs CCG no filtering Figure 2: Impact of our various techniques to lter client locations when performing client-centric geolocation on Google front-ends with known locations. Importance of Filtering. Figure 2 demonstrates the need for the lters we apply in CCG. The CCG no ltering line shows our basic technique without any lters, yielding a median error of 556km. Only considering client eyeball prexes we observed in the BitTorrent dataset reduces the median error to 484km and increases the percentage of front- ends located with error less than 1000km from 61% to 74%. Applying our standard deviation ltering improves the me- dian to 305km and error less than 1000km to 86%. When using speed-of-light constraints measured from PlanetLab and MLab to exclude client locations outside the feasible location for a front-end and to exclude clients with infeasi- ble MaxMind locations, we obtain a median error of 26km, and only 10% of front-end geolocations have an error greater than 1000km. However, we obtain our best results by simul- taneously applying all three lters. Case Studies of Poor Geolocation. CCG's accuracy depends upon its ability to draw tight speed-of-light con- straints, which in turn depends (in our current implemen- tation), on Planetlab and M-Lab deployment density. We found one instance where sparse vantage point deployments aected CCG's accuracy. In this instance, we observe a set of front-ends in Stockholm, Sweden, with the arn airport code, serving a large group of client locations throughout Northern Europe. However, our technique locates the front- ends as being 409km southeast of Stockholm, pulled down by the large number of clients in Oslo, Copenhagen and northern Germany. Our speed of light ltering usually ef- fectively eliminates clients far from the actual front-end. In this case, we would expect Planetlab sites in Sweden to l- ter out clients in Norway, Denmark and Germany. However, these sites measure latencies to the Google front-ends in the 24ms range, yielding a feasible radius of 2400km. This loose constraint results in poor geolocation for this set of front- ends. It is well-known that Google has a large datacenter in The Dalles, Oregon, and our map (Fig. 7) does not show any sites in Oregon. In fact, we place this site 240km north, just south of Seattle, Washington. A disadvantage of our geolocation technique is that large datacenters are often hosted in re- mote locations, and our technique will pull them towards large population centers that they serve. In this way, the estimated location ends up giving a sort of \logical" serv- ing center of the server, which is not always the geographic location. We also found that there are instances where we are un- able to place a front-end. In particular, we observed that occasionally when new front-ends were rst observed during the expansion, there would be very few /24 client networks directed to them. These networks may not have city-level geolocation information available in MaxMind so we were unable to locate the corresponding front-ends. 5.3 AccuracyofFront-EndClustering To validate the accuracy of our clustering method, we run clustering on three groups of nodes for which we have ground truth: 72 PlanetLab servers from 23 dierent sites around world; 27 servers from 6 sites all in California, USA, some of which are very close (within 10 miles) to each other; and nally, 75 Google front-end IP addresses, that have airport codes in their reverse DNS names, out of 550 (14%) having airport codes and of 8,430 (0.9%) total Google IP addresses as of Apr 16th, 2013. These three sets are of dierent size and geographic scope, and the last set is a subset of our tar- get so we expect it to be most representative. In the absence of complete ground truth, we have had to rely on more ap- proximate validation techniques: using PlanetLab, selecting a subset of front-ends with known locations, and using air- port codes (we have justied this choice in Section 5.2). We also discuss below some internal consistency checks on our clustering algorithm that gives us greater condence in our results. The metric we use for the accuracy of clustering is the Rand Index [29]. The index is measured as the ratio of the sum of true positives and negatives to the ratio of the sum of these quantities and false positives and negatives. A Rand index equal to 1 means there are no false positives or false negatives. Table 4 shows the Rand index for the 3 node sets for which we have ground truth. We see that in each case, the Rand index is upwards of 97%. This accuracy arises from two components of the design of our clustering method: eliminating outliers which result in more accurate distance measures, and dynamically selecting the cluster boundary using our OPTICS algorithm. Experiment Rand Index False negative False positive PlanetLab 0.99 1% 0 CA 0.97 0 3% Google 0.99 1% 0 Table 4: Rand index for our nodesets. Our clustering al- gorithm achieves over 97% across all nodesets, indicating very few false positives or negatives. 0 1 2 3 4 5 0 10 20 30 40 50 60 70 Reachability distance Google front-ends (OPTICS output order) mrs muc mil sof eze sin syd bom del Figure 3: Distance plot of Google servers with airport codes. Servers in the same cluster have low reachability distance to each other thus are output in sequence as neighbors. Cluster boundaries are demarcated by large impulses in the reachability plot. Our method does have a small number of false positives and false negatives. In the California nodeset, the method fails to set apart USC/ISI nodes from nodes on the USC campus ( 10 miles away, but with the same upstream con- nectivity) which leads to 3% false positive. In the Planet lab nodeset, some clusters have low reachability distance that confuses our boundary detection method, resulting in some clusters being split into two. The Google nodeset reveals one false negative which we actually believe to be correct: the algorithm correctly identies two distinct serving sites in mrs, as discussed below. To better understand the performance of our method, Fig- ure 3 shows the output of the OPTICS algorithm on the Google nodeset. The x-axis in this gure represents the or- dered output of the OPTICS algorithm, and the y-axis the reachability distance associated with each node. Impulses in the reachability distance depict cluster boundaries, and we have veried that the nodes within the cluster all belong to the same airport code. In fact, as the gure shows, the al- gorithm is correctly able to identify all 9 Google sites. More interesting, it shows that, within a single airport code mrs, there are likely two physically distinct serving sites. We be- lieve this to be correct, from an analysis of the DNS names associated with those front-ends: all front-ends in one serv- ing site have a prex mrs02s04, and all front-ends in the other serving site have a prex mrs02s05. In addition, Figure 4 shows the OPTICS output when us- ing reverse-TTL (as proposed in [21]) instead of RTT for the metric embedding. This uses the same set of Google servers as in our evaluation using RTT for metric embedding. We could see that reverse-TTL based embedding performs rea- sonably well but results in the OPTICS algorithm being un- able to distinguish between serving sites in bom and del. RTT-based clustering is able to dierentiate these serving sites. Moreover, although reverse-TTL suggests the possibil- ity of two sites in mrs, it mis-identies which servers belong to which of these sites (based on reverse DNS names). 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 10 20 30 40 50 60 70 TTL Reachability distance Google front-ends (OPTICS output order) mrs muc mil sof eze syd sin bom del Figure 4: The output of the OPTICS clustering algo- rithm when reverse-TTL is used for the metric embed- ding. When using this metric, the clustering algorithm cannot distinguish serving sites at Bombay (bom) and Delhi (del) in India, while RTT-based clustering can. 0 200 400 600 800 1000 1200 1400 1600 2012-10-01 2012-11-01 2012-12-01 2013-01-01 2013-02-01 2013-03-01 2013-04-01 2013-05-01 2013-06-01 2013-07-01 2013-08-01 Cumulative clusters observed by EDNS Date Figure 6: Growth in the number of points of presence hosting Google serving infrastructure over time. Finally, we also perform some additional consistency checks. We run our clustering algorithm against all Google front-end IPs that have airport codes (6.5%, 550 out of 8430). We nd that except for the kind of false negative we mentioned above (multiple serving site within same airport code), the false positive rate of our clustering is 0, which means we never merge two dierent airport codes together. Furthermore, when our algorithm splits one airport code into separate clusters, the resulting clusters exhibit naming consistency | our algorithm always keeps IPs that have same hostname pattern <airport code><two digit>s<two digit>, such as mrs02s05, in the same cluster. In summary, our clustering method exhibits over 97% ac- curacy on three dierent test datasets. On the Google IPs that have airport codes, our clustering show one kind of false negative that we believe to be correct and no false positive at all. 6. MAPPINGGOOGLE’SEXPANSION We present a longitudinal study of Google's serving in- frastructure. Our initial dataset is from late October to early November of 2012 and our second dataset covers March through August of 2013. We are able to capture a substan- tial expansion of Google infrastructure. 6.1 Growthovertime For each snapshot that we capture, we use EDNS-client- subnet to enumerate all IP addresses returned forwww.google. com. Figure 5(a) depicts the number of server IP addresses 0 5000 10000 15000 20000 25000 30000 2012-10-01 2012-11-01 2012-12-01 2013-01-01 2013-02-01 2013-03-01 2013-04-01 2013-05-01 2013-06-01 2013-07-01 2013-08-01 Cumulative IPs Observed Date EDNS Open resolver 0 200 400 600 800 1000 1200 1400 1600 2012-10-01 2012-11-01 2012-12-01 2013-01-01 2013-02-01 2013-03-01 2013-04-01 2013-05-01 2013-06-01 2013-07-01 2013-08-01 Cumulative /24s observed Date EDNS Open resolver 0 100 200 300 400 500 600 700 800 900 2012-10-01 2012-11-01 2012-12-01 2013-01-01 2013-02-01 2013-03-01 2013-04-01 2013-05-01 2013-06-01 2013-07-01 2013-08-01 Cumulative ASes observed Date EDNS Open resolver Figure 5: Growth in the number of IP addresses (a), /24 prexes (b), and ASes (c) observed to be serving Google's homepage over time. During our study, Google expanded rapidly at each of these granularities. 180 ° W 135 ° W 90 ° W 45 ° W 0 ° 45 ° E 90 ° E 135 ° E 180 ° E 90 ° S 45 ° S 45 ° N 90 ° N Google AS Other AS 2012−10−28 Other AS 2013−8−14 Figure 7: A world wide view of the expansion in Google's infrastructure. Note that some of the locations that appear oating in the Ocean are on small islands. These include Guam, Maldives, Seychelles, Cape Verde and Funchal. seen in these snapshots over time. 5 The graph shows slow growth in the cumulative number of Google IP addresses observed between November 2012 and March 2013, then a major increase in mid-March in which we saw approximately 3,000 new serving IP addresses come online. This was fol- lowed by another large jump of 3,000 in mid-May. Over the month of June, we observed 11,000 new IPs followed by an increase of 4,000 across July. By the end of our study, the number of serving IP ad- dresses increased sevenfold. Figure 5(b) shows this same trend in the growth of the number of /24s seen to serve Google's homepage. In Figure 5(c), we see 8X growth in the number of ASes originating these prexes, indicating that this large growth is not just Google adding new capacity to existing serving locations. Figure 6 shows the growth in the number of distinct serving sites within those ASes. Figure 7 shows the geographic locations of Google's serv- ing infrastructure at the beginning of our measurements and in our most recent snapshot. We observe two types of expan- sion. First, we see new serving locations in remote regions of countries that already hosted servers, such as Australia and Brazil. Second, we observe Google turning up serving infras- tructure in countries that previously did not appear to serve Google's homepage, such as Vietnam and Thailand. Of new front-end IP addresses that appeared during the course of 5 It is not necessarily the case that each IP address maps to a distinct front-end. 0 20 40 60 80 100 120 140 2012-10-01 2012-11-01 2012-12-01 2013-01-01 2013-02-01 2013-03-01 2013-04-01 2013-05-01 2013-06-01 2013-07-01 2013-08-01 Cumulative countries observed Date EDNS Open resolver Figure 8: Number of countries hosting Google serving infrastructure over time. our study, 95% are in ASes other than Google. Of those addresses, 13% are in the United States and 26% are in Eu- rope, places that would appear to be well-served directly from Google's network. 6 In the future, we plan to investi- gate the performance impact of these front-ends. In addi- tion, 21% are in Asia, 13% are in North America (outside the US), 11% are in South America, 8% are in Africa, and 8% are in Oceania. A link to an animation of the worldwide expansion is available at http://mappinggoogle.cs.usc.edu. 6 In contrast, when we submitted the paper in May, only 13% were in the US or Europe. We added in the new expansion in those regions in preparing the nal version of the paper. 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 CDF of ISPs Number of serving sites All Stub ISP Tiny ISP Small ISP Large ISP Tier-1 ISP Figure 9: CDF of number of sites in dierent types of ISP. Figure 8 depicts this growth in the number of countries hosting serving infrastructure, from 58 or 60 at the begin- ning of our study to 139 in recent measurements. 7 We in- tend to continue to run these measurements indenitely to continue to map this growth. 6.2 CharacterizingtheExpansion To better understand the nature of Google's expansion, we examine the types of networks where the expansion is occurring and how many clients they serve. Table 5 classies the number of ASes of various classes in which we observe serving infrastructure, both at the beginning and at the end of our study. It also depicts the number of /24 client prexes (of 10 million total) served by infrastructure in each class of AS. We use AS classications from the June 28, 2012 dataset from UCLA's Internet Topology Collection [37], 8 except that we only classify as stubs ASes with 0 customers, and we introduce a Tiny ISP class for ASes with 1-4 customers. As seen in the table, the rapid growth in ASes that host infrastructure has mainly been occurring lower in the AS hierarchy. Although Google still directs the vast majority of client prexes to servers in its own ASes, it has begun di- recting an additional 8% of them to servers o its network, representing a 393% increase in the number served from out- side the network. By installing servers inside client ISPs, Google allows clients in these ISPs to terminate their TCP connections locally (likely at a satellite server that proxies requests to a datacenter [7,11,28], as it is extremely unlikely that Google has sucient computation in these locations to provide its services). We perform reverse DNS lookups on the IP addresses of all front-ends we located outside of Google's network. More than 20% of them have hostnames that include either ggc or google.cache. These results sug- gest that Google is reusing infrastructure from the Google Global Cache (GGC), Google's content distribution network built primarily to cache YouTube videos near users. 9 It is possible that the servers were already in use as video caches; if so, this existing physical deployment could have enabled the rapid growth in front-ends we observed. Figure 9 depicts a slightly dierent view of the Google ex- pansion. It charts the cumulative distribution of the number 7 We base our locations on our CCG approach, which may distort locations of front-ends that are far from their clients. 8 UCLA's data processing has been broken since 2012, but we do not expect the AS topology to change rapidly. 9 GGC documentation mentions that the servers may be used to proxy Google Search and other services. of serving sites by ISP type. Overall, nearly 70% of the ISPs host only one serving site. Generally speaking, smaller ISPs host fewer serving sites than larger ISPs. The biggest ex- ceptions are a Tiny ISP in Mexico hosting 23 serving sites consisting of hundreds of front-end IPs, and a Stub national mobile carrier with 21 sites. Betting their role in the In- ternet, most Large and Tier 1 ISPs host multiple sites. For example, a Large ISP in Brazil serves from 23 sites. Whereas Google would be willing to serve any client from a server located within the Google network, an ISP hosting a server would likely only serve its own customers. Serving its provider's other customers, for example, would require the ISP to pay its provider for the service! We check this intuition by comparing the location in the AS hierarchy of clients and the servers to which Google directs them. Of clients directed to servers outside of Google's network, 93% are located within the server's AS's customer cone (the AS itself, its customers, their customers, and so on) [20]. Since correctly inferring AS business relationship is known to be a hard problem [10], it is unclear whether the remaining 7% of clients are actually served by ISPs of which they are not customers, or (perhaps more likely) whether they rep- resent limitations of the analysis. In fact, given that 40% of the non-customer cases stem from just 7 serving ASes, a small number of incorrect relationship or IP-to-AS infer- ences could explain the counter-intuitive observations. Google's expansion of infrastructure implies that, over time, many clients should be directed to servers that are closer to them than where Google directed them at the be- ginning of the study. Figure 10(a) shows the distribution of the distance from a client to our estimate of the location of the server serving it. We restrict the clients to those in our BitTorrent eyeball dataset (2.6 million client prexes) and geolocate all client locations using MaxMind. Some of the very large distances shown in both curves could be ac- curacy limitations of the MaxMind GeoLite Free database, especially in regions outside of the United States. Overall, results show that in mid-August 2013, many clients are sub- stantially closer to the set of servers they are directed to than in October of 2012. For example, the fraction of client pre- xes within 500km of their front-ends increases from 39% to 64%, and the fraction within 1000km increases from 54% to 78%. Figure 10(b) shows the distribution of distances only for the set of client prexes that were directed to front-ends outside of Google's network on 2013-8-14. The top curve shows the distances between the clients and front-ends on 2013-8-14 while the bottom curve shows the distances be- tween this same set of clients and the front-ends that they were served by on 2012-10-29. The gure shows that the set of clients that have moved o of Google's network are now much closer to their front-ends in August of 2013 than in October of 2012. The fraction of client prexes within 500km of their front-ends has increased from 21% to 89%, and the fraction within 1000km increased from 36% to 96%. Because many of the newer front-ends seem to be satellites that likely proxy trac back to datacenters, it is hard to know the impact that decreasing the distance from client to front-end will have on application performance [28]. November 2012 May 2013 August 2013 ASes Clients ASes Clients ASes Clients Google 2 9856K 2 (+0%) 9658K (-2%) 2 (+0%) 9067K (-8%) Tier 1 2 481 2 (+0%) 201 (-58%) 4 (+100%) 35K (+7278%) Large 30 111K 46 (+53%) 237K (+114%) 123 (+310%) 410K (+270%) Small 35 37K 64 (+83%) 63K (+71%) 319 (+811%) 359K (+870%) Tiny 23 31K 41 (+78%) 57K (+84%) 206 (+796%) 101K (+228%) Stub 13 21K 36 (+177%) 38K (+81%) 201 (+1446%) 79K (+281%) Table 5: Classication of ASes hosting Google serving infrastructure at the beginning, middle, and end of our study. We count both the number of distinct ASes and the number of client /24 prexes served. Growth numbers for May and August are in comparison to November. Google still directs 90% of the prexes to servers within its own network, but it is evolving towards serving fewer clients from its own network and more clients from smaller ASes around the world. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 500 1000 1500 2000 2500 3000 3500 4000 CDF of clients Distance from Client to Estimated Front-end Location (km) 2013-8-14 2012-10-29 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 500 1000 1500 2000 2500 3000 3500 4000 CDF of clients Distance from Client to Estimated Front-end Location (km) Other AS only 2013-8-14 Other AS only 2012-10-29 Figure 10: (a) Distances from all BitTorrent client prexes to estimated front-end locations to which Google directs them. (b) Comparison of the distances between the set of clients served by front-ends outside of Google's network on 2013-8-14 and their estimated front-end locations on 2013-8-14 and 2012-10-29. 0 0.2 0.4 0.6 0.8 1 0 1000 2000 3000 4000 5000 CDF of estimated location Error (km) client-centric geolocation (CCG) 2013-4-14 CCG no filtering 2013-4-14 CCG no filtering 2013-3-20 CCG no filtering 2012-10-29 Figure 11: As Google expands, clients become closer to their front-ends, improving accuracy of lter-less client- based geolocation. 6.3 ImpactonGeolocationAccuracy A side-eect of Google directing more clients to front-ends closer to them is that our geolocation technique should be- come more accurate over time, since we base it on the as- sumption that front-ends are near their clients. To verify that assumption, we apply our basic geolocation approach{ without any of our lters that increase accuracy{to the data- sets from three points in time. We chose dates to coincide with the large jumps in Google servers that we observe in Figure 5. Using the airport code-based ground truth dataset from Section 5.2, Figure 11 shows the distribution of error in geolocation using these three datasets and, for comparison, the most recent dataset using all our lters. We can see that there is steady reduction in error over time, with median er- ror decreasing from 817km in October 2012, to 610km in March 2013, and 475km in April 2013. However, our lters still provide substantial benet, yielding a median error of only 22km. 7. USINGOURMAPPING In addition to our evaluation of Google's serving infras- tructure so far, our mapping is useful to the research com- munity, for what it says about clients, and for what it can predict about other serving infrastructure. Our data is pub- licly available at http://mappinggoogle.cs.usc.edu. The Need for Longitudinal Research Data. Our re- sults show the limitations of one-o measurement studies|a snapshot of Google's serving infrastructure in October would have missed the rapid growth of their infrastructure and po- tentially misrepresented their strategy. We believe the re- search community needs long-term measurements, and we intend to refresh our maps regularly. We will make our on- going data available to the research community, and we plan to expand coverage from Google to include other providers' serving infrastructures. Sharing the Wealth: From Our Data to Related Data. Our mapping techniques assume the target sharing infrastructure is pervasive and carefully and correctly engi- neered. We assume that (a) Google directs most clients to nearby front-ends; (b) Google's redirection is carefully en- gineered for \eyeball" prexes that host end-users; and (c) Google will only direct a client to a satellite front-end if the client is a customer of the front-end's AS. Google has eco- nomic incentives to ensure these assumptions. In practice, these assumptions are generally true but not always, and our design and evaluation has carefully dealt with excep- tions (such as clients occasionally being directed to distant front-ends). If we accept these assumptions, our maps allow us to ex- ploit Google's understanding of network topology and user placement to improve other datasets. Prior work has used Akamai to chose detour routes [35]; we believe our mapping can improve geolocation, peer selection, and AS classica- tion. Geolocation is a much studied problem [12, 13, 16], and availability of ground truth can greatly improve results. With clients accessing Google from mobile devices and comput- ers around the world, Google has access to ample data and measurement opportunity to gather very accurate client lo- cations. An interesting future direction is to infer prex location from our EDNS-client-subnet observations, and use that coarse data to re-evaluate prexes that existing datasets (such as MaxMind) place in very dierent locations. The end result would be either higher accuracy geolocation or, at least, identication of prexes with uncertain locations. Researchers designed a BitTorrent plugin that would di- rect a client to peer with other users the plugin deemed to be nearby, because the potential peer received similar CDN redirections as the client's [8]. However, with the existing plugin, the client can only assess similarity of other users of the plugin who send their CDN front-end mappings. Just as we used EDNS-client-subnet to obtain mappings from arbi- trary prexes around the world, we could design a modied version of the plugin that would allow a client to assess the nearness of an arbitrary potential peer, regardless of whether the peer uses the plugin or not. By removing this barrier, the modied plugin would be much more widely applicable, and could enhance the adoption of such plugins. Finally, in Section 6.2, we showed that 90% of prexes served in ASes other than Google are within the customer cone of their serving AS. The remaining 10% of prexes likely represent problems with either our IP-to-AS map- ping [15] or with the customer cone dataset we used [20]. From talking to the researchers behind that work and shar- ing our results with them, it may be necessary to move to prex-level cones, to accommodate the complex relation- ships between ASes in the Internet. The client-to-front-end data we generate could help resolve ambiguities in AS rela- tionships and lead to better inference in the future. Mapping Other Providers. While our techniques will apply directly for some providers, we will need to adapt them for others, and we describe the challenges and po- tential approaches here. Our studies of Google combine observations using EDNS-client-subnet and open recursive resolvers. EDNS-client-subnet support is increasing. How- ever, some networks such as Akamai do not support it, and we are restricted to using open resolvers for them. In Section 5.1, we demonstrated that even using hundreds of thousands of open DNS resolvers would miss discovering much of Google's infrastructure. Table 2 showed that ED- NS-client-subnet found 20% more front-end IPs than open resolvers, but we cannot assume that ratio holds on other infrastructures. We would expect open resolvers to suce to uncover all of a ten-front-end infrastructure, for example, but we would expect an even bigger gap on Akamai than on Google, since Akamai serves from many more locations. We may be able to use our results from Google to project results for other providers that support only open resolvers. We select one open recursive resolver from each /24 in which we know one (there are 110,000 such prexes). Then, we se- lect one of these /24s at a time and resolve www.google.com 500 1000 1500 2000 2500 3000 3500 4000 4500 0 20000 40000 60000 80000 100000 Number of Google IP addresses Number of vantage points (one per /24) resolver min resolver max resolver mean EDNS with resolver /24s min EDNS with resolver /24s max EDNS with resolver /24s mean Figure 12: The relation between number of Google IP addresses discovered and the number of vantage points. Using one open resolver per /24 block and one EDNS query per /24 block. from the open resolver in the prex and via an EDNS query for that prex. Figure 12 depicts the growth in the number of Google front-end IP addresses discovered by the two ap- proaches as we issue additional measurements (1000 trials). Using resolvers in a set of prexes yields very similar results to issuing EDNS queries for that same set of prexes, so that the benet of EDNS is primarily that we can issue queries for many more prexes than we have access to resolvers in. We extrapolate these growth curves to understand the im- pact of having more resolvers. To test this theory, we t power law curves to the open resolver lines (R = 0:97 in all cases). We project that access to resolvers in all 10M routable /24 prexes, predicting discovery of 6990{8687 IP addresses of Google front-end servers (as of May 4th, 2013). Using EDNS-client-subnet queries for these 10M prexes, we found 8563 IP addresses, within the range, so the extrapo- lation approach may be reasonable. In the future, we plan to apply it to predict the size of Akamai and other infras- tructures that do not yet support EDNS-client-subnet. We can use our Google results to characterize which regions our set of open resolvers has good coverage in, in order to ag portions of other infrastructures as more or less complete. Google lends itself to client-centric geolocation because our EDNS-client-subnet measurements uncover the client to front-end mappings, and Google's deployment is dense enough that most clients are near front-ends. We will have to adapt the strategy for deployments where one or both of those properties do not hold. Akamai uses a resolver's geolo- cation in mapping decisions [5], and so it may be possible to geolocate Akamai servers based on the locations of the open resolvers they serve, even though we cannot directly measure which clients they serve. We will verify the soundness of this approach by geolocating Google front-ends using resolver lo- cations. If the approach is generally accurate, we can also use it to ag suspicious resolver locations and only use the remainder when geolocating Akamai (or other) servers. CDNs such as Edgecast support EDNS queries to discover client-to-front-end mappings, but they lack the density of servers of Akamai and Google and so necessarily direct some clients to distant servers. Since our geolocation approach assumes front-ends are near clients, it may not be sound to assume that the front-end is at the geographic center of the clients. Edgecast publishes its geographic points of presence on its website, so we can use its deployment as ground truth to evaluate approaches to map other providers that do not publish this information. We will investigate whether our aggressive pruning of distant clients suces for Edgecast. If not, straightforward alternate approaches may work well for these sparse deployments. For example, these small de- ployments tend to be at well-connected Internet exchange points, where we likely have a vantage point close enough to accurately use delay-based geolocation [12,16]. 8. RELATEDWORK Closest to our work is prior research on mapping CDN in- frastructures [1,2,14,36]. Huang et al. [14] map two popular content delivery networks, Akamai and Limelight, by enu- merating their front-ends using a quarter of a million open rDNS resolvers. They geolocate and cluster front-ends using a geolocation database and also use the location of penulti- mate hop of traceroutes to front-ends. Ager et al. [2] chart web hosting structures as a whole. They start by prob- ing several sets of domain names from dozens of vantage points to collect service IP addresses, rely entirely on Max- Mind [24] for geolocation, and use feature-based clustering where the goal of clustering is to separate front-ends belong- ing to dierent hosting infrastructures. Torres et al. [36] use a small number of vantage points in the US and Europe and constraint-based geolocation to approximately geolocate serving sites in the YouTube CDN, with the aim of under- standing video server selection strategies. Finally, Adhikari et al. [1] use open resolvers to enumerate YouTube servers and geolocation databases to geolocate them, with the aim of reverse-engineering the caching hierarchy and logical orga- nization of YouTube infrastructure using DNS namespaces. In contrast to these pieces of work, our enumeration eec- tively uses many more vantage points to achieve complete coverage, our geolocation technique leverages client loca- tions for accuracy instead of relying on geolocation databases, and our clustering technique relies on a metric embedding in high-dimensional space to dierentiate between nearby sites. In addition to this prior work, simultaneous work appear- ing at the same conference as this paper also used ED- NS-client-subnet to expose CDN infrastructure [34]. While our work focuses on characterizing Google's rapid expan- sion, including geolocating and clustering front-ends, that work addresses complementary issues including measuring other EDNS-client-subnet-enabled infrastructures. Our re- sults dier from that work, as our work exposes 30% more /24 prexes and 12% more ASes hosting front-ends that are actively serving Google search. We believe our additional coverage results from our more frequent mapping and accu- mulation of servers over time, since a single snapshot may miss some infrastructure (see Table 3). Some in the op- erations community also independently recognized how to use EDNS-client-subnet to enumerate a CDN's servers, al- though these previous measurements presented just a small- scale enumeration without further investigation [23]. Several other pieces of work are tangentially related to ours. Previous work exploits the observation that two clients directed to the same or nearby front-ends are likely to be geographically close [8, 35]. Our work uses this observation to geolocate front-ends. Mao et al. [22] quantify the prox- imity of clients to their local DNS resolvers and nd that clients in dierent geographic locations may use the same resolver. The EDNS-client-subnet extension we use was de- signed to permit serving infrastructures to more accurately direct clients to serving sites in these cases. Otto et al. [26] examine the end-to-end impact that dierent DNS services have on CDN performance. It is the rst work to study the potential of EDNS-client-subnet to address the client CDN mapping problem, using the extension as intended, but does not repurpose EDNS to map infrastructure, as we do. Finally, several strands of research explored complemen- tary problems associated with serving infrastructures, in- cluding characterizing and diagnosing latency of providers [7,11,17,18,40]; geolocating ASes using client locations [30]; verifying data replication strategies for cloud providers [4]; and analyzing content usage in large CDNs [6]. Some of this research describes how providers use distributed front-ends as proxies to improve client performance [7, 11, 28]. Our work demonstrates the rapid expansion{and new strategy of front-ends in other networks{of Google's infrastructure to delivery on this approach. 9. CONCLUSIONS As the role of interactive web applications continues to grow in our lives, and the mobile web penetrates remote re- gions of the world more than wired networks ever had, the Internet needs to deliver fast performance to everyone, ev- erywhere, at all times. To serve clients around the world quickly, service providers deploy globally distributed serv- ing infrastructure, and we must understand these infras- tructures to understand how providers deliver content to- day. Towards that goal, we developed approaches specic to mapping these serving infrastructures. By basing our tech- niques around how providers architect their infrastructures and guarding our techniques against noisy data, we accu- rately map the geographically-distributed serving sites. We apply our techniques to mapping Google's serving in- frastructure and track its rapid expansion over the period of our measurement study. During that time, the num- ber of serving sites grew more than sevenfold, and we see Google deploying satellite front-ends around the world, in many cases distant from any known Google datacenters. By continuing to map Google's and others' serving infrastruc- tures, we will watch the evolution of these key enablers of today's Internet, and we expect the accurate maps to enable future work by us and others to understand and improve content delivery on a global scale. Acknowledgments We thank our shepherd, Aditya Akella, and the anonymous IMC reviewers for their valuable feedback. We also grate- fully acknowledge Bernhard Ager, Georgios Smaragdakis, Florian Streibelt, and Nikolaos Chatzis for their feedback on earlier versions of this paper. Xun Fan, Zi Hu, and John Heidemann are partially sup- ported by the U.S. Department of Homeland Security Sci- ence and Technology Directorate, Cyber Security Division, via SPAWAR Systems Center Pacic under Contract No. N66001-13-C-3001. John Heidemann is also partially sup- ported by DHS BAA 11-01-RIKA and Air Force Research Laboratory, Information Directorate under agreement num- ber FA8750-12-2-0344. Matt Calder and Ramesh Govindan were partially supported by the U.S. National Science Foun- dation grant number CNS-905596. 10. REFERENCES [1] Vijay Kumar Adhikari, Sourabh Jain, Yingying Chen, and Zhi-Li Zhang. Vivisecting YouTube: An active measurement study. In INFOCOM, 2012. [2] Bernhard Ager, Wolfgang M uhlbauer, Georgios Smaragdakis, and Steve Uhlig. Web content cartography. In IMC, 2011. [3] Mihael Ankerst, Markus M. Breunig, Hans-peter Kriegel, and J org Sander. OPTICS: Ordering points to identify the clustering structure. In SIGMOD, 1999. [4] Karyn Benson, Rafael Dowsley, and Hovav Shacham. Do you know where your cloud les are? In Cloud Computing Security Workshop, 2011. [5] Arthur Berger, Nicholas Weaver, Robert Beverly, and Larry Campbell. Internet nameserver IPv4 and IPv6 address relationships. In IMC, 2013. [6] Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, and Sue Moon. I Tube, You Tube, Everybody Tubes: Analyzing the World's Largest User Generated Content Video System. In IMC, 2007. [7] Yingying Chen, Sourabh Jain, Vijay Kumar Adhikari, and Zhi-Li Zhang. Characterizing roles of front-end servers in end-to-end performance of dynamic content distribution. In IMC, 2011. [8] David Chones and Fabian E. Bustamante. Taming the torrent: A practical approach to reducing cross-ISP trac in peer-to-peer systems. In SIGCOMM, 2008. [9] C. Contavalli, W. van der Gaast, S. Leach, and E. Lewis. Client subnet in DNS requests, April 2012. Work in progress (Internet draft draft-vandergaast-edns-client-subnet-01). [10] Xenofontas Dimitropoulos, Dmitri Krioukov, Marina Fomenkov, Bradley Huaker, Young Hyun, k. c. clay, and George Riley. AS relationships: Inference and validation. ACM CCR, 37(1):29{40, January 2007. [11] Tobias Flach, Nandita Dukkipati, Andreas Terzis, Barath Raghavan, Neal Cardwell, Yuchung Cheng, Ankur Jain, Shuai Hao, Ethan Katz-Bassett, and Ramesh Govindan. Reducing web latency: the virtue of gentle aggression. In SIGCOMM, 2013. [12] Bamba Gueye, Artur Ziviani, Mark Crovella, and Serge Fdida. Constraint-based geolocation of Internet hosts. IEEE/ACM TON, 14(6):1219{1232, December 2006. [13] Zi Hu and John Heidemann. Towards geolocation of millions of IP addresses. In IMC, 2012. [14] Cheng Huang, Angela Wang, Jin Li, and Keith W. Ross. Measuring and evaluating large-scale CDNs. Technical Report MSR-TR-2008-106, Microsoft Research, October 2008. [15] iPlane. http://iplane.cs.washington.edu. [16] Ethan Katz-Bassett, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas Anderson, and Yatin Chawathe. Towards IP geolocation using delay and topology measurements. In IMC, 2006. [17] Rupa Krishnan, Harsha V. Madhyastha, Sridhar Srinivasan, Sushant Jain, Arvind Krishnamurthy, Thomas Anderson, and Jie Gao. Moving beyond end-to-end path information to optimize CDN performance. In IMC, pages 190{201, 2009. [18] Ang Li, Xiaowei Yang, Srikanth Kandula, and Ming Zhang. CloudCmp: comparing public cloud providers. In IMC, 2010. [19] Greg Linden. Make data useful. http://sites.google.com/site/glinden/Home/ StanfordDataMining.2006-11-28.ppt, 2006. [20] M. Luckie, B. Huaker, A. Dhamdhere, V. Giotsas, and k clay. AS relationships, customer cones, and validation. In IMC, 2013. [21] Harsha V. Madhyastha, Tomas Isdal, Michael Piatek, Colin Dixon, Thomas Anderson, Arvind Krishnamurthy, and Arun Venkataramani. iPlane: An information plane for distributed services. In OSDI, 2006. [22] Z. M. Mao, C. D. Cranor, F. Douglis, M. Rabinovich, O. Spatscheck, and J Wang. A precise and ecient evaluation of the proximity between web clients and their local DNS servers. In USENIX Annual Technical Conference, 2002. [23] Mapping CDN domains. http://b4ldr.wordpress. com/2012/02/13/mapping-cdn-domains/. [24] MaxMind. http://www.maxmind.com/app/ip-location/. [25] David Meyer. RouteViews. http://www.routeviews.org. [26] John S. Otto, Mario A. S anchez, John P. Rula, and Fabi an E Bustamante. Content delivery and the natural evolution of DNS. In IMC, 2012. [27] Venkata N. Padmanabhan and Lakshminarayanan Subramanian. An investigation of geographic mapping techniques for Internet hosts. In SIGCOMM, 2001. [28] Abhinav Pathak, Y. Angela Wang, Cheng Huang, Albert Greenberg, Y. Charlie Hu, Randy Kern, Jin Li, and Keith W. Ross. Measuring and evaluating TCP splitting for cloud services. In PAM, 2010. [29] William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846{850, 1971. [30] Amir H Rasti, Nazanin Magharei, Reza Rejaie, and Walter Willinger. Eyeball ASes: from geography to connectivity. In IMC, 2010. [31] Steve Souders. High-performance web sites. Communications of the ACM, 51(12):36{41, December 2008. [32] Neil Spring, Ratul Mahajan, and David Wetherall. Measuring ISP topologies with Rocketfuel. ACM CCR, 32(4):133{145, 2002. [33] Stoyan Stefanov. Yslow 2.0. In CSDN SD2C, 2008. [34] Florian Streibelt, Jan B ottger, Nikolaos Chatzis, Georgios Smaragdakis, and Anja Feldmann. Exploring EDNS-client-subnet adopters in your free time. In IMC, 2013. [35] Ao-Jan Su, David R. Chones Aleksandar Kuzmanovic, and Fabi'an E. Bustamante. Drafting behind Akamai (Travelocity-based detouring). In SIGCOMM, 2006. [36] Ruben Torres, Alessandro Finamore, Jin Ryong Kim, Marco Mellia, Maurizio M Munafo, and Sanjay Rao. Dissecting video server selection strategies in the YouTube CDN. In ICDCS, 2011. [37] UCLA Internet topology collection. http://irl.cs.ucla.edu/topology/. [38] Qiang Xu and Jaspal Subhlok. Automatic clustering of grid nodes. In Proc. of 6th IEEE International Workshop on Grid Computing, 2005. [39] Ming Zhang, Yaoping Ruan, Vivek S Pai, and Jennifer Rexford. How DNS misnaming distorts Internet topology mapping. In USENIX Annual Technical Conference, 2006. [40] Yaping Zhu, Benjamin Helsley, Jennifer Rexford, Aspi Siganporia, and Sridhar Srinivasan. LatLong: Diagnosing wide-area latency changes for CDNs. IEEE Transactions on Network and Service Management, 9(1), September 2012.
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 934 (2013)
PDF
USC Computer Science Technical Reports, no. 944 (2014)
PDF
USC Computer Science Technical Reports, no. 957 (2015)
PDF
USC Computer Science Technical Reports, no. 961 (2015)
PDF
USC Computer Science Technical Reports, no. 971 (2017)
PDF
USC Computer Science Technical Reports, no. 692 (1999)
PDF
USC Computer Science Technical Reports, no. 750 (2001)
PDF
USC Computer Science Technical Reports, no. 741 (2001)
PDF
USC Computer Science Technical Reports, no. 958 (2015)
PDF
USC Computer Science Technical Reports, no. 677 (1998)
PDF
USC Computer Science Technical Reports, no. 939 (2013)
PDF
USC Computer Science Technical Reports, no. 949 (2014)
PDF
USC Computer Science Technical Reports, no. 773 (2002)
PDF
USC Computer Science Technical Reports, no. 782 (2003)
PDF
USC Computer Science Technical Reports, no. 931 (2012)
PDF
USC Computer Science Technical Reports, no. 822 (2004)
PDF
USC Computer Science Technical Reports, no. 774 (2002)
PDF
USC Computer Science Technical Reports, no. 796 (2003)
PDF
USC Computer Science Technical Reports, no. 760 (2002)
PDF
USC Computer Science Technical Reports, no. 706 (1999)
Description
Matt Calder, Xun Fan, Zi Hu, Ethan Katz-Bassett, John Heidemann, and Ramesh Govindan. "Mapping the expansion of google's serving infrastructure." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 935 (2013).
Asset Metadata
Creator
Calder, Matt
(author),
Fan, Xun
(author),
Govindan, Ramesh
(author),
Heidemann, John
(author),
Hu, Zi
(author),
Katz-Bassett, Ethan
(author)
Core Title
USC Computer Science Technical Reports, no. 935 (2013)
Alternative Title
Mapping the expansion of google's serving infrastructure (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
14 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16269705
Identifier
13-935 Mapping the Expansion of Google_s Serving Infrastructure (filename)
Legacy Identifier
usc-cstr-13-935
Format
14 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/