Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
On efficient data transfers across geographically dispersed datacenters
(USC Thesis Other)
On efficient data transfers across geographically dispersed datacenters
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
On Ecient Data Transfers Across Geographically Dispersed Datacenters by Mohammad Noormohammadpour A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2019 Copyright 2019 Mohammad Noormohammadpour Acknowledgements I want to thank my Ph.D. advisor, Prof. Cauligi Raghavendra, who provided inordinate help with every step in the preparation and making of this dissertation. I want to thank our collaborators Dr. Sriram Rao from Facebook, Dr. Srikanth Kandula from Microsoft, and Dr. Ajitesh Srivastava from the Ming Hsieh Department of Electrical Engineering, University of Southern California. I would also like to thank Prof. Neal Young from the University of California, Riverside, for the helpful comments on Stack Exchange concerning the NP-Hardness proof of the Best Worst-case Routing presented in Appendix A. I nally would like to thank Long Luo from the University of Electronic Science and Technology of China for helpful discussion and collaboration. I would also like to thank the following researchers and engineers who provided helpful advice and support throughout the Ph.D. program as part of classes and internships. Prof. Minlan Yu now at Harvard; my internship team from Cisco that worked on Non-Volatile Memory for Distributed Storage especially David Oran, Josh Gahm, Atif Fahim, Praveen Kumar, Marton Sipos, and Spyridon Mastorakis; and my internship team at Google NetInfra working on Inter-Datacenter Trac Engineering especially Jerey Liang, Kirill Mendelev, Brad Morrey, Gilad Avidov, and Warren Chen. i Contents Acknowledgements i List of Figures vi List of Tables xii Abstract xiii 1 Introduction 1 1.1 User Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Inter-Datacenter Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Inter-DC Transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.1 Point to Point (P2P) Transfers . . . . . . . . . . . . . . . . . . . . . 8 1.3.2 Point to Multipoint (P2MP) Transfers . . . . . . . . . . . . . . . . . 8 1.3.3 Inter-DC Transfers with Deadlines . . . . . . . . . . . . . . . . . . . 9 1.4 Overview of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Inter-DC Network Trac Engineering 12 2.1 Central Inter-DC Trac Management Architecture . . . . . . . . . . . . . . 14 2.1.1 Functions of Centralized Trac Management . . . . . . . . . . . . . 16 2.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 General Inter-DC Optimization Formulation . . . . . . . . . . . . . . . . . . 19 3 Adaptive Routing of Transfers over Inter-Datacenter Networks 25 3.1 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.1 A Novel Metric for Adaptive Routing over WAN . . . . . . . . . . . 26 3.2 Evaluation of Dierent Cost Metrics . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Discussion and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 ii 3.4 Best Worst-case Routing (BWR) . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4.2 Denition of Best Worst-case Routing . . . . . . . . . . . . . . . . . 31 3.4.3 BWR Heuristic (BWRH) . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.4 Application to Real Network Scenarios . . . . . . . . . . . . . . . . . 33 3.4.5 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 A Faster BWR Heuristic (BWRHF) . . . . . . . . . . . . . . . . . . . . . . . 37 3.5.1 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 Fast Deadline-based Admission Control for Inter-DC Transfers 48 4.1 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Fast Admission Control on A Network Path . . . . . . . . . . . . . . . . . . 49 4.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.2 Currently Used Approach . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.3 As Late As Possible (ALAP) Scheduling . . . . . . . . . . . . . . . . 50 4.2.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Application of ALAP over General Network Topologies . . . . . . . . . . . . 54 4.3.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.2 Network-wide ALAP Scheduling . . . . . . . . . . . . . . . . . . . . 56 4.3.3 Load-based Dynamic Routing . . . . . . . . . . . . . . . . . . . . . . 57 4.3.4 DCRoute Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Admission Control with Multipath ALAP Scheduling . . . . . . . . . . . . . 64 4.4.1 Multipath Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5 Ecient Point to Multipoint Transfers over Inter-DC Networks 69 5.1 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 Adaptive Forwarding Tree Selection for P2MP Transfers . . . . . . . . . . . 74 5.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2.2 Selection of Forwarding Trees . . . . . . . . . . . . . . . . . . . . . . 75 5.2.3 Scheduling Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.4 DCCast Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 iii 5.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3 Fast Admission Control for Point to Multipoint Transfers with Deadlines . 82 5.3.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.2 Point to Multipoint Transfers with Deadlines . . . . . . . . . . . . . 83 5.3.3 Deadline-aware DCCast (DDCCast) . . . . . . . . . . . . . . . . . . 84 5.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6 Speeding up P2MP Transfers using Receiver Set Partitioning 89 6.1 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3 Optimizing Receiver Completion Times with Minimum Bandwidth Usage . 94 6.3.1 Forwarding Tree Selection . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3.2 Receiver Set Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 95 6.3.3 Rate Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4.1 Weight Assignment Techniques for Tree Selection . . . . . . . . . . . 100 6.4.2 Receiver Set Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 103 6.4.3 Eect of Rate Allocation Policies . . . . . . . . . . . . . . . . . . . . 107 6.4.4 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.4.5 Forwarding Plane Resource Usage . . . . . . . . . . . . . . . . . . . 108 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7 Mixed Completion Time Objectives for P2MP Transfers over Inter-DC Networks 111 7.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.1.1 Online Greedy Optimization Model . . . . . . . . . . . . . . . . . . . 115 7.2 Partitioning of Receivers on a Relaxed Topology . . . . . . . . . . . . . . . 116 7.2.1 Our Partitioning Approach . . . . . . . . . . . . . . . . . . . . . . . 118 7.2.2 Incorporating Objective Vectors . . . . . . . . . . . . . . . . . . . . 119 7.3 Iris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.3.1 Choosing Forwarding Trees . . . . . . . . . . . . . . . . . . . . . . . 121 7.3.2 Estimating Minimum Completion Times . . . . . . . . . . . . . . . . 124 7.3.3 Assigning Ranks to Receivers . . . . . . . . . . . . . . . . . . . . . . 125 7.3.4 The Iris Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 iv 7.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.4.1 Computing a Lower Bound . . . . . . . . . . . . . . . . . . . . . . . 128 7.4.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.4.3 Mininet Emulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.4.4 Practical Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8 Speeding up P2MP Transfers using Parallel Steiner Trees 139 8.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.3 Application of Parallel Forwarding Trees . . . . . . . . . . . . . . . . . . . . 140 8.3.1 Adaptive Edge-disjoint Parallel Forwarding Tree Selection . . . . . . 141 8.3.2 Scheduling Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.4.1 Eect of Number of Parallel Trees . . . . . . . . . . . . . . . . . . . 143 8.4.2 Eect of Number of Copies . . . . . . . . . . . . . . . . . . . . . . . 143 8.4.3 Eect of Transfer Size Distribution . . . . . . . . . . . . . . . . . . . 144 8.4.4 Eect of Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.4.5 Eect of Scheduling Policies . . . . . . . . . . . . . . . . . . . . . . . 145 8.4.6 Eect of Network Load . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 9 Summary and Future Directions 148 9.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 9.1.1 Adaptive Routing over Inter-DC Networks . . . . . . . . . . . . . . . 150 9.1.2 Deadline-aware Point to Multipoint Transfers . . . . . . . . . . . . . 151 9.1.3 Receiver Completion Times of Point to Multipoint Transfers . . . . 151 9.1.4 Large-scale Implementation and Evaluation of Algorithms for Fast and Ecient Point to Multipoint Transfers . . . . . . . . . . . . . . 151 A NP-Hardness Proof for Best Worst-case Routing 153 B SDN Switches that Support Group Table ALL 156 Bibliography 158 v List of Figures 1.1 A typical datacenter cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Net ix cache locations as of 2016. Green dots are ISP locations and orange circles are Internet Exchange Points (IXPs) where dierent network providers connect their networks. 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Google, a major cloud services provider, with 19 functional regions and 4 currently in progress as of 2019. 2 . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Google's inter-DC network also known as B4. 3 . . . . . . . . . . . . . . . . 6 1.5 Microsoft Azure's inter-DC network. 4 . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Trac growth across Facebook's Express Backbone. 5 . . . . . . . . . . . . . 8 2.1 Central trac management architecture. . . . . . . . . . . . . . . . . . . . . 14 2.2 Steps in processing of a new inter-DC transfer. . . . . . . . . . . . . . . . . 15 2.3 Rate-allocation per link per timeslot. . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Several end-point rate-limiting techniques. . . . . . . . . . . . . . . . . . . . 17 2.5 Some penalty functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 Performance of various cost metrics for path selection over Cogent WAN [1], with uniform capacity of 1 and = 1:0 (F,S andM represent the FCFS, SRPT and MMF scheduling policies, respectively), simulation was repeated many times and average was computed. The minimum was computed per column and per metric across all schemes in the column. MFCT and TFCT represent the mean and tail ow completion times, respectively. . . . . . . . 27 3.2 Example of routing a new ow F 4 . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 BWRH's optimality gap for = 10 and = 50 computed for 1000 ow arrivals. 34 vi 3.4 Online routing techniques by ow scheduling policy assuming = 1, = 50, and various topologies over 500 time units. All simulations were repeated 20 times and the average results have been reported along with standard deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 Worst-case routing scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Comparison of mean and tail ow completion times for the three implemen- tations of BWR for the three topologies of GScale [2], AGIS [3] and ANS [4]. Exhaustive search nds all possible paths between the end-points and then nds a minimum weight path. We considered = 50 data units and per- formed the simulation over 500 time units. All simulations were repeated 20 times and the average results have been reported. We applied the Fair Sharing policy based on max min fairness which is most widely used. . . . . 40 3.7 Comparison of mean and tail ow completion times for the three implemen- tations of BWR over two large topologies of AT&T [5] and Cogent [1]. We excluded exhaustive search as it would take intractable amount of time for the topologies considered here. We considered = 50 data units and per- formed the simulation over 500 time units. All simulations were repeated 20 times and the average results have been reported. We also applied the Fair Sharing scheduling policy based on max min fairness which is most widely used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.8 Comparison of mean and tail ow completion times for the three implemen- tations of BWR for the three topologies of GScale [2], AGIS [3] and ANS [4]. Exhaustive search nds all possible paths between the end-points and then nds a minimum weight path. We considered = 1 and = 50 data units and performed the simulation over 500 time units. All simulations were re- peated 20 times and the average results are reported. . . . . . . . . . . . . . 42 3.9 Comparison of mean and tail ow completion times for the three implemen- tations of BWR over two large topologies of AT&T [5] and Cogent [1]. We excluded exhaustive search as it would take intractable amount of time for the topologies considered here. We considered = 1 and = 50 data units and performed the simulation over 500 time units. All simulations were repeated 20 times and the average results have been reported. . . . . . . . . . . . . . 43 vii 3.10 Online routing techniques by ow scheduling policy assuming = 1, = 50, and AT&T and Cogent topologies over 500 time units. All simulations were repeated 20 times and the average results have been reported along with standard deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.11 Online routing techniques by ow scheduling policy assuming = 1, = 50, and GScale and ANS topologies over 500 time units. All simulations were repeated 20 times and the average results have been reported along with standard deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.12 Online routing techniques by ow scheduling policy assuming = 1, = 50, and AGIS topology over 500 time units. All simulations were repeated 20 times and the average results have been reported along with standard deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.13 BWRHF's optimality gap for = 10 and = 50 computed for 1000 ow arrivals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1 A trac allocation used in proof of Theorem 1. . . . . . . . . . . . . . . . . 52 4.2 An example of ALAP allocation. . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Comparison between Amoeba and ALAP scheduling. . . . . . . . . . . . . . 53 4.4 An example of improving utilization (i.e., PullBack phase) while keeping the nal allocation ALAP (i.e., PushForward phase). . . . . . . . . . . . . . . . 57 4.5 An example of assigning paths to transfers and their total network capacity use assuming that no two transfers should be assigned the same paths for load balancing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.6 Total % of rejected trac and relative request processing time for GScale network with 12 nodes and 19 links. . . . . . . . . . . . . . . . . . . . . . . 63 4.7 Total % of rejected trac and relative request processing time for networks with dierent sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.8 An example of multipath ALAP scheduling: trac is allocated on edge- disjoint paths from the deadline backward in parallel. . . . . . . . . . . . . 65 4.9 Multipath ALAP scheduling over GScale [2] topology. . . . . . . . . . . . . 67 4.10 Multipath ALAP scheduling over ANS [4] topology. . . . . . . . . . . . . . 67 4.11 Multipath ALAP scheduling over Cogent [1] topology. . . . . . . . . . . . . 67 5.1 Applications that generate transfers potentially with multiple destinations. 71 viii 5.2 Inter-DC multicasting can reduce total bandwidth consumption as well as completion times of transfers. . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3 Tree Selection (GScale Topo) . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 Tree Selection (Random topology,jV V V G j= 50) . . . . . . . . . . . . . . . . . 78 5.5 Various scheduling policies and the eect of batching. . . . . . . . . . . . . 79 5.6 DCCast vs Point-To-Point (P2P-SRPT-LP). . . . . . . . . . . . . . . . . . . 80 5.7 Performance of 3-Shortest Paths (P2P) vs DCCast as network grows. . . . . 80 5.8 Performance of 3-Shortest Paths (P2P) vs DCCast as incoming network load increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.9 Computational overhead of DCCast as network size grows. . . . . . . . . . . 82 5.10 DDCCast (Deadline-Aware DCCast 5.2) architecture. . . . . . . . . . . . . 84 5.11 Capacity consumption and total admitted trac byjD D D R j (given = 2) . . . 87 5.12 Capacity consumption and total admitted trac by (givenjD D D R i j = 3;8i) . 88 6.1 Using multiple smaller multicast trees we can improve the completion times of several receivers while marginally increasing total network capacity con- sumption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2 Evaluation of various weights for tree selection (F,S andM refer to schedul- ing policies FCFS, SRPT and Fair Sharing, respectively). . . . . . . . . . . 102 6.3 Various schemes for bulk multicast transfers. All schemes use max-min fair rates except for DCCast which uses FCFS and are performed on GEANT topology. Plots are normalized by minimum (lower is better). We used Cache-Follower and Hadoop trac patterns in Table 6.3. . . . . . . . . . . . 105 6.4 Mean receiver completion time speedup (larger is better) of receivers com- pared to single load-aware Steiner tree (Algorithm 5) by their rank (receivers sorted by their speed from fastest to slowest per transfer), receivers selected according to uniform distribution from all nodes, we considered = 1. . . . 106 6.5 Performance of QuickCast as a function of partitioning factor p f . We as- sumed 4 receivers and an arrival rate of = 1. . . . . . . . . . . . . . . . . 107 6.6 Average throughput of bulk multicast transfers obtained by running dierent scheduling policies. We started 100 transfers at the time zero, senders and receivers were selected according to the uniform distribution. Each group of bars is normalized by the minimum in that group. . . . . . . . . . . . . . . 108 ix 7.1 A relaxed topology with innite core capacity, and uplink and downlink ca- pacities of r s and r 1 r n . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.2 Various partitioning solutions for a scenario with four receivers. Numbers show the downlink and uplink speeds of nodes and curly brackets indicate the partitions where all nodes in a partition receive data at the same rate. The objective is to maximize the average rate of receivers given the max-min fairness policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.3 A worst-case scenario for the proposed partitioning scenario. Numbers within the nodes show the downlink and uplink speeds of nodes and curly brackets indicate the partitions where all nodes in a partition receive data at the same rate. The objective is to maximize the average rate of receivers given the max-min fairness policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.4 Example of a partitioning hierarchy for a transfer with 10 receivers (the topology not shown). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.5 Pipeline of Iris. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.6 The physical topology, and the aggregate topology to compute a lower bound on receiver completion times. The aggregate topology is not part of how Iris operates and is only used for evaluation in this section. . . . . . . . . . . 129 7.7 Comparison of various techniques by number of multicast receivers. Plots are normalized by the minimum data point (mean and tail charts are normalized by the same minimum), = 1, and lower values are better. . . . . . . . . . 131 7.8 Mean completion time speedup (larger is better) of receivers normalized by no partitioning (load aware single tree) case given their rank from fastest to slowest, every node initiates equal number of transfers, receivers were selected according to uniform distribution from all nodes, and we considered of 1. 132 7.9 CDF of receiver completion times. Every transfer has 8 receivers selected uniformly across all nodes. \Lower Bound" is computed by nding the ag- gregate topology and applying Theorem 2. . . . . . . . . . . . . . . . . . . . 133 7.10 Gain by rank for dierent receivers per transfer averaged over all transfers for four dierent objective vectors. We set = 0:1 and there are 8 receivers. 135 7.11 Mininet Emulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.1 Using parallel forwarding trees we can increase the overall network through- put to all receivers. We may have to pay some extra bandwidth cost as we add more trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 x 8.2 The eect of number of parallel trees on total bandwidth consumption and transfer completion times (TCTs). Other experiment parameters are = 0:01, max-min fair rate computation, and GEANT [6] topology. . . . . . . . 144 8.3 The eect of number of receivers on total bandwidth consumption ratio (i.e., Bandwidth of K=2 Bandwidth of K=1 ) and transfer completion times (TCTs) gain (i.e., completion time of K=1 completion time of K=2 ). Other experiment parameters are = 0:01, max-min fair rate computation, and GEANT [6] topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.4 The eect of transfer size distribution on total bandwidth consumption ra- tio (i.e., Bandwidth of K=2 Bandwidth of K=1 ) and transfer completion times (TCTs) gain (i.e., completion time of K=1 completion time of K=2 ). Other experiment parameters are = 0:01, 4 receivers per transfer, max-min fair rate computation, and GEANT [6] topology. . . 146 8.5 The eect of topology on total bandwidth consumption ratio (i.e., Bandwidth of K=2 Bandwidth of K=1 ) and transfer completion times (TCTs) gain (i.e., completion time of K=1 completion time of K=2 ). Other experiment parameters are = 0:01, 4 receivers per transfer, and max-min fair rate computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.6 The eect of trac scheduling policy on total bandwidth consumption ra- tio (i.e., Bandwidth of K=2 Bandwidth of K=1 ) and transfer completion times (TCTs) gain (i.e., completion time of K=1 completion time of K=2 ). Other experiment parameters are = 0:01, 4 receivers per transfer, and GEANT [6] topology. . . . . . . . . . . . . . . . . . . . . . 146 8.7 The eect of transfer arrival rate (i.e., network load) on total bandwidth con- sumption ratio (i.e., Bandwidth of K=2 Bandwidth of K=1 ) and transfer completion times (TCTs) gain (i.e., completion time of K=1 completion time of K=2 ). Other experiment parameters are 4 receivers per transfer, max-min fair rate computation, and GEANT [6] topology. . . 147 9.1 Example scenario used inx9.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . 150 A.1 Network used in Problem 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 xi List of Tables 2.1 Denition of performance metrics. . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Denition of variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1 Variables used in this chapter in addition to those in Table 2.2 . . . . . . . 55 5.1 Various services that perform data replication. . . . . . . . . . . . . . . . . 70 5.2 Variables used in this chapter in addition to those in Table 2.2 . . . . . . . 74 5.3 Schemes used for comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.1 Denition of variables used in this chapter besides those dened in Table 2.2. 93 6.2 Various topologies used in evaluation. . . . . . . . . . . . . . . . . . . . . . 100 6.3 Transfer size distributions (parameters inx6.4). . . . . . . . . . . . . . . . . 101 6.4 Various weights for tree selection for incoming request R new . . . . . . . . . 101 7.1 Behavior of several objective vectors. . . . . . . . . . . . . . . . . . . . . . . 115 7.2 Denition of variables used in this chapter besides those dened in Table 2.2. 116 7.3 Various topologies and trac patterns used in evaluation. One unit of trac is equal to what can be transmitted at the rate of the fastest link over a given topology per timeslot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 B.1 SDN products with support for OFPGT ALL. . . . . . . . . . . . . . . . . . . 157 xii Abstract As applications become more distributed to improve user experience and oer higher avail- ability, businesses rely on geographically dispersed datacenters that host such applications more than ever. Dedicated inter-datacenter networks have been built that provide high visibility into the network status and exible control over trac forwarding to oer quality communication across the instances of applications hosted on many datacenters. These networks are relatively small, with tens to hundreds of nodes and are managed by the same organization that operates the datacenters which make centralized trac engineer- ing feasible. Using coordinated data transmission from the services and routing over the inter-datacenter network, one can optimize the network performance according to a vari- ety of utility functions that take into account data transfer deadlines, network capacity consumption, and transfer completion times. Such optimization is especially relevant for long-running data transfers that occur across datacenters due to the replication of congu- ration data, multimedia content, and machine learning models. In this dissertation, we study techniques and algorithms for fast and ecient data trans- fers across geographically dispersed datacenters over the inter-datacenter networks. We discuss dierent forms and properties of inter-datacenter transfers and present a general- ized optimization framework to maximize an operator selected utility function. Next, in the several chapters that follow, we study, in detail, the problems of admission control for transfers with deadlines and inter-datacenter multicast transfers. We present a variety of heuristic approaches while carefully considering their running time. For the admission control problem, our solutions oer signicant speed up in the admission control process while oering almost identical performance in the total trac admitted into the network. For the bulk multicasting problem, our techniques enable signicant performance gain in receiver completion times with low computational complexity, which makes them highly applicable to inter-datacenter networks. In the end, we summarize our contributions and discuss possible future directions for researchers. xiii Chapter 1 Introduction Datacenters provide an infrastructure for many online services which include services man- aged by small companies and individuals who do not want to deal with complexities and diculties of maintaining physical computers [7, 8]. Examples of these online services are on-demand video delivery, storage and le sharing, cloud computing, nancial services, mul- timedia recommendation systems, online gaming, and interactive online tools that millions of users depend on [9{11]. Besides, massively distributed services such as web search, social networks, and scientic analytics that require storage and processing of substantial scientic data take advantage of computing and storage resources of datacenters [2,12,13]. Datacenter services may consist of a variety of applications with instances running on one or more datacenters. They may dynamically scale across a datacenter or across multiple datacenters according to end-user demands which enables cost-savings for service managers. Moreover, considering some degree of statistical multiplexing, better resource utilization can be achieved by allowing many services and applications to share datacenter infrastructure. To reduce costs of building and maintaining datacenters, numerous businesses rely on infrastructure provided by large cloud infrastructure providers such as Google Cloud, Mi- crosoft Azure, and Amazon Web Services [14{16] with datacenters consisting of hundreds of thousands of servers. This enables the resources needed to run thousands of distributed applications that span hundreds of servers and scale out dynamically as needed to handle additional user load. A datacenter is typically home to multiple server clusters with thousands of machines per cluster that are connected using high capacity networks. Figure 1.1 shows the structure of a typical datacenter cluster network with many racks. A cluster is usually made up of up to hundreds of racks [17{19]. A rack is essentially a group of machines which can 1 . . . . . . . . . CLUSTER INTERCONNECTION ToR Switch ToR Switch Rack 1Rack N Host/Server Figure 1.1: A typical datacenter cluster communicate at high speed with minimum latency. All the machines in a rack are connected to a Top of Rack (ToR) switch which provides non-blocking connectivity among them. Rack size is typically limited by maximum number of ports that ToR switches provide and the ratio of downlink to uplink bandwidth. There is usually about tens of machines per rack [17{19]. ToR switches are then connected via a large interconnection allowing machines to communicate across racks. An ideal network should act as a huge non-blocking switch to which all servers are directly connected allowing them to simultaneously communicate at maximum rate. Datacenter network topology plays a signicant role in determining the level of failure resiliency, ease of incremental expansion, communication bandwidth and latency. The aim is to build a robust network that provides low latency, typically up to hundreds of microseconds [20{22], and high bandwidth across servers. Many network designs have been proposed for datacenters [18,23{29]. These networks often come with a large degree of path redundancy which allows for increased fault tolerance. Also, to reduce deployment costs, some topologies scale into large networks by connecting many inexpensive switches to achieve the desired aggregate capacity and number of machines [17,30] and the majority of these topologies are symmetrical. Many services may need to span over multiple racks to access required volume of storage and compute resources. This increases the overall volume of trac across racks. A high- capacity datacenter network allows for exible operation and placement of applications 2 across clusters and improves overall resource utilization and on-demand scale out for appli- cations [17,18,23,28]. This allows resources of any machine to be used by any application which is essential for hyper-scale cloud providers [14{16]. However, designing networks that run at very high capacity is costly and unnecessary for smaller companies or enterprises. As a result, many datacenters may not oer full capacity across racks with the underlying assumption that services run mostly within a single rack. To maximize resource utilization across a datacenter, accommodate more services and allow for better scalability, large cloud providers usually build their networks at maximum capacity. There is growing demand for datacenter network bandwidth. This increase is driven by faster storage devices, rising volume of user and application data, reduced cost of cloud services and ease of access to cloud services. Google reports 100% increase in their dat- acenter networking demands every 12 to 15 months [17]. Cisco forecasts a 400% increase in global datacenter IP trac and 2:6 growth in global datacenter workloads from 2015 to 2020 [31]. This growth in trac has made network trac management a necessity for datacenter operators to ensure that services can access the network capacity with minimal interference from other services. 1.1 User Experience User experience is the cornerstone of online services which have become ubiquitous and are presented to users through a variety of platforms including websites and mobile applications [32]. Several factors determine the quality of experience perceived by users while accessing such services the most important of which are latency and availability. It is crucial that users can always access the resources and the faster, the better. For example, a website's load time can aect whether the users will explore the website further. As another example, while watching a video clip on YouTube, users would like the video to start quickly and play smoothly without interruptions or degradation in quality [33]. To maximize users' quality of experience while interacting with a specic service, oper- ators keep multiple instances of such services up and running at any time and place them closer to local users across regions, countries, and continents [34,35]. This deployment min- imizes users' latency while interacting with services and allows for a smooth and responsive experience. Moreover, if an instance is interrupted due to failures or disasters, users will have the option of switching to other running instances of the same service in another dat- acenter. Doing so will also require services to copy the data based on which they operate across the datacenters on which they run. 3 Figure 1.2: Net ix cache locations as of 2016. Green dots are ISP locations and orange circles are Internet Exchange Points (IXPs) where dierent network providers connect their networks. 1 An example of such distributed applications is content distribution platforms like Net ix [36]. These services copy multimedia content to many locations close to local users for low- latency and high-speed access. Figure 1.2 shows Net ix's cache locations where multimedia content is stored for regional user access [37]. Depending on how users are distributed, services can decide how to place copies of data. For example, multimedia content can be distributed to locations where many users are expected to access it. Besides, such copying can be done both proactively and reactively. In the former case, services copy the content to a location before it is accessed by users allowing all users to have fast access to content. In the latter case, services copy the content to a location when a user near that location accesses the content which might lead to rst users experiencing less than ideal quality of experience. Although the proactive approach oers a better user experience, it can be more costly for operators. Another example of distributed services is web search such as Google and Bing [38,39]. These services crawl billions of web pages and generate signicant volumes of search index updates which are distributed across many datacenters for low-latency access by local and regional users [2,40]. Search index updates are generated at dierent frequencies according to how fresh the related results need to be which usually leads to smaller updates at high 1 This gure was downloaded from the following URL: https://media.netflix.com/en/company-blog/ how-netflix-works-with-isps-around-the-globe-to-deliver-a-great-viewing-experience 4 Figure 1.3: Google, a major cloud services provider, with 19 functional regions and 4 currently in progress as of 2019. 2 frequency and larger updates at a low frequency that are pushed from the datacenter that generates them to all other datacenters. 1.2 Inter-Datacenter Networks There is benet in providing services using multiple datacenters that are geographically distributed so that required services and data can be brought close to users for low-latency and high-speed access. Accordingly, Google Cloud, Amazon Web Services, and Microsoft Azure operate and maintain multiple geographically distributed datacenters. Google oper- ates across 19 regions as shown in Figure 1.3 with plans to expand to additional 4 regions, Microsoft operates across 54 geographical regions, Amazon runs more than two dozen avail- ability zones each consisting of one or more discrete datacenters, and Facebook employs 7 datacenters in North America and Europe. There is a signicant volume of trac exchanged between datacenters. This trac is due to frequent copying of large quantities of data and content from one datacenter to one or more datacenters. For this purpose, high bandwidth networks connecting datacenters can be leased or purchased for fast and ecient data transfers [2,41{43]. These high-speed wide area networks with dedicated capacity are referred to as inter-datacenter (inter-DC) 2 This gure was downloaded from the following URL: https://cloud.google.com/about/locations/ 5 Figure 1.4: Google's inter-DC network also known as B4. 4 Figure 1.5: Microsoft Azure's inter-DC network. 5 networks. The resources of these networks may be used by the services that run on the datacenters that they connect. Datacenter operators own the capacity of the inter-DC network and can manage it as needed to maximize the performance of services. For example, Google B4, shown in Figure 1.4, is an inter-DC network that connects Google's datacenters globally. 3 It hosts the trac for not only Google but also all the businesses that rely on Google Cloud including thousands of websites, mobile and desktop applications. Another dedicated inter-DC WAN is Microsoft Azure's global backbone [42, 44], shown in Figure 1.5. There are also a variety of third-party companies that oer tools and equipment for medium and small businesses to build their inter-DC networks with dedicated capacity for high performance. 3 This topology is from 2013 and has been well expanded since then. 4 This gure was downloaded from [45]. 5 This gure was downloaded from the following URL: https://azure.microsoft.com/en-us/blog/how- microsoft-builds-its-fast-and-reliable-global-network/ 6 Given that inter-DC networks connect a limited number of locations, usually about tens to hundreds of datacenters, management of their capacity for ecient usage through coordi- nated resource scheduling is feasible and has been shown to improve utilization and reduce deployment costs [2, 44, 46, 47]. Besides, inter-DC networks oer a high level of visibility into network status, and control over network behavior such as routing and forwarding of trac. These features streamline capacity management which is also the central concept around which this dissertation is shaped. 1.3 Inter-DC Transfers Datacenter services determine the trac characteristics and the communication patterns among servers within a datacenter and between dierent datacenters. Many datacenters, especially cloud providers, run a variety of services that results in a spectrum of workloads. Some popular services include cache followers, le stores, key-value stores, data mining, search indexing, and web search. Some services generate lots of trac among application instances of the service which is referred to as internal trac. The reason this trac is called internal is that they start and end between the instances of the same service without any direct interaction with the users. Examples of communication patterns that generate lots of internal trac are scatter-gather (also known as partition-aggregate) [48{51] and batch computing tasks [52,53]. Inter-DC transfers occur as a result of geographically distributed services with instances running across various regions and datacenters generating lots of internal trac across them. For example, multiple instances of services running on dierent datacenters may need to synchronize by sending periodic or on-demand updates. Besides, in the case of distributed data stores like key-value stores and relational databases, it may be necessary to oer consistency guarantees across multiple instances which requires the constant transmission of replicated data. The volume of internal data transfers across datacenters is growing fast. For instance, Figure 1.6 shows the growth of inter-DC bandwidth across Facebook's datacenters. As can be seen, the amount of internal trac is a signicant portion of the trac carried by inter- DC network and is growing much faster than user trac. To support this growing internal trac, inter-DC network operators, such as Facebook, need to invest in expanding the net- work capacity which can be expensive. Therefore, ecient utilization of network bandwidth is critical to maximize the support for internal trac. In this dissertation, we focus on de- veloping ecient algorithms for optimizing internal inter-DC transfers. We consider the 7 Figure 1.6: Trac growth across Facebook's Express Backbone. 6 multiple research problems around inter-DC networks with a focus on performance, oer several solutions, and perform comprehensive evaluations. Inter-DC transfers can be classied according to their number of destinations and whether they have completion time requirements. We brie y discuss dierent types of inter-DC transfers in the following. 1.3.1 Point to Point (P2P) Transfers Transfers could be generated as a result of data delivery from one datacenter to another datacenter which we refer to as point to point (P2P) transfers [2, 12, 46, 54{57]. Many backup services allow for one geographically distant copy of data in a dierent region for increased reliability in case of natural disasters or datacenter failures. For example, if a datacenter region on the east coast goes completely o the grid due to a storm, data copied to a datacenter on the west coast can be used to handle user queries. Also, data warehousing services require delivery of data from all datacenters to a datacenter warehouse [58]. 1.3.2 Point to Multipoint (P2MP) Transfers There are also transfers that deliver an object from one datacenter to multiple data- centers which we refer to as point to multipoint (P2MP) transfers. For example, con- tent delivery networks (CDNs) may push signicant video content to regional cache lo- cations [12, 56, 59{61], cloud storage services may replicate data objects across multiple 6 This gure was downloaded from the following URL: https://code.fb.com/networking-traffic/ building-express-backbone-facebook-s-new-long-haul-network/ 8 sites for increased reliability [62, 63], and search engines push substantial updates to their geographically distributed search database on a regular basis [2]. Data transfers among dat- acenters for replication of objects from one datacenter to multiple datacenters is referred to as geo-replication [2,45,54,56,57,64{70] and can form a large portion of inter-DC trac [43]. 1.3.3 Inter-DC Transfers with Deadlines Inter-DC transfers deliver content that may need to become available to applications before specic deadlines. Such deadlines may represent the importance of transfers [46, 55]. For example, a transfer with a later deadline can be delayed in favor of another transfer with a close deadline. Deadlines are usually due to consumer requirements. For example, the results of some data processing may need to be ready by a specic time. It may also be an internally assigned metric for more ecient scheduling of network transfers. For example, if a data processing task requires two inputs to generate an output, and one of them becomes available sometime in the future, it will not help to deliver the other input data anytime earlier than that time. Assigning a deadline that is in the future, allows the network operators to deliver data that is needed sooner rst. 1.4 Overview of the Dissertation In this dissertation, we develop algorithms and techniques for ecient P2P and P2MP transfers among geographically dispersed datacenters. In Chapter 2, we rst discuss how a modern inter-DC network manages trac ow and formally present trac management problems of interest, specically online arrival of inter-DC trac with its requirements. We then discuss performance metrics, such as mean and tail completion times, and nally, give a general optimization formulation for the types of problems we will consider in the rest of the dissertation. For P2P trac, path selection for trac routing is a well-known problem with various existing solutions. However, using a centralized network architecture and given a dedicated inter-DC network, it is possible to develop routing algorithms that are adaptive to network conditions and therefore more ecient. In Chapter 3, we develop a new routing approach referred to as Best Worst-case Routing (BWR) which is capable of considerably reducing inter-DC transfer completion times regardless of the scheduling policy used for transmission of data across the network. We evaluate various heuristics that implement BWR and use them to quickly compute a new path for a newly arriving inter-DC transfer. 9 In Chapter 4, we develop fast admission control algorithms for inter-DC transfers with deadlines. We focus on Point to Point (P2P) transfers to maximize the number of transfers completed before their deadlines. We present a new scheduling policy referred to as the As Late As Possible (ALAP) scheduling and combine it with a load-aware path selection mechanism to perform quick feasibility checks and decide on the admission of new inter-DC transfers. We also perform evaluations across dierent topologies and using varying network load and show that our approach is scalable and can speed up the admission control by more than two orders of magnitude compared to traditional techniques. In Chapter 5, we study ecient P2MP transfers where data transfer is needed from one source datacenter to multiple destination datacenters. Although this can be performed as multiple P2P transfers, there is opportunity to do signicantly better as all the receiving ends are known apriori and the network trac forwarding can be centrally controlled. We introduce the concept of load-aware forwarding trees and compute them as weighted Steiner trees. 7 We consider the objective of minimizing the completion time of the slowest transfer and the total bandwidth use of all transfers. We perform extensive evaluations using random and deterministic topologies and show that our tree selection approach can considerably reduce transfer completion times compared to tree selection using other weight assignment techniques. We show that our approach can reduce the completion times of slowest transfers by about 50% compared to performing P2MP using multiple P2P transfers. We also consider deadlines for P2MP transfers and present an admission control solution to maximize the number of P2MP transfers completed before deadlines. Our approach uses load-aware forwarding trees combined with the ALAP scheduling policy to perform fast admission control for P2MP transfers with deadlines. We also perform extensive evaluations and show that compared to state-of-the-art inter-DC admission control solutions our approach admits up to 25% more trac into the network while saving at least 22% network bandwidth. For a P2MP transfer, it is in general not required that all receivers get a copy of the data at the same time. In Chapter 6, we focus on selectively speeding up some datacenters using receiver set partitioning, that is, grouping the receivers of P2MP transfers into multiple partitions and attaching each partition using an independent forwarding tree. That is because a single multicast tree can slow down all receivers to the slowest receiver, although it oers the highest bandwidth savings. We apply our P2MP load-aware tree selection approach per partition to distribute load across the network as well. We also explore 7 A Steiner tree is a tree subgraph of the inter-DC network that connects the sender and all the receivers. The weight of a Steiner tree is the sum of weights of its edges. Selecting a minimum weight Steiner tree over a general graph is NP-Hard [71] but fast heuristics exist that oer close to optimal solutions on average [72]. 10 dierent ways of nding the right number of partitions as well as the receivers that are grouped per partition. Using extensive evaluations, we show that our approach can speed up the P2MP receivers by up to 35 when network links have highly varying capacities. In Chapter 7, we develop a framework to optimize for mixed completion time objectives for P2MP transfers over inter-DC networks. That is, we realize that in general, dierent ap- plications that distribute copies of objects to many locations, may have dierent completion time objectives. For example, many applications require one copy of an object to be made quickly while the rest of the replicas can be made slowly. Knowing this requirement, we can select the receiver partitions accordingly to save bandwidth by grouping all the slower re- ceivers into one partition and satisfy the speed requirements by attaching the fastest receiver using an independent path. We present a solution that uses application-specic objectives to optimize the partitioning and tree selection for P2MP transfers. Through simulations and emulations, we show that our approach reduces average receiver completion times by 2 while meeting the requirements specied by applications on completion times. In Chapter 8, we aim to speed up P2MP transfers using parallel load-aware forwarding trees that are selected as weighted Steiner trees. We attach each partition of receivers using potentially multiple forwarding trees that in parallel deliver data to all its receivers hence increasing their throughput and reducing their completion times. We focus on the selection of edge-disjoint trees to eliminate direct bandwidth contention across the partitions of the same transfer. We perform comprehensive simulations and show that using up to two parallel edge-disjoint trees oers almost all the benet over various topologies and that by using parallel trees we can speed up P2MP transfers by up to 40%. Finally, in Chapter 9, we provide a summary and set forth several future directions to expand on our work. 11 Chapter 2 Inter-DC Network Trac Engineering Inter-DC networks consist of high-capacity links that connect tens to hundreds of data- centers across cities, countries, and continents with dedicated bandwidth [2, 43{47, 55, 57]. They can be modeled as a graph with datacenters as nodes and inter-DC links as edges where every edge is associated with the properties of the inter-DC link it represents such as capacity and bandwidth utilization. Given that datacenter operators also manage inter- DC networks, coordination among trac generation from datacenters and routing of trac within the inter-DC networks can be used to optimize network utilization and maximize overall utility [46,47,55,73]. The context we consider is data transfers that move bulk data across geographically dispersed datacenters over inter-DC networks. Bulk data transfers move the lion share of data across datacenters [12] which makes it highly practical and valuable to optimize their transmission over inter-DC networks. Besides, inter-DC networks are relatively small in terms of the number of edges and nodes which makes it feasible to formulate and solve optimization scenarios to maximize their performance [2,42,43]. Finally, inter-DC networks are operated by the same organization that manages the datacenters they connect which makes it possible to control them in a logically centralized fashion as well as apply novel trac scheduling and routing techniques that cannot be used over the internet. We consider a centralized trac management scheme where a logically centralized Trac Engineering Server (TES) receives trac requirements from the senders and decides how trac should be transmitted from the senders and how it should be routed within the inter-DC network across the datacenters. It also communicates with the senders and the 12 network elements to coordinate them. Several inter-DC networks have been built using this principle, and related work has shown that this form of management allows for substantial performance gains [2,43,44,46,55,57]. Central trac allocation oers a variety of benets: First, it allows for improved per- formance by minimizing congestion by proactively reserving bandwidth while collectively considering the interplay of many transfers initiated from dierent datacenters. Second, it oers a highly congurable platform that allows maximizing performance according to various utility functions. Such utility functions can be selected according to an organiza- tion's business model. The coordinated routing and scheduling of trac for maximization of network utility can be formulated as an optimization problem with dierent constraints as we will show later in this chapter. The trac engineering problem we consider is the following. We are given an inter-DC network topology, including the connectivity and link capacities across datacenters, with end-points that generate network trac located within the datacenters. Data transfers arrive at the network in an online manner at dierent datacenters, i.e., we assume no prior knowledge of when a future transfer will arrive and what properties it will have. End-points can control the rate at which they transmit trac. Upon the arrival of a new transfer, the sender communicates with the TES the properties of this transfer and any potential requirements on its transmission. The problem is for TES to compute the best route(s) on which the trac for this new transfer is forwarded as well as the rate at which the new transfer and all the other existing transfers should transmit their trac. The transmission rates need to be updated as new transfers arrive, existing transfers nish, links fail or their capacity changes, or transfers are terminated. To eciently handle this highly dynamic situation, we assume a slotted timeline and periodically compute end- point transmission rates at the beginning of every timeslot. It is possible to schedule re- computation of rates upon highly critical events in addition to having them run periodically. In this dissertation, we only assume periodic execution of rate calculation for simplicity. Also, the transmission of any new transfer begins as soon as the rates are updated. We assume that TES makes its optimization decisions given the knowledge of transfers that have already arrived. That is because we do not have deterministic information about transfers that may be created in the future. In general, it may be possible to predict future transfer arrivals and perform further optimizations accordingly, which is out of the scope of this dissertation. 13 Inter Datacenter Network Datacenter X Datacenter Y Site Broker Site Broker Global Traffic Engineering Server Routing (Forwarding Rules) Scheduling (Transmission Rates) Scheduling (Transmission Rates) Figure 2.1: Central trac management architecture. 2.1 Central Inter-DC Trac Management Architecture Central network trac management has two major elements: rate-limiting at the senders and routing/forwarding in the network. Figure 2.1 shows the overall setup for this purpose adopted by several existing inter-DC networks [44,74]. In this setup, TES calculates trans- mission rates and routes for submitted transfers as they arrive at the network. Rates are then dispatched to agents that are located at datacenters which are proxies that keep track of local transfers, i.e., transfers initiated within the same datacenters, called site brokers. When TES calculates new routes, they are dispatched to the network by implementing proper forwarding rules on the network's switching elements. Figure 2.2 shows the steps taken by the TES in processing a new inter-DC transfer. The part of the switching elements that does this is referred to as the Forwarding Information Base (FIB). When a sender wants to initiate a transfer, it rst communicates with the site broker in the local datacenter which records the request and forwards it to TES. When TES responds with the transmission rates, site broker records that and forwards it to the sender. The sender then applies rate-limiting at the rate specied by TES. In some setups, the sender should also attach the proper forwarding label to its packets so that its packets are correctly forwarded (like a VLAN tag). Such labeling may also be applied transparently to the sender 14 Figure 2.2: Steps in processing of a new inter-DC transfer. Figure 2.3: Rate-allocation per link per timeslot. at a dierent network entity (hypervisor, border gateway, etc.). This function could also be implemented at the datacenter network edge based on end-point addresses and using real-time packet header modication predicates. In order to exibly allocate trac with varying rates over time, we break the timeline into small timeslots similar to several current solutions [2,44,46,55,57]. Figure 2.3 shows how this is done for a single link e. For a network, capacity is allocated over the whole network per timeslot. We do not assume an exact length for these timeslots as there are trade- os involved. Having smaller timeslots can lead to inaccurate rate-limiting 1 and adds the overhead of having to calculate rates for a larger number of timeslots, while having larger timeslots results in a less exible allocation because the transmission rate is considered constant over a timeslot. Finally, timeslot length depends on transfer sizes. In general, we could select a value based on minimum or average transfer size. Current solutions have used a timeslot duration of 5 minutes which is long enough to reduce the overhead of rate-computations and short enough to allow the network to adapt to changes in trac demand [55,57]. The purpose of the site broker is manifold by adding one level of indirection between senders and TES. First, it reduces the request-response overhead for TES by maintaining a persistent connection with the server and possibly aggregating many sender requests into 1 It takes a short amount of time for senders to converge to new rates [75]. 15 a smaller number of messages before sending them o to the server. Second, it allows for the application of hierarchical bandwidth allocation by locally grouping many transfers and presenting them to TES as one. 2 Finally, site broker can update TES's response according to varying local network conditions, allow senders to switch to a backup TES in case TES goes oine, or even revert to distributed mode. Centralized trac management can be realized using Software Dened Networking (SDN) [76]. SDN oers many highly congurable features among which is the ability to manage trac forwarding state centrally and programmatically by installing, updating, or removing forwarding rules in real-time. With a global view of network status and server demands, it is possible to oer globally optimal solutions. WANs operated using SDN have been adopted by an increasing number of companies and organizations over the past few years examples of which include Google [2], Microsoft [44], and Facebook [43]. Of course, there are challenges in such centralized and real-time management of network, for example, routing update inconsistencies or the latency from when forwarding rules are dispatched to when they take eect are two signicant issues. Ongoing SDN related research has been addressing these and several other problems [77, 78]. In this dissertation, we consider the usage of SDN for controlling dedicated inter-DC networks. We develop algorithms that can be used by TES to compute routes on a per transfer basis as they arrive. 2.1.1 Functions of Centralized Trac Management Trac Rate-limiting: Figure 2.4 shows how rate-limiting can be applied at the servers before data is transmitted on the wire. The most straightforward approach is for service instances to communicate their demand with the local broker, which in turn makes contact with TES, and only hands o to the transport layer (i.e., socket) as much as specied by TES. This technique requires no changes to the end-points' protocol stack and hardware but requires modications at the application layer. Another approach is to use the methods supported by the operating system for per- ow rate control. For example, the later versions of Linux allow users to use a socket option along with the Fair Queuing algorithm to specify a pacing rate. Next, it is possible to apply rate limiting in hardware using precise timers. This approach is much more accurate compared to software approaches but requires more sophisticated equipment. There are also hybrid approaches that use a combination of operating system support and hardware rate limiters to apply accurate per transfer rate limiting for a large number of transfers [75]. 2 This may reduce the accuracy of trac engineering but makes it signicantly scalable in case there is a considerable number of transfers [74]. 16 Application Transport (TCP) Operating System NIC Applications limit their transmission rate by monitoring the volume handed off to transport. Easy to apply per flow rate limits. OS can apply rate limits using a variety of methods. Linux Traffic Control (TC) can do this per socket using Fair Queuing and SO_MAX_PACING_RATE socket flag. NICs can also apply rate limits using hardware timers which is more precise. Programmable/Advanced NICs are usually needed for per flow rate limiting. Figure 2.4: Several end-point rate-limiting techniques. Trac Routing: Inter-DC networks are strong candidates for custom routing techniques. Eective routing should take into account the overall load scheduled on links to better use all available capacity while shifting trac across a variety of paths. Besides, routing should consider the properties of new transfers while assigning routes to them. Conventional routing schemes are incapable of taking into account such parameters to optimize routing with regards to operator-specied utility functions. 2.2 Performance Metrics A variety of metrics can be used for performance evaluation over inter-DC networks includ- ing transfer completion times, total network capacity consumption, and transfers completed before their deadlines. Depending on the services running over inter-DC networks, oper- ators may choose to focus primarily on optimizing one metric or a utility function that generates an aggregate utility value according to all of these metrics. Table 2.1 oers an overview of these metrics. In general, some of these metrics may be at odds with others, and therefore it may not be possible to optimize all parameters at the same time. The relationship between 17 Table 2.1: Denition of performance metrics. Metric Description Tail completion times Completion time of the slowest transfer over the evaluation period. In some cases, 99th or 95th percentile may be used instead. Median completion times The completion time of the transfer that is slower than 50% of transfers and faster than the other 50% over the evaluation period. Mean completion times Average of completion times of all transfers over the evaluation period. Total bandwidth/capacity consumed Sum of the volume of trac that was sent on all network edges per edge over the evaluation period. Ratio of deadline transfers completed Fraction of transfers the network was able to nish before their deadlines in case a deadline was specied. The network may apply admis- sion control to only accept transfers that it can complete by their deadlines. In this case, we take the fraction of admitted transfers. Ratio of deadline trac completed Ratio of the total volume of transfers the net- work was able to nish before their deadlines, in case a deadline was specied, by the total vol- ume of transfers. The network may apply ad- mission control to only accept transfers that it can complete by their deadlines. In this case, we take the ratio of admitted trac by total sub- mitted trac. Running time (Network algorithms) The time to process transfer information, and compute transmission rates and forwarding routes. 18 these metrics also could depend on the operating point of the system. For example, under light trac load, using more bandwidth usually allows us to reduce the completion times of transfers, while under heavy trac load, using more bandwidth potentially leads to resource contention and increased completion times. One can consider two scenarios of transfers with and without deadlines. In the former case, we consider the performance metrics that evaluate the volume of trac and the total fraction of transfers completed before their deadlines. In the latter case, we pay attention to minimizing tail, median or mean completion times. When deadlines are not present, de- pending on the services running over the inter-DC networks, we may more strongly consider tail, median or mean completion times. 3 For example, in computing tasks that take mul- tiple inputs from dierent datacenters, the processing start time depends on when all the inputs become available which increases the importance of reducing tail completion times. Various data transfer problems considered in this dissertation are all trac engineering problems over inter-DC networks aiming at optimizing one or more of the metrics stated above. To nd ecient solutions to such problems, we can formulate optimization problems using the network and transfer parameters, and consider appropriate performance metrics to optimize. We will develop a general optimization framework in the next section. 2.3 General Inter-DC Optimization Formulation The inter-DC optimization problem can be formulated in a variety of ways by considering dierent objective functions and constraints. In each problem, bulk inter-DC transfers will be initiated from one sender to one or more receiving datacenters. In the following, we discuss dierent types of constraints and objectives that can be combined to form the ultimate framework. Denition of Variables: Table 2.2 shows the list variables used in this section. Data could be transmitted over paths or multicast trees to receivers. Also, in general, data can be transmitted over multiple parallel paths or multicast trees towards the receivers. The notations we dened capture these properties. Formal Denition of Completion Times: We dened a receiver's completion time as the last timeslot with non-zero trac arriving at that receiver for a specic transfer. i d ,ftjf i (t)> 0;9 2 i d g;8d2 D D D R i ;8i2f1:::Ig (2.1) 3 It is possible to consider other aggregate metrics as well given the circumstances. 19 Table 2.2: Denition of variables. Variable Denition t and t now Some timeslot and current timeslot ! Width of a timeslot in seconds e A directed edge C e Capacity of e in bytes per second B e Current available bandwidth on edge e G A directed graph representing an inter-DC network E E E G Sethi of edges of directed graph G A directed subgraph over which trac is forwarded to the re- ceivers, could be a path or a multicast tree (E E E E E E G ) R R R Set of all requests (past, current, future) R i A transfer request where R i 2 R R R; i2 I I I =f1:::Ig S R i Source datacenter of R i A R i Arrival time of R i R i Completion time of R i t d R i Deadline of R i R i Total network capacity consumed by R i for its completion V R i Original volume of R i in bytes D D D R i Sethi of destinations of R i i d Completion time of receiver d2 D D D R i i d Directed subgraphs attached to receiver d2 D D D R i from S R i f i (t) Transmission rate of R i on subgraph at timeslot t e Whether edge e2 E E E G is on subgraph (binary variable) U A network utility function set by network operators For a transfer, the completion time is the time at which all receivers of that transfer complete. R i = max d2D D D R i i d ;8i2f1:::Ig (2.2) Optimization Objective: A variety of metrics can be considered as part of the optimiza- tion objective including transfer completion times (i.e., median, average, tail), total network capacity use, and the number of deadlines missed (or alternatively, number of transfers that could not be admitted to meet their deadlines). In general, a utility function can be dened over these metrics which the optimization problem aims to maximize. This function should be representative of how much prot the business can obtain while using the network. 20 Max(U((f R i g; X i R i ;jfij R i >t d R i gj))); i2f1:::Ig (2.3) Examples of objective functions include: Minimizing the mean (i.e., average) transfer completion times, i.e.,Min( P i2f1:::Ig R i ), minimizing the total network capacity consump- tion, i.e., Min( P i2f1:::Ig R i ), minimizing the number of deadline missing transfers, i.e., Min(jfi2f1:::Igj R i > t d R i gj) or a combination of these. For example, we can min- imize a weighted sum of completion times and total network capacity consumption, i.e., Min( P i2f1:::Ig R i + P i2f1:::Ig R i ) where 0 < 1 is a coecient used to prioritize minimizing completion times. In all of these cases, U is dened as a negative multiply of these functions. In other words, the network operator prots if these parameters are minimized. Demand Constraints: The total data transmitted towards a receiver across all the paths or multicast trees connected to it then has to be equal to the total volume of the transfer. X t X 2 i d !f i (t) =V R i ;8i2f1:::Ig (2.4) Capacity Constraints: Total transmission rate of all paths and multicast trees sharing an edge must be at most equal to the link's available bandwidth B e C e . X i; e2E E E j 2f[ d2D D D R i i d g f i (t)B e ;8t;e2 E E E G (2.5) The available bandwidth on an edge is determined by the volume of trac used up by short ows (e.g., user-facing, high priority trac). There is usually a good estimate of how much such trac is generated as the rate of growth for user trac is far less than that of business-internal inter-DC transfers [43]. Routing Constraints: To forward trac from the source to each receiver per transfer, we can use one or more paths or trees. To make sure that each receiver is obtaining a full copy of the data, if any two receivers are connected using the same tree, any tree connected to one of them should also be connected to the other one. In other words, for some requestR i , receivers D D D R i can be separated into multiple groups D D D j R i ;jjD D D R i j each connected using at least one path (i.e.,jD D D j R i j = 1) or tree (i.e.,jD D D j R i j> 1). In general, it is possible to formulate the selection of such paths and trees as part of 21 the optimization framework and create a joint routing and rate computation framework. This however leads to exponential number of constraints and addition of a large number of binary variables to the formulation which in general could take a long time to solve. Another approach would be to compute the paths and trees using some heuristic approach and plug them into the optimization framework which reduces the complexity of the problem allowing it to only focus on computation of the rates. For the sake of completeness, we brie y discuss how a joint optimization can be formu- lated by adding constraints to the framework. This can be done by enumerating all possible paths (or trees) from the source to each group of receivers and considering fraction variables that determine how much of the trac will end up on each path (tree). Also, since we do not know how to group receivers, we need to consider all possibilities and dene binary variables that determine which grouping maximizes the utility of the network. More formally, let us dene binary variables b k as whether we have selected grouping k2f1:::Kg whereK is the total number of ways to partition D D D R i into disjoint sets whose union is equal to D D D R i . Also, let us dene the groups ink th partitioning as D D D j;k R i ;j2f1:::Jg. We can write the following constraints: X k2f1:::Kg b k = 1 (2.6) [ j2f1:::Jg D D D j;k R i = D D D R i ;8k2f1:::Kg (2.7) Let us dene j;k i as the set of all paths (trees) that connect S R i to D D D j;k R i over the inter- DC graphG. For every receiverd2 D D D R i we can then dene the following constraint to nd i d : i d =[ k2f1:::Kgj b k =1 j2f1:::Jgj d2D D D j;k R i j;k i (2.8) The demand constraint of Eq. 2.4 will then automatically take into account the distri- bution of trac across all the paths (trees) that connect to any group of receivers. Hard Deadline Constraint: A transfer R i with a hard deadline must complete before its deadline. We can formulate this as an equality of demand over the timeslots prior to the transfer's deadline. 22 P(t) t t d R i P 1 P 2 P 3 Figure 2.5: Some penalty functions X tt d R i X 2 i d !f i (t) =V R i ;8d2 D D D R i (2.9) The optimization problem with this constraint may become infeasible. That means the current parameters make it impossible to meet the given deadline. This process is referred to as admission control by performing feasibility checks. In general, fast heuristics exists that allow quick infeasibility checks, however, if a problem is not deemed infeasible by such heuristics, it does not guarantee feasibility. Soft Deadlines: A soft deadline can be formulated as part of the objective function. Although soft deadlines are not the focus of this dissertation, we provide a short overview of how they can be modeled here. In general, we can use a penalty function that determines the benet obtained from completing the transfer. In case the transfer is nished way too late, its value could be zero (or even negative as it wastes bandwidth). Here, we dene two dierent penalty functions as shown in Figure 2.5. These functions are specied according to how the system should handle a deadline miss. A step function, for example, determines that we highly value meeting the deadline, but as soon as a deadline is missed, it does not matter how late we complete the transfer. We dene a variable t that determines how much trac is delivered per timeslot for a transfer to a specic receiver. i t , X 2 i d !f i (t);8t;i2f1:::Ig (2.10) Using this new variable, we can dene a system-wide penalty function that can be combined with the objective function in the optimization formulation. 23 P P P, X i X t i t P (t) (2.11) And the new objective function can be formulated as follows. Max(U P P P) (2.12) Other Constraints: There are many basic constraints such as the valid range of values for variables. In this case, we have the following two basic constraints. e 2f0; 1g;8 ;e2 E E E G (2.13) 0f i (t) min e2E E E G j e =1 B e ;8 (2.14) Depending on transfer arrival rate and patterns, this optimization model can become more complex with many variables. 4 Solving this optimization framework may be computa- tionally expensive and slow given that it needs to be solved as new transfers arrive. In case transfers have hard deadlines, it may be necessary only to admit new transfers when their deadlines can be met, which essentially requires performing feasibility checks before nd- ing an optimal solution. To address the issue of complexity, throughout this dissertation, we present, implement, and evaluate heuristics that help nd quick solutions to dierent versions of this optimization framework. 4 That is, due to the presence of binary or integer variables and non-linear constraints and objectives. 24 Chapter 3 Adaptive Routing of Transfers over Inter-Datacenter Networks Inter-DC networks carry trac ows with highly variable sizes and dierent priority classes: long throughput-oriented ows and short latency-sensitive ows. While latency-sensitive ows are almost always scheduled on shortest paths to minimize end-to-end latency, long ows can be assigned to paths according to usage to maximize average network throughput. Long ows contribute huge volumes of trac over inter-DC WAN. The Flow Completion Time (FCT) is a vital network performance metric that aects the running time of dis- tributed applications and users' quality of experience. Adaptive ow routing can improve eciency and performance of networks by assigning paths to new long ows according to network status and ow properties. We focus on single path routing while aiming at minimizing completion times and bandwidth usage of internal ows. In this chapter, we rst discuss a popular adaptive approach widely used for trac engineering that is based on current bandwidth utilization of links. We propose an alter- native that reduces bandwidth usage by up to at least 50% and ow completion times by up to at least 40% across various scheduling policies and ow size distributions. Next, we propose a routing approach that uses the remaining sizes and paths of all ongoing ows to minimize the worst-case completion time of incoming ows assuming no knowledge of future ow arrivals. Our approach can be formulated as an NP-Hard graph optimization problem. We propose BWRH, a heuristic to quickly generate an approximate solution. We evaluate BWRH against several real WAN topologies and two dierent trac patterns. We see that BWRH provides solutions with an average optimality gap of less than 0:25%. Fur- thermore, we show that compared to other popular routing heuristics, BWRH reduces the 25 mean and tail FCT by up to 3:5 and 2, respectively. We then present and evaluate an even faster heuristic called BWRHF which is based on Dijkstra's shortest path algorithm. We perform extensive evaluations to compare BWRH and BWRHF to show that they oer relatively similar performance over multiple topologies, scheduling policies, and ow size distributions despite BWRHF being considerably faster and more straightforward. 3.1 Background and Related Work Although adaptive path selection can be formulated as an online optimization problem, such problems cannot be solved optimally due to no knowledge about future ow arrivals. Alternatively, heuristic schemes can be used by considering a cost (distance) metric and selecting the minimum cost (shortest) path. A variety of metrics have been used for path selection over WAN including static metrics such as hop count and interface bandwidth, and dynamic metrics such as end-to-end latency which is a function of propagation and queuing latency, and current link bandwidth utilization [79, 80]. Especially, bandwidth utilization has been extensively used by prior work over inter-DC networks [46,81,82]. Our understanding is that while these metrics are eective for routing of short ows, they are insucient for improving the completion times of long ows as we will demonstrate. Over inter-DC WAN where end-points are managed by the organization that also controls the routing [2,42,43], one can use routing techniques that dierentiate long ows from short ows and use ow properties obtained from applications, including ow size information, to reduce the completion times of long ows. 3.1.1 A Novel Metric for Adaptive Routing over WAN We argue that while assigning paths to new ows, instead of focusing on current bandwidth utilization, one should consider utilization temporally and into the future, i.e., by counting total outstanding bytes to be sent per link according to paths assigned to ows and total outstanding bytes per ow. We refer to this total number of remaining bytes per link as its load and use it as the cost metric. Compared to utilization, load oers more information about future usage of a link's bandwidth which can help us perform more eective load balancing. Every time a ow is assigned to a path, load variables associated with all edges of that path increase by its demand. Also, a link's load variable decreases continuously as ows on that link make progress. In addition, we evaluate two heuristics of selecting the path with minimum value of maximum link cost and minimum value of sum of link costs which we refer to as MINMAX() 26 MFCT TFCT Total Bandwidth Used Light-tailed Heavy-tailed Light-tailed Heavy-tailed Light-tailed Heavy-tailed F S M F S M F S M F S M F S M F S M MINSUM(load) MINMAX(load) MINSUM(load+demand) MINMAX(load+demand) MINSUM(utilization) MINMAX(utilization) MinHop MFCT TFCT Total Bandwidth Used Cache-Follower Hadoop Cache-Follower Hadoop Cache-Follower Hadoop F S M F S M F S M F S M F S M F S M MINSUM(load) MINMAX(load) MINSUM(load+demand) MINMAX(load+demand) MINSUM(utilization) MINMAX(utilization) MinHop < 10% from min < 20% from min < 30% from min < 40% from min < 50% from min ≥ 50% from min Fig. 1. Performance of various cost metrics for path selection over Cogent WAN [6], with uniform capacity of 1 and λ = 1.0 (F,S andM represent the FCFS, SRPT and MMF scheduling policies, respectively), simulation was repeated many times and average was computed Schemes based on utilization are at least 40% above the mini- mum for the majority of scenarios. Also, MINMAX(load) and MINMAX(load+demand) are more than 50% above the minimum in mean completion times for multiple scenar- ios. Overall, it can be seen that schemes based on “load” as link cost offer much better tail completion times (less than 10% away from minimum for majority of cases). Also, MINSUM(load+demand) offers the best mean completion times considering all scenarios. Total Bandwidth Usage: MINSUM(load+demand) of- fers the minimum extra bandwidth usage compared to MinHop which is below 20% at all times. Schemes based on MINMAX() consume at least 40% extra band- width. MINSUM(load) and MINSUM(utilization) use at least 10% more bandwidth at all times compared to MINSUM(load+demand) and at least 20% more bandwidth for the majority of scenarios. III. CONCLUSIONS AND FUTURE WORK We see that MINSUM(load+demand) stays within 20% of minimum for all completion times and within 10% of minimum in the majority of cases. It offers the minimum bandwidth usage across all adaptive approaches (MinHop is static). With this cost metric, larger flows are most likely assigned shorter paths which allows for higher bandwidth savings (due to presence of “demand” as part of link cost) while shorter flows are assigned to paths with smaller total load which reduces completion times via load balancing. We believeMINSUM(load+demand) performs better than tech- niques based onMINMAX() since it considers total number of bytes that will eventually be scheduled on a path taking into account all edges and not just the highest loaded/utilized link. Our experiments have shown thatMINSUM(load+demand) is also an effective metric for selection of multicast forwarding trees that reduce completion times via load balancing [7], [8]. It is also interesting to note that MINMAX(utilization), which is frequently used in traffic engineering research, is far from the best solution for the majority of evaluated scenarios. Centralized frameworks, such as SDN [9], are good candi- dates for realization of this scheme since they offer access to global view of network status and flow demands. To properly update load variables associated with links, one needs knowl- edge of flow demands. In case exact flow size is unknown, an estimate can be used. Further research is needed on how flow demand estimation accuracy can affect quality of selected paths. In addition, we plan to extend and evaluate our proposed adaptive approach for multipath traffic engineering. REFERENCES [1] A. Roy, H. Zeng, J. Bagga, G. Porter, and A. C. Snoeren, “Inside the Social Network’s (Datacenter) Network,” SIGCOMM, pp. 123–137, 2015. [2] Building express backbone: Facebooks new long-haul network. [Online]. Available: https://code.facebook.com/posts/1782709872057497/ building-express-backbone-facebook-s-new-long-haul-network/ [3] S. Kandula, I. Menache, R. Schwartz, and S. R. Babbula, “Calendaring for wide area networks,” SIGCOMM, vol. 44, no. 4, pp. 515–526, 2015. [4] B. Fortz and M. Thorup, “Optimizing OSPF/IS-IS weights in a changing world,” IEEE Journal on Selected Areas in Communications, vol. 20, no. 4, pp. 756–767, 2002. [5] S. Kandula et al., “Walking the Tightrope: Responsive Yet Stable Traffic Engineering,” SIGCOMM, vol. 35, no. 4, pp. 253–264, 2005. [6] The internet topology zoo (cogent). [Online]. Available: http://www. topology-zoo.org/files/Cogentco.gml [7] M. Noormohammadpour, C. S. Raghavendra, S. Rao, and S. Kandula, “Dccast: Efficient point to multipoint transfers across datacenters,” in HotCloud. USENIX Association, 2017. [8] M. Noormohammadpour, C. S. Raghavendra, S. Kandula, and S. Rao, “QuickCast: Fast and Efficient Inter-Datacenter Transfers using Forward- ing Tree Cohorts,” arXiv preprint arXiv:1801.00837, 2018. [9] N. McKeown, “Software-defined networking,” INFOCOM keynote talk, vol. 17, no. 2, pp. 30–32, 2009. Figure 3.1: Performance of various cost metrics for path selection over Cogent WAN [1], with uniform capacity of 1 and = 1:0 (F,S andM represent the FCFS, SRPT and MMF scheduling policies, respectively), simulation was repeated many times and average was computed. The minimum was computed per column and per metric across all schemes in the column. MFCT and TFCT represent the mean and tail ow completion times, respectively. and MINSUM(), respectively. Although the former is frequently used in the literature [46,81, 82], we nd that the latter oers considerably better performance for the majority of trac patterns and scheduling policies. 3.2 Evaluation of Dierent Cost Metrics We considered a large WAN called Cogent [1] with 197 nodes and 243 links, four ow demand distributions of light-tailed (Exponential distribution), heavy-tailed (Pareto distribution), Cache-Follower [12] and Hadoop [12] (the last two happen across Facebook datacenters), and a uniform capacity of 1:0 for all links. A Poisson distribution with rate was used for ow arrivals. For all ow demand distributions, we assumed an average of 20 units and a maximum of 500 units. For heavy-tailed, we used a minimum demand of 2 units. We considered scheduling policies of First Come First Serve (FCFS), Shortest Remaining Processing Time (SRPT) and Fair Sharing using Max-Min Fairness (MMF). We considered three dierent cost metrics of \utilization", \load", and \load+demand" per link where demand represents the new ow's size in bytes. To measure a path's cost, we considered two cost functions of maximum which assigns any path the cost of its highest cost link 27 (used by MINMAX() heuristic), and sum which computes a path's cost by summing up costs of its links (used by MINSUM() heuristic). Combining these path cost functions with the three link cost metrics mentioned above, we obtain six dierent path selection schemes that select the path with minimum cost for a newly arriving ow. We also considered MinHop which selects a path with minimum hops per ow to compute lower bound of bandwidth usage. For minimum cost path selection, we used Dijkstra's algorithm in JGraphT library. We measured Mean and Tail Flow Completion Times (MFCT/TFCT) and total bandwidth as shown in Figure 3.1. Flow Completion Times (FCT): MINSUM(load) and MINSUM(load+demand) perform almost identically in completion times. The rest of schemes oer highly varying perfor- mance dictated by scheduling policy or trac pattern. Schemes based on utilization are at least 40% above the minimum for the majority of scenarios. Also, MINMAX(load) and MINMAX(load+demand) are more than 50% above the minimum in mean completion times for multiple scenarios. Overall, it can be seen that schemes based on \load" as link cost oer much better tail completion times (less than 10% away from minimum for majority of cases). Also, MINSUM(load+demand) oers the best mean completion times considering all scenarios. Total Bandwidth Usage: MINSUM(load+demand) oers the minimum extra bandwidth usage compared to MinHop which is below 20% at all times. Schemes based on MINMAX() consume at least 40% extra bandwidth. MINSUM(load) and MINSUM(utilization) use at least 10% more bandwidth at all times compared to MINSUM(load+demand) and at least 20% more bandwidth for the majority of scenarios. 3.3 Discussion and Analysis We see that MINSUM(load+demand) stays within 20% of minimum for all completion times and within 10% of minimum in the majority of cases. It oers the minimum bandwidth usage across all adaptive approaches (MinHop is static). With this cost metric, larger ows are most likely assigned shorter paths which allows for higher bandwidth savings (due to presence of \demand" as part of link cost) while shorter ows are assigned to paths with smaller total load which reduces completion times via load balancing. We believe MINSUM(load+demand) performs better than techniques based on MINMAX() since it considers total number of bytes that will eventually be scheduled on a path taking into account all edges and not just the highest loaded/utilized link. Our experiments have shown that MINSUM(load+demand) is 28 also an eective metric for selection of multicast forwarding trees that reduce completion times via load balancing [83,84]. It is also interesting to note that MINMAX(utilization), which is frequently used in trac engineering research, is far from the best solution for the majority of evaluated scenarios. Centralized frameworks, such as SDN [76], are good candidates for realization of this scheme since they oer access to global view of network status and ow demands. 3.4 Best Worst-case Routing (BWR) Given the results of the experiments we performed above, it is obvious that current routing heuristics can be far from the optimal over dierent evaluation scenarios and for various performance metrics. Therefore, we revisit the well-known ow routing problem over inter- DC networks. As mentioned earlier, we focus on long ows which carry tremendous volumes of data over inter-DC networks [2,46]. They are usually generated as a result of replicating large objects such as search index les, virtual machine migration, and multimedia content. For instance, over Facebook's Express Backbone, about 80% of ows for cache applications take at least 10 seconds to complete [12]. Besides, the volume of inter-DC trac for repli- cation of content and data, which generates many long ows, has been growing at a fast pace [43]. In general, ows are generated by dierent applications at unknown times to move data across the datacenters. Therefore, we assume that ows can arrive at the inter-DC network at any time and no knowledge of future ow arrivals. Every ow is specied with a source, a destination, an arrival time, and its total volume of data. The Flow Completion Time (FCT) of a ow is the time from its arrival until its completion. We focus on minimizing the completion times of long ows which is a critical perfor- mance metric as it can signicantly aect the overall application performance or consid- erably improve users' quality of experience. For example, in cloud applications such as Hadoop, moving data faster across datacenters can reduce the overall data processing time. As another example, moving popular multimedia content quickly to a regional datacenter via replication allows improved user experience for many local users. To attain this goal, routing and scheduling need to be considered together which can lead to a complex discrete optimization problem. Here, we only address the routing problem, that is, choosing a xed path for an incoming ow given the network topology and the currently ongoing ows while making no assumptions on the trac scheduling policy. We focus on single path routing which mitigates the undesirable eects of packet reordering. 29 Assuming no knowledge of future ow arrivals and no constraints on the network trac scheduling policy, we propose to minimize the worst-case completion time of every incoming ow given the network topology, the currently ongoing ows' paths, and their remaining number of data units. For any given scheduling policy, we route the ows to minimize the worst-case ow completion time. We refer to this routing approach as the Best Worst-case Routing (BWR). 3.4.1 System Model We consider a general network topology with bidirectional links and equal capacity of one for all edges and assume an online scenario where ows arrive at unknown times in the future and are assigned a xed path as they arrive. Each ow is divided into many equal size pieces (e.g., IP datagrams) which we refer to as data units. We also assume knowledge of the ow size (i.e., number of a ow's data units) for the new ow and the remaining ow size for all ongoing ows. Given an index i, every ow F i is dened with a source s i , a destination t i , an arrival time i , and a total volume of dataV i . In addition, each ow is associated with a pathP i , a nish time i which is the time of delivery of its last data unit, and a completion timec i = i i . Finally, at any moment, the total number of remaining data units of F i isV r i V i . Similar to multiple existing inter-DC networks [2, 42, 43], we assume the availability of logically centralized control over the network routing. A controller can maintain information on the currently ongoing long ows with their remaining data units and perform routing decisions for an incoming long ow upon arrival. We employ a slotted timeline model where at each timeslot a single data unit can traverse any path in the network. In other words, we assume a zero propagation and queuing latency which we justify by focusing only on long ows. Given this model, if multiple ows have a shared edge, only one of them can transmit during a timeslot. We say two data units are competing if they belong to ows that share a common edge. Depending on the scheduling policy that is used, these data units may be sent in dierent orders but never at the same time. Also, if two ows with pending data units use non-overlapping paths, they can transmit their data units at the same time if no other ow with a common edge with either one of these ows is transmitting at the same timeslot. 30 3.4.2 Denition of Best Worst-case Routing We aim to reduce long ows' completion times with no assumption on the scheduling policy for transmission of data units. To achieve this goal, we propose the following routing technique referred to as Best Worst-case Routing (BWR): Problem 1. Given a network topology G(V;E) and the set of ongoing ows F F F = fF i ; 1 i Ng, we want to assign a path P N+1 to the new ow F N+1 so that the worst- case completion time of F N+1 , i.e., max(c N+1 ) is minimized. Assuming no knowledge of future ows and given the described network model, since only a single data unit can get through any edge per timeslot, the worst-case completion time of a ow happens when the data units of all the ows that share at least one edge with the new ow's path go sequentially and before the last data unit of the new ow is transmitted. Therefore, Problem 1 can be reduced to the following graph optimization problem which aims to minimize the number of competing data units with F N+1 . Problem 2. Given a network topologyG(V;E) where every edgee2E is associated with a set of ows F F F e (that is, e2P i ;8F i 2 F F F e ), the set of ongoing ows F F F =fF i ; 1iNg, and an incoming owF N+1 , we want to nd a minimum weight pathP N+1 where the weight of any path P from s N+1 to t N+1 is computed as follows: W P = X f1iN j F i 2f[ e2P F F Fegg V r i (3.1) Proposition 1. Assuming no knowledge of future ow arrivals,P N+1 selected by solving Problem 2 minimizes the worst-case completion time of F N+1 regardless of the scheduling policy used for transmission of data units. Proof. P N+1 is chosen to minimize the maximum number of data units ahead of F N+1 given the knowledge of ongoing ows' remaining data units which minimizes the worst-case N+1 , that is, the maximum number of timeslots the last data unit of F N+1 has to wait before it can be sent. Since N+1 is xed, this minimizes max(c N+1 ). Example: Consider the scenario shown in Figure 3.2. A new owF 4 with 3 data units has arrived and has two options of sharing an edge withF 1 that has 4 remaining data units (path 1) or sharing edges withfF 2 ;F 3 g which have a total of 6 remaining data units (path 2). Our approach tries to minimize the worst-case completion time of F 4 given ongoing ows. If path 1 is chosen, the worst case completion time of F 4 will be 7 while with path 31 V W V ) ) ) ) Figure 3.2: Example of routing a new ow F 4 2 it will be 9 and therefore, the logically centralized network controller will select path 1 for F 4 . The worst-case completion times are not aected by the scheduling policy and are independent of it. Also, the fact that F 2 has three common edges with path 2 and F 3 has two common edges with path 2 does not aect the worst-case completion time of F 4 on path 2. 3.4.3 BWR Heuristic (BWRH) The path weight assignment used in Problem 2 is not edge-decomposable. Finding a min- imum weight path for F N+1 is NP-Hard and requires examining all paths from s N+1 to t N+1 . 1 We propose a fast heuristic here, called BWRH, that nds an approximate solution to Problem 2. Algorithm 1 shows our proposed approach to nding a path P N+1 forF N+1 . At every iteration, the algorithm nds the minimum weight path from s N+1 to t N+1 with at most K hops by computing the weight of every such path according to Eq. 3.1. The algorithm starts by searching all the minimum hop paths from s N+1 to t N+1 and nding the weight of the minimum weight path among such paths. It then increases the number of maximum hops allowed (i.e., K) by one, extending the search space to more paths. This process continues until the weight of the minimum weight path with at most K hops is the same as K 1, i.e., there is no gain while increasing the number of hops. The termination condition used in BWRH may prevent us from searching long paths. Therefore, if the optimal path is considerably longer than the minimum hop path, it is possible that the algorithm terminates before it reaches the optimal path. Let us call the optimal path P o and the path selected with our heuristic P h . The optimality gap, dened as W P h W Po W Po , is highly dependant on the number of remaining data units of ongoing ows. 1 Please see Appendix A for proof. 32 Algorithm 1: BWRH Input: F N+1 , G(V;E), P i ;V r i ; 1iN Output: P N+1 1 K #hops on the minimum hop path from s N+1 to t N+1 ; 2 W K min Weight of the minimum weight path from s N+1 to t N+1 with at most K hops by examining all such paths; 3 repeat 4 K K + 1; 5 Compute W K min ; 6 until W K min W K1 min ; 7 P N+1 The minimum weight path from s N+1 to t N+1 with at most K 1 hops (if multiple minimum weight paths exist, choose the one with minimum hops); We nd that the worst-case optimality gap can be generally unbounded. However, it is highly unlikely, in general, for the optimal path to be long as having more edges increases the likelihood of sharing edges with more ongoing ows which increases the weight of the path. We will later conrm this intuition through empirical evaluations and show that BWRH provides solutions with an average optimality gap of less than a quarter of percent. 3.4.4 Application to Real Network Scenarios We discuss how BWRH can be used to nd a path for an incoming ow on a real network assuming a uniform link capacity. We can use the same topology as the actual topology as input to BWRH. Since we focus on long ows for which the transmission time is signicantly larger than both propagation and queuing latency along existing paths, it is reasonable to ignore their eect in routing (hence the assumption that these values are zero inx3.4.1). Next, assuming that all data units are of the same size, we can use the total number of remaining bytes per ongoing ow in place of the number of remaining data units as it does not aect the selected path. In practice, some data units may be smaller than the underlying network's MTU, which for the long ows with many data units, has minimal eect on the selected path. Once BWRH selects a path, the network's forwarding state is updated accordingly to route the new ow's trac, for example, using SDN [2,46]. In general, network trac is a mix of short and long ows. Since our dissertation targets the long ows, routing of short ows will not be aected and could be done considering the 33 propagation and queuing latency. Incoming long ows can be routed according to the knowledge of current long ows while ignoring the eect of short ows. 3.4.5 Evaluations We considered two ow size distributions of light-tailed (Exponential) and heavy-tailed (Pareto) and considered Poisson ow arrivals with the rate of . We also assumed an average ow size of data units with a maximum of 500 data units along with a minimum size of 2 data units for the heavy-tailed distribution. We considered the scheduling policies of First Come First Serve (FCFS), Shortest Remaining Processing Time (SRPT) and Fair Sharing based on max-min fairness [85]. Topologies: We used GScale [2] with 12 nodes and 19 edges, AGIS [3] with 25 nodes and 30 edges, ANS [4] with 18 nodes and 25 edges, AT&T North America [5] with 25 nodes and 56 edges, and Cogent [1] with 197 nodes and 243 edges. We assumed bidirectional edges with a uniform capacity of 1 data unit per time unit for all of these topologies. Schemes: We considered three schemes besides BWRH. The Shortest Path (Min-Hop) approach simply selects a xed shortest hop path from the source to destination per ow. The Min-Max Utilization approach selects a path that has the minimum value of maximum utilization across all paths going from the source to the destination. This approach has been extensively used in the trac engineering literature [46, 79]. The Shortest Path (Random- Uniform) selects a path randomly with equal probability across all existing paths which are at most one hop longer than the shortest hop path. BWRH's Optimality Gap: In Figure 3.3 we compute the optimality gap of solutions found by BWRH over three dierent topologies and under two trac patterns. The optimal solution was computed by taking into account all existing paths and nding the minimum *6FDOH $*,6 $16 2SWLPDOLW\*DS / L J K W W D L O H G *6FDOH $*,6 $16 + H D Y \ W D L O H G Figure 3.3: BWRH's optimality gap for = 10 and = 50 computed for 1000 ow arrivals. 34 FCFS SRPT Fair Sharing 0 1 2 Mean FCT Light-tailed FCFS SRPT Fair Sharing 0 2 4 Heavy-tailed FCFS SRPT Fair Sharing 0 1 2 Tail FCT BWRH Shortest Path (Min-Hop) Min-Max Utilization Shortest Path (Random-Uniform) FCFS SRPT Fair Sharing 0 1 2 (a) AT&T Topology [5] FCFS SRPT Fair Sharing 0 1 Mean FCT Light-tailed FCFS SRPT Fair Sharing 0 1 2 Heavy-tailed FCFS SRPT Fair Sharing 0 1 2 Tail FCT FCFS SRPT Fair Sharing 0 1 2 (b) Cogent Topology [1] FCFS SRPT Fair Sharing 0 0.5 1 Mean FCT Light-tailed FCFS SRPT Fair Sharing 0 0.5 1 Heavy-tailed FCFS SRPT Fair Sharing 0 0.5 1 Tail FCT FCFS SRPT Fair Sharing 0 1 (c) GScale Topology [2] FCFS SRPT Fair Sharing 0 0.5 1 Mean FCT Light-tailed FCFS SRPT Fair Sharing 0 0.5 1 Heavy-tailed FCFS SRPT Fair Sharing 0 0.5 1 Tail FCT FCFS SRPT Fair Sharing 0 1 2 (d) ANS Topology [4] Figure 3.4: Online routing techniques by ow scheduling policy assuming = 1, = 50, and various topologies over 500 time units. All simulations were repeated 20 times and the average results have been reported along with standard deviations. 35 weight path on topologies of GScale, AGIS, and ANS. We also implemented a custom branch and bound approach which would require less computation time with a small number of ongoing ows (i.e., < 20 in our setting) and an intractable amount of time for a large number of ongoing ows (i.e., > 30 in our setting). According to the results, the average gap is less than 0:25% over all experiments. We could not perform this experiment on larger topologies as computing the optimal solution would take an intractable amount of time. Eect of Scheduling Policies: In Figure 3.4, we xed the ow arrival rate to 1 and mean ow size to 50 and tried various scheduling policies under the four topologies of AT&T North America, Cogent, GScale, and ANS. All simulations were repeated 20 times and the standard deviation for each instance has been reported. The minimum value normalizes each group of bars. We see that BWRH is consistently better than other schemes regardless of the scheduling policy used. We can also see that compared to each other, the perfor- mance of other schemes varies considerably with the scheduling policy applied. To quantify, BWRH provides up to 3:5 and 2 better mean and tail completion times than the other schemes across all scenarios on average, respectively. Running Time: We implemented Algorithm 1 in Java using the JGraphT library. To exhaustively nd all paths with at most K hops, we used the class AllDirectedPaths in JGraphT. We performed simulations while varying from 1 to 10 and from 5 to 50 over 1000 ow arrivals per experiment which covers both lightly and heavily loaded regimes. We also experimented with all the four topologies pointed to earlier, both trac patterns of light-tailed and heavy-tailed, and all three scheduling policies of FCFS, SRPT, and Fair Sharing. The maximum running time of Algorithm 1 was 222:24 milliseconds, and the average of maximum running time across all experiments was 27 milliseconds. This latency can be considered negligible given the time needed to complete long ows once they are routed. 36 3.5 A Faster BWR Heuristic (BWRHF) In the previous section, we showed that even for large topologies, BWRH is a fast heuristic. Even so, the tail latency associated with nding a path can be hundreds of milliseconds. To be able to apply BWR to shorter ows, we propose a heuristic called BWRHF that runs much faster than BWRH with the caveat that its solutions are on average farther from the optimal. BWRHF is based on Dijkstra's algorithm and works by simply assigning weights to edges of the inter-DC graph and selecting a minimum weight path. Despite its simplicity, empirical evaluations show its signicant and consistent gains. Algorithm 2 shows our proposed approach to nding a path P N+1 for F N+1 . The coecient allows us to select the shortest hop path in case there are multiple paths with the same weight. Algorithm 2: BWRHF Input: F N+1 , G(V;E), P i ;V r i ; 1iN, and 0< 1 Output: P N+1 1 Assign edge weights, W e = ( P F i 2F F Fe V r i ) +,8e2E; 2 P N+1 Find a minimum weight (shortest) path from the source to the destination of F N+1 ; We will nd the worst-case optimality gap for BWRHF based on the number of data units of ows already in the system. Without loss of generality, let us assume that ows F i ; 1iN have been sorted by their remaining data units from the smallest (F 1 ) to the largest (F N ). Let us call the optimal path P o and the path selected with our heuristic P h . Theorem 1. W P h W Po P 1iN V r i V r 1 . Proof. In case there exists a path with weight of zero from s N+1 tot N+1 , Algorithm 2 and the optimal solution will both choose a path with weight of zero. In case the weight of the optimal path is greater than zero, the quality of paths selected by Algorithm 2 is highly correlated with the existing ows, their remaining data units and paths, and the network topology. We construct a simple example, as shown in Figure 3.5, that obtains the worst- case optimality gap. There are two possible paths, P 1 andP 2 , forF N+1 . Let us choose the number of intermediate nodes M on P 2 so that M > P 1iN V r i V r 1 . Apparently, from S to T , the optimal solution for Problem 2 is P 2 with a total weight ofV r 1 . However, Algorithm 2 will choose P 1 with a total weight of P 1iN V r i . This represents the worst-case as the 37 weight of optimal path is the minimum and the weight of the chosen path is the maximum. T S ... ..... F 1 F 1 F 1 F 1 F 1 F 2 F 3 F N-1 F N M P 2 P 1 Figure 3.5: Worst-case routing scenario The worst-case optimality gap is highly dependant on the remaining ow data units and can potentially be large. However, the worst-case scenario is highly unique. We will show, through experiments, that Algorithm 2 oers close to optimal solutions under dierent trac patterns and network loads. 3.5.1 Evaluations We performed extensive simulations to compare the two schemes of BWRH, BWRHFand an exact implementation of BWR using exhaustive search by nding and evaluating all existing paths between the source and destination of every incoming ow. We used the same simulation parameters and topologies discusses inx3.4.5. We compared the earlier schemes with respect to network load and scheduling policies. BWRHF's Performance by Network Load: In Figures 3.6 and 3.7, we explore the eect of load on the mean and tail completion times of various schemes considering the fair sharing policy. We consider multiple topologies with a dierent number of nodes and multiple degrees of connectivity. We see that regardless of incoming load (i.e., for dierent values of ), all schemes oer close performance values. The performance gap is aected by both topology and load. We see a negligible dierence in performance under both GScale and AT&T topologies. For the topologies of Cogent, AGIS, and ANS, we observe that performance diers by up to 35% across the schemes in a couple of cases. We also understand that although more straightforward, BWRHF oers better completion times in almost all instances. Knowing that BWR itself is a greedy online approach, this can be explained by noticing that making sub-optimal decisions for new ows as they arrive (i.e., the case for BWRHF), can help future ows perform better in many cases. Since we evaluate the performance by looking at system-wide metrics (i.e., mean and tail ow completion times), it is reasonable to make sub-optimal decisions for routing of a new ow 38 upon its arrival if that potentially helps the future ows, which we are unaware of, perform better and hence give us a better system-wide performance. For example, while the exact BWR implementation might choose a long path with minimum outstanding data units for a new ow, doing so might consume considerable network capacity due to many edges. Selecting a shorter path with marginally more data units can save more network bandwidth over extended periods and allow future ows to complete faster. Besides, it should be noted that the approach we took in Eq 3.1 for computing the worst-case completion time of a new ow may overshoot, that is, the worst-case may be larger than necessary. This could happen as edge-disjoint ows that intersect with a path for the new ow may be able to transmit their data units in parallel. Computing tighter bounds on the worst-case, however, requires taking into account the dependencies of current ows and so can be computationally intensive in general. BWRHF's Performance by Scheduling Policy: In Figures 3.8 and 3.9, we explore the eect of scheduling policies of SRPT and FCFS on the mean and tail completion times of various schemes. 2 Again, we observe that the straightforward heuristic of BWRHF performs well compared to BWRH and the exact BWR implementation. We also see that under the heavy-tailed distribution of ow sizes, the eect of scheduling policies is more obvious. We see little dierence in the performance of dierent schemes over all the topologies given dierent scheduling policies. In most cases, we see that BWRHF performs little better (i.e., up to 10%) thanBWRH. For a few scenarios, BWRHF performs little worse (i.e., up to 15%). The same two arguments discussed in the eect of network load above also applies to why this may be the case. In Figures 3.10, 3.11 and 3.12, we compare BWRH and BWRHF with two other schemes of path selection that we earlier used inx3.4.5. We observe that for multiple scheduling policies, ow size distributions, and topologies, the two heuristics of BWRH and BWRHF perform almost equally well and better than the other schemes, i.e., up to 2:6 and 2:1 better in mean and tail completion times, respectively. BWRHF's Optimality Gap: In Figure 3.13, we compute the optimality gap of solutions found byBWRHF over three dierent topologies and under two trac patterns. The optimal solution was computed by taking into account all existing paths and nding the minimum weight path on topologies of GScale, AGIS, and ANS. We also implemented a custom branch and bound approach which would require less computation time with a small number of ongoing ows (i.e.,< 20 in our setting) and an intractable amount of time for a large number of ongoing ows (i.e., > 30 in our setting). According to the results, while the optimality 2 The eect of the fair sharing policy was already discussed in Figures 3.6 and 3.7. 39 = 0.1 = 1 0 0.5 1 Mean FCT (Normalized) GScale = 0.1 = 1 0 0.5 1 AGIS = 0.1 = 1 0 0.5 1 ANS = 0.1 = 1 0 0.5 1 Tail FCT (Normalized) GScale = 0.1 = 1 0 0.5 1 AGIS = 0.1 = 1 0 0.5 1 ANS BWRHF BWRH BWR (Exhaustive Search) (a) Light-tailed Trac = 0.1 = 1 0 0.5 1 Mean FCT (Normalized) GScale = 0.1 = 1 0 0.5 1 AGIS = 0.1 = 1 0 0.5 1 ANS = 0.1 = 1 0 0.5 1 Tail FCT (Normalized) GScale = 0.1 = 1 0 0.5 1 AGIS = 0.1 = 1 0 0.5 1 ANS (b) Heavy-tailed Trac Figure 3.6: Comparison of mean and tail ow completion times for the three implemen- tations of BWR for the three topologies of GScale [2], AGIS [3] and ANS [4]. Exhaustive search nds all possible paths between the end-points and then nds a minimum weight path. We considered = 50 data units and performed the simulation over 500 time units. All simulations were repeated 20 times and the average results have been reported. We applied the Fair Sharing policy based on max min fairness which is most widely used. 40 = 0.1 = 1 0 0.5 1 Mean FCT (Normalized) AT&T = 0.1 = 1 0 0.5 1 COGENT = 0.1 = 1 0 0.5 1 Tail FCT (Normalized) AT&T = 0.1 = 1 0 0.5 1 COGENT BWRHF BWRH (a) Light-tailed Trac = 0.1 = 1 0 0.5 1 Mean FCT (Normalized) AT&T = 0.1 = 1 0 0.5 1 COGENT = 0.1 = 1 0 0.5 1 Tail FCT (Normalized) AT&T = 0.1 = 1 0 0.5 1 COGENT (b) Heavy-tailed Trac Figure 3.7: Comparison of mean and tail ow completion times for the three implementa- tions of BWR over two large topologies of AT&T [5] and Cogent [1]. We excluded exhaustive search as it would take intractable amount of time for the topologies considered here. We considered = 50 data units and performed the simulation over 500 time units. All simu- lations were repeated 20 times and the average results have been reported. We also applied the Fair Sharing scheduling policy based on max min fairness which is most widely used. 41 FCFS SRPT 0 0.5 1 Mean FCT (Normalized) GScale FCFS SRPT 0 0.5 1 AGIS FCFS SRPT 0 0.5 1 ANS FCFS SRPT 0 0.5 1 Tail FCT (Normalized) GScale FCFS SRPT 0 0.5 1 AGIS FCFS SRPT 0 0.5 1 ANS BWRHF BWRH BWR (Exhaustive Search) (a) Light-tailed Trac FCFS SRPT 0 0.5 1 Mean FCT (Normalized) GScale FCFS SRPT 0 0.5 1 AGIS FCFS SRPT 0 0.5 1 ANS FCFS SRPT 0 0.5 1 Tail FCT (Normalized) GScale FCFS SRPT 0 0.5 1 AGIS FCFS SRPT 0 0.5 1 ANS (b) Heavy-tailed Trac Figure 3.8: Comparison of mean and tail ow completion times for the three implemen- tations of BWR for the three topologies of GScale [2], AGIS [3] and ANS [4]. Exhaustive search nds all possible paths between the end-points and then nds a minimum weight path. We considered = 1 and = 50 data units and performed the simulation over 500 time units. All simulations were repeated 20 times and the average results are reported. 42 FCFS SRPT 0 0.5 1 Mean FCT (Normalized) AT&T FCFS SRPT 0 0.5 1 COGENT FCFS SRPT 0 0.5 1 Tail FCT (Normalized) AT&T FCFS SRPT 0 0.5 1 COGENT BWRHF BWRH (a) Light-tailed Trac FCFS SRPT 0 0.5 1 Mean FCT (Normalized) AT&T FCFS SRPT 0 0.5 1 COGENT FCFS SRPT 0 0.5 1 Tail FCT (Normalized) AT&T FCFS SRPT 0 0.5 1 COGENT (b) Heavy-tailed Trac Figure 3.9: Comparison of mean and tail ow completion times for the three implementa- tions of BWR over two large topologies of AT&T [5] and Cogent [1]. We excluded exhaustive search as it would take intractable amount of time for the topologies considered here. We considered = 1 and = 50 data units and performed the simulation over 500 time units. All simulations were repeated 20 times and the average results have been reported. 43 FCFS SRPT Fair Sharing 0 1 2 3 Mean FCT Light-tailed FCFS SRPT Fair Sharing 0 1 2 3 Heavy-tailed FCFS SRPT Fair Sharing 0 1 2 Tail FCT BWRHF BWRH Min-Max Utilization Shortest Path (Random-Uniform) FCFS SRPT Fair Sharing 0 1 2 (a) AT&T Topology [5] FCFS SRPT Fair Sharing 0 1 2 Mean FCT Light-tailed FCFS SRPT Fair Sharing 0 1 2 Heavy-tailed FCFS SRPT Fair Sharing 0 1 2 Tail FCT FCFS SRPT Fair Sharing 0 1 2 (b) Cogent Topology [1] Figure 3.10: Online routing techniques by ow scheduling policy assuming = 1, = 50, and AT&T and Cogent topologies over 500 time units. All simulations were repeated 20 times and the average results have been reported along with standard deviations. 44 FCFS SRPT Fair Sharing 0 1 2 3 Mean FCT Light-tailed FCFS SRPT Fair Sharing 0 1 2 3 Heavy-tailed FCFS SRPT Fair Sharing 0 1 2 Tail FCT BWRHF BWRH Min-Max Utilization Shortest Path (Random-Uniform) FCFS SRPT Fair Sharing 0 1 2 FCFS SRPT Fair Sharing 0 1 2 Mean FCT Light-tailed FCFS SRPT Fair Sharing 0 1 2 Heavy-tailed FCFS SRPT Fair Sharing 0 1 2 Tail FCT FCFS SRPT Fair Sharing 0 1 2 (a) GScale Topology [2] FCFS SRPT Fair Sharing 0 1 2 Mean FCT Light-tailed FCFS SRPT Fair Sharing 0 1 2 Heavy-tailed FCFS SRPT Fair Sharing 0 1 2 Tail FCT FCFS SRPT Fair Sharing 0 1 2 (b) ANS Topology [4] Figure 3.11: Online routing techniques by ow scheduling policy assuming = 1, = 50, and GScale and ANS topologies over 500 time units. All simulations were repeated 20 times and the average results have been reported along with standard deviations. 45 FCFS SRPT Fair Sharing 0 1 2 3 Mean FCT Light-tailed FCFS SRPT Fair Sharing 0 1 2 3 Heavy-tailed FCFS SRPT Fair Sharing 0 1 2 Tail FCT BWRHF BWRH Min-Max Utilization Shortest Path (Random-Uniform) FCFS SRPT Fair Sharing 0 1 2 FCFS SRPT Fair Sharing 0 1 2 Mean FCT Light-tailed FCFS SRPT Fair Sharing 0 1 2 Heavy-tailed FCFS SRPT Fair Sharing 0 1 2 Tail FCT FCFS SRPT Fair Sharing 0 1 2 (a) AGIS Topology [3] Figure 3.12: Online routing techniques by ow scheduling policy assuming = 1, = 50, and AGIS topology over 500 time units. All simulations were repeated 20 times and the average results have been reported along with standard deviations. gap may be large occasionally (i.e., > 50% for< 2% of the incoming ows, not shown), the average gap is less than 5% over all experiments. We could not perform this experiment on larger topologies as computing the optimal solution would take an intractable amount of time. Running Time: BWRHF aims to nd one minimum weight path using Dijkstra's algo- rithm which is on average much less computationally intensive compared to BWRH. We implemented Algorithm 2 in Java using the JGraphT library. We performed simulations while varying from 1 to 10 and from 5 to 50 over 1000 ow arrivals per experiment which covers both lightly and heavily loaded regimes. We also experimented with all the four topologies pointed to earlier, both trac patterns of light-tailed and heavy-tailed, and all three scheduling policies of FCFS, SRPT, and Fair Sharing. The maximum running time of Algorithm 1 was 17:88 milliseconds, and the average of maximum running time across all experiments was 1:38 milliseconds. This latency is about 10 less than what was observed from BWRH under identical circumstances. 46 GScale AGIS ANS 0 5 10 15 Optimality Gap (%) Light-tailed GScale AGIS ANS 0 10 20 Heavy-tailed Figure 3.13: BWRHF's optimality gap for = 10 and = 50 computed for 1000 ow arrivals. 3.6 Conclusions In this chapter, we explored a variety of routing heuristics and showed that the current routing techniques are insucient for reducing the completion times of inter-DC transfers, even compared to several simple routing heuristics that we discovered. We then presented a new technique for routing based on ow size information, called the Best Worst-case Routing (BWR), to reduce ow completion times. Accordingly, the online routing problem turns into nding a minimum weight path on the topology from the source to the destination where the weight is computed by summing up the number of remaining data units of all the ows that have a common edge with the path. Since this is a hard problem, we developed two fast heuristics with small average optimality gaps. We also discussed how information from a real network scenario could be used as input to our network model to nd a path on an actual inter-DC network for an incoming ow. 47 Chapter 4 Fast Deadline-based Admission Control for Inter-DC Transfers We consider the problem of admission control for point to point inter-DC transfers with deadlines. As the total capacity of inter-DC networks is limited, the purpose of admission control is to only accept new transfers when we can complete them prior to their deadlines while meeting the deadlines of all other transfers already in the system. To achieve this, trac scheduling is needed for future timeslots because by focusing only on current timeslot we cannot guarantee that admitted transfers will nish before their deadlines [46]. Besides, any algorithm used to perform such inter-DC admission control is desired to maximize the transfer admission rate and make ecient use of existing network resources. Speed in processing new transfer requests is another requirement. That is because for large scale applications that have millions of users, large number of transfers may have to be processed and allocated every minute. In this chapter, we propose and discuss a new scheduling policy called As Late As Possible (ALAP) scheduling and combine it with a novel routing policy to perform fast and eective admission control. 4.1 Background and Related Work There is considerable work on maximizing the number of deadline meeting ows for trac inside datacenters. These approaches, however, do not perform admission control which leads to wasted bandwidth. In [49,86], authors propose deadline aware transport protocols which increase the number of transfers that complete prior to their assigned deadlines by adjusting the transmission rate of such transfers based on their deadlines. Also, multi- 48 ple previous studies have focused on improving the eciency and performance of inter-DC communications through proper scheduling of transfers. In [46], authors propose TEM- PUS which improves fairness by maximizing the minimum portion of transfers delivered to destination before the transfer deadlines. TEMPUS cannot guarantee that admitted transfers are completed prior to their deadlines. In [55], authors propose Deadline-based Network Abstraction (DNA) which allows tenants to specify deadlines for transfers, and a system called Amoeba which performs admission control for new inter-DC transfers. When a request is submitted, Amoeba formulates an optimization scenario, performs feasibility checks, and decides whether the new request can be satised using available resources. If a transfer cannot be completed prior to its deadline, Amoeba tries to reschedule a subset of previously admitted requests to push trac further away out of the new request's timeline. The admission process is performed on a rst-come-rst-served (FCFS) basis and requests are not preempted, that is, the system does not drop a previously admitted request as this can lead to thrashing. 4.2 Fast Admission Control on A Network Path We discuss a new scheduling approach that allows fast admission control over a single network path. We will extend this idea to general networks in the next section. 4.2.1 System Model In this section, we consider a simple topology where multiple transfers are scheduled over the same path. We will use the same notation as that in Table 2.2. Assume we are allocating trac for a timeline starting att now representing current time and ending att end which corresponds to the latest deadline for all submitted requests. New requests may be submitted to the scheduler at any time. Every request R i is identied with two parameters V R i and t d R i representing request size and deadline, respectively. Since all requests are scheduled over the same path, they all have the same source and destination. Requests are instantly allocated upon arrival over timeslots for which t>t now . We consider a TES that receivers the inter-DC transfer requests and decides whether they can be admitted. If yes, the TES has to also compute a transmission schedule which determines the rate at which the source node should send packets associated with every transfer per timeslot. 49 4.2.2 Currently Used Approach To perform admission control, one can formulate and solve a linear program (LP) involving all current transfers and the new transfer with demand and capacity constraints populated based on link capacities (for the links on the path) and request volumes. We can then attempt to solve this LP. If this LP is feasible, then the transfer can be admitted. This LP has to be solved every time a new request is submitted and can result in changing the allocation of already scheduled requests. The problem with this approach is its high complexity (solving possibly large LPs over and over is computationally inecient) as the frequency of arrivals increases. 4.2.3 As Late As Possible (ALAP) Scheduling We propose As Late As Possible Scheduling (ALAP) [87], which is a fast trac allocation technique that minimizes the time required to perform the admission process. It avoids rescheduling already admitted requests to quickly decide whether a new request can be admitted. It also achieves high utilization and can eciently use network resources. We present the rules based on which ALAP operates: Rule 1: Similar to previous schemes [55], preemption is not supported. Preempting a request that is partly transmitted is wasteful. Also, it may result in thrashing if requests are consecutively preempted in favor of future requests. Rule 2: To be fast, ALAP does not change the allocation of already allocated trac unless there is leftover bandwidth in current timeslot (t now ). In which case, it fetches trac from the earliest timeslot that is not empty and sends it. This is done until either we fully utilize the current timeslot or there is no more trac to send. When a new transfer R new is submitted, ALAP creates a small LP, only involving the new request, to schedule it. The number of variables in this LP is (t d Rnew t now ). Assume the amount of bandwidth allocated to new transfer at time t isf new P (t) andCB(t) is the residual capacity on the path at timeslott assuming a path capacity ofC = min e2P (C e ) and available bandwidth of B(t) = min e2P (B e (t));8t whereP is the path on which we perform admission control. We use the LP of equation 4.4 with the objective function of equation 4.1 to do the allocation. If the following LP does not yield a feasible solution, we reject the request. 50 U(R new ), t d Rnew X t=tnow +1 t f new P (t) (4.1) max(U(R new )) (4.2) t d Rnew X t=tnow +1 f new P (t) =V Rnew (4.3) 0f new P (t)CB(t); t now <tt d Rnew (4.4) Now consider a scenario where transfers R 1 ;R 2 ; ::: R K arrive at the network in order to be allocated on pathP . We show that upon arrival ofR k ; 1kK, ALAP allocation for previously admitted requests is so that we cannot increase the chance of admission for R k by rearranging the allocation of already allocated requests (previous k 1 requests). Recall that the deadline of R k is shown ast d R k and at any time t, the latest deadline of all admitted requests is t end . Theorem 1. If we draw a vertical line at time t d R k t end in our trac allocation, it is not possible to increase the free space behind the line by moving trac from left side of the line (tt d R k ) to the right side (t end t>t d R k ). Proof. Let us assume we have the allocation shown in Figure 4.1 for the rst k 1 requests on path P . To schedule all requests, we used the utility function of equation 4.1 which assigns a higher cost to future timeslots. Let us assume that we can move some trac volume from left side to the right side. If so, this volume belongs to at least one of the admitted requests and that means we are able to increase the utility for that request further. This is not possible because the LP in equation 4.4 gives the maximum utility. That means if we were to move trac from left side of the line to the right side it would either result in violation of link capacity constraints or violation of deadline constraints. Now let's assume a new transfer arrives. If it can be allocated using the residual link capacity on all the links of path P , then we can admit it. If not, based on Theorem 1, there is no way we can shift already allocated trac so that we can accommodate the new transfer. Since every new transfer is scheduled closest possible to its deadline, we refer to this policy as As Late As Possible (ALAP) scheduling since trac cannot be pushed further closer to the deadline. Figure 4.2 provides an example of the ALAP allocation technique. As can be seen, 51 t now t end t d Time Bandwidth R k Figure 4.1: A trac allocation used in proof of Theorem 1. t t t Transfer 1: Deadline = d1 & Volume = V1 Transfer 2: Deadline = d2 & Volume = V2 d1 d2 d1 V1 V1 V2 t Transfer 3: Deadline = d3 & Volume = V3 d2 d1 V1 V2 d3 V3 now now now now Figure 4.2: An example of ALAP allocation. when the rst transfer is received, the timeline is empty and therefore it is allocated ad- jacent to its deadline. The second transfer is allocated as close as possible to its deadline. The implication of this type of scheduling is that requests do not use resources until it is absolutely necessary. This means resources will be available to other requests that may currently need them. When the third transfer arrives, resources are free and it just grabs as much bandwidth as needed. If we had allocated the rst two requests closer to current time we may have had to either reject the third transfer or move the rst two transfers ahead freeing resources for the third transfer (which would have required rescheduling). 4.2.4 Simulation Results We compare the performance and speed of ALAP with Amoeba [55]. Other schemes, such as [44,46], are deadline-agnostic and have an eective link utilization of less than 50% [55]. Amoeba, on the other hand, only accepts requests when it can guarantee that the deadline can be fully met. 52 5HTXHVW$UULYDO5DWHSHU6ORW )DLOHG5HTXHVWV $/$3 $02(%$ 5HTXHVW$UULYDO5DWHSHU6ORW $YHUDJH/LQN8WLOL]DWLRQ 5HTXHVW$UULYDO5DWHSHU6ORW $YHUDJH$OORFDWLRQ7LPH Figure 4.3: Comparison between Amoeba and ALAP scheduling. Setup: We consider a topology with multiple equal capacity links with a capacity of 1 attached in a line and trac is transmitted from one end to the other. We assume that high priority trac (e.g., user generated, real-time, etc.) takes a xed amount of bandwidth and allocate the leftover among inter-DC transfer requests. Simulation is performed for 576 timeslots each lasting 5 minutes which is equal to 2 days. We performed the simulations three times and calculated the average. Metrics: Fraction of inter-DC transfer requests that were rejected, average link utilization, and average allocation time, in timeslots, per request are the three metrics measured and presented. Workload: We generate inter-DC transfers according to a Poisson distribution of rate 1 8 request(s) per timeslot. The dierence between the arrival time of requests and their deadlines follows an exponential distribution with an average of 12 timeslots. In addition, the demand of each request also follows an exponential distribution with an average of 0:286 (a maximum of 1 unit of trac can be sent in each timeslot on every link). Figure 4.3 shows the aforementioned simulation metrics for both Amoeba and ALAP. As can be seen, both algorithms result in similar rejection rate (and so admission rate) and utilization. However, ALAP achieves the same performance metrics with much less complexity. ALAP is up to 15 faster than Amoeba. Also, the complexity of ALAP grows slowly as the frequency of arrivals increases, i.e., up to 1:6 while arrivals increase by a factor of 8. With regards to the trend for time complexity as shown in Fig. 4.3, when the request arrival rate is small, most of the capacity is left unused. Therefore, Amoeba does not have to move already allocated requests to push in a new one. As the arrival rate increases, 53 we see a higher utilization. Starting the arrival rate of 4, utilization grows close to 1 and we can see a huge jump in the time complexity of Amoeba (by a factor of 3:7). That is because Amoeba has to move around multiple already allocated requests to push in the new request. For an arrival rate of 8 requests per timeslot, we see that both algorithms drop almost half of the requests. This can happen as a result of capacity loss in the network. For example, when a datacenter is connected using only two links and one of them fails for a few timeslots. While Amoeba can get really slow, ALAP scheduling is able to handle such situations almost as fast as when there is low link utilization. 4.3 Application of ALAP over General Network Topologies We empirically showed that ALAP can speed up the allocation process by allowing new transfers to be scheduled only considering the residual bandwidth on the edges of a path P which results in creation of much smaller LPs. In this section, we consider the routing problem in addition to the ALAP scheduling policy for admission control over a general network. We focus on single path routing and develop a solution called DCRoute [88]. Minimizing Packet Reordering: Avoiding packet reordering allows data to be instantly delivered to applications upon arrival of packets. In addition, inter-DC networks have characteristics similar to WAN networks (including asymmetric link delays and large delays for links that connect distant locations) for which multiplexing packets over dierent paths has been shown to considerably degrade TCP performance [89]. Putting out of order packets and segments back in order can be expensive in terms of memory and CPU usage, especially when transmitting at high rates. Admission Control over General Networks: In contrast to routing over a single path, for a network, each request is routed on multiple links and there are many ways to schedule requests ALAP. If some links are used by multiple requests routed on dierent edges, how trac is allocated on common links can aect multiple other links which will aect the requests that use those links later on. We propose a routing heuristic that allows us to select a least loaded path for a new request over which we attempt to allocate a new request. We will show that using the ALAP scheduling policy, we can greatly speed up the allocation process while sacricing negligible performance. 54 Table 4.1: Variables used in this chapter in addition to those in Table 2.2 Variable Denition L e (t) Total load currently scheduled on edge e prior to and including timeslot t L e Total load currently scheduled on edge e (same as L e (t now )) V r R i Current residual demand of request R i 4.3.1 System Model At any given moment, we have two parameterst now andt end which represent current times- lot and the latest deadline among all current transfers, respectively. A request arriving sometime in timeslott can be allocated starting timeslott + 1 since the schedule and trans- mission rate for current timeslot is already decided and broadcast into all senders. Also, at any moment t, t now is the timeslot that includes t (current timeslot), and t now + 1 is the next available timeslot for allocation (next timeslot). A request is considered active if it is admitted into the system and its deadline has not passed yet. Some active requests may take many timeslots to complete transmission. The total unsatised demand of an active request is called the residual demand of that request. We will use the same notation as that in Table 2.2. We will dene some additional variables in this section as shown in Table 4.1. Denition of Edge Load L e and L e (t): We dene a new metric called edge load which determines the total remaining volume of trac per edge for all the transfers that share that edge. This metric provides a measure of how busy a link is expected to be on average over future timeslots. L e (t) is the total volume of trac scheduled on an edge prior to and including timeslott. L e can then be written asL e (t end ). For a new request with a deadline of t d R i , it would only make sense for us to consider all the trac scheduled on edges prior to t d R i , i.e., L e (t d R i );8e2 E E E G is the metric we will use to select a path. Upon arrival of a transfer request, a central controller decides whether it is possible to allocate it considering some criteria that includes the total available bandwidth over future timeslots. If there is not enough room to allocate a request, the request is rejected and can be resubmitted to the system later with a new deadline. Allocation Problem: Given active requests R 1 through R n with residual demandsV r R 1 toV r Rn (0<V r R i ; 1in), is it possible to allocate a new request R n+1 ? If yes, we want to nd a valid path over the inter-DC network and a transmission schedule that respects capacity and deadline constraints. 55 There are many ways to formulate this as an optimization problem. We can solve this allocation problem by forming a linear program (LP) considering capacity constraints of the network edges as well as demand constraints of requests while considering a subset of available paths between the source and the destination of new request. We can also formulate an edge-based optimization problem that automatically considers all possible paths. These formulations, however, do not consider the single path routing constraint we have to minimize packet reordering. Adding the single path constraint will turn this into a Mixed Integer Linear Program (MILP) which are in general NP-Hard. If the constructed LP (or MILP) is feasible, the solution will give us a possible allocation. Although this approach maybe straightforward, considering the number of active requests, number of links in network graph, and how far we are planning ahead into the future due to deadlines (t end ), the resulting LP (or MILP) could be large and may take a long time to solve. One of the ways to speed up this process is to limit the number of possible paths between every pair of nodes [46], for example, by using only the K-Shortest Paths [55]. Another method to speedup is to limit the number of considered active requests based on some criterion [55] such as having a common edge with the new request on their paths (if we know what path or potential paths we will assign to the new request). It is also possible to use custom iterative methods to solve the resulting LP models faster based on the solutions of previous LP models in a way similar to the water lling process [46]. 4.3.2 Network-wide ALAP Scheduling We do not create an LP model by employing a fast routing heuristic that allows us to select a path according to the total load scheduled on network edges and by trying to allocate new requests only knowing the residual bandwidth on the edges for dierent timeslots. DCRoute relies on the following three rules. Rule 1: A path P i is selected for every request R i upon their arrival based on the total outstanding load on the edges of the candidate paths. Rule 2: R i is initially allocated according to the ALAP policy on P i . Rule 3: If the upcoming timeslot is underutilized, network utilization is maximized by pulling trac from the closest timeslots into the future. Pulling trac from closest timeslot into the future to maximize utilization allows the ALAP property of allocation to hold true afterwards. That is, all residual demands will still be allocated as close to their deadlines as possible. Over a network, however, requests 56 A B C D E3 E1 E2 B→C B→D B→D A→D A→D E1 E2 E3 T now +1 B→C B→D B→D A→D A→D E1 E2 E3 T now +1 B→C B→D B→D A→D A→D E1 E2 E3 T now +1 PullBack() PushForward() The red dashed line is the deadline for all 3 requests t t t t t t t t t Figure 4.4: An example of improving utilization (i.e., PullBack phase) while keeping the nal allocation ALAP (i.e., PushForward phase). with dierent paths could have common edges which could create complex dependencies that would prevent us from pulling trac from earliest timeslots with non-zero allocation. This means, to maximize utilization, we may have to pull trac from later timeslots which might render the ultimate allocation non-ALAP. To x this, we add a procedure that runs afterwards, scans the timeline, and pushes the allocation forward as much as possible to make it ALAP. Figure 4.4 shows an example of this process. There are three dierent requests all of which having the same deadline. It is not possible to pull back the green request as the link E1 is already occupied. Therefore, we have to pull the orange request (PullBack phase). Afterward, the allocation is not ALAP anymore, so we push the green request toward its deadline (PushForward phase). The nal assignment is ALAP, and the utilization of upcoming timeslot is maximum. 4.3.3 Load-based Dynamic Routing The next part is assigning paths to new transfers as they arrive. A transfer from any source to any destination can be generally routed over many paths. To avoid packet reordering, we limit the number of paths per transfer to 1. In general, one can assign static paths to every new request given the source and the destination just like the K-Shortest Paths approach. However, as we will demonstrate, it is better to assign paths to new requests according to their sizes. To understand why this is important, we created the example of Figure 4.5. By assigning shorter paths to larger transfers, the total capacity usage decreases across the network leaving more room for future requests on average. Such savings can pile up as time goes by with arrival of many transfers. This is especially important if transfer sizes are skewed which is what this study from Facebook conrms [12]. 57 Next, we would like routing to assign dierent transfers to dierent paths as much as possible to balance load across the network. If all transfers are assigned to the shortest path, it will be overloaded and slowed down while there is leftover capacity over some longer paths. As a result, we need a path selection technique that takes into account both load balancing and path assignment according to volumes. A well-established technique is to assign available paths some cost that is calculated as a function of transfer size and path properties and select the path with minimum cost. We propose a straight-forward cost assignment scheme that meets the stated criteria and is quick to compute. Routing Cost: Given a new transfer R new and a set of available paths P where for every path P2 P path cost C P (t) is dened as total outstanding load prior to t which is calculated by summing up the total load scheduled on P prior to t if R new were to be put on P considering the new transfer's sizeV Rnew . Let us assume a graph G(V;E) connecting datacenters with bidirectional links with equal capacity for simplicity. We have variablesL e (t) that represent the total sum of trac volume scheduled over edge e from time t now + 1 to t (total load that is scheduled but not sent prior to timet). The value ofL e (t) depends on transfer arrivals. As new transfers arrive this value increases on some edges and as we send trac over time, this value decreases. With this notation, the cost assigned to path P withjPj edges given transfer R new with deadline t d Rnew and volumeV Rnew will be: C P (t) = X e2P L e (t) +V Rnew jPj = X e2P (L e (t) +V Rnew ) (4.5) Routing Objective: We want to select pathP that with minimum value ofC P (t d Rnew ) for all valid paths given R new . This means selecting the path over which routing R new results in minimum total load (considering R new itself) prior to and including t d Rnew . Implications: Since this cost assignment is edge decomposable (i.e., cost of a path is sum of costs of its edges), a path with minimum cost can be simply selected using Dijkstra's shortest path algorithm. For small transfers whereV Rnew is much smaller thanL e (t d Rnew ), the total already scheduled load on edges is dominant and as a result the assignment selects paths with minimum total load prior to the new transfer's deadline. IfV Rnew is considerably larger than L e (t d Rnew ) for candidate paths, the cost function leans toward selecting shorter paths to minimize network capacity usage. This is essentially eective for heavy-tailed transfer size distributions. That is, the few enormous transfers will be scheduled on shortest paths while the rest of transfers are distributed across longer paths for load balancing. 58 S D Transfer Volume 1 10 2 5 3 2 4 1 1 2 3 4 Link Capacity = 1 Transfer Path 1 1 2 2 3 3 4 4 Transfer Path 1 2 2 3 3 4 4 1 Some Assignment Optimal Assignment Total BW Usage: 52 Total BW Usage: 30 Figure 4.5: An example of assigning paths to transfers and their total network capacity use assuming that no two transfers should be assigned the same paths for load balancing. 4.3.4 DCRoute Algorithm Every time a new request is submitted to the system,t end is updated to the latest deadline. We dene the active window as the range of timeslots over all edges from timet now +1 to t end which are the timeslots DCRoute operates on. DCRoute is made up of four procedures explained in the following. Allocate(R new ): Algorithm 3 is executed upon arrival of a new request R new and per- forms path selection, admission control and bandwidth allocation. To do so, it assigns a cost of L e (t d Rnew ) +V Rnew to every edge e2 E E E G of the graph and then runs Dijkstra's algorithm to select the path P with minimum cost. It then tries to schedule transfer R new on P according to the ALAP policy starting from timeslot t d Rnew backward until R new is completely satised. It rejects the request, if there is not enough capacity on P from t now to t d Rnew . 59 Algorithm 3: Allocate(R new ) Input: R new (V Rnew ;S Rnew ;D Rnew ;t d Rnew ), G(V;E), W , L e (t) and B e (t),8e2 E E E G ;t>t now Output: Whether R new should be admitted, and a minimum cost path P 1 To every edge e2 E E E G , assign cost L e (t d Rnew ) +V Rnew ; 2 Find path P by running Dijkstra's algorithm for shortest (minimum cost) path; 3 t 0 t d Rnew andV 0 V Rnew ; 4 whileV 0 > 0 and t 0 >t now do 5 B P (t 0 ) min e2E E E P (B e (t 0 )) ; 6 Schedule R new on P with rate min(B P (t 0 ); V 0 ! ) at timeslot t 0 ; 7 t 0 t 0 + 1 andV 0 (V 0 min(B P (t 0 ); V 0 ! )!) ; 8 return P ifV 0 = 0, otherwise, reject R new ; PullBack(): This procedure sweeps the timeslots starting t now + 2 to t end and pulls back trac to the next timeslot to be scheduled, i.e., t now + 1. The objective is to maximize network resource utilization. When pulling back trac, all edges on a transfer's path have to be checked for available capacity and updated together and atomically as we pull trac back. PushForward(): After pulling some trac back, it may be possible for some other trac to be pushed ahead even further to make the allocation ALAP. This procedure scans all future timeslots starting t now + 2 and makes sure that all demands are allocated ALAP. If not, it moves as much trac as possible to the future timeslots until all residual demands are ALAP. Note that there may be many ALAP schedules due to spacial and temporal dependencies across transfers. This procedure nds one of such schedules by scanning through time and edges in a xed order. Walk(): This procedure is executed when the allocation for next timeslot is nal. It broadcasts to all datacenters the allocation that is nalized for the next timeslot and adjusts requests' remaining demands accordingly by deducting what is scheduled to be sent from the total demand. 4.3.5 Simulation Results In this section, we perform simulations to evaluate the performance of DCRoute. We generate synthetic trac requests with Poisson arrival and input the trac to both DCRoute and a few other techniques that can be used for deadline-aware trac allocation. Two 60 metrics are being measured and compared: allocation time and fraction of rejected trac both of which are desired to be small. Simulation Parameters: We used the same trac distributions as described in [55]. Requests arrive with Poisson distribution of rate. Also, total demand of each requestR is distributed exponentially with mean 1 8 proportional to the maximum transmission volume possible prior to t d R . In addition, the deadline of requests is exponentially distributed for which we assumed a mean of 10 timeslots. We performed the simulations over 500 timeslots. We considered a uniform link capacity of 1 for all edges. We compare DCRoute with the following allocation schemes for all of which we used the same objective function as [55]: Global LP: This technique is the most general and exible way of allocation which routes trac over all possible edges. All active requests are considered for all timeslots on all edges creating a potentially large linear program. The solution here gives us a lower bound on trac rejection rate. K-Shortest Paths: Same as Global LP, however, only the K-Shortest Paths between each pair of nodes are considered in routing. The trac is allocated using a linear program over such paths. We simulated four cases of K2f1; 3; 5; 7g. It is obvious that as K increases, the overall rejection rate will decrease as we have higher exibility for choosing paths and multiplexing trac. Pseudo-Integer Programming (PIP): In terms of trac rejection rate, comparing DCRoute with the previous two techniques is not fair as they allow multiplexing pack- ets on multiple paths. The aim of this technique is to nd a lower bound on trac rejection rate when all packets of each request are sent over a single path. To do so, the general way is to create an integer program involving a list of possible paths (maybe all paths) for the new request and xed paths for requests already allocated. The resulting model would be a non-linear integer program which cannot be solved using standard optimization libraries available. We instead created a number of linear programs each assigning one of the possi- ble K-Shortest Paths for the newly arriving request. We then compare the objective values manually and choose the best possible path. In our implementation, we choseK = 20. This K seems to be more than necessary as we saw negligible improvement in trac rejection rate even when increasing K from 5 to 7. Using PIP, the path over which a request is transferred is decided upon admission and does not change afterwards. We implemented two versions of this scheme: 61 Pure Minimum Cost (PMC): We choose the path that results in smallest objective value. Shortest Path, Minimum Cost (SPMC): Amongst all shortest paths that result in a feasible solution and have the least number of hops, we choose the one with smallest objective value. Experiment 1: Google's GScale Network GScale network [2] comprised of 12 nodes and 19 links connects Google datacenters world- wide. We used the same topology to evaluate DCRoute as well as other allocation schemes. Figure 4.6 shows the rejection rate of dierent techniques for dierent arrival rates from low load ( = 1) to high load ( = 15). We have included the schemes that potentially multiplex trac over multiple paths just to provide a lower bound. Comparing with PMC and SPMC schemes over all arrival rates, DCRoute performs < 2% worse than the one with minimum rejection rate. Also, compared to all schemes, DCRoute rejects at most 4% more trac. Figure 4.6 shows the relative time to process a request using dierent schemes. This time is calculated dividing the total time to allocate and adjust all requests over all timeslots by the total number of requests. DCRoute is about 3 orders of magnitude faster than either PMC or SPMC. It should be noted that the rate at which time complexity grows drops as we move toward higher arrival rates since there is less capacity available for new requests and many arriving requests get rejected by failing simple capacity constraint checks. Experiment 2: Network with Variable Size We simulated dierent methods against four networks from 5 to 20 nodes: (N;M) 2 f(5; 7); (10; 17); (15; 27); (20; 37)g. In our topology, each node was connected to 3 or 4 other nodes at most 2 hops away. The arrival rate was kept constant at = 6:0 for all cases. Figure 4.7 shows the rejection rate of dierent schemes for dierent network sizes. As network size increases, since is kept constant, the total capacity of network increases compared to the total demand of requests. As a result, for a scheme that multiplexes request trac over dierent paths, we expect to see a decrease in rejection rate. For the K-Shortest Paths case with K2f1; 3g, we see an increase in rejection rate which we think is because these schemes cannot multiplex packets that much. Increasing the network size for these cases can cause more requests to have common links as the network is sparsely connected and create more bottlenecks resulting in a higher rejection rate. 62 Arrival Rate ( 6) 2 4 6 8 10 12 14 % of Rejected Traffic 0 5 10 15 20 25 30 35 40 45 DCRoute Global LP K-Shortest Paths (K=1) K-Shortest Paths (K=3) K-Shortest Paths (K=5) K-Shortest Paths (K=7) SPMC PMC Arrival Rate ( 6) 2 4 6 8 10 12 14 Relative Time to Process a Request 10 0 10 1 10 2 10 3 10 4 10 5 Figure 4.6: Total % of rejected trac and relative request processing time for GScale network with 12 nodes and 19 links. Network Size (# Nodes) 5 10 15 20 % of Rejected Traffic 0 2 4 6 8 10 12 14 16 18 20 6 = 6.0 DCRoute Global LP K-Shortest Paths (K=1) K-Shortest Paths (K=3) K-Shortest Paths (K=5) K-Shortest Paths (K=7) SPMC PMC Network Size (# Nodes) 5 10 15 20 Relative Time to Process a Request 10 0 10 1 10 2 10 3 10 4 10 5 6 = 6.0 Figure 4.7: Total % of rejected trac and relative request processing time for networks with dierent sizes. 63 PMC has a high rejection rate for small networks since choosing the minimum cost path might result in selecting longer (more hops) paths that create larger number of bottlenecks due to collision with other requests. Increasing network size, there are more paths to choose from and that results in less bottlenecks and therefore less rejection rate. In contrast, SPMC enforces the selection of paths with smaller number of hops resulting in lower rejection rates for small networks (due to request paths colliding less) and more rejections as network grows due to less diversity of chosen paths. Compared to these two approaches, DCRoute balances the choice between smaller and longer paths. The assigned path has the least sum of load on the entire path and the least bottleneck load among all such paths. Paths with heavily loaded links and unnecessarily larger number of hops are avoided. As a result, rejection rate compared to min(PMC, SPMC) is relatively small (< 3%) for all network sizes. Also, as Figure 4.7 shows, similar to previous simulation, DCRoute is almost three orders of magnitude faster than PIP schemes and more than 200 faster than all considered schemes. 4.4 Admission Control with Multipath ALAP Scheduling In some scenarios we may be inclined to pay the reordering cost in order to increase the throughput, especially since inter-DC capacity is costly. In case packet reordering is not an issue, we can use multipath routing to increase network throughput and maximize the chances of admission for new transfers with deadlines. According to the ALAP policy, we will need to schedule trac as close as possible to the deadlines over the multiple paths. That can be done by starting from the deadline on both paths and allocating as much as possible, then moving one timeslot back and allocating as much as possible on both paths, and so on. This has been shown in Figure 4.8. 4.4.1 Multipath Routing There are a variety of ways to select multiple paths for new transfers. We focus on ap- plication of parallel edge-disjoint paths to increase throughput. The benet of using edge- disjoint paths is that the trac for the same transfer will not have to compete with itself over common edges. We explain how multiple paths are selected and name our technique MP-DCRoute. We want to select paths in a similar way to the approach presented in 4.3.3 which allowed for quick selection of paths while it balanced the load across network edges. To select more than one path with such properties, after nding the rst path, we mark all of its edges 64 Figure 4.8: An example of multipath ALAP scheduling: trac is allocated on edge-disjoint paths from the deadline backward in parallel. as deleted and then search for the next path. This way, we are sure to obtain the same good load balancing properties while guaranteeing that the paths are edge-disjoint. We can keep searching for new paths until there are no more paths remain, or we can terminate the search as soon as we nd a given number of paths. We combined these two conditions to allow for up to K load balancing paths per transfer where K is a conguration parameter. The parameter K needs to be selected carefully as using too many parallel paths per transfer can waste bandwidth and exhaust network capacity. That is because as we select more paths, the paths tend to grow longer, or use edges that are heavily loaded. This means that, under light load, using more paths can improve throughput while under heavy load, doing such can quickly saturate the network and lead to rejection of transfers. In general, K can be selected adaptively according to the network's overall load factor. That is, the operators can monitor incoming trac load and updateK accordingly for the new transfers. 4.4.2 Simulation Results In this section, we perform simulations to evaluate the performance of MP-DCRoute. We generate synthetic trac requests with Poisson arrival and input the trac to both MP- DCRoute and DCRoute presented in the previous section. Three metrics are being mea- sured and compared: allocation time, fraction of rejected requests and fraction of rejected trac all of which are desired to be small. Simulation Parameters: We used the same trac distributions as described in [55]. Requests arrive with Poisson distribution of rate . Also, total demand of each request R new is distributed exponentially with mean 1 2 proportional to the maximum transmission volume possible prior to t d Rnew . In addition, the deadline of requests is exponentially 65 distributed for which we assumed a mean of 1 timeslot. We performed the simulations over 1000 timeslots. We considered a uniform link capacity of 1 for all edges. We compare the following allocation schemes which are basically single and multipath ALAP scheduling techniques: DCRoute (1 path): The technique proposed inx4.3. It uses a single path that is selected adaptively according to network load to balance load and minimize packet reordering. MP-DCRoute (up to K paths): We use the technique proposed in this section to select up to K edge-disjoint adaptively selected paths that balance load across the network. We compare DCRoute and MP-DCRoute over three dierent topologies as shown in Figures 4.9, 4.10 and 4.11. In terms of the total trac admitted and the total number of requests admitted, we see that MP-DCRoute does considerably better, i.e., up to 12% more trac and up to 5% more transfers are admitted to the inter-DC network. We also see that the gain of using multiple paths reduces as we increase the network load by increasing the arrival rate of transfers. Also, we see that all the benet of using multiple paths is received with 2 paths and increasing the number of paths to 3 has virtually no benet. 1 We then evaluate the running time of dierent techniques which is the total computation time to handle all 1000 timeslots. We see that MP-DCRoute can be between up to 2 to 3 slower than DCRoute which is due to the time needed to nd additional paths and schedule trac over multiple paths per transfer. However, since the total time to process a single request is small, 2 this should not cause any practical impediments. 1 In most cases, using 3 paths instead of 2 hurts the performance. 2 In the order of milliseconds. 66 2468 10 Transfer Arrival Rate ( ) 0 5 10 15 % of Rejected Requests GScale Topology DCRoute (1 path) MP-DCRoute (up to 2 paths) MP-DCRoute (up to 3 paths) 2468 10 Transfer Arrival Rate ( ) 0 5 10 15 20 25 30 35 % of Rejected Traffic GScale Topology 2468 10 Transfer Arrival Rate ( ) 0 100 200 300 400 500 Runtime (milliseconds) GScale Topology Figure 4.9: Multipath ALAP scheduling over GScale [2] topology. 2468 10 Transfer Arrival Rate ( ) 0 5 10 15 20 % of Rejected Requests ANS Topology DCRoute (1 path) MP-DCRoute (up to 2 paths) MP-DCRoute (up to 3 paths) 2468 10 Transfer Arrival Rate ( ) 0 5 10 15 20 25 30 35 % of Rejected Traffic ANS Topology 2468 10 Transfer Arrival Rate ( ) 0 100 200 300 400 500 600 700 Runtime (milliseconds) ANS Topology Figure 4.10: Multipath ALAP scheduling over ANS [4] topology. 2468 10 Transfer Arrival Rate ( ) 0 5 10 15 20 % of Rejected Requests Cogent Topology DCRoute (1 path) MP-DCRoute (up to 2 paths) MP-DCRoute (up to 3 paths) 2468 10 Transfer Arrival Rate ( ) 0 5 10 15 20 25 30 35 % of Rejected Traffic Cogent Topology 2468 10 Transfer Arrival Rate ( ) 0 2000 4000 6000 8000 Runtime (milliseconds) Cogent Topology Figure 4.11: Multipath ALAP scheduling over Cogent [1] topology. 67 4.5 Conclusions In this chapter, we discussed the problem of admission control for inter-DC transfers with deadlines which is an essential problem given that inter-DC networks have limited capacity. Sending trac without paying attention to deadlines could waste bandwidth as the value of completed transfers past their deadlines may be signicantly less. We discussed why current approaches based on linear programming or mixed integer linear programming are not eective in general as they could take a long time to solve and require considerable computing resources. We presented a new scheduling technique called As Late As Possible (ALAP) policy that allows the scheduler to quickly decide whether a new transfer can be admitted on a given path. We then developed an adaptive routing approach that balances load across the network and saves network capacity by routing larger transfers over shorter paths. Finally, we realized that, although using a single path per transfer can minimize packet reordering, which is a desired property, it can also limit the obtainable throughput. We then applied an edge-disjoint multipath routing technique that improves the trac admitted to the network. We performed extensive simulations to conrm the eectiveness of our approaches showing that our methods can reduce the time needed to perform admission control and compute a valid schedule by orders of magnitude at little or no cost to the total trac admitted to the inter-DC network. 68 Chapter 5 Ecient Point to Multipoint Transfers over Inter-DC Networks As discussed in Chapter 1, a large volume of inter-DC trac is due to replication of data and content from one datacenter to multiple other datacenters. We refer to such transfers as Point to Multipoint (P2MP) which have a known sender and set of receivers upon arrival. Also, in general, we do not have knowledge of arrival times for these transfers and have to manage them as they arrive at the network, i.e., in an online fashion. We consider ecient routing and scheduling of P2MP data transfers, with the objective of minimizing transfer completion times and total network capacity consumption. Using centralized scheduling and load-aware multicast tree selection, we can signicantly improve the performance. Our approach is dierent from traditional multicasting in that we select multicast trees atomically given the source and all the destinations whereas traditional multicasting builds multicast trees incrementally as destinations join. With a global view of network topology and edge load status, it is possible to nd near optimal weighted Steiner trees that connect any given source datacenter to its destination datacenters per P2MP transfer. We dene appropriate edge weights and select minimum weight Steiner trees which lead to ecient bandwidth utilization of all network edges. To our knowledge, the research set forth, at the time of publication, 1 was the rst to explore and study ecient P2MP transfers over inter-DC networks. 1 This chapter was originally published in [83]. 69 Table 5.1: Various services that perform data replication. Service Replicas Facebook Across availability regions [90], 4 [91], for various object types including large machine learning congs [92] CloudBasic SQL Server Up to 4 secondary databases with active Geo-Replication (asyn- chronous) [93] Azure SQL Database Up to 4 secondary databases with active Geo-Replication (asyn- chronous) [94] Oracle Directory Server Up to the number of datacenters owned by an enterprise for re- gional load balancing of directory servers [95,96] AWS Route 53 GLB Across multiple regions and availability zones for global load bal- ancing [97] Youtube Function of popularity, content potentially pushed to many loca- tions (could be across 33 datacenters [98]) Net ix Across 2 to 4 availability regions [99], and up to 233 cache locations distributed globally [36] 5.1 Background and Related Work A variety of datacenter services replicate content and data from one location to many locations. Table 5.1 provides a brief list of how many replicas are made for some applications. Also, Figure 5.1 oers a list of applications that perform P2MP transfers and gives a short description of why such replication is done. One solution is to perform P2MP transfers as multiple independent P2P transfers that are scheduled separately [46,54,55,61,70,88,100{105]. There may however be more ecient ways, in terms of total bandwidth usage and transfer completion times, to perform P2MP transfers by sending at most one copy of the message across any link given that the source datacenter and destination datacenters are known apriori. In Figure 5.2, an object X is to be transferred from datacenter S to twoD datacenters considering a link throughput of R. In order to send X to destinations, one could initiate individual transfers, but that wastes bandwidth and increases delivery time since the link attached to S turns into a bottleneck. We present an elegant solution using minimum weight Steiner Trees [71] (a.k.a., For- warding Trees, or Multicast Trees) for P2MP transfers that achieves reduced bandwidth usage and tail completion times for receivers. We brie y go over some of the related work in this space and survey their objectives and methods. Internet Multicasting: A large body of general multicasting approaches have been pro- 70 Figure 5.1: Applications that generate transfers potentially with multiple destinations. posed where receivers can join multicast groups anytime to receive required data and mul- ticast trees are incrementally built and pruned as nodes join or leave a multicast session such as IP multicasting [106], TCP-SMO [107] and NORM [108]. These solutions focus on building and maintaining multicast trees, and do not consider link capacity and other ongoing multicast ows while building the trees. Multicast Trac Engineering: An interesting work [109] considers the online arrival of multicast requests with a specied bandwidth requirement. The authors provide an elegant solution to nd a minimum weight Steiner tree for an arriving request with all edges having the requested available bandwidth. This work assumes a xed transmission rate per multicast tree, dynamic multicast receivers, and unknown termination time for multicast sessions whereas we consider variable transmission rates over timeslots, xed multicast receivers, and deem a multicast tree completed when all its receivers download a specic volume of data. MTRSA [110] considers a similar problem to [109] but in an oine scenario where all multicast requests are known beforehand while taking into account the number of available forwarding rules per switch. MPMC [111,112] maximizes the throughput for a single multicast transfer by using multiple parallel multicast trees and coding techniques. None of these works aims to minimize the completion times of receivers while considering the total bandwidth consumption. Datacenter Multicasting: A variety of solutions have been proposed for minimizing congestion across the intra-datacenter network by selecting multicast trees according to 71 S D D Completion Time T T T 2T Total BW 3X 4X X: Object Size, R: Link Throughput, T: X/R S D D X X X X X X Forwarding Trees One-to-One Transfers 1 2 Figure 5.2: Inter-DC multicasting can reduce total bandwidth consumption as well as com- pletion times of transfers. link utilization. Datacast [113] sends data over edge-disjoint Steiner trees found by pruning spanning trees over various topologies of FatTree, BCube, and Torus. AvRA [114] focuses on tree and FatTree topologies and builds minimum edge Steiner trees that connect the sender to all receivers as they join. MCTCP [115] reactively schedules ows according to link utilization. These works do not aim at minimizing the completion times of receivers and ignore the total bandwidth consumption. Overlay Multicasting: With overlay networks, end-hosts can form a multicast forwarding tree in the application layer. RDCM [116] populates backup overlay networks as nodes join and transmits lost packets in a peer-to-peer fashion over them. NICE [117] creates hierarchi- cal clusters of multicast peers and aims to minimize control trac overhead. AMMO [118] allows applications to specify performance constraints for selection of multi-metric overlay trees. DC2 [119] is a hierarchy-aware group communication technique to minimize cross- hierarchy communication. SplitStream [120] builds forests of multicast trees to distribute load across many machines. BDS [121] generates an application-level multicast overlay net- work, creates chunks of data, and transmits them in parallel over bottleneck-disjoint overlay paths to the receivers. Due to limited knowledge of underlying physical network topology and condition (e.g., utilization, congestion or even failures), and limited or no control over how the underlying network routes trac, overlay routing has limited capability in man- aging the total bandwidth usage and distribution of trac to minimize completion times of receivers. In case such control and information are provided, for example by using a 72 cross-layer approach, overlay multicasting can be used to realize solutions such as those presented in this dissertation. Reliable Multicasting: Various techniques have been proposed to make multicasting reliable including the use of coding and receiver (negative or positive) acknowledgments. Experiments have shown that using positive ACKs does not lead to ACK implosion for medium scale (sub-thousand) receiver groups [107]. TCP-XM [122] allows reliable delivery by using a combination of IP multicast and unicast for data delivery and re-transmissions. MCTCP [115] applies standard TCP mechanisms for reliability. Another approach is for receivers to send NAKs upon expiration of some inactivity timer [108]. NAK suppression has been proposed to address implosion which can be applied by routers [123]. Forward Error Correction (FEC) has been used to reduce re-transmissions [108] and improve the completion times [124] examples of which include Raptor Codes [125] and Tornado Codes [126]. These techniques can be applied complementary to the algorithms and techniques presented in this dissertation. Multicast Congestion Control: Existing approaches track the slowest receiver. PGMCC [127], MCTCP [115] and TCP-SMO [107] use window-based TCP like congestion control to compete fairly with other ows. NORM [108] uses an equation-based rate control scheme. With rate allocation and end-host based rate limiting applied over inter-DC networks, need for distributed congestion control becomes minimal; however, such techniques can still be used as a backup in case there is a need to fall back to distributed inter-DC trac control. Other Related Work: CastFlow [128] precalculates multicast spanning trees which can then be used at request arrival time for fast rule installation. ODPA [129] presents algo- rithms for dynamic adjustment of multicast spanning trees according to specic metrics. BIER [130] has been recently proposed to improve the scalability and allow frequent dynamic manipulation of multicast forwarding state in the network and can be applied complemen- tary to our solutions in this dissertation. Peer-to-peer approaches [131{133] aim to maximize throughput per receiver without considering physical network topology, link capacity, or to- tal network bandwidth consumption. Store-and-Forward (SnF) approaches [54,100,102,134] focus on minimizing transit bandwidth costs which does not apply to dedicated inter-DC networks. However, SnF can still be used to improve overall network utilization in the pres- ence of diurnal link utilization patterns, transient bottleneck links, or for application layer multicasting. BDS [135] uses many parallel overlay paths from a multicast source to its destinations storing and forwarding data from one destination to the next. Application of SnF for bulk multicast transfers considering the physical topology is complementary to our 73 Table 5.2: Variables used in this chapter in addition to those in Table 2.2 Variable Denition L e Total load currently scheduled on edge e (same as L e (t now )) T R i The forwarding tree (i.e., multicast tree) selected for request R i work in this dissertation. Recent research [136{139] also consider bulk multicast transfers with deadlines with the objective of maximizing the number of transfers completed before the deadlines. 5.2 Adaptive Forwarding Tree Selection for P2MP Transfers We present an ecient scheme for P2MP transfers called DCCast [83] which aims to optimize tail transfer completion times as well as total network capacity consumption. It selects forwarding trees according to a weight assignment that tries to balance load across the network. 5.2.1 System Model To allow for exible bandwidth allocation, we consider a slotted timeline [46,55,88] where the transmission rate of senders is constant during each timeslot, but can vary from one timeslot to next. This can be achieved via rate-limiting at end-hosts [44, 74]. A central scheduler is assumed that receives transfer requests from end-points, calculates their tem- poral schedule, and informs the end-points of rate-allocations when a timeslot begins. We focus on scheduling large transfers that take more than a few timeslots to nish and there- fore, the time to submit a transfer request, calculate the routes, and install forwarding rules is considered negligible in comparison. We assume equal capacity for all links in an online scenario where requests may arrive anytime. A more advanced solution that considers non- uniform link capacity is discussed in the next chapter. We will use the same notation as that in Table 2.2 with some additional variables in this section as shown in Table 5.2. Denition of Edge Load L e : We dene a new metric called edge load which provides a measure of how busy a link is expected to be on average over future timeslots. L e ;8e2 E E E G is the total volume of trac scheduled on an edge e which is computed by summing up the number of remaining bytes for all the transfers that share e at t now . 74 5.2.2 Selection of Forwarding Trees Our proposed approach is, for each P2MP transfer, to jointly route trac from source to all destinations over a forwarding tree to save bandwidth. Using a single forwarding tree for every transfer also minimizes packet reordering which is known to waste CPU and memory resources at the receiving ends especially at high rates [140,141]. To perform a P2MP transfer R new with volumeV Rnew , the source S Rnew transmits trac over a Steiner Tree that spans across D D D Rnew . At any timeslot, trac for any transfer ows with the same rate over all links of a forwarding tree to reach all the destinations at the same time. The problem of scheduling a P2MP transfer then translates to nding a forwarding tree and a transmission schedule over such a tree for every arriving transfer in an online manner. A relevant problem is the minimum weight Steiner tree [71] that can help minimize total bandwidth usage with proper weight assignment. Although it is a hard problem, heuristic algorithms exist that often provide near optimal solutions [142,143]. 5.2.3 Scheduling Policy When forwarding trees are found, we schedule trac over them according to First Come First Serve (FCFS) policy using all available residual bandwidth on links to minimize the completion times. This allows us to provide guarantees to users on when their transfers will complete upon their arrival. We do not use a preemptive scheme, such as Shortest Remaining Processing Time (SRPT), due to practical concerns: larger transfers might get postponed over and over which might lead to the starvation problem and it is not possible to make promises on exactly when a transfer would complete. Optimal scheduling discipline to minimize tail times rests on transfer size distribution [144]. 5.2.4 DCCast Algorithms DCCast is made up of two algorithms as follows. 2 Update(): This procedure is executed upon beginning of every timeslot. It simply dis- patches the transmission schedule, that is the rate for each transfer, to all senders to adjust their rates via rate-limiting and adjusts L e (e2 E G E G E G ) by deducting the total trac that was sent over e during current timeslot. Allocate(R new ): This procedure is run upon arrival of every request which nds a for- warding tree and schedulesR new to nish as early as possible. Pseudo-code of this function 2 An implementation of DCCast is available on Github: https://github.com/noormoha/DCCast 75 Algorithm 4: Allocate(R new ) Input: R(V Rnew ;S Rnew ;D D D Rnew ), G(V V V G ;E E E G ), !, L e and B e (t) for e2 E E E G and t>t now Output: Forwarding tree (minimum weight Steiner Tree) T Rnew and transmission schedule (trac allocation) for R new for t>t now 1 To every edge e2 E E E G , assign weight (L e + V Rnew ); 2 Find the minimum weight Steiner tree T Rnew that connects S Rnew [ D D D Rnew . We used GreedyFLAC [72,143]; 3 t 0 t now + 1 andV 0 V Rnew ; 4 whileV 0 > 0 do 5 B T Rnew (t 0 ) min e2E E E T Rnew (B e (t 0 )) ; 6 Schedule R new on T Rnew with rate min(B T Rnew (t 0 ); V 0 ! ) at timeslot t 0 ; 7 t 0 t 0 + 1 andV 0 V 0 min(B T Rnew (t 0 ); V 0 ! )! ; 8 return T Rnew and the transmission schedule of R new ; has been shown in Algorithm 3. Statically calculating forwarding trees can lead to creation of hot-spots, even if there exists one highly loaded edge that is shared by multiple trees. As a result, DCCast dynamically chooses a forwarding tree that reduces the tail transfer completion times while saving considerable bandwidth. It is possible that larger trees provide higher available bandwidth by using longer paths through least loaded edges, but using which would consume more overall bandwidth since they send same trac over more edges. To model this behavior, we use a weight assignment that allows balancing these two possibly con icting objectives. The weights represent trac load allocated on links. Selecting links with lower weights will improve load balancing that would be better for future requests. The trade o is in avoiding heavier links at the expense of getting larger trees for a more even distribution of load. The forwarding tree T Rnew selected by Algorithm 4 will have a total weight of: X e2E E E T Rnew (L e +V Rnew ) (5.1) This weight is essentially the total load overT Rnew if requestR new were to be allocated on it. Selecting trees with minimal total weight will most likely avoid highly loaded edges and larger trees. To nd an approximate minimum weight Steiner Tree, we used GreedyFLAC [72,143], which is quite fast and in practice provides results not far from the optimal. 76 Table 5.3: Schemes used for comparison. Scheme Method MINMAX Selects forwarding trees to minimize maximum load on any link. Schedules trac using FCFS policyx5.2. RANDOM Selects random forwarding trees. Schedules trac using FCFS policy x5.2. BATCHING Batches (enqueues) new requests arriving in time windows of T . At the end of batching windows, jointly schedules all new requests ac- cording to Shortest Job First (SJF) policy and picks their forwarding trees using weight assignment of Algorithm 3. SRPT Upon arrival of a new request, jointly reschedules all existing requests and the new request according to SRPT policyx5.2 and picks new forwarding trees for all requests using weight assignment of Algorithm 3. P2P-SRPT-LP Views each P2MP request as multiple independent point-to-point (P2P) requests. Uses a Linear Programming (LP) model along with SRPT policyx5.2 to (re)schedule each request overK-Shortest Paths between its source and destination upon arrival of new requests. P2P-FCFS-LP Similar to above while using FCFS policyx5.2. 5.2.5 Evaluation We evaluated DCCast using synthetic trac. We assumed a total capacity of 1:0 for each timeslot over every link. The arrival of requests followed a Poisson distribution with rate P 2MP = 1. Demand of every request was calculated using an exponential distribution with mean 20 added to a constant value of 10 (xing the minimum demand to 10). All simulations were performed over as many timeslots as needed to nish all requests with arrival time of last request set to be 500 or less. Presented results are normalized by minimum values in each chart. We measure three dierent metrics: total bandwidth used as well as mean and tail transfer completion times. The total bandwidth used is the sum of all trac over all timeslots and all links, i.e., the total network capacity consumed during simulation running time. The completion time of a transfer is dened as its arrival time to the time its last bit is delivered to the destination(s). We performed simulations using Google's GScale topology [2], with 12 nodes and 19 edges, on a single machine (Intel Core i7-6700T and 24 GBs of RAM). All simulations were coded in Java and used Gurobi Optimizer [145] to solve linear programs for P2P schemes. We increased the destinations (copies) for each object from 1 to 6 picking recipients according to uniform distribution. Table 5.3 shows list of 77 1 2 3 4 5 6 # Copies 1 2 3 4 5 6 Mean TCT DCCast RANDOM MINMAX 1 2 3 4 5 6 # Copies 1 1.5 2 2.5 3 3.5 4 Tail TCT Figure 5.3: Tree Selection (GScale Topo) # Copies 1 2 3 4 5 6 Mean TCT 0 5 10 15 20 25 DCCast RANDOM MINMAX # Copies 1 2 3 4 5 6 Tail TCT 0 2 4 6 8 10 12 14 Figure 5.4: Tree Selection (Random topology,jV V V G j= 50) considered schemes. In this table, the rst 4 approaches are P2MP schemes and last 2 are P2P schemes that operate by breaking each P2MP transfer into multiple P2P transfers. We evaluated various forwarding tree selection criteria over both GScale topology and a larger random topology with 50 nodes and 150 edges as shown in Figures 5.3 and 5.4, respectively. In case of GScale, DCCast performs slightly better than RANDOM and MIN- MAX in completion times while using equal overall bandwidth (not in gure). In case of larger random topologies, DCCast's dominance is more obvious regarding completion times while using same or less bandwidth (not in gure). We also experimented various scheduling disciplines over forwarding trees as shown in Figure 5.5. The SRPT discipline performs considerably better with respect to mean com- pletion times; it however may lead to starvation of larger transfers if smaller ones keep arriving. It has to compute and install new forwarding trees and recalculate the whole schedule, for all requests currently in the system with residual demands, upon arrival of 78 1 2 3 4 5 6 # Copies 1 2 3 4 5 6 7 8 Mean TCT DCCast SRPT BATCH(T=10) BATCH(T=50) BATCH(T=100) 1 2 3 4 5 6 # Copies 1 1.5 2 2.5 3 3.5 Tail TCT Figure 5.5: Various scheduling policies and the eect of batching. every new request. This could impose signicant rule installation overhead which is consid- ered negligible in our evaluations. It might also lead to lots of packet loss and reordering. Batching improves performance marginally compared to DCCast and could be an alter- nate road to take. Generally, a smaller batch size results in a smaller initial scheduling latency while a larger batch size makes it possible to employ collective knowledge of many requests in a batch for optimized scheduling. Batching might be more eective for systems with bursty request arrival patterns. All schemes performed almost similarly regarding tail completion times and total bandwidth usage (not in gure). In Figure 5.6, we compare DCCast with a Point-To-Point scheme (P2P-SRPT-LP) us- ing SRPT scheduling policy which uses various number of shortest paths (i.e., K shortest paths) and delivers each copy independently. The total bandwidth usage is close for all schemes when there is only one destination per request. Both bandwidth usage and tail completion times of DCCast are up to 50% less than that of P2P-SRPT-LP as the num- ber of destinations per transfer increases. Although DCCast follows the FCFS policy, its mean completion time is close to that of P2P-SRPT-LP and surpasses it for 6 copies due to bandwidth savings which leave more headroom for new transfers. In a dierent experiment, we compared DCCast with P2P-FCFS-LP and obtained somewhat similar results. DCCast again saved up to 50% bandwidth and reduced tail completion times by up to almost 50% while increasing the number of destinations per transfer. Finally, we studied the eect of load and network size on DCCast comparing it with a P2P scheme that is based on 3-Shortest Paths. Figure 5.7 shows that when network grows in size, there is minor change in performance of P2MP routing. The total bandwidth usage increase obviously since paths become longer. However, the growth in bandwidth usage of 79 1 2 3 4 5 6 # Copies 1 2 3 4 5 6 7 8 Accumulative BW Used DCCast P2P-SRPT-LP (K=1) P2P-SRPT-LP (K=3) P2P-SRPT-LP (K=5) P2P-SRPT-LP (K=7) 1 2 3 4 5 6 # Copies 1 2 3 4 5 6 7 8 9 10 Mean TCT 1 2 3 4 5 6 # Copies 1 2 3 4 5 6 7 8 Tail TCT Figure 5.6: DCCast vs Point-To-Point (P2P-SRPT-LP). 50 100 150 200 250 Nodes (edges = Nodes # 6) 0.5 1 1.5 2 2.5 Accumulative BW Used Copies = 5, Avg Req Size = 2 Full Slots Lambda = 1, Req Size Dist = Exponential 50 100 150 200 250 Nodes (edges = Nodes # 6) 0.5 1 1.5 2 2.5 Mean TCT Copies = 5, Avg Req Size = 2 Full Slots Lambda = 1, Req Size Dist = Exponential 50 100 150 200 250 Nodes (edges = Nodes # 6) 0.5 1 1.5 2 2.5 Tail TCT Copies = 5, Avg Req Size = 2 Full Slots Lambda = 1, Req Size Dist = Exponential DCCast 3-Shorest Paths 3-Shorest Paths / DCCast Figure 5.7: Performance of 3-Shortest Paths (P2P) vs DCCast as network grows. P2P scheme considered is a little more than that of DCCast. Figure 5.8 shows the eect of input load on performance of same schemes. As can be seen, all performance metrics grow much slower for DCCast compared to P2P shortest paths (lower values are better). 80 0 2 4 6 8 10 12 Avg Req Size (Full Slots) 0 2 4 6 8 10 12 14 16 18 20 Accumulative BW Used Copies = 5, Nodes = 50, Edges = 300 Lambda = 1, Req Size Dist = Exponential 0 2 4 6 8 10 12 Avg Req Size (Full Slots) 0 2 4 6 8 10 12 14 Mean TCT Copies = 5, Nodes = 50, Edges = 300 Lambda = 1, Req Size Dist = Exponential 0 2 4 6 8 10 12 Avg Req Size (Full Slots) 0 10 20 30 40 50 60 70 80 90 Tail TCT Copies = 5, Nodes = 50, Edges = 300 Lambda = 1, Req Size Dist = Exponential DCCast 3-Shorest Paths 3-Shorest Paths / DCCast X: 0.1 Y: 1.552 X: 2.1 Y: 1.407 X: 4.1 Y: 1.687 X: 6.1 Y: 1.798 X: 8.1 Y: 2.166 X: 10.1 Y: 2.805 X: 0.1 Y: 1.129 X: 2.1 Y: 1.057 X: 4.1 Y: 1.112 X: 6.1 Y: 1.122 X: 8.1 Y: 1.225 X: 10.1 Y: 1.26 X: 0.1 Y: 1.653 X: 2.1 Y: 1.728 X: 4.1 Y: 1.725 X: 6.1 Y: 1.709 X: 8.1 Y: 1.678 X: 10.1 Y: 1.636 Figure 5.8: Performance of 3-Shortest Paths (P2P) vs DCCast as incoming network load increases. Computational Overhead: We used a large network with 50 nodes and 300 edges and considered P2MP transfers with 5 destinations per transfer. Transfers were generated ac- cording to Poisson distribution with arrival times ranging from 0 to 1000 timeslots and the simulation ran until all transfers were completed. Mean processing time of a single timeslot increased from 1:2ms to 50ms per timeslot while increasing P 2MP from 1 to 10. Mean processing time of a single transfer (which accounts for nding a tree and scheduling the transfer) was 1:2ms and 5ms per transfer for P 2MP equal to 1 and 10, respectively. This is negligible compared to timeslot lengths of minutes in prior work [55]. We also looked at the computational overhead of DCCast as network size grows shown in Figure 5.9. As can be seen, the growth is sub-linear. 81 50 100 150 200 250 Nodes (edges = Nodes # 6) 400 600 800 1000 1200 1400 1600 1800 2000 2200 Computation time in milliseconds Copies = 5, Avg Req Size = 2 Full Slots Lambda = 1, Req Size Dist = Exponential DCCast Linear Growth Figure 5.9: Computational overhead of DCCast as network size grows. 5.3 Fast Admission Control for Point to Multipoint Transfers with Deadlines Existing techniques to performing inter-DC transfers are either unable to guarantee the deadlines for inter-DC multicast transfers or can only do so by treating multicast trans- fers as separate P2P transfers. We present Deadline-aware DCCast (DDCCast), a quick yet eective deadline aware point to multipoint technique based on the ALAP trac al- location policy. DDCCast performs careful admission control using temporal planning, rate-allocation, and rate-limiting to avoid congestion while sending trac over forwarding trees that are adaptively selected to reduce network capacity consumption and maximize the number of admitted transfers. We perform experiments conrming DDCCast's poten- tial to reduce total bandwidth usage by up to 45% while admitting up to 25% more trac into the network compared to alternatives that guarantee deadlines. 5.3.1 System Model We use the same notations expressed earlier in Table 2.2 and Table 5.2. Similarly, to provide exible bandwidth allocation, we consider a slotted timeline [46, 55, 88] where the transmission rate of senders is constant during each slot, but can be updated from one slot to the next. This can be achieved using rate-limiting techniques at the end-points [44,74]. A central scheduler is assumed that receives transfer requests from end-points, per- forms admission control to determine feasibility, calculates an initial temporal schedule, and informs the end-points of next timeslot's rate-allocation when the timeslot begins. The 82 allocation for future slots can change as new requests are submitted, however, only the scheduler knows about schedules beyond the current timeslot and it can update such sched- ules as new requests are submitted. We focus on scheduling large transfers that can take minutes or more to complete [46] and therefore, the time to submit a transfer request, cal- culate the routes, and install forwarding rules is considered negligible in comparison. We also assume equal link capacity for all links to simplify the problem. We consider an online scenario where requests may arrive at any time and go through an admission control pro- cess; if admitted, they are scheduled to be completed prior to their deadlines. To prevent thrashing, similar to previous works [55,88], we also assume that once a request is admitted, it cannot be evicted. A transfer requestR i is considered active if it has been admitted but not completed. At any moment, there may beK 0 dierent active requests with various deadlines. We dene active window as the range of time from t now + 1 (next timeslot) to t end , the timeslot of the latest deadline, dened as max(t d R i 1 ;:::;t d R i K ). At the end of each timeslot, all requests can be updated to re ect their remaining (residual) demands by deducting volume sent during a timeslot from their total demand at the beginning of a timeslot. To perform a P2MP transfer R, the source S R transmits trac over a Steiner Tree [71] that spans across all destinations D D D R 1 to D D D Rn which we refer to as the P2MP request's forwarding tree. The transmission rate over a forwarding tree at every timeslot is the minimum of available bandwidth over all edges of the tree at that timeslot. 5.3.2 Point to Multipoint Transfers with Deadlines We focus on the case when a P2MP transfer is only valuable if all of its destinations receive the associated object prior to the specied deadline, i.e., all receivers have the same deadline. As a result, a transfer should only be accepted if this requirement can be guaranteed given no failures or unexpected loss of capacity across the network. P2MP Deadline Problem: Determine feasibility of allocating transfer R K+1 using any forwarding tree over the inter-DC network G, given K existing requests R 1 to R K with residual demandsV r R 1 toV r R K each with their own forwarding trees. If feasible, the transfer is admitted and the algorithm should determine the forwarding tree that minimizes overall bandwidth consumption. The objective is to maximize the total trac admitted into the network. The most general approach to solving the P2MP Deadline Problem is to form a Mixed Integer Linear Program (MILP) that considers capacity of links over various timeslots along 83 • Update() – Is executed at the end of every timeslot • Dispatches rate-allocations to end-points (i.e., senders) for rate-limiting • Allocate( ) – Is executed upon arrival of a transfer request 1. Selects a forwarding tree for request 2. Performs rate-allocation over DCCast Procedures DC1 DC2 DC3 Update() -Maximize Utilization -Dispatch Rates Allocate(R) -Forwarding Tree Selection -ALAP rate-allocation Rate Allocation Database Rates Requests TE Server Figure 5.10: DDCCast (Deadline-Aware DCCast 5.2) architecture. with transfer deadlines and reschedules all active requests along with the new request. The solution would be a new schedule for every active transfer (over the same trees) and a new tree with a rate allocation schedule for the new request. Solving MILPs can be computationally intensive and may take a long time. This is especially problematic if MILPs have to be solved upon arrival of requests for admission control where admission control latency can lead to creation of backlogs. We discuss our fast heuristic next. 5.3.3 Deadline-aware DCCast (DDCCast) The architecture of DDCCast (Deadline-Aware DCCast [83, 146]) is shown in Figure 5.10. There are two main procedures of Update() and Allocate(R new ). The former simply reads the rate-allocations from the database and dispatches them to all end-points at the beginning of every timeslot. The latter performs admission control, forwarding tree selection and rate-allocation according to the ALAP policy. The rates are then updated in a database. Also, at the beginning of every timeslot, if there is unused capacity, the Update() procedure moves back some of the future allocations, starting with the closest allocation to the current timeslot that can be moved back, to maximize utilization. Afterwards, to keep the allocation ALAP, it may sweep the timeline and further push any allocations that can be pushed forward closest to their deadlines. This technique is similar to the one used by DCRoute in Chapter 4 with the minor dierence that it is applied over the edges of multicast trees. We discuss the main parts of DDCCast in the following. Forwarding Tree Selection For every new transfer, this procedure selects a forwarding tree that connects the sender to all receivers over the inter-DC network. This is done by assigning weights to edges of the inter-DC network and selecting a minimum weight Steiner Tree [71]. Weight of a forwarding 84 tree is sum of the weights of its edges. For every transferR new with volumeV Rnew , we assign edge e2 E E E G of the inter-DC network a weight of (V Rnew +L e (t d Rnew )) where L e (t) is the total load on edge e up to and including timeslot t. Running a minimum weight Steiner Tree heuristic gives us a forwarding tree T Rnew . This process is performed only once for every request upon their arrival. We explain the motivation behind our approach to tree selection. Ideally for routing, we seek a tree with minimum number of edges that connects the source datacenter to all destination datacenters (i.e., a minimum edge Steiner Tree), but such tree may not have enough capacity available on all edges to complete R new prior to t d Rnew . Therefore, a dierent Steiner tree, which can be larger but oers more available bandwidth may be chosen. It is possible that larger trees provide higher available capacity by using longer paths through least loaded edges, but they consume more bandwidth since they sendV Rnew over a larger number of edges. To model this behavior, we use a weight assignment that allows balancing two possibly con icting objectives, i.e., nding the forwarding tree with highest available capacity by potentially taking longer paths (to balance load across the network), while minimizing the total network capacity used by minimizing the number of edges used. Our evaluations presented earlier in 5.2 show that this cost assignment performs more eectively compared to minimizing the maximum utilization on the network which is a well-known policy that is frequently used for trac engineering over wide area networks. Admission Control After nding a P2MP forwarding tree, we need to rst verify if the new transfer can be accommodated over the tree. We perform admission control by calculating the available bandwidth over the tree (i.e.,8e2 E E E T Rnew ) for all timeslots of t now + 1 to t d Rnew . We then sum the available bandwidth across these timeslots and admit the request if the total is not less thanV Rnew . This admission control approach does not guarantee that a rejected request could not have been accommodated on G. It is possible that a request is rejected although it could have been accepted if a dierent forwarding tree had been chosen. In general, nding the tree with maximum available bandwidth prior to a deadline is a hard problem given that the maximum available rate over a tree is the minimum of what is available over its edges per timeslot. In addition, even if this problem could be optimally solved in polynomial time, it is unclear whether it would lead to an improved solution since this is an online resource packing problem with multiple capacity and demand constraints. 85 Trac Allocation and Adjustment Once admitted, the trac allocation process places every new request according to ALAP policy which guarantees meeting deadlines while postponing the use of bandwidth until necessary. Adjustments are done in Update() procedure upon beginning of timeslots. To maximize utilization and use the network eciently, we adjust the schedules when there is unused capacity. Upon the beginning of every timeslot, we pull trac from closest timeslots in the future over each forwarding tree and send it in current timeslot, if there is available capacity across all edges of such a P2MP forwarding tree. For a network, it may not be possible to schedule trac ALAP on all edges since allocations may need to span over multiple edges all of which may not have available bandwidth. Therefore, after maximizing the utilization of the upcoming timeslot (i.e., t now + 1), we sweep the timeline starting t now + 2 and push allocations forward as much as possible until no schedule can be pushed further toward its deadline. 5.3.4 Evaluation We evaluated DDCCast using synthetic trac generated in accordance with several related works [55, 88]. The arrival of requests followed a Poisson distribution with rate . The deadlinet d Rnew of every requestR new was generated using an exponential distribution with a mean value of 10 timeslots. Demand ofR new was then calculated using another exponential distribution with a mean of t d Rnew tnow 8 . All simulations were performed over 500 timeslots and each scenario was repeated 10 times and the average measurements have been reported. We assumed a total capacity of 1:0 for every timeslot over every link. Setup: We performed our simulations over Google's GScale topology [2] with 12 datacenters and 19 bidirectional edges. We assumed a machine attached to each datacenter generating trac destined to other (multiple) datacenters. The simulations were performed on a single machine equipped with an Intel Core i7-6700T CPU and 24GBs of RAM. All simulations were coded in Java, and to solve linear programs for Amoeba, we used Gurobi [145]. Performance Metrics: We measured two metrics of total bandwidth used and total trac volume admitted. Both parameters were calculated over the whole network and all timeslots. The rst parameter is the sum of all trac over all timeslots and all links. The second parameter determines what volume of oered load from all end-points was admitted into the network. 86 12345 Number of destinations per request 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 Mean Total Traffic Admitted (Normalized by DDCCast) DDCCast DCRoute Amoeba 12345 Number of destinations per request 1 1.5 2 2.5 3 3.5 4 4.5 Mean Total Bandwidth Used (Normalized by minimum data point) DDCCast DCRoute Amoeba Figure 5.11: Capacity consumption and total admitted trac byjD D D R j (given = 2) Schemes: Following schemes were considered: DDCCast, DCRoute, 3 and Amoeba [55] all of which aim to guarantee the deadlines, maximize total utilization, and perform admission control. DCRoute and Amoeba do not have the notion of point to multipoint forwarding trees. As a result, to perform the following simulations, each P2MP transfer with multiple destinations in DDCCast is broken into several independent P2P transfers from the source to each destination and then plugged into DCRoute and Amoeba. We only compare DDCCast with these two works since other works either do not support deadlines [44,74] or focus on dierent objectives. Eect of Number of Destinations Figure 5.11 shows the results of this experiment. We increased the number of destinations for each transfer from 1 to 5 and picked random destinations for each transfer. The total volume of trac used by Amoeba [55] is up to 1:8 the volume used by DDCCast. Even in case of one destination Amoeba uses 1:2 the bandwidth of DCCast and DCRoute. This occurs because Amoeba routes trac across theK static shortest paths, and asK increases, some of these paths may not be as short as the shortest path. Therefore, even for a small incoming network load, a portion of trac may traverse longer paths and increase total bandwidth usage. DDCCast saves bandwidth by using P2MP forwarding trees. DDCCast admits 25% more trac compared to Amoeba when sending objects to 5 destinations while using 45% less overall network capacity. 3 DCRoute was presented earlier in Chapter 4. 87 12345 Request Arrival Rate ( ) 0.85 0.9 0.95 1 Mean Total Traffic Admitted (Normalized by DDCCast) DDCCast DCRoute Amoeba 12345 Request Arrival Rate ( ) 1 2 3 4 5 6 7 8 Mean Total Bandwidth Used (Normalized by minimum data point) DDCCast DCRoute Amoeba Figure 5.12: Capacity consumption and total admitted trac by (givenjD D D R i j = 3;8i) Eect of Transfer Arrival Rate (i.e., Incoming Load) We investigate the eect of while sending an object to three destinations. Results of this experiment have been shown in Figure 5.12. Volume of admitted trac is about 10% higher for DDCCast compared with other two schemes over all arrival rates. Also, similar to the previous experiment, DDCCast's total bandwidth usage is between 37% to 45% less than Amoeba [55] and 28% less than DCRoute [88]. 5.4 Conclusions In this chapter, we studied ecient inter-DC P2MP transfers which are multicast transfers with known source and set of receivers upon submission to the inter-DC network. We in- vestigated an adaptive approach to selection of forwarding trees (i.e., multicast trees) which reduced total capacity consumption while balancing load across the network. It is possible to set up such trees using commodity hardware that support multicast forwarding, or SDN frameworks such as OpenFlow [147] along with application of Group Tables [148]. 4 Such trees can be congured upon arrival of transfers and torn down upon their completion. Our evaluations show that by adaptively selecting forwarding trees according to edge load and transfer size, we can reduce the total network capacity consumption while either reducing completion times, or admitting more trac given guaranteed deadlines. 4 See Appendix B for a discussion of switch support for group tables. 88 Chapter 6 Speeding up P2MP Transfers using Receiver Set Partitioning In the previous chapter, we discussed using atomically selected forwarding trees (i.e., multi- cast trees) to copy an object from one datacenter to multiple datacenters over an inter-DC network. This allowed us to save network capacity while reducing the time needed to cast objects to many locations. Although one can perform inter-DC P2MP transfers using a sin- gle multicast forwarding tree, that might lead to poor performance as the slowest receiver on each tree dictates the completion time for all receivers. In this chapter, we discuss using multiple trees per transfer, each connected to a subset of receivers, which alleviates this concern. The choice of multicast trees also determines the total bandwidth usage. We approach this problem by breaking it into three sub-problems of partitioning, tree selection, and rate allocation. We present an algorithm, called QuickCast, which is com- putationally fast and allows us to signicantly speed up multiple receivers per multicast transfer with control over extra bandwidth consumption. We evaluate QuickCast against a variety of synthetic and real trac patterns as well as real WAN topologies. Compared to performing bulk multicast transfers as separate unicast transfers, QuickCast achieves up to 3:64 reduction in mean completion times while at the same time using 0:71 the bandwidth. Also, QuickCast allows the top 50% of receivers to complete between 3 to 35 faster on average compared with when a single forwarding multicast tree is used for data delivery. 89 6.1 Background and Related Work In general, it is not required that the receivers of a P2MP transfer complete data reception at the same time. For many applications, speeding up several receivers per P2MP transfer can translate to improved end-user quality of experience and increased availability. For example, faster replication of video content to regional datacenters enhances average user's experience in social media applications or making a newly trained model available at re- gional datacenters allows speedier access to new application features for millions of users. Several recent works focus on improving the performance of unicast transfers over dedicated inter-DC networks [2,44,46,55,57]. However, performing bulk multicast transfers as many separate unicast transfers can lead to excessive bandwidth usage and will increase receiver completion times. Although there exists extensive work on multicasting, it is not possible to apply those solutions to our problem as existing research has focused on dierent goals and considers dierent constraints. For example, earlier research in multicasting aims at dynamically building and pruning multicast trees as receivers join or leave [106], building multicast overlays that reduce control trac overhead and improve scalability [117], or choosing multicast trees that satisfy a xed available bandwidth across all edges as requested by applications [109, 110], minimize congestion within datacenters [113, 114], reduce data re- covery costs assuming some recovery nodes [149], or maximize the throughput of a single multicast ow [111, 112]. To our knowledge, none of the related research eorts aimed at minimizing the mean completion times of receivers for concurrent bulk multicast transfers while considering the overall bandwidth usage, which is the focus of this chapter. In this chapter, we break the bulk multicast transfer routing, and scheduling problem with the objective of minimizing mean completion times of receivers into three sub-problems of the receiver set partitioning, multicast forwarding tree selection per receiver partition, and rate allocation per forwarding tree. We brie y describe each problem as follows. Receiver Set Partitioning: As dierent receivers can have dierent completion times, a natural way to improve completion times is to partition receivers into multiple sets with each receiver set having a separate tree. This reduces the eect of slow receivers on faster ones. We employ a partitioning technique that groups receivers of every bulk multicast transfer into multiple partitions according to their mutual distance (in hops) on the inter- DC graph. With this approach, the partitioning of receivers into any N > 1 partitions consumes minimal additional bandwidth on average. We also oer a conguration parame- ter called the partitioning factor that is used to decide on the right number of partitions that 90 create a balance between receiver completion times improvements and the total bandwidth consumption. Forwarding Tree Selection: To avoid heavily loaded routes, multicast trees should be chosen dynamically per partition according to the receivers in that partition and the distri- bution of trac load across network edges. We utilize a computationally ecient approach for forwarding tree selection that connects a sender to a partition of its receivers by as- signing weights to edges of the inter-DC graph, and using a minimum weight Steiner tree heuristic. We dene a weight assignment according to the trac load scheduled on edges and their capacity and empirically show that this weight assignment oers improved receiver completion times at minimal bandwidth consumption. Rate Allocation: Given the receiver partitions and their forwarding trees, formulating the rate allocation for minimizing mean completion times of receivers leads to a hard problem. We consider the popular scheduling policies of fair sharing, Shortest Remaining Processing Time (SRPT), and First Come First Serve (FCFS). We reason why fair sharing is preferred compared to policies that strictly prioritize transfers (i.e., SRPT, FCFS, etc.) for network throughput maximization when focusing on bulk multicast transfers especially ones with many receivers per transfer. We empirically show that using max-min fairness [85], which is a form of fair sharing, we can considerably improve the average network throughput which in turn reduces receiver completion times. Motivating Example Figure 6.1 shows an example of delivering a large object X from source S to destinations ft 1 ;t 2 ;t 3 ;t 4 g which has a volume of 100 units. We have two types of links with capacities of 1 and 10 units of trac per time unit. We can use a single multicast tree to connect the sender to all receivers which will allow us to transmit at the bottleneck rate of 1 to all receivers. However, one can group receivers into two partitions of P 1 and P 2 and attach each partition with a separate multicast tree. Then we can select transmission rates so that we minimize the mean completion times. In this case, assigning a rate of 1 to the tree attached to P 1 and a rate of 9 to the tree attached to P 2 will attain this goal while respecting link capacity over all links (the link attached toS is the bottleneck). As another possibility, we could have assigned a rate of 10 to the tree attached to P 2, allowingft 3 ;t 4 g to nish in 10 units of time, while suspending the tree attached to P 1 until time 11. As a result, the tree attached toP 1 would have started at 11 allowingft 1 ;t 2 g to nish at 110. In this dissertation, we aim to improve the speed of several receivers per bulk multicast transfer 91 P2 P1 t 4 t 3 t 2 t 1 S X = 100 C = 1 C = 10 t 4 t 3 t 2 t 1 S X = 100 C = 1 C = 10 rate = 1 rate = 9 rate = 1 Individual Completion Times (Time to deliver X to receiver t i , ∀i) Mean Completion Times t 1 t 2 t 3 t 4 Setup #1 100 100 100 100 100 Setup #2 100 100 11.1 11.1 55.55 Setup #1 Setup #2 Figure 6.1: Using multiple smaller multicast trees we can improve the completion times of several receivers while marginally increasing total network capacity consumption. without hurting the completion times of the slow receivers. In computing the completion times, we ignore the propagation and queuing latencies as the focus of this dissertation is on delivering bulk objects for which the transmission time dominates the propagation or queuing latency along the trees. 6.2 System Model We consider a scenario where bulk multicast transfers arrive at the inter-DC network in an online fashion. We will use the same notations as that of Table 2.2. We will also use some additional denitions as described in Table 6.1. In general, synchronization is not required across receivers of a bulk multicast transfer and therefore, receivers are allowed to complete at dierent times as long as they all receive the multicast object completely. Incoming requests are processed as they come by a trac engineering server that manages the forwarding state of the whole network in a logically centralized manner for installation and eviction of multicast trees. Upon arrival of a request, this server decides on the number of partitions and receivers that are grouped per partition and a multicast tree per partition. Periodically, the TES computes the transmission rates for all multicast trees at the beginning of every timeslot and dispatches them to senders for rate limiting. This allows for a congestion free network since the rates are computed according to link capacity constraints 92 Table 6.1: Denition of variables used in this chapter besides those dened in Table 2.2. Variable Denition U e Edge e's bandwidth utilization, 0U e 1 T A directed Steiner tree connected to a partition of receivers P A partition of receivers of a request R P P P R Sethi of partitions of request R,jP P P R jjD D D R j T P The forwarding tree (i.e., multicast tree) of partition P r T P (t) The transmission rate over T P of partition P2 P P P at timeslot t V r P Residual volume of some partition P2 P P P L e Edge e's total trac load at time t now , i.e., total outstanding bytes scaled by e's inverse capacity p f 1 Conguration parameter; determines a partitioning cost threshold N max Conguration parameter; maximum number of partitions allowed per transfer and other ongoing transfers. To minimize control plane overhead, partitions and forwarding trees are xed once they are established for an incoming transfer. In this context, the bulk multicast transfer routing and scheduling problem can be formally stated as follows. Partitioning Problem: Given an inter-DC network G(V V V G ;E E E G ) with the edge capacity C e ;8e2 E E E G and the set of all partitionsfP2 P P P R j8R2 R R R;V r P > 0g, for a newly arriving bulk multicast transferR new , the trac engineering server needs to compute a set of receiver partitions P P P R each with one or more receivers, and select a forwarding tree T P ;8P2 P P P R . Rate-allocation Problem: Per timeslott, the trac engineering server needs to compute the ratesr T P (t);fP2 P P P R j8R2 R R R;V r P > 0g. The objective is to minimize the average time for a receiver to complete data reception while keeping the total bandwidth consumption below a certain threshold compared to the minimum possible, i.e., a minimum edge Steiner tree per transfer. Both the number of ways to partition receivers into subsets and the number of candidate forwarding trees per subset grow exponentially with the problem size. It is, in general, not clear how partitioning and selection of forwarding trees correlate with both receiver completion times and total bandwidth usage. Even the simple objective of minimizing the total bandwidth usage is a hard problem. Also, assuming known forwarding trees, selecting transmission rates per timeslot per tree for minimization of mean receiver completion times is a hard problem. Finally, this is an online problem with unknown future arrivals which adds to the complexity. 93 6.3 Optimizing Receiver Completion Times with Minimum Bandwidth Usage As stated earlier, we need to address the three sub-problems of receiver set partitioning, tree selection, and rate allocation. Since the partitioning sub-problem uses the tree selection sub-problem, we rst discuss tree selection in the following. As the last problem, we will address rate allocation. Since the total bandwidth usage is a function of transfer properties, i.e., number of receivers, transfer volume, and the location of sender and receivers, and the network topology, it is highly sophisticated to design a solution that guarantees a limit on the total bandwidth usage. Instead, we aim to reduce the receiver completion times while minimally increasing bandwidth usage. 6.3.1 Forwarding Tree Selection The tree selection problem states that given a network topology with link capacity knowl- edge, how to choose a Steiner tree that connects a sender to all of its receivers. The objective is to minimize the completion times of receivers 1 while minimally increasing the total band- width usage. Since the total bandwidth usage is directly proportional to the number of edges on selected trees, we would want to keep trees as small as possible. Reduction in completion times can be achieved by avoiding edges that have a large outstanding trac load. For this purpose, we use an approach similar to the one used in Chapter 5 which worked by assigning proper weights to the edges of the inter-DC graph and choosing a minimum weight Steiner tree. The weight assignment we use next also takes into account the variable link capacities over the topology. Weight Assignment: We use the metric of link load L e ;8e2 E E E G that is dened in Table 6.1 and can be computed as L e = 1 Ce P P2P P P Rnew ;8Rj e2E E E T P V r P . Note that this is dierent from what we used in Chapter 5 in that we divide the total outstanding volume of trac allocated on a link by its capacity. We can compute a link's load since we know the remaining volume of current transfers and the edges that they use. A link's load is a measure of how busy it is expected to be in the next few timeslots. It increases as new transfers are scheduled on a link, and diminishes as trac ows through it. To select a forwarding tree from a source to a set of receivers, we use an edge weight of L e + V Rnew Ce and select a minimum weight Steiner tree. The selected tree will most likely exclude any links that are expected to be highly busy. Addition of 1 All receivers on a tree complete at the same time. 94 Algorithm 5: Forwarding Tree Selection Algorithm Input: Request R new , partition P2 P P P Rnew , G(V V V G ;E E E G ), and L e ;8e2 E E E G Output: A forwarding tree (set of edges) 1 ComputeTree (P;R new ) 2 Assign a weight of (L e + V Rnew Ce ) to every edge e,8e2 E E E G ; 3 Find a minimum weight Steiner tree T P which connects the nodesfS Rnew [Pg; 4 L e L e + V Rnew Ce ; 8e2 E E E T P ; 5 return T P ; the second element in the weight (new request's volume divided by capacity) helps select smaller trees in case there is not much load on most edges. Algorithm 5 applies the weight assignment approach mentioned above to select a for- warding tree that balances the trac load across available trees and nds a minimum weight Steiner tree using the GreedyFLAC heuristic [143]. Inx6.4, we explore a variety of weights for forwarding tree selection as shown in Table 6.4 and see that this weight assignment provides consistently close to minimum values for the three performance metrics of mean and tail receiver completion times as well as total bandwidth usage. Worst-case Complexity: Algorithm 5 computes one minimum weight Steiner tree. For a requestR new , the worst-case complexity of Algorithm 5 is O(jV V V G j 3 jD D D Rnew j 2 +jE E E G j) given the complexity of GreedyFLAC [143]. 6.3.2 Receiver Set Partitioning The maximum transmission rate on a tree is that of the link with minimum capacity. To improve bandwidth utilization of inter-DC backbone, we can replace a large forwarding tree with multiple smaller trees each connecting the source to a subset of receivers. By partitioning, we isolate some receivers from the bottlenecks allowing them to receive data at a higher rate. We aim to nd a set of partitions each with at least one receiver that allows for reducing the average receiver completion times while minimally increasing the bandwidth usage. Bottlenecks may appear either due to competing transfers or dierences in link capacity. In the former case, some edges may be shared by multiple trees which lead to lower available bandwidth per tree. Such conditions may arise more frequently under heavy load. In the latter case, dierences in link capacity can increase completion times especially in large networks and with many receivers per transfer. 95 Receiver set partitioning to minimize the impact of bottlenecks and reduce completion times is a sophisticated open problem. It is best if partitions are selected in a way that no additional bottlenecks are created. Also, increasing the number of partitions may in general increase bandwidth consumption (multiple smaller trees may have more edges in total com- pared to one large tree). Therefore, we need to come up with the right number of partitions and receivers that are grouped per partition. We propose a partitioning approach, called the hierarchical partitioning, that is computationally ecient and uses a partitioning factor to decide on the number of partitions and receivers that are grouped in those partitions. Number of Partitions Transfers may have a highly varying number of receivers. Generally, the number of parti- tions should be computed based on the number of receivers, where they are located in the network, and the network topology. Also, using more partitions can lead to the creation of unnecessary bottlenecks due to shared links. We compute the number of partitions per transfer according to the total trac load on network edges and considering a threshold that limits the cost of additional bandwidth consumption. Limitations of Partitioning Partitioning, in general, cannot improve tail completion times of transfers as tail is usually driven by physical resource constraints, i.e., low capacity links or links with high contention. Hierarchical Partitioning We group receivers into partitions according to their mutual distance which is dened as the number of hops on the shortest hop path that connects any two receivers. Hierarchical clustering [150] approaches such as agglomerative clustering can be used to compute the groups by initially assuming that every receiver has its partition and then by merging the two closest partitions at each step which generates a hierarchy of partitioning solutions. Each layer of the hierarchy then gives us one possible solution with a given number of partitions. With this approach, the partitioning of receivers into any N > 1 partitions consumes minimal additional bandwidth on average compared to any other partitioning with N par- titions. That is because assigning a receiver to any other partition will likely increase the total number of edges needed to connect the source to all receivers; otherwise, that receiver would not have been grouped with the other receivers in its current partition in the rst 96 place. There is, however, no guarantee since hierarchical clustering works based on a greedy heuristic. After building a partitioning hierarchy, the algorithm selects the layer with the maximum number of partitions whose total sum of tree weights stays below a threshold that can be congured as a system parameter. Choosing the maximum partitions allows us to minimize the eect of slow receivers given the threshold, which is a multiple of the weight of a single tree that would connect the sender to all receivers and can be looked at as a bandwidth budget. We call the multiplication coecient the partitioning factorp f . Algorithm 6 shows this process in detail. The partitioning factor p f plays a vital role in the operation of QuickCast as it determines the extra cost we are willing to pay in bandwidth for improved completion times. In general, ap f greater than one but close to it should allow partitioning to separate very slow receivers from several other nodes. A p f that is considerably larger than one may generate too many partitions and potentially create many shared links which reduce throughput and additional edges that increase bandwidth usage. If p f is less than one, a single partition will be used. Worst-case Complexity: Algorithm 6 performs multiple calls to the GreedyFLAC [143]. It uses the hierarchical clustering with average linkage which has a worst-case complexity ofO(jD D D Rnew j 3 ). To compute the pairwise distances of receivers, we use breadth rst search with has a complexity ofO(jV V V G j+jE E E G j). Worst-case complexity of Algorithm 6 isO((jV V V G j 3 + jE E E G j)jD D D Rnew j 2 +jD D D Rnew j 3 ). 6.3.3 Rate Allocation To compute the transmission rates per tree per timeslot, one can formulate an optimization problem with the capacity and demand constraints, and consider minimizing the mean receiver completion times as the objective. This is, however, a hard problem and can be modeled using mixed-integer programming by assuming a binary variable per timeslot per tree that shows whether that tree has completed by that timeslot. One can come up with approximation algorithms to this problem which is considered part of the future work. We consider the three popular scheduling policies of FCFS, SRPT, and fair sharing according to max-min fairness [85] which have been extensively used for network scheduling. These policies can be applied independently of partitioning and forwarding tree selection techniques. Each one of these three policies has its unique features. FCFS and SRPT both prioritize transfers; the former according to arrival times and the latter according to transfer volumes and so obtain a meager fairness score [151]. SRPT has been extensively 97 Algorithm 6: Compute Partitions and Trees Input: Request R new , G(V V V G ;E E E G ), and L e ;8e2 E E E G Output: Pairs of (partition, forwarding tree) 1 ComputePartitionsAndTrees (R new ;N max ) 2 Assign a weight of (L e + V Rnew Ce ) to e,8e2 E E E G ; 3 Find the minimum weight Steiner tree T Rnew which connects the nodesfS Rnew [ D D D Rnew g and its total weight W T Rnew ; 4 foreachf;g;2 D D D Rnew ;2 D D D Rnew ; 6= do 5 DIST ; number of edges on the minimum hop path from to ; 6 Compute the agglomerative clustering hierarchy for D D D Rnew using average linkage and distance DIST i;j which will have l clusters at layer 1ljD D D Rnew j; 7 for l = min(N max ;jD D D Rnew j) to 2 by1 do 8 P P P l set of clusters at layer l of agglomerative hierarchy, each cluster forms a partition; 9 foreach P2 P P P l do 10 Find the minimum weight Steiner tree T P which connects the nodesfS Rnew [Pg; 11 if P P2P P P l W T P p f W T Rnew then 12 foreach P2 P P P l do 13 T P ComputeTree (P ,R new ); 14 return (P; T P ); 8P2 P P P l ; 15 L e L e + V Rnew Ce ; 8e2T Rnew ; 16 return (D D D Rnew ; T Rnew ); used for minimizing ow completion times within datacenters [152{154]. Strictly prioritizing transfers over forwarding trees (as done by SRPT and FCFS), however, can lead to low overall link utilization and increased completion times, especially when trees are large. This might happen due to bandwidth contention on shared edges which can prevent some transfers from making progress. Fair sharing allows all transfers to make progress which mitigates such contention enabling concurrent multicast transfers to all make progress. In x6.4.3, we empirically compare the performance of these scheduling policies and show that fair sharing based on max-min fairness can signicantly outperform both FCFS and SRPT in average network throughput especially with a larger number of receivers per tree. As a result, we will use QuickCast along with the fair sharing policy based on max-min fairness. The TES periodically computes the transmission rates per multicast tree every timeslot 98 to maximize utilization and cope with inaccurate inter-DC link capacity measurements, imprecise rate limiting, and dropped packets due to corruption. To account for inaccurate rate limiting, dropped packets and link capacity estimation errors, which all can lead to a dierence between the actual volume of data delivered and the number of bytes transmitted, we propose that senders keep track of actual data delivered to their receivers per forwarding tree. At the end of every timeslot, every sender reports to the trac engineering server how much data it was able to deliver allowing it to compute rates accordingly for the timeslot that follows. Newly arriving transfers will be assigned rates starting the next timeslot. 6.4 Evaluation We considered various topologies and transfer size distributions as shown in Tables 6.2 and 6.3. Also, for Algorithm 6, unless otherwise stated, we usedp f = 1:1 which limits the overall bandwidth usage while oering signicant gains. In the following sections, we rst evaluated a variety of weight assignments for multicast tree selection considering receiver completion times and bandwidth usage. We showed that the weight proposed in Algorithm 5 oers close to minimum completion times with minimal extra bandwidth consumption. Next, we evaluated the proposed partitioning technique and considered two cases of N max = 2, 2 and N max =jD D D Rnew j. We measured the performance of QuickCast while varying the number of receivers and showed that it oers consistent gains. We also measured the speedup observed by dierent receivers ranked by their speed per multicast transfer, and the eect of partitioning factorp f on the gains in completion times as well as bandwidth usage. In addition, we evaluated the eect of dierent scheduling policies on average network throughput and showed that with increasing number of multicast receivers, fair sharing oers higher throughput compared to both FCFS and SRPT. Finally, we showed that QuickCast is computationally fast by measuring its running time and that the maximum number of group table forwarding entries it uses across all switches is only a fraction of what is usually available in a physical switch across the several considered scenarios. Network Topologies: Table 6.2 shows the list of topologies we considered. These topolo- gies provide capacity information for all links which range from 45 Mbps to 10 Gbps. We normalized all link capacities dividing them by the maximum link capacity. We also assumed all bidirectional links with equal capacity in either direction. 2 Two partitions is the minimum needed to separate several receivers from the slowest receiver per P2MP transfer. 99 Table 6.2: Various topologies used in evaluation. Name Description ANS [4] A backbone and transit network that spans across the United States with 18 nodes and 25 links. All links have equal capacity of 45 Mbps. GEANT [6] A backbone and transit network that spans across the Europe with 34 nodes and 52 links. Link capacity ranges from 45 Mbps to 10 Gbps. UNINETT [155] A large-sized backbone that spans across Norway with 69 nodes and 98 links. Most links have a capacity of 1, 2.5 or 10 Gbps. Trac Patterns: Table 6.3 shows the considered distributions for transfer volumes. Trans- fer arrival followed a Poisson distribution with rate . We considered no units for time or bandwidth. For all simulations, we assumed a timeslot length of = 1:0. For Pareto dis- tribution, we considered a minimum transfer volume equal to that of 2 full timeslots and limited maximum transfer volume to that of 2000 full timeslots. Unless otherwise stated, we considered an average demand equal to volume of 20 full timeslots per transfer for all trac distributions (we xed the mean values of all distributions to the same value). Per simulation instance, we assumed equal number of transfers per sender and for every trans- fer, we selected the receivers from all existing nodes according to the uniform distribution (with equal probability from all nodes). Assumptions: We focused on computing gains and assumed accurate knowledge of inter- DC link capacity, and precise rate control at the end-points which together lead to a con- gestion free network. We also assumed no dropped packets due to corruption or errors, and no link failures. Simulation Setup: We developed a simulator in Java (JDK 8). We performed all simula- tions on one machine (Core i7-6700 and 24 GB of RAM). We used the Java implementation of GreedyFLAC [72] for minimum weight Steiner trees. 6.4.1 Weight Assignment Techniques for Tree Selection We empirically evaluate and analyze several weights for selection of forwarding trees. Table 6.4 lists the weight assignment approaches considered for tree selection (please see Table 6.1 for denition of variables). We considered three edge weight metrics of utilization (i.e., the fraction of a link's bandwidth currently in use), load (i.e., the total volume of trac that an edge will carry starting current time), and load plus the volume of the newly arriving 100 Table 6.3: Transfer size distributions (parameters inx6.4). Name Description Light-tailed Based on Exponential distribution. Heavy-tailed Based on Pareto distribution. Cache-Follower (Facebook) Generated by cache applications over Facebook inter-DC WAN [12]. Hadoop (Facebook) Generated by geo-distributed analytics over Face- book inter-DC WAN [12]. Table 6.4: Various weights for tree selection for incoming request R new . # Weight of edge e;8e2 E E E G Properties of Selected Trees 1 1:0 A xed minimum edge Steiner tree 2 exp(U e ) Minimum highest utilization over edges 3 exp(L e ) Minimum highest load over edges 4 U e Minimum sum of utilization over edges 5 L e Minimum sum of load over edges 6 L e + V Rnew Ce Minimum nal sum of load over edges 7 1:0 + exp(Ue) P e2E E E G exp(Ue) Minimum edges, min-max utilization 8 1:0 + exp(Le) P e2E E E G exp(Le) Minimum edges, min-max load 9 1:0 + Ue P e2E E E G Ue Minimum edges, min-sum of utilization 10 1:0 + Le P e2E E E G Le Minimum edges, min-sum of load transfer request. We also considered the weight of a tree to be either the weight of its edge with maximum weight or the sum of weights of its edges. An exponential weight is used to approximate selection of trees with minimum highest weight, similar to the approach used in [46]. The benet of the weight #6 over #5 is that in case there is no load or minimal load on some edges, selecting the minimum weight tree will lead to minimum edge trees that reduce bandwidth usage. Also, with this approach, we tend to avoid large trees for large transfers which helps further reduce bandwidth usage. Figure 6.2 shows our simulation results of receiver completion times for bulk multicast transfers with 10 receivers for a xed arrival rate of = 1. We considered both light- tailed and heavy-tailed transfer volume distributions. Techniques #1, #7, #8, #9 and 101 Mean Receiver Completion Times ANS GEANT Light-tailed Heavy-tailed Light-tailed Heavy-tailed # F S M F S M F S M F S M 1 10- 10- 20- 20- 10- 20- 50- 40- 50+ 50+ 40- 40- 2 20- 20- 10- 20- 30- 10- 10- 20- 10- 20- 10- 10- 3 20- 20- 10- 20- 50- 30- 10- 10- 10- 10- 10- 10- 4 40- 40- 10- 40- 40- 10- 20- 30- 10- 20- 10- 10- 5 10- 10- 10- 10- 20- 20- 10- 10- 10- 10- 10- 10- 6 10- 10- 10- 10- 20- 20- 10- 10- 10- 10- 10- 10- 7 10- 10- 10- 10- 10- 10- 40- 30- 30- 40- 30- 20- 8 10- 10- 10- 10- 10- 10- 50- 40- 50+ 50+ 40- 40- 9 10- 10- 10- 10- 10- 10- 40- 40- 30- 40- 30- 30- 10 10- 10- 10- 10- 10- 10- 50+ 50+ 50+ 50+ 50+ 50- Tail Receiver Completion Times ANS GEANT Light-tailed Heavy-tailed Light-tailed Heavy-tailed # F S M F S M F S M F S M 1 20- 20- 30- 20- 10- 20- 50- 50- 50+ 50+ 50+ 50- 2 30- 20- 20- 30- 30- 20- 20- 30- 20- 30- 20- 10- 3 20- 20- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 4 40- 40- 10- 30- 30- 10- 30- 30- 20- 20- 20- 10- 5 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 6 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 7 20- 10- 20- 20- 10- 10- 40- 30- 40- 50- 40- 50+ 8 10- 10- 10- 10- 10- 10- 50- 50- 50+ 50+ 50+ 50- 9 10- 20- 20- 20- 10- 10- 30- 30- 40- 40- 30- 50+ 10 10- 10- 10- 10- 10- 10- 40- 50+ 50- 40- 50+ 50- Total Bandwidth Used ANS GEANT Light-tailed Heavy-tailed Light-tailed Heavy-tailed # F S M F S M F S M F S M 1 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 2 20- 20- 20- 20- 20- 20- 40- 40- 40- 40- 50- 40- 3 20- 20- 20- 20- 20- 20- 10- 10- 10- 10- 10- 10- 4 30- 30- 10- 30- 30- 10- 20- 30- 10- 20- 30- 20- 5 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 20- 6 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 20- 7 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 8 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 9 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- < 10% from min 10- < 20% from min 20- < 30% from min 30- < 40% from min 40- < 50% from min 50- 50% from min 50+ Figure 6.2: Evaluation of various weights for tree selection (F,S andM refer to scheduling policies FCFS, SRPT and Fair Sharing, respectively). 102 #10 all used minimal edge Steiner trees, and so oer minimum bandwidth usage. However, this comes at the cost of increasing completion times especially when edges have a non- homogeneous capacity. Techniques #2 and #4 use utilization as criteria for load balancing. Minimizing max- imum link utilization has long been a popular objective for trac engineering over WAN. As can be seen, they have the highest bandwidth usage compared to other techniques (up to 40% above the minimum) for almost all scenarios while their completion times are at least 20% worse than the minimum for several scenarios. Techniques #3, #5, and #6 operate based on link load (i.e., total outstanding volume of trac per edge) among which technique #3 (minimizing maximum load) has the highest variation between best and worst case performance (up to 40% worse than the minimum in mean completion times). Techniques #5 and #6 (minimizing the sum of load including and excluding the new multicast request) on the other hand oer consistently good performance that is up to 13% above the minimum (for all performance metrics) across all scheduling policies, topologies, and trac patterns. These techniques oer lower completion times for the GEANT topology with non-uniform link capacity. Technique #6 also provides slightly better bandwidth usage and better completion times compared to #5 for the majority of scenarios (not shown). Our proposals rely on technique #6 for selection of load-aware forwarding trees, as shown in Algorithm 5. 6.4.2 Receiver Set Partitioning Receiver set partitioning allows separation of faster receivers from the slowest (or slower ones). This is essential to improve network utilization and speed up transfers when there are competing transfers or physical bottlenecks. For example, both GEANT and UNINETT have edges that vary by at least a factor of 10 in capacity. We evaluate QuickCast over a variety of scenarios. Eect of Number of Receivers We provide an overall comparison of several schemes (QuickCast, Single Load-Aware Steiner Tree, and DCCast [83]) along with two basic solutions of using a minimum edge Steiner tree and unicast minimum hop path routing as shown in Figure 6.3. We also considered both light and heavy load regimes. We used real inter-DC trac patterns reported by Facebook for two applications of Cache-Follower and Hadoop [12]. Also, all schemes use 103 the fair sharing rate allocation based on max-min fairness except DCCast which uses the FCFS policy. The minimum edge Steiner tree leads to the minimum bandwidth consumption. The unicast minimum hop path routing approach separates all receivers per bulk multicast trans- fer. It, however, uses a signicantly larger volume of trac and also does not oer the best mean completion times for the following reasons. First, it exhausts network capacity quickly which increases tail completion times by a signicant factor (not shown here). Second, it can lead to many additional shared links that increase contention across ows and reduce throughput per receiver. The signicant increase in completion times of higher percentiles increases the average completion times of the unicast approach. WithN max =jD D D Rnew j, we see that QuickCast oers the best mean and median comple- tion times, i.e., up to 2:84 less compared to QuickCast with N max = 2, up to 3:64 less compared to unicast minimum hop routing, and up to 3:33 less than single load-aware Steiner tree. To achieve this gain, QuickCast withN max =jD D D Rnew j uses at most 1:49 more bandwidth compared to using minimum edge Steiner trees which is still 1:4 less than band- width usage of unicast minimum hop routing. We also see that while increasing the number of receivers, QuickCast with N max =jD D D Rnew j oers consistently small median completion times by separating fast and slow receivers since the number of partitions are not limited. Overall, we see a higher gain under light load as there is more capacity available to utilize. We also recognize that QuickCast with eitherN max = 2 orN max =jD D D Rnew j performs almost always better than unicast minimum hop routing in mean completion times. Speedup by Receiver Rank Figure 6.4 shows how QuickCast can speed up multiple receivers per transfer by separating them from the slower receivers. The gains are normalized by when a single partition is used per bulk multicast transfer. In case the number of partitions is limited to two similar to [84], the highest gain is usually obtained by the rst two to three receivers while allowing more partitions, we can get considerably higher gain for a signicant fraction of receivers. Also, by not limiting the partitions to two, we see higher gains for all receiver ranks that is above 2 for multiple receiver ranks. This comes at the cost of higher bandwidth consumption which we saw earlier in the previous experiment. 104 2468 0 10 20 30 40 Mean Completion Times GEANT (Cache-Follower) 2468 # Receivers 0 10 20 30 40 Mean Completion Times GEANT (Hadoop) 2468 0 200 400 600 Median Completion Times GEANT (Cache-Follower) 2468 # Receivers 0 100 200 300 400 500 Median Completion Times GEANT (Hadoop) 2468 0 2 4 6 8 10 Total Bandwidth GEANT (Cache-Follower) 2468 # Receivers 0 2 4 6 8 10 Total Bandwidth GEANT (Hadoop) QuickCast (N max =|D R |) QuickCast (N max =2) Single Load-Aware Steiner Tree Unicast Min-Hop Paths Min-Edge Steiner Tree DCCast (a) = 1 (heavy load) 2468 0 5 10 15 Mean Completion Times GEANT (Cache-Follower) 2468 # Receivers 0 5 10 15 Mean Completion Times GEANT (Hadoop) 2468 0 10 20 30 40 Median Completion Times GEANT (Cache-Follower) 2468 # Receivers 0 10 20 30 40 50 Median Completion Times GEANT (Hadoop) 2468 0 2 4 6 8 10 Total Bandwidth GEANT (Cache-Follower) 2468 # Receivers 0 2 4 6 8 10 Total Bandwidth GEANT (Hadoop) (b) = 0:001 (light load) Figure 6.3: Various schemes for bulk multicast transfers. All schemes use max-min fair rates except for DCCast which uses FCFS and are performed on GEANT topology. Plots are normalized by minimum (lower is better). We used Cache-Follower and Hadoop trac patterns in Table 6.3. 105 1234 Receiver Rank 0 1 2 3 4 5 6 7 Mean Speedup GEANT (4 Receivers) 510 15 Receiver Rank 0 5 10 15 20 25 30 35 GEANT (16 Receivers) 1234 Receiver Rank 0 1 2 3 4 5 UNINETT (4 Receivers) 510 15 Receiver Rank 0 2 4 6 8 10 12 14 UNINETT (16 Receivers) QuickCast (N max =|D R |, Light-tailed) QuickCast (N max =2, Light-tailed) QuickCast (N max =|D R |, Heavy-tailed) QuickCast (N max =2, Heavy-tailed) Single Load-Aware Steiner Tree Figure 6.4: Mean receiver completion time speedup (larger is better) of receivers compared to single load-aware Steiner tree (Algorithm 5) by their rank (receivers sorted by their speed from fastest to slowest per transfer), receivers selected according to uniform distribution from all nodes, we considered = 1. Partitioning Factor The performance of QuickCast as a function of the partitioning factor (i.e., p f ) has been shown in Figure 6.5 where gains are normalized by single load-aware Steiner tree which uses a single partition per bulk multicast transfer. We computed per receiver mean and 95 th percentile completion times as well as bandwidth usage. As can be seen, bandwidth consumption increases with partitioning factor as more requests' receivers are partitioned into two or more groups. The gains in completion times keep increasing ifN max is not limited as we increasep f . That, however, can ultimately lead to unicast delivery to all receivers (i.e., every receiver as a separate partition) and excessive bandwidth usage. We see a diminishing return type of curve as p f is increased with the highest returns coming when we increase p f from 1 to 1.1 (marked with a green dashed lined). That is because using too many partitions can saturate network capacity while not improving the separation of fast and slow nodes considerably. At p f = 1:1, we see up to 10% additional bandwidth usage compared to single load- aware Steiner tree while mean completion times improve by between 40% to 50%. According to other experiments not shown here, with large p f , it is possible even to see reductions in gain that come from excessive bandwidth consumption and increased contention over capacity. Note that this experiment was performed considering four receivers per bulk multicast transfer. Using more receivers can lead to more bandwidth usage with the same 106 1 1.2 1.4 1.6 1.8 2 1 1.5 2 2.5 Mean Speedup GEANT 1 1.2 1.4 1.6 1.8 2 p f 1 1.2 1.4 1.6 1.8 2 Mean Speedup UNINETT 1 1.2 1.4 1.6 1.8 2 1 1.5 2 2.5 95 th Percentile Speedup GEANT 1 1.2 1.4 1.6 1.8 2 p f 1 2 3 4 5 95 th Percentile Speedup UNINETT 1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 Total Bandwidth by Single Load-Aware Tree GEANT 1 1.2 1.4 1.6 1.8 2 p f 0.9 1 1.1 1.2 1.3 1.4 Total Bandwidth by Single Load-Aware Tree UNINETT QuickCast (N max =|D R |, Light-tailed) QuickCast (N max =2, Light-tailed) QuickCast (N max =|D R |, Heavy-tailed) QuickCast (N max =2, Heavy-tailed) Figure 6.5: Performance of QuickCast as a function of partitioning factor p f . We assumed 4 receivers and an arrival rate of = 1. p f , an increased slope at values of p f close to 1, and faster saturation of network capacity as we increasep f . Therefore, using smallerp f is preferred with more receivers per transfer. 6.4.3 Eect of Rate Allocation Policies As explained earlier inx6.3.3, when scheduling trac over large forwarding trees, fair sharing can sometimes oer signicantly higher throughput and hence better completion times. We performed an experiment over the ANS topology and with both light-tailed and heavy-tailed trac distributions. ANS topology has uniform link capacity across all edges which helps us rule out the eect of capacity variations on throughput obtained via dierent scheduling policies. We also considered an increasing number of receivers from 4 to 8 and 16. Figure 6.6 shows the results. We see that fair sharing oers a higher average throughput across all ongoing transfers compared to FCFS and SRPT and that with more receivers, the benet of using fair sharing increases to up to 1:5 with 16 receivers per transfer. 107 ANS (Light-tailed) 4 Receivers 8 Receivers 16 Receivers 0 0.5 1 1.5 Avg Network Throughput ANS (Heavy-tailed) 4 Receivers 8 Receivers 16 Receivers 0 0.5 1 1.5 Avg Network Throughput Fair Sharing (Max-Min Fairness) FCFS SRPT Figure 6.6: Average throughput of bulk multicast transfers obtained by running dierent scheduling policies. We started 100 transfers at the time zero, senders and receivers were selected according to the uniform distribution. Each group of bars is normalized by the minimum in that group. 6.4.4 Running Time To ensure scalability of proposed algorithms, we measured the running time of our algo- rithms over various topologies (with dierent sizes) and with varying rates of arrival. We assumed two arrival rates of = 0:001 and = 1 which account for light and heavy load regimes. We also considered eight receivers per transfer and all the three topologies of ANS, GEANT, and UNINETT. We saw that the running time of Algorithm 5, and 6 remained below one millisecond and 20 milliseconds, respectively, across all of these scenarios. These numbers are less than the propagation latency between the majority of senders and re- ceivers over considered topologies (a simple TCP handshake would take at least twice the propagation latency). More ecient realization of these algorithms can further reduce their running time (e.g., implementation in C/C++ instead of Java). 6.4.5 Forwarding Plane Resource Usage QuickCast can be realized using software-dened networking and OpenFlow compatible switches. To forward packets to multiple outgoing ports on switches where trees branch out to numerous edges, we can use group tables which have been supported by OpenFlow since early versions. Besides, an increasing number of physical switch makers have added support for group tables. To allow forwarding to multiple outgoing ports, the group table entries should be of type \ALL", i.e., OFPGT ALL in the OpenFlow specications. Group 108 tables are highly scarce (compared to TCAM entries) and so should be used with care. Some new switches support 512 or 1024 entries per switch. Another critical parameter is the maximum number of action buckets per entry which primarily determines the maximum possible branching degree for trees. Across the switches we looked at, we found that the minimum supported value was 8 action buckets which should be enough for WAN topologies as most of such do not have any nodes with this connectivity degree. In general, reasoning about the number of group table entries needed to realize dierent schemes is hard since it depends on how the trees are formed which is highly intertwined with edge weights that depend on the distribution of load. For example, consider a complete binary tree with 8 receivers as leaves and the sender at the root. This will require 6 group table entries to transmit to all receivers with two action buckets per each intermediate node on the tree (branching at the sender does not need a group table entry). If instead, we used an intermediate node to connect to all receivers with a branching degree of 8, we would only need one group table entry with eight action buckets. We measured the number of group table entries needed to realize QuickCast. We com- puted the average of the maximum, and maximum of the maximum number of entries used per switch during the simulation for the topologies of ANS, GEANT, and UNINETT, with arrival rates of = 0:001 and = 1, considering both light-tailed and heavy-tailed trac patterns and assuming that each bulk multicast transfer had eight receivers. The exper- iment was terminated when 200 transfers arrived. Looking at the maximum helps us see whether there are enough entries at all times to handle all concurrent transfers. Interest- ingly, we saw that by using multiple trees per transfer, both the average and maximum of the maximum number of group table entries used were less than when a single tree was used per transfer. One reason is that using a single tree slows down faster receivers which may lead to more concurrent receivers that increase the number of group entries. Also, by partitioning receivers, we make subsequent trees smaller and allow them to branch out closer to their receivers which balances the use of group table entries usage across the switches reducing the maximum. Finally, by using more partitions, the maximum number of times a tree needs to branch to reach all of its receivers decreases. Across all the scenarios considered above, the maximum of maximum group table entries at any timeslot was 123, and the average of the maximum was at most 68 for QuickCast. Furthermore, by setting N max =jD D D Rnew j which allows for more partitions, the maximum of maximum group table entries decreased by up to 17% across all scenarios. 109 6.5 Conclusions Many P2MP transfers do not require that all receivers nish reception at the same time. Moreover, attaching all receivers of a P2MP transfer to the sender using a single forwarding trees limits the speed of all receivers to that of the slowest one. We introduced the bulk multicast routing and scheduling problem to minimize mean completion times of receivers and split it into three sub-problems of receiver set partitioning, tree selection, and rate allocation. We then presented QuickCast which applies three heuristic techniques to oer approximate solutions to these three hard sub-problems. We performed extensive evalu- ations to validate the eectiveness of QuickCast. In general, the gains are a function of network connectivity, link capacities, and transfer properties. Considering multiple net- work topologies and transfer size distributions, we found that QuickCast oers signicant speedups for multiple receivers per P2MP transfer while negligibly increasing the total bandwidth consumption. Interestingly, we also found that the number of forwarding rules at network switches needed to realize QuickCast can be considerably less than when a single forwarding tree is used per P2MP transfer which makes it more practical. 110 Chapter 7 Mixed Completion Time Objectives for P2MP Transfers over Inter-DC Networks Bulk transfers from one to multiple datacenters can have many dierent completion time objectives ranging from quickly replicating some k copies to minimizing the time by which the last destination receives a full replica. We design an SDN-style wide-area trac sched- uler that optimizes dierent completion time objectives for various requests. The scheduler builds, for each bulk transfer, one or more multicast forwarding trees which preferentially use lightly loaded network links. Multiple multicast trees are used per bulk transfer to in- sulate destinations that have higher available bandwidth and can hence nish quickly from congested destinations. When receivers of a bulk multicast transfer have very dierent network bandwidth avail- able on paths from the sender, the slowest receiver dictates the completion time for all receivers. As discussed in Chapter 6, using multiple multicast trees to separate the faster receivers which will improve the average receiver's completion time. However, each addi- tional tree consumes more network bandwidth and at the extremum, this idea devolves to one tree per receiver. We aim to answer the following questions: 1. What is the right number of trees per transfer? 2. Which receivers should be grouped in each tree? We analyze a relaxed version of this partitioning problem where each partition is a sub- set of receivers attached to the sender with a separate forwarding tree. We rst propose 111 a partitioning technique that reduces the average receiver completion times of receivers by isolating slow and fast receivers. We study this approach in the relaxed setting of having a congestion-free network core, i.e., links in/out of the datacenters are the capacity bot- tlenecks, and considering max-min fair rate allocation from the underlying network. We then develop a partitioning technique for real-world inter-datacenter networks, without re- laxations, and inspired by the ndings from studying the relaxed scenario. The partitioning technique operates by building a hierarchy of valid partitioning solutions and selecting the one that oers the best average receiver completion times. Our evaluation of this partition- ing technique on real-world topologies, including ones with bottlenecks in the network core, show that the technique yields completion times that are close to a lower bound and hence nearly optimal. Moreover, we incorporate binary objective vectors which allow applications to indicate transfer-specic objectives for receivers' completion times. Using the application-provided objective vectors, we can optimize for mixed completion time objectives based on the trade- o between total network capacity consumption and the receivers' average completion times. We present theIris heuristic, which computes a partitioning of receivers for every transfer given a binary objective vector. Iris aims to minimize the completion time of receivers whose rank is indicated by applications/users with a one in the objective vector while saving as much bandwidth as possible by grouping receivers whose ranks are indicated with consecutive zeros in the objective vector. Iris operates in a logically centralized manner, receives bulk multicast transfer requests from end-points, and computes receiver partitions along with their multicast forwarding trees. We create forwarding trees using group tables [156]. Iris uses a RESTful API to communicate with the end-points allowing them to specify their transfer properties and requirements (i.e., objective vectors) using which it computes and installs the required rules in the forwarding plane. We believe our techniques are easily applicable in today's inter-DC networks [2,42,43]. We perform extensive simulations and Mininet emulations with Iris using synthetic and real-world Facebook inter-DC trac patterns over large WAN topologies. Simulation results show that Iris speeds up transfers to a small number of receivers (e.g., 8 receivers) by 2 on the average completion time while the bandwidth used is 1:13 compared to state-of-the-art. Transfers with more receivers receive larger benets. For transfers to at least 16 receivers, 75% of the receivers complete at least 5 faster and the fastest receiver completes 2:5 faster compared to state-of-the-art. Compared to performing multicast as multiple unicast transfers with shortest path routing, Iris reduces mean completion times 112 by about 2 while using 0:66 of the bandwidth. Finally, Mininet emulations show that Iris reduces the maximum group table entries needed by up to 3. Motivating Examples: Back-end geo-distributed applications running on datacenters can have dierent requirements on how their objects are replicated to other datacenters. Hence, inter-DC trac is usually a mix of transfers with various completion time objectives. For example, while reproducing n copies of an object to n dierent datacenters/locations, one application may want to transfer k copies quickly to any among n given receivers, and another application may want to minimize the time when the last copy nishes. In the former case, grouping the slowernk receivers into one partition consumes less bandwidth and this spare bandwidth could be used to speed up the other transfers. In the latter case, by grouping all receivers except the slowest receiver together (i.e., into one tree), we can isolate them from the slowest receiver with minimal bandwidth consumption. Minimizing the completion times of all receivers is another possible objective. Our technique takes as input a binary objective vector whose i th element expresses interest in the completion time of thei th fastest receiver; it aims to minimize the completion times of receivers whose rank is set to one in this objective vector. It is easy to see that following values of the objective vector achieve the goals discussed so far; when k = 1,n = 3,f1; 0; 0g,f0; 0; 1g andf1; 1; 1g aim to minimize the completion time of the fastestk out ofn receivers, the slowest receiver, and all receivers, respectively. 7.1 System Model Similar to previous chapters, a TES runs our algorithms in a logically centralized manner to decide how trac is forwarded in-network. P2MP transfers are processed in an online fash- ion as they arrive with the main objective of optimizing completion times. Also, forwarding entries, which are installed for every transfer upon arrival, are xed until the transfers' completion may only be updated in case of failures. We consider max-min fair [85] rate allocation across multicast forwarding trees. Trac is transmitted with the same rate from the source to all the receivers attached to a for- warding tree. To reach max-min fair rates, such rates can either be computed centrally over specic time periods, i.e., timeslots, and then be used for end-point trac shaping or end-points can gradually converge to such rates in a distributed fashion in a way similar to TCP [115] (fairness is considered across trees). In our evaluations, we will consider the former approach for increased network utilization. Using a fair sharing policy addresses the 113 starvation problem (such as in SRPT policy) and prevents larger transfers from blocking edges (such as in FCFS policy). We use the notion of objective vectors to allow applications to dene transfer-specic requirements which in general can improve overall system performance and reduce band- width consumption. An objective vector for a transfer is a vector of zeros and ones which is the same size as the number of receivers of that transfer. From left to right, the binary digiti in this vector is associated with thei th fastest receiver. A one in the objective vector indicates that we are specically interested in the completion time of the receiver associated with that rank in the vector. By assigning zeros and ones to dierent receiver ranks, it is possible to respect dierent applications' preferences or requirements while allowing the system to optimize bandwidth consumption further. The application/user, however, needs not be aware of the mapping between the downlink speeds (rank in the objective vector) and the receivers themselves. Table 7.1 oers several examples. For instance, an objective vector off0; 0; 0; 1; 0; 0; 0; 0g indicates the application's interest in the fourth fastest receiver. To respect the application's objective, we initially isolate the fourth receiver and do not group it with any other receiver. The rst three fastest receivers can be grouped into a partition to save bandwidth. The same goes for the four slowest receivers. However, we do not group all receivers indicated with zeros into one partition initially (i.e., the top three receivers and the bottom four) to avoid slowing some of them down unnecessarily (in this case, the top three receivers). This forms the basis for the partitioning technique proposed inx6.3.2 that operates by building a hierarchy with multiple layers, where each layer is a valid partitioning solution, and selects the layer that gives the smallest average receiver completion times. Problem Statement: Given an inter-DC topology with known available bandwidth per link, the trac engineering server is responsible for partitioning receivers and selecting a forwarding tree per partition for every incoming bulk multicast transfer. A bulk multicast transfer is specied by its source, set of receivers and volume of data to be delivered. The primary objective is minimizing average receiver completion times. In case an objective vector is specied, we want to minimize average completion times of receivers whose ranks are indicated with a 1 in the vector as well as receivers indicated with consecutive 0's in the vector together as groups (receivers noted with consecutive 0's use the same forwarding tree and will have the same completion times). Minimizing bandwidth consumption, which is directly proportional to the size of selected forwarding trees, is considered a secondary objective. 114 Table 7.1: Behavior of several objective vectors. Objective Vector (!) Outcome (given n receivers) {1, ... , 1} n Interested in completion times of all individual re- ceivers {1, ... , 1, 0, ... , 0} k n-k Interested in completion times of the top k receivers (groups the bottomnk receivers to save bandwidth) {0, ... , 0, 1, 0, ... , 0} k-1 n-k Interested in the completion time of the k th receiver (groups the top k 1 receivers into a fast partition, and the bottomnk receivers into a slow one to save bandwidth) {0, ... , 0} n Not interested in the completion time of any specic receiver (all receivers form a single partition) 7.1.1 Online Greedy Optimization Model The online bulk multicast partitioning and forwarding tree selection problem can be formu- lated using Eq. 7.1-7.3 added the constraint that our rate allocation is max-min fair across forwarding trees for any selection of the partitions and the trees. We will use the notation dened in Table 2.2 as well as those in Table 7.2. The set R R R includes both the new transfer R new and all the ones already in the system for which we already have the partitions and forwarding trees. The optimization objective of Eq. 7.1 is to minimize the weighted sum of completion times of receivers of all requests R2 R R R according to their objective vectors, and the total bandwidth consumption of R new by partitioning its receivers and selecting their forwarding trees (indicated by the term P P2P P P Rnew V P jT P j). Operators can choose the non-negative coecient according to the overall system objective to give a higher weight to minimizing the weighted completion time of receivers than reducing bandwidth consumption. Eq. 7.2 shows the demand constraints which state that the total sum of transmission rates over every tree for future timeslots is equal to the remaining volume of data per partition (each partition uses one tree). Eq. 7.3 presents the capacity constraints which state that the total sum of transmission rates per timeslot for all trees that share a common edge has to not go beyond its available bandwidth. 115 Table 7.2: Denition of variables used in this chapter besides those dened in Table 2.2. Variable Denition T A directed Steiner tree r T (t) The transmission rate over tree T at timeslot t P A receiver partition of some request P P P R Sethi of partitions of some request R T P The forwarding tree of partition P V r P Current residual volume of partition P of request R P Estimated minimum completion time of partition P L e Edge e's total load (seex7.3.1) R Objective vector assigned to request R ? R Weighted completion time vector computed from R by replacing the last zero in a pack of consecutive zeros with the number of consecutive zeros in that pack (e.g., R =f0; 0; 0; 1; 0; 0g! ? R =f0; 0; 3; 1; 0; 2g) R Vector of completion times of receivers of request R sorted from fastest to slowest min X R2R R R R ? R + X P2P P P Rnew V P jT P j (7.1) Subject to X t r T P (t) =V r P 8P2 P P P R ;R2 R R R (7.2) X fP2P P P R ;R2R R Rj e2T P g r T P (t)B e (t) 8t;e (7.3) This online discrete optimization problem is highly complex as it is unclear how receivers should be partitioned into multiple subsets to reduce completion times and there is an exponential number of possibilities. Selection of forwarding trees to minimize completion times is also a hard problem. Inx7.3, we will present a heuristic that aims to approximate a solution to this optimization problem inspired by the ndings inx7.2. 7.2 Partitioning of Receivers on a Relaxed Topology Due to the high complexity of the partitioning problem as a result of physical topology, we rst study a relaxed topology where every datacenter is attached with a single up- link/downlink to a network with innite core capacity (and so the network core cannot be- come a bottleneck). As shown in Figure 7.1, the sender has a maximum uplink rate ofr s and transmits to a set ofn receivers with dierent maximum downlink rates ofr i ;8i2f1;:::;ng. 116 Inx7.3.1, we discuss a load-balancing forwarding tree selection approach that aims to dis- tribute load across the network to minimize the eect of bottlenecks within the network core. Also, inspired by the ndings in this section, we will develop an eective partitioning heuristic inx6.3.2. Without loss of generality, let us also assume that the receivers in Figure 7.1 are sorted by their downlink rates in descending order. The sender can initiate multicast ows to any partition (i.e., a subset of receivers) given that every receiver appears in exactly one partition. All receivers in a partition will have the same multicast rate that is the rate of the slowest receiver in the partition. To compute rates at the uplink, we consider the max-min fair rate allocation policy (seex7.1). In this context, we would like to compute the number of partitions as well as the receivers that should be grouped per partition to minimize mean completion times. Theorem 1. Given receivers sorted by their downlink rates, partitioning that groups consecutive receivers is pareto-optimal with regards to minimizing completion times. Proof. We use proof by contradiction. Let us assume a partitioning where non- consecutive receivers are grouped together, that is, there exist two partitions P 1 and P 2 where part of partition P 1 falls in between receivers of P 2 or the other way around. Let us call the slowest receivers ofP 1 andP 2 asj 1 andj 2 , respectively. Acrossj 1 andj 2 , let us pick the fastest and call itf(j 1 ;j 2 ). Iff(j 1 ;j 2 ) =j 1 (i.e., in the non-decreasing order of downlink speed from left to right, P 2 appears before P 1 as in P 2 f:::g P 1 f:::;j 1 g P 2 f:::;j 2 g ::: ), then by swapping the fastest receiver in P 2 and j 1 , we can improve the rate of P 1 while keeping the rate of P 2 the same. If f(j 1 ;j 2 ) =j 2 , then by swapping the fastest receiver in P 1 andj 2 , we can improve the rate ofP 2 while keeping the rate ofP 1 the same. This can be done in both cases without changing the number of partitions, or number of receivers per partition across all partitions. Since the new partitioning has a higher or equal achievable Network s 1 n r 1 r n 2 r 2 ... r s > > > Figure 7.1: A relaxed topology with innite core capacity, and uplink and downlink capac- ities of r s and r 1 r n . 117 rate for one of the partitions, the total average completion times will be less than or equal to that of original partitioning, which means the original partitioning could not have been optimal. 7.2.1 Our Partitioning Approach Based on Theorem 1, the number of possible partitioning scenarios that can be considered for minimum average completion times is the number of compositions of integer n, that is, 2 n1 ways which can be a large space to search. To reduce complexity, we isolate slow receivers from the rest of receivers to minimize their eect. In other words, given an integer 1Mn, we group the rst nM + 1 fastest receivers into one partition and the rest of the receivers as separate 1-receiver partitions (M 1 in total). Since we do not know the value of integer M, we will try all possible values, that is, n in total which will help us nd the right threshold for the separation of fast and slow receivers. In particular, we compute the total average downlink rate of all receivers for the given transfer for every value of M and select the M that maximizes the average rate. 1 As shown in Figure 7.1, the uplink at the sender has a rate ofr s which will be divided across all the multicast ows that deliver data to the receivers. Isolating a slow receiver only takes a small fraction of the sender's uplink which is why this technique is eective as we will later see in evaluations. An example of this approach and how it compares with the optimal solution is shown in Figure 7.2 where our solution selects M = 3 partitions isolating the two slow receivers. A main determining factor in the eectiveness of this approach is how r s compares with P 1in r i . If r s is larger, then simply using n partitions will oer the maximum total rate to the receivers. The opposite is when r s P 1in r i in which case using a single partition oers the highest total rate. In other cases, given the partitioning approach mentioned above, the worst-case scenario happens when there are many slow receivers and only a handful of fast receivers. An example has been shown in Figure 7.3. In the scenario on the left, our approach groups all the receivers into one partition where they all receive data at the rate of one. That is because by isolating slow receivers we can either get a rate of one or less than one if we isolate more than nine slow receivers, which means using one partition is enough. The optimal case, however, groups all the slow receivers into one partition. In general, scenarios like this rarely happen as the number of slow receivers over inter-datacenter networks is usually small, i.e., most datacenters are connected using high capacity links with large available bandwidth. In general, since we consider all values of 1 Or alternatively minimizes the average completion times of receivers. 118 Network Core 10 10 1 1 10 8 1 1 Network Core 10 10 1 1 10 9 1 Network Core 10 10 1 1 10 4.5 4.5 1 Mean Rate: (8+8+1+1)/4 = 4.5 Mean Rate: (9+9+1+1)/4 = 5 Mean Rate: (4.5+4.5+1+1)/4 = 2.75 Isolating Slow Receivers (Our Approach) Optimal Isolating Fast Receivers Figure 7.2: Various partitioning solutions for a scenario with four receivers. Numbers show the downlink and uplink speeds of nodes and curly brackets indicate the partitions where all nodes in a partition receive data at the same rate. The objective is to maximize the average rate of receivers given the max-min fairness policy. M from 1 to n partitions, the solution obtained from our partitioning approach cannot be worse than the two baseline approaches of using a single multicast tree for all receivers and unicasting to all receivers using separate paths. 7.2.2 Incorporating Objective Vectors We allow users to supply an objective vector along with their multicast transfers to better optimize the network performance, that is, total network capacity consumption and receiver completion times. We incorporate the objective vectors by grouping receivers with consec- utive ranks that are indicated with zeros in the objective vector and treating them as one partition in the whole process. That is because the users have indicated no interest in the completion times of those receivers, so we might as well reduce the network capacity usage by grouping them from the beginning. Figure 7.4 shows an example of building possible solutions by isolating slow receivers and incorporating the user-supplied objective vector, which we refer to as the partitioning hierarchy. Please note that this hierarchy moves in the reverse direction, that is, instead of isolating slow receivers, it merges fast receivers from bottom to the top. Each layer in this hierarchy, labeled as P i ; 1 i 5, represents a valid partitioning 119 Isolating Slow Receivers (Our Approach) Optimal Network Core 10 1 1 1 10 9 1 Mean Rate: (9+19)/20 = 1.4 1 ... 1 2 3 4 20 Network Core 10 1 1 1 10 1 Mean Rate: (20×1)/20 = 1 1 ... 1 2 3 4 20 Figure 7.3: A worst-case scenario for the proposed partitioning scenario. Numbers within the nodes show the downlink and uplink speeds of nodes and curly brackets indicate the partitions where all nodes in a partition receive data at the same rate. The objective is to maximize the average rate of receivers given the max-min fairness policy. P 1 P 2 P 3 P 4 P base ( P 5 ) D R 8 5 3 9 7 10 1 4 2 6 5 3, 9, 7, 10, 1, 4 2 6 8 Ψ 1 2 3 4 5 6 7 8 9 10 ω R 1 1 0 0 0 0 0 0 1 1 6 6 8, 5, 3, 9, 7, 10, 1, 4, 2 6 8, 5, 3, 9, 7, 10, 1, 4, 2, 6 Partitioning Hierarchy Receiver Ranks Objective Vector 3, 9, 7, 10, 1, 4 2 8, 5 8, 5, 3, 9, 7, 10, 1, 4 2 Receiver IDs Figure 7.4: Example of a partitioning hierarchy for a transfer with 10 receivers (the topology not shown). 120 solution. 2 We see that receivers indicated with consecutive zeros in ! R are merged into one big partition at the base layer or P 5 . Also, we see that as we move up, the two fastest partitions at each layer are merged, which reduces total bandwidth consumption. For each layer, we compute the average completion time of receivers and then select the layer that oers the least value, in this case, P 3 was chosen. 7.3 Iris We apply the partitioning technique discussed in the previous chapter to real-world inter- datacenter networks. We develop a heuristic for partitioning receivers on real-world topolo- gies without relaxations ofx7.2. We will generate multiple valid partitioning solutions in the form of a hierarchy where layers of the hierarchy present feasible partitioning solutions and each layer is formed by merging the two fastest partitions of the layer below. 3 We present Iris, a heuristic that runs on the trac engineering server to manage bulk multicast transfers. 4 When a bulk multicast transfer arrives at an end-point, it will com- municate the request to the trac engineering server which will then invoke Iris. It uses the knowledge of physical layer topology, available bandwidth on edges after deducting the share of high priority user trac and other running transfers to compute partitions and for- warding trees. The trac engineering server pulls end-points' actual progress periodically to determine their exact remaining volume across transfers to compute the total outstand- ing load per edge for all edges. Iris consists of four modules as shown in Figure 7.5 which we discuss in the following subsections. Iris aims to nd an approximate solution to the optimization problem of Eq. 7.1 assuming 1 to prioritize minimizing completion times over minimizing bandwidth consumption. We will empirically evaluate Iris by comparing it to recent work and a lower bound inx6.4. 7.3.1 Choosing Forwarding Trees Load aware forwarding trees are selected given the link capacity information on the topology and according to other ongoing bulk multicast transfers across the network to reduce the 2 The associated network topology is not shown. 3 In general, it is not possible to oer optimality guarantees due to the highly varying factors of network topology, transfer arrivals, and the distribution of transfer volumes. However, our extensive simulations in x6.4 show that our approach can oer signicant improvement on other approaches over various topologies and trac patterns. Also, as a result of building a hierarchy of partitioning options and selecting the best one, our solution will be at least as good as either using a single multicast tree or using unicasting to all receivers. 4 Unicast transfers are a special case with a single receiver. 121 Traffic Engineering Server Existing Transfers' Information (Receiver Partitions, Remaining Volumes, Forwarding Trees) Forwarding State Multicast Transfer Request Bulk Multicast Sender Intermediate Node (Datacenter, IXP, PoP, etc.) Bulk Multicast Receiver Iris (Algorithm 4) Algorithm 3: Rank Receivers Algorithm 2: Estimate Min. Comp. Times Algorithm 1: Compute Forwarding Trees Figure 7.5: Pipeline of Iris. 122 Algorithm 7: Compute A Forwarding Tree Input: Steiner tree terminal nodes fS Rnew [ D D D Rnew g, request R new Output: A Steiner tree 1 CompForwardingTree ( ;R new ) 2 To every edge e2 E E E G , assign a weight of (L e + V Rnew Be ); 3 return A minimum weight Steiner tree that connects the nodes in set (we used a hueristic [72]); completion times by mitigating the eect of bottlenecks. Tree selection should also aim to keep bandwidth consumption low by minimizing the number of edges per tree where an edge could refer to any of the links on the physical topology. To select a forwarding tree, a general approach that can capture a wide range of selection policies is to assign weights to edges of the inter-DC graph G and select a minimum weight Steiner tree [71]. Per edge e2 E E E G , we assume a virtual queue that increases by volume of every transfer scheduled on that edge and decreases as trac ows through it. Since edges dier in capacity, completing the same virtual queue size may need signicantly dierent times for dierent links. We dene a metric called load asL e = 1 Be P fP2P P P R ;8R2R R Rj e2T P g V r P . This equation sums up the remaining volumes of all trees that use a specic edge (total virtual queue size) and divides that by the average available bandwidth on that edge to compute the minimum possible time it takes for all ongoing transfers on that edge to complete. To keep completion times low, we need to avoid edges for which this value is large. With this metric available, to select a forwarding tree given a sender and several re- ceivers, we will rst assign an edge weight ofL e + V Rnew Be to all edgese2 E E E G and then select a minimum weight Steiner tree as shown in Algorithm 7. With this edge weight, compared to edge utilization which has been extensively used in literature for trac engineering, we achieve a more stable measure of how busy a link is expected to be in the near future on average. We considered the second term in edge weight to reduce total bandwidth use when there are multiple trees with the same weight. It also leads to the selection of smaller trees for larger transfers which decreases the total bandwidth consumption of Iris further in the long run. 123 Algorithm 8: Computing Minimum Completion Times Input: Request R new , a set of partitions P P P where P D D D Rnew ;8P2 P P P Output: The minimum completion time of every partition in P P P 1 MinimumCompletionTimes (P P P;R new ) 2 f f f ;, t t now + 1; 3 P V R ; 8P2 P P P; 4 T P CompForwardingTree(P;R new ),8P2 P P P; 5 whilejf f fj<jP P Pj do 6 Compute r T P (t);8P2fP P P f f fg, max-min fair rate [85] allocated to tree T P at timeslot t given available bandwidth of B e (t) on every edge e2 E E E G ; 7 P P ! r T P (t); 8P2 P P P; 8 foreach P2fP P P f f fg do 9 if P = 0 then 10 P t, f f f ff f f[Pg; 11 t t + 1; 12 return P ;8P2 P P P 7.3.2 Estimating Minimum Completion Times The purpose of this procedure is to estimate the minimum completion time of dierent partitions of a given transfer considering available bandwidth over the edges and applying max-min fair rate allocation when there are shared links across forwarding trees. Algorithms 9 and 10 then use the minimum completion time per partition to rank the receivers (i.e., faster receivers have an earlier completion time) and then decide which partitions to merge. Computing the minimum completion times is done by assuming that the new transfer request has access to all the available bandwidth and compared to computing the exact completion times is much faster. Besides, calculating the exact completion times is not particularly more eective due to the continuously changing state of the system as new transfer requests arrive. Since available bandwidth over future timeslots is not precisely known, we can use estimate values similar to other work [46, 55, 102]. Algorithm 8 shows how the minimum completion times are computed. 124 Algorithm 9: Assign Receiver Ranks Input: Request R new Output: r , i.e., the rank of receiver r2 D D D Rnew 1 AssignReceiverRanks (R new ) 2 /* Every receiver is treated as a separate partition */ 3 f r ; 8r2 D D D Rnew g MinimumCompletionTimes(D D D Rnew ;R new ); 4 r Position of receiver r in the list of all receivers sorted by their estimated minimum completion times (fastest receiver is assigned a rank of 1),8r2 D D D Rnew ; 5 return r ;8r2 D D D Rnew ; 7.3.3 Assigning Ranks to Receivers Algorithm 9 assigns ranks to individual receivers according to their minimum completion times taking into account available bandwidth over edges as well as edges' load in the path selection process. This ranking is used along with the provided objective vector later to partition receivers. 7.3.4 The Iris Algorithm The Iris algorithm computes receiver partitions using hierarchical partitioning and assigns each partition a multicast forwarding tree. The partitioning problem is solved per transfer and determines the number of partitions and the receivers that are grouped per partition. Iris uses a partitioning technique inspired by the ndings ofx7.2 that is computationally fast, signicantly improves receiver completion times, and operates only relying on network topology and available bandwidth per edge (i.e., after deducting the quota for higher priority user trac). Algorithm 10 illustrates how Iris partitions receivers with an objective vector. Given that each node in a real-world topology may have multiple interfaces, we cannot directly compute the right number of partitions using Theorem 2. As a result, we build a partitioning hierarchy with numerous layers and examine the various number of partitions from bottom to the top of the hierarchy while looking at the average of minimum completion times. By building a hierarchy, we consider the discrete nature of forwarding tree selection on the physical network topology. The process consists of two steps as follows. We rst use the receiver ranks from Algorithm 9 and the objective vector to create the base of partitioning hierarchy, P P P base . We rst sort the receivers by their ranks from fastest to slowest and then group them according to the weights in the objective vector. 125 Algorithm 10: Compute Receiver Partitions and Trees (Iris) Input: Request R new , binary objective vector Rnew Output: Partitions of request R new and their forwarding trees 1 CompPartitionsAndTrees (R new ; Rnew ) 2 /* Initial partitioning using the objective vector Rnew */ 3 f r ;8r2 D D D Rnew g AssignReceiverRanks(R new ); 4 D D D s Rnew Receivers r sorted by r ;8r2 D D D Rnew ascending; 5 P P P base fAny receiver r2 D D D Rnew for which Rnew < r > is 1 as a separate partitiong[fGroup receivers that appear consecutively on D D D s Rnew for which Rnew < r > is 0, each group forms a separate partitiong; 6 /* Building the partitioning hierarchy for P P P base */ 7 P P P jP P P base j P P P base ; 8 for l =jP P P base j to l = 1 by1 do 9 f P ; 8P2 P P P l g MinimumCompletionTimes(P P P l ; R new ); 10 l P P2P P P l (jPj P ); 11 Assuming receivers are sorted from left to right by increasing order of rank, merge the two partitions on the left, P and Q, to form PQ; 12 P P P l1 fPQg[fP P P l fP;Qgg; 13 Find l min for which lmin min 1ljP P P base j l , if multiple layers have the same l , choose the layer with minimum total weight over all of its forwarding trees, i.e., select l min to optimize min( P P2P P P l min ( P e2T P (L e + V Rnew Be ))); 14 foreach P2 P P P lmin do 15 T P CompForwardingTree(P;R new ); 16 foreach e2T P do 17 L e L e + V Rnew Be , W e W e + V Rnew Be ; 18 return (P; T P ); 8P2 P P P lmin ; 126 For any receiver whose rank in the objective vector has a value of 1, we consider a separate partition (single node partition) which allows the receiver to complete as fast as possible by not attaching it to any other receiver. Next, we group receivers with consecutive ranks that are assigned a value of 0 in the objective vector into partitions with potentially more than one receiver, which allows us to save as much bandwidth as possible since the user has not indicated interest in their completion times. Now that we have a set of base partitions P P P base , a heuristic creates a hierarchy of partitioning solutions withjP P P base j layers where every layer 1ljP P P base j is made up of a set of partitions P P P l . Each layer is created by merging two partitions from the layer below going from the bottom to the top of hierarchy. At the bottom of the hierarchy, we have the base partitions. Also, at any layer, any partition P is attached to the sender using a separate forwarding tree T P . We rst compute the average of minimum completion times of all receivers at the bottom of the hierarchy. We continue by merging the two partitions that hold receivers with highest ranks. When merging two partitions, the faster partition is slowed down to the speed of slower partition. A new forwarding tree is computed for the resulting partition using the forwarding tree selection heuristic of Algorithm 7 to all receivers in that partition, and the average of minimum completion times for all receivers are recomputed. This process continues until we reach a single partition that holds all receivers. In the end, we select the layer at which the average of minimum completion times across all receivers is minimum, which gives us the number of partitions, the receivers that are grouped per partition, and their associated forwarding trees. If there are multiple layers with the minimum average completion times, the one with minimum total forwarding tree weight across its forwarding trees is chosen which on average leads to better load distribution. 7.4 Evaluation We considered various topologies and transfer size distributions as shown in Table 7.3. We selected two research topologies with given capacity information on edges from the Internet Topology Zoo [157]. We could not use other commercial topologies as the exact connectivity and link capacity information were not publicly disclosed. We also considered multiple transfer volume distributions including synthetic (light-tailed and heavy-tailed) and real- world Facebook inter-DC trac patterns (Hadoop and Cache-follower) [12]. Transfer arrival pattern was according to Poisson distribution with a rate of per timeslot. For simplicity, we assumed an equal number of receivers for all bulk multicast transfers per experiment. We performed simulations and Mininet emulations to evaluate Iris. We compare Iris with 127 Table 7.3: Various topologies and trac patterns used in evaluation. One unit of trac is equal to what can be transmitted at the rate of the fastest link over a given topology per timeslot. Name Description Topology GEANT Backbone and transit network across Europe with 34 nodes and 52 links. Link capacity from 45 Mbps to 10 Gbps. UNINETT Backbone network across Norway with 69 nodes and 98 links. Most links have a capacity of 1, 2.5 or 10 Gbps. Trac Pattern Light-tailed Based on Exponential distribution with a mean of 20 units per transfer. Heavy-tailed Based on Pareto distribution with the minimum of 2 units, the mean of 20 units, and the maximum capped at 2000 units per transfer. Hadoop Generated by geo-distributed data analytics over Facebook's inter-DC WAN (distribution mean of 20 units per transfer). Cache-follower Generated by geo-distributed cache applications over Facebook's inter-DC WAN (distribution mean of 20 units per transfer). multiple baseline techniques and QuickCast, presented in Chapter 6, which also focuses on partitioning receivers into groups for improved completion times. 7.4.1 Computing a Lower Bound We develop a technique to compute a lower bound on receiver completion times by creating an aggregate topology from the actual topology. As shown in Figure 7.6, to create the aggregate topology, we combine all downlinks and uplinks with ratesr [node] i for all interfaces i per node to a single uplink and downlink with their rates set to the sum of rates of physical links. Also, the aggregate topology connects all nodes in a star topology using their uplinks and downlinks and so assumes no bottlenecks within the network. Since this topology is a relaxed version of the physical topology, any solution that is valid for the physical topology is valid on this topology as well. Therefore, the solution to the aggregate topology is a lower bound that can be computed eciently but may be inapplicable to the actual physical topology. We will use this approach inx7.4.2 for evaluation of Iris. 128 A B C D A B C D Actual Topology Aggregate Topology Σr i A Σr i B Σr i C Σr i D Figure 7.6: The physical topology, and the aggregate topology to compute a lower bound on receiver completion times. The aggregate topology is not part of how Iris operates and is only used for evaluation in this section. 7.4.2 Simulations In simulations, we focus on computing gains and therefore assume no dropped packets and accurate max-min fair rates. We normalized link capacities by maximum link rate per topology and xed the timeslot length to ! = 1:0. Accounting for the Eect of User Trac: We account for the eect of higher priority user trac in the simulations. The amount of available bandwidth per edge per timeslot, i.e., B e (t), is computed by deducting the rate of user trac from the link capacity C e . Recent work has shown that this rate can be safely estimated [46, 102]. For evaluations, we assume that user trac can take up to 30% of a link's capacity with a minimum of 5% and that its rate follows a periodic pattern going from low to high and to low again. Per link, we consider a random period in the range of 10 to 100 timeslots that is generated and assigned per experiment instance. Minimizing Average Completion Times This is when the objective vector is made of all ones. The partitioning hierarchy then begins with all receivers forming their 1-receiver partitions. This is a highly general objective and can be considered as the default approach when the application/user does not specify an objective vector. We discuss multiple simulation experiments. In Figure 7.7, we measure the completion times (mean and tail) as well as bandwidth consumption by the number of receivers (tail is 99.9 th percentile). We consider two baseline cases: unicast shortest path and static single tree (i.e., minimum edge Steiner tree) routing. 129 The shortest path routing is the unicast scenario that uses minimum bandwidth possible. The minimum edge Steiner tree routing uses minimum bandwidth possible while connecting all receivers with a single tree. The rst observation is that using unicast, although leads to highest separation of fast and slow receivers, does not lead to the fastest completion as it can lead to many shared bottlenecks and that is why we see long tail times. Iris oers the minimum completion times (mean and tail) across all scenarios. Also, its completion times grow much slower compared to others as the number of receivers (and so overall network load) increases. This is while Iris uses only up to 35% additional bandwidth compared to the static single tree (unicast shortest path routing uses up to 2:25). Compared to QuickCast, Iris oers up to 26% lower tail times and up to 2:72 better mean times while using up to 13% extra bandwidth. In Figure 7.8, we show the completion times speedup of receivers by their rank. As seen, gains depend on the topology, trac pattern, and receiver's rank. The dashed line is the baseline, i.e., no-partitioning case. Compared to QuickCast [84], the fastest node always completes faster and up to 2:25 faster with Iris. Also, the majority of receivers complete signicantly faster. In case of four receivers, the top 75% receivers complete between 2 to 4 faster than baseline and with sixteen receivers, the top 75% receivers complete at least 8 faster than baseline. This is when QuickCast's gain drops quickly to one after the top 25% of receivers. In Figure 7.9, we measure the CDF of completion times for all receivers. As seen, tail completion times are two to three orders of magnitude longer than median completion times which is due to variable link capacity and transfer volumes. We evaluate the completion times of QuickCast and Iris and compare them with a lower bound which considers the aggregate topology (seex7.4.1) and applies Theorem 2 directly. It is likely that no feasible solution exists that achieves this lower bound. Under low arrival rate (light load), we see that Iris tracks the lower bound nicely with a marginal dierence. Under high arrival rate (heavy load), Iris stays close to the lower bound for lower and higher percentiles while not far from it for others. Other Objective Vectors We discuss four dierent objective vectors of A, B, C and D as shown in Figure 7.10. This gure shows the mean speedup of receivers given their ranks, and the bandwidth consumption associated with each vector. In A, we aim to nish one copy quickly while not being concerned with completion times of other receivers. We see a gain of between 130 123456789 10 0 5 10 15 20 25 Mean Completion Times UNINETT (Cache-Follower) Iris QuickCast Single Tree (Load Aware) Single Tree (Static) Unicast (Shortest Path) 123456789 10 0 5 10 15 20 Mean Completion Times UNINETT (Hadoop) 123456789 10 0 200 400 600 800 Tail Completion Times UNINETT (Cache-Follower) 123456789 10 0 200 400 600 800 1000 Tail Completion Times UNINETT (Hadoop) 0 5 10 15 Total Bandwidth UNINETT (Cache-Follower) 123456789 10 # Receivers 0 2 4 6 8 10 Total Bandwidth UNINETT (Hadoop) 123456789 10 # Receivers Figure 7.7: Comparison of various techniques by number of multicast receivers. Plots are normalized by the minimum data point (mean and tail charts are normalized by the same minimum), = 1, and lower values are better. 131 1234 Receiver Rank 510 15 Receiver Rank 0 5 10 15 20 25 16 Receivers (GEANT, Cache-follower) 1234 Receiver Rank 0 2 4 6 8 4 Receivers (UNINETT, Hadoop) 510 15 Receiver Rank 0 5 10 15 16 Receivers (UNINETT, Hadoop) Iris QuickCast Single Tree (Load Aware) 0 1 2 3 Mean Speedup 4 Receivers (GEANT, Cache-follower) Mean Speedup Mean Speedup Mean Speedup Figure 7.8: Mean completion time speedup (larger is better) of receivers normalized by no partitioning (load aware single tree) case given their rank from fastest to slowest, every node initiates equal number of transfers, receivers were selected according to uniform distribution from all nodes, and we considered of 1. 132 10 0 10 5 0 0.2 0.4 0.6 0.8 1 GEANT ( = 0.001, Light-tailed) 10 0 10 5 0 0.2 0.4 0.6 0.8 1 GEANT ( = 0.001, Heavy-tailed) 10 0 10 5 0 0.2 0.4 0.6 0.8 1 GEANT ( = 1.0, Light-tailed) 10 0 10 5 0 0.2 0.4 0.6 0.8 1 GEANT ( = 1.0, Heavy-tailed) 10 0 10 5 0 0.2 0.4 0.6 0.8 1 UNINETT ( = 0.001, Light-tailed) 10 0 10 5 0 0.2 0.4 0.6 0.8 1 UNINETT ( = 0.001, Heavy-tailed) 10 0 10 5 Receiver Completion Times 0 0.2 0.4 0.6 0.8 1 UNINETT ( = 1.0, Light-tailed) 10 0 10 5 Receiver Completion Times 0 0.2 0.4 0.6 0.8 1 UNINETT ( = 1.0, Heavy-tailed) Iris QuickCast Lower Bound Figure 7.9: CDF of receiver completion times. Every transfer has 8 receivers selected uniformly across all nodes. \Lower Bound" is computed by nding the aggregate topology and applying Theorem 2. 133 9 to 18 across the two topologies considered for the rst receiver. We also see that this approach uses much less extra bandwidth compared to when we have a vector with more ones (e.g., case B). In B, we aim to speed up the rst four receivers (we care about each one) while in C, we want to speed up the fourth receiver not directly concerning ourselves with the top three receivers. As can be seen, B oers increasing speedups for the top three receivers while C's speedup is atter. Also, C uses less bandwidth compared to B by grouping the top three receivers into one partition at the base of the hierarchy. Finally, D's vector species that the application/user only cares about the completion time of the last receiver which means that receiver will be put in a separate partition at the base of the hierarchy while other receivers will be grouped into one partition. Since the slowest receiver is usually limited by its downlink speed, this cannot improve its completion time. However, with minimum extra bandwidth, this speeds up all receivers except the slowest by as much as possible. Except for the slowest, all receivers observe a speedup of between 3 to 6 while using 8% to 16% less bandwidth compared to B. A tradeo is observed, that is, D oers lower speedup but consistent gain for more receivers with less bandwidth use compared to B. 7.4.3 Mininet Emulations We used Mininet to build and test a prototype of Iris and compare it with QuickCast and set up the testbed on CloudLab [158]. We used OpenvSwitch (OVS) 2.9 in the OpenFlow 1.3 compatibility mode along with the Floodlight controller 1.2 connecting them to a control network. We assumed xed available bandwidth over edges according to GEANT topology [6] while scaling downlinks' capacity so that the maximum is 500 Mbps. We did this to reduce the CPU overhead of trac shaping over TCLink Mininet modules. Our trac engineering program communicated with end-points through a RESTful API. We used NORM [159] for multicast session management along with its rate-control module. To increase eciency, we computed max-min fair rates centrally at the trac engineering program and let the end-points shape their trac using NORM's rate control module. The experiment was performed using twelve trace les generated according to Facebook trac patterns (concerning transfer volume) [12], and each trace le had 200 requests in total with an arrival rate of one request per timeslot based on Poisson distribution. We also considered timeslots of one second, a minimum transfer volume of 5 MBs and limited the maximum transfer volume to 500 MBs (which also match the distribution of YouTube video sizes [160]). We considered three schemes of Iris, QuickCast and a single tree approach 134 5HFHLYHU5DQN 0HDQ6SHHGXS *($17&DFKHIROORZHU $% & ' ([WUD%DQGZLGWK *($17&DFKHIROORZHU 5HFHLYHU5DQN 0HDQ6SHHGXS 81,1(77+DGRRS $% & ' ([WUD%DQGZLGWK 81,1(77+DGRRS $ $ ^` % % ^` & & ^` ' ' ^` Figure 7.10: Gain by rank for dierent receivers per transfer averaged over all transfers for four dierent objective vectors. We set = 0:1 and there are 8 receivers. 135 (no partitioning). The total emulation time was about 24 hours. Figure 7.11 shows our emulation results. To allow comparison between the tail (95 th percentile) and mean values, we have normalized both plots by the same minimum in each row. Also, the group table usage plots are not normalized and show the actual average and actual maximum across all switches. The reason why data points jump up and down is the randomness of generated traces that comes from transfers (volume, source, receivers, arrival pattern, etc). Completion Times and Bandwidth: Iris can improve on QuickCast by speeding up mean receiver completion times by up to 2:5. It also oers up to 4 better mean completion times compared to using a single forwarding tree per transfer. We also see that compared to using one multicast tree, Iris consumes at most 25% extra bandwidth. Forwarding Plane: We see that Iris uses up to about 4 less group table entries at the switches where the maximum number of entries were exhausted which allows more parallel transfers across the same network. Iris achieves this by allowing a larger number of partitions per transfer whenever it does not hurt the completion times. By allowing more partitions, each tree will branch less times on average reducing the number of group table entries. Running Time: Across all experiments, the computation time needed to run Iris to cal- culate partitions and forwarding trees stayed below 5 ms per request. 7.4.4 Practical Concerns New challenges, such as increased communication latency across network elements and failures, may arise while deploying Iris on a real-world geographically distributed network. Communication latency may not aect the performance considerably as we focus on long- running internal transfers that are notably more resilient to latency overhead of scheduling and routing compared to interactive user trac. Failures may aect physical links or the TES. Loss of a physical link can be addressed by rerouting the aected transfers reactively either by the network controller or by using the SDN fast failover mechanisms. End-points may be equipped with distributed congestion control, such as the one presented in [115], which they can fall back to in case the centralized trac engineering fails. 7.5 Conclusions In this chapter, we presented the problem of grouping receivers into multiple partitions per P2MP transfer to minimize the eect of receiver downlink speed discrepancy on completion 136 123456 1 1.5 2 2.5 3 3.5 4 Mean Completion Times GEANT (Cache-Follower) Iris QuickCast Single Tree (Load Aware) 123456 0 2 4 6 8 10 Mean Completion Times GEANT (Hadoop) 123456 11.5 12 12.5 13 13.5 14 95 th Perc. Completion Times GEANT (Cache-Follower) 123456 0 10 20 30 40 95 th Perc. Completion Times GEANT (Hadoop) 123456 1 1.5 2 2.5 3 3.5 Total Bandwidth GEANT (Cache-Follower) 123456 1 2 3 4 5 Total Bandwidth GEANT (Hadoop) 23456 0 0.5 1 1.5 2 Mean Group Table Entries GEANT (Cache-Follower) 23456 0 0.5 1 1.5 2 2.5 Mean Group Table Entries GEANT (Hadoop) 0 10 20 30 40 Tail Group Table Entries GEANT (Cache-Follower) 0 10 20 30 40 50 Tail Group Table Entries GEANT (Hadoop) 23456 # Receivers 23456 # Receivers Figure 7.11: Mininet Emulation Results 137 times of receivers. We analyzed a relaxed version of this problem and came up with a parti- tioning that minimizes mean completion times given max-min fair rates. We also set forth the idea of applications/users expressing their requirements in the form of binary objective vectors which allows us to optimize resource consumption and performance further. We then described Iris, a system that computes partitions and forwarding trees for incoming bulk multicast transfers as they arrive given objective vectors. We showed that Iris could signicantly reduce mean completion times with a small increase in bandwidth consumption and can fulll the requirements expressed using objective vectors while saving bandwidth whenever possible. It is worth noting that performance of any partitioning and forwarding tree selection algorithm rests profoundly on the network topology and transfer properties. 138 Chapter 8 Speeding up P2MP Transfers using Parallel Steiner Trees In Chapters 5 to 7, we discussed dierent ways of managing Point to Multipoint (P2MP) inter-DC transfers via using dynamically selected forwarding trees to balance load across the network and reduce network capacity consumption. In all past eorts, we attached each receiver to the sender using a single forwarding tree. 1 In general, however, it may be possible to increase receivers' download speeds by using multiple parallel trees that connect the sender to all receivers 2 which is what we will explore in this chapter. We will show that by using two forwarding trees per receiver, we can reduce the completion times of receivers by up to 40% while only increasing the total network capacity usage by up to 10%. We also nd that using up to more than two parallel trees oers a negligible benet or even hurts the performance due to excessive bandwidth usage and creation of unnecessary bottlenecks. 8.1 Motivating Example By using parallel trees, we can substantially increase the multicast forwarding throughput possibly at little extra network capacity cost. Figure 8.1 shows how adding more trees can improve the overall receiver throughput. Assuming equal link capacity of 1 for all edges, the single tree case on the left oers a total rate of 1. Adding one more tree in an edge-disjoint manner will double the rate. If we consider an equal division of trac across the two trees, 1 In case of partitioning, every receiver belonged to exactly one partition and so was connected to the sender using a single forwarding tree. 2 In case of partitioning, all receivers in every partition are attached using one forwarding tree to the sender as in Chapters 6 and 7. 139 S R 1 R 2 S R 1 R 2 S R 1 R 2 Figure 8.1: Using parallel forwarding trees we can increase the overall network throughput to all receivers. We may have to pay some extra bandwidth cost as we add more trees. the total network bandwidth usage is not increase compared to the single tree case. Now nd the network on the right. We see three trees that will give us a total rate of 3. However, the last tree has four edges. Assuming equal division of trac across all three trees, we see that this will increase the total bandwidth usage by 1:11. Also, we see that adding more trees will not help us improve completion times due to the creation of bottlenecks (since trees will not be edge-disjoint anymore). 8.2 System Model We adopt the same system model as presented inx5.2.1. Namely, we consider a slotted timeline, and a centralized trac engineering mechanism that determines what trees will be used by an arriving transfer. The central controller also computes the rates at which senders transmit trac on each tree. We also focus on bulk and internal data transfers that are not in the critical path of user experience and so are resilient to some degree of latency. We assume heterogeneous link capacities as presented by real WAN topologies. We will use the same notation as that in Tables 2.2 and 5.2. 8.3 Application of Parallel Forwarding Trees We discuss how to dynamically select parallel edge-disjoint forwarding trees according to network load across dierent edges and then discuss various rate-allocation (i.e., trac scheduling) policies. 140 8.3.1 Adaptive Edge-disjoint Parallel Forwarding Tree Selection Although using a single forwarding tree for every transfer minimizes packet reordering and total network capacity consumption, it can considerably limit the overall achievable network throughput for P2MP transfers. Under light load, this will lead to inecient use of network capacity articially increasing the completion times of P2MP transfers. We discuss our approach to selection of multiple forwarding trees. To perform a P2MP transfer R new with volumeV Rnew , the source S Rnew transmits trac over edge-disjoint Steiner trees that span across D D D Rnew . In this chapter, we do not discuss receiver set partitioning as that subject can be applied orthogonal to the parallel tree selection approach by treating each partition as a separate P2MP request. At any timeslot, trac for any transfer ows with the same rate over all links of a forwarding tree to reach all the destinations at the same time. The problem of scheduling a P2MP transfer then translates to nding multiple forwarding trees and a transmission schedule over every tree for every arriving transfer in an online manner. A relevant problem is the minimum weight Steiner tree [71] that can help minimize total bandwidth usage with proper weight assignment. Although it is a hard problem, heuristic algorithms exist that often provide near optimal solutions [142,143]. To select multiple Steiner trees, we use the metric load L e that is dened for every edge e as the total remaining volume of trac for all the trees that include that edge. We rst assign every edge a weight of W e = Le+V Rnew Ce which is the minimum time it would take for all the transfers that share that edge to complete (if R new were to be placed on that edge). The algorithm starts by rst selecting a minimum weight Steiner tree using a heuristic algorithm. We then mark all of the edges of this tree as deleted and run the minimum weight Steiner tree selection algorithm again. This process is repeated until either no more trees can be found (i.e., some receivers are disconnected) or we reach a maximum ofK trees set by the operators as a conguration parameter. This approach oers several benets. Since trees are selected dynamically as load changes on edges, they tend to avoid highly busy links. Also, as trees assigned to a transfer are edge-disjoint, this approach avoids creating additional bottlenecks that cause competi- tion across trees of the same transfer. Finally, by limiting the maximum number of trees, operators can choose between speeding up the transfers (using more trees) or minimizing total bandwidth consumption (using fewer trees) by changing the value of K. This value could be chosen as a function of network load, i.e., under heavier load operators can reduce K and increase it as load decreases. 141 Updating L e : While using multiple forwarding trees, after selection of such trees, the load on their edges needs to be increased according toV Rnew . Since we do not know, originally, how much of the trac will be sent over each tree, it is unclear how to increase the load on the edges of dierent trees. This is because according to the scheduling policy used to send trac and the future transfers that arrive, the volume of trac sent over dierent trees per transfer can change. For example, if a transfer has two trees and one of them has to compete with a future transfer, the volume of trac sent over the other tree will automatically increase as a result as soon as the future transfer arrives. To address this, we use a heuristic technique as follows. We assume that at any time, the remaining volume of a transfer is equally divided across all its trees. If one tree sends a lot of trac, that reduction in load will be equally divided and deducted from all of the trees for that transfer. Although the exact load on every edge will potentially not be accurate, this approach oers an ecient approximation of load which helps us to quickly select future forwarding trees. 8.3.2 Scheduling Policies Similar to previous chapters, we consider well-known scheduling policies of First Come First Serve (FCFS), Shortest Remaining Processing Time (SRPT), and fair sharing based on Max- Min Fairness (MMF). These scheduling policies have dierent properties. Fair sharing is the most widely used policy as it allows many users to fairly access the network bandwidth over network bottlenecks. SRPT allows more internal data transfers to be completed in any given period of time. FCFS can also be used to oer more accurate guarantees to applications on when their transfers will complete. 8.4 Evaluation We considered various topologies and transfer size distributions. In the following, we per- form experiments to measure the eectiveness of using parallel forwarding trees on multiple toplogies and using multiple transfer size distributions. Network Topologies: We use the same topologies discussed in Chapter 6. These topolo- gies provide capacity information for all links which range from 45 Mbps to 10 Gbps. We normalized all link capacities dividing them by the maximum link capacity. We also assumed all bidirectional links with equal capacity in either direction. Trac Patterns: We use the same transfer size distributions discussed in Chapter 6. Transfer arrival followed a Poisson distribution with rate. We considered no units for time 142 or bandwidth. For all simulations, we assumed a timeslot length of ! = 1:0. For Pareto distribution, we considered a minimum transfer volume equal to that of 2 full timeslots and limited maximum transfer volume to that of 2000 full timeslots. Unless otherwise stated, we considered an average demand equal to volume of 20 full timeslots per transfer for all trac distributions (we xed the mean values of all distributions to the same value). Per simulation instance, we assumed equal number of transfers per sender and for every transfer, we selected the receivers from all existing nodes according to the uniform distribution (with equal probability from all nodes). Assumptions: We focused on computing gains and assumed accurate knowledge of inter- DC link capacity, and precise rate control at the end-points which together lead to a con- gestion free network. We also assumed no dropped packets due to corruption or errors, and no link failures. Simulation Setup: We developed a simulator in Java (JDK 8). We performed all simula- tions on one machine (Core i7-6700 and 24 GB of RAM). We used the Java implementation of GreedyFLAC [72] for minimum weight Steiner trees. 8.4.1 Eect of Number of Parallel Trees Figure 8.2 shows the eect of maximum number of trees per transfer (i.e., K). We see that almost all the gain is obtained with 2 parallel trees and increasing it further does not improve the completion times. Adding more trees, however, increases the total network bandwidth usage. We see that while the network bandwidth consumption increases by about 7% in the settings of this experiment, the mean completion times improve by up to 17% and the median completion times improve by up to 30%. The gain in mean completion times is less than that of median as a result of the tail completion times which are usually much higher than median since the transfer size distribution is skewed. Also, it is worth noting that having parallel trees cannot improve tail completion times as the tail is restricted by physical constraints such as low capacity links. 8.4.2 Eect of Number of Copies In Figure 8.3 we explore the eect of number of receivers per transfer. With more receivers, we will have larger trees which make it harder in general to nd edge-disjoint parallel trees. As a result, we see that the gain in mean and median completion times drops with more receivers. One way to increase the eect of parallel trees in scenarios with many receivers per 143 Light-tailed Bandwidth Mean TCT Median TCT 0.9 1 1.1 1.2 1.3 Heavy-tailed Bandwidth Mean TCT Median TCT 0.9 1 1.1 1.2 1.3 K=1 (single tree) K=2 (up to 2 trees) K=3 (up to 3 trees) Figure 8.2: The eect of number of parallel trees on total bandwidth consumption and transfer completion times (TCTs). Other experiment parameters are = 0:01, max-min fair rate computation, and GEANT [6] topology. Light-tailed Bandwidth Ratio Mean TCT Gain Median TCT Gain 0.9 1 1.1 1.2 1.3 Heavy-tailed Bandwidth Ratio Mean TCT Gain Median TCT Gain 0.9 1 1.1 1.2 1.3 4 Receivers 8 Receivers Figure 8.3: The eect of number of receivers on total bandwidth consumption ratio (i.e., Bandwidth of K=2 Bandwidth of K=1 ) and transfer completion times (TCTs) gain (i.e., completion time of K=1 completion time of K=2 ). Other experiment parameters are = 0:01, max-min fair rate computation, and GEANT [6] topology. transfer is to reduce the receivers per tree by partitioning receivers using some technique, for example those discussed in Chapters 6 and 7. 8.4.3 Eect of Transfer Size Distribution Figure 8.4 shows the eect of dierent transfer size distributions which include both syn- thetic and real distributions. Since trees are selected dynamically, we see that the total bandwidth consumption also changes with the trac pattern. We also see that the gain in mean and median completion times depend highly on the trac distribution ranging from 5% to 30%. Interestingly, we also see that the gain in mean completion times has an inverse relationship with that of median completion times. We believe this behavior is a result of how distributions aect the tail completion times. For example, with the synthetic 144 light-tailed and heavy-tailed distributions, the tail grows larger with K = 2, while for the real trac patterns of Cache-follower and Hadoop we see a decrease in tail completion times (not shown in the gure). The common result is that regardless of the trac patterns, we always obtain considerable gains in either mean or median completion times with up to 10% increase in bandwidth usage. 8.4.4 Eect of Topology We explore the eect of dierent topologies as shown in Figure 8.5. We see that GScale oers signicantly higher gains in completion times compared to the other two topologies. That is because we assumed a uniform capacity of 1 across the edges of GScale while GEANT and UNINETT have many low capacity edges which negatively aect the gains. We also see a higher bandwidth usage over GScale that is up to 18% which is due to the ability of the routing algorithm to use parallel trees for more transfers. GScale is a smaller topology and is better connected compared to UNINETT and GEANT which is why we can build more parallel trees on average. 8.4.5 Eect of Scheduling Policies Figure 8.6 shows the eect of scheduling policies on the ow completion times gain and the total bandwidth use. We see that using parallel trees oers the most gain when applying the SRPT policy. This is because small transfers obtain much higher throughput as soon as they arrive since the policy preempts any other larger transfers. Fair sharing and FCFS both oer considerable gains in median completion times with fair sharing oering a higher average gain that is due to better tail completion times. In other words, with FCFS, few large transfers can fully block some links and slow down all other transfers whose trees use those edges. Overall, we see that using parallel trees marginally increases bandwidth use while considerably improving completion times regardless of the scheduling policy. 8.4.6 Eect of Network Load In Figure 8.7 we evaluate the eect of network load. Overall, it appears that with lower network load, we observer higher gains in completion times and slightly higher bandwidth consumption. Under light load, most network edges are not loaded and so using parallel trees allows us to increase throughput for ongoing transfers with minimal interference. As load increases, we expect higher contention across competing transfers for access to network capacity which reduces the gains of having parallel trees. 145 Bandwidth Ratio Mean TCT Gain Median TCT Gain 0.9 1 1.1 1.2 1.3 Light-tailed Heavy-tailed Cache-follower Hadoop Figure 8.4: The eect of transfer size distribution on total bandwidth consumption ratio (i.e., Bandwidth of K=2 Bandwidth of K=1 ) and transfer completion times (TCTs) gain (i.e., completion time of K=1 completion time of K=2 ). Other experiment parameters are = 0:01, 4 receivers per transfer, max-min fair rate computation, and GEANT [6] topology. L i gh t - t a i l e d Bandwidth Ratio Mean TCT Gain Median TCT Gain 1 1.2 1.4 1.6 1.8 H e a v y - t a i l e d Bandwidth Ratio Mean TCT Gain Median TCT Gain 1 1.2 1.4 1.6 1.8 GEANT GScale UNINETT Figure 8.5: The eect of topology on total bandwidth consumption ratio (i.e., Bandwidth of K=2 Bandwidth of K=1 ) and transfer completion times (TCTs) gain (i.e., completion time of K=1 completion time of K=2 ). Other experiment parameters are = 0:01, 4 receivers per transfer, and max-min fair rate computation. L i gh t - t a i l e d Bandwidth Ratio Mean TCT Gain Median TCT Gain 1 1.2 1.4 H e a v y - t a i l e d Bandwidth Ratio Mean TCT Gain Median TCT Gain 1 1.2 1.4 FCFS SRPT Fair Sharing (MMF) Figure 8.6: The eect of trac scheduling policy on total bandwidth consumption ratio (i.e., Bandwidth of K=2 Bandwidth of K=1 ) and transfer completion times (TCTs) gain (i.e., completion time of K=1 completion time of K=2 ). Other experiment parameters are = 0:01, 4 receivers per transfer, and GEANT [6] topol- ogy. 146 L i gh t - t a i l e d Bandwidth Ratio Mean TCT Gain Median TCT Gain 0.9 1 1.1 1.2 1.3 H e a v y - t a i l e d Bandwidth Ratio Mean TCT Gain Median TCT Gain 0.9 1 1.1 1.2 1.3 = 0.1 = 0.01 = 0.001 Figure 8.7: The eect of transfer arrival rate (i.e., network load) on total bandwidth con- sumption ratio (i.e., Bandwidth of K=2 Bandwidth of K=1 ) and transfer completion times (TCTs) gain (i.e., completion time of K=1 completion time of K=2 ). Other experiment parameters are 4 receivers per transfer, max-min fair rate computation, and GEANT [6] topology. 8.5 Conclusions In this chapter, we evaluated the benets of parallel forwarding trees for inter-DC P2MP transfers. The approach is to use edge-disjoint forwarding trees to reduce the interference across the trees of one transfer while maximizing throughput. We used a load-adaptive approach for selection for forwarding trees that selects up to K such trees that balance load across the network. We also discussed a weight assignment technique for updating load weights over the edges of trees for ecient computation of weights. According to our evaluations with dierent trac patterns, topologies, network load, number of parallel trees, and scheduling policies, we nd that using up to two parallel trees per transfer can considerably improve the completion times of transfers while slightly increasing the total network bandwidth use. We also nd that for better-connected networks with fewer bottlenecks, using parallel trees oer higher gains in completion times. 147 Chapter 9 Summary and Future Directions As organizations continue to build more datacenters around the globe, communication across these datacenters becomes more and more important for highly distributed appli- cations with globally distributed users. Increasingly, companies use private dedicated high speed networks to connect datacenters to oer high quality infrastructure for distributed applications. For such costly networks to be protable, it is necessary to maximize perfor- mance and eciency. In this dissertation, we made the case for coordinated control of routing over inter-DC networks and trac transmission at the end-points. Since inter-DC networks are relatively small with tens to hundreds of nodes, such coordination is possible and is currently used by multiple organizations. A trac engineering sever that is logically centralized receives trac demands from end-points as well as network status updates from the network. Combined with the topology information, the trac engineering server can then compute the routes over which trac is forwarded over inter-DC networks and the rate at which trac is transmitted from end-points. We focused on multiple research domains concerned with trac engineering over inter- DC networks. First, we noticed that a large portion of inter-DC trac is formed by large inter-DC ows which we refer to as transfers. We realized that current adaptive routing techniques based on link utilization or static topology information are insucient for mini- mizing the completion times of such transfers. We then developed Best Worst-case Routing (BWR), which is a routing heuristic that aims to route new transfers to minimize their worst-case completion times. We showed that this technique can improve completion times regardless of the scheduling policy used for transmission of trac. We then discussed the deadline requirement of many inter-DC transfers and studied 148 admission control for large transfers. Admission control helps prevent over committing existing resources and makes sure that admitted transfers meet the deadlines they aimed for. Our major contribution has been to make such admission control as fast as possible to handle large number of transfers as they arrive. The admission control considers both routing of trac and transmission control. We considered both cases of single path routing and multipath routing and showed that using up to two parallel paths oers considerable gains in admitted trac. For fast admission control, we applied a new trac allocation strategy that pushes trac for every transfer as close as possible to their deadlines which we call the As Late As Possible (ALAP) scheduling policy. With this allocation strategy, we can quickly determine if a new transfer can meet its deadline and compute a feasible allocation without formulating complex optimization problems. Next, we considered the problem of delivering objects from one location to multiple loca- tions while paying attention to performance metrics such as completion times and deadlines. This problem has the one-to-many transmission property in common with the traditional multicasting problem, but has the added property that all the receivers of a transfer are known at the arrival time which allows us to select a multicast tree unon its arrival. We called such transfers Point to Multipoint (P2MP) transfers. We used Steiner trees to mini- mize bandwidth usage while selecting them in a way that distributes load by shifting trac across various trees to exercise all available capacity. This approach allowed us to reduce tail completion times while handling more trac. We also discussed the same problem given deadlines for P2MP transfers and showed that using the same adaptive tree selection technique combined with the ALAP scheduling policy, we can admit more trac to the network and guarantee deadlines as well. We then explored ways of further reducing the completion times of some receivers for P2MP transfers given that not all the receivers have to receive complete data at the same time. We observed that a single slow receiver, can slow down all receivers attached to the sender on a tree and proposed to break receivers into multiple partitions. Each partition is then connected to the sender using a separate tree. By grouping receivers according to their download speeds or by according to their proximity we can then improve their overall reception rate. We presented algorithms for performing such partitioning and showed that it is eective. We also showed that the eectiveness of these techniques is a function of network topology and link capacity distribution as well as distribution of transfer volumes. Finally, we aimed to further improve the completion times of a P2MP transfer by using parallel forwarding trees. We explored the application of edge-disjoint forwarding trees that are selected adaptively according to network load. We realized that parallel trees can 149 F 1 F 2 Figure 9.1: Example scenario used inx9.1.1 considerably improve completion times while minimally increase bandwidth consumption. We also found that selecting more than two parallel trees does not oer any benets in most cases but increases bandwidth consumption. 9.1 Future Directions We propose a few research directions for interested researchers to explore. We categorize these ideas according to the part of dissertation they target. 9.1.1 Adaptive Routing over Inter-DC Networks We presented BWR as an eective routing technique that improves transfer completion times regardless of the scheduling policy used for trac. The method we developed to compute the worst-case completion times of transfers per path can be further improved. Our current implementation is merely summing up the remaining volumes of all the ows that intersect a path can be in general not tight enough and is too conservative for worst- case. For example, consider a path with two hops as shown in Figure 9.1. On the rst hop, we intersect F 1 with 4 remaining packets, and on the second hop, we intersect F 2 with 5 remaining data units. The current method will simply use 4 + 5 as the worst-case start time on the path, but if the whole topology has only these two links, F 1 and F 2 can transmit in parallel, which means the worst-case will be max(F 1 ;F 2 ) that is 5. Knowing this, an algorithm should prefer this path over another with a worst-case of 8 (not shown in the gure). Computing the exact worst-case may be possible by using a dependency graph which can be computationally expensive. Furthermore, since optimizing one arriving ow does not necessarily provide benets to future arrivals, we did not explore nding the exact worst-case. Also, extending to multipath BWR routing, and the eect of inaccurate ow size information on routing performance can be other directions to explore. 150 9.1.2 Deadline-aware Point to Multipoint Transfers The approach we presented in this dissertation for deadline-based admission control requires that all destinations can be reached before the deadline for that transfer. This, however, may be too restrictive for many applications: we might prefer to maximize the number of receivers that complete before specied deadlines per transfer while considering a minimum number of replicas that need be made before a given deadline. This objective is more prac- tical in the sense that minimum replicas represent some degree of reliability (which means we guarantee a required reliability degree) while allowing more transfers to be admitted which increases network utilization and eciency. 9.1.3 Receiver Completion Times of Point to Multipoint Transfers Due to varying load on edges as a result of time zone dierences, the total bandwidth per tree may not be signicant as trees span across many regions. To address this issue, store-and-forward can be used along with parallel trees to utilize the capacity of wide area networks further. With store-and-forward, one can build large scale overlay networks across datacenters and use intermediate nodes as large temporary buers that store data in case the incoming rate is higher than the outgoing rate per transfer. As time passes bandwidth increases on outgoing edges of such nodes, the temporary buer used will drain to next hop overlay nodes. An overlay node can consist of multiple servers in every datacenter with enough capacity to store data over highly loaded hours and consume later. With this approach, overlay nodes will use simple point to point connections but on a per-hop basis to build a multicast overlay network [67]. 9.1.4 Large-scale Implementation and Evaluation of Algorithms for Fast and Ecient Point to Multipoint Transfers In this dissertation, we developed various algorithms for fast and ecient P2MP transfers and evaluated them through simulations. Large-scale evaluation of our techniques and algorithms over real inter-DC networks and using practical inter-DC applications is another direction for future research. For example, forwarding trees can be realized using SDN Group Tables [156], Bit Index Explicit Replication (BIER) [130], and via standard multicast tables at the inter-DC switches. These approaches oer various trade-os concerning the latency of installing a forwarding tree, the number of forwarding trees that can be set up at any given time, and the maximum rate at which trac can be forwarded over forwarding trees. Comparison and analysis of how various ways of implementing forwarding trees can 151 aect the eciency and speed of inter-DC transfers is an exciting and valuable topic for future research. 152 Appendix A NP-Hardness Proof for Best Worst-case Routing From Chapter 3, we recall that over the inter-DC graphG, each edgee was associated with a set of ongoing ows F F F e and each owF i had a remaining volumeV r i . The Best Worst-case Routing (BWR) problem was to select the path with minimum total weight between two verticess andt with the weight computed as sum of the remaining volumes of all ows that have at least one common edge with the path. In other words, given a set of ows F F F, to nd a best worst-case path, we are looking for a subset of ows F F F with minimum total sum of remaining data units where there exists a path from s to t removing all edges that have a ow in F F F . In the following, we show that one instance of this problem where the remaining volumes of all ows is set to 1 is NP-Hard, and so BWR must be as well. Problem 1. Consider the multi-graph of Figure A.1 and a set of labels L L L =fl 1 ;:::;l m g. Between any two verticesi andi + 1, we have at least one edge and each edge is associated with exactly one label. Also, there are no edges between vertices with non-consecutive numbers. We want to nd a path P from s to t so that the total number of distinct labels on the edges associated with P is minimized. Proposition 1. Problem 1 is NP-Hard. Proof. We will reduce the well-known Set Cover problem to Problem 1. Consider an arbitrary instance of Set Cover with the universal set U =f1; 2;:::;ng and collection of subsets S =fS 1 ;S 2 ;:::;S m g. Construct a multi-graph G with n + 1 vertices labeled 0; 1;:::;n. For every S j that contains element i2U, add an edge (i 1, i) with label l j to G corresponding to that subset. Now, any set cover using k subsets corresponds to a path 153 0 1 n n-1 s t ..................... l i l j l k l p l q Figure A.1: Network used in Problem 1. from node 0 to node n that uses k labels. Conversely, any path on G from node 0 to node n which usesk labels, covers all the elements ofU and so corresponds to a set cover withk subsets. Therefore, nding a path in G with minimum total distinct labels corresponds to nding a minimum set cover on S, which is NP-hard. 1 Problem 2. Consider some graph G(V;E) where each edge is associated with a set of labels L L L =fl 1 ;:::;l m g. We want to nd a path P from s to t so that the total number of distinct labels on the edges associated with P is minimized. Proposition 2. Problem 2 is NP-Hard. Proof. Problem 1 is an instance of Problem 2 where each edge is associated with exactly one label and there are edges only between consecutive nodes. Therefore, Problem 2 must be NP-Hard. The ow routing problem has one more constraint, that is, all edges of a ow appear in consecutive order. In other words, all edges associated with a specic label appear on the graph G(V;E) in consecutive order from the source of the ow to its destination. Proposition 3. Assuming a ow size of 1 for all ows, BWR is NP-Hard. Proof. We will reduce Problem 2 into an instance of Problem 3. Let us take an instance of Problem 2. We associate every label from the set of all labels L L L with exactly one ow. For any label l, if all the edges on G are connected consecutively, we do not make any changes. Otherwise, we add dummy edges labeled l to G so that all edges with label l appear consecutively in the new graph G 1 . We repeat this for all such labels l arriving at graph G k where k is the number of labels for which edges do not appear consecutively on G. Next, to any dummy edge in G k , we add m + 1 dummy new and distinct labels which 1 Please note that the multi-graph created can be converted to a simple graph by replacing every edge with two edges labeled with the same label and a node in between. 154 will extend the set of all labels to L L L 0 . The graph G k with the set of labels L L L 0 together form a valid instance of Problem 3. Any solution found for this instance is also a solution to Problem 2 and vice versa. That is because solutions found over the new graph with new labels will not include any of the dummy edges as such solutions will include at least m + 1 labels. This concludes our proof that BWR is NP-Hard. 155 Appendix B SDN Switches that Support Group Table ALL We list some Software Dened Networking (SDN) [76] products that support the Group Table ALL feature which can be used to forward incoming packets to multiple outgoing ports via packet replication. When using Group Tables, 1 each group entry can have multiple action buckets each programmed with a set of actions. The \ALL" feature means all action buckets of a group entry will be executed and each bucket will be supplied with a copy of incoming packet that matches the group entry predicates. This feature has been in OpenFlow standard since version 1:1 and was added for the purpose of ooding, broadcasting or multicasting [161]. Despite being part of the OpenFlow specications, this feature has not been widely supported by switch vendors. As of 2016, physical switches on the market have started providing support for this feature. We merely list several products that currently support this feature and cite related documents which contain detailed information on how these features are actually supported (e.g., maximum number of entries, maximum action buckets per entry, and whether group chaining is allowed). Table B.1 provides a list of several products that can be used for building multicast forwarding trees. 1 Group Tables are a feature supported by OpenFlow [147] that allow complex group operations on incoming packets for purposes such as fast failover, load balancing, and multicasting [161]. 156 Table B.1: SDN products with support for OFPGT ALL. Vendor Product HP [162] HP 5920 & 5900 Switch Series HP [163] HP 5130 EI Switch Series HP [164] HP Switch 2920 series, HP Switch 3500 series, HP Switch 3800 series, HP Switch 5400 series, v1 and v2 modules, HP Switch 5406R series, HP Switch 5412A series, HP Switch 6200 series, HP Switch 6600 series, HP Switch 8200 series, v1 and v2 modules Juniper Networks [165] MX Series, EX9200, QFX5100 and EX4600 Alcatel-Lucent [166] OmniSwitch 10K, OmniSwitch 9900, OmniSwitch 6900, OmniSwitch 6860, and OmniSwitch 6865 IBM [167] IBM System Networking RackSwitch G8264 Brocade [168] Brocade VDX 2741, Brocade VDX 6740, Brocade VDX 6940 and Brocade VDX 8770 157 Bibliography [1] The internet topology zoo (cogent). http://www.topology-zoo.org/files/Cogentco.gml. visited on July 19, 2017. [2] Sushant Jain, Alok Kumar, et al. B4: Experience with a globally-deployed software dened wan. SIGCOMM, 43(4):3{14, 2013. [3] http://www.topology-zoo.org/files/Agis.gml. [4] http://www.topology-zoo.org/files/Ans.gml. [5] The internet topology zoo (att north america). http://www.topology-zoo.org/files/AttMpls.gml. visited on July 19, 2017. [6] The Internet Topology Zoo (GEANT). http://www.topology-zoo.org/files/Geant2009.gml. [7] W. Xia, P. Zhao, Y. Wen, and H. Xie. A survey on data center networking (dcn): Infrastructure and operations. IEEE Communications Surveys Tutorials, 19(1):640{656, Firstquarter 2017. [8] J. Zhang, F. R. Yu, S. Wang, T. Huang, Z. Liu, and Y. Liu. Load balancing in data center networks: A survey. IEEE Communications Surveys Tutorials, 20(3):2324{2352, thirdquarter 2018. [9] Google Cloud: Products and services. [10] Directory of Azure Cloud Services. [11] Cloud Products. [12] Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren. Inside the social network's (datacenter) network. In SIGCOMM, pages 123{137. ACM, 2015. [13] Chen Zhang, Hans De Sterck, Ashraf Aboulnaga, Haig Djambazian, and Rob Sladek. Case study of scientic data processing on a cloud using hadoop. In International Symposium on High Performance Computing Systems and Applications, pages 400{415. Springer, 2009. [14] Compute engine - iaas - google cloud platform. https://cloud.google.com/compute/. [15] Microsoft azure: Cloud computing platform & services. https://azure.microsoft.com/. [16] Amazon web services (aws) - cloud computing services. https://aws.amazon.com/. [17] Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Je Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hlzle, Stephen Stuart, and Amin Vahdat. Jupiter rising: A decade of clos topologies and centralized control in googles datacenter network. In Sigcomm '15, 2015. 158 [18] Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta. VL2: A Scalable and Flexible Data Center Network. Commun. ACM, 54(3):95{104, March 2011. [19] Sriram Subramanian. Network in Hyper-scale data centers - Facebook, 2015. [20] Glenn Judd. Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter. 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pages 145{157, 2015. [21] Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. Congestion Control for Large-Scale RDMA Deployments. Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, pages 523{536, 2015. [22] Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, and Masato Yasuda. Less is more: trading a little bandwidth for ultra-low latency in the data center. Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 253{266, 2012. [23] Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A Scalable, Commodity Data Center Network Architecture. Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication, pages 63{74, 2008. [24] Jung Ho Ahn, Nathan Binkert, Al Davis, Moray McLaren, and Robert S Schreiber. HyperX: topol- ogy, routing, and packaging of ecient large-scale networks. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 41, 2009. [25] Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, and Songwu Lu. Dcell: A Scalable and Fault-tolerant Network Structure for Data Centers. SIGCOMM Comput. Commun. Rev., 38(4):75{86, August 2008. [26] Mohammad Alizadeh and Tom Edsall. On the data path performance of leaf-spine datacenter fabrics. 2013 IEEE 21st Annual Symposium on High-Performance Interconnects, pages 71{74, 2013. [27] Asaf Valadarsky, Gal Shahaf, Michael Dinitz, and Michael Schapira. Xpander: Towards Optimal- Performance Datacenters. Proceedingsofthe12thInternationalonConferenceonemergingNetworking EXperiments and Technologies, pages 205{219, 2016. [28] Introducing data center fabric, the next-generation Facebook data center network. [29] Ankit Singla, Chi-Yao Hong, Lucian Popa, and P Brighten Godfrey. Jellysh: Networking data centers randomly. Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 225{238, 2012. [30] Charles Clos. A Study of Non-Blocking Switching Networks. Bell Labs Technical Journal, 32(2):406{ 424, 1953. [31] Cisco Global Cloud Index: Forecast and Methodology, 2015-2020 White Paper. [32] The Growth of eServices: Stats and Trends. [33] How do users feel about video streaming quality on their TVs? [34] Cloud Technology Why Move Critical Workloads and Data Closer to Users? [35] Content Delivery Networks Move Closer to the Network Edge. 159 [36] Mapping net ix: Content delivery network spans 233 sites. http://datacenterfrontier.com/ mapping-netflix-content-delivery-network/. visited on March 3, 2017. [37] Ken Florance. How net ix works with isps around the globe to deliver a great viewing ex- perience. https://media.netflix.com/en/company-blog/how-netflix-works-with-isps-around- the-globe-to-deliver-a-great-viewing-experience, 2016. [38] Google. [39] Bing. [40] Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya Akella, Paramvir Bahl, and Ion Stoica. Low latency geo-distributed data analytics. ACM SIGCOMM Computer Communi- cation Review, 45(4):421{434, 2015. [41] Dark ber network and pricing. http://www.level3.com/en/products/managed-dedicated-fiber/. visited on March 29, 2019. [42] How microsoft builds its fast and reliable global network. https://azure.microsoft.com/en-us/ blog/how-microsoft-builds-its-fast-and-reliable-global-network/. visited on September 30, 2017. [43] Building express backbone: Facebooks new long-haul network. https://code.facebook.com/posts/ 1782709872057497/building-express-backbone-facebook-s-new-long-haul-network/. visited on September 30, 2017. [44] Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, et al. Achieving high utilization with software- driven wan. In SIGCOMM, pages 15{26. ACM, 2013. [45] Xiaoxue Zhao, Vijay Vusirikala, Bikash Koley, Valey Kamalov, and Tad Hofmeister. The prospect of inter-data-center optical networks. IEEE Communications Magazine, 51(9):32{38, 2013. [46] Srikanth Kandula, Ishai Menache, Roy Schwartz, and Spandana Raj Babbula. Calendaring for wide area networks. SIGCOMM, 44(4):515{526, 2015. [47] Virajith Jalaparti, Ivan Bliznets, Srikanth Kandula, Brendan Lucier, and Ishai Menache. Dynamic pricing and trac engineering for timely inter-datacenter transfers. In SIGCOMM, pages 73{86. ACM, 2016. [48] Srikanth Kandula, Sudipta Sengupta, Albert Greenberg, Parveen Patel, and Ronnie Chaiken. The nature of data center trac: measurements & analysis. Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, pages 202{208, 2009. [49] Balajee Vamanan, Jahangir Hasan, and TN Vijaykumar. Deadline-aware datacenter tcp (d2tcp). SIGCOMM, 42(4):115{126, 2012. [50] Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. Data Center TCP (DCTCP). SIGCOMM Comput. Commun. Rev., 41(4):63{74, August 2010. [51] David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, and Randy Katz. DeTail: reducing the ow completion time tail in datacenter networks. ACM SIGCOMM Computer Communication Review, 42(4):139{150, 2012. 160 [52] Jerey Dean and Sanjay Ghemawat. MapReduce: simplied data processing on large clusters. Com- munications of the ACM, 51(1):107{113, 2008. [53] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed Data- parallel Programs from Sequential Building Blocks. SIGOPS Oper. Syst. Rev., 41(3):59{72, March 2007. [54] Nikolaos Laoutaris, Georgios Smaragdakis, Rade Stanojevic, Pablo Rodriguez, and Ravi Sundaram. Delay-tolerant bulk data transfers on the internet. IEEE/ACM TON, 21(6), 2013. [55] Hong Zhang, Kai Chen, Wei Bai, et al. Guaranteeing deadlines for inter-datacenter transfers. In EuroSys, page 20. ACM, 2015. [56] Yuan Feng, Baochun Li, and Bo Li. Jetway: Minimizing costs on inter-datacenter video trac. In ACM international conference on Multimedia, pages 259{268. ACM, 2012. [57] Xin Jin, Yiran Li, Da Wei, Siming Li, Jie Gao, Lei Xu, Guangzhi Li, Wei Xu, and Jennifer Rexford. Optimizing bulk transfers with software-dened optical wan. In SIGCOMM, pages 87{100. ACM, 2016. [58] Scaling the Facebook data warehouse to 300 PB. [59] Vijay Kumar Adhikari, Sourabh Jain, Yingying Chen, and Zhi-Li Zhang. Vivisecting youtube: An active measurement study. In INFOCOM, pages 2521{2525. IEEE, 2012. [60] Meshenberg, Ruslan and Gopalani, Naresh and Kosewski, Luke. Active-Active for Multi-Regional Re- siliency. http://techblog.netflix.com/2013/12/active-active-for-multi-regional.html, 2013. [61] Yuhua Lin, Haiying Shen, and Liuhua Chen. Eco ow: An economical and deadline-driven inter- datacenter video ow scheduling system. In International conference on Multimedia, pages 1059{1062. ACM, 2015. [62] Multi-datacenter replication in cassandra. http://www.datastax.com/dev/blog/multi-datacenter- replication, 2012. [63] Azure sql database now supports powerful geo-replication features for all service tiers. https://azure.microsoft.com/en-us/blog/azure-sql-database-now-supports-powerful-geo- replication-features-on-all-service-tiers/, 2016. [64] Ashish Gupta, Fan Yang, Jason Govig, Adam Kirsch, Kelvin Chan, Kevin Lai, Shuo Wu, Sandeep Dhoot, Abhilash Rajesh Kumar, Ankur Agiwal, Sanjay Bhansali, Mingsheng Hong, Jamie Cameron, Masood Siddiqi, David Jones, Je Shute, Andrey Gubarev, Shivakumar Venkataraman, and Divyakant Agrawal. Mesa: A geo-replicated online data warehouse for google's advertising system. Commun. ACM, 59(7):117{125, June 2016. [65] Tim Kraska, Gene Pang, Michael J Franklin, Samuel Madden, and Alan Fekete. Mdcc: Multi-data center consistency. In EuroSys, pages 113{126. ACM, 2013. [66] Ll Gifre, F Paolucci, J Marhuenda, et al. Experimental assessment of inter-datacenter multicast connectivity for ethernet services in exgrid networks. In ECOC, pages 1{3, 2014. [67] Yasuhiro Miyao. An overlay architecture of global inter-data center networking for fast content delivery. In ICC, pages 1{6. IEEE, 2011. 161 [68] Ping Lu, Liang Zhang, Xiahe Liu, Jingjing Yao, and Zuqing Zhu. Highly ecient data migration and backup for big data applications in elastic optical inter-data-center networks. IEEE Network, 29(5):36{42, 2015. [69] Yingying Chen, Sourabh Jain, Vijay Kumar Adhikari, Zhi-Li Zhang, and Kuai Xu. A rst look at inter-data center trac characteristics via yahoo! datasets. In INFOCOM, pages 1620{1628. IEEE, 2011. [70] Y. Wu, Z. Zhang, C. Wu, C. Guo, Z. Li, and F. C. M. Lau. Orchestrating bulk data transfers across geo-distributed datacenters. IEEE Transactions on Cloud Computing, PP(99):1{1, 2015. [71] FK Hwang and Dana S Richards. Steiner tree problems. Networks, 22(1):55{89, 1992. [72] Evaluation of approximation algorithms for the directed steiner tree problem. https://github.com/ mouton5000/DSTAlgoEvaluation. visited on Apr 27, 2017. [73] Long Luo, Hongfang Yu, Zilong Ye, and Xiaojiang Du. Online deadline-aware bulk transfer over inter- datacenter wans. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 630{638. IEEE, 2018. [74] Alok Kumar, Sushant Jain, Uday Naik, et al. Bwe: Flexible, hierarchical bandwidth allocation for wan distributed computing. In SIGCOMM, pages 1{14, 2015. [75] Ahmed Saeed, Nandita Dukkipati, Vytautas Valancius, Carlo Contavalli, Amin Vahdat, et al. Carousel: Scalable trac shaping at end hosts. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 404{417. ACM, 2017. [76] Nick McKeown. Software-dened networking. INFOCOM keynote talk, 17(2):30{32, 2009. [77] L. Luo, Z. Li, J. Wang, and H. Yu. Simplifying ow updates in software-dened networks using atoman. IEEE Access, 7:39083{39097, 2019. [78] Keqiang He, Junaid Khalid, Aaron Gember-Jacobson, et al. Measuring control plane latency in sdn- enabled switches. In SOSR, pages 25:1{25:6. ACM, 2015. [79] M. Kodialam, T. V. Lakshman, and S. Sengupta. Online multicast routing with bandwidth guarantees: a new approach using multicast network ow. IEEE/ACM Transactions on Networking, 11(4):676{686, 2003. [80] http://www.omnisecu.com/cisco-certified-network-associate-ccna/what-is-routing- metric-value.php. [81] B. Fortz and M. Thorup. Optimizing OSPF/IS-IS weights in a changing world. IEEE Journal on Selected Areas in Communications, 20(4):756{767, 2002. [82] Srikanth Kandula et al. Walking the Tightrope: Responsive Yet Stable Trac Engineering. SIG- COMM, 35(4):253{264, 2005. [83] M. Noormohammadpour, C. S. Raghavendra, S. Rao, and S. Kandula. Dccast: Ecient point to multipoint transfers across datacenters. In HotCloud. USENIX Association, 2017. [84] Mohammad Noormohammadpour, Cauligi S Raghavendra, Srikanth Kandula, and Sriram Rao. Quick- cast: Fast and ecient inter-datacenter transfers using forwarding tree cohorts. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 225{233. IEEE, 2018. [85] Dimitri Bertsekas and Robert Gallager. Data networks, 1987. 162 [86] Chi-Yao Hong, Matthew Caesar, and P Godfrey. Finishing ows quickly with preemptive scheduling. SIGCOMM, 42(4):127{138, 2012. [87] M. Noormohammadpour, C. S. Raghavendra, S. Rao, and A. M. Madni. Rcd: Rapid close to deadline scheduling for datacenter networks. In World Automation Congress (WAC), pages 1{6. IEEE, 2016. [88] Mohammad Noormohammadpour, Cauligi S Raghavendra, and Sriram Rao. Dcroute: Speeding up inter-datacenter trac allocation while guaranteeing deadlines. In High Performance Computing, Data, and Analytics (HiPC). IEEE, 2016. [89] M. Laor and L. Gendel. The eect of packet reordering in a backbone link on application throughput. IEEE Network, 16(5):28{36, Sep 2002. [90] Companies like facebook and google have multiple data centers. do these datacenters all store copies of the same information? https://www.quora.com/Companies-like-Facebook-and-Google-have- multiple-data-centers-Do-these-datacenters-all-store-copies-of-the-same-information. visited on March 3, 2017. [91] Where are the facebook servers located worldwide? https://www.quora.com/Where-are-the- Facebook-servers-located-worldwide. visited on March 3, 2017. [92] Chunqiang Tang, Thawan Kooburat, Pradeep Venkatachalam, Akshay Chander, Zhe Wen, Aravind Narayanan, Patrick Dowell, and Robert Karl. Holistic conguration management at facebook. In Symposium on Operating Systems Principles, pages 328{343. ACM, 2015. [93] Geo-replication/multi-ar (active). http://cloudbasic.net/documentation/geo-replication- active/. visited on March 5, 2017. [94] Overview: Sql database active geo-replication. https://docs.microsoft.com/en-us/azure/sql- database/sql-database-geo-replication-overview. visited on March 5, 2017. [95] Using replication across multiple data centers. https://docs.oracle.com/cd/E20295 01/html/821- 1217/fpcoo.html#aalgm. visited on March 11, 2017. [96] Understanding oracle internet directory replication. https://docs.oracle.com/cd/E28280 01/admin. 1111/e10029/oid replic.htm#OIDAG2201. visited on March 11, 2017. [97] Global load balancing using aws route 53. https://www.sumologic.com/blog-amazon-web-services/ aws-route-53-global-load-balancing/. visited on March 14, 2017. [98] Ruben Torres, Alessandro Finamore, Jin Ryong Kim, Marco Mellia, Maurizio M Munafo, and Sanjay Rao. Dissecting video server selection strategies in the youtube cdn. In ICDCS, pages 248{257. IEEE, 2011. [99] How net ix leverages multiple regions to increase availability (arc305). https://www.slideshare. net/AmazonWebServices/arc305-28387146. visited on March 3, 2017. [100] Yiwen Wang, Sen Su, et al. Multiple bulk data transfers scheduling among datacenters. Computer Networks, 68:123{137, 2014. [101] Yang Yu, Wang Rong, and Wang Zhijun. Ssnf: Shared datacenter mechanism for inter-datacenter bulk transfer. In International Conference on Advanced Cloud and Big Data (CBD), pages 184{189. IEEE, 2014. 163 [102] Nikolaos Laoutaris, Michael Sirivianos, Xiaoyuan Yang, and Pablo Rodriguez. Inter-datacenter bulk transfers with netstitcher. In SIGCOMM, pages 74{85. ACM, 2011. [103] Yuan Feng, Baochun Li, and Bo Li. Postcard: Minimizing costs on inter-datacenter trac with store- and-forward. In International Conference on Distributed Computing Systems Workshops, pages 43{50. IEEE, 2012. [104] Thyaga Nandagopal and Krishna PN Puttaswamy. Lowering inter-datacenter bandwidth costs via bulk data scheduling. In Cluster, Cloud and Grid Computing (CCGrid), pages 244{251. IEEE, 2012. [105] Jingjing Yao, Ping Lu, Long Gong, and Zuqing Zhu. On fast and coordinated data backup in geo- distributed optical inter-datacenter networks. Journal of Lightwave Technology, 33(14):3005{3015, 2015. [106] M. Cotton, L. Vegoda, and D. Meyer. IANA guidelines for IPv4 multicast address assignments. Internet Requests for Comments, 2010. [107] S. Liang and D. Cheriton. Tcp-smo: extending tcp to support medium-scale multicast applications. In Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, volume 3, pages 1356{1365, 2002. [108] Brian Adamson, Carsten Bormann, Mark Handley, and Joe Macker. Nack-oriented reliable multicast (norm) transport protocol, 2009. [109] M. Kodialam, T. V. Lakshman, and S. Sengupta. Online multicast routing with bandwidth guarantees: a new approach using multicast network ow. IEEE/ACM Transactions on Networking, 11(4):676{686, Aug 2003. [110] L. H. Huang, H. C. Hsu, S. H. Shen, D. N. Yang, and W. T. Chen. Multicast trac engineering for software-dened networks. In INFOCOM, pages 1{9. IEEE, 2016. [111] A. Nagata, Y. Tsukiji, and M. Tsuru. Delivering a le by multipath-multicast on open ow networks. In International Conference on Intelligent Networking and Collaborative Systems, pages 835{840, 2013. [112] K. Ogawa, T. Iwamoto, and M. Tsuru. One-to-many le transfers using multipath-multicast with coding at source. In IEEE International Conference on High Performance Computing and Communi- cations, pages 687{694, 2016. [113] J. Cao, C. Guo, G. Lu, Y. Xiong, Y. Zheng, Y. Zhang, Y. Zhu, C. Chen, and Y. Tian. Datacast: A scalable and ecient reliable group data delivery service for data centers. IEEE Journal on Selected Areas in Communications, 31(12):2632{2645, 2013. [114] Aakash Iyer, Praveen Kumar, and Vijay Mann. Avalanche: Data center multicast using software dened networking. In COMSNETS, pages 1{8. IEEE, 2014. [115] Tingwei Zhu, Fang Wang, Yu Hua, Dan Feng, et al. Mctcp: Congestion-aware and robust multicast tcp in software-dened networks. In International Symposium on Quality of Service, pages 1{10, June 2016. [116] D. Li, M. Xu, M. c. Zhao, C. Guo, Y. Zhang, and M. y. Wu. Rdcm: Reliable data center multicast. In 2011 Proceedings IEEE INFOCOM, pages 56{60, 2011. [117] Suman Banerjee, Bobby Bhattacharjee, and Christopher Kommareddy. Scalable application layer multicast. In SIGCOMM, pages 205{217. ACM, 2002. 164 [118] A. Rodriguez, D. Kostic, and A. Vahdat. Scalability in adaptive multi-metric overlays. In International Conference on Distributed Computing Systems, pages 112{121, 2004. [119] Karthik Nagaraj, Hitesh Khandelwal, Charles Killian, and Ramana Rao Kompella. Hierarchy-aware distributed overlays in data centers using dc2. In COMSNETS, pages 1{10. IEEE, 2012. [120] Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, Animesh Nandi, Antony Rowstron, and Atul Singh. Splitstream: High-bandwidth multicast in cooperative environments. In SOSP, pages 298{313. ACM, 2003. [121] Yuchao Zhang, Junchen Jiang, Ke Xu, et al. BDS: A Centralized Near-optimal Overlay Network for Inter-datacenter Data Replication. In EuroSys, pages 10:1{10:14, 2018. [122] K. Jeacle and J. Crowcroft. Tcp-xm: unicast-enabled reliable multicast. In ICCCN, pages 145{150, 2005. [123] L. H. Lehman, S. J. Garland, and D. L. Tennenhouse. Active reliable multicast. In INFOCOM, volume 2, pages 581{589 vol.2, Mar 1998. [124] Christos Gkantsidis, John Miller, and Pablo Rodriguez. Comprehensive view of a live network coding p2p system. In IMC, pages 177{188. ACM, 2006. [125] A. Shokrollahi. Raptor codes. IEEE Transactions on Information Theory, 52(6):2551{2567, 2006. [126] John W. Byers, Michael Luby, Michael Mitzenmacher, and Ashutosh Rege. A digital fountain approach to reliable distribution of bulk data. In SIGCOMM, pages 56{67. ACM, 1998. [127] L. Rizzo. Pgmcc: A tcp-friendly single-rate multicast congestion control scheme. In SIGCOMM, 2000. [128] C. A. C. Marcondes, T. P. C. Santos, A. P. Godoy, C. C. Viel, and C. A. C. Teixeira. Cast ow: Clean-slate multicast approach using in-advance path processing in programmable networks. In IEEE Symposium on Computers and Communications, pages 94{101, 2012. [129] J. Ge, H. Shen, E. Yuepeng, Y. Wu, and J. You. An open ow-based dynamic path adjustment algorithm for multicast spanning trees. In IEEE International Conference on Trust, Security and Privacy in Computing and Communications, pages 1478{1483, 2013. [130] IJsbrand Wijnands, Eric C. Rosen, Andrew Dolganow, Tony Przygienda, and Sam Aldrin. Multicast Using Bit Index Explicit Replication (BIER). RFC 8279, November 2017. [131] M. Hefeeda, A. Habib, B. Botev, et al. Promise: Peer-to-peer media streaming using collectcast. In MULTIMEDIA, pages 45{54. ACM, 2003. [132] Johan Pouwelse, Pawe lGarbacki, Dick Epema, and Henk Sips. The bittorrent p2p le-sharing sys- tem: Measurements and analysis. In Proceedings of the 4th International Conference on Peer-to-Peer Systems, IPTPS'05, pages 205{216, Berlin, Heidelberg, 2005. Springer-Verlag. [133] R. Sherwood, R. Braud, and B. Bhattacharjee. Slurpie: a cooperative bulk data transfer protocol. In INFOCOM, volume 2, pages 941{951, 2004. [134] Sen Su, Yiwen Wang, Sujuan Jiang, Kai Shuang, and Peng Xu. Ecient algorithms for scheduling multiple bulk data transfers in inter-datacenter networks. International Journal of Communication Systems, 27(12), 2014. [135] Yuchao Zhang, Junchen Jiang, Ke Xu, et al. BDS: A Centralized Near-optimal Overlay Network for Inter-datacenter Data Replication. In EuroSys, EuroSys '18, pages 10:1{10:14, 2018. 165 [136] M. Noormohammadpour and C. S. Raghavendra. DDCCast: Meeting Point to Multipoint Trans- fer Deadlines Across Datacenters using ALAP Scheduling Policy. Technical Report, Department of Computer Science, University of Southern California, Report No. 17-972, 2017. [137] S. Ji, S. Liu, and B. Li. Deadline-Aware Scheduling and Routing for Inter-Datacenter Multicast Transfers. In 2018 IEEE International Conference on Cloud Engineering (IC2E), pages 124{133, 2018. [138] Long Luo, Hongfang Yu, and Zilong Ye. Deadline-guaranteed Point-to-Multipoint Bulk Transfers in Inter-Datacenter Networks. ICC, 2018. [139] Long Luo, Klaus-Tycho Foerster, Stefan Schmid, and Hongfang Yu. Dartree: Deadline-aware multicast transfers in recongurable wide-area networks. In 27th IEEE/ACM International Symposium on Quality of Service (IWQoS 2019), June 2019. [140] Yilong Geng, Vimalkumar Jeyakumar, et al. Juggler: a practical reordering resilient network stack for datacenters. In EuroSys, page 20. ACM, 2016. [141] Costin Raiciu, Christoph Paasch, et al. How hard can it be? designing and implementing a deployable multipath tcp. In NSDI, pages 29{29. USENIX Association, 2012. [142] Gabriel Robins and Alexander Zelikovsky. Tighter bounds for graph steiner tree approximation. SIAM Journal on Discrete Mathematics, 19(1):122{134, 2005. [143] Dimitri Watel and Marc-Antoine Weisser. A Practical Greedy Approximation for the Directed Steiner Tree Problem, pages 200{215. Springer International Publishing, Cham, 2014. [144] Adam Wierman and Bert Zwart. Is tail-optimal scheduling possible? Operations research, 60(5):1249{ 1257, 2012. [145] Inc. Gurobi Optimization. Gurobi optimizer reference manual, 2016. [146] Adaptive tree selection for ecient point to multipoint transfers across datacenters. https://github. com/noormoha/DCCast. [147] N. McKeown, T. Anderson, et al. Open ow: Enabling innovation in campus networks. SIGCOMM, 38(2):69{74, 2008. [148] Ben Pfa, Bob Lantz, Brandon Heller, et al. Open ow switch specication, version 1.3.1 (wire pro- tocol 0x04). https://www.opennetworking.org/images/stories/downloads/sdn-resources/onf- specifications/openflow/openflow-spec-v1.3.1.pdf, 2012. [149] S. H. Shen, L. H. Huang, D. N. Yang, and W. T. Chen. Reliable multicast routing for software-dened networks. In INFOCOM, pages 181{189, April 2015. [150] Lior Rokach and Oded Maimon. Clustering Methods, pages 321{352. Springer US, 2005. [151] T. Lan, D. Kao, M. Chiang, and A. Sabharwal. An Axiomatic Theory of Fairness in Network Resource Allocation. In 2010 Proceedings IEEE INFOCOM, pages 1{9, 2010. [152] Mohammad Alizadeh, Shuang Yang, Milad Sharif, et al. pFabric: Minimal Near-optimal Datacenter Transport. SIGCOMM Comput. Commun. Rev., 43(4):435{446, August 2013. [153] Wei Bai, Li Chen, Kai Chen, et al. PIAS: Practical information-agnostic ow scheduling for data center networks. Proceedings of the 13th ACM Workshop on Hot Topics in Networks, page 25, 2014. 166 [154] Y. Lu, G. Chen, L. Luo, et al. One more queue is enough: Minimizing ow completion time with explicit priority notication. INFOCOM, pages 1{9, 2017. [155] The Internet Topology Zoo (UNINETT). http://www.topology-zoo.org/files/Uninett2011.gml. [156] Ben Pfa, Bob Lantz, Brandon Heller, et al. Open ow switch specication, version 1.1.0 implemented (wire protocol 0x02). http://archive.openflow.org/documents/openflow-spec-v1.1.0.pdf, 2011. [157] The internet topology zoo (dataset). http://topology-zoo.org/dataset.html. [158] CloudLab. https://www.cloudlab.us/. [159] NACK-Oriented Reliable Multicast (NORM). https://www.nrl.navy.mil/itd/ncs/products/norm. [160] Phillipa Gill, Martin Arlitt, Zongpeng Li, and Anirban Mahanti. Youtube trac characterization: A view from the edge. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, IMC '07, pages 15{28, New York, NY, USA, 2007. ACM. [161] Ben Pfa et al. OpenFlow Switch Specication Version 1.1.0 Implemented (Wire Protocol 0x02). http://archive.openflow.org/documents/openflow-spec-v1.1.0.pdf, 2011. [162] HP 5920 & 5900 Switch Series OpenFlow Command Reference. http://h20565.www2.hpe.com/hpsc/ doc/public/display?sp4ts.oid=5221896&docId=emr%5Fna-c04089449&docLocale=en%5FUS. [163] HP 5130 EI Switch Series OpenFlow Conguration Guide. http://h20565.www2.hpe.com/hpsc/doc/ public/display?sp4ts.oid=7399420&docLocale=en%5FUS&docId=emr%5Fna-c04771714. [164] HP OpenFlow 1.3 Administrator Guide Wired Switches K/KA/KB/WB 15.15. http://h20564.www2. hpe.com/hpsc/doc/public/display?docId=c04217797&lang=en-us&cc=us. [165] OpenFlow v1.3.1 Compliance Matrix for Devices Running Junos OS. https://www.juniper. net/documentation/en%5FUS/junos/topics/reference/general/junos-sdn-openflow-v1.3.1- compliance-matrix.html#table-openflow-compliance-matrix-group-types. [166] OmniSwitch AOS Release 8 Switch Management Guide. http://enterprise.alcatel-lucent.com/ assets/documents/omniswitch-8-switch-management-guide.pdf. [167] IBM System Networking RackSwitch G8264 Application Guide For Networking OS 7.9. http://www- 01.ibm.com/support/docview.wss?uid=isg3T7000679&aid=1. [168] Network OS Software Dened Networking (SDN) Conguration Guide Supporting Network OS 7.0.0. http://www.brocade.com/content/dam/common/documents/content-types/configuration- guide/nos-700-sdnguide.pdf. 167
Abstract (if available)
Abstract
As applications become more distributed to improve user experience and offer higher availability, businesses rely on geographically dispersed datacenters that host such applications more than ever. Dedicated inter-datacenter networks have been built that provide high visibility into the network status and flexible control over traffic forwarding to offer quality communication across the instances of applications hosted on many datacenters. These networks are relatively small, with tens to hundreds of nodes and are managed by the same organization that operates the datacenters which make centralized traffic engineering feasible. Using coordinated data transmission from the services and routing over the inter-datacenter network, one can optimize the network performance according to a variety of utility functions that take into account data transfer deadlines, network capacity consumption, and transfer completion times. Such optimization is especially relevant for long-running data transfers that occur across datacenters due to the replication of configuration data, multimedia content, and machine learning models. ❧ In this dissertation, we study techniques and algorithms for fast and efficient data transfers across geographically dispersed datacenters over the inter-datacenter networks. We discuss different forms and properties of inter-datacenter transfers and present a generalized optimization framework to maximize an operator selected utility function. Next, in the several chapters that follow, we study, in detail, the problems of admission control for transfers with deadlines and inter-datacenter multicast transfers. We present a variety of heuristic approaches while carefully considering their running time. For the admission control problem, our solutions offer significant speed up in the admission control process while offering almost identical performance in the total traffic admitted into the network. For the bulk multicasting problem, our techniques enable significant performance gain in receiver completion times with low computational complexity, which makes them highly applicable to inter-datacenter networks. In the end, we summarize our contributions and discuss possible future directions for researchers.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Protocols, algorithms, and application adaptation for mobile ad hoc network (MANET)-like disruption tolerant networks (MDTNs)
PDF
Machine learning for efficient network management
PDF
Customized data mining objective functions
PDF
Algorithmic aspects of throughput-delay performance for fast data collection in wireless sensor networks
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Making web transfers more efficient
PDF
High performance classification engines on parallel architectures
PDF
Discovering and querying implicit relationships in semantic data
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
Cache analysis and techniques for optimizing data movement across the cache hierarchy
PDF
Prediction models for dynamic decision making in smart grid
PDF
Modeling intermittently connected vehicular networks
PDF
From matching to querying: A unified framework for ontology integration
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
A complex event processing framework for fast data management
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
PDF
Optimum multimodal routing under normal condition and disruptions
PDF
Optimal designs for high throughput stream processing using universal RAM-based permutation network
Asset Metadata
Creator
Noormohammadpour, Mohammad
(author)
Core Title
On efficient data transfers across geographically dispersed datacenters
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/18/2019
Defense Date
06/04/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
inter-datacenter,OAI-PMH Harvest,optimization,routing,software-defined networking,Traffic engineering
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Raghavendra, Cauligi (
committee chair
), Horowitz, Ellis (
committee member
), Prasanna, Viktor (
committee member
)
Creator Email
noormoha@alumni.usc.edu,noormoha@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-220585
Unique identifier
UC11673081
Identifier
etd-Noormohamm-7818.pdf (filename),usctheses-c89-220585 (legacy record id)
Legacy Identifier
etd-Noormohamm-7818.pdf
Dmrecord
220585
Document Type
Dissertation
Rights
Noormohammadpour, Mohammad
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
inter-datacenter
optimization
routing
software-defined networking