Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Towards highly-available cloud and content-provider networks
(USC Thesis Other)
Towards highly-available cloud and content-provider networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TOWARDS HIGHLY-AVAILABLE CLOUD AND CONTENT-PROVIDER NETWORKS by Mingyang Zhang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2022 Copyright 2022 Mingyang Zhang Dedication To my parents. ii Acknowledgements I am beyond grateful to my advisor, Ramesh Govindan for his guidance, mentorship and encour- agement along this journey. I learned so much from Ramesh: how to formulate a research problem, how to drive it from a vague idea to a large and concrete project, and how to convey my ideas clearly and concisely. From Ramesh, I not only learned how to do research but also learned how to be a better person. Ramesh will always be my role model in my future career. Being one of his students is the luckiest thing in my life. I would love to thank Radhika Niranjan Mysore for mentoring me on the FatClique project. I learned so much from her, not only on topology design but also on how to be a good mentor and collaborator. I specially thank Radhika for her patient guidance and sincere encouragement. I will never forget the inspiring discussions with her. I benefited a lot from my internship at Google where I completed the Gemini project. I would like to thank Jeff Mogul, Rui Wang, Jianan Zhang and Amin Vadhat for their constant support and guidance. I am very grateful to Jeff for his detailed guidance and helping me revise the paper until late night. I thank Rui and Jianan for their constant support and inspiring discussions. I also learned a lot from Jad Hachem, Keqiang He, Chenguang Liu and Yikai Lin at Google. I would also like to thank my collaborators on various projects Pooria Namyar, Chao Wang, Sucha Supittayapornpong, Ying Zhang, Wenfei Wu for sharing your knowledge and great ideas with me. I would like to thank Barath Raghavan, Chao Wang, and Xuehai Qian to serve on my committee and provide feedback on this dissertation. iii I thank previous and current NSL group members Rui Miao, Yitao Hu, Jianfeng Wang, Hang Qiu, Weiwu Pang, Yuliang Li, Pooria Namyar, Omid Alipourfard, Xiaochen Liu, Masoud Moshref Javadi, Zahaib Akhtar, Haonan Lu, Yi-Ching Chiu, Tobias Flach, Luis Pedrosa, Yurong Jiang, Xing Xu, Jane Yen for their inspiring discussions and feedback on papers and talks. Your accompany brings so much joy to my Ph.D. life. Finally, I would like to thank my parents, Haiyan and Wenju for their enduring love and support. Without their support, this dissertation would not exist. iv Table of Contents Dedication ii Acknowledgements iii List Of Tables viii List Of Figures ix Abstract xii Chapter 1: Introduction 1 1.1 Large-Scale Datacenter Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges to Achieve High Availability . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Lifecycle Management Complexity . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Bursty Traffic with Practical Reconfigurability and Cost Constraints . . . . 3 1.2.3 High Concurrency in the Microservices-based Control Plane . . . . . . . . . 4 1.3 Key Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter2: UnderstandingLifecycleManagementComplexityofDatacenterTopolo- gies 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Deployment Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Packaging, Placement, and Bundling . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Deployment Complexity Metrics . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.3 Comparing Topology Classes . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Topology Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 The Practice of Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 An Expansion Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.3 Expansion Complexity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.4 Comparing Topology Classes . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 Towards Lower Lifecycle Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.1 FatClique Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.2 FatClique Synthesis Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.3 FatClique Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6 Evaluating Lifecycle Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 v 2.6.2 Patch Panel Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.6.3 Deployment Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.4 Expansion Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6.5 FatClique Result Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Chapter 3: Gemini: Robust Topology-and-Traffic Engineering in Spine-Free Dat- acenters 45 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3 Measuring success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.1 Correlation between FCT and MLU . . . . . . . . . . . . . . . . . . . . . . 54 3.3.2 FCT vs. Frequency of Overloaded Links . . . . . . . . . . . . . . . . . . . . 55 3.4 Gemini Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.1 Gemini Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.2 Minimizing Link Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.3 Traffic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.4 Hedging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.5 Joint Solver Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4.6 Predictor and Controller operation . . . . . . . . . . . . . . . . . . . . . . . 65 3.4.7 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.5.1 Testbed Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.5.2 Large-scale Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5.2.1 Benefits of Demand-Awareness. . . . . . . . . . . . . . . . . . 73 3.5.2.2 Prediction quality . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.5.3 Sensitivity Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 4: Towards Correct-by-Design Scalable Control Planes 80 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.1.1 Large-scale Control Plane Basics . . . . . . . . . . . . . . . . . . . . . . . . 80 4.1.2 Microservice-based Centralized Control Plane . . . . . . . . . . . . . . . . . 81 4.1.3 Modules in Microservice-based Control Plane . . . . . . . . . . . . . . . . . 82 4.1.3.1 SDN Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.1.3.2 Control Plane Core Modules . . . . . . . . . . . . . . . . . . . . . 83 4.1.4 Our Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2 Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.2 Semantics of Switch and FPI States . . . . . . . . . . . . . . . . . . . . . . 86 4.2.3 Basic DAG programming Protocols . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.4 Basic Failover Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.5 Basic Management Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Failure Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.1 Complete Switch Failure Model . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.2 Partial Transient Switch Failure Model . . . . . . . . . . . . . . . . . . . . . 90 4.3.3 Control Plane Failure Model . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4 Lessons from Bugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.1 Type 1: State Management Issues . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.1.1 BugsCausedbyUpdate-Before-ActionandSolvedbyAction-before- Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 vi 4.4.1.2 Bugs Caused by Update-Before-Action and Solvable by the Log- based Reconciliation . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.1.3 Bugs Caused by Action-Before-Update . . . . . . . . . . . . . . . 96 4.4.1.4 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4.2 Type 2: Synchronization Issues . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4.3 Type 3: State Machines Related . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4.3.1 FPI State Machine Violation . . . . . . . . . . . . . . . . . . . . . 100 4.4.3.2 Wrong State Machine . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.4.4 Type 4: Wrong Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.5 Convergence Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.5.1 Total Protocol Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.5.2 Added Convergence Time for a Solution . . . . . . . . . . . . . . . . . . . . 108 Chapter 5: Related Work 109 5.1 Static Datacenter Topology Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Reconfigurable Datacenter Architecture . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3 Control Plane Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Chapter 6: Conclusion 112 6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Bibliography 116 Appendix A FatClique Algorithms and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.1 Clos Generation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.2 Jellyfish Placement Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.3 Scale-invariance of Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.4 FatClique Expansion Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 A.5 Expansion for Clos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 A.6 FatClique Topology Synthesis Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 127 A.7 Parameter Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.8 Other Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Appendix B Gemini Measurement and Prototype Results . . . . . . . . . . . . . . . . . . . . . . . . 131 B.1 Correlation between FCT and MLU, over All Links . . . . . . . . . . . . . . . . . . 131 B.2 FCTs in testbed experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 vii List Of Tables 2.1 Deployment Complexity Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Expansion Comparison (SLO = 90%) . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Qualitative comparison of lifecycle management complexity . . . . . . . . . . . . . 24 2.4 FatClique Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Scalability of Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 Capacities of topologies built with 32 port 40G switches. Small, medium and large scale topologies achieve 1 4 , 4, 16 times capacity of Jupiter. The table also shows sizes of individual building blocks of these topologies in terms of number of switches. Abbreviations: e:edge, a:aggregation, sp:spine, cap:capacity, svr:server. . . . . . . 32 2.7 Bundle Types (Switch Radix = 32) . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1 State Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2 State Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.1 Datacenter settings mostly [61] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 A.2 40G QSFP Mellanox cable length in meter (Length) and price with transceivers (Price) [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 viii List Of Figures 1.1 Three layers of the datacenter network . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Fiber Re-bundling for Clos at Patch Panels . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Clos Expansion with Patch Panels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Basic Rewiring Operations at a patch panel . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Thin and Fat Edge Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 FatClique Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 FatClique Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7 FatClique Expansion example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.8 Number of switches. C is Clos, J is Jellyfish, X is Xpander and F is FatClique. . . 37 2.9 Number of patch panels. C is Clos, J is Jellyfish, X is Xpander and F is FatClique. 37 2.10 Cabling cost. C is Clos, J is Jellyfish, X is Xpander and F is FatClique. . . . . . . 39 2.11 Expansion steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.12 Average Number of Rewired Links at a Single Patch Panel across Steps . . . . . . 42 3.1 FatTree: recursive Clos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Logical spine-free topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 Physical realization of spine-free topology, via patch panels . . . . . . . . . . . . . . . . 46 3.4 Normalized traffic vs. time: all pod-pairs, fabric F5 . . . . . . . . . . . . . . . . . . . 50 3.5 Average pod-level traffic skew over one month . . . . . . . . . . . . . . . . . . . . . . . 51 3.6 Fraction of well-bounded pairs; higher is better . . . . . . . . . . . . . . . . . . . . . . 52 ix 3.7 CDFs of demand-to-max ratio (DMR) . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.8 FCTs vs p99 MLUs on production fabrics . . . . . . . . . . . . . . . . . . . . . . . 53 3.9 FCTs vs p99 ALUs on production fabrics . . . . . . . . . . . . . . . . . . . . . . . 54 3.10 Gemini architecture. TM: traffic modeler; JS: joint topology and routing solver; RI: reconfiguration interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.11 Per-link relationship: utilization vs. discard rate. . . . . . . . . . . . . . . . . . . . . . 58 3.12 Convex-hull-based Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.13 Hedging-based routing and topology. The example shows that hedging-based routing and topology engineering should be used for handling traffic bursts. . . . . . . . . . . . . . . 60 3.14 Multi-stage optimization example. Red dashed edges: trunks with maximum risk. . . . . 61 3.15 Due to pod speed heterogeneity (40G/100G), the uniform topology on the left can not support the demand between pod 1 and pod 2. Since Gemini is demand-aware, it can find the feasible topology as shown in the right figure. . . . . . . . . . . . . . . . . . . . . . 62 3.16 Simulated v.s., measured testbed utilization . . . . . . . . . . . . . . . . . . . . . . . . 70 3.17 Comparing baseline vs. predicted-best configs . . . . . . . . . . . . . . . . . . . . . . . 71 3.18 P99.9 MLU impact of demand awareness . . . . . . . . . . . . . . . . . . . . . . . . 73 3.19 P99.9 ALU impact of demand awareness . . . . . . . . . . . . . . . . . . . . . . . . 73 3.20 P99.9 OLR impact of demand awareness . . . . . . . . . . . . . . . . . . . . . . . . 73 3.21 P99.9 stretch for one arbitrarily-chosen month (“M3”) in our study; other months are qualitatively similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.22 Gemini’s predicted strategy v.s., optimal strategy . . . . . . . . . . . . . . . . . . . . . 76 3.23 Benefits from correct predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.24 Misprediction cost, MLU (left) and ALU (right) . . . . . . . . . . . . . . . . . . . . . . 77 3.25 MLU/ALU vs. routing reconfiguration interval (r) . . . . . . . . . . . . . . . . . . . . 78 3.26 MLU/ALU vs. topology reconfig. interval (t) . . . . . . . . . . . . . . . . . . . . . . . 78 3.27 MLU/ALU vs. number of matrices (k) in M3 . . . . . . . . . . . . . . . . . . . . . . . 79 3.28 MLU/ALU vs. aggregation window (w) in M3 . . . . . . . . . . . . . . . . . . . . . . 79 4.1 Overall microservices-based control plane architecture . . . . . . . . . . . . . . . . . . . 85 x 4.2 Packet loss in the last step parallel drain . . . . . . . . . . . . . . . . . . . . . . . . 104 A.1 Recursive Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.2 Block-Based Construction 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.3 Block-Based Construction 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.4 Original topology is a Folded Clos with capacity=64. The required SLO during expansion is 75%, which means capacity should be no smaller than 48. There are 16 links on each pod. Due to the SLO constraint, for all plans, 4 links are allowed to be drained at each pod.126 A.5 Clos Draining Link Redistribution Scheduling. . . . . . . . . . . . . . . . . . . . . . . 126 A.6 Spectral Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 A.7 Path Diversity for Small-scale Topologies . . . . . . . . . . . . . . . . . . . . . . . 129 A.8 Path Diversity for Medium-scale Topologies . . . . . . . . . . . . . . . . . . . . . . 129 B.1 FCTs (inter-pod flows) vs p99 MLUs (all links) on production fabrics . . . . . . . . 132 B.2 FCTs (inter-pod flows) vs p99 ALUs (all links) on production fabrics . . . . . . . . 133 B.3 FCTs (inter-pod flows) vs p99 OLRs (all links) on production fabrics . . . . . . . . 134 B.4 Testbed experiments – min-RTT, delivery rate . . . . . . . . . . . . . . . . . . . . . . 134 B.5 Testbed experiments – message-transfer latency . . . . . . . . . . . . . . . . . . . . 135 xi Abstract Network availability is the biggest challenge faced by large content and cloud providers today, such as Google, Facebook, Microsoft, and Amazon due to the scale and complexity. Network availability is measured by the percentage of time that a network is up; cloud providers are pushing hard to make their network availability up to 99.99% (or even 99.999%), which means that the network can be down at most 4 mins (or 24 seconds) per month. Achieving high availability is challenging. Firstly, to keep up with the ever-increasing traffic demand, cloud providers have to keep building and evolving their networks. Bugs and errors can be introduced by complex management operations during this process, which will decrease network availability. Secondly, once network hardware is deployed, techniques, such as traffic engineering, are needed to best utilize the resources. A fragile design might make the network congestion-prone due to traffic bursts, which also decreases network availability. Thirdly, control plane fragility can also impact network availability. In this dissertation, we present algorithms, protocols and systems that can improve the network availability at different layers of the network. We identify and study the following three key pieces for improving network availability. (i) In the management plane, we explored a new dimension for datacenter topology design, lifecycle management complexity, which attempts to understand the complexity of deploying a topology and expanding it. We devise complexity metrics for lifecycle management and design a new class of topologies, FatClique, which has lower management complexity and equivalent performance compared to previous designs. xii (ii) In the data plane, we proposed Gemini, a spine-free datacenter architecture which could handle dynamic traffic workloads using commodity hardware while reconfiguring the network infrequently, rendering the spine-free architecture practical enough for deployment in the near future. (iii) In the control plane, we used formal methods to design and verify the correctness of the specification of a control plane design similar to Orion, Google’s software-defined networking Control Plane. xiii Chapter 1 Introduction Cloud and content providers keep building and evolving their large-scale datacenters to serve the ever-increasing traffic [75, 33]. Those datacenters are in the scale of 100s of thousands of servers and 10s of thousands of network switches [75, 72]. Network availability is measured by the percentage of time that a network is up; cloud providers are pushing hard to make their network availability up to 99.99% (or even 99.999%), which means that the network can be down at most 4 mins (or 24 seconds) per month. Network availability is the biggest challenge faced by large content and cloud providers today, such as Google, Facebook, Microsoft, and Amazon. Based on previous study [33] on Google’s network infrastructure, it is challenging to maintain high availability for following reasons: 1) scale and heterogeneity that makes failures normal instead of exception; 2) velocity of evolution that makes the network constant changing; 3) management complexity that makes the network management error-prone. 1.1 Large-Scale Datacenter Networks Services by cloud and content-provider networks are hosted in servers in large scale datacenters. Those servers are connected by a datacenter network. The datacenter network is a complex distributed system, which consists of three logical layers as shown in Figure 1.1. The top layer is 1 Control plane: Manage network state and program switches Data plane: packet forwarding Management plane: Store network policy and generate high-level management intents and plans: e.g. topology expansion Network Availability Figure 1.1: Three layers of the datacenter network the management plane, which stores network policy and generates high-level management intents and plans [33]. The middle layer is the control plane, which manages network state and programs the network devices [29, 46]. The bottom layer is the data plane, which consists of network devices responsible for forwarding packets. Each layer faces unique challenges to achieve high availability. 1.2 Challenges to Achieve High Availability 1.2.1 Lifecycle Management Complexity With datacenters living on for years, sometimes up to a decade [33, 75], their lifecycle costs can be high. A data center design that is hard to deploy can stall the rollout of services for months; this can be expensive considering the rate at which network demands have historically increased [75, 55]. A design that is hard to expand can leave the network functioning with degraded capacity impacting the large array of services that depend on it. Datacenter networks are dynamic and evolving. To keep up with the ever-increasing traffic demand, cloud providers have to keep building and evolving their networks [33]. Bugs and errors can be made by management operations during this process, which will decrease network 2 availability. Examples of management operations include turning down a switch, replacing a optical fiber, rewiring the topology and upgrading switch firmware etc.. Based on a previous study from Google [33], 70% network failures happen when a management operation is in progress. Therefore, a network design with high management complexity is error-prone and can potentially lead to network failures which decreases network availability. Low management complexity datacenter topology design. Over the past decade, there has been a long line of work on designing datacenter topologies [9, 81, 75, 76, 10, 17, 45, 8]. While most have focused on performance properties such as latency and throughput, and on resilience to link and switch failures, datacenter lifecycle management [72, 91] has largely been overlooked. Multiple design objectives. It is a challenge to design a topology family with lower management complexity, lowercostandhighcapacityatthesametime. First, welackasystematicunderstanding and quantification of management complexity is lacked. Second, due to the scale of real datacenter topologies (with 10s of thousands of switching chips), finding a topology instance satisfying all objectives in reasonable time is challenging. 1.2.2 BurstyTrafficwithPracticalReconfigurabilityandCostConstraints Congestion-freedom under low cost. Non-blocking datacenter networks which can serve any admissible traffic matrices, such as Clos [9] are expensive. To save cost, providers use oversubscribed networks, under-provisioning capacity at a particular hierarchy of the network. Those oversubscribed networks might cause congestion when traffic pattern change happens. As pointed out by [34], those designs can result into network congestion in practice which reduces network availability significantly. Constraints on practical reconfigurability. By reconfiguring routing and the topology, some blocking network designs [57, 58, 14] could maintain application-perceived performance in the face of dynamic workloads. However, current proposals rely on fast custom hardware that is not yet 3 commercially available at scale [57, 58, 14]. In practice, commodity reconfigurable devices, such as patch panels, can only be reconfigured in the time scale of hours. Dynamic datacenter traffic. Based on previous study [34, 15], datacenter traffic is bursty and most traffic consists of short flows which might last less than one millisecond. Trying to rapidly reconfigure the topology and routing upon a traffic burst might be too late given practical reconfigurablity. Therefore, the topology and routing design must be robust to traffic bursts. 1.2.3 High Concurrency in the Microservices-based Control Plane Robustness to failures. Centralized control planes for large datacenters or SDN-WANs are increasingly architectured using microservices [29], for ease of development and management (Microservice architecture is a architectural style that structures an application as a collection of services which are independently scaled and loosely coupled through a network [5]). At the same time, these control planes must achieve high availability and fast convergence during topology changes or for management operations, and must be robust to concurrent failures of switches and microservice components. Race conditions in microservices-based systems. Because these components can exhibit significant concurrency, these designs are susceptible to errors resulting from distributed races (such as two components are updating the same system state concurrently, and possibly inconsistently), some of them extremely subtle and triggered by a complex combination of failures [29]. 1.3 Key Design Principles To improve the overall network availability, we need to improve the availability of each layer across the whole network, including designing manageable topologies in the management plane, designing adaptive topology and routing in the data plane, and designing correct-by-design protocols in the control plane. In this thesis, we propose metrics, architectures, and protocols across multiple 4 layers to systematically improve the availability of the whole network. Our designs are guided by three key design principles: Datacenter topologies must enable low management complexity. There is a long line of research on datacenter topology design as discussed in §5. However, few of them consider the management complexity, which is essential to achieve high availability of the networks. We follow the first principle to design metrics to quantify the management complexity, understand the management complexity of representative topology families and then identify key insights, such as the importance of uniformity and fat edge, for low-complexity topology design. A reconfigurable datacenter architecture should be realizable with commodity hard- ware. To achieve a good performance for reconfigurable networks, previous work mostly focus on reconfiguring the topology frequently using fast customized hardware. However, in practice, the commodity reconfigurable optical switch could only be reconfigured in milliseconds. We design reconfigurable datacenter architecture with robust topology and traffic engineering instead of flow scheduling and leverage the technique called hedging to handle traffic burst, which could achieve good performance using commodity hardware and need only be reconfigured at longer timescales. A highly-available control plane should be correct-by-design. Network control plane is the mission-critical system that not only the network but also the whole cloud system depend on. On the other hand, it is a complex distributed system with high concurrency. It is imperative to have a correct design at the early stage of development. Instead of implementing the control plane directly, we use a specification language to formally specify the design and check its correctness using model checking which could identify key design flaws without having a real implementation. And after fixing all design flaws iteratively, we can obtain a provably-correct control plane design. 5 1.4 Contributions We have identified and studied three key pieces of this dissertation in each layer of the network: 1) In the management plane, we propose metrics to quantify the lifecycle management complexity systematically and propose a new family of cost-effective datacenter topologies with low man- agement complexity; 2) We propose a spine-free reconfigurable datacenter architecture using commodity hardware to handle bursty datacenter traffic; 3) We formally specify and verify the whole microservices-based control plane, which is similar to Google’s production control plane. From the bugs we found during the verification process, we distill lessons and principles for future control plane design. Next, we discuss this piece in detail. FatClique: a new topology family with low management complexity, high capacity and low cost. Most recent datacenter topology designs have focused on performance properties such as latency and throughput. We explore a new dimension, lifecycle management complexity, which attempts to understand the complexity of deploying a topology and expanding it. By analyzing current practice in lifecycle management, we devise complexity metrics for lifecycle management, and show that existing topology classes have low lifecycle management complexity by some measures, but not by others. Motivated by this, we design a new class of topologies, FatClique, that, while being performance-equivalent to existing topologies, is comparable to, or better than them by all our lifecycle management complexity metrics. Gemini: apracticalreconfigurabledatacenterarchitecturewithcommodityhardware. To reduce cost, datacenter network operators are exploring blocking network designs. An example of such a design is a "spine-free" form of a Fat-Tree, in which pods directly connect to each other, rather than via spine blocks. To maintain application-perceived performance in the face of dynamic workloads, these new designs must be able to reconfigure routing and the inter-pod topology. Gemini is a system designed to achieve these goals on commodity hardware while reconfiguring 6 the network infrequently, rendering these blocking designs practical enough for deployment in the near future. The key to Gemini is the joint optimization of topology and routing, using as input a robust estimation of future traffic derived from multiple historical traffic matrices. Gemini “hedges” against unpredicted bursts, by spreading these bursts across multiple paths, to minimize packet loss in exchange for a small increase in path lengths. It incorporates a robust decision algorithm to determine when to reconfigure, and whether to use hedging. Data from tens of production fabrics allows us to categorize these as either low-volatility or or high-volatility; these categories seem stable. For the former, Gemini finds topologies and routing with near-optimal performance and cost. For the latter, Gemini’s use of multi-traffic-matrix optimization and hedging avoids the need for frequent topology reconfiguration, with only marginal increases in path length. As a result, Gemini can support existing workloads on these production fabrics using a spine-free topology that is half the cost of the existing topology on these fabrics. Correct-by-design distributed protocols for the microservices-based control plane. Centralized control planes for large datacenters or SDN WANs are increasingly architectured using microservices, for ease of development and management. At the same time, these control planes must achieve high availability and fast convergence during topology changes or for management operations, and must be robust to concurrent failures of switches and microservice components. Because these components can exhibit significant concurrency, these designs are susceptible to errors resulting from distributed races, some of them extremely subtle and triggered by a complex combination of failures. In this dissertation, we have used formal methods to design and verify the correctness of the specification of a microservices-based control plane design similar to Orion. This design follows current practice and eschews heavyweight race-free implementations (e.g., using distributed transactions) for performance; we show how it is possible to achieve a provably correct specification of a fast control plane even in this setting, using a careful combination of ordering of operations, state record and recovery, and careful management of state transitions. This comes at 7 a cost, however; the complexity of the specification is nearly two orders of magnitude higher than one required for correctness in the absence of failures. 1.5 Organization The rest of the dissertation is organized as follows. Chapter 2 describes metrics for quantifying lifecycle management complexity of datacenter topologies and a new family of topologies with low management complexity. Chapter 3 describes Gemini, a practical reconfigurable datacenter architecture with commodity hardware. Chapter 4 describes specification and verification of distributed protocols for the microservices-based control plane. Chapter 5 summarizes related work. Chapter 6 discusses future work, and concludes this dissertation. 8 Chapter 2 Understanding Lifecycle Management Complexity of Datacenter Topologies 2.1 Introduction Over the past decade, there has been a long line of work on designing datacenter topologies [9, 81, 75, 76, 10, 17, 45, 8]. While most have focused on performance properties such as latency and throughput, and on resilience to link and switch failures, datacenter lifecycle management [72, 91] has largely been overlooked. Lifecycle management is the process of building a network, physically deploying it on a data-center floor, and expanding it over several years so that it is available for use by a constantly increasing set of services. With datacenters living on for years, sometimes up to a decade [75, 33], their lifecycle costs can be high. A data center design that is hard to deploy can stall the rollout of services for months; this can be expensive considering the rate at which network demands have historically increased [75, 55]. A design that is hard to expand can leave the network functioning with degraded capacity impacting the large array of services that depend on it. It is therefore desirable to commit to a data-center network design only after getting a sense of its lifecycle management cost and complexity over time. Unfortunately, the costs of the large array 9 of components needed for deployment such as switches, transceivers, cables, racks, patch panels 1 , and cable trays, are proprietary and change over time, and so are hard to quantify. An alternative approach is to develop complexity measures (as opposed to dollar costs) for lifecycle management, but as far as we know, no prior work has addressed this. In part, this is due to the fact that intuitions about lifecycle management are developed over time and with operations experience, and these lessons are not made available universally. Unfortunately, in our experience, this lack of a clear understanding of lifecycle management complexity often results in costly mistakes in the design of datacenters that are discovered during deployment and therefore cannot be rectified. Our work is a first step towards useful characterizations of lifecycle management complexity. Contributions. To this end, our work makes three contributions. First, we design several complexity metrics (§2.3 and §2.4) that can be indicative of lifecycle management costs (i.e., capital expenditure, time and manpower required). These metrics include the number of: switches, patch panels, bundle-types, expansion steps, and links to be re-wired at a patch panel rack during an expansion step. We design these metrics by identifying structural elements of network deployments that make their deployment and expansion challenging. For instance, the number of switches in the topology determines how complex the network is in terms of packaging – laying out switches into homogeneous racks in a space efficient manner. Wiring complexity can be assessed by the number of cable bundles and the patch panels a design requires. As these increase, the complexity of manufacturing and packaging all the different cable bundles efficiently into cable trays, and then routing them from one patch panel to the next can be expected to increase. Finally, because expansion is carried out in steps [91], where the network operates at degraded capacity at each step, the number of expansion steps is a measure of the reduced availability in the network induced by lifecycle management. Wiring patterns also determine the number of links that need to be rewired at a patch panel during each step of expansion, a measure of step complexity [91]. 1 A patch panel or a wiring aggregator is a device that simplifies cable re-wiring. 10 Our second contribution is to use these metrics to compare the lifecycle management costs of two main classes of datacenter topologies recently explored in the research literature (§2.2), Clos [9] and expander graphs [76, 81]. We find that neither class dominates the other: Clos has relatively lower wiring complexity; its symmetric design leads to more uniform bundling (and fewer cable bundle types); but expander graphs at certain scales can have simpler packaging requirements due to their edge expansion property [76]; they end up using much fewer switches than Clos to achieve the same network capacity. Expander graphs also demonstrate better expansion properties because they have fat edges (§2.4) which permit more links to be rewired in each step. Finally we design and synthesize a novel and practical class of topologies called FatClique (§2.5), that has lower overall lifecycle management complexity compared to Clos and expander graphs. We do this by combining favorable design elements from these two topology classes. By design, FatClique incorporates 3 levels of hierarchy and uses a clique as a building block while ensuring edge expansion. At every level of its hierarchy, FatClique is designed to have fat edges, for easier expansion, while utilizing much fewer patch panels and therefore inter-rack cabling. Evaluations of these topology classes at three different scales, the largest of which is 16 the size of Jupiter, shows that FatClique is the best at most scales by all our complexity metrics. It uses 50% fewer switches and 33% fewer patch panels than Clos at large scale, and has a 23% lower cabling cost (an estimate we are able to derive from published cable prices). Finally, FatClique can permit fast expansion while degrading network capacity by small amounts (2.5-10%): at these levels, Clos can take 5 longer to expand the topology. 2.2 Background Datacentertopologyfamilies. Data centers are often designed for high throughput, low latency and resilience. Existing data center designs can be broadly classified into the following families: (a) Clos-like tree topologies, e.g., Google’s Jupiter [75], Facebook’s fbfabric [10], Microsoft’s 11 VL2 [34], F10 [53]; (b) Expander graph based topologies, e.g., Jellyfish [76], Xpander [81]; (c) ‘Direct’ topologies built from multi-port servers, e.g., BCube [35], DCell [36]. (d) Low diameter, strongly-connected topologies that rely on high-radix switches, e.g., Slimfly [17], Dragonfly [45]; (e) Re-configurable optical topologies like Rotornet and ProjectToR [57, 28, 32, 38, 94]. Of these, Clos and Expander based topologies have been shown to scale using widely deployed merchant silicon. The ecosystem around the hardware used by these two classes, e.g., cabling, cable trays used, rack sizes, is mature and well-understood, allowing us to quantify some of the operational complexity of these topologies. Direct multi-port server topologies and some reconfigurable optical topologies [57, 32, 38, 94] rely on newer hardware technologies that are not mainstream yet. It is hard to quantify the operational costs of these classes without making significant assumptions about such hardware. Low diameter topologies like Slimfly [17] and Dragonfly [45], can be built with hardware that is available today, but they require strongly connected groups of switches. Their incremental expansion comes at high cost and complexity; high-radix switches either need to be deployed well in advance, or every switch in the topology needs to be upgraded during expansion, to preserve low diameter. To avoid estimating operational complexity of topologies that rely on new hardware, or on topologies that unacceptably constrain expansion, we focus on the Clos and Expander families. Clos. A logical Clos topology with N servers can be constructed using switches with radix k connected in n = logk 2 ( N 2 ) layers based on a canonical recursive algorithm in [85] 2 . Fattree [9] and Jupiter [75] are special cases of Clos topology with 3 and 5 layers respectively. Clos construction naturally allows switches to be packaged together to form a chassis [75]. Since there are no known 2 This equation for n can be used to build a Clos with 1:1 oversubscription. For a Clos with an over-subscription x:y we would need n = logk 2 ( yN=x 2 ) layers. 12 generic Clos packaging algorithm that can help design such a chassis, for a Clos of any scale, we designed one to help our study of its operational complexity. We present this algorithm in §A.1. Expander graphs. Jellyfish and Xpander benefit from the high edge expansion property of expander graph to use a near optimal number of switches, while achieving the same bisection bandwidth as Clos based topologies [81]. Xpander splits N servers among switches by attaching s servers to each switch. With a k port switch, the remaining ports p =ks are connected to other switches that are organized in p blocks called metanodes. Metanodes are a group of switches, containing l =N=(s(p+1)) switches, which increase as topology scale N increases. There are no connections between the switches of a metanode. Jellyfish is a degree bounded random graph (see [76] for more details). Takeaway. A topology with high edge expansion [81] can achieve a target capacity with fewer switches, leading to lower overall cost. 2.3 Deployment Complexity Deployment is the process of realizing a physical topology in a data center space (e.g., a building), from a given logical topology. Deployment complexity can be reduced by careful packaging, placement and bundling strategies [75, 45, 8]. 2.3.1 Packaging, Placement, and Bundling Packaging of a topology involves careful arrangement of switches into racks, while placement involves arranging these racks into rows on the data center floor. The spatial arrangement of the topology determines the type of cables needed between switches. For instance, if two connected switchesarewithinthesamerack, theycanuseshort-rangecheapercoppercables, whileconnections between racks require more expensive optical cables. Optical cable costs are determined by two factors: the cost of transceivers and the length of cables (§2.3.2). Placement of switches on the 13 datacenter floor can also determine costs: connecting two switches placed at two ends of the data center building might require long range cables and high-end transceivers. Chassis, racks, and blocks. Packaging connected switches into a single chassis using a backplane completely removes the need for physical connecting cables. At scale, the cost and complexity sav- ings from using a chassis-backplane can be significant. One or more chassis that are interconnected can be packed into racks such that: (a) racks are as homogeneous as possible, i.e., a topology makes use of only a few types of racks to simplify manufacturing and (b) racks are packed as densely as possible to reduce space wastage. Some topologies define larger units of co-placement and packaging called blocks, which consist of groups of racks. Examples of blocks include pods in Fattree. External cabling from racks within a block are routed to wiring aggregators (i.e., patch panels [59]) to be routed to other blocks. For blocks to result in lower deployment complexity, three properties must be met: (a) the ports on the patch panel that it connects to are not wasted, when the topology is built out to full scale, (b) wiring out of the block should be as uniform as possible, and (c) racks in a block must be placed close to each other to reduce the length and complexity of wiring. Bundling and cable trays. When multiple fibers from the same set of physically adjoint (or neighboring) racks are destined to another set of neighboring racks, these fibers can be bundled together. A fiber bundle is a fixed number of identical-length fibers between two clusters of switches or racks. Manufacturing bundles is simpler than manufacturing individual fibers, and handling such bundles significantly simplifies operation complexity. Cable bundling reduces capex and opex by around 40% in Jupiter [75]. Patch panels facilitate bundling since the patch panel represents a convenient aggregation point to create and route bundles from the set of fibers destined to the same patch panel (or the same set of physically proximate patch panels). Figure 2.1 shows a Clos topology instance (left) and its physical realization using patch panels (right). Each aggregation block in the Clos network connects with one link to each spine block. The figure on the right shows how these links are 14 Logical Clos Physical Clos Aggr Spine Aggr Spine Switch Patch Panel Figure 2.1: Fiber Re-bundling for Clos at Patch Panels routed physically. Bundles with two fibers each from two aggregations are routed to two (lower) patch panels. At each patch panel, these fibers are rebundled, by grouping fibers that go to the same spine in new bundles, and routed to two other (upper) patch panels that connect to spines. The bundles from the upper patch panels are then routed to the spines. Figure 2.1 assumes that patch panels are used as follows: bundles are connected to both the front and back ports on patch panels. For example, bundles from the aggregation layer connect to front ports on patch panels and bundles from spines connect to the back ports of patch panels. This enables bundle aggregation and rebundling and simplifies topology expansion. 3 Bundles and fibers are routed through the datacenter on cable trays. The cables that aggregate at a patch panel rack must be routed overhead by using over-row and cross-row trays [61]. Trays have capacity constraints [79], which can constrain rack placement, block sizes, and patch panel placement. Today, trays can support at most a few thousand fibers [79]. With current rack and cable tray sizes, a single rack of patch panels can be accommodated by four overhead cable trays, arranged in four directions. In order to avoid aggregating too many links into a single location, it is desirable to space such patch panels apart to accommodate more cable trays. This consideration in turn constrains block sizes; if cables from blocks must be all routed locally, it is desirable that a block only connect to a single rack of patch panel. 3 [91]’s usage of patch panels is slightly different. All bundles are connected to front ports of patch panels and links are established using jumper cables between the back ports of patch panels. For patch panels of a given port count, both approaches require the same number of patch panels. Our approach enables bundling closer to the aggregation and spine layers; [91] does not describe how bundling is accomplished in their design. 15 2.3.2 Deployment Complexity Metrics Based on the previous discussion, we identify several metrics that quantify the complexity of the two aspects of datacenter topology deployment: packaging and placement. In the next subsection, we use these metrics to identify differences between Clos and Expander graph topology classes. Number of Switches. The total number of switches in the topology determines the capital expenditure for the topology, but it also determines the packaging complexity (switches need to be packed to chassis and racks) and the placement complexity (racks need to be placed on the datacenter floor). Number of Patch panels. By acting as bundle waypoints, the number of patch panels captures one measure of wiring complexity. The more the number of patch panels, the shorter the cable lengths from switches to the nearest patch panel, but the fewer the bundling opportunities, and vice versa. The number of patch panels needed is a function of topological structure. For instance, in a Clos topology, if an aggregation layer fits into one rack or a neighboring set of racks, a patch panel is not needed between the ToR and the aggregation layer. However, for larger Clos topologies where an aggregation block can span multiple racks, ToR to aggregation links may need to be rebundled through a patch panel. We discuss this in detail in §2.6.2. NumberofBundleTypes. Thenumberofpatchpanelsalonedoesnotcapturewiringcomplexity. The other measure is the number of distinct bundle types. A bundle type is represented by a tuple of (a) the capacity of the number of fibers in the bundle, and (b) the length of the bundle. If a topology requires only a small number of bundle types, its bundling is more homogeneous; manufacturing and procuring such bundles is significantly simpler, and deploying the topology is also simplified since fewer bundling errors are likely with fewer types. These complexity measures are complete. The number of cable trays, the design of the chassis, and the number of racks can be derived from the number of switches (and the number of servers 16 Topology 4-layer Clos (Medium) Jellyfish #servers 131,072 131,072 #switches 28,672 16,384 #bundle types 74 1577 #patch panels 5546 7988 Table 2.1: Deployment Complexity Comparison and the datacenter floor dimensions, which are inputs to the topology design). The number of cables and transceivers can be derived from the number of patch panels. In some cases, a metric is related to another metric, but not completely subsumed by it. For example, the number of switches determines rack packaging, which only partially determines the number of transceivers per switch. The other determinant of this quantity is the connectivity in the logical topology (which switch is connected to which other switch). Similarly, the number of patch panels can influence the number of bundle types, but these are also determined by logical connectivity. 2.3.3 Comparing Topology Classes To understand how the two main classes of topologies compare by these metrics, we apply these to a Clos topology and to a Jellyfish topology that support the same number of servers (131,072) and the same bisection bandwidth. This topology corresponds to twice the size of Jupiter. In §3.5, we perform a more thorough comparison at larger and smaller scales, and we describe the methodology by which these numbers were generated. Table 2.1 shows that the two topology classes are qualitatively different by these metrics. Consistent with the finding in [76], Jellyfish only needs a little over half the switches compared to Clos to achieve comparable capacity due to its high edge expansion property. But, by other measures, Clos performs better. It exposes far fewer ports outside the rack (a little over half that of Jellyfish); we say Clos has better port-hiding. A pod in this Clos contains 16 aggregation and 16 edge switches 4 . The aggregation switches can be can be packed into a single rack, so bundles from edge switches to aggregation switches do not need to be rebundled though patch panels, and we 4 we follow the definition of pod in [9]. 17 only need two layers of patch panels between aggregation and spine layer. However, in Jellyfish, almost all links are inter-rack links, so it requires more patch panels. Moreover, for Clos, since each pod has the same number of links to each spine, all bundles in Clos have the same capacity (number of fibers). However, the length of bundles can be different, depending on the relative placement of the patch panels between aggregation and spine layers, so Clos has 74 bundle types. However, since Jellyfish is a purely random graph without structure, to enable bundling, we group a fixed amount of neighbor racks as blocks to enable bundling. Since connectivity is random, the number of links between blocks are not uniform, Jellyfish needs almost 20 the number of bundle types. In §3.5, we show that Xpander also has qualitatively similar behavior in large scale. Takeaway. Relative to a structured hierarchical class of topologies like Clos, the expander graph topology has inherently higher deployment complexity in terms of the number of bundle types and cannot support port-hiding well. 2.4 Topology Expansion The second important component of topology lifecycle management is expansion. Datacenters are rarely deployed to maximal capacity in one shot; rather, they are gradually expanded as network capacity demands increase. 2.4.1 The Practice of Expansion In-place Expansion. At a high-level, expanding a topology involves two conceptual phases: (a) procuring new switches, servers, and cables and laying them on the datacenter floor, and (b) re-wiring (or adding) links between switches in the existing topology and the new switches. Phase (b), the re-wiring phase, can potentially disrupt traffic; as links are re-wired, network capacity can drop, leading to traffic loss. To avoid traffic loss, providers can either take the existing topology 18 offline (migrate services away, for example, to another datacenter), or can carefully schedule link re-wiring while carrying live traffic, but schedule the re-wiring to maintain a desired target capacity. The first choice can impact service availability significantly. So, today, datacenters are expanded while carrying live traffic [72, 33, 75, 91]. To do this, expansion is carried out in steps, where at each step, the capacity of the topology is guaranteed to be at least a percentage p of the capacity of the existing topology. This fraction is sometimes called the expansion SLO. Today, many providers operate at expansion SLOs of 75% [91]; higher SLOs of 85-90% can impact availability budgets less while allowing providers to carry more traffic during expansion. The unit of expansion. Since expansion involves procurement, topologies are usually expanded in discrete units called blocks to simplify the procurement and layout logistics. In a structured topology, there are natural candidates for blocks. For example, in a Clos, a pod can be block, while in an Xpander, the metanode can be a block. During expansion, a block is first fully assembled and placed, and links between switches within a block are connected (as an aside, an Xpander metanode has no such links). During the re-wiring phase, only links between existing blocks and new blocks are re-wired. (This phase does not re-wire links between switches within an existing block). Aside from simplifying logistics, expanding at the granularity of a block preserves structure in structured topologies. 2.4.2 An Expansion Step What happens during a step. Figure 2.2 shows an example of Clos expansion. The upper left figure shows a partially-deployed logical Clos, in which each spine and aggregation block are connected by two links. The upper right is the target fully-deployed Clos, where each spine and aggregation block are connected by a single link. During expansion, we need to redistribute half of existing links (dashed) to the newly added spines without violating wiring and capacity constraints. 19 Original Partially Deployed Clos Aggr Spine Switch Patch Panel Aggr Spine Target Fully Deployed Clos (Logical) Aggr New Spine New Spine Expansion at Patch Panels (or Optical Circuit Switches) Original Links or New-added Links Rewired LInks Expansion for Clos Topology Reserved Ports SLO: 87.5%. Step 1: 1,3; Step2: 2,4 1 2 4 3 Figure 2.2: Clos Expansion with Patch Panels Suppose we want to maintain 87.5% of the capacity of the topology (i.e., the expansion SLO is 0.875), this expansion will require 4 steps in total, where each patch panel is involved in 2 of these steps. In Figure 2.2, we only show the rewiring process on the second existing patch panels. To maintain 87.5% capacity at each pod, only one link is allowed to be drained. In the first step, the red link from the first existing aggregation block and the green link from the second existing aggregation block are rewired to the first new spine block. In the second step, the orange links from the first existing aggregation block and the purple link from the second existing aggregation block are rewired to the first new spine block. A similar process happens in the first patch panel. In practice, each step of expansion involves four sub-steps. In the first sub-step, the existing links that are to be re-wired are drained. Draining a link involves programming switches at each end of the link to disable the corresponding ports, and may also require reprogramming other switches or ports to route traffic around the disabled link. Second, one or more human operators physically rewire the links at a patch panel (explained in more detail below). Third, the newly 20 a b c d A B C D A B C D e f E F Target Wiring: c-E, d-F, e-B, f-D a b e f A B C D c d E F Original wiring Phase 1: Route Fibers {e,f,E,F} to the patch panel Phase 2: Rewire Patch Panel Front Back a b c d Figure 2.3: Basic Rewiring Operations at a patch panel wired links are tested for bit errors by sending test traffic through them. Finally, the new links are undrained. By far the most time consuming part of each step is the second sub-step, which requires human involvement. This sub-step is also the most important from an availability perspective; the longer this sub-step takes, the longer the datacenter operates at reduced capacity, which can impact availability targets [33]. The role of patch panels in re-wiring. The lower figure in Figure 2.2 depicts the physical realization of the (logical) re-wiring shown in the upper figure. (For simplicity, the figure only shows the re-wiring of links on one patch panel to a new pod). Fibers and bundles originate and terminate at patch panels, so re-wiring requires reconnecting input and output ports at each patch panel. One important constraint in this process is that re-wiring cannot remove fibers that are already part of an existing bundle. Patch panels help localize rewiring and reuse existing cable bundling during expansions. Figure 2.3 shows, in more detail the rewiring process at a single patch panel. The leftmost figure shows the original wiring with connections (a, A), (b, B), (c, C), (d, D). To enable expansion, a topology is always deployed such that some ports at the patch panel are reserved for expansion steps. In the figure, we use these reserved ports to connect new fibers e, f, E and F (Phase 1). To get to a target wiring in the expanded network with connections (a, A), (b, B), (e, C), (f, 21 Thin Edge 25% loss Fat Edge 0% loss Drain 25% Links Servers Figure 2.4: Thin and Fat Edge Comparison Topology 4-layer Clos (Medium) Jellyfish Average # links rewired per patch panel rack 832 470 Expansion steps 6 3 North-to-south capacity ratio 1 3 Table 2.2: Expansion Comparison (SLO = 90%) D), (c, E), (d, F), the following steps are taken: (1) Traffic is drained from (c, C), (d, D), (2) Connections (c, C), (d, D) are rewired, with c being connected to E, d being connected to F and so on, and (3) The new links are undrained, allowing traffic to use new capacity. 2.4.3 Expansion Complexity Metrics We identify two metrics that quantify expansion complexity and use these metrics to identify differences between Clos and Jellyfish in the next subsection. Number of Expansion Steps. As mentioned each expansion step requires a series of substeps which cannot be parallelized. Therefore the number of expansion steps determines the total time for expansion. Average number of rewired links in a patch panel rack per step. With patch panels, manual rewiring dominates the time taken within each expansion step. Within steps, it is possible to parallelize rewiring across racks of patch panels. With such parallelization, the time taken to rewire a single patch panel rack will dominate the time taken for each expansion step. 22 2.4.4 Comparing Topology Classes Table 2.2 shows the value of these measures for a medium-sized Clos and a comparable Jellyfish topology, when the expansion SLO is 90%. (§3.5 has more extensive comparisons for these metrics, and also describes the methodology more carefully). In this setting, the number of links rewired per patch panel can be a factor of two less than Clos. Moreover, Jellyfish requires 3 steps, while Clos twice the number of steps. To understand why Jellyfish requires fewer steps, we define a metric called the north-to-south capacity ratio for a block. This is the ratio of the aggregate capacity of all “northbound” links exiting a block to the aggregate capacity of all “southbound” links to/from the servers within the block. Figure 2.4 illustrates this ratio: a thin edge (left), has an equal number of southbound and northbound links while a fat edge (right), has more northbound links than southbound links. A Clos topology has a thin edge, i.e., this ratio is 1, since the block is a pod. Now, consider an expansion SLO of 75%. This means that the southbound aggregate capacity must be at least 75%. That implies that, for Clos, at most 25% of the links can be re-wired in a single step. However, Jellyfish has a much higher ratio of 3, i.e., it has a fat edge. This means that many more links can be rewired in a single step in Jellyfish than in Clos. This property of Jellyfish is required for reducing the number of expansion steps. Takeaway. Clos topologies re-wire more links in each patch panel during an expansion step and require many steps because they have a low north-south capacity ratio. 2.5 Towards Lower Lifecycle Complexity Our discussions in §2.3 and §2.4, together with preliminary results presented in those sections (§3.5 has more extensive results) suggest the following qualitative comparison between Clos and the expander graph families with respect to lifecycle management costs (Table 2.3): • Clos uses fewer bundle types and patch panels. 23 4-layer Clos (Medium) Jellyfish switches X bundle types X patch panels X re-wired links per patch panel X expansion steps X Table 2.3: Qualitative comparison of lifecycle management complexity Auxiliary Variable Description ps =Sc 1 # ports per switch to connect other switches inside a sub-block p b =ksps pc # ports per switch to connect other blocks Rc =Sc (pc +p b ) radix of a sub-block R b =S b Sc p b radix of a block N b =N=(S b Sc s) #blocks Lcc =Sc pc=(S b 1) #links between two sub-blocks inside a block L bb =R b =(N b 1) #links between two blocks Table 2.4: FatClique Variables • Jellyfish has significantly lower switch counts, uses fewer expansion steps, and touches fewer links per patch panel during an expansion step. In all of these comparisons, we compare topologies with the same number of servers and the same bisection bandwidth. The question we ask in this section is: Is there a family of topologies which are comparable to, or dominate, both Clos and expander graphs by all our lifecycle management metrics? In this section, we present the design of the FatClique class of topologies and validate in §3.5 that FatClique answers this question affirmatively. 2.5.1 FatClique Construction FatClique (Figure 2.5) combines the hierarchical structure in Clos with the edge expansion in expander graphs to achieve lower lifecycle management complexity. FatClique has three levels of hierarchy: individual sub-block (top left), interconnected into a block (top right), which are in turn interconnected to form FatClique (bottom). The interconnection used at every level in the hierarchy is a clique, similar to Dragonfly [45]. Additionally, each level in the hierarchy is designed to have a fat edge (a north-south capacity ratio greater than 1). The cliques enable 24 switch server Sub-block (Clique of Switches) Block (Clique of Sub-blocks) Local Bundles The Whole Network (Clique of Blocks) Global Bundles Figure 2.5: FatClique Topology high edge expansion, while hierarchy enables lower wiring complexity than random-graph based expanders [76, 81]. FatClique is a class of topologies. To obtain an instance of this class, a topology designer specifies two input parameters: N, the number of servers, and k the chip radix. A synthesis algorithm takes these as inputs, and attempts to instantiate four design variables that completely determine the FatClique instance Table 2.4. These four design variables are: • s, the number of ports in a switch that connect to servers • p c , the number of ports in each switch that connect to other sub-blocks inside a block • S c , the number of switches in a sub-block • S b , the number of sub-blocks in a block The synthesis algorithm searches for the best combination of values for design variables, guided by six constraints, C 1 through C 6 , described below. The algorithm also defines auxiliary variables for 25 convenience; these can be derived from the design variables (Table 2.4). We define these variables in the narrative below. Sub-block connectivity. In FatClique, the sub-block forms the lowest level of the hierarchy, and contains switches and servers. All sub-blocks have the same structure. Servers are distributed uniformly among all switches of the topology, such that each sub-block has the same number of servers attached. However, because this number of servers may not be an exact multiple of the number of switches, we distribute the remainder across the switches, so that some switches may be connected to one more server than others. The alternative would have been to truncate or round up the number of servers per sub-block to be divisible by the number of switches in the sub-block, which could lead to overprovisioning or underprovisioning. Within a sub-block, every switch has a link to every other switch within its sub-block, to form a clique (or complete graph). To ensure a fat edge at the sub-block level, each switch must connect to more switches than servers, captured by the constraint C 1 :s<rs, where r is the switch radix and s is the number of ports on a switch connected to servers. Block-level connectivity. The next level in the hierarchy is the block. Each sub-block is connected to other sub-blocks within a block using a clique (Figure 2.5, top-left). In this clique, each sub-block may have multiple links to another sub-block; these inter-sub-block links are evenly distributed among all switches in the sub-block such that every pair of switches from different sub-block has at most one link. Ensuring a fat edge at this level requires that a sub-block has more inter-sub-block and inter-block links egressing from the sub-block than the number of servers it connects to. Because sub-blocks contain switches which are homogeneous 5 , this constraint is ensured if the sum of (a) the number ports on each switch connected to other sub-block (p c ) and (b) those connected to other blocks (p b , an auxiliary variable in Table 2.4, see also Figure 2.6) exceeds the number of servers connected to the switch (captured by C 2 :p c +p b >s). 5 They are nearly homogeneous, since a switch may differ from another by one in the number of servers connected 26 sub-block B lock swit ch server Figure 2.6: FatClique Block Inter-block connectivity. The top of the hierarchy is the overall network, in which each block is connected to every other block, resulting in a clique. The inter-block links are evenly distributed among all sub-blocks, and, within a sub-block, evenly among all switches. To ensure a fat edge at this level, the number of inter-block links at each switch should be larger than the number of servers it connects to, captured by C 3 :p b >s. Note that C 3 subsumes (is a stronger constraint than) C 2 . Moreover, the constraint that blocks are connected in a clique imposes a constraint on the block radix (R b , a derived variable). The block radix is the total number of links in a block destined to other blocks. R b should be large enough to reach all other blocks (captured by C 4 :R b N b 1) such that the whole topology is a clique. Incorporating rack space constraints. Beyondconnectivityconstraints, weneedtoconsider packaging constraints in sub-block design. Ideally, we need to ensure that a sub-block fits completely into one or more racks with no wasted rack space. For example, if we use 58RU racks, and each switch is to be connected to 8 1RU servers, we can accommodate 6 switches per sub-block, 27 Topology Scalability 3-layer Clos (Fattree) 2(k=2) 3 4-layer Clos 2(k=2) 4 5-layer Clos (Jupiter) 2(k=2) 5 FatClique O(k 5 ) Table 2.5: Scalability of Topologies leaving 58(68+6) = 4U in the rack for power supply and other equipment. In contrast, choosing 8 switches per sub-block would be a bad choice because it would need 88+8 = 72U rack space, overflowing into a second rack that would have 44RU un-utilized. We model this packaging fragmentation as a soft constraint: our synthesis algorithm generates multiple candidate assignments to the design variables that satisfy our constraints, and of these, we pick the alternative that has the lowest wasted rack space. Ensuring edge expansion. At each level of the hierarchy, edge expansion is ensured by using a clique. This is necessary for high edge expansion, but not sufficient, since it does not guarantee that every switch connects to as many other switches across the network as possible. One way to ensure this diversity is to make sure that each pair of switches is connected by at most one link. The constraints discussed so far do not ensure this. For instance, consider Figure 2.6, in which L cc (another auxiliary variable in Table 2.4) is the number of links from one sub-block to another. If this number is greater than the number of switches S c in the sub-block, then, some pair of switches might have more than one link to each other. Thus, C 5 :L cc S c is a condition to ensure that each pair of switches must be connected by a single link. Our topology synthesis algorithm generates assignments to design variables, and a topology generator then assigns links to ensure this property (§2.5.2). Incorporating patch panel constraints. The size of the block is also limited by the number of ports in a single patch panel rack (denoted by PPRack ports ). It is desirable to ensure that the inter-block links egressing each block connect to at most 1 2 the ports in a patch panel rack, so that the rest of the patch panel ports are available for external connections into the block (captured by C 6 :R b 1 2 PPRack ports ). 28 2.5.2 FatClique Synthesis Algorithm Generating candidate assignments. The FatClique synthesis algorithm attempts to assign values to the design variables, subject to constraints C 1 to C 6 . The algorithm enumerates all possible combinations of value assignments for these variables, and filters out each assignment that fails to satisfy all the constraints. For each remaining assignment, it generates the topology specified by the design variable, and determines if the topology satisfies a required capacity Cap , which is an input to the algorithm. Each assignment that fails the capacity test is also filtered out, leaving a candidate set of assignments. These steps are described in §A.6. FatClique placement. For each assignment in this candidate set, the synthesis algorithm generates a topology placement. Because FatClique’s design is regular, its topology placement algorithm is conceptually simple. A sub-block may span one or more racks, and these racks are placed adjacent to each other. All sub-blocks within a block are arranged in a rectangular fashion on the datacenter floor. For example, if a block has 25 racks, it is arranged in a 55 pattern of racks. Blocks are then arranged in a similar grid-like fashion. Selecting best candidate. For each placement, the synthesizer computes the cabling cost of the resulting placement (using [24]), and picks the candidate with the lowest cost. This step is not shown in Algorithm 4. This approach implicitly filters out candidates whose sub-block cannot be efficiently packed into racks (§2.5.1). 2.5.3 FatClique Expansion Re-wiring during expansion. Consider a small FatClique topology, shown top left in Figure 2.7, that has 3 blocks and L bb = 5, i.e., five inter-block links. To expand it to a clique with six blocks, we would need to rewire the topology to have L 0 bb =2 (top right in Figure 2.7). This means we need to redistribute more than half (6 out of 10) of existing links (red) at each block to new blocks without violating wiring and capacity constraints. 29 1 2 3 5 5 5 1 2 3 2 2 5 6 4 2 1 2 3 5 4 6 Original Fatclique Target Fatclique (Logical) Expansion at a Patch Panel Block Patch Panel Original Links or New-added Links Rewired LInks Reserved Ports Existing Bundles 2 2 3 4 1 2 a b c d Used Ports Figure 2.7: FatClique Expansion example The expansion process with patch panels is shown in the bottom of Figure 2.7. Similar to the procedure for Clos described in §2.4.1, all new blocks (shown in orange) are first deployed and interconnected and links from the new blocks are routed to reserved ports on patch panels associated with existing blocks (shown in blue), before re-wiring begins. For FatClique, rewiring one existing link requires releasing one patch panel port so that a new link can be added. Since links are already parts of existing bundles and routed through cable trays, we can not rewire them directly, e.g., by rerouting it from one patch panel to another. For example, link 1 (lower half of Figure 2.7) is originally connected blocks 1 and 3 by connecting ports a and b on the patch panel. Suppose we want to remove that link, and add two links, one from block 1 to block 5 (labeled 3), and another from block 3 to block 5 (labeled 4). The part of the original link (labeled 1) between the two patch panels is already bundled, so we cannot physically reroute it from block 3 to block 5. Instead, we effect re-wiring by releasing port a, connecting link 3 to port a, connecting link 1 to port c. Logically, this is equivalent to connecting ports a and d 30 and b and c on the patch panel shown in lower half of Figure 2.7. This preserves bundling, while permitting expansion. If the original topology has N b blocks, by comparing the old and target topology, the total number of rewired links is computed by N b (N b 1)(L bb L 0 bb )=2. For this example, the total number of links to be rewired is 9. Iterative Expansion Plan Generation. By design, FatClique has fat edges, which allows draining more and more links at each step of the expansion, as network capacity increases. At each step, we drain links across all blocks uniformly, so that each block loses the same aggregate capacity. However the relationship between overall network capacity, and the number of links drained at every block in FatClique is unclear, because traffic needs to be sent over non-shortest paths to fully utilize the fabric. Therefore, we use an iterative approach to expansion planning, where, at each step, we search for the maximal ratio of links to be drained that still preserves expansion SLO. (§A.4 discusses the algorithm in more detail). Our evaluation §3.5 shows that the number of expansion steps computed by this algorithm is much smaller than that for expanding symmetric Clos. 2.5.4 Discussion Achieving low complexity. By construction, FatClique achieves low lifecycle management complexity (Table 2.3), while ensuring full-bisection bandwidth. It ensures high edge expansion, resulting in fewer switches. By packaging clique connections into a sub-block, it exports fewer external ports, an idea we call port hiding. By employing hierarchy and a regular (non-random) structure, it permits bundling and requires fewer patch panels. By ensuring fat edges at each level of the hierarchy, it enables fewer re-wired links per patch panel, and fewer expansion steps. We quantify these in §3.5. Scalability. Since Xpander and Jellyfish do not incorporate hierarchy, they can be scaled to arbitrarily large sizes. However, because Clos and FatClique are hierarchical, they can only scale 31 to a fixed size for a given chip radix. Table 2.5 shows the maximum scale of each topology as a function of switch radix k. FatClique scales to the same order of magnitude as a 5-layer Clos. As shown in §3.5, both of them can scale to 64 times bisection bandwidth of Jupiter. FatClique and Dragonfly. FatClique is inspired by Dragonfly [45] and they are both hier- archical topologies that use cliques as building blocks, but differ in several respects. First, for a given switch radix, FatClique can scale to larger topologies than Dragonfly because it incorporates one additional layer of hierarchy. Second, the Dragonfly class of topologies is defined by many more degrees of freedom than FatClique, so instantiating an instance of Dragonfly can require an expensive search [78]. In contrast, FatClique’s constraints enable more efficient search for candidate topologies. Finally, since Dragonfly does not explicitly incorporate constraints for expansion, a given instance of Dragonfly may not end up with fat edges. Routing and Load Balancing on FatClique. Unlike for Clos, ECMP-based forwarding cannot be used achieve high utilization in more recently proposed topologies [45, 81, 76, 44]. FatClique belongs to this latter class, for which a combination of ECMP and Valiant Load Balancing [89] has been shown to achieve performance comparable to Clos [44]. 2.6 Evaluating Lifecycle Complexity In this section, we compare three classes of topologies, Clos, expander graphs and FatClique by our complexity metrics. Topology Clos FatClique Scale e a sp pod cap svr sub-block block cap svr Small 1 16 1 32 327T 8.2k 6 150 337T 8.1k Medium 1 16 48 32 5.24P 131k 6 150 5.40P 132k Large 1 512 48 768 20.96P 524k 6 150 21.36P 523k Topology Xpander Jellyfish Scale metanode cap svr cap svr Small 41 351T 8.2k 350T 8.2k Medium 655 5.56P 131k 5.56P 131k Large 2620 22.27P 524k 22.27P 524k Table 2.6: Capacities of topologies built with 32 port 40G switches. Small, medium and large scale topologies achieve 1 4 , 4, 16 times capacity of Jupiter. The table also shows sizes of individual building blocks of these topologies in terms of number of switches. Abbreviations: e:edge, a:aggregation, sp:spine, cap:capacity, svr:server. 32 2.6.1 Methodology Topology scales. Because the lifecycle complexity of topology classes can be a function of topology scale, we evaluate complexity across three different topology sizes based on the number of servers they support: small, medium, and large. Small topologies support as many servers as a 3-layer clos topology. Medium topologies support as many servers as 4-layer Clos. Large topologies support as many servers as 5-layer Clos topologies 6 . All our experiments in this section are based on comparing topologies at the same scale. At each scale, we generate one topology for each of Clos, Xpander, Jellyfish, and FatClique. The characteristics of these topologies are listed in Table 2.6. All these topologies use 32-port switching chips, the most common switch radix available today for all port capacities [18]. To compare topologies fairly, we need to equalize them first. Specifically, at a given scale, each topology has approximately the same bisection bandwidth, computed (following prior work [76, 81]) using METIS [43]. All topologies at the same scale support roughly the same number of servers; small, medium and large scale topologies achieve, respectively, 1 4 , 4, and 16 times capacity of Jupiter. (In A.8, we also compare these topologies using two other metrics). Table 2.6 also shows the scale of individual building blocks of these topologies in terms of number of switches. For Clos, we use the algorithm in §A.1 to design building blocks (chassis) and then use them to compose Clos. One interesting aspect of this table is that, at the 3 scales we consider, a FatClique’s sub-block and block designs are identical, suggesting lower manufacturing and assembly complexity. We plan to explore this dimension in future work. For each topology we compute the metrics listed in Table 2.3: the number of switches, the number of bundle types, the number of patch panels, the average number of re-wired links at a 6 To achieve low wiring complexity, a full 5-layer Clos topology would require patch panel racks with four times as many ports as available today, so we restrict ourselves to the largest Clos that can be constructed with today’s patch panel capacities 33 patch panel during each expansion step, and the number of expansion steps. To compute these, we need component parameters, and placement and expansion algorithms for each topology class. Component Parameters. In keeping with [17, 95], we use optical links for all inter-rack links. We use 96 port 1RU patch panels [31] in our analysis. A 58RU [68] rack with patch panels can aggregate 29658 = 11;136 fibers. We call this rack a patch-panel rack. Most datacenter settings, such as rack dimensions, aisle dimensions, cable routing and distance between cable trays follow practices in [61]. We list all parameters used in this section in §A.7. Placement Algorithms. For Clos, following Facebook’s fbfabric [10], spine blocks are placed at the center of the datacenter, which might take multiple rows of racks, and pods are placed at two sides of spine blocks. Each pod is organized into a rectangular area with aggregation blocks placed in the middle to reduce the cable length from ToR to aggregation. FatClique’s placement algorithm is discussed in §2.5.2. For Xpander, we use the placement algorithm proposed in [44]. We follow the practice that all switches in a metanode are placed closed to each other. However, instead of placing a metanode into a row of racks, we place a metanode into a rectangular area of racks, which reduces cable lengths when metanodes are large. For Jellyfish, we design a random search algorithm to aggressively reduce the cable length (§A.2). Expansion Algorithms. For Clos, as shown in [91], it is fairly complex to compute the optimal number of rewired links for asymmetric Clos during expansion. However, when the original and target topologies are both symmetric, this number is easy to compute. For this case, we design an optimal algorithm (§A.5) which rewires the maximum number of links at each step and therefore uses the smallest number of steps to finish expansion. For FatClique, we use the algorithm discussed in §2.5.3. For Xpander and Jellyfish, we design an expansion algorithm based on the intuition from [81, 76] that, to expand a topology by n ports requires breaking n 2 existing links. Finally, we have found that for all topologies, the number of expansion steps at a given SLO is scale 34 invariant: it does not depend on the size of the original topology as long as the expansion ratio (target-topology-size-to-original-topology-size ratio) is fixed (§A.3). Presenting results. In order to bring out the relative merits of topologies, and trends of how cost and complexity increase with scale, we present values for metrics we measure for all topologies and scales in the same graph. In most cases, we present the absolute values of these metrics; in some cases though, because our three topologies span a large size range, for some metrics the results across topologies are so far apart that we are unable to do so without loss of information. In these cases, we normalize our results by the most expensive, or complex topology. 2.6.2 Patch Panel Placement The placement of patch panels is determined both by the structure of the topology and its scale. Between edge and aggregation layers in Clos. For small and medium scale Clos, no patch panels are needed between edge and aggregation layers. Each pod at these scales contains 16 aggregation switches, which can be packed into a single rack (we call this an aggregation-rack). Given that a pod at this scale is small, all links from the edge can connect to this rack. Since all links connect to one physical location, bundles form naturally. In this case, each bundle from edge racks contains 316 fibers 7 . Therefore, no patch panels are needed between edge and aggregation layers. However, a large Clos needs one layer of patch panels between edge and aggregation layers since a pod at this scale is large. An aggregation block consists of 16 middle blocks 8 , each with 32 switches. The aggregation block by itself occupies a single rack. Based on the logical connectivity, links from any edge need to connect to all middle blocks. Without using patch panels, each bundle could at most contain 316=16 = 3 fibers. In our design, we use patch panels to aggregate local bundles from edges first and then rebundle them on patch panels to form new high capacity 7 In our setting, each rack with 58RU can accommodate at most 3 switches and 48 associated servers. The total number of links out of this rack is 316. 8 We follow the terminology in [75]. A middle block is a sub-block in an aggregation block. 35 bundles from patch panels to aggregation racks. Based on the patch panel rack capacity constraint, two patch panel racks are enough to form high capacity bundles from edge to aggregation layers. Specifically, in our design 128 edge switches and 8 aggregation racks connect to a single patch panel. In this design, each edge-side bundle contains 48 fibers and each aggregation-side bundle contains 128 fibers. Between aggregation and spine layers. The topology between aggregation and spine layer in Clos is much larger than that inside a pod. For this reason, to form high capacity bundles, two layers of patch panels are needed. As shown in Figure 2.1, one layer of patch panels is placed near spine blocks at the center of the data center floor. Each patch panel rack aggregates local bundles from four spine racks in medium and large scale topologies. Similarly, another layer of patch panels are placed near aggregation rack, permitting long bundles between those patch panels. In expanders and FatClique. For Jellyfish, Xpander and FatClique, patch panels are deployed at the server block side and long bundles form between those patch panels. In FatClique, each block requires one patch panel rack (§2.5.3). In a large Xpander, since a metanode is too big (Table 2.6), it is not possible to use one patch panel rack to aggregate all links from a metanode. Therefore, we divide a metanode into homogeneous sections, called sub-metanodes, such that links from a sub-metanode can be aggregated at one patch panel rack. For Jellyfish, we partition the topology into groups, each of which contains the same number of switches as in a block in FatClique, so each group needs one patch panel rack. 2.6.3 Deployment Complexity In this section, we evaluate our different topologies by our three measures of deployment complexity (§2.3.2). Number of Switches. Figure 2.8 shows how the different topologies compare in terms of number of switches used at various topology scales. Figure 2.8(a) shows the total number of switches for the small topologies, Figure 2.8(b) for the medium, and Figure 2.8(c) for the large. The y-axes 36 C J X F 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1e3 (a) Small C J X F 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1e4 (b) Medium C J X F 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1e5 (c) Large Figure 2.8: Number of switches. C is Clos,J is Jellyfish,X is Xpander and F is FatClique. increase in scale by about an order of magnitude from left to right. FatClique has 20% fewer switches than Clos for a small topology, and 50% fewer for the large. The results for Jellyfish and Xpander are similar, consistent with findings in [81, 76]. This benefit comes from the edge expansion property of the non-Clos topologies we consider. This implies that Clos topologies, at large scale, may require nearly twice the capital expenditures for switches, racks, and space as the other topologies. C J X F 0 1 2 3 4 5 1e2 (a) Small C J X F 0 1 2 3 4 5 6 7 8 1e3 (b) Medium C J X F 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1e4 (c) Large Figure 2.9: Number of patch panels. C is Clos,J is Jellyfish,X is Xpander and F is FatClique. Number of Patch panels. Figure 2.9 shows the number of patch panels at different scales. As before, across these graphs, the y-axis scale increases approximately by one order of magnitude from left to right. At small and medium scales, Clos relies on patch panels mainly for connections between aggregation and spine blocks. Of all topologies at these scales, Clos uses the fewest number of patch panels: FatClique uses about 11% more patch panels, and Jellyfish and Xpander 37 use almost 44-50% more. Xpander and Jellyfish rely on patch panels for all northbound links, and therefore in general, as scale increases, the number of patch panels in these networks grows (as seen by the increase in the y-axis scale from left to right). At large scale, however, Clos needs many more patch panels, comparable to Xpander and Jellyfish. At this scale, Clos aggregation blocks span multiple racks, and patch panels are also needed for connections between ToRs and aggregation blocks. Here, FatClique’s careful packaging strategy becomes more evident, as it needs nearly 25% fewer patch panels than Clos. The majority of patch panels used in FatClique at all scales comes from inter-block links (which increase with scale). For this metric, Clos and FatClique are comparable at small and medium scales, but FatClique dominates at large scale. Scale # Bundle Types Clos FatClique Xpander Jellyfish Small 8 11 11 28 Medium 74 61 976 1577 Large 322 212 3034 3678 Table 2.7: Bundle Types (Switch Radix = 32) Number of Bundle Types. Table 2.7 shows the number of bundle types used by different topologies at different scales. A bundle type (§2.3.1) is characterized by (a) the number of fibers in the bundle, and (b) the length of the bundle. The number of bundle types is a measure of wiring complexity. In this table, if bundles differ by more than 1m in length, they are designated as separate bundle types. Table 2.7 shows that Clos and FatClique use the fewest number of bundle types; this is due to the hierarchical structure of the topology, where links between different elements in the hierarchy can be bundled. As the topology size increases, the number of bundle types also increases in these topologies, by a factor of about 40 for Clos to 20 for FatClique when going from small to large topologies. 38 C J X F 0 1 2 3 4 5 1e6 (a) Small C J X F 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1e8 (b) Medium C J X F 0 1 2 3 4 5 6 7 8 9 1e8 (c) Large Figure 2.10: Cabling cost. C is Clos,J is Jellyfish,X is Xpander and F is FatClique. On the other hand, Xpander and Jellyfish use an order of magnitude more bundle types compared to Clos and FatClique at medium and large scales, but use a comparable number for small scale topologies. Even at the small scale, Jellyfish uses many more bundle types because it uses a random connectivity pattern. At small scales Xpander metanodes use a single patch panel rack and bundles from all metanodes are uniform. With larger scales, Xpander metanodes become too big to connect to a single patch panel rack. We have to divide a metanode into several homogeneous sub-metanodes such that all links from sub-metanodes connect to a patch panel rack. However, because of the randomness in connectivity, this subdivision cannot ensure uniformity of bundles egressing sub-metanode patch panel racks, so we find that Xpander has a large number of bundle types in medium and large topologies. Thus, by this metric, Clos and FatClique have the lowest complexity across all three scales, while Xpander and Jellyfish have an order of magnitude more complexity. Moreover, across all metrics FatClique has lowest deployment complexity, especially at large scales. Case Study: Quantifying cabling costs. While not all aspects of lifecycle management complexity can be translated to actual dollar costs, it is possible to estimate one aspect, namely the cost of cables. Cabling cost includes the cost of transceivers and cables, and is reported to be the dominant component of overall datacenter network cost [75, 45]. We can estimate costs because our placement algorithms generate cable or bundle lengths, the topology packaging determines the 39 number of transceivers, and estimates of cable and transceiver costs as a function of cable length are publicly available [24]. Figure 2.10 quantifies the cabling cost of all topologies, across different scales. Clos has higher cabling costs at small and medium scales compared to expander graphs, although the relative difference decreases at medium scale. At large scales, the reverse is true. Clos is around 12% cheaper than Xpander in terms of cabling cost since Xpander does not support port-hiding at all and uses more long inter-rack cables. Thus, given that cabling cost is the dominant component of overall cost, it is unclear whether the tradeoff Xpander and Jellyfish makes in terms of number of switches and cabling design pays off in terms of capital expenditure, especially at large scale. We find that FatClique has the lowest cabling cost of the topologies we study with a cabling cost 23-36% less than Clos. This result came as a surprise to us, because intuitively topologies that require all-to-all clique like connections might use longer length cables (and therefore more expensive transceivers). However on deeper examination, we found that Clos uses a larger number of cables (especially inter-rack cables) compared to other topologies since it has a relatively higher number of switches (Figure 2.8) to achieve the same bisection bandwidth. Thus, more switches leads to more racks and datacenter floor area, which stretches the cable length. All those factors together explain why Clos cabling costs are higher than FatClique’s. Thus, from an equipment capital expenditure perspective, at large scale a FatClique can be at least 23% cheaper than a Clos, because it has at least 23% fewer switches, 33% fewer patch panel racks, and 23% lower cabling costs than Clos. 2.6.4 Expansion Complexity In this section, we evaluate topologies by our two measures of expansion complexity (§2.4.3): number of expansion steps required, and number of rewired-links per patch panel rack per step. Since the number of steps is scale-invariant (§2.6.1), we only present the results from expanding 40 0.75 0.80 0.85 0.90 0.95 SLO 0 5 10 15 20 Expansion Steps Expansion Ratio = 2 Fatclique Clos Xpander Jellyfish Figure 2.11: Expansion steps medium size topologies for both metrics 9 . When evaluating Clos, we study the expansion of symmetric Clos topologies; generic Clos expansion is studied in [91]. As discussed in §2.6.1, for symmetric Clos, we have developed an algorithm with optimal number of rewiring steps. Number of expansion steps. Figure 2.11 shows the number of steps (y-axis) required to expand topologies to twice their existing size (expansion ratio = 2) at different expansion SLOs (x-axis). We find that at 75% SLO, all topologies require the same number of expansion steps. But the number of steps required to expand Clos with tighter SLOs steeply increases. This is because the number of links that can be rewired per aggregation block in Clos per step, is limited (due to north-to-south capacity ratio §2.4.3) by the SLO. The tighter the SLO, fewer the number of links rewired per aggregation block per step, and larger the number of steps required to complete expansion. FatClique, Xpander and Jellyfish require fewer and comparable number of expansion steps due to their fat edge property, allowing many more links to be rewired per block per step. Their curves largely overlap (with FatClique taking one more step as SLO increases beyond 95%) . Number of rewired links per patch panel rack per step. This metric is an indication of the time it takes to finish an expansion step because, today, rewiring each patch panel requires a human operator [91]. A datacenter operator can reduce re-wiring time by employing staff to rewire each patch panel rack in parallel, in which case, the number of links per patch panel rack per step is a good indicator of the complexity of an expansion step. Figure 2.12 shows the average 9 We have verified that the relative trend in the number of re-wired links per patch panel holds for small and large topologies 41 0.75 0.8 0.85 0.9 0.95 0.975 0 150 300 450 600 750 900 1050 1200 1350 Average #rewired links at a single patch panel rack Expansion Ratio = 2 Fatclique Clos Xpander Jellyfish SLO Figure 2.12: Average Number of Rewired Links at a Single Patch Panel across Steps of the maximum rewired links per patch panel rack, per step (y-axis), when expanding to twice the topology size size at different SLOs (y-axis). Even though the north-to-south capacity ratio restricts the number of links that can be rewired in Clos per step, the number of rewired links per patch panel rack per step in Clos remains consistently higher than other topologies, until we hit 97.5% SLO. The reason is that the links that need to be rewired in Clos are usually concentrated in few patch panel racks by design. As such, it is harder to parallelize rewiring in Clos, than it is in the other topologies. FatClique has the lowest rewiring step complexity across all topologies. 2.6.5 FatClique Result Summary We find that FatClique is the best at most scales by all our complexity metrics. (The one exception is that at small and medium scales, Clos has slightly fewer patch panels). It uses 50% fewer switches and 33% fewer patch panels than Clos at large scale, and has a 23% lower cabling cost (an estimate we are able to derive from published cable prices). Finally, FatClique can permit fast expansion while degrading network capacity by small amounts (2.5-10%): at these levels, Clos can take 5 longer to expand the topology, and each step of Clos expansion can take longer than 42 FatClique because the number of links to be rewired at each step per patch panel can be 30-50% higher. 2.7 Conclusions and Future Work In this section, we have attempted to characterize the complexity of lifecycle management of datacenter topologies, an unexplored but critically important area of research. Lifecycle man- agement consists of network deployment and expansion, and we devise metrics that capture the complexity of each. We use these to compare topology classes explored in the research literature: Clos and expander graphs. We find that each class has low complexity by some metrics, but high by others. However, our evaluation suggests topological features important for low lifecycle complexity: hierarchy, edge expansion and fat edges. We design a family of topologies called FatClique that incorporates these features, and this class has low complexity by all our metrics at large scale. As the management complexity of networks increases, the importance of designing for manage- ability will increase in the coming years. Our work is only a first step in this direction; several future directions remain. Topology oversubscription. In our comparisons, we have only considered topologies with an over-subscription ratio of 1:1. Jupiter [75] permits over-subscription at the edge of the network, but there is anecdotal evidence that providers also over-subscribe at higher levels in Clos topologies. To explore the manageability of over-subscribed topologies it will be necessary to design over- subscription techniques in FatClique, Xpander and Jellyfish in a way in which all topologies can be compared on a equal footing. Topology heterogeneity. In practice, topologies have a long lifetime over which they accrue heterogeneity: new blocks with higher radix switches, patch panels with different port counts etc. These complicate lifecycle management. To evaluate these, we need to develop data-driven 43 models for how heterogeneity accrues in topologies over time and adapt our metrics for lifecycle complexity to accommodate heterogeneity. Other management problems. Our work focuses on topology lifecycle management, and explicitly does not consider other network management problems like fault isolation or control plane complexity. Designs for manageability must take these into account. 44 Chapter 3 Gemini: Robust Topology-and-Traffic Engineering in Spine-Free Datacenters 3.1 Introduction Datacenter topology designers must grapple with two competing objectives: cost and the need for reliable high performance, in spite of dynamic workloads. Today’s datacenter topologies use rearrangeably non-blocking designs to support any admissible traffic matrix [9, 75, 34, 11]. These networks are hierarchical: sets of top-of-rack (ToR) switches connect to non-blocking Clos-based pods, which are themselves connected, often by Clos-based spines – see Fig. 3.1. At larger scales (those which require an extra layer of switches), providing full bisection bandwidth between the pods becomes expensive. However, our observed inter-pod traffic aggregates are both non-uniform and somewhat predictable (see §3.2). This suggests we can deploy a more efficient inter-pod network with non-uniform connectivity, tuned to our predicted workloads rather than to the full-bisection worst case, while matching the performance of non-blocking topologies on those workloads. If our traffic prediction, however, is imperfect, we risk load imbalance and high packet loss. To reduce that risk, we can reconfigure the inter-pod topology (the DataCenter Network Interconnect, 45 Pod 1 Pod 2 Pod N Spine Block 1 Spine Block 2 Spine Block 3 Spine Block 4 Spine Block M Server racks with ToR switches DCNI Figure 3.1: FatTree: recursive Clos Pod 1 Pod 2 Pod N Transit (2-hop) path between Pods 1 & 2 Direct (1-hop) path between Pods 1 & 2 Interpod (DCNI) connections via lots of links Figure 3.2: Logical spine-free topology Pod 1 Pod 2 Pod N Patch panel 1 Patch panel 2 Patch panel 3 Patch panel 4 Patch panel M DCNI Transit (2-hop) path between Pods 1 & 2 Figure 3.3: Physical realization of spine-free topology, via patch panels DCNI) to match the traffic demand, or we can reconfigure routing to rebalance the load to match the DCNI – or we can do both. Prior research has explored reconfiguration at the top of the topology (the DCNI) through optical circuit switches (OCS). Helios [28] used an OCS to establish dedicated pod-to-pod circuits for long-lived elephant flows, and a separate spine network of electrical packet switches to serve latency-sensitive flows. However, no commercially-available OCS scales enough for Helios in large datacenters, and Helios required mechanisms to detect and re-route elephant flows. Other prior work uses reconfigurable ToR uplinks; as we discuss in §3.2, practical considerations rule out these designs in today’s large datacenters. Gemini: practical reconfigurability. This section describes Gemini, in which we replace the spine layer with a passive restriping layer that allows us to implement a reconfigurable DCNI; we reconfigure both the DCNI topology and routing at relatively long timescales; we use a novel approach to predict future inter-pod demand; and we jointly optimize DCNI topology and routing based on these predictions. This allows us to oversubscribe the DCNI (w.r.t. a worst-case traffic demand) without violating network-operator preferences for minimizing and balancing link utilizations. 1 Fig. 3.2 shows Gemini’s spine-free logical topology, where pods are directly connected. It also illustrates the possibility of using two-hop transit routing, to provide extra capacity between pairs 1 Readers should understand that we are not proposing a new fabric design with improved behavior on worst-case traffic; we are describing a complete system, including a control plane, that performs well on traffic patterns we can observe. Therefore, our evaluations are based on link utilization metrics (see §3.3), rather than theoretical properties (e.g., bisection bandwidth or oversubscription ratio) that ignore both the behavior of the control plane and the actual workload. 46 of pods. Fig. 3.3 shows how one can use a set of patch panels to create reconfigurable inter-pod links. A spine-free network is significantly less expensive than the equivalent spine-based network, as we discuss in §3.2. Because we only need to reconfigure the DCNI at long timescales, we can build the restrip- ing layer either out of patch panels (using humans to make changes) or relatively inexpensive, commercially-available OCSs. As the restriping technology improves, Gemini can exploit faster DCNI reconfiguration to yield better results. Our approach allows the restriping layer to be split across multiple, independent, and relatively small patch panels or OCSs, rather than requiring a single, impractically large switch or panel. We must also tolerate long timescales for reconfiguring inter-pod routing. This can take several seconds [41], because (1) one TCAM rule update can take several mSec [41], (2) we might have to reconfigure hundreds or thousands of rules, across hundreds of switches, and (3) routing-table updates must be carefully sequenced across switches, to minimize packet loss due to black holes and loops [50, 41, 70]. Given long reconfiguration timescales, reconfigurability is practical only if the interval between reconfigurations is two or three orders of magnitude larger than the reconfiguration time. Otherwise, we risk violating overall fabric-availibility SLOs, because during restriping, the fabric’s total capacity is somewhat reduced, increasing the chance that a traffic spike will cause packet loss. Also, restriping is potentially error-prone, so frequent restriping increases the risk of an SLO-impacting error. Contributions. We show that it is both possible and useful to reconfigure datacenter inter-pod topologies at infrequent intervals, by jointly optimizing topology and routing for predicted, skewed workloads. This allows us to eliminate 50% of the expensive long-range transceivers, and a large fraction of the network switches, without having to replace these with an expensive OCS. Specific contributions include: 47 Traffic prediction: It is widely assumed that datacenter traffic is unpredictable [34]. We show, through production-network measurements, that while inter-pod traffic matrices (TMs) do change at short time-scales, these TMs have both predictable and unpredictable components, and the predictable components can be stable over days or weeks (§3.2). Inspired by prior work on robust routing [12, 13, 82, 88], we base our predictions on the convex hull of a set of measured traffic matrices, over an aggregation window of a few days or weeks (§3.4.3). Optimization: Most robust optimization work [13, 12, 82, 88] only focuses on routing. Gemini jointly optimizes topology and routing to find more opportunities to reduce both worst-case maximium link utilization (MLU) and average link utilization (ALU), across all traffic matrices at the extrema of a convex hull (§3.4.5). Transit routing: While the direct (optical-only) path between a pair of pods minimizes latency and link loading, our optimizer sometimes finds a better solution (w.r.t. MLU) using transit routing, where packets from Pod A to Pod B travel via Pod C, to reduce congestion in the case where the direct A-to-B links are overloaded (§3.4.5). Hedging: Inter-pod demand can burst significantly at short timescales (§3.2). Practical “optimality” requires good behavior both on average and in the tail. To hedge against the risk of unexpected bursts, the optimizer can spread traffic across more paths than necessary for the expected case (§3.4.5). Hedging gives us more robust handling of a wider range of workloads, in exchange for a little expected-case path stretch due to more use of transit. Prior papers have often compared datacenter network designs based on flow-completion times (FCTs). However, our traffic-matrix traces lack FCT data, and we use simulation-based evaluations that yield link utilizations (we cannot accurately simulate FCTs, for a workload mixing thousands of applications, at scale). §3.3 reports on the correlation between FCTs and link utilizations from actual production networks. These results support our use of utilization-based metrics in evaluating Gemini against other designs (§3.5). 48 In particular, we show that a Gemini network outperforms a spine-free network constructed to have the same total DCNI cost, and often performs almost as well as a higher-cost spine-based DCNI. We also show how various aspects of Gemini, including topology reconfiguration, routing reconfiguration, hedging, and parameter choices contribute to the improvements. 3.2 Motivation Gemini’s design is motivated by several interesting properties of today’s DCNIs. We quantify these properties using measurements from 22 production fabrics. Each property motivates a different aspect of Gemini’s design. Cost. For practical reasons, such as bursty traffic, network operators avoid running networks at full utilization [75, 71, 87]. Therefore, a non-blocking fabric, such as a full-bisection bandwidth Clos, is too expensive, and operators rely on over-subscription [9, 71, 15] at different layers of the network. The dominant cost of a high-speed network is in the optics 2 and associated fiber [75, 96], and allowing 2:1 over-subscription of a Clos DNCI by removing half of the pod-to-spine links can remove almost half of the cost of this layer. Alternatively, by removing the spine blocks and directly connecting pods in Gemini, as shown in Figure 3.3, we achieve similar cost reductions while preserving much of the performance of a non-oversubscribed Clos. Note that the simplest spine-free design, a uniform-mesh DCNI with a Valiant Load Balancing (VLB) routing scheme [89], offers little cost benefit over a spine-full DCNI, because it would require provisioning each pod with a DCNI link capacity of twice its DCNI demand – an overprovisioning ratio of 2:1. This is required to serve all traffic patterns (especially worst-case ones) with demand- oblivious routing. This obliterates the cost savings from removing spines, by transferring that cost 2 E.g., assuming 40G long-reach optics at $7 Gb=s [52], removing each 100G spine block, with 512 ports, would save 360K in downlink optics – not counting switches, internal cabling, power, cooling, etc. 49 to the pods themselves. Gemini makes a spine-free DCNI feasible without overprovisioning the pods. Constraints on practical reconfigurability. Many recent proposals [94, 32, 38, 57, 58, 14] have proposed mechanisms for reconfiguring datacenter networks at the ToR level, using either a reconfigurable fabric of optical cables, free-space optics, or high-capacity wireless networks. In response to a shift in traffic demand, these approaches can reconfigure a fabric on timescales of nsec to msec. However, these proposals rely on custom hardware that is not yet commercially available at scale [57, 58, 14] or cannot be deployed in current production datacenters [44, 57], or require accurate flow size information for scheduling, which might not be always available [80]. Instead, Gemini yields goodresults with relatively infrequent DCNI reconfiguration (§3.5.3 discusses Gemini’s sensitivity to the reconfiguration interval.) 0 5000 10000 15000 20000 25000 30000 35000 40000 Time (minutes) 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Traffic Volume Each color represents one pod-pair (“commodity”). Figure 3.4: Normalized traffic vs. time: all pod-pairs, fabric F5 Dynamicinter-podtraffic. Evenwhenhighlyaggregated, inter-podleveltrafficisquitedynamic. Figure 3.4 shows the inter-pod traffic of all pod pairs in an illustrative fabric, F5; traffic varies significantly at small time scales, with many traffic spikes. This dynamism makes it hard to accurately predict traffic matrices in real time. Because we need a robust, stable traffic model, we instead build models using the convex hull of past traffic matrices (§3.4.3). Skewed inter-pod traffic. For some fabrics, inter-pod traffic can be significantly skewed: a few pod-pairs account for a significant fraction of inter-pod traffic ([32] reported similar skew in 50 0 20 40 60 80 100 Percentage of commodities (sorted by their demand in the descending order) 0.0 0.2 0.4 0.6 0.8 1.0 CDF F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 Figure 3.5: Average pod-level traffic skew over one month ToR-level traffic). Figure 3.5 shows that, for 11 of our 22 fabrics, 30% of pod-pairs (“commodities”) account for 80% of the traffic. (In other fabrics, the distribution is more uniform.) Traffic skew motivates two design decisions (§3.4.5). First, Gemini reconfigures DCNI topology to match the skewed demand. This Topology Engineering (ToE) gives higher DCNI capacity to pod-pairs that carry higher traffic volumes; relative to a uniform DCNI, ToE can reduce network congestion and more efficiently use costly resources. Prior ToE work has either augmented the DCNI with OCS-based reconfigurable paths for specific flows [28], or has focused on ToR-level reconfiguration; we believe reconfiguring to handle pod-pair demands, using a single DCNI, is novel. Second, Gemini can use the spare pod-internal capacity, in the lower-utilization pods, to transit traffic between two other pods when insufficient direct capacity exists. Gemini adapts to the degree of skew; in fabrics with relatively uniform traffic, Gemini generates a uniform DCNI and uses single-hop pod-to-pod paths. Those two design decisions are the core of our joint topology and routing solver (§3.4.5). Large predictable traffic component. For many fabrics, we have observed some traffic predictability at long time-scales. As a feasibility study, we “trained” on the maximum pod-to-pod demands over a sliding aggregation window of 7 days, then “tested” predictability. We quantify this, per pod-pair, as the ratio of demand on the next day over the prior-7-day maximum (demand- to-max ratio, DMR, for short). A pod-pair is well-bounded if its 99-th percentile DMR is below 1 – 51 F10 F4 F19 F2 F1 F15 F7 F13 F12 F22 F11 F16 F8 F14 F17 F18 F21 F9 F5 F6 F20 F3 0.5 0.6 0.7 0.8 0.9 1.0 Fabrics Fraction of well-bounded sd-pairs Figure 3.6: Fraction of well-bounded pairs; higher is better i.e., 99% of the next day’s demand does not exceed the previous week’s maximum; otherwise it is poorly-bounded Figure 3.6 plots the fraction p of well-bounded pod-pairs for 22 fabrics, measured over one month. For most fabrics, most pod-pairs are well-bounded; for 17 fabrics, p> 0:9. We refer to fabrics with p> 0:9 as mostly-bounded. Even for the least-predictable fabric, F3, p = 0:68. This suggests that it might be feasible to reconfigure the DCNI for some fabrics, no more than once per day, based on, say, a week’s worth of data, especially if we can route inter-pod demands to leverage statistical multiplexing. Small unpredictable traffic component: long tail distribution for some pod-pairs. Though most pod-pairs’ demands, at most times, are well-bounded, a few are not, and these have long-tailed DMR distributions. Figure 3.7 shows the pod-pair DMR distributions for two representative fabrics: F1, with p = 0:98 and F6, with p = 0:78. The maximum DMRs for F1 and F6 are 3 and 13. A long tail implies sudden traffic-pattern changes. To handle sudden changes, we could try to rapidly reconfigure the topology and routing, but that is difficult and risky; we prefer to proactively embed sufficient inherent robustness in the topology and routing, which motivates the risk-based design in §3.4.5. 52 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.2 0.4 0.6 0.8 1.0 F1 : largest tail=3 Fraction of poorly-bounded pairs : 0.02 0 2 4 6 8 10 12 14 0.0 0.2 0.4 0.6 0.8 1.0 F6 : largest tail=13 Fraction of poorly-bounded pairs : 0.22 DMR (the ratio of demand on the next day over the prior-7-day maximum) CDF Figure 3.7: CDFs of demand-to-max ratio (DMR) 3.3 Measuring success Networks exist to serve applications, and the gold standard for network evaluation has been Flow Completion Time (FCT)[26]. However, FCTs can be difficult to measure directly, and especially difficult to simulate at datacenter scale. It is much easier and efficient to measure network-level metrics, such as link utilization and per-link loss rates, that are observable via mechanisms such as SNMP, scalably and without privacy concerns. We can aggregate these metrics over intervals chosen as a compromise between fidelity and feasibility. Gemini’s traffic modeling and solver use aggregated inter-pod traffic traces as inputs. Our simulation-based evaluation (§3.5.2) likewise generates link utilizations, from which we can compute MLUs and ALUs. Intuitively, these link-utilization metrics should be correlated with FCTs, but is that intuition correct? Large changes in MLU (e.g., changing MLU from 1% to 99%) presumably harm FCTs, but what about the smaller changes we actually see when comparing Gemini to other designs, or when comparing parameter settings? 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 MLU 2 4 6 8 10 12 P99 tx-latency (normalized) 1KB 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 MLU 1 2 3 4 5 6 7 8 9 P99 tx-latency (normalized) 8KB 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 MLU 0 5 10 15 20 25 P99 tx-latency (normalized) 64KB 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 MLU 0 5 10 15 20 25 30 P99 tx-latency (normalized) 256KB 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 MLU 0 10 20 30 40 50 60 P99 tx-latency (normalized) 2MB Figure 3.8: FCTs vs p99 MLUs on production fabrics 53 0.05 0.1 0.15 0.2 0.25 0.3 ALU 1 2 3 4 5 6 P99 tx-latency (normalized) 1KB 0.05 0.1 0.15 0.2 0.25 0.3 ALU 1 2 3 4 5 P99 tx-latency (normalized) 8KB 0.05 0.1 0.15 0.2 0.25 0.3 ALU 0 5 10 15 20 25 30 35 P99 tx-latency (normalized) 64KB 0.05 0.1 0.15 0.2 0.25 0.3 ALU 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 P99 tx-latency (normalized) 256KB 0.05 0.1 0.15 0.2 0.25 0.3 ALU 0 5 10 15 20 25 30 35 40 P99 tx-latency (normalized) 2MB Figure 3.9: FCTs vs p99 ALUs on production fabrics 3.3.1 Correlation between FCT and MLU Because we do have FCT data for some (not all) applications on our network, we could study the correlation between several FCT-based metrics and network metrics (ALU and MLU) in production. This is not an exhaustive or rigorous study, which remains a good topic for future work. Using data aggregated from 32 Fat-Tree datacenter fabrics over 7 days, we collected FCT metrics (transmission latencies for various message sizes) for flows between servers on different pods of the same fabrics; this focuses on the effects of DCNI utilizations, which we collected simultaneously. Figure 3.8 plots FCTs for several message-size buckets 3 , vs. DCNI MLU (in 5% buckets), normalized to the best sample for each size. The results suggest that p99 FCTs increase with MLU. This is consistent with prior work showing that, at high link utilizations, packet loss rates increase [47], which would be likely to increase FCTs. However, our experiments cannot conclusively establish the relationship between FCT and MLU, since there may be confounding factors (e.g., offered load); clarifying this relationship is future work. Figure 3.9 suggests a weaker effect of ALU on FCTs, until the ALU exceeds about 20%, where ALUs clearly appear to affect long-message FCTs. In §B.1 we plot FCTs v.s., link utilization metrics for all links in each network, not just the DCNI links. However, those results are less indicative of whether the DCNI-only simulated utilizations in §3.5 would be predictive of FCT benefits. 3 The message size shown at the top of each graph, in Figure 3.8 and Figure 3.9, is the upper bound for the message-size bucket represented by that graph. 54 3.3.2 FCT vs. Frequency of Overloaded Links MLU and ALU do not tell the whole story. Two Gemini solutions could have identical MLU and ALU, while one had relatively few overloaded links, while another had many overloaded links. One might expect the second to exhibit a much higher total loss rate, and therefore would be likely to have much worse FCTs [63, 54]. From measurements of loss rates v.s., link utilization by others [47], we believe that links loaded above 0.8 (in a measurement interval, e.g. 5 minutes) should be treated as “overloaded.” Therefore, one metric we will apply in our evaluations is the fraction of overloaded links (“Overloaded Link Ratio,” OLR). We attempted to find a correlation between FCTs and DCNI OLRs in our own fabrics, but these were generally too lightly loaded to provide enough data. (Figure B.3 in §B.1 does suggest a correlation for all-links OLRs.) 3.4 Gemini Design Gemini predicts for each fabric, using historical traffic data, a reconfiguration strategy that optimizes fabric metrics. This strategy determines both the DCNI topology as well as routing paths for inter-pod traffic. 3.4.1 Gemini Overview Approach. Gemini addresses the challenges described in §3.2 as follows. (i) It optimizes the topology based on link utilization metrics; these have been shown to be correlated with loss rates [47]. (ii) It models traffic demand using a collection of historical traffic matrices, and uses this model to derive topology and routing configurations. (iii) It jointly optimizes topology and routing configuration, which enables it to identify opportunities to optimize link utilization aggressively. Its optimization formulation accounts for pod heterogeneity. (iv) It hedges against mispredictions in fabrics with short-term variability by spreading traffic across multiple paths (the shortest path 55 Network Fabric JS JS Predicted Best Reconfiguration Strategy TM Predictor Simulated Controller w/ different reconfiguration strategies: uniform topology? | hedging? Topo-Reconf Routing-Reconf Topology T-Model Historical traffic traces in the training window (TW) Historical traffic traces in the moving aggregation window (AW) Real time traffic Traffic Model Real Controller Offline Online TW Predicted Interval AW RI RI AW Time Figure 3.10: Gemini architecture. TM: traffic modeler; JS: joint topology and routing solver; RI: reconfiguration interval. and 2-hop 4 paths). (v) Because fabrics vary in skew and predictability, and because hedging can increase path stretch, Gemini selects determines the best reconfiguration strategy (whether to use ToE or not, whether to use hedging or not) for each fabric by simulating, on historical traffic, the impact of these choices on link utilization. (vi) It incorporates several techniques to scale the search for the best configuration to large fabrics. Gemini Architecture. Gemini consists of two components, a Predictor and an extended SDN Controller (Figure 3.10). The Predictor determines, given a historical trace of traffic matrices over a training window (e.g., one month) for a fabric, the best reconfiguration strategy to use over the next predicted interval, which we set to one month, leveraging the long-term predictability described in §3.2. A reconfiguration strategy consists of two decisions: (a) whether to reconfigure the topology (b) whether to hedge routing. Using the predicted strategy, to adapt to significant short-term variability (§3.2), the Controller adapts routing configurations at finer timescales, based on a moving aggregation window’s worth of traffic (e.g., one week; see §3.5.3 for a sensitivity analysis). The Controller always updates routing 4 For now, to minimize latency impact, we only consider 2-hop paths. We have left it to future work to explore longer paths. 56 configuration using a routing reconfiguration interval (15 minutes, in our simulations); hedging, if included in the strategy, is done via routing adjustments. Reconfiguring routing can take several seconds (§3.1), so a 15-minute interval is feasible. If the strategy also includes topology reconfiguration, the Controller changes the DCNI topology at a fixed topology reconfiguration interval . §3.5.3 discusses how results vary with intervals ranging from one day to multiple weeks, and shows that once per month is sufficient. This is feasible using patch panels; if faster reconfiguration is required, commercial OCS switches could be employed. (§3.5.3 shows that routing needs to be reconfigured much more often than topology.) Note that the Predictor and Controller both use the same components (Figure 3.10): a traffic modeler that abstracts traces into a compact traffic model, and a joint topology and routing solver that produces solutions necessary for prediction and configuration. We first describe these components, then describe how the Predictor and Controller use these. Before doing so, we discuss what metric(s) Gemini seeks to optimize. 3.4.2 Minimizing Link Utilization Recent work has focused on minimizing flow completion time (FCT) as the objective of reconfigu- ration. In large datacenters with centralized (SDN) control [75, 29], reconfiguring the DCNI to optimize individual flows does not scale, since several million flows can be active at any instant. Also, flow size information might not be available for scheduling [80]. Instead, Gemini makes reconfiguration decisions based on inter-pod traffic demand. This prevents us from using FCT as an optimization goal. Instead, Gemini optimizes link utilization, which is (non-linearly) correlated with packet discard rate (as shown in [47] for datacenters, and [30] for WANs; we have also independently verified this using measured utilization and discard rate of DCNI links in all of the fabrics in our dataset, Figure 3.11). This implies that at higher utilizations there is less headroom to tolerate bursts, which can cause high discard rates. A topology where every link operates at low to moderate 57 0.0 0.2 0.4 0.6 0.8 1.0 Link Utilization 10 6 10 5 10 4 10 3 10 2 10 1 10 0 Normalized Discard Rate Figure 3.11: Per-link relationship: utilization vs. discard rate. utilization is likely to provide better application-perceived performance than one in which a few links operate at high utilization. More specifically, Gemini minimizes the maximum link utilization (MLU) across the fabric, following prior work [88, 82, 13, 12] which uses the same objective for WAN traffic engineering. In addition, Gemini also aims to minimize the network path stretch, equivalent to minimizing the average link utilization (ALU). This ensures, when possible, that traffic is routed along low latency paths. This objective is secondary, since the latency impact of an extra hop within the datacenter is lower than that of packet discards. 3.4.3 Traffic Modeling Gemini exploits historical traffic matrices available from our networks to develop a model of traffic demand in a fabric. We capture traffic matrices as average pod-to-pod bandwidth utilization every five minutes for each fabric. Even for a single fabric, an aggregation window’s worth of traffic matrices can be significant, corresponding to over 2000 traffic matrices. To scale better, Gemini abstracts this traffic matrix history into a compact traffic model, and makes configuration and prediction decisions based on this traffic model. Gemini could have used the elementwise-maximal inter-pod traffic matrix (Maximal-TM), where each element is the maximum demand during the aggregation window for that pod-pair. 58 Original Convex-hull Approximated Convex-hull Many Extrema Cluster 1 Cluster 2 Critical traffic matrices Figure 3.12: Convex-hull-based Prediction. This is a pessimistic choice, and can result in an inefficient use of network resources, since the maximum demand for pod-pairs might not all occur at the same time. Instead, it uses an approximate convex hull of all traffic matrices within the aggregation window, building upon prior work on robust routing [12, 13, 82]. That work, given an arbitrary topology, seeks a routing assignment for a collection of traffic matricesT such that the maximum MLU is minimized. Those papers show that it suffices to consider the traffic matrices at the extrema of the convex hull ofT . Gemini uses this property, but departing from this prior work, seeks to jointly optimize both topology and routing (§3.4.5). While this property reduces the number of traffic matrices to consider in the joint optimization, the number of extrema on the convex hull can be very large for large aggregation windows necessary to ensure long timescale reconfiguration (Figure 3.12). To further reduce computational complexity, Gemini leverages the technique used in [88]. Specifically, it groups traffic matrices into k clusters. Then, from all the traffic matrices in a cluster, we generate a critical traffic matrix whose elements are the element-wise maxima of the cluster’s traffic matrices. Those critical TMs are extrema of an approximated convex hull which is strictly larger than the original one as shown in Figure 3.12. Note that the Maximal-TM is a special case in our approach where k = 1. Optimizing over these critical TMs can minimize MLU when the future traffic falls within the approximated convex hull. 59 1 2 3 U' = /C Single-path routing leads to large U' - U: expected utilization - U': utilization increase due to burst 1 2 3 U' = 0.33 /C Hedging-based routing leads to ⅓ U' - Predicted Demand: D - Real Demand: D + (burst) 4 4 U' = 0.33 /C U' = 0.33 /C Figure 3.13: Hedging-based routing and topology. The example shows that hedging-based routing and topology engineering should be used for handling traffic bursts. The traffic model uses two parameters, the traffic aggregation window and the number of critical TMs. We evaluate their impact in §3.5.3. 3.4.4 Hedging To mitigate the impact of short-term unpredictability, Gemini incorporates a technique we call hedging. This section describes the intuition underlying hedging; the next section formalizes this intuition. Figure 3.13 illustrates the idea of using hedging to reduce utilization surge due to a burst . The topology has only four pods. For simplicity, we only focus on a particular pod-pair, 1-2. Those three figures show the utilization increase due to a traffic burst . The leftmost figure shows that when a single shortest-path routing is used in a uniform topology, the burst can lead to =C utilization increase on a single trunk. However, if we split the traffic on three paths (one one-hop path and two two-hop paths) as shown in the middle, the burst is spread across these paths leading to only 0:33=C utilization increase. Alternatively, if we assign more capacity on trunk 1-2 as shown in the rightmost figure, we achieve the same utilization increase as in the middle figure. We call both strategies hedging: the former effects hedging through traffic engineering, the latter through topology engineering. To use hedging in practice, Gemini must decide: 1) the traffic split ratio of a demand (defined as f) on each path and 2) the right capacity allocated on each trunk. Motivated by the example of Figure 3.13, we quantify the risk r of a burst over a trunk with capacity C as f=C, which is exactly the utilization increase due to on the trunk. Minimizing r forces bursts to be spread out 60 over more paths, which reduces (“hedges”) the risk that any single path will be overloaded or may cause more capacity to be allocated on some paths. One can estimate for each pod-pair based on past traffic data. However, in our experience, assigning the same for all pod-pairs reduces MLU effectively. Therefore, we leave more accurate estimate of per pod-pair s to future work. Hedging has the undesirable side-effect of using two-hop paths, increasing path stretch. Gemini disables hedging for well-bounded fabrics, as discussed in §3.4.6. 3.4.5 Joint Solver Design The core of Gemini is a solver that searches for the optimal topology and routing configuration, given the traffic model described in §3.4.3, and is designed to scale to large fabrics. 1 3 4 2 120 120 480 480 1 3 4 2 100 100 400 400 100 Demand 1-3: 200 2-4: 100 1-2: 50 3-4: 50 Routing (Path Split) 1-3: 0.833 1-2-3: 0.0835 1-4-3: 0.0835 2-4: 1 1-2: 0.5 1-4-2: 0.5 3-4: 0.5 3-2-4: 0.5 1 3 4 2 200 200 320 320 80 Demand 1-3: 200 2-4: 100 1-2: 50 3-4: 50 Routing (Path Split) 1-3: 0.666 1-2-3:0.167 1-4-3:0.167 2-4: 1 1-2: 1 3-4: 1 MLU* = 0.42 Max risk = 40/120 = 0.33 --> Real MLU = 0.42 + 0.33 = 0.75 Stretch = 1-->ALU = 0.33 MLU* = 0.42 Max risk* = 40*0.5/100 = 0.2 --> Real MLU = 0.42 + 0.2 = 0.62 Stretch = 1.21-->ALU = 0.4 MLU* = 0.42 Max risk* = 40*1/200 = 0.2 --> Real MLU = 0.42 + 0.2 = 0.62 Stretch = 1.16-->ALU = 0.388 Stage 1: MLU minimization Stage 2: Risk minimization Stage 3: Stretch minimization Demand 1-3: 200 2-4:100 1-2: 50 3-4: 50 Routing (Path Split) 1-3: 1 2-4: 1 1-2: 1 3-4: 1 Burst: 40 Figure 3.14: Multi-stage optimization example. Red dashed edges: trunks with maximum risk. Notation. We model the pod-level network as a directed graph G(V;E), where the vertex setV represents pods and edge setE represents directed logical trunks between pods. The traffic modelT contains m critical traffic matrices. In the t-th matrix, the i;j-th entry, denoted by d i;j;t denotes the demand between pods i and j. Also, we use f i;j;p to denote the path split ratio of d i;j;t on path p2P i;j . Modeling pod heterogeneity. Pods can be heterogeneous, as a result of incremental expan- sion [91]. We model this as follows: the i-th pod has a fixed radix x i (number of ports 5 ) and all ports of a pod have the same uplink rate s i . (Uplink and downlink rates are the same.) Pod i’s ports are partitioned into different subsets, such that each subset of ports is connected via fiber 5 In this section, we only consider the DCNI-connected “uplink” ports. 61 1:100G 2:100G 3:40G 4:40G 1:100G 2:100G 3:40G 4:40G 100 40 40 40 40 40 300 120 Uniform topology, capacity between 1-2: 180 due to speed heterogeneity Gemini: demand-aware topology Capacity between 1-2: 300 Demand 1-2: 300 Demand 3-4: 50 Figure 3.15: Due to pod speed heterogeneity (40G/100G), the uniform topology on the left can not support the demand between pod 1 and pod 2. Since Gemini is demand-aware, it can find the feasible topology as shown in the right figure. links to a distinct pod. The set of n e links, connecting two pods constitutes a trunk between the pods. The capacity of the trunk C e , is determined by n e s e , where s e =min(s i ;s j ). Modeling this heterogeneity explicitly allows Gemini to avoid connecting ports with different speeds (e.g., a 100G port to a 40G port) which wastes capacity. This, together with ToE, allows Gemini to satisfy demands that may be impossible to satisfy with a uniform topology, defined as a topology with same number of links between each pod-pair, but not necessarily the same link speeds. Consider the example of Figure 3.15, in which there is a demand of 300 between pods 1 and 2 (each with 100G ports), and 50 between pods 3 and 4 (each with 40G ports). This demand cannot be satisfied with the uniform topology on the left, but can, with the topology on the right. Three-stage optimization. The solver has three optimization stages, each of which optimizes for a particular objective. We use Figure 3.14 to illustrate these stages. Stage 1: Minimize MLU. The first stage generates a (potentially non-uniform) topology and routing that minimizes the MLU across all extreme traffic matrices in the traffic model. Denote by u the resulting MLU. Equation 3.1 ensures that all link utilizations are smaller than u across all traffic matrices inT . Equation 3.2 models the consequence of the pod speed heterogeneity: the speed of a link connecting two speed-heterogeneous pods equals to the smaller speed of two end 62 pods. Equation 3.3 makes sure that the total number of links originating from pod i is smaller than the pod radix, R i . min u s.t. X i;j2V X fpjp2Pi;j;e2pg f i;j;p d i;j;t uC e ;8e2E;8t2T (3.1) C e =n e s e ;8e = (i;j)2E;s e =min(s i ;s j ) (3.2) X feje originates from pod ig n e R i ; 8i 2V; (3.3) Flow constraints for f i;j;p and n e >= 0 (3.4) In our example, as shown in the leftmost figure in Figure 3.14, the first stage produces a highly skewed topology based on the skewed TM and shortest-path routing. However, this solution can create high risk on thin trunks (e.g., 1-2 and 3-4); given a traffic burst of 40 units, the utilization on those trunks could increase up to 33%. Stage 2: Enable hedging. This stage addresses risk hedging, formalizing the intuition described in §3.4.4. It achieves hedging by minimizing the maximum risk (§3.4.4) over all trunks. Equation 3.7 ensures that all trunk utilization is smaller than u obtained in stage 1 and Equation 3.8 ensures that the maximum risk for all pod-pairs is less than r. min r s.t. X i;j2V X fpjp2Pi;j;e2pg f i;j;p d i;j;t u C e ;8e2E;8t2T (3.5) f i;j;p rC e ;8i;j2V;8p2P i;j ;8e2p (3.6) Equations (3.2)-(3.4) 63 In our example (middle figure in Figure 3.14), hedging reduces the MLU on trunk 1-2 and 3-4 from 0.75 to 0.62 by jointly adjusting the topology and routing. However, because it uses many two-hop paths to hedge, it results in a larger path stretch and ALU compared to the first stage solution. Stage 3: Minimize path stretch. This stage re-arranges the computed routing and topology solution to minimize path stretch while maintaining u and r from previous stages. The path stretch is defined as the total network load over all links divided by the total demand, where the total load equals to the summation of loads on all links. Since the total demand is a constant, we only keep the total load as the objective in the formulation. Equation 3.7 and Equation 3.8 ensure that MLU and the maximum risk for all pod-pairs are less than u and r respectively. Note that since u and r are constants, the formulation is a linear program. min X t2T X e2E X i;j2V X fpjp2Pi;j;e2pg f i;j;p d i;j;t s.t. X i;j2V X fpjp2Pi;j;e2pg f i;j;p d i;j;t u C e ;8e2E;8t2T (3.7) f i;j;p r C e ;8i;j2V;8p2P i;j ;8e2p (3.8) Equations (3.2)-(3.4) The rightmost figure in Figure 3.14 illustrates the topology and routing solution after stage 3, which enables more traffic to go through shortest paths without increasing u and r. Scaling the solver. The formulations in stage 1 and stage 2 are both non-linear: MLU and risk are multiplied by the trunk capacity (e.g., Equation 3.1 and Equation 3.8). To make the solver tractable, we conduct a binary search over a range of values for MLU and risk. For a fixed value of MLU/risk, the stages result in linear programs that can be solved efficiently. Each step bounds the objective function above or below; we stop when the gap between the two bounds is below a threshold. 64 We reduce the search space for the binary searches via several bounds. For MLU, the lower bound is the maximum ratio of aggregate demands on a pod over the pod’s block capacity; the upper bound can be approximated by a simple algorithm, such as VLB over a uniform topology. The lower bound for risk is achieved when is equally split on all possible paths which have the same capacity. 3.4.6 Predictor and Controller operation Hedging can reduce MLU and allow Gemini to be more robust to misprediction, which is beneficial for fabrics with unpredictable traffic (i.e., those with few well-bounded pairs). However, it can force traffic over longer paths, increasing ALU as in Figure 3.14. Gemini uses the predictor to derive configurations with and without hedging, then picks the better configuration (one with lower MLU, or, if the MLUs are comparable, the one with lower ALU). Thus, for a fabric with largely predictable traffic, Gemini avoids using hedging because it increases ALU. However, for one with unpredictable traffic, it selects a hedging-based configuration because that has lower MLU. To select a strategy, the Predictor needs a goal that depends on the network operator’s objectives. Our operators prioritize the strategy that reduces the p99.9 MLU to within 5% of the best strategy, and breaks ties via the p99.9 ALU. 6 As discussed earlier, both Predictor and Controller use the traffic model and the solver, but in slightly different ways. The Predictor takes a training window’s worth of data (1 month), and runs the solver offline for four strategies combining two binary choices: Uniform vs. Non-uniform pod-to-pod topology, and Hedging vs. no-hedging for risk minimization. (The Predictor can disable topology reconfiguration by fixing n e in the joint formulation to the number of trunks in the uniform topology, rather than setting it as an optimization variable.) The Predictor simulates each strategy, over the training window, to estimate link utilizations. 6 In this work, we define the p99.9 MLU with respect to a period, such as 1 month, by selecting the most-utilized link in each five-minute measurement interval, and reporting the 99.9th percentile of these per-interval MLUs. The p99.9 ALU is the 99.9th percentile over the ALUs for each interval. 65 The Controller reconfigures the fabric online using the predicted strategy. If the strategy includes topology reconfiguration, then every topology reconfiguration interval, it computes a new topology that network operators can use to implement a reconfiguration. Note that the Controller always uses periodic routing reconfiguration; at each routing reconfiguration interval, it computes a routing solution and invokes the fabric SDN controller to update the switches. 3.4.7 Practical Considerations Physical realization with Patch Panels. To realize a physical topology with patch panels as shown in Figure 3.3, we need to address two issues: (a) Gemini’s solver might output fractional trunks; Gemini must round fractional trunks to integers, while maintaining the same pod radixes and (b) any suggested topology reconfiguration should not require moving fibers between patch panels. In other words, any logical topology should be realized by optical paths through fixed-radix patch panels. To round the fractional solution, we find a rounding algorithm Algorithm 1 which has the following property: Theorem 1 Given a graph G(V;E) that has even node degrees, arbitrary edge weights n e ,8e2E, and no self-loops, Algorithm 1 can construct a graph G 0 (V;E) with no self-loops in O(jVj 2 ) time, with the same node degrees of G and with n 0 e 2fbn e c;bn e c+1g. Proof of Theorem 1: Given a graph G(V;E) that has even node degrees x v ,8v2V, arbitrary edge weights n e ,8e2E, and no self-loops, Algorithm 1 computes a graph that has integer edge weights close to n e (eitherbn e c orbn e c+1)8e2E, while maintaining the same node degrees. Since there are at mostjVj iterations of Steps 2 and 3, and it takes O(jVj) time for integer sorting and connecting at mostjVj edges in each iteration, Algorithm 1 runs in O(jVj 2 ) time. We next prove that the algorithm outputs a graph that satisfies the degree and edge constraints. 66 Algorithm 1 Rounding fractional edges to integers while maintaining node degrees. 1. Round down the value of each edge n e to the largest integer not exceeding n e , i.e.,bn e c. Denote the graph by G(V;E 0 ). 2. Compute the node degrees y v in G(V;E 0 ). Let z v =x v y v be the residue degree of node v. Sort nodes in the descending order of residue degrees v i (z i ),8i2f1;2;:::;jVjg. 3. Connect one edge between v 1 and each of the next z 1 nodes that have the largest residue degrees, i.e. v 2 ;v 3 ;:::;v z1+1 . Let the resulting graph be G(V;E 1 ). 4. Repeat Steps 2 and 3 until all residue degrees are zero. By Erdos-Gallai Theorem [22], given node degree sequence z 1 z 2 z n whose sum is even, the following inequality is sufficient for the existence of a simple graph without parallel edges or self-loops that satisfies the degree sequence. k X i=1 z i k(k1)+ n X i=k+1 min(z i ;k); 1kn: (3.9) Moreover, if the inequality holds, by Theorem 5 of [37], the algorithm of iteratively connecting the node with the largest degree z 1 and the next z 1 nodes of the largest degrees constructs a simple graph G(V;E 0 ). Since there is at most one edge between any pair of nodes in a simple graph,G(V;E 0 [E 0 ) a graph with degreesx v ,8v2V, and integer edgesbn e c orbn e c+1,8e2E. Moreover, G(V;E 0 [E 0 ) does not have self-loops since there is no self-loop in E 0 or E 0 . It remains to prove that the sum of z 1 z 2 z n is even and that inequality (3.9) holds. The total degrees of nodes in G(V;E 0 ) is even, because every edge contributes an additional degree to two nodes. Moreover, x v is even,8v2V. Therefore, P v2V z v = P v2V x v P v2V y v is even. The residual degrees z i are non-zero only if the fractional edges adjacent to i are rounded down by a value smaller than 1. Thus, there exists fractional edges 0 w ij < 1 that satisfy P n i=1 w ij =z j and P n j=1 w ij =z i . 67 k X i=1 z i = k X i=1 n X j=1 w ij = k X i=1 k X j=1 w ij + k X i=1 n X j=k+1 w ij k(k1)+ k X i=1 n X j=k+1 w ij k(k1)+ n X j=k+1 min(z j ;k): The first inequality holds because w ij < 1 and w ii =0. The second inequality holds because P k i=1 w ij z j and P k i=1 w ij k. To avoid moving fibers between patch panels, we use Theorem 2 to distribute links between pods and patch panels. Theorem 2 If the radix of every pod is 2 k , any topology that has integer numbers of inter-pod trunks can be constructed using 2 p patch panels (p<k), by connecting 2 kp ports of every pod to every patch panel. Proof of Theorem 2: If every node in G has degree 2r, then the graph can be decomposed into r edge-disjoint 2-factors in polynomial time (i.e., a graph where every node v2V has degree 2) [65]. Therefore, a graph with uniform node degree 2 k can be decomposed into 2 k1 2-factors, which can be partitioned into 2 p groups of size 2 k1p . Since there are two edges adjacent to each node in a 2-factor, there are a total of 2 kp edges adjacent to each node in a group of 2-factors. Edges in each group of 2-factors can be constructed by one patch panel, because every pod has 2 kp ports connected to a patch panel and links can be arbitrarily connected between ports in a patch panel. Handling major workload changes. Predictions based on historical traffic might not handle large demand increases (e.g., adding a new large-scale service to a datacenter). We address this primarily via an approval process for admitting large-scale workloads, together with a continual process for adding new capacity (pods and DCNI resources) on a live network [72, 92, 86]. Gemini’s 68 topology-reconfiguration strategy can integrate exogenous predictions of future demand to inform our regular capacity-augmentation processes. Wiring complexity. Modern datacenters use fiber bundling to reduce wiring complexity [75, 86], usingalayerofpatchpanelsbetweenthepodandspinelayers[92]. Aspine-freetopology’srestriping layer permits fiber bundling in much the same way, even in the presence of pod heterogeneity (the proof of Theorem 2 describes one approach to this). Topology restriping/expansion. Restriping or expanding non-blocking Clos topologies requires rewiring fibers. Algorithms for these operations [92] generate multi-step rewiring plans, to ensure that at each step fabric capacity does not drop below an operator-specified threshold. Spine-free topologies can use these algorithms, with minimal modifications. 3.5 Evaluation Inthissection, wevalidateGeminionatestbed(§3.5.1), andmoreextensivelysimulateGeminiusing traces from 22 production fabrics (§3.5.2). In simulations, Gemini consistently improves link-utilization metrics (§3.3) over various baselines. 3.5.1 Testbed Evaluation We used a testbed, exposed to a production workload (including storage, search, computation, and video serving), to validate our simulator, and to test that Gemini works reliably. This gave us some ability to compare Gemini to a baseline, but operational constraints limited what experiments we could run, and these were not randomized controlled trials. The testbed has 12 pods, each with 256 100G ports, using a non-blocking Clos design similar to Jupiter [75]. Pods are connected by a DCNI with multiple patch panels as in Figure 3.3. Reconfiguration. The Gemini system interfaces with the fabric’s centralized control plane, which performs routing state (re)-configuration and collects traffic matrices. The fabric uses WCMP [93] 69 0.0 0.2 0.4 0.6 0.8 1.0 Utilization 0.0 0.5 1.0 CDF Simulated Measured p99.9 utilization: simulated = 0.716 vs. measured = 0.701 (2% error); p99.9 MLU: simulated = 0.836 vs. measured = 0.843 (< 1% error) Figure 3.16: Simulated v.s., measured testbed utilization to route traffic within pods. In this implementation, Gemini reconfigures routing once every 8 hours, and reconfigures topology in multiple steps to reduce the capacity degradation at each step, using an algorithm similar to [92] to generate the rewiring plan at each step. Configurations. We initially configured the testbed with a uniform topology, with hedging enabled, and collected traffic data for two weeks. Based on that traffic, Gemini’s Predictor recommended a (non-uniform, no-hedge) configuration. We then reconfigured the topology accordingly, and collected data for another two weeks. We refer to the initial configuration as baseline and the latter as predicted-best. Both configurations use demand-aware routing. Simulator validation. Using the testbed we compared our simulator’s predicted link utilizations against measurements, for a (different) 27-day period, with the testbed restored to the “baseline” configuration: (Uniform, hedging). Figure 3.16 shows that, at least in this configuration, the simulator agrees closely with ground truth. We have no reason to expect the simulator to be less accurate in other configurations. (Note also that the Predictor depends on the accuracy of the simulator, as shown in Figure 3.10.) 70 0 20 40 60 Utilization 0.2 0.4 0.6 0.8 1.0 CDF 0.25 0.35 0.45 0.55 0.65 MLU 0.0 0.2 0.4 0.6 0.8 1.0 0.05 0.15 0.25 ALU 0.0 0.2 0.4 0.6 0.8 1.0 1.00 1.25 1.50 1.75 Stretch 0.0 0.2 0.4 0.6 0.8 1.0 Predicted-best Baseline Figure 3.17: Comparing baseline vs. predicted-best configs Utilization metrics. Due to a data-retention mistake, we have no direct measurements of link utilizations during these testbed trials. We did retain directly-measured traffic matrix traces, so we reconstructed link utilizations from these traces using the simulator. Results. Because we used production traffic, the two configurations were under somewhat different workloads (predicted-best had 17% higher traffic, and a DMR of 5.49 v.s., 1.67), so these results are “suggestive” of improvements, but not proofs. Figure 3.17 shows CDFs of (simulated) link metrics comparing the baseline and predicted-best configurations, showing apparent reductions in utilization, ALU, and stretch, when using predicted- best, but increases in MLU. Some of the shift might be caused by the use of different workloads, but we lack data to separate that from the topology effect. However, the large improvement in stretch is consistent with the use of a non-uniform topology that avoids most transit routing. §B.2 provides some additional results from the testbed. 71 3.5.2 Large-scale Simulation To better understand the performance of Gemini at scale, we present results from a trace-driven evaluation, using data from production fabrics. To do this, we feed historical traffic matrices from these fabrics to a simulator using a Map-Reduce implementation of Gemini’s algorithms (§3.4). Datasets. We use six months’ worth of 5-minute traffic matrices from 22 fabrics in our whole fleet. 7 These fabrics span a range of topology sizes, utilization levels and traffic characteristics; some of the fabrics mix several line rates and/or pod radices in their DCNI. 8 The choice of a 5-minute window for traffic matrices can obscure traffic dynamics within the window, but this window size has long been an industry-standard collection interval [21, 34, 71, 87]. Gemini Prediction Methodology. For a given fabric, Gemini’s Predictor chooses the best of four possible demand-aware strategies listed in §3.4.6. For each fabric, we used a 1-month training window. 9 Baselines. We compare Gemini against three demand-oblivious baselines. The first, (Uniform, VLB) connects pods directly using a uniform topology with Valiant-loading-balancing based (VLB) routing. VLB splits traffic equally across all N one-hop and two-hop paths between ingress and egress pods [90]. The second, Same-cost Clos, is a 2:1 oversubscribed-Clos DCNI running ECMP. (Uniform, VLB) and Same-cost Clos have the same total DCNI cost (including switches, cables, and optical transceivers) as the one that Gemini uses. The third, Full Clos, is a Clos topology with a spine layer, with twice Gemini’s total DCNI cost. Parameters. Our baseline simulations use traffic models built with 12 critical traffic matrices, an aggregation window of a week, a topology reconfiguration rate of 1/day, and a routing reconfiguration period of 15 min. In §3.5.3, we evaluate Gemini’s sensitivity to these parameters. 7 For brevity, we often use the term “fabric” to refer to the combination of a workload trace and the specific fabric where it was obtained. 8 Absolute numbers for fabric size, throughput, latency, or overall loss rates are proprietary, so we cannot publish these. 9 We chose a 1-month training window partly for convenience; a sensitivity study for this parameter is future work. 72 Success metrics. We compare configurations and strategies on several metrics: MLU, ALU, and Overloaded Link Ratio (OLR) as discussed in §3.3. An ideal strategy would improve all three of these vs. the other options. 3.5.2.1 Benefits of Demand-Awareness. F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 Fabric 0.0 0.5 1.0 1.5 2.0 P99.9 MLU Gemini (Uniform, VLB) Same-cost Clos Full Clos Bars above MLU=1.0 represent demands that cannot be feasibly routed Figure 3.18: P99.9 MLU impact of demand awareness F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 Fabric 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 P99.9 ALU Gemini (Uniform, VLB) Same-cost Clos Full Clos Figure 3.19: P99.9 ALU impact of demand awareness F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 Fabric 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 99.9 OLR Gemini (Uniform, VLB) Same-cost Clos Figure 3.20: P99.9 OLR impact of demand awareness 73 Our simulations demonstrate that Gemini’s demand-aware approach is superior to demand- oblivious approaches on MLU, ALU, and OLR. Our evaluation here compares Gemini to other designs using p99.9 values for MLU (Figure 3.18), ALU (Figure 3.19), and OLR (Figure 3.20). These figures aggregate results (with error bars) over all 5 months. Gemini has lower p99.9 MLU than cost-competitive approaches. Across all fabrics, Gemini’s p99.9 MLU is always comparable to, or lower than, (Uniform, VLB) and Same-cost Clos. 10 The average difference between these two alternatives and Gemini is 42% and 30%, respectively. For about half of the fabrics, the demand-oblivious approaches do not result in feasible routing: the total demand on at least one link exceeds capacity (this is shown as an MLU greater than 1). Gemini, on the other hand, is able to accommodate traffic demands on every fabric. Same-cost Clos performs worse because the oversubscription is uniform at all pods, so “hot” pods which generate high demand can see high utilization. For such pods, Gemini would provision additional trunk capacity. (Uniform, VLB) has high MLU because all pods need to act as transit for all other pods; this can increase MLU for hot pods significantly. Compared to Gemini, Full Clos has smaller p99.9 MLU across all the fabrics. This is because Full Clos has twice the number of switches and links, can perfectly load balance traffic at each pod, and no pods carry transit traffic for other pods. Interestingly, for 17 of 22 fabrics, Gemini’s p99.9 MLU is at most 30% higher than Full Clos. For these fabrics, traffic is predictable, so Gemini’s configuration is able to achieve low p99.9 MLU because it is well matched to the traffic demand. Gemini’s p99.9 ALU is comparable to, or lower than, competing approaches. Across all fabrics, Gemini’s p99.9 ALU is always comparable to, or lower than, (Uniform, VLB) and Same-cost Clos for the same reasons discussed before. For highly-variable fabrics where Gemini decides to use hedging, which spreads traffic across longer transit paths, the p99.9 ALU is twice that of Full Clos. For the other fabrics, Gemini’s p99.9 ALU is similar to Full Clos. Thus, Gemini 10 Our use of fabrics with mixed line rates and radices explains why same-cost Clos sometimes outperforms VLB, which can suffer from hot spots in such cases. 74 appears to provide a better tradeoff between DCNI cost and ALU than any of the demand-oblivious options. Gemini’s OLR is better than any same-cost option. Figure 3.20 shows that Gemini’s OLR is always significantly better than (Uniform, VLB) and Same-cost Clos. The figure uses a log scale, and so omits bars where OLR=0 (no links exceeded 0.8 utilization), which is always the case for Full Clos and for Gemini on many fabrics. Gemini’s OLR is always below 1%, suggesting that FCTs would remain low. F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 Fabrics 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 P99.9 Stretch (Non-uniform, hedge) (Uniform, hedge) (Non-uniform, no-hedge) (Uniform, no-hedge) Figure 3.21: P99.9 stretch for one arbitrarily-chosen month (“M3”) in our study; other months are qualitatively similar. Gemini’s stretch is typically low. Clos networks (with spines) and VLB networks always have an extra hop, compared to the best-case Gemini configurations. While Gemini sometimes does add a second hop, for transit routing, Figure 3.21 shows that the p99.9 stretch remains below 2, and when hedging is not necessary, the p99.9 stretch is usually close to 1.0 (ideal) because Gemini can exploit non-uniform topology. 3.5.2.2 Prediction quality Our simulations demonstrate that Gemini usually makes correct predictions, that when these predictions are correct they are beneficial, and when they are wrong, the cost is relatively low. (Here, we use data from all five months of predictions.) Gemini usually recommends optimal strategies. In Figure 3.22, solid bars represent Gemini’s recommendations and hashed bars show the optimal (in hindsight) strategy, for each 75 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 Fabrics M1 M2 M3 M4 M5 Months Predicted Optimal (Non-uniform, hedge) (Uniform, hedge) (Non-uniform, no-hedge) (Uniform, no-hedge) Figure 3.22: Gemini’s predicted strategy v.s., optimal strategy fabric and each month; matching colors mean a correct prediction. Gemini overall correctly predicts 81% of the choices here, and (for these 5 months) predicts perfectly for 11 fabrics, and near-perfectly for 4 others. §3.5.2.2 discuss the benefits of good predictions and the costs of bad ones. Correct predictions are beneficial. Earlier we showed that Gemini typical predicts the optimal strategy; here we show the benefits of correct predictions. F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 Fabric 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 MLU improvement MLU benefit from correct predictions (higher is better) F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 Fabric 0.2 0.1 0.0 0.1 0.2 0.3 ALU improvement ALU benefit from correct predictions (higher is better) Figure 3.23: Benefits from correct predictions 76 Figure 3.23 shows the MLU and ALU improvements from correct predictions. Each vertical bar plots, for one fabric, the range, across 5 months, of improvements for MLU (respectively ALU) compared to every other strategy. In most cases, both MLU and ALU benefit from correct predictions; when MLU is worsened, it is still within the 5% cushion applied by the Predictor (§3.4.6), when it trades off MLU for improved ALU (e.g., fabrics F5, F6, F7, F9, F13, F14, F15, F17, F20). Misprediction costs are small. For fabrics where Gemini mispredicted at least once, Figure 3.24 shows the difference from optimal in p99.9 MLU (left) and MLU (right); the whiskers show the range across the mispredicted months. The worst-case increases are 15% for both MLU and ALU, but most MLU increases are within the Predictor’s 5% cushion; most ALU increases are also below 5%. In many cases, a misprediction still improves either MLU or ALU over the optimal strategy. F3 F7 F8 F9 F12 F13 F15 F16 F17 F19 F22 Fabric 0.05 0.00 0.05 0.10 0.15 MLU misprediction cost F3 F7 F8 F9 F12 F13 F15 F16 F17 F19 F22 Fabric 0.10 0.05 0.00 0.05 0.10 0.15 ALU misprediction cost p99.9 values; higher is worse Figure 3.24: Misprediction cost, MLU (left) and ALU (right) Taken together, these results suggest that, while there is room for improvement in Gemini’s prediction accuracy, even its mispredictions are mostly harmless. 3.5.3 Sensitivity Analyses Routing Reconfiguration Frequency. Figure 3.25 shows the impact of the routing reconfigu- ration interval (r) on MLU and ALU, for r between 5 minutes and 8 hours. (For this experiment, we fix the topology reconfiguration interval to 1 day. We only show the (Non-uniform, hedge) 77 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 Fabrics (dotted bars = p99.9 ALU; solid bars = p99.9 MLU) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 P99.9 MLU/ALU (w/ dot) r=5m r=15m r=30m r=1h r=8h Figure 3.25: MLU/ALU vs. routing reconfiguration interval (r) F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 Fabrics (dotted bars = p99.9 ALU; solid bars = p99.9 MLU) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 P99.9 MLU/ALU (w/ dot) t=1d t=7d t=14d t=28d Figure 3.26: MLU/ALU vs. topology reconfig. interval (t) strategy for month M3; other strategies in all months are qualitatively similar.) For half of the fabrics, p99.9 MLU decreases as r decreases; the rest are insensitive to r. The p99.9 ALU is insensitive to r. r =15min is typically sufficient. Topologyreconfigurationfrequency. Westudiedtheeffectsofseveraltopologyreconfiguration intervals (t) between 1 day and 4 weeks, with r fixed at 15 minutes. Our experimental results show that for all fabrics, both MLU and ALU are independent of t, implying that topology reconfiguration can be infrequent (for our fabrics/workloads), and can be implemented with inexpensive patch panels. Impact of multiple critical traffic matrices. In its traffic model, Gemini clusters traffic matrices (TM), then selects critical TMs. We simulated MLU and ALU against the number of clusters k (or critical TMs), for k = (1, 4 and 12). We observed that as k increases, p99.9 78 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 Fabrics (dotted bars = p99.9 ALU; solid bars = p99.9 MLU) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 P99.9 MLU/ALU (w/ dot) k=1 k=4 k=12 Figure 3.27: MLU/ALU vs. number of matrices (k) in M3 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 Fabrics (dotted bars = p99.9 ALU; solid bars = p99.9 MLU) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 P99.9 MLU/ALU (w/ dot) w=1d w=3d w=7d Figure 3.28: MLU/ALU vs. aggregation window (w) in M3 MLU decreases by 5% in average in half of fabrics without increasing p99.9 ALU, but we observe diminishing returns: k = 12 is a sweet spot. Impact of traffic aggregation window. In its traffic model, Gemini clusters traffic matrices within a traffic aggregation window w. We simulated MLU and ALU for w = 1 day, 3 days, and 7 days; we fixed r =15min and t =1 day. Increasing w seems to improve MLU in 20% of the fabrics; it has no apparent affect on ALU. 79 Chapter 4 Towards Correct-by-Design Scalable Control Planes 4.1 Introduction A large-scale network need to have a highly scalable and available control plane to manage 10s of thousands of switches and processes millions of updates per seconds [29]. The high-level goal of the control plane is to manage various network states and program the switches to reach its target state reliably based on the high-level intents in a reason time. 4.1.1 Large-scale Control Plane Basics Distributed v.s., centralized. The first type of large-scale control plane design [7] relies on distributed protocols running on switches, such as BGP for its scalability and flexible policy control. Different from designs replying on distributed protocols, which are complex, hard to configure and manage, people are developing production-grade centralized control planes [75], such as Orion [29], Onix [46] and ONOS [16]. In centralized control plane, a logically centralized controller processes. In this paper, we mainly consider the specification and verification of the centralized control plane. Consistent update in the centralized control plane. One advantage of the centralized control plane is that it has the global view of the whole network state such that computed routing solution is more optimal than using decentralized control. 80 To fully reap the benefit of centralized control, the control plane should enable consistent updates that guarantees the preservation of well-defined behaviors (such as loop-freedom and blackhole-freedom), when transitioning between states [70]. There are multiple approaches to achieve consistent update [70, 42]. In this paper, we consider [42]’s approach to achieve consistent update for its low switch memory fingerprint and generality. In specific, dependencies of flow programming instructions (FPIs) to switches are captured by a directed acyclic graph (DAG). The control plane’s job is to 1) compute the target network state based on current data plane states including topology and flows; 2) compute a DAG for scheduling FPIs to switches; 3) schedule FPIs to switches and make sure they are correctly installed in a reasonable time. Design goals. There are several strict goals for designing the control plane: 1) at-least-once installation of flow entries; 2) final installed entries should be correct eventually; 3) states in the data plane and control plane are consistent eventually. Also, there are several goals preferable to have: 1) ensure exactly-once flow entry installation on each switch; 2) have low convergence time defined as the duration between a triggering event and all work complete in data/control plane [29]. 4.1.2 Microservice-based Centralized Control Plane Microservice. More production software systems, such as Netflix, Amazon and Uber [3, 4, 6] are adopting the microservice-based architecture for its unique advantages over the monolithic design, such as high availability and decoupled scaling. The high availability of microservice does not come for free; instead of using complex distributed transactions which incur tightly temporal coupling of services, microservice-based systems usually are eventual consistent with the mind that inconsistency would happen and should be handled by 81 compensation techniques [29, 3]. In microservice, components are loosely coupled communicating through the network using remote procedure calls (RPCs). Google’s microservice-based control plane. Orion, adopts the microservice architecture and distributes its logic over multiple processes for scalability and fault-tolerance [29]. The architecture of Orion consists of two layers, SDN applications at the top and Orion Core in the center. SDN applications stores high-level network intents, computes target network state and provides routing mechanisms. Orion core is responsible for compiling high-level requirements from SDN applications into OpenFlow rules installed on switches and provides the data plane state for SDN applications. Each layer consists of multiple components (corresponding to different processes) providing different functionalities. We depict the detailed modular decomposition based on our understanding as shown in Figure 4.1. 4.1.3 Modules in Microservice-based Control Plane We discuss the design and functionality of each module, which could be further decomposed into submodules. Those submodules are processes residing in the same server and sharing memory. Processes across servers need to communicate using RPCs. 4.1.3.1 SDN Applications There are many SDN applications. In this paper, we only consider some key applications. Routing engine (RE). The routing engine is responsible for the following tasks. 1) Compute the target network state based on intentions from the upper layer such as draining a switch or topology change from the data plane directly, such as a switch failure. The submodule TE is responsible for this task. 2) Compute the new DAG graph for flow programming based on the old and target network state, which is also handled by TE; 3) Schedule the FPIs based on the DAG to lower layer submodules, which is handled by the submodule Sequencer in RE. 4) Also, 82 to communicate with NIB, RE has a submodule called RENIBEventHandler, which routes messages received from NIB to other submodules. Drainer. This module is responsible to sequence and schedule the drain/undrain requests from other applications. It has two submodules, sequencer and NIBEventHandler. Topology expander and upgrader. Topology expander is used to generate plans to expand the network by adding more switches [91, 86] while topology upgrader is used to generate plans to upgrader switch firmwares [75]. Those applications can submit drain/undrain requests to drainer and submit up/down requests to topology manager. 4.1.3.2 Control Plane Core Modules Each module is a single process or a virtual machine locating in a commodity server. Each module has multiple submodules based on the service provided by the module. And each submodule is a thread residing inside the process. All submodules inside a single access the same copy of data. For each module, we have a dedicated failover submodule to handle mastership transition due to failures. Network information base (NIB). NIB, a logically centralized in-memory database, is an essential design choice in the control plane [29, 46]. The motivation of NIB is multi-pronged. Firstly, it provides a central communication point for all other modules such as RE and OFC. NIB needs to maintain multiple event handlers to communicate with other modules such as RE and OFC. Secondly, it serves as a virtually centralized database to store network states, which could be used for failure recovery. Openflow controller (OFC). OFC is responsible to program switches based on instructions from NIB. To increase parallelism, OFC have multiple worker threads used to send FPIs to switches. Also, since OFC interacts with both NIB and switches, it has a OFCNIBEventHandler used to receive messages from the NIB and EventHandler used to monitor/detect switch status change. 83 Also, there is another submodule, MonitoringServer used to receive FPI status update from switches. 4.1.4 Our Methodology Challenges. Designing robust microservices-based control plane is challenging in the face of concurrency, control-plane component failures and switch failures. What makes it even more challenging is in the absence of distributed transactions and in the absence of durable and persistent storage in the worst case, such as NIB failure. Due to above reasons, control plane designs are susceptible to errors resulting from distributed races, some of them are extremely subtle and triggered by a complex combination of failures. PlusCal/TLA+/TLC. PlusCal is a language for writing algorithms and meant to replace pseu- docode with precise, testable code [1]. Those algorithms can be used as a formal specification/model of a system. Codes written using PlusCal is translated into a TLA+ model that can be checked with the TLC model checker [84]. Those tools have been used successfully by industry companies, such as Amazon and Microsoft to check the correctness of their systems and achieve a big success [62, 2]. Our approach. In this work, we use PlusCal to specify the control plane logic/protocols running on each module and then verify the correctness of the specification using the TLC model checker in an iterative manner. In specific, we start with a basic design and once the specification violates all desired properties, it will output an execution trace (bug), based on which we find the rootcause of this bug, fix it and continue this process until the verification process stops, meaning that we obtain a provably correct protocol. Once we get a correct protocol under a failure scenario, we 84 will repeat above process under a new failure scenario until we find a solution correct under all interesting failure scenarios. Bug analysis and classification. We collect, analyze and classify the bug traces through this process and distill useful lessons and principles, which are generic to large-scale distributed systems and unique for microservices-based SDN systems. Convergence and complexity analysis. We analyze the convergence time and complexity of each bug, which illustrates the overhead to handle failures. Topology expander Topology upgrader Drainer Network information base RE OFC Sequencer TE NIB event handler NIB event handler NIB event handler SW event handler Monitoring server Worker thread Worker thread Worker thread DAG Sequencer IR status RE event handler Switch status IR status OFC event handler Management event handlers Drain request Switch Switch Switch SDN Applications Control Plane Core Data Plane Topology Manager Figure 4.1: Overall microservices-based control plane architecture 4.2 Protocols Though Orion and Onix describe the overall design principles and architecture, detailed protocols between components are missing from those papers. In this work, we consider three types of protocols including 1) protocols enforcing consistent update running in RE/NIB/OFC; 2) failover protocols for RE/NIB/OFC, which are used to recover states when failures occur; 3) management protocols running SDN applications. 4.2.1 Terminology We define several terms for easy discussion. 85 State Semantic SW_UP Switch powers up SW_DOWN Switch functions unhealthily or down SW_DRAINED Switch up but no other switches send traffic to it SW_UNDRAINED Switch up with traffic flowing through it FPI_NONE The operation is not applied on a switch FPI_DONE The operation is applied on a switch FPI_SENT The operation is sent to a switch CTRL_NORMAL Module work normally CTRL_ABNORMAL Module can not receive or process messages Table 4.1: State Semantics I State recording means that before applying a local action/update, the metadata about the update/action has to be recorded into a stable storage. The benefit of doing this is that upon failure recovery, the failed component could know if the update/action has been taken or not. I State recovery refers to the following process: after a component fails and comes up again, it needs to recover state before the failure by reading states from different sources, such as log and switches. I Reconciliation refers to the part of state recovery where the control plane queries switch state, which serves as the ground truth of the data plane state. Due to the availability of ground truth, Orion could sacrifice NIB durability for higher availability [29]. 4.2.2 Semantics of Switch and FPI States The core logic of the control plane is to maintain the state machine for each FPI and each switch such that the network could reach a targeted state eventually. We define basic states for FPI, switch and controller. Those definitions could be found in Table 4.2. 4.2.3 Basic DAG programming Protocols Given a routing event, such as topology change, RE needs to recompute a routing solution accordingly, which needs to add/delete/modify multiple rules on multiple switches. To avoid black 86 holes or loops, those rules are programmed in a consistent manner; in specific, those rules are programming according the a DAG built with various data plane constraints in consideration [70]. DAG programming example. We use an example of a DAG programming to illustrate how modules interact. Firstly, the RENIBEventHandler receives a routing event from the NIB, which triggers TE to compute a DAG bases on the targeting network state and current network state. For simplicity, we assume that there are only two FPIs in the DAG with FPI2 depending on FPI1. The DAG will be sent to the Sequencer, which will put FPI1 to the ScheduledFPISet, send FPI1 to NIB and then wait for FPI status to be FPI_DONE. After NIB receives FPI1, it will put it into its own copy of FPIQueue and sends FPIQueue to OFC. After the FPIQueue is received by the NIBEventHandler at OFC, one worker will pick FPI1 for scheduling. In specific, the worker will update FPI1’s status to be FPI_SENT, send FPI1 to the switch and dequeues FPI1 from the FPIQueue. Note that all those operations not only need to update OFC’s local copy but also needs to send udpate to the NIB. After the FPI is performed at the switch, it will send the acknowledgement back to OFC, which will be handled by the OFC monitoring server, which will update the FPI status from FPI_SENT to FPI_DONE both locally and remotely at the NIB asynchronously. Then the FPI_DONE message will be propagated back to RE such that RE will schedule FPI2. 4.2.4 Basic Failover Protocols Reconciliation. As observed by previous work [64, 29], strong consistency is not only unnecessary but also harmful for achieving high availability (i.e., low latency). The key uniqueness of the control plane is that it could always know the ground truth of the data plane by querying the switch states. Depending on different design choices, eventual consistency or sequential consistency have been used by previous work [64, 29]. Without strong consistency, state loss is unavoidable after failrues. One way to recover state is to query data plane states from switches and intents from higher-level modules and recover the states before failure. This process is called reconciliation. 87 Failover means the process of mastership transition between replicas. Based on previous design experience [19], since failover protocols are less exercised compared to other parts of the system, they are rich in interesting bugs. Failure-pattern-oblivious failover design. The design goal is the following: a single failover protocol (with multiple pieces running on different modules) that can tolerate arbitrary combination of module failures for example, RE-plus-OFC failure or RE-plus-NIB failure. To achieve this goal, we follow the idea from the capability readyness protocol in Orion, which is used to order resumption of operation after controller failover. In specific, the readyness protocol model the order of resumption using a DAG. We use TLA+ to write the formal specification of this protocol. 4.2.5 Basic Management Protocols Management operations. Datacenter topology in a production environment is not static; many management operations might be ongoing at the same time. For example, network operators need to upgrade switch firmware to rollout new features [75]. Another common management operation is topology expansion, which is used to add new switches to the existing topology to add capacity [86]. Such management operations are usually maintained by different teams and submitted by different SDN applications atop of RE. Network drain. To ensure safeness such as graceful capacity degradation and blackhole freedom, each of management those operations need to be decomposed into multiple steps to finish. Most of those management operations highly reply on a primitive to complete successfully, network drain. Network drain refers to the network programming tasks to stop traffic flowing through a network entity, which could be a single link, a single switch or a network block such as a chassis or even a pod. Architecture. Modules that generate network operations (called management applications such as topology upgrader and topology expander) lies on the top layer of the SDN architecture [29]. Drainer is the module that sits between management application and RE to sequentialize all 88 drain/undrain requests. Besides generating drain-related requests, managemenet applications could also generate entity up/down requests which are submitted to the topology manager, a submodule in OFC. Basic state machine for management operations. As mentioned, to safely execute a man- agement operation, such as turning a switch down, the switch have to be drained first, namely redirecting traffic away from the switch; otherwise packet loss will occur. Therefore, taking a switch down operation could only be operated on a drained switched. Similarly, to undrain a switch, it has to be turned up first. To achieve those dependencies, each module needs to maintain the state machine for each switch. There are 4 states in the state machine: SW_UP, SW_DOWN, SW_DRAINED, SW_UNDRAINED. Invariant. Besides the invariants mentioned before, the additional invariant to check during verification is whether a protocol could enforce the state machine. A exemplary violation of this invariant could be that an undrained switch is turned down directly without being drained first. Single topology upgrade example. We use the life of a drain request in the control plane to illustrate how those modules work together to process a drain request successfully. 0 ○ A SDN application starts processing the request by checking the switch status in the NIB. If the switch is undrained, the drain request will be submitted to NIB; if the switch is already drained, it will return a successful notification back to the requester. 1 ○ Since Drainer subscibes NIB about drain request, it will get the notification about the drain request, which will be enqueued into the drain queue. 2 ○ Drainer will put the drain request to NIB such that RE will get the drain request. 3 ○ RE gets the drain request, computes a new DAG and sends FPIs to OFC. 4 ○ OFC will program the switches, which is to drain the to-be-drained switch. 5 ○ Switches are done with FPI programming. 6 ○ RE sends a successful notification to drainer about the drain request. 7 ○ Drainer changes the switch status to SW_DRAINED and sends a notification back to the application. 8 ○ The application dequeues the drain request and notifies the request. 89 4.3 Failure Models There are multiple dimensions of a failure models. In this section, we describe the failure models we verified. There are two dimensions for failures, permanent/transient, partial/complete. Assumptions we made for each type are as follows: 1) complete failures: all states get lost after recovery; 2) partial failures: states are preserved; 3) permanent failures (only eligible for switches): needs switch recover protocol to reboot the switches, which are not considered in this dissertation; 4) Transient failures (only eligible for switches): it will recover eventually. 4.3.1 Complete Switch Failure Model Complete permanent failure. After the switch fails due to reasons such as power-off or hardware issues, the switch will stop working and the control plane will detect it. Also, after recovery, switches will start from an empty state. Complete transient failure. In this failure model, switch eventually recovers after facing a failure (e.g. switch reboots due to hardware faults). After the recovery, switch will start from an empty state. In other words, it looses all its state including the packets in its buffers or entries in the TCAM. The challenging part with this failure mode is that control plane does not know whether the switch is going to recover or not. 4.3.2 Partial Transient Switch Failure Model NIC/ASIC transient failure. In case of NIC/ASIC failure, all the in-flight packets destined to the failed switch get dropped as the TCP connection terminates. However, the switch can still process the instructions in its buffers and generate the acknowledgements for the control plane. These acknowledgements are forwarded to the control plane once the NIC/ASIC recovers. On the 90 control plane side, it can detect the failure similar to the complete failure cases, but it can not distinguish between these two failure types. OFA transient failure. Similar to NIC/ASIC failure, OFA failure will result in TCP connection termination, and the control plane can use this signal as an indication for switch failure. When OFA fails, switch might lose a part of its state (e.g., if the OFA is in the middle of processing an FPI, the FPI might get lost). However, the TCAM and buffer states are not affected by this failure. Installer transient failure. Although installer failure does not result in an observable signal by the control plane, we assume OFA is able to detect the installer failure and report the failure to the control plane. In this failure mode, switch might lose part of its state similar to OFA failure. However, the TCAM and buffer states are not affected, and they are preservered. CPU transient failure. CPU failure will cause all the applications running on top of it to terminate (i.e., OFA, installer). Also, after the CPU failure, all the buffer states are lost. However, the TCAM state will be preserved after the recovery. 4.3.3 Control Plane Failure Model In the control plane, modules and submodules could fail concurrently and sequentially. Module failure. When a complete module fails, we need a failover protocol to recover the state in the standby replica of the failed module. The standby replica could be cold/warm/hot depending on the replication protocol used. In this section, we only consider cold standby since it is the worst case scenario from the perspective of failover. Submodule failure. When a submodule fails, the watchdog inside each module will relaunch the submodule; when the new submodule comes up, all states, such as FPI_QUEUE belonging to the module remain. Multiple failures. We also consider multiple failures happening concurrently or sequentially. 91 4.4 Lessons from Bugs Categories Lessons RootcausePattern Representative Bugs State Man- agement Careful re- ordering of operations within a single component could solve some bugs UBA: Re- moval before action. ABU works since if failure happens before action, upon recovery, taking the action is safe; if the failure hap- pens in between, it is safe to do the removal since the action is done. Bug12 State Man- agement State recording and crash recov- ery is the final solution UBA: Addi- tion before action. After failures, in UBA, the added state indicates the action is taken such that the real action is skipped. However, ABU does not work for some bugs due to the following rea- sons: 1) switch failures could inval- idate an FPI_SENT action taken such that the state update for the FPI should be skipped upon recover, which is not achievable using ABU; 2) ABUcouldmakeintentarriveatOFC before RE updates the intent, violat- ing the invariant that RE should be ahead of OFC in terms of FPI intents. Bug8 and Bug11 State ma- chine design Guarded state machine transi- tions Bad state transi- tion FPI_DONE is overwritten by FPI_SENT Bug46 State ma- chine design Incorporating states for ongo- ing operations Wrong state ma- chine 1) Ongoing drain state (between DRAINED and UNDRAINED); 2) Ongoing failover state (between NOR- MAL and ABNORMAL) Bug48 Synchronization Synchronization needs to be complete for all components after failover Incomplete syn- chro- niza- tion After NIB is synchronized by RE and OFC, it needs to update RE and OFC after failover Bug41 Synchronization Synchronization signals needs to be precise. Inaccurate syn- chro- niza- tion 1) Topology and flow update should use different signals; 2) Send delta instead of complete information. Bug47 and Bug42 Management algorithm Drain algorithm design should be careful. Partially correct algo- rithm 1) consider all types of failures; 2) consider sequencing carefully Bug41 and Bug51 Table 4.2: State Semantics During our explorations, we found 50s interesting bugs. After careful analysis, we find that even though many bugs happen in different scenarios (e.g., under different failures), some of them share 92 a few patterns. For example, some bugs are caused by the same root cause (e.g., wrong order of operations) or have the same consequence (e.g., final states are overwritten by some intermediate states). In this section, we group bugs into types based on their root cause and consequence. 4.4.1 Type 1: State Management Issues In a sequential program within a single component, it usually involves multiple local operations. Without failures or race condition, the order of many such operations does not matter. However, this is not true under failures or race condition. We find some bugs occurring due to some particular ordering of operations under failures and race condition. Why not use distributed transactions?. One might wonder why they could not be put into an atomic step or a transaction. The main reason is as follows: if one of the operation involves networks, distributed transaction is needed which is notoriously hard to implement and deviates from the design philosophy of the microservice architecture. Additionally, some of the bugs could be easily solved by reversing the order of the two operations without introducing additional overhead compared to using the lock. Order definition. We define a particular order of two operations as update-before-action (UBA) where the local status update happens before taking the action. The actions in the routing context could be 1) sending an FPI to switches/modules and 2) locking/unlocking an FPI. The update could be 1) updating the state of the FPI/switch/lock; 2) enqueuing/dequeuing the FPI/general message from a set. Eight bugs in our exploration happen due to UBA. Similarly, we define a particular order of two operations as action-before-update (ABU) where the status update happens after the action is taken. Five bugs in our exploration happen due to ABU. 4.4.1.1 Bugs Caused by Update-Before-Action and Solved by Action-before-Update Bug12. In this bug, FPI successfully reaches the switch all the way through RE, NIB to OFC and the switch successfully installed the FPI. The FPI_DONE update is received by the monitoring 93 server and dequeued from the message queue. Then the monitoring server fails. Upon recovery, the new monitoring server has no way to know about the FPI_DONE udpate anymore, which leads to deadlock as in Bug8 and Bug13. The solution is simple, which is to process the message before dequeuing it from the message queue (ABU) such that even with the monitoring server failure, the FPI_DONE message is still in the queue upon recovery. Bug15. In this bug, a switch fails and recovers very fast, which generates two routing events to the control plane. Event handler at OFC processes the first routing event, SW_DOWN in the message queue and changes the switch status to be SW_DOWN. And then after it dequeues the second routing event, SW_UP, it fails before taking any action which leads to the switch status SW_DOWN forever even after recovery (SW_UP is no longer in the queue). The solution is same as in Bug12, which is to use ABU, i.e., process the message before dequeuing it. Bug17. In this bug, FPI is sent to switch by the worker thread before the switch CPU fails. So even if the switch receives the FPI, it will not install the FPI due to the CPU failure. Then the switch recovers but the worker thread fails right after DB state is cleared, without unlocking the FPI yet. Upon recovery, since DB state is empty, no reconciliation is done by the new worker. However, since the FPI is locked and no other workers will work on it anymore even if it has never been installed. The solution is ABU, i.e., unlocking the FPI before clearing the DB. Bug18. In this bug, FPI is sent out by RE right after the switch ASIC fails and then the sequencer fails before cleaning DB. One OFC worker threads locks the FPI and forwards it to the switch. During RE reconciliation, the FPI is removed from the ScheduledFPISet since it is not sure if FPI has been sent or not. The switch recovers and OFC event handler sets FPI status to FPI_NONE. RE reschedules the FPI. Another worker tries to acquire the FPI lock (i.e., race condition between workers), which is still locked by previous worker. The new worker dequeues the FPI from the FPIQueue, which leads to the FPI never scheduled (even though the original worker thread unlocks the FPI latter) and therefore deadlock. This bugs happens under the race condition between 94 workers with RE and switch partial failure. The solution is to use ABU, i.e., unlocking the FPI before removing FPI from the ScheduledFPISet. Summary. All those bugs share the same common pattern: some state are removed before taking an action such that the state is gone after the failure and no action is taken. For these bugs, since the update is to delete some message, ABU can preverse the to-do state (such as processing the message before dequeuing it) and therefore is sufficient to fix them during reconciliation. 4.4.1.2 BugsCausedbyUpdate-Before-ActionandSolvablebytheLog-basedReconciliation Bug8. RE puts an FPI into the ScheduledFPISet (used to keep track of scheduled FPI) and then fails before sending it to NIB (action). Upon recovery, since FPI is in ScheduledFPISet, it will not be scheduled again. Thus the system goes into the deadlock state since RE waits for OFC to install the FPI but OFC never receives it. ABU does not work in this case since it will lead to Bug4. The fundamental reason that ABU is not ideal is because ABU violates a nice invariant that a system should maintain: the upper layer modules should be the authority for intents generating from its layer, such as which FPI to schedule. ABU violates this invariant since the action is taken and the lower layer (i.e., OFC) goes ahead of the upper layer (i.e., RE) such that it leads to Bug4. The solution is the following: before putting an FPI to the ScheduledFPISet, RE logs the status of FPI scheduling to a DB; in specific, we set FPI status to be STATUS_START in the DB and then upon failover recovery, if the status of FPI in the DB is STATUS_START, we will schedule it even if it is already in ScheduledFPISet. We call this type of reconciliation as the log-based reconciliation. Bug11. FPI reaches OFC. OFC worker thread updates the status to FPI_SENT but it fails before forwarding the FPI (action). Upon recovery, FPI will not be sent to the switch since the FPI status is FPI_SENT. ABU does not work for this case for the following fundamental reason: an action taken (FPI_SENT) could be invalided by a switch failure since it will lost all its state 95 and a state update for the action should not be in effect later. The solution is similar to Bug8, i.e., using log-based reconciliation. Bug13. Similar to Bug8, FPI is put into ScheduledFPISet and arrives at the FPIQueue at OFC. But after one OFC worker dequeues the FPI (update), it fails immediately without sending it to the switch (action). During reconciliation, since FPI is already in the ScheduledFPISet and not in the FPIQueue at OFC, it will not be scheduled again which leads to deadlock. The solution is log-based reconciliation similar to Bug8 and Bug11. Summary. All those bugs share the same common pattern: some state are added before taking an action. After failures, the added state indicates the action is taken such that the real action is skipped. However, ABU does not work for those bugs due to the reason mentioned: 1) ABU could violate the invariant that RE should be ahead of OFC in terms of FPI intents, i.e., ScheduledFPISet; 2) switch failures could invalidates an action taken such that the state update for it should also be invalidated. The solution pattern is the following: I log metadata/semantics (only state updated but action is not taken yet)I add the stateI do the action. These bugs are always solvable by the log-based reconciliation since all necessary semantics is logged before the update. 4.4.1.3 Bugs Caused by Action-Before-Update Bug3. Switch ASIC fails before the OFC worker thread forwards an FPI to it without changing FPI status yet. Since switch ASIC fails, the FPI will be discarded by the switch. Then OFC event handler processes the SW_DOWN event and suspends the switch. Switch recovers from failure. OFC event handler processes the event and change the switch status to SW_RUN and update the FPI status to FPI_NONE. Finally, the worker changes FPI status from FPI_NONE to FPI_SENT (race condition between the work and the event handler). Since the FPI has not been installed on the switch and both OFC/RE think the opposite, it leads to deadlock similar to 96 Bug2. The solution is to use UBA, i.e., updating the FPI status to FPI_SENT before sending it to the switch such that FPI_NONE will not be overwritten by FPI_SENT. Bug4. RE sends FPI to switch before putting it into the ScheduledFPISet. Then OFC works faster than RE; a OFC worker starts to work on the FPI and then tries to remove the FPI from the shared ScheduledFPISet between RE and OFC, which does not contain the FPI resulting into an exception. The solution is to use UBA, i.e., putting FPI into ScheduledFPISet before sending it to OFC. Summary. Even though UBA could solve Bug3 and Bug4, it will introduce new bugs if UBA used (Bug8 and Bug11). Therefore, for those action-update pairs (such as sending FPI to switch and updating FPI status), neither UBA nor ABU work, we should use the log-based approach mentioned before. 4.4.1.4 Takeaways Common bug pattern. All bugs in this section is caused by the interleaving of update-action with other operations under either failures or race conditions. Summary. For some action-update pairs mentioned in §4.4.1.1, ABU should be used, which is preferablecomparedtothelog-basedapproachwhichintroducesadditionaloverheadforlogging§4.5. However, for some bugs, neither UAB or UBA works due to the effect of failures or the maintenance of some system invariant, we need to use the log-based approach to bookkeeping some order information for latter reconciliation use. 4.4.2 Type 2: Synchronization Issues To complete a control plane task, such as rule installation, multiple components (intra/inter modules) need to synchronize well, which needs correct and timely singaling between them. We find several bugs caused by either no signaling or imprecise signaling between components mostly 97 under failures/race conditions. Bug9: no sync between OFC workers. RE sequencer adds an FPI to the ScheduledFPISet, log the metadata and enqueues the FPI to FPIQueue. One worker thread starts working on the FPI. The sequencer fails and then recovers with FPI removed from the ScheduledFPISet, which make the FPI get rescheduled again. Another worker will start working on the FPI which is identical to the original one. In this bug, two workers work on the identical FPI unnecessarily since no synchronization between them. The solution is to use semaphore; the worker locks the semaphore after reading the FPI, so only one worker thread at any instance of time has access to the FPI. Bug41: nosyncbetweenNIBandREafterNIBfailover. TocomputeDAG,REproactively sends read request to NIB to read the data plane state. NIB receives the request, dequeues it and then fails before sending the read content back to RE. During NIB reconciliation, NIB recovers by reading states from both OFC and RE. The system goes into the deadlock state since RE is waiting for reply from NIB and OFC is waiting for FPIs. The solution is to add a synchronization step in NIB reconciliation; after NIB recovers, it needs to proactively notify RE about the data plane state and OFC about FPIs from RE. Bug22: no sync between TE and Sequencer. Due to a switch failure event, the original DAG fed to the Sequencer is invalid. Therefore, TE compute a new DAG based on both the new topology and ScheduledFPISet and the new DAG contains FPI removal instruction for FPIs in the stale DAG. However, an FPI from the stale DAG gets scheduled right after the new DAG is computed. Therefore, the FPI from the stale DAG is installed on switches, which is not the target network state specified by the new DAG. The solution is as follows. There is one Boss Sequencer and a set of Worker sequencers. The Boss receives the DAGs and distributes them between the workers. Workers sequence the FPIs and schedule the FPIs. In this architecture, when topology changes, TE marks the original DAG as stale and waits for the feedback from Boss Sequencer. 98 Boss Sequencer terminates the corresponding workers and sends TERMINATE_DONE to TE. The TE module then computes the DAG based on the ScheduledFPISet. Bug42: NIB sends all RE data instead of delta to OFC. FPI1 is successfully installed on a switch. However, since the OFC worker is slow which make FPI1-removal from the FPIQueue at OFC get delayed (therefore FPI1 is still in the NIB). OFC event handler sends FPI_DONE to NIB which forwards it to RE. RE schedules FPI2 and put it into the FPIQueue and update NIB remotely. NIB sends the FPIQueue to OFC workers. (Note that the FPIQueue contains both FPI1 and FPI2 due to the delayed removal of FPI1.) Then FPI1 is removed at both OFC and NIB. However, the new FPIQueue arrives at OFC and OFC starts to schedule FPI1 again which is not necessary. The solution is to make NIB only send delta update to OFC to avoid unnecessary FPI scheduling 1 . Bug47: Imprecise NIB update semantics. Sequencer schedules FPI based on two conditions: 1) validity of current DAG; 2) current switch and FPI status of current DAG. The Sequencer those conditions sequentially. The validity of current DAG is configured by TE based on received data plane events. The bug happens under a slow TE and a fast Sequencer. When a SW_DOWN event happens, since TE is slow, current DAG is not set as INVALID such that it goes into second layer check. Since there is NIB_UPDATE=TRUE (due to SW_DOWN), Sequencer will try to compute new FPIs to schedule, after which NIB_UPDATE is set to FALSE. Since the NIB update is topology change instead of FPI status change, no new FPI to schedule and Sequencer continue this process. However, this time, Sequencer will go into a deadlock; it keeps waiting new NIB update to trigger new FPIs to schedule. However, since no FPIs to schedule in the last round, no further NIB update will happen. Thus the new DAG is never scheduled since Sequencer is blocked. The rootcause of this bug is due to the ambiguous semantic of NIB_UPDATE, which does not 1 The alternative solution is to do scrutiny check at OFC when doing FPI enqueuing. If the FPI status is FPI_DONE, the FPI can be skipped. However, this approach has higher overhead compared to the fix. 99 distinguish FPI and switch update. The solution is to use two flags to signal those two different events. Bug49: Imprecise signal of a drain completion. Topology expander and upgrader could send current drain and undrain requests to the drainer. After a switch is drained, the SW_DRAINED status will be propagated back to both expander and upgrader. Once the management application receives the target status, it will proceed to submit the next operation request in the queue. In this bug, the expander sends a drain request before the upgrader sends an undrain request. The drain is successfully completed and the expander event handler updates the SW status to SW_DRAINED. However, the SW status is changed to be SW_UNDRAINED due to the undrain request submitted by the topology upgrader and the expander will never submit the next operation. The root cause of this bug is that we only use the target switch status as the signal for drain completion. Instead, we should use an explicit signal DRAIN_OP_COMPLETION which will not be affected by switch status update. 4.4.3 Type 3: State Machines Related Modules maintain partial/complete state machines (for FPIStatus, ControllerStatus, SWStatus) in a distributed manner. Such state machines should be correct in the first place and protocols’ job is to enforce such state machine (state A -> action -> state B). We have seen some bugs that are caused by the imperfect state machine in the first place, such as FPI_DONE is overwritten by FPI_SENT (bug5), or missing critical states in the state machine (bug46). 4.4.3.1 FPI State Machine Violation Bug5. An FPI is successfully installed by a switch. And then the switch’s ASIC fails. The OFC event handler detects the switch failure and suspend the switch, after which the switch recovers. The OFC event handler then change the switch status to SW_RUN and set the FPI status to FPI_NONE assuming that it is not installed. Since the FPI status is changed to FPI_NONE, RE 100 schedules those FPIs again. Before the new FPI are sent to a switch by OFC, the OFC monitor receives FPI_DONE message for the original FPI from the switch and updates the local FPI status to FPI_DONE. However, due to the race condition between the monitoring server and the worker thread, FPI status is changed to FPI_SENT by the worker thread and FPI is sent to the switch again. The root cause of this bug is that once FPI_DONE should not be overwritten by FPI_DONE. The solution is the following: before scheduling an FPI, the worker needs to check the FPI status. If FPI status is FPI_DONE, the FPI should be skipped. Bug2. The OFC worker thread sends FPI to a switch directly without updating its status to FPI_SENT first. Then FPI is installed on the switch and the monitoring server gets notified. The monitoring server updates FPI status to be FPI_DONE, which is latter overwritten to be FPI_SENT by the slow worker thread. This race condition between the monitoring server and the worker thread leads to a deadlock since RE will keep waiting FPI_DONE forever. There are two root causes for this bug: 1) action before update (update FPI status before sending it out); 2) FPI_DONE is overwritten by FPI_SENT. Using UBA or state-machine-check as used in Bug5 could both solve this problem. Bug44. Similar overwritten also happens after a OFC failover. Before the OFC failover, FPI is sent to a switch. But due to eventual consistency, FPI is not installed in the switch yet and FPI_SENT notification is not processed by NIB yet. During failover, OFC queries NIB about FPIQueue and switches about FPIStatus to recover. Right after OFC failovers, FPI is installed by the switch and FPI_DONE is sent back to the OFC monitoring server. But after failover, since FPI is still in FPIQueue, it will be scheduled again even if the FPI status is already FPI_DONE. The solution is the same as previous bugs, checking FPI status before scheduling it. Bug46. RE-issued read and write should have be processed sequentially by NIB, which is true without any failures. However, this is not true with concurrent NIB and OFC failures. During the NIB failover, NIB sends an update to RE about the data plane status. A read request is sent to NIB by the sequencer, which is handled by the event handler for RE in NIB. NIB event handler 101 just updates its local FPI status, based on which the sequencer computes a new FPIs to be sent to NIB afterwards. Since OFC failover is still ongoing, it queries the lastest FPIQueue status from RE via NIB and then schedules the FPIs immediately. Then FPI status is changed to FPI_SENT at both NIB and OFC. At this time, the RE event handler at NIB processed the read request from RE, by when FPI status is FPI_SENT. Latter, the FPI is installed on a switch and FPI_DONE is received by the OFC event handler at NIB. Note that since both OFC event handler and RE handler can send FPI status update to RE, race condition happens, i.e., FPI_DONE arrives before FPI_SENT and gets overwritten, leading to a deadlock. Summary. Due to the nature of distributed systems, various possible interleavings of operations are possible especially under failures, which is hard for programmers to reason about in advance. One strategy to avoid state machine violation due to race condition and failures is to defend in depth, i.e., whenever updating a state machine, one needs to atomically check its previous state and perform the transition to the next state; once an illegal transition is encountered, we should skip it. Bug29. In this bug, a switch fails and recovers. When RE processes the SW_UP event, two operations needed to be done as follows. Step1: changing status to SW_RUN and Step2: then change FPI status to FPI_NONE. The bug happens because those two steps interleaves with other operations. One of those operations is to update FPI status to FPI_DONE. However, FPI_DONE is overwritten by FPI_NONE in Step 2, which will lead to the inconsistency between the data plane and control plane forever. 4.4.3.2 Wrong State Machine Bug48: wrong state machine for drain/undrain states. Switch turn-up/turn-down and drain/undrain requests are submitted to drainer and OFC separately. The original state machine hasfourswitchstates, SW_UP,SW_DOWN,SW_DRAINED,SW_UNDRAINED.Ifthereisonly one management application issuing management requests, our protocols work correctly. However, 102 this is not true for multiple management application submitting requests at the same time, which causes this bug. The problem with the state machine is that since it does not consider the time of operation, it does not incorporate transition states. In specific, draining/undraining/turning- up/turning-down a switch all takes time, which results into a transition state (we call it the ONGO state) between two final states. Without defining such state, there is no way for OFC and drainer to know if there is ongoing operation. As a result, a down and undrain could happen at the same time and the down operation removes an undrained switch, which results into significant packet loss. Therefore, the solution to this problem is to define UNGO states for all operations, before drainer and OFC perform any operations, it needs to query NIB if there is any ongoing operations on the relevant switch. If so, it will abort the operation and let the upper layer operations know and the application will take further actions based on its own logic; otherwise, it will change the switch status in NIB to OP_ONGO such that other operations can not be performed on the switch. Bug43: wrong state machine for controller state. There are only two states for controller ini- tially,CTCL_NORMALandCTCL_ABNORMAL.IfthestateofamoduleisCTCL_ABNORMAL, other modules will skip sending update to it. However, similar to the last one, this state machine does not consider the transition state from abnormal to normal (i.e., failover), where the controller can accept update but is not ready to compute and schedule new FPIs. The bug happens under this state machine as follows. When a RE standby starts the failover process, it will read states from NIB recover and set itself from CTCL_ABNORMAL to CTCL_NORMAL until the last step. Due to the semantic of CTCL_ABNORMAL, NIB will skip updating RE on FPI update after RE queries it for failover. In this bug, one FPI_DONE is skipped by the NIB, which will lead to a deadlock. The solution is to add a transition state CTCL_REC at the beginning of the failover process such that RE will not miss any notification from NIB. Summary. These examples illustrate that, because operations take time, state machines should incorporate these delays, i.e., adding transition states into the state machine. 103 A B C D A B C D A B C D A B C D 1 2 undrained Original topology & Routing Intermediate topology & Routing Last step topology & Routing Target topology & Routing ... drained drained Figure 4.2: Packet loss in the last step parallel drain 4.4.4 Type 4: Wrong Algorithms Different algorithms are needed at different components to handle 1) complex routing event from the data plane and 2) high-level management intents from the management plane. The algorithm design is challenging since it is needed to be correct at all possible scenarios. In our exploration, we find that some well-designed algorithms in typical scenarios might fail under corner cases/failure patterns. Bug45: hazardous drain. Drain algorithm for topology expansion is as follows: based on the required network SLO (i.e., preserving 80% of the original topology), at each step, we can rewire two links, such as link 1 and 2 in Figure 4.2. Also, we can safely drain link 1 and link 2 in parallel due to the SLO requirement, which is useful to reduce the convergence time than a sequential drain. In specific, even there could be 10 in-links and 9 out-links on superblock C, the out-links still have 90% capacity which can supports the maximum planned traffic. After the draining links at each step, the drained links will be moved to the new added superblock D, be set to be undrained and accommodate partial traffic (the second subfigure in Figure 4.2). Due to above analysis, parallel drain is assumed to be safe for all later expansion steps. However, this is not true at the last step. As shown in the third subfigure in Figure 4.2, if link 1 and 2 are being drained at the same time, link 1 drain could happen before link 2 drain transiently, which will lead to packet loss. The solution is to conduct sequantial draining of link 1 and link 2. In specific, drain link 2 is 104 drained first such that all traffic is diverted to rewired links (shown in green). Then link 1 can be safely drained. Bug7: wrong control plane event coalescing. Control congestion could sometimes result into data and control plane failures. Therefore, it would be ideal if events received by the control plane could be coalesced, i.e., only processing a few of them being sufficient. The bug we found happens when a switch fails transiently, i.e., the switch fails and recovers in a short time, which results into two control events being received by the OFC event handler. The coalescing algorithm chooses to discard the first SW_DOWN event and only processes the SW_UP event. However, since the current switch status is normal, no any extra actions are conducted, which is wrong since all states on the switch has been lost due to the transient switch-down event. Thus an FPI with FPI_SENT status in OFC will never be installed by the switch, which results into a system deadlock. The solution for this bug is to process every switch event. Bug21: wrong TE algorithm. TE receives the SW_DOWN event but compute the new DAG only based on topology without considering current data plane state. When the new DAG is installed, both FPIs from both the new and old DAG are installed on switches. The solution is to change the TE algorithm to compute DAG not only based on topology but also based on the data plane state. Summary. Control plane algorithms are hard to design due to tricky corner cases (Bug50) and failure scenarios (Bug51). We think that when designing control plane algorithms, we need to consider all possible scenarios instead of only relying on typical scenarios. And the methodology used in this section, i.e., formal specification and verification, provides an efficient way to find the defects of a well-designed algorithm working in a typical scenario. 105 4.5 Convergence Time Analysis Convergence time (CT) is defined as the time that a target data plane change (such as an IR, a DAG or a management operation) issued by the control plane takes to take effect in the data plane. Convergence time is a critical availability metric for the routing system; the shorter the convergence time is, more available the network is. In this section, we analyze the convergence time of our verified protocols in some typical scenarios. We try to answer the following questions: I What’s additional convergence time needed to handle failures? I What’s additional convergence time needed to handle a particular type of bugs? Methodology. In our convergence time analysis, we mainly consider the inter-process communi- cation as the main contributor to the convergence time and use the routing trip time (RTT) as the basic unit in our analysis. Also, we assume asynchronous RPC (used in [29]) as the communication protocol underlying the microservice architecture. Definitions. We defined t cd as RTT between the control plane (e.g., OFC) and switches and t cc as RTT between the control plane processes (modules). 4.5.1 Total Protocol Convergence CT of IR installation. In the monolithic architecture, the convergence time of an IR installation, T ir = 2t cd . In the microservice architecture, T ir = 5t cc +2t cd . Below is the sequence of installing an IR: 1) RE sends an IR to NIB (t cc ). 2) NIB receives the IR, processes it and sends it to OFC (t cc ). 3) OFC receives the IR, updates local IR status to IR_SENT, sends a write request to NIB, which takes one t cc due to the RPC call. 4) After OFC receives the ACK from NIB about the write request, it sends the IR to switch (t cd ). 5) The switch installs the IR and sends ACK back 106 to OFC (t cd ). 6) OFC updates IR local status to IR_DONE, which is then sent to NIB (t cc ). 7) NIB sends the IR status update to RE (t cc ). CT of a DAG installation. T dag =nT ir , where n is the depth of the DAG graph. CT of a management operation. The sequence of finishing a drain operation issued by the topology expander is as follows: 1) expander sends a drain request to NIB (t cc ). 2) NIB processes the drain request and forwards it to drainer (t cc ). 3) Drainer receives the request and sends DRAIN_ONGO for the switch to NIB (t cc ). 4) If the switch status is set to DRAIN_ONGO successfully, NIB will acknowledge drainer (t cc ). 5) Drainer updates the switch local status to DRAIN_ONGO and send the drain request to RE (t cc ). 6) RE computes the DAG to drain the switch, which takes T dag . 7) After RE programs all related switches based on the DAG, it sends the acknowledgement to NIB and NIB forwards it to drainer (2t cc ). 8) Drainer updates local switch status to DRAINED and sends the update to NIB (t cc ). 9) NIB updates its copy of switch status to DRAINED and sends the update to the expander (t cc ) and the expander notifies the requester on the drain completion. Therefore, T m = 9t cc +T dag . CT of DAG installation under catastrophic control plane failure. Catastrophic control plane failure is the worst case control plane failure whose convergence time serves as the upper bound for any control plane failure. The convergence time is as follows: 1) When the standby is up, they all try to query the components they depend on. OFC reads data plane states synchronously, set its state to OFC_READY and notifies NIB about the data plane states (t cc +t cd ). 2) NIB stores the data plane state and forwards it to RE (t cc ). 3) RE recomputes a new DAG based on the data plane state and forwards IRs to NIB (t cc ). 4) NIB processes received IRs and forwards them to OFC (t cc ). Therefore, T cat =t cd +4t cc . 107 4.5.2 Added Convergence Time for a Solution For some bug, sometimes there is more than one solution to solve them. In this subsection, we are going to discuss the implication of those solutions in terms of convergence time. Log-based reconciliation. Before recovery, this approach logs the meta data to NIB before updating the local state. Upon failure recovery, the module will read the metadata (e.g., who locks the IR) right before a failure such that the recovered module/submodule knows what has been done before the failure. Logging needs synchronous RPC such that the log operation is successfully completed in NIB. The overhead of the log-based reconciliation has two parts. The first part lies in normal logging operation, which takes one RTT. The second part lies in the reconciliation phase, which takes one RTT to read the metadata from NIB. The limitation of this approach is that it stops working if the NIB fails. Sw-side reconciliation. This approach takes one RTT to reach necessary states from all relevant switches. 108 Chapter 5 Related Work This dissertation covers designs in different layers related to datacenter performance and availability. In this chapter, we present the related work in data plane, control plane and management plane respectively. 5.1 Static Datacenter Topology Design Topology Design. Previous topology designs have focused on cost effective, high capacity and low diameterdatacentertopologieslike[23,81,76,17,45]. Althoughtheyachievegoodperformanceand cost properties, the lifecycle management complexity of these topologies have not been investigated either in the original papers or in subsequent work that has compared topologies [61, 66]. In contrast to these, we explore topology designs that have low lifecycle complexity. Recent work has explored datacenter topologies based on free space optics [57, 32, 28, 38, 94] but because we lack operational experience with them at scale, it is harder to design and evaluate lifecycle complexity metrics for them. Topology Expansion. Prior work has discussed several aspects of topology expansion [72, 76, 81, 25, 91]. Condor [72] permits synthesis of Clos-based datacenter topologies with declarative constraints some of which can be used to specify expansion properties. A more recent paper [91] attempts to develop a target topology for expansion, given an existing Clos topology, that would 109 require the least number of link rewiring. REWIRE [25] finds target expansion topologies with highest capacity and smallest latency without preserving topological structure. Jellyfish [76] and Xpander [81] study expansion properties of their topology, but do not consider practical details in re-wiring. Unlike these, our work is examines lifecycle management as a whole, across different topology classes, and develops new performance-equivalent topologies with better lifecycle management properties. 5.2 Reconfigurable Datacenter Architecture Non-blocking topologies: Previous datacenter work [9, 75, 76, 81, 86] focused on non-blocking topologies. For Clos-based designs like Jupiter [75], simple demand-oblivious routing schemes exist. Practical routing for other proposed topologies is an open question [44]. Reconfigurable designs with commercial OCS: Closest to our work is Helios [28], which augmented the spine layer with a reconfigurable OCS, and segregated long flows to be routed via the OCS. Unlike Helios, Gemini leverages models from real-traffic observations and makes reconfiguration decisions on inter-pod demand, not individual flows. Sec-level reconfigurability: Much work has focused on fine-timescale reconfiguration [28, 32, 38, 51, 67, 57, 58, 14]. In contrast, Gemini relies only on commodity hardware proven to work at large scales. It also attempts to minimize link utilizations, vs. directly minimizing FCTs, as collecting flow-size information in real time can be hard [80, 58, 60]. Robust routing design: Prior work has focused on robustness to failure [48, 77]. More relevant to Gemini is research focusing on robustness to traffic variations; some [13, 12] are demand- oblivious and perform poorly relative to demand-aware approaches in practice [82]. Demand-aware approaches (e.g., [82, 88]) reconfigure based on multiple traffic matrices, as Gemini does, but focus on routing for fixed wide-area networks. Gemini jointly optimizes topology and routing, and scales to large datacenters with highly variable traffic. 110 5.3 Control Plane Design TLA+ usage in practice. TLA+ has been used to check design flaws of production-quality systems successfully by large companies, such as Amazon and Microsoft, as reported in [62, 2]. However, the systems are not related to the network control plane. We are the first work to apply TLA+/PlusCal to check the microservices-based control plane. And based on bug traces we found, we distill important lessons and principles for future design. Verification for real codes. In recent years, many tools, such as Verdi, IronFleet and Yggdarsil are proposed to write real systems which can be fully verified. Verdi [83] could only verify the safety properties. However, liveness is an important property for network control plane which could impact network availability significantly. For other works, such as IronFleet [39] and Yggdarsil [74], it is not clear if those techniques are scalable enough for production-quality control planes with 10s of thousands of lines of codes [46]. Formal methods in networking. Formal methods have gained a lot attention in the networking community. NICE [20] has used model checking and symbolic execution to test SDN application logics. [27] and [73] could detect data plane inconsistency from execution traces between the control plane and data plane. One common problem for those works is that the network control plane considered in those works are quite different from microservices-based design used in production [29]. Our work focus on design and verification of microservices-based control plane design like Orion. 111 Chapter 6 Conclusion Network availability is a big challenge for large cloud and content providers due to the scale and complexity. Our work is an attempt to improve the network availability. In this section, we summarize our contributions in §6.1 and discuss future work in §6.2. 6.1 Summary of Contributions Exploring a new dimension for datacenter topology design: lifecycle management. We propose a set of metrics for lifecycle management complexity based on first principles. Those metrics could be used for future datacenter topology design and evaluation. We propose a new family of datacenter topologies, FatClique, which achieves high capacity, low cost and low management complexity compared to existing state-of-art. Evaluations of these topology classes at three different scales, the largest of which is 16 the size of Jupiter, shows that FatClique is the best at most scales by all our complexity metrics. It uses 50% fewer switches and 33% fewer patch panels than Clos at large scale, and has a 23% lower cabling cost (an estimate we are able to derive from published cable prices). Finally, FatClique can permit fast expansion while degrading network capacity by small amounts (2.5-10%): at these levels, Clos can take 5 longer to expand 112 the topology. Due to the lower management complexity, such as faster topology expansion and deployment, network availability could be improved. Reconfigurable datacenter architecture using commodity hardware. We propose a new datacenter architecture using commodity hardware, i.e., patch panels or commercial optical circuit switches. To handle traffic bursts under practical reconfigurablity, we propose a scalable approach for joint topology and routing optimization. Data from tens of production fabrics allows us to categorize datacenters as either low-or high-volatility; these categories seem stable. For the former, Gemini finds topologies and routing with near-optimal performance and cost. For the latter, Gemini’s use of multi-traffic-matrix optimization and hedging avoids the need for frequent topology reconfiguration, with only marginal increases in path length. As a result, Gemini can support existing workloads on these production fabrics using a spine-free topology that is half the cost of the existing topology on these datacenters. Highly available correct-by-design network control plane. To have the correct-by-design control plane, we formally and systematically verify the specifications under various failure scenarios. From design bugs we found, we distill critical lessons and principles for control plane designers. Simulation and prototyping. In FatClique, we evaluate different topology families using simulations which models different parts of datacenter networks such as cable trays, datacenter rack space and patch panels. Gemini has been systematically evaluated using a half year traffic trace from tons of datacenters at Google and a prototype has been built to validate the idea and algorithms. 6.2 Future Work It has the following open issues and directions for future work. Practical routing for FatClique. FatClique is a new family of topologies which achieve low cost, low management complexity and high bisection bandwidth. However, how to design a practical 113 high-performance routing over FatClique is still an open problem. Since FatClique does not have as much path redundancy as Clos, simple static routing scheme like WCMP might not work given dynamic traffic. Based on previous research [44], a simple combination of shortest path and valiant load balancing does not work for expander-based topologies. Adaptive routing [56] considering practical constraints might be a promising solution. Metrics to capture control plane complexity of different topology families. A topology design change might impact the control plane design significantly. Modern control plane design such as Orion [29] relies on modularity to reduce the failure blast radius. How to design control domains to reduce the failure blast radius and routing convergence time is an interesting research direction. Understand the relationship between network-side metrics and host-side metrics. Due to many practical reasons, network operators might design and operate the network based on network-side metrics which are different from host-side metrics that end hosts really care about. Network-side metrics like link utilization (usually measured in a time window in seconds to minutes) and path stretch are easy to measure and intuitive to be used as design metrics. However, the relationship between those network-side metrics and host-side metrics, such as flow completion time is still an open question. Establishing this relationship (if possible) could help network operators design and operate the network better. Establishing the relationship between network-side and host-side metrics is not easy for many reasons: 1) those metrics are in different time scales; 2) there are many other confounding factors that might impact the relationship such as topology, routing, congestion control algorithm used, traffic distribution etc.. Host and datacenter network co-design. Gemini design the topology and routing without relyingonanysupportfromtheendhosts. However, iftherearemanymicrobursts, theperformance slow topology and engineering with practical reconfigurability might be not optimal. To handle micro bursts, end hosts could rate limit flows based their priority, quality of service (QoS) such 114 that latency sensitive flows could always go through shorter paths or even be provisioned with additional capacity. With end hosts’ support, Gemini could achieve better performance. Transformation from formally-verified specifications to real codes. There is a big gap between specifications and implementation; errors could arise during the implementation process based on the specification. How to transform the specifications to a correct implementation is an interesting research direction. Techniques proposed in [39, 83, 74] could potentially be leveraged to fill the gap. However, some challenges remain: 1) verification of liveness properties during the transformation from specification to implementation needs to be solved; 2) more scalable techniques might be needed for production-grade network control plane; 3) techniques are needed to get network operators out of the verification loop since they might not have expertise on verification. 115 Bibliography [1] A High-Level View of TLA+. https://lamport.azurewebsites.net/tla/ high-level-view.html. [2] A High-Level View of TLA+. IndustrialUseofTLA+. [3] Adopting Microservices at Netflix: Lessons for Architectural Design. https://www.nginx. com/blog/microservices-at-netflix-architectural-best-practices/. [4] Introducing Domain-Oriented Microservice Architecture. https://eng.uber.com/ microservice-architecture/. [5] What are microservices? https://microservices.io/. [6] What Led Amazon to its Own Microservices Architecture. https://thenewstack.io/ led-amazon-microservices-architecture/. [7] Anubhavnidhi Abhashkumar, Kausik Subramanian, Alexey Andreyev, Hyojeong Kim, Nanda Kishore Salem, Jingyi Yang, Petr Lapukhov, Aditya Akella, and Hongyi Zeng. Running BGP in data centers at scale. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 65–81. USENIX Association, April 2021. [8] J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber. Hyperx: Topology, routing, and packaging of efficient large-scale networks. In Proc. SC9, 2009. [9] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In Proc. ACM SIGCOMM, 2008. [10] A. Andreyev. Introducing data center fabric, the next-generation Facebook data center network. https://code.fb.com/production-engineering/ introducing-data-center-fabric-the-next-generation/ -facebook-data-center-network/. [11] A. Andreyev. Introducing data center fabric, the next-generation Facebook data center network. https://engineering.fb.com/production-engineering/ introducing-data-center-fabric-the-next-generation-facebook-data-center-network/, 2014. [12] David Applegate, Lee Breslau, and Edith Cohen. Coping with network failures: Routing strategies for optimal demand oblivious restoration. In Proc. ACM SIGMETRICS, 2004. [13] David Applegate and Edith Cohen. Making intra-domain routing robust to changing and uncertain traffic demands: Understanding fundamental tradeoffs. In Proc. ACM SIGCOMM, 2003. 116 [14] Hitesh Ballani, Paolo Costa, Raphael Behrendt, Daniel Cletheroe, Istvan Haller, Krzysztof Jozwik, Fotini Karinou, Sophie Lange, Kai Shi, Benn Thomsen, and Hugh Williams. Sirius: A flat datacenter network with nanosecond optical switching. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM ’20, 2020. [15] Theophilus Benson, Aditya Akella, and David A. Maltz. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC ’10, 2010. [16] Pankaj Berde, Matteo Gerola, Jonathan Hart, Yuta Higuchi, Masayoshi Kobayashi, Toshio Koide, Bob Lantz, Brian O’Connor, Pavlin Radoslavov, William Snow, and Guru Parulkar. Onos: Towards an open, distributed sdn os. In Proceedings of the Third Workshop on Hot Topics in Software Defined Networking, HotSDN ’14, New York, NY, USA, 2014. Association for Computing Machinery. [17] M. Besta and T. Hoefler. Slim fly: A cost effective low-diameter network topology. In Proc. SC14, 2014. [18] Broadcom Inc. Broadcom Tomahawk Swiching chips. https://www.broadcom.com/ products/ethernet-connectivity/switching/strataxgs/bcm56960-series. [19] Mike Burrows. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation, OSDI ’06, 2006. [20] Marco Canini, Daniele Venzano, Peter Peresini, Dejan Kostic, Jennifer Rexford, et al. A nice way to test openflow applications. In NSDI, volume 12, pages 127–140, 2012. [21] J. Case, R. Mundy, D. Partain, and B. Stewart. Rfc3410: Introduction and applicability statements for internet-standard management framework. Technical report, USA, 2002. [22] Sheshayya A Choudum. A simple proof of the Erdos-Gallai theorem on graph sequences. Bulletin of the Australian Mathematical Society, 33(1):67–70, 1986. [23] C. Clos. A study of non-blocking switching networks. The Bell System Technical Journal, 32(2):406–424, March 1953. [24] Colfax International. Colfax direct. http://www.colfaxdirect.com. [25] Andrew R. Curtis, Tommy Carpenter, Mustafa Elsheikh, Alejandro López-Ortiz, and Srini- vasan Keshav. Rewire: An optimization-based framework for unstructured data center network design. In Proc. IEEE INFOCOMM, 2012. [26] Nandita Dukkipati and Nick McKeown. Why flow-completion time is the right metric for congestion control. SIGCOMM Comput. Commun. Rev., 36(1):59–62, January 2006. [27] Ahmed El-Hassany, Jeremie Miserez, Pavol Bielik, Laurent Vanbever, and Martin Vechev. Sdnracer: Concurrency analysis for software-defined networks. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’16, page 402–415, New York, NY, USA, 2016. Association for Computing Machinery. [28] N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Subramanya, Y. Fainman, G. Papen, and A. Vahdat. Helios: A hybrid electrical/optical switch architecture for modular data centers. In Proc. ACM SIGCOMM, 2010. 117 [29] Andrew D. Ferguson, Steve Gribble, Chi-Yao Hong, Charles Killian, Waqar Mohsin, Henrik Muehe, Joon Ong, Leon Poutievski, Arjun Singh, Lorenzo Vicisano, Richard Alimi, Shawn Shu- oshuo Chen, Mike Conley, Subhasree Mandal, Karthik Nagaraj, Kondapa Naidu Bollineni, Amr Sabaa, Shidong Zhang, Min Zhu, and Amin Vahdat. Orion: Google’s software-defined networking control plane. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 83–98. USENIX Association, April 2021. [30] B. Fortz and M. Thorup. Internet traffic engineering by optimizing OSPF weights. In Proc. IEEE INFOCOM, 2000. [31] FS.COM. 96 Fibers 12x MTP/MPO-8 to LC/UPC Single Mode 1U 40GB QSFP+ Breakout Patch Panel Flat. https://www.fs.com/products/43552.html. [32] M. Ghobadi, R. Mahajan, A. Phanishayee, N. Devanur, J. Kulkarni, G. Ranade, P.-A. Blanche, H. Rastegarfar, M. Glick, and D. Kilper. Projector: Agile reconfigurable data center interconnect. In Proc. ACM SIGCOMM, 2016. [33] R. Govindan, I. Minei, M. Kallahalla, B. Koley, and A. Vahdat. Evolve or die: High-availability design principles drawn from googles network infrastructure. In Proc. ACM SIGCOMM, 2016. [34] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. Vl2: a scalable and flexible data center network. In Proc. ACM SIGCOMM, 2009. [35] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers. In Proc. ACM SIGCOMM, 2008. [36] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. DCell: A Scalable and Fault-tolerant Network Structure for Data Centers. In Proc. ACM SIGCOMM, 2008. [37] S Louis Hakimi. On realizability of a set of integers as degrees of the vertices of a linear graph. I. Journal of the Society for Industrial and Applied Mathematics, 10(3):496–506, 1962. [38] N. Hamedazimi, Z. Qazi, H. Gupta, V. Sekar, S. R. Das, J. P. Longtin, H. Shah, and A. Tanwer. Firefly: A reconfigurable wireless data center fabric using free-space optics. In Proc. ACM SIGCOMM, 2014. [39] Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. Ironfleet: Proving practical distributed systems correct. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP ’15, page 1–17, New York, NY, USA, 2015. Association for Computing Machinery. [40] Shlomo Hoory, Nathan Linial, and Avi Wigderson. Expander graphs and their applications. Bull. Amer. Math. Soc., 43(04):439–562, August 2006. [41] X. Jin, H. H. Liu, R. Gandhi, S. Kandula, R. Mahajan, M. Zhang, J. Rexford, and R. Wat- tenhofer. Dynamic scheduling of network updates. In Proc. ACM SIGCOMM, SIGCOMM ’14, 2014. [42] Xin Jin, Hongqiang Harry Liu, Rohan Gandhi, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Jennifer Rexford, and Roger Wattenhofer. Dynamic scheduling of network updates. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM ’14, page 539–550, New York, NY, USA, 2014. Association for Computing Machinery. 118 [43] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 1998. [44] S. Kassing, A. Valadarsky, G. Shahaf, M. Schapira, and A. Singla. Beyond fat-trees without antennae, mirrors, and disco-balls. In Proc. ACM SIGCOMM, 2017. [45] J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. In 2008 International Symposium on Computer Architecture, 2008. [46] Teemu Koponen, Martin Casado, Matasha Gude, Jeremy Stribling, Leon Poutevski, Min Zhu, Rajiv Ramanathan, Yuichiro Iwata, Hiroaki Inoue, Takayuki Hama, and Scott Shenker. Onix: A distributed control platform for large-scale production networks. In OSDI, 2010. [47] Gautam Kumar, Nandita Dukkipati, Keon Jang, Hassan M. G. Wassel, Xian Wu, Behnam Montazeri, Yaogong Wang, Kevin Springborn, Christopher Alfeld, Michael Ryan, David Wetherall, and Amin Vahdat. Swift: Delay is simple and effective for congestion control in the datacenter. In Proc. ACM SIGCOMM, 2020. [48] Praveen Kumar, Yang Yuan, Chris Yu, Nate Foster, Robert Kleinberg, Petr Lapukhov, Chiun Lin Lim, and Robert Soulé. Semi-oblivious traffic engineering: The road not taken. In Proc. NSDI, 2018. [49] D. H. Lawrie. Access and alignment of data in an array processor. IEEE Trans. Computers, C-24(12):1145–1155, Dec 1975. [50] H. H. Liu, X. Wu, M. Zhang, L. Yuan, R. Wattenhofer, and D. Maltz. zupdate: Updating data center networks with zero loss. In Proc. ACM SIGCOMM, 2013. [51] He Liu, Matthew K. Mukerjee, Conglong Li, Nicolas Feltman, George Papen, Stefan Savage, Srinivasan Seshan, Geoffrey M. Voelker, David G. Andersen, Michael Kaminsky, George Porter, and Alex C. Snoeren. Scheduling techniques for hybrid circuit/packet networks. In Proc. ACM CoNEXT, 2015. [52] Hong Liu, Ryohei Urata, Xiang Zhou, and Amin Vahdat. Evolving Requirements and Trends of Datacenters Networks. Springer International Publishing, 2020. [53] V. Liu, D. Halperin, A. Krishnamurthy, and T. Anderson. F10: A fault-tolerant engineered network. In Proc. USENIX NSDI, 2013. [54] G. Luan. Estimating tcp flow completion time distributions. Journal of Communications and Networks, 21(1):61–68, 2019. [55] S. Mandal. Lessons learned from b4, google’s sdn wan. https://www.usenix. org/sites/default/files/conference/protected-files/atc15_slides_ mandal.pdf. [56] Nie McDonald, Mikhail Isaev, Adriana Flores, Al Davis, and John Kim. Practical and efficient incremental adaptive routing for hyperx networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, NY, USA, 2019. Association for Computing Machinery. [57] W. M. Mellette, R. McGuinness, A. Roy, A. Forencich, G. Papen, A. C. Snoeren, and G. Porter. Rotornet: A scalable, low-complexity, optical datacenter network. In Proc. ACM SIGCOMM, 2017. 119 [58] William M. Mellette, Rajdeep Das, Yibo Guo, Rob McGuinness, Alex C. Snoeren, and George Porter. Expanding across time to deliver bandwidth efficiency and low latency. In Proc. NSDI, 2020. [59] J. Mitchell. What are Patch Panels & When to Use Them? https://www. lonestarracks.com/news/2016/10/28/patch-panels/. [60] Tatsuya Mori, Masato Uchida, Ryoichi Kawahara, Jianping Pan, and Shigeki Goto. Identi- fying elephant flows through periodically sampled packets. In Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, IMC ’04, 2004. [61] J. Mudigonda, P. Yalagandula, and J. C. Mogul. Taming the flying cable monster: A topology design and optimization framework for data-center networks. In Proc. USENIX ATC, 2011. [62] Chris Newcombe, Tim Rath, Fan Zhang, Bogdan Munteanu, Marc Brooker, and Michael Deardeuff. How amazon web services uses formal methods. Commun. ACM, 58(4):66–73, March 2015. [63] Jitendra Padhye, Victor Firoiu, Don Towsley, and Jim Kurose. Modeling TCP Throughput: A Simple Model and Its Empirical Validation. In Proc. SIGCOMM, page 303–314, 1998. [64] Aurojit Panda, Wenting Zheng, Xiaohe Hu, Arvind Krishnamurthy, and Scott Shenker. SCL: Simplifying distributed SDN control planes. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 329–345, Boston, MA, March 2017. USENIX Association. [65] Julius Petersen et al. Die theorie der regulären graphs. Acta Mathematica, 15:193–220, 1891. [66] L. Popa, S. Ratnasamy, G. Iannaccone, A. Krishnamurthy, and I. Stoica. A cost comparison of datacenter network architectures. In Proceedings of the 6th International COnference, Co-NEXT ’10, 2010. [67] George Porter, Richard D. Strong, Nathan Farrington, Alex Forencich, Pang-Chen Sun, Tajana Rosing, Yeshaiahu Fainman, George Papen, and Amin Vahdat. Integrating microsecond circuit switching into the data center. In Proc. ACM SIGCOMM, 2013. [68] RackSolutions. Open Frame Server Racks. https://www.racksolutions.com/ server-racks-cabinets-enclosures.html. [69] Rackspace US, INC. The Rackspace Cloud. www.rackspacecloud.com. [70] Mark Reitblatt, Nate Foster, Jennifer Rexford, Cole Schlesinger, and David Walker. Abstrac- tions for network update. In Proc. ACM SIGCOMM, 2012. [71] Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren. Inside the social network’s (datacenter) network. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, SIGCOMM ’15, 2015. [72] B. Schlinker, R. N. Mysore, S. Smith, J. C. Mogul, A. Vahdat, M. Yu, E. Katz-Bassett, and M. Rubin. Condor: Better topologies through declarative design. In Proc. USENIX NSDI, 2015. [73] Colin Scott, Andreas Wundsam, Barath Raghavan, Aurojit Panda, Andrew Or, Jefferson Lai, Eugene Huang, Zhi Liu, Ahmed El-Hassany, Sam Whitlock, H.B. Acharya, Kyriakos Zarifis, and Scott Shenker. Troubleshooting blackbox sdn control software with minimal causal sequences. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM ’14, page 395–406, New York, NY, USA, 2014. Association for Computing Machinery. 120 [74] Helgi Sigurbjarnarson, James Bornholt, Emina Torlak, and Xi Wang. Push-button verification of file systems via crash refinement. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 1–16, Savannah, GA, November 2016. USENIX Association. [75] A. Singh, J. Ong, A. Agarwal, G. Anderson, A. Armistead, R. Bannon, S. Boving, G. Desai, B. Felderman, P. Germano, A. Kanagala, J. Provost, J. Simmons, E. Tanda, J. Wanderer, U. Hölzle, S. Stuart, and A. Vahdat. Jupiter rising: A decade of clos topologies and centralized control in google’s datacenter network. In Proc. ACM SIGCOMM, 2015. [76] A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey. Jellyfish: Networking data centers randomly. In Proc. USENIX NSDI, 2012. [77] Martin Suchara, Dahai Xu, Robert Doverspike, David Johnson, and Jennifer Rexford. Network architecture for joint failure recovery and traffic engineering. In Proc. ACM SIGMETRICS, 2011. [78] M. Y. Teh, J. J. Wilke, K. Bergman, and S. Rumley. Design space exploration of the dragonfly topology. In ISC Workshops, 2017. [79] The Siemon Company. Trunk Cable Planning & Installation Guide. https://www.siemon. com/us/white_papers/07-09-24-trunk-cable-planning-installation.asp. [80] Vojislav Ðukić, Sangeetha Abdu Jyothi, Bojan Karlas, Muhsen Owaida, Ce Zhang, and Ankit Singla. Is advance knowledge of flow sizes a plausible assumption? In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), 2019. [81] A.Valadarsky, G.Shahaf, M.Dinitz, andM.Schapira. Xpander: Towardsoptimal-performance datacenters. In Proc. ACM CoNEXT, 2016. [82] Hao Wang, Haiyong Xie, Lili Qiu, Yang Richard Yang, Yin Zhang, and Albert Greenberg. Cope: Traffic engineering in dynamic networks. In Proc. ACM SIGCOMM, 2006. [83] James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Thomas Anderson. Verdi: A framework for implementing and formally verifying distributed systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’15, page 357–368, New York, NY, USA, 2015. Association for Computing Machinery. [84] Yuan Yu, Panagiotis Manolios, and Leslie Lamport. Model checking tla+ specifications. In In Correct Hardware Design and Verification Methods (CHARME ’99), Laurence Pierre and Thomas Kropf editors. Lecture Notes in Computer Science, Springer-Verlag., volume 1703, pages 54–66, June 1999. [85] M. R. Zargham. Computer Architecture: Single and Parallel Systems. Prentice Hall, 1996. [86] MingyangZhang, RadhikaNiranjanMysore, SuchaSupittayapornpong, andRameshGovindan. Understanding lifecycle management complexity of datacenter topologies. In Proc. NSDI, 2019. [87] Qiao Zhang, Vincent Liu, Hongyi Zeng, and Arvind Krishnamurthy. High-resolution measure- ment of data center microbursts. In Proc. ACM SIGCOMM IMC, 2017. [88] Y. Zhang and Z. Ge. Finding critical traffic matrices. In 2005 International Conference on Dependable Systems and Networks (DSN’05), 2005. 121 [89] R. Zhang-Shen and N. McKeown. Designing a predictable internet backbone with valiant load-balancing. In Proc. IEEE IWQoS, 2005. [90] R. Zhang-Shen and N. McKeown. Guaranteeing quality of service to peering traffic. In Proc. IEEE Infocomm, 2008. [91] S. Zhao, R. Wang, J. Zhou, J. Ong, J. Mogul, and A. Vahdat. Minimal rewiring: Efficient live expansion for clos data center networks. In Proc. USENIX NSDI, 2019. [92] Shizhen Zhao, Rui Wang, Junlan Zhou, Joon Ong, Jeffrey C. Mogul, and Amin Vahdat. Minimal rewiring: Efficient live expansion for clos data center networks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), 2019. [93] Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul Kabbani, Leonid B. Poutievski, Arjun Singh, and Amin Vahdat. Wcmp: weighted cost multipathing for improved fairness in data centers. In EuroSys ’14, 2014. [94] X. Zhou, Z. Zhang, Y. Zhu, Y. Li, S. Kumar, A. Vahdat, B. Y. Zhao, and H. Zheng. Mirror mirror on the ceiling: Flexible wireless links for data centers. In Proc. ACM SIGCOMM, 2012. [95] D. Zhuo, M. Ghobadi, R. Mahajan, K.-T. Förster, A. Krishnamurthy, and T. Anderson. Understanding and mitigating packet corruption in data center networks. In Proc. ACM SIGCOMM, 2017. [96] Danyang Zhuo, Monia Ghobadi, Ratul Mahajan, Amar Phanishayee, Xuan Kelvin Zou, Hang Guan, Arvind Krishnamurthy, and Thomas Anderson. Rail: A case for redundant arrays of inexpensive links in data center networks. In Proc. NSDI, 2017. 122 Appendix A FatClique Algorithms and Analysis A.1 Clos Generation Algorithm For Clos topologies, the canonical recursive algorithm in [85] can only generate non-modular topologies as shown in Figure A.1. In practice, as shown in Jupiter [75], the topology is composed of heterogenous building blocks (chassis), which are packed into a single rack and therefore enforce port hiding (the idea that as few ports from a rack are exposed outside the rack). Although Jupiter is modular and supports port hiding, it is single instance of a Clos-like topology with a specific set of parameters. We seek an algorithm that can take any valid set of Clos parameters and produce chassis-based topologies automatically. Besides, it would be desirable for this algorithm to generate all possible feasible topologies satisfying the parameters, so we can select the one that is most compactly packed. Our logical Clos generation algorithm achieves these goals. Specifically, the algorithm uses the following steps: 1. Compute the total number of layers of homogeneous switching chips needed. Namely, given N servers and radix k switches, we use n = logk 2 ( N 2 ) to compute the number of layers of chips n needed. 2. Determine the total number of layers of chips for edge, aggregation and core layers, which are represented by e, a and s respectively, such that e+a+s =n. 3. Identify blocks for edge, aggregation and core layer. Clos networks rely on every edge being able to reach every spine through exactly one path, by fanning out via as many different aggregation blocks as possible (and vice versa). We find that the resulting interconnection is a derivative of the classical perfect shuffle Omega network ([49], e.g., aggregation blocks in Figure A.2 and Figure A.3). Therefore, we use Omega networks to build both the edge and aggregation blocks, and to define the connections between edge-aggregation and aggregation- spines. The spine block on the other hand needs to be rearrangeably-nonblocking, so it can relay flows from any edge to any other edge with full capacity. Therefore it is built as a smaller Clos topology [23] (e.g., spine blocks in Figure A.2). N=32 Clos N=16 Clos N=8 Clos N=64 Clos Figure A.1: Recursive Construc- tion ... Spine: 2 layer Clos ... Aggregation: 2 layer Omega (blocking) Edge: 1 switch 16 16 N=64 Clos pod pod Figure A.2: Block-Based Con- struction 1 ... ... Spine: 1 switch Aggregation: 3 layer Omega (blocking) Edge: 1 switch 32 8 N=64 Clos pod pod Figure A.3: Block-Based Con- struction 2 123 4. Compose the whole network using edge, aggregation and core blocks. The process to compose the whole topology is to link all these blocks and uses the same procedure as Jupiter[75]. We have verified that topologies generated by our construction algorithm, such as the ones in Figure A.2 and Figure A.3, are isomorphic to a topology generated using the canonical algorithm in Figure A.1. By changing different combinations of e, a and s, we can obtain multiple candidate topologies, as shown in Figure A.2 and Figure A.3. A.2 Jellyfish Placement Algorithm For Jellyfish, we use a heuristic random search algorithm to place switches and servers. The algorithm works as follows. At each stage of the algorithm, a node can be in one of two states: placed, or un-placed. A placed node is one which has been positioned in a rack. Each step of the algorithm randomly selects an un-placed node. If the selected node has logical neighbor nodes that have already been placed, we place this node at the centroid of the area formed by its placed logical neighbors. If no placed neighbor exists, the algorithm randomly selects a rack to place the node. We have also tried other heuristics like neighbor-first, which tries to place a switch’s logical neighbors as close as possible around it. However, this performs worse than our algorithm. A.3 Scale-invariance of Expansion Scale-invariance of Expandability for Symmetric Clos. For a symmetric Clos network, the number of expansion steps is scale-invariant and independent of the degree to which the original topology is partially deployed. Consider a simplified Clos where the original topology has g aggregation blocks. Each aggregation block has p ports for spine-aggregation links, each of which has the unit capacity. Assume the worst-case traffic in which all sources are located in the left half of aggregation blocks and all destinations are in the right half. This network contains gp=2 crossing links between left and right halves. If, during expansion, the network is expected to support a demand of d units capacity per aggregation block, the total demand traversing the cut between the left and right halves in one direction is dg=2. Then, the maximum number of links that can be redistributed in an expansion step is k =gp=2dg=2 =g(pd)=2, which is linear in the number of aggregation blocks (network size). This linearity between k and g implies scale-invariant expandability, e.g., when an aggregation block is doubled to 2g, the maximum number of redistributed links per expansion step becomes 2k. Scale-invariance of Expandability for Jellyfish, Xpander, and FatClique. A random graph consists of s nodes, which is a first-order approximation for Jellyfish’s switch, Xpander’s metanode and FatClique’s block. Each node has p inter-node ports, so there are sp=2 inter-node links. We can treat the network as a bipartite graph. We assume the worst-case traffic matrix, where all traffic is sent through one part of the bipartite graph to the other. Suppose an expansion SLO requires each source-destination node pair to supportd unit demand. Then the total demands from all sources are ds=2. The probability of a link being a cross link is 1=2, and the expected number of cross links is sp=4. These cross links are expected to be the bottleneck between the source-destinations pairs. Therefore, in the first expansion step, we can redistribute at most k = sp=4ds=2 = s(p=4d=2) links, and the maximum number of redistributed links is linear in the number of nodes (network size), e.g., if the number of nodes is doubled to 2s, we can redistribute 2k links in the first step. It is easy to see that, after each expansion step, the number of links added to the bottleneck is also linear with the number of nodes, so the expandability is scale-invariant. 124 A.4 FatClique Expansion Algorithm Algorithm 2 shows the expansion algorithm for FatClique. The input to the algorithm includes original and target topologies T o and T n , the link break ratio during an expansion step , multipliers < 1 and > 1, which are used to adjust based on network capacity. specifies the fraction of existing links that must be broken for re-wiring. The output of the algorithm is the expansion plan Plan. Our expansion algorithm is an iterative trial-and-error approach (Line 2). Each iteration tries to find the right amount of links to break while satisfying the aggregate capacity constraint (Line 2) and the edge capacity constraint (Line 2), which guarantees that the north-to-south capacity ratio is always not smaller than 1 during any expansion step. If all constraints are satisfied, we accept this plan and tentatively increase the link break ratio (Line 2, by multiplying by ) due to capacity increase. Otherwise, the link break ratio (Line 2) is decreased (by multiplying by conservatively.) Algorithm 2 FatClique Expansion Plan Generation input : T o , T n , SLO output: Plan Initialize 2 (0;1);2 (0;1); 2 (1;1) Find the total set of links to break, L, based on T o and T n Compute original capacity c0 whilejLj> 0do Select a subset of links L b , from L uniformly across all blocks, wherejL b j = jLj. if L b does not satisfy edge capacity constraint then = end Delete L b from T o c = ComputeCapacity(T o ) if c<c0SLO then = add L b back to T o else T o = AddNewLinks(L b , T o , T n ) = Plan.add(L b ) end end A.5 Expansion for Clos Since the motivation of this work is to compare topologies, we only focus on developping optimal expansion solutions for symmetrical Clos. More general algorithms for Clos’ expansion can be found in [91]. Also, similar to [91], we assume the worst case traffic matrices for Clos, i.e., servers under a pod will send traffic using full capacity to servers in other pods. Target Topology Generation. As mentioned in §2.4.1, a pod is the unit of expansion in Clos. When we add new pods and associated spines to a Clos topology for expansion, the wiring pattern inside a pod remains unchanged. To make the target topology non-blocking and to ease expansion (i.e., number of to-be-redistributed links on each pod is the same), links from a pod should be distributed across all spines as evenly as possible. Expansion plan generation. Once a target Clos topology is generated, the next step is to redistribute links to convert the original topology into the target topology. By comparing the original and target topology, it is easy to figure out which new links should be routed to which patch panels to satisfy the wiring constraint. In this section, we mainly focus on how to drain links such that the capacity constraint is satisfied and the number of expansion steps is minimized. Insight 1: Maximum rewired links at each pod is bounded. At each expansion step, when links are drained, network capacity drops. At the same time, as expansion proceeds, new devices are added incrementally, the overall network capacity increases gradually during the whole expansion 125 Original Topology: Original Capacity = 64 Target SLO = 75% Drain Plan 1: Drain links from two spines uniformly across pods. SLO=48/64 = 75% Drain Plan 2: Drain links from all spines randomly across all pods. SLO=40/64 = 62.5% 4 2 1 Trunk Capacity Spine Pod Pod Drained Devices 3 Figure A.4: Original topology is a Folded Clos with capacity=64. The required SLO during expansion is 75%, which means capacity should be no smaller than 48. There are 16 links on each pod. Due to the SLO constraint, for all plans, 4 links are allowed to be drained at each pod. Before After 1 2 3 4 0.5 1.0 0.75 1.0 0.75 0.5 1-s1 3-s1 3-s2 2-s2 4-s1 Original 1-s1 2-s2 4-s2 Redistribution Plan 1 New Added Capacity s1 s2 Redistribution Plan 2 Spine Pod Spine Pod Figure A.5: Clos Draining Link Redistribution Scheduling. process. In general, during expansion, the incrementally added capacity should be leveraged to speed up the expansion process. Due to the thin edges in Clos, no matter what the overall network capacity is, the maximum number links to be drained at each pod is bounded by the number of links on each pod multiplied by (1SLO). Figure A.4 shows an example. The leftmost figure is a folded Clos, where each pod has 16 links (4 trunks). If the SLO is 75%, the maximum number of links to be drained at a single step is 16(10:75)=4. For our expansion plan generation algorithm, we try to achieve this bound at each pod at every single step. Insight 2: Drain links at spines uniformly across edges (pods). Given the number of links allowed to be drained at each pod, we need to carefully select which links are to be drained. Figure A.4 shows two draining plans. Drain plan 1 will drain links from two spines uniformly across all pods. The residual capacity is 48, satisfying the requirement SLO=75%. By uniformly, we mean the number of drained links between the spine and all pods are the same. Drain plan 2 also drains 4 links from each pods but not uniformly (for example, more links are drained at the third spine compared to the fourth spine), which violates the SLO requirement since the residual capacity is only 40, smaller than the 48 in Drain plan 1. Insight 3: Create physical loops by selecting the right target spines. Ideally, drained links with the same index on a pod on the same original spine should be redistributed to the same spine because the traffic sent from the pod to the target spine has a return path to the pod. Otherwise, the traffic will be dropped. Figure A.5 illustrates this insight. The right side of the figure shows the performance of two redistribution plans. The y axis shows the normalized capacity of the network at each expansion step. In the first plan, link 1 is first moved to spine s1 (1-s1),followed by link 3 to the same spine s1 (3-s1) which results in 75% capacity loss, since the two pods are connected by three paths instead of four. Once links 1 and 3 are undrained, s1 connects the two pods by a fourth path, and the normalized capacity is restored to 1. This redistribution step 126 now provides leeway for supporting 25% capacity loss in the next step. In this next step, links 2 and 4 are rewired to connect to s2. During the rewiring, capacity again drops to 75%, with three paths between the pods. On undraining links 2 and 4, the capacity is once again restored to 1. In contrast, redistribution plan 2 violates SLO because it does not focus on restoring capacity by establishing paths via the new spine, as suggested by the insight (links 1 and 3 are moved to different spines). Inspired by these insights, we designed Algorithm 3, which can achieve all our insights simulta- neously when both original and target topologies are symmetric. The algorithm is optimal since at every expansion step, it achieves the upper bound of the links that could be drained. Therefore, our algorithm uses smallest steps to expand Clos. The input to the algorithm is the original and new symmetric topologyT o andT n . We useT o sp and T n sp to represent the number of links between spine s and pod p in the old and new topology respectively. Initially, T o s 0 p = 0, where s 0 is a new spine. The output of the algorithm is the draining plan, Subplan i , for expansion step i. The final expansion plan Plan =fSubplan i g and the number of Subplan,jPlanj, is the total expansion step. The algorithm starts by indexing old spines, new spines and links on each pod from left to right respectively (Line 3-3), which are critical for the correctness of the algorithm since the algorithm relies on these indexes to break ties when selecting spines and links to redistribute. Then, based on our Insight 1, Line 3 computes the upper bound on the number of links to be redistributed on each pod, n p . We show experimentally that our algorithm can always achieve this upper bound in each individual step as long as T o and T n are symmetric. Next, the algorithm iterates over all indexed old spines (Line 3) and tries to drain n p links uniformly across all pods (Line 3) such that Insight 2 is satisfied. Line 3 compares the number of remaining to-be-redistributed links sp andn p and is useful only at the last expansion step. For each pod, the algorithm needs to find spines to redistribute links to (Line 3-3) while satisfying the constraint in Insight 3, i.e., drained links with the same index on a pod on the same original spine are redistributed to the same spine. Due to indexing and symmetric structure of Clos, our algorithm can always satisfy Insight 3. Specifically, when selecting spines, the spine satisfying s 0 p =T n s 0 p T o s 0 p >0 with the smallest index will be considered first (Line 3-Line 3). When selecting links from pod to redistribute, we always select the first n a links to redistribute (Line 3). Theorem 3 Algorithm 3 produces the optimal expansion plan for Clos topology. The proof is simple. Since at every expansion step, our algorithm achieves the upper bound of the links that could be drained, our algorithm uses smallest steps to finish the expansion. A.6 FatClique Topology Synthesis Algorithm The topology synthesis algorithm for FatClique is shown in Algorithm 4. Essentially, the algorithm is a search algorithm, and leverages the constraints C 1 to C 6 in §2.5.1 to prune the search space. It works as follows. The outermost loop (Line 4) enumerates the number of racks used for a sub-block. Based on the rack space constraints, sub-block size S c is determined Line 4. Next, the algorithm iterates over the number of sub-blocks in a block S b Line 4, whose size is constrained by MaxBlockSize. Inside this loop, we leverage constraints C 1 to C 6 and derivations in §2.5.1 to find the feasible set of p c , which is represented by P c (Line 4). Then we construct FatClique based all design variables Line 4 and compute its capacity Line 4. If the capacity matches the target capacity, we add this topology into candidate set (Line 4). If the capacity is larger than required, the algorithm will increase s by 1 which will decrease the number of switches used n=N=s (N is fixed) and therefore reduce the network capacity in next search step (Line 4). If the capacity is smaller than required, the algorithm will decrease s by 1 (Line 4) to increase the number of switches and capacity in next search step. 127 Algorithm 3 Single Step Clos Expansion Plan Generation input : T o , T n , SLO output: Subplan Index original and new spines from left to right starting from 1 respectively Index links at each pod from left to right starting from 1 8 pod p, np = num_links_per_pod (1-SLO) // Insight 1 foreach Original Spine sdo foreach pod pdo // Insight 2 sp = T o sp T n sp , np = min(np;sp) // Insight 2 while np > 0do foreach New Spine s 0 do // Insight 3 s 0 p =T n s 0 p T o s 0 p if s 0 p > 0then break end na =min( s 0 p ;np) Find the first na to-be-distributed links, Lsp np = npna, update(T o ) Subplan.add(Lsp) end end end Algorithm 4 FatClique Topology Synthesis Algorithm input : N,r,Cap ,s 0 output: candidate candidate = [] for i = 1;i<MaxRackPerSubblock;i++ do s =s 0 S c =iRackCapacity=(1+s) for S b = 1;S b <=MaxBlockSize;S b ++ do P c = CheckConstraints(S c , S b ) foreach p c in P c do T = ConstructTopology(S c , S b , s, p c ) Cap = ComputeCapacity(T) if Cap<Cap then s =s1 else if Cap>Cap then s =s+1 else candidate.append(T) end end end end A.7 Parameter Setting The cable price with transceivers used in our evaluation is listed in Table A.2. We found that a simple linear model does not fit the data. The data is better approximated by a piecewise linear function: cables shorter than 100 meters are fit using one linear model and cables beyond 100 meters are fit using another linear model. The latter has a larger slope because beyond 100 meters, more advanced and expensive transceivers are necessary. In our experiment, since we only know the discrete price for cables and associated transceivers, we do the following: if the length of the cable is X, we use the exact price; if the length if larger than X, we use the first cable price larger than X. 128 Rack width 24 inches Rack depth 28.875 inches Rack height 108 inches Tray-to-rack distance 24 inches Dist. Betw. cross-trays 48 inches Aisle Width 48 inches Rack units per rack 58 RU [69] #Ports per patch panel 48 [31] Patch panel space 1 RU Cable tray size 24 inches x 4 inches [79] Table A.1: Datacenter settings mostly [61] 1k 2k 3k 4k 0 5 10 15 Spectral Gap Jellyfish Xpander FatClique Topology Scales (#nodes) Figure A.6: Spectral Gap 100 200 300 400 500 Small-scale Clos Jellyfish Xpander FatClique 4 5 0 5 10 15 Number of Paths Path Length Figure A.7: Path Diversity for Small-scale Topologies 1000 3000 5000 7000 9000 Medium-scale Clos Jellyfish Xpander FatClique 6 7 0 100 200 300 400 Number of Paths Path Length Figure A.8: Path Diversity for Medium-scale Topologies Length 3 5 10 15 20 30 Price 303 310 318 334 350 399 Length 50 100 200 300 400 Price 489 753 1429 2095 2700 Table A.2: 40G QSFP Mellanox cable length in meter (Length) and price with transceivers (Price) [24] A.8 Other Metrics In our evaluations, we have tried to topologies with qualitatively similar properties 2.6. In this section, we quantify other properties of these topologies. Edge Expansion and Spectral Gap. Since computing edge expansion is computationally hard, we follow the method in [81] using spectral gap [40] to approximate edge expansion. A larger spectral gap implies larger edge expansion. To fairly compare topologies, we equalize their bisection bandwidth first. As shown before, to achieve the same bisection bandwidth, Clos uses many more switches. Also, Clos is not a d-regular graph and do not know of a way to compute the spectral graph for Clos-like topologies. Therefore, we compare the spectral gap only for d-regular graphs, Jellyfish, Xpander and FatClique at different scales (1k-4k nodes). The spectral gap is defined as follows [40]. Let G with node degree d and A(G) denote the d-regular topology and its adjacent matrix. The matrixA(G) hasn real eigenvalues which we denote by 1 2 n . Spectral gap SG =d 2 . In our experiments, chip radix is 32 and each node in those topologies connects to 8 servers, d = 24. The result is shown in Figure A.6. First, we observe that spectral gap stays roughly the same under different scales. Also, the spectral gap of FatClique is slightly lower than that of other topologies, which implies that FatClique has slightly smaller edge expansion 129 compared to Jellyfish and Xpander. This is to be expected, since FatClique adds some hierarchical structure to cliques. Path Diversity. We compute the path diversity for different topologies. For Clos, we only calculate the number of shortest paths between two ToR switches from different pods. For other topologies, we compute the number of paths which are no longer than the shortest paths in the same-scale Clos. For example, for small-scale Clos, the shortest path length is 5. We will only calculate paths whose length is no larger than 5 in other topologies. This is a rough metric for path diversity. The results are shown in Figure A.7 and Figure A.8. We found that Jellyfish, Xpander and FatClique have the same level of path diversity, which is higher than that of Clos. Also, those topologies have shorter paths than Clos. 130 Appendix B Gemini Measurement and Prototype Results B.1 Correlation between FCT and MLU, over All Links In §3.3 we presented correlations between FCTs for inter-pod flows with DCNI-level link-utilization metrics. We also collected link-utilization data for all other links, including host-to-ToR links as well as pod-internal links. This data provides stronger evidence for correlations between link utilizations and FCTs, but is less indicative of whether the DCNI-only simulated utilizations in §3.5 would be predictive of FCT benefits. Figure B.1 shows FCTs v.s., p99 all-links MLUs; Figure B.2 shows FCTs v.s., p99 all-links ALUs; Figure B.3 shows FCTs v.s., all-links overloaded link ratios (OLRs). As in §3.3, FCT values are normalized to the best sample for each size, and the message size shown at the top of each graph is the upper bound for the message-size bucket represented by that graph. Note that with this dataset, there does appear to be a correlation between FCTs and OLRs. B.2 FCTs in testbed experiments Here we report some additional, inconclusive results regarding FCTs in the testbed experiments of §3.5.1. Note that the workload changed measurably between the baseline and best-predicted tri- als: The daily-average traffic volume increased by 17% The fraction of well-bounded pod-pairs decreased from 0.93 to 0.825, indicating that the traffic became less predictable. The maximum DMR (demand-to-max ratio) increased considerably, from 1.67 to 5.49, also indicating a decrease in predictability. We currently lack access to this testbed that would allow us to repeat the experiments with less of a change in workload between trials. FCT metrics. We collected per-flow metrics: min RTT, message transmission latency, and delivery rate (for transfers that were network, not application, limited). For each of these, we report the median and 99-th percentile values. Transmission latencies are bucketed by transfer size into 5 buckets ranging from 1KB to 2MB and a sixth bucket with transfers larger than 2MB. As in §3.3, we report FCT values that are normalized to the best sample for each size. Figure B.4 shows how Gemini’s suggested non-uniform topology and routing, in our testbed experiments, affects the normalized min-RTT and delivery rates observed at the endpoints; Figure B.5 shows the effects on FCTs. In both figures, the predicted-best topology appears to improve results in some cases, but worsens them in others. We lack sufficient information to 131 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 MLU 2 4 6 8 10 12 P99 tx-latency (normalized) 1KB 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 MLU 2 4 6 8 10 P99 tx-latency (normalized) 8KB 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 MLU 0 5 10 15 20 25 P99 tx-latency (normalized) 64KB 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 MLU 0 5 10 15 20 25 30 35 P99 tx-latency (normalized) 256KB 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 MLU 0 10 20 30 40 50 60 70 P99 tx-latency (normalized) 2MB Figure B.1: FCTs (inter-pod flows) vs p99 MLUs (all links) on production fabrics conclude whether any of these changes are attributable to the configuration or to the difference in workload (especially, the significant difference in variability). The developers of the FCT-measurement system have warned us that the p99 delivery-rate results could be unreliable, due to some practical difficulties in measuring these rates at the tail. They are more confident in the other measurements (RTT, FCT, and p50 delivery rates). 132 0.05 0.1 0.15 0.2 0.25 ALU 2 4 6 8 10 12 14 P99 tx-latency (normalized) 1KB 0.05 0.1 0.15 0.2 0.25 ALU 1 2 3 4 5 6 7 8 9 P99 tx-latency (normalized) 8KB 0.05 0.1 0.15 0.2 0.25 ALU 0 5 10 15 20 25 30 P99 tx-latency (normalized) 64KB 0.05 0.1 0.15 0.2 0.25 ALU 0 10 20 30 40 50 P99 tx-latency (normalized) 256KB 0.05 0.1 0.15 0.2 0.25 ALU 0 10 20 30 40 50 60 P99 tx-latency (normalized) 2MB Figure B.2: FCTs (inter-pod flows) vs p99 ALUs (all links) on production fabrics 133 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.011 OLR 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 P99 tx-latency (normalized) 1KB 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.011 OLR 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P99 tx-latency (normalized) 8KB 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.011 OLR 0 5 10 15 20 25 30 35 P99 tx-latency (normalized) 64KB 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.011 OLR 0 10 20 30 40 50 P99 tx-latency (normalized) 256KB 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.011 OLR 0 10 20 30 40 50 60 P99 tx-latency (normalized) 2MB Figure B.3: FCTs (inter-pod flows) vs p99 OLRs (all links) on production fabrics Base Pred. 1.00 1.05 1.10 1.15 1.20 P50 min-RTT (normalized) Base Pred. 1.0 1.1 1.2 1.3 1.4 P99 min-RTT (normalized) Base Pred. 1.0 1.1 1.2 1.3 1.4 P50 delivery rate (normalized) Base Pred. 1.0 1.5 2.0 2.5 P99 delivery rate (normalized) “Base” = baseline; “Pred.” = predicted-best Figure B.4: Testbed experiments – min-RTT, delivery rate 134 Base Pred. 1.0 1.1 1.2 1.3 P50 tx-latency (normalized) 1KB Base Pred. 1.0 1.2 1.4 1.6 1.8 2.0 P99 tx-latency (normalized) 1KB Base Pred. 1.0 1.1 1.2 1.3 P50 tx-latency (normalized) 8KB Base Pred. 1.00 1.25 1.50 1.75 2.00 2.25 P99 tx-latency (normalized) 8KB Base Pred. 1.00 1.05 1.10 1.15 1.20 1.25 P50 tx-latency (normalized) 64KB Base Pred. 1.0 1.5 2.0 2.5 P99 tx-latency (normalized) 64KB Base Pred. 1.000 1.025 1.050 1.075 1.100 1.125 P50 tx-latency (normalized) 256KB Base Pred. 1 2 3 4 P99 tx-latency (normalized) 256KB Base Pred. 1.0 1.2 1.4 1.6 1.8 P50 tx-latency (normalized) 2MB Base Pred. 1.0 1.2 1.4 1.6 1.8 P99 tx-latency (normalized) 2MB Figure B.5: Testbed experiments – message-transfer latency 135
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Scaling-out traffic management in the cloud
PDF
Hardware and software techniques for irregular parallelism
PDF
Detecting and characterizing network devices using signatures of traffic about end-points
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Congestion control in multi-hop wireless networks
PDF
Detecting and mitigating root causes for slow Web transfers
PDF
Reliable languages and systems for sensor networks
PDF
Efficient delivery of augmented information services over distributed computing networks
PDF
Satisfying QoS requirements through user-system interaction analysis
PDF
Protecting online services from sophisticated DDoS attacks
PDF
Towards the efficient and flexible leveraging of distributed memories
PDF
Disentangling the network: understanding the interplay of topology and dynamics in network analysis
PDF
Measuring the impact of CDN design decisions
PDF
Toward sustainable and resilient communities with HCI: physical structures and socio-cultural factors
PDF
Improving efficiency, privacy and robustness for crowd‐sensing applications
PDF
On efficient data transfers across geographically dispersed datacenters
PDF
Efficient pipelines for vision-based context sensing
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
On practical network optimization: convergence, finite buffers, and load balancing
Asset Metadata
Creator
Zhang, Mingyang
(author)
Core Title
Towards highly-available cloud and content-provider networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-05
Publication Date
12/20/2021
Defense Date
10/21/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
availability,Cloud,datacenter,networks,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Govindan, Ramesh (
committee chair
), Qian, Xuehai (
committee member
), Raghavan, Barath (
committee member
), Wang, Chao (
committee member
)
Creator Email
mingyanz@usc.edu,myzh1989@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC18807270
Unique identifier
UC18807270
Legacy Identifier
etd-ZhangMingy-10318
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhang, Mingyang
Type
texts
Source
20211223-wayne-usctheses-batch-906-nissen
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
availability
datacenter
networks