Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Performant, scalable, and efficient deployment of network function virtualization
(USC Thesis Other)
Performant, scalable, and efficient deployment of network function virtualization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PERFORMANT, SCALABLE, AND EFFICIENT DEPLOYMENT OF NETWORK FUNCTION VIRTUALIZATION by Jianfeng Wang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2023 Copyright 2023 Jianfeng Wang Dedication I dedicate this thesis to the countless individuals in the scientific community whose contributions have paved the way for my journey. It is my sincere hope that this work would also benefit humanity in a meaningful way. ii Acknowledgements I would like to express my deepest gratitude to those who have helped me while I was working on this dissertation and to those who have shaped my career and life with their advice and efforts. To Ramesh Govindan and Barath Raghavan, my Ph.D. advisors, I am truly grateful. Both of them have been extremely encouraging, supportive, and resourceful. Their advice and well-structured training have improved my communication skills and my independent thinking ability. We’ve had countless delightful and inspiring conversations over the years. I can’t imagine finishing this dissertation without their support. Marcos A. M. Vieira has been an invaluable mentor and friend. He was a visiting scholar at USC in 2017. During these early days, he showed remarkable patience in guiding me through technical details, which turned out to be crucial for my research. I always remember these winter days, tirelessly debugging code together day after day. Marcos is a very kind and interesting person from whom I’ve learned so much. I have also collaborated with many talented colleagues. Among them: Jane Yen is my first collaborator at USC and educated me a lot in project management and communication skills. Zhuojin Li and Siddhant Gupta helped me with their solid engineering skills. During my internship at Google, Neal Cardwell and Tarun Bansal helped me tremendously and shaped my career. Many NSL Members also contributed to my research: Sucha Supittayapornpong, Mingyang Zhang, Xiaochen Liu, Yitao Hu, and Rui Miao. I am fortunate to have wonderful friends. Yeji Shen and Ce Yang have been my roommates for five years. We first met in our undergraduate years, and are lifelong friends. Danyang Qiao and I have done many trips across the US in a crazy way. Kaitai Zhang and Chengpeng Wang are incredibly supportive iii friends, willing to offer their help without hesitation. Jiang He and Hongyu Fu were there to support me in some of my darkest moments. I am thankful for all of them. In addition, I am grateful to singer Li Chen for making exceptional albums. Her voice has been a constant companion during the lonely period of COVID-19 lockdowns. Finally, I’d like to thank my lovely girlfriend Jing, my parents, and my grandparents. They have been cheering me up all the time, telling me that there is nothing to worry about and convincing me that the best is yet to come. I love you, Jing, Mom and Dad, Grandma and Grandpa. iv TableofContents Dedication ii Acknowledgements iii List of Tables viii List of Figures x Abstract xii Chapter 1: Introduction 1 1.1 Background of NFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Network Function Virtualization (NFV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Design Space for NFV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Dissertation Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.6 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2: Lemur: Meeting SLOs in Cross-platform NFV 13 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Lemur: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 The Placer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 The Placement Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.2 The Placement Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 The Meta-Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Synthesizing NF Chain Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5.1 x86-based commodity server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5.1.1 BESS script generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5.1.2 Shared Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5.1.3 Core Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.2 PISA switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.2.1 Algorithm of unifying P4 parsers . . . . . . . . . . . . . . . . . . . . . . 35 2.5.2.2 Algorithm of generating the global P4 pipeline . . . . . . . . . . . . . . . 35 2.5.3 Execution: smartNIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 v 2.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.6.2 Comparison Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6.3 Other Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 3: Quadrant: A Cloud-Deployable NF Virtualization Platform 49 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Quadrant Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 Quadrant’s Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.1 Existing NF Execution Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.2 NF Chain Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4 Core Allocation and Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.1 Controlling Chain Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.2 Spatiotemporal Packet Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4.3 Other Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.5 Auto-scaling in Quadrant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5.1 Monitoring and scaling signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5.2 Quadrant Ingress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.5.3 Scaling of NF Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.6.1 Quantifying Reuse of Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.6.2 Performance Comparisons: Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.6.3 Performance Comparisons: Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.6.4 Validating SLO-adherence with Scaling . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.6.5 Quantifying Isolation Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.6.6 Scaling to 40 and 100 GbE NIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.6.7 Cooperative Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Chapter 4: Ironside: Sub-millisecond Latency SLOs for NFV 86 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2 Background, Motivation, and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.2.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3 Ironside Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.3.2 Absorbing bursts: The Core Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.3.3 Core Efficiency: The Server Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.3.4 Minimizing Servers: The Ingress Mapper . . . . . . . . . . . . . . . . . . . . . . . . 103 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4.1 Setup and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4.2 Latency Performance Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.4.3 CPU Core Usage Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.4.4 Ablation Study: The Core Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.4.5 Ablation Study: The Server Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.4.6 Analyzing Scheduling Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Chapter 5: Literature Review 115 vi 5.1 Middlebox and Network Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.2 NFV Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.3 Hardware Offloading in NFV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.4 Software-based Optimization in NFV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.5 Multi-core CPU Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.6 Other Relevant Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Chapter 6: Conclusions 125 6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3.1 Exploring NF Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3.2 Cloud Integration of NFV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Bibliography 131 vii ListofTables 2.1 Lemur’s SLOs capture key operator use cases. . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Five canonical NF chains used for evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3 All supported NFs and available placement choices in Lemur. We artificially limit IPv4Fwd as P4-only for the sake of evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4 Example profiled NF costs (CPU cycles/packet) . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1 A comparison of NFV platforms’ properties that are key for being production-ready. . . . . 51 3.2 Per-core NF chain throughput (kpps) w/ and w/o coopsched. . . . . . . . . . . . . . . . . . 82 3.3 Overheads under isolation variants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.4 Per-core chain throughput (kpps) under different batch settings. . . . . . . . . . . . . . . . 84 4.1 A whale (i.e., a single flow whose packet processing requirement exceeds the capacity of a core) can inflate p99 latency. Each cell represents the p99 latency for the corresponding NF chain and trace inµ s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 Many minnows (i.e., non-whale flows) can inflate p99 latency. Each cell represents the p99 latency for the corresponding NF chain and trace inµ s. . . . . . . . . . . . . . . . . . . . . 91 4.3 Comparisons of end-to-end metrics (p50 and p99 latency, the time-averaged CPU core usage and loss rate) forchain1 under thebackbonetraffic by Ironside and others, as a function of the system’s latency target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.4 Comparisons of end-to-end metrics (p50 and p99 latency, the time-averaged CPU core usage and loss rate) forchain2 under thebackbonetraffic by Ironside and others, as a function of the system’s latency target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.5 Comparisons of end-to-end metrics (p50 and p99 latency, the time-averaged CPU core usage and loss rate) for runningchain1 under theAStraffic by Ironside and others, as a function of the system’s latency target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 viii 4.6 Comparisons of end-to-end metrics (p50 and p99 latency, the time-averaged CPU core usage and loss rate) for runningchain2 under theAStraffic by Ironside and others, as a function of the system’s latency target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.7 Comparisons of end-to-end p50 and p99 latency and the time-averaged CPU core usage by Ironside and Metron. In these experiments, we set Ironside’s latency target to 2500µ s and 5000µ s respectively. Ironside still produces lower p99 latency results, and has similar CPU core usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.8 Comparisons of end-to-end p99 latency achieved by Ironside and variants with different server mappers, as a function of the system’s latency target. Turning off Ironside’s boost-mode or applying on-demand invocations of the server mapper cannot achieve sub-millisecond tail latency SLOs. This show that NFV systems need resort to handle bursts in software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.9 The average execution time of Ironside’s core mapper’s core re-mapping process, as a function of packet queue length. This process runs at the end of each short epoch and is lightweight (i.e. <5µ s to re-map more than 1k packets). . . . . . . . . . . . . . . . . . . . 113 ix ListofFigures 1.1 The design space of NFV systems. An ideal NFV system should have high performance, scalability, and efficiency while offering the complete set of NF processing functionality in the cloud context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Overview of Lemur’s design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Performance comparison of alternative schemes and of Lemur with toggled optimizations. 41 2.3 Performance comparison of Lemur running with different hardware and on multiple servers. 46 3.1 Quadrant’s control plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2 A Quadrant worker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 Quadrant’s controller interacts with Quadrant’s ingress and worker subsystem to deploy containerized NFs for packet processing. Unshaded boxes are existing cloud components that Quadrant reuses, lightly shaded ones are components that Quadrant modifies, and darker ones represent new components specific to Quadrant embedded within the infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Timeline of packets on a Quadrant worker. A packet is tagged at the ingress. 1. NIC’s L2 switch sends it to NIC VF associated with the destined chain. NIC VF DMAs packets to the first NF’s memory space. 2. NF 1 processes the packet. 3. After NF 1’s packet processing function returns, the packet is copied to the chain’s pktbuf by the NF runtime if there are other NFs. This is necessary to ensure packet isolation as the NIC’s pktbuf should only been seen by NF 1. 3–5 A per-core cooperative scheduler controls the execution sequence of NFs to ensure temporal packet isolation. 6. Final NF asks VF to send the packet out. . . 64 3.5 Throughput with increasing chain length for running an NF chain on a single core. . . . . 73 3.6 Core usage of NF chains implemented in Quadrant and Metron as a function of achieved tail latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.7 End-to-end tail latency achieved by NF chains in Quadrant as a function of latency SLO. . 77 x 3.8 End-to-end tail latency achieved under different levels of traffic dynamics. Latency SLO is 70µ s for all groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.9 End-to-end latency CDF with SR-IOV on and off. . . . . . . . . . . . . . . . . . . . . . . . 79 3.10 Per-packet cost of copying packets of different sizes . . . . . . . . . . . . . . . . . . . . . . 79 3.11 End-to-end tail latency and CPU core usage achieved by Quadrant (Chain 1) as a function of latency SLO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1 Ironside’s hierarchical multi-scale allocation maps traffic to cores at three different spatial scales and temporal scales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.2 At each epoch, the NFV runtime maintains the number of packet and flow arrivals. At the end of an epoch, the core mapper flags potential SLO violations. If the backlog exceeds the core’s capacity, the core mapper recruits auxiliary cores to handle excess traffic. . . . . 96 4.3 An example of the Pareto-frontier for the number of active flows (with at least one packet arrival) (f) and the number of packets (p) that an NF chain is able to process within an epoch. Asf increases, the chain must process less packets in order to avoid backlog. . . . 96 4.4 The server mapper leverages RSS and updates the NIC’s indirection table to apply bucket-to-core mappings. For each decision interval, it tracks the flow count and packet rate< f,r > for each RSS bucket, corresponding to one entry in the NIC’s table. At the end of an interval, the server mapper decides the number of dedicated cores at a server: it finds overloaded cores, whose loads exceed the core’s capacity, moves the minimum set of buckets from them to other cores, and then tries to reclaim cores. In the above case, one bucket is migrated from core 1 to core 2 to avoid overloading. . . . . . . . . . . . . . . . . 102 4.5 Comparisons of end-to-end p99 latency and time-averaged CPU cores achieved/used by Ironside and its variants, as a function of achieved p99 latency. Ironside w/o core-mapper has the worst latency result, which shows the importance of the core mapper. Ironside static-unsafe and Ironside static-safe only consider the packet count (or queue length) when predicting core usage. The former can have latency SLO violations, while the latter can result in more CPU core usage. Overall, Ironside’s design performs the best in terms of meeting SLOs and then minimizing CPU core usage. . . . . . . . . . . . . . . . . . . . . 111 4.6 Comparisons of end-to-end p99 latency and time-averaged CPU cores achieved/used by Ironside and other two variants with different server mappers, as a function of achieved p99 latency. For chain 2, Ironside is able to meet all latency SLOs with the smallest time-averaged CPU cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 xi Abstract Network Functions (NFs) are widely deployed to process network traffic. Network Function Virtualization (NFV) aims to replace traditional hardware middleboxes with software-based NFs deployed on commodity clouds. We posit that an ideal NFV system should perform well in three dimensions: performance, scalabil- ity and efficiency, and cloud deployability. However, today’s NFV research often sacrifices one or two for the rest, which pushes NFV away from practical large-scale deployment. Recently, it has been an urgent matter to advance NFV as the scale of network processing has vastly increased in the Internet era. In this work, we explore design principles and mechanisms to advance the state-of-the-art NFV in the three dimensions. We start by analyzing the key aspects of NFV deployment: the execution of NFs, the mechanism for interconnecting NFs, and the scheduling of NF tasks. For the NF execution, we consider various hardware accelerators, and design and implement Lemur, a cross-platform NFV framework. Lemur places and executes multiple NF chains across heterogeneous hard- ware while meeting SLOs. Lemur employs a placement algorithm that takes into account many hardware- related constraints and NF chain specifications, and a meta-compiler that automatically generates code, table, and rules to stitch cross-platform NF chain execution. We validate Lemur on a rack-scale testbed with various hardware accelerators and show that Lemur is the only one among competing strategies in meeting SLOs for NF chains while maximizing the system’s marginal throughput. For the NF interconnection, we consider several design choices (CPU, memory, and NIC resource al- location) when deploying NF chains on commodity servers. We notice the isolation requirement needed xii by multi-tenant NF chains and observe the access pattern of various resources among NFs. With the ac- cess pattern, we describe the design and implementation of a high-performance spatiotemporal packet isolation mechanism that makes use of containerized NFs and NIC virtualization. We then develop a cloud-deployable NFV platform Quadrant that employs the isolation method and supports auto-scaling. We evaluate Quadrant on a rack-scale testbed and show that Quadrant achieves up to 2.31× the per-core throughput compared against state-of-the-art NFV systems. For the NF task scheduling, we find that real-world traffic exhibits burstiness that can cause latency spikes of up to 10s of milliseconds. This prevents existing NFV systems from achieving low-latency SLOs. We describe a novel multi-scale core-scaling strategy that makes traffic-to-core allocation decisions at rack, server, and core-spatial scales, and at increasingly finer timescales, to accommodate multi-timescale bursts. We have implemented Ironside, an NFV system using the proposed core-scaling strategy, and evaluated it under realistic traffic inputs. When compared with state-of-the-art approaches, Ironside is the only one capable of achieving sub-millisecond p99 latency SLOs with a comparable number of cores. xiii Chapter1 Introduction In this dissertation, we focus on an important and specialized category of applications, known asNetwork Functions(NFs). Unlike web applications and data center microservices, NFs represent network processing functionalities. Our primary research goal is to explore and propose new design principles and mechanisms for achieving high performance, scalability, and efficiency for deploying NFs on commodity clouds. This research goal has been proven to be challenging, due to the unique characteristics exhibited by NFs. Consequently, it has gained significant attention from multiple research communities, including computer systems, networking, architecture, programming language and software engineering, and more. With our distinct perspectives and insights, this dissertation aims to play a role in driving the revolution of the ways of developing, deploying, and managing NFs. 1.1 BackgroundofNFs Many organizations, such as Internet Service Providers (ISPs) and enterprises, operate networks to pro- vide network connectivity to users. For instance, ISPs operate cellular and cable networks to provide 5G networking experiences, optimize network latency for mobile applications, and offload computations to edge clouds [89, 3]. Universities and institutions like USC operate campus networks so that students and staff can access university services and browse the Internet for online resources. 1 In these scenarios, the university or the company must provide network services that go beyond basic Internet connectivity. An ISP serving many customers must provide basic network features (encryption, authentication, accounting, etc.) and implement traffic engineering that optimizes the networking expe- rience for customers. Likewise, a university can have multiple schools and departments, each having its on-campus services, such as digital libraries, payment systems, campus management, dining facilities, etc. To cater to these needs, the university must provide network slicing, enforce access controls, consider load balancing, and provide network security features and traffic monitoring. In short, organizations require ways to manage traffic policies for security, reliability, performance, and cost. These needs are fulfilled by network processing tasks and services. The research community has identified these tasks and services as a unique type of application and terms them as Network Functions (NFs). In this dissertation, we focus on analyzing this type of application and exploring mechanisms for making them easier and cheaper to be developed, deployed, and managed. Some NF examples. Nowadays, the Internet consists of many types of networks: datacenter networks, enterprise networks, campus networks, ISP networks, cellular networks, airplane networks, and more. In these different contexts, NFs can vary significantly. To give readers a closer look at NFs, we describe some common NFs used in today’s networks. ISPnetworks. ISPs provide basic network connectivity functions, includingDynamicHostConfigura- tion Protocol (DHCP), Domain Name System (DNS), L2/L3 traffic forwarding, and routing . In addition, they offer NFs to optimize user network experience and applications’ performance. Qual- ity of Service (QoS) NFs monitor network traffic, manage network resources ( e.g., network buffers), and prioritize certain traffic types for better application performance. The Content caching is often integrated 2 with web services. It leverages a distributed computing infrastructure to deploy content caches, reducing the amount of data transmitted from the web server to the end user and thereby reducing web latency. Mobile networks. Many NFs are needed in the context of mobile networks. The Evolved Packet Core (EPC) provides core network functionalities for LTE and 5G. TheRadioAccessNetwork(RAN) is the crucial component that provides radio connectivity between mobile devices and the core infrastructure. The IP Multimedia Subsystem (IMS) is used to deliver multimedia services such as Voice over LTE (VoLTE) and Rich Communication Services (RCS). Enterprise networks. They provide services for enterprise users, including staff members and cus- tomers. Security NFs are commonly used by enterprises. Firewalls act as security barriers for organizations. They are configured to prevent unauthorized access. Firewalls can vary in their designs and implementations. A simple firewall may be an Access Control List (ACL) that whitelists a few IP ranges. A complex one can be aDeepPacketInspection(DPI) application (e.g., Intrusion Detection System (IDS) [170]) that catches keywords in packets’ payloads. VirtualPrivateNetworks(VPNs) enables an encrypted connection over the Internet so that remote users can access a private network securely. Encryption and decryption are used to establish secure connections between two endpoints over an insecure network. Typically, an encryption module is deployed at the network ingress, closer to one end of the communication. The encrypted data is transmitted over the network. At the network egress, a decryption module is deployed to receive and decrypt the received data. Cloudanddatacenternetworks. Big IT companies run data centers to host web services, and some offer cloud computing capabilities to small companies. Here, NFs are mostly used to improve the reliability and performance of web services and increase the utilization of the underlying infrastructure. The Network Address Translation (NAT) is not only used in data centers but also deployed almost in every private network. A network operator often owns a set of public IP addresses but has way more users 3 and devices that need to be exposed to the public Internet. NAT allows multiple devices to share a single public IP. It maintains a pool of available L4 ports and performs address translation for connections. The Load Balancer (LB) distributes incoming traffic across backend servers for high availability. It also improves performance and resource utilization. For example, Google has described the design and implementation of its software-based LB system Maglev in 2016 [26]. The WAN optimizers can also be used in Wide-Area Networks (WANs) to minimize the volume of network traffic [4]. It reduces the chance of fetching repeated downloads from a remote server, by caching frequently accessed byte streams, such as documents, video chunks, and software packages. In summary, NFs are applied to network traffic almost everywhere: they can deliver an enhanced networking experience (e.g., NAT, LB, VPN, and traffic monitoring), ensure network security and privacy (e.g., firewall, encryption and decryption, VPN), and interact with user-facing applications to significantly improve the overall user experience for applications (e.g., QoS and content caching). However, deploying and managing NFs can pose challenges for network operators, prompting the introduction of the concept of Network Function Virtualization (NFV). Next, we will discuss more details about NFV. 1.2 NetworkFunctionVirtualization(NFV) In the previous section, we highlight that network operators host various NFs in their networks to process network traffic to provide services to network users . With the rapid growth of the Internet, the demand for NFs has vastly increased, leading to greater deployment of NFs by network operators. Two factors are driving this trend. Firstly, the wide adoption of the Internet and its applications has re- sulted in a substantial increase in the number of Internet users, necessitating the deployment of more and 4 more NFs to handle the ever-increasing network traffic. Secondly, the development and evolution of net- worked applications have further expanded the range of NFs required by network operators to effectively support and deliver innovations. Problems for hosting NFs. Traditionally, NFs have been designed and developed as middleboxes in hardware. Network appliance vendors, such as Cisco, Juniper, F5 Networks, and Palo Alto Networks, provide hardware-based solutions to network operators. Network operators often have to make purchases from many middlebox vendors. As a result, they tend to operate and manage a long list of different NFs, each with its unique hardware specification, functionality, and management interface. In a survey paper [136], Sherryetal. studied the NF deployment from many network operators, rang- ing from small to large organizations. The authors found that the number of hardware middleboxes is at the same level as the number of switches and routers, regardless of the size of the network operator. Also, these hardware middleboxes are often from many vendors with diverse management interfaces. This poses challenges to the network operators because they not only have to make significant investments in purchasing middlebox appliances, and also have to spend great efforts in running, configuring, and man- aging these devices. In particular, NFs need to process traffic in sequence, i.e., an NF chain. Orchestrating middleboxes correctly to meet operational requirements can be difficult and error-prone. Additionally, middleboxes need to be upgraded, as new applications and use cases keep evolving. Even if network appliance vendors have made significant efforts in designing and developing new solutions to meet the changing needs, they still cannot deliver quickly enough to keep up with the pace of today’s Internet upgrade speed. For this reason, hardware-based solutions have intrinsic limitations. Finally, the traffic volume can keep increasing and exceed the processing capacity of one middlebox device. Each middlebox is a powerful piece of hardware that can be expensive. This requires network operators to scale out the middlebox infrastructure to prepare for peak loads by adding new hardware while the infrastructure’s utilization may be low for most of the time. Scaling out devices, together with 5 the complex orchestration and management interfaces, can be a great challenge for network operators. At the end of the day, network operators may spend significant resources and costs to maintain and scale up via a specialized network engineering team. The NFV proposal. Given the drawbacks of the traditional NF solution, the research community and the industry have come up with the NFV proposal: to replace hardware middleboxes with software NFs deployed on commodity servers. Contrary to on-premise deployments or specialized hardware, NFV vir- tualizes middleboxes as software packages running on general cloud infrastructure. This scheme comes with many benefits. Lowerinfrastructurecost. Cloud computing has demonstrated great capabilities of hosting software services for many clients over a shared infrastructure. By multiplexing the cloud infrastructure, cloud plat- forms can reduce the cost and complexity of building, maintaining, and upgrading compute infrastructure for cloud tenants. If successful, NFV can reduce the cost and complexity of purchasing, operating, and managing middleboxes for network operators. Faster development. Developing NFs as software packages rather than hardware middleboxes can greatly expedite the development of new NFs. With NFV, network operators can upgrade existing NFs and develop new NFs easily, as if to deploy new pieces of software. Betterdynamicscaling. Software NFs can benefit from dynamic scaling and use just enough amount of resources for computations, which has been applied to various workloads [158]. Cloud platforms offer 6 fine-grained scaling mechanisms (such as serverless functions) that enable the pay as you go compute mode. This can further reduce operational costs. Easiertomanage. NFV can reduce management overheads. Orchestrating software NFs is also much easier with the help of cloud networking techniques, such as Software-Defined Networks (SDN) and virtual networks [62, 110]. Managing NFs can be easier if a unified software management interface exists. ProblemsforNFV. After the NFV proposal came out, the networking and system community hoped that NFV would enable the rapid development of new NFs, and leverage existing commodity cloud computing infrastructure to reduce the management overhead. However, despite these advances, NFV has shown little progress toward large-scale deployments. To date, we know of no production-ready, software-based NFV system that got adopted by large network operations. After these years of research in NFV, we are still far from “making middleboxes someone else’s problem” [136]. As a result, the industry has doubled down on custom hardware solutions [107] and complex and bespoke NFV deployments [100]. In this dissertation, we ask: why has the transition from hardware middleboxes to software NFs been so slow? We believe this is because the state of the art on NFV has not met allproduction-readyrequirements. By looking at the success of cloud computing in deploying general applications, we find that large-scale deployments of performant, efficient, and easily-managed computations with performance objectives can be achieved by scale-out computations on commodity servers and network devices, using standard OS mechanisms for resource management. To completely replace hardware-based solutions, NFV has similar requirements: performance, scalability and efficiency, generality, and cloud deployability. Ourobservations. We posit that a production-ready NFV system should do well in these dimensions: (1)Performance: it must meet the service level objective (SLO) associated with each NF chain (in terms of throughput and latency), because NFs can be critical network processing tasks and affect user experience. Under peak loads, it must serve dynamic traffic on the order of 10s-100s Gbps. 7 Performance Scalability And Efficiency Cloud Deployability Lemur Quadrant Ironside Ideal Hardware middlebox Software Clean-slate Dirty-slate Auto-scaling Fine-grained auto-scaling Figure 1.1: The design space of NFV systems. An ideal NFV system should have high performance, scala- bility, and efficiency while offering the complete set of NF processing functionality in the cloud context. (2)Scalabilityandefficiency : it must scale up and down quickly to adapt to load changes and consume just-enough system resources to meet packet processing needs, as NFV promises cost saving by multiplex- ing the underlying infrastructure. (3)Clouddeployability: it must be able to deploy and orchestrate NFs manufactured by different vendors to form NF chains to process traffic from many network operators). In addition, it must leverage commodity cloud infrastructure as much as possible to minimize management overheads. 1.3 DesignSpaceforNFV The design space can be classified across three dimensions, as shown in Figure 1.1. Performance. NFV systems’ performance can directly affect users’ networking experience. To push the adoption of NFV, performance metrics for NFs deployed under NFV should match these of ones deployed under hardware-based solutions or bespoke deployments. Some example metrics can be: throughput, packet processing median/tail/worst-case latency, packet loss rate, energy usage, etc. Scalability and efficiency. NFV promises to eliminate the difficulty of dynamic scaling and reduce op- erational and management costs with cloud computing technology. Scalability and efficiency are often 8 considered together because the system’s scaling behavior can affect efficiency. Under this dimension, NFV systems may demonstrate different levels of dynamic scaling, i.e., whether they can adapt to traffic dynamics quickly using as less resources as possible. Cloud deployability. Cloud deployability indicates the difficulty of deploying the target NFV solution in the commercial cloud context. NFV requires utilizing the existing cloud infrastructure so that network operators can leverage the power and economics of commodified cloud computing infrastructure. It is important to consider the deployability in the cloud computing context when designing NFV systems: making too many assumptions can reduce the cloud deployability, while making no assumptions would result poor performance or scalability, or efficiency. As shown in the above figure, our work investigates the design space of NFV systems. We aim to ex- plore principles and mechanisms to advance the design and implementation of NFV systems in these three dimensions. Our principle is to make as fewer assumptions as possible to maximize cloud deployability while pursuing advancements in the other two dimensions. Lemur, a cross-platform NFV meta-compiler, is designed to push NFV’s performance to a limit by utilizing on-path hardware accelerators. Different from other hardware offloading work, Lemur considers cloud deployability by handling hardware offload- ing automatically in a systematic way. Quadrant, an NFV platform with a novel lightweight isolation mechanism, is designed to support low-overhead isolation and orchestration for deploying closed-source NFs from third-party vendors. It also attempts to provide performance SLOs by introducing auto-scaling to NFV systems. Finally, Ironside, an NFV task scheduler with a new hierarchical multi-scale strategy, is designed to push the scalability and efficiency to a limit. In this work, we offer new insights gained from real-world traffic analysis and explore the key set of mechanisms for achieving sub-millisecond tail latency SLOs for running NF chains. 9 1.4 DissertationOverview In this dissertation, we study, understand, and explore design principles and mechanisms that advance NFV in performance, scalability and efficiency, and cloud deployability. We show that NFV systems can benefit from offloading computations to on-path hardware accelerators, including OpenFlow/P4 programmable switches and/or smart NICs. Our framework Lemur takes as input the cluster’s hardware specifications and NF chains with SLOs. To reduce management overheads, we seek to automatically place, configure, and execute multiple NF chains across heterogeneous hardware. To this end, we design an NF Placer that determines NF placement, and addresses several challenges: determining hardware acceleration under resource constraints and deciding the scale-out for NFs on the cluster of multi-core servers. We also describe the design of a meta-compiler that produces low-overhead coordination code and tables to ensure that each NF chain executes as determined by the placer. In Chapter 2, we implement Lemur on a real cluster with heterogeneous hardware accelerators. We evaluate Lemur by validating our meta-compiler’s design and the effectiveness of the NF placer on this testbed. Next, we show that software NFV can also achieve great performance by rethinking the inter-NF iso- lation mechanism. NFs need to process traffic in sequence, while NFV platforms handle packet forwarding and isolation for NFs. In Chapter 3, we describe the design of a lightweight isolation mechanism for NFs deployed on the same server. We first analyze the memory access patterns for NF chains on the same server. Then, we identify unnecessary overheads of packet copying and transmitting that state-of-the- art NFV systems require achieving isolation for NFs Our proposed new isolation mechanism avoids these overheads by re-designing memory allocation and CPU core scheduling. In addition, we also start to ex- plore the auto-scaling concept for NFV systems in this work. We validate the effectiveness of our isolation and scaling mechanisms on a rack cluster, by comparing them with state-of-the-art NFV systems. Finally, we aim to push performance, scalability, and efficiency to a limit without sacrificing cloud deployability. Under real-world traffic, we identify key challenges that prevent existing NFV systems to 10 achieve sub-millisecond tail latency SLOs. With traffic analysis, we demonstrate two types of bursts and posit that NFV systems cannot achieve dynamic scaling under stringent latency SLOs if they ignore bursts. We propose a novel task scheduler that optimizes CPU core efficiency and can detect and react to traffic bursts at different scales. We evaluate our system Ironside on a rack-scale cluster and show that it can achieve sub-millisecond latency SLOs (more than 10× better than alternatives) with comparable CPU core usage. Potential research outcomes. We envision our work to advance the research on NFV systems and promote the adoption of NFV techniques by network operators for more use cases. As of 2023, we’ve seen many pieces of evidence showing our research’s impacts. CDN providers and startups also start to offer new network security techniques (i.e., one NF use case) that protect end users’ Internet privacy [52, 54]. They mainly use software-based solutions deployed on public clouds, reusing the design principles and software artifacts from Quadrant. Inspired by Lemur, large enterprises, such as TikTok, have started to use a systematic hardware offloading method to achieve a unified network gateway with many NFs ( e.g., firewall, NAT, and LB). 1.5 DissertationStatement This dissertation advocates for the following principles in the context of rack-scale NFV deployment: to apply hardware offloading with programmable hardware platforms, to employ a lightweight packet isolation mechanismforinterconnectingNFsseamlessly,andtouseataskschedulercapableofdetectingandreactingto traffic bursts, With these mechanisms, NFV systems can achieve high performance, scalability, and efficiency. They can effectively adapt to traffic dynamics, even under adversarial traffic inputs, while utilizing fewer resources (in terms of CPU core hours) and delivering better tail latency. 11 1.6 DissertationOutline This dissertation is structured as follows: Chapter 2: Lemur: Meeting SLOs in Cross-platform NFV. In this chapter, we present our work on de- signing and implementing a cross-platform meta compiler, named Lemur. Lemur focuses on automatically leveraging on-path programmable hardware platforms to accelerate NFV workloads. With Lemur’s ability to utilize hardware offloading automatically, we then delve into the software aspect of NFV deployment. Chapter 3: Quadrant: A Cloud-Deployable NF Virtualization Platform. In this chapter, we explore the challenges of hosting multiple third-party NFs without mutual trust and propose a novel NF isolation mechanism, which eliminates the unnecessary overheads of copying and transmitting packets between NFs. Additionally, we start exploring the design of auto-scaling for NFV workloads under dynamic traffic. Chapter 4: Ironside: Sub-millisecond Latency SLOs for NFV. In this chapter, we provide new insights gained from analyzing real-world traffic traces. We identify two types of bursts in traffic that pose signifi- cant challenges in meeting sub-millisecond tail latency SLOs. On top of our new observations, we present our work on pushing NFV’s auto-scaling performance to the extreme. We demonstrate that our proposed rack-scale NFV scheduler can achieve sub-millisecond tail latency SLOs without substantial increases in the number of CPU cores utilized for serving NFV workloads. In Chapter 5, we conduct a comprehensive review of the existing literature, focusing on relevant re- search directions, including NFs, NFV, multi-core CPU scheduling, network packet scheduling, etc. In Chapter 6, we conclude this dissertation. Moreover, we discuss potential directions for further research development in NFV. We hope that the outlined structure of this dissertation would provide a clear flow for readers to un- derstand our research efforts, insights, and the results that we’ve achieved. 12 Chapter2 Lemur: MeetingSLOsinCross-platformNFV Over the last ten years network operators have begun to deploy virtualized network functions (NFs). These NFs typically perform packet processing in software on commodity servers. They replace specialized hardware middleboxes, leveraging cheaper commodity servers and cloud-like service management. They also permit flexible orchestration of the data plane by chaining together NFs (into NF chains) to meet operator needs. An important large-footprint use case is a rack-scale deployment of servers to run NFs for traffic ingressing or egressing a telecom central office, an ISP Point of Presence (PoP), or an enterprise border. This is the setting we consider in this dissertation. 2.1 Introduction Industry interest has prompted two threads of NFV work: Software NFs. One thread has focused on programming and orchestrating NFs and achieving elastic scaling (e.g., [36, 105, 109]), and improving their performance (e.g., [10, 148, 149]). However, a key factor in practical deployments, the ability to meet service-level objectives (SLOs) for traffic processed by an NF chain, has received less attention [149]. Consider an ISP that serves residential or enterprise customers. It may want to apply security or isolation policies on traffic from these customers using NF chains. In doing so, however, the ISP runs the risk ofviolatingtrafficSLOs that it established with its customers because the 13 NFs add processing overhead [27]. In discussions with ISPs, we have found that such SLOs usually have three required components: aminimumrate requirement on aggregate traffic processed by an NF chain, a maximum rate bound that limits bursts, and a maximum delay imposed by the NF chain. Accompanying these SLOs is a pricing model that sets a fixed price for the minimum rate, and a usage-sensitive price for traffic above the minimum rate. Hardware acceleration. A second thread stems from the growing realization that custom-hardware middleboxes, while inflexible, delivered greater predictability and performance than software-based NFs. In response, researchers and industry alike have looked to leverage hardware acceleration (in the form of Protocol Independent Switch Architecture or PISA hardware [90, 45], Smart NICs [28, 72], GPUs [137, 23], and Network FPGAs [17, 82]). This line of work explores hardware acceleration on an NF-by-NF basis, yielding a suite of useful, but piecemeal, higher-performance implementations. Hardware acceleration is particularly important in our setting as server scaling has its limits in a rack- scale deployment where space and power considerations are important. For example, the 32-port Barefoot Tofino-based PISA switch [12] we use, which has 3.2 Tbps of capacity, consumes about 450 W, comparable to a 1U two-socket Intel Xeon-based server. However, due to the limitations of the P4 programming model and limited hardware resources, not all NFs can run on switches. Leveraging hardware acceleration through manual configuration has always been possible, but ISPs that have taken this approach have suffered from the high management overhead such a manual approach imposes. In addition, we found when working on NFV in the industry that applications of generic cloud comput- ing platforms (e.g., OpenStack, Kubernetes, etc.) aided in automation but yielded abysmal performance. On the other hand, hand-tuned NFV deployments were difficult to optimize and maintain. Neither provided 14 the requisite SLO guarantees. Lemur aims to get the best of both worlds, automation, and highly-tuned performance while meeting SLOs. Goal. Given multiple NF chains and their associated SLOs, we seek toautomatically place, configure, and execute multiple chains across heterogeneous hardware such that: (a) each NF chain receives at least its minimum rate, and (b) the totalmarginal rate (the rate above the minimum which each chain can burst) is maximized, which maximizes revenue for the ISP. Automatic configuration means: (a) decide where an NF should run (in software, on a PISA or OpenFlow switch, or in a SmartNIC), and the degree of NF scale-out required (using multiple cores) to achieve the SLOs, and (b) execute each NF chain across hardware with little operator intervention. In doing so, we must leverage all hardware made available on-path, thereby providing operational flexibility. Lemur’s goal is not to provide a unified programming language for heterogeneous platforms. Instead, it respects existing hardware offload efforts and embodies a practical approach to generating appropriate NF placement. Automatic configuration poses several challenges. How do we specify NF chains in a manner amenable to analysis for acceleration and scaling decisions, and compilation for execution? How do we determine which hardware element (CPU, PISA switch, smart NIC, OpenFlow switch) each NF in the chain should run on? How do we scale partial NF chains by replicating them across multiple cores in order to meet SLOs? How do we respect constraints imposed by hardware accelerators (e.g., pipeline stages on PISA switches)? How do we accommodate link capacity constraints between switches and servers? How do we work around limitations in the programming and execution models used by hardware accelerators? Lemur: Approach and Contributions. In this chapter, we present Lemur, a system to address these challenges. Lemur takes as input a high-level description of multiple NF chain DAGs and their associated 15 SLOs. Lemur’s output is a placement configuration for each NF chain along with coordination code that ensures that the NF executes on the appropriate hardware element specified by the placement. Contributions. First, Lemur provides Placer, which determines NF placement and provisioning and addresses the several competing challenges identified above: determining hardware acceleration, deciding the scale out for NFs on multi-core systems, respecting link capacities, and hardware-specific constraints (§2.2). Our work includes an oft-ignored element of performance prediction, run-to-completion NF exe- cution, as opposed to cross-core execution. A MILP formulation can address a scalable run-to-completion formulation while meeting SLO requirements and link-capacity constraints, but off-the-shelf solvers can- not determine if a set of NF chains respects hardware constraints, since that requires actually invoking the hardware-specific compiler. An alternative, optimal approach (§2.3) leverages the structure of the prob- lem to (a) enumerateplacements of NFs on different hardware elements, (b) use resource and performance profiles of each NF on each hardware element to determine how to scale out server-placed NFs to multiple cores in each placement, (c) determine which placements maximize the aggregate marginal throughput while satisfying link capacity constraints, and (d) select a placement that respects hardware limitations. Enumerating placements is computationally expensive, so we develop (§2.3) a heuristic capable of near- optimal performance with very low placement delay. Our placement algorithm addresses the issue that today’s PISA switches do not expose an inexpensive API to check the feasibility of placements. Second, Lemur provides ameta-compiler that, given NF implementations, produces low-overhead coor- dination code and tables to ensure that the NF chain executes as determined by Placer. The meta-compiler automatically (§2.4) reasons about DAGs in NF chains, and generates code for function chaining. A key architectural novelty in Lemur’s meta-compiler is the use of a top-of-the-rack (ToR) PISA switch as a coordinator (in addition to acting as an NF accelerator), which shares fate and improves performance. Ad- ditionally, our meta-compiler overcomes another limitation of PISA switches: the programming model of these switches does not permit reasoning about modular NFs. Recent work [143] has explored language 16 support for modular and composable P4 programs; in contrast, Lemur targets minimal changes to P4 to support NF composition. In our evaluations, we use canonical NF chains [69], which in our experience both in industry and research reflect actual deployments. For these, Lemur outperforms alternative approaches. It finds feasible placements in all our experiments while other approaches find feasible placements in about 17-76% of the cases. Lemur also obtains a maximum marginal throughput difference over competing approaches of more than 50% of the link capacity across our experiments. We demonstrate that Lemur’s meta-compiler can reduce manual labor: nearly 30% of code in Lemur is auto-generated. Finally, Lemur’s heuristic placement algorithm can generate near-optimal placements in a little over three seconds. In addition, we have open sourced our implementation and MILP formulation via Github at: https://github.com/USC-NSL/Lemur. 2.2 Lemur: Overview Lemur’s goal is to automatically assign resources to NF chains in order to satisfy SLOs while maximiz- ing revenue for ISPs. To achieve these goals, Lemur accelerates NFs using on-path hardware, such as programmable switches and smart NICs. Next we provide a high-level overview of Lemur and how we specify NF chains. We then summarize Lemur’s key ideas. Overview. The input to Lemur (Figure 2.1) is an NF chain specification that, for each chain, describes which traffic aggregates to apply, the DAG of NFs, and the corresponding SLO. The Placer consumes the specs and determines, for each NF in every NF chain, whether Lemur should execute that NF in on-path hardware, and if so, on which element. If Placer decides to run the NF on a server it also determines how many cores to allocate to the NF. The resulting placement configuration of the NF chains is guaranteed to satisfy the specified SLOs. 17 NF chain specification NF-chain execution Placer predicts throughput Placement configuration Platform Code Meta-Compiler parses placement configuration Lemur Placer Lemur Meta-Compiler Figure 2.1: Overview of Lemur’s design. Given the specification, and the Placer’s placement configuration, the meta-compiler parses the speci- fication, selects the NF implementation for the hardware target identified by the configuration, and auto- matically generates code to route traffic between NFs in the NF chain. For example, traffic may first ingress the ISP at the PISA switch, then traverse NFs on a server and a smart NIC. Traffic may bounce back to an NF on the PISA switch before returning to a server (which may be necessary to satisfy SLOs as we discuss later), and return again to the switch before egressing the ISP. SpecifyingNFchains. Lemur provides a natural and abstract means for operators to specify NF chains. Inspired by BESS [10], our specification is not novel but is critical for enabling automated placement and execution. Using hardware middleboxes typically required an operator to manually configure pipelines, and a core NFV aim was to automate such work. However, Tier-1 operators, even for non-hardware accelerated settings (such as VM-based NFs using SR-IOV) do manual setup, and we have learned from them that they frequently leave stranded resources due to configuration complexity. In Lemur, even though it supports multiple hardware platforms, NF chain configuration is as straightforward as with software- only NFV. 18 A Lemur user (e.g., an ISP operator) specifies NF chains using a dataflow language, where the nodes represent NFs and the edges represent packet flow between NFs. This example specifies an NF chain: ACL -> Encryption -> Forward. In the example,ACL,Encryption, andForward are all NF names. This NF chain represents a packet processing pipeline that applied to traffic so that incoming packets are filtered by an access control list (ACL) NF, then encrypted by an encryption NF, and finally emitted out an appropriate port based on MAC address-based forwarding. An NF can have parameters. For example, the ACL can have an associated rule: ACL(rules=[{’dst _ ip’:’10.0.0.0/8’,’drop’: False}]) to drop packets other than those destined to10.0.0.0/8. Also, an NF chain may specify conditional execution through branching, like: ACL -> [{’vlan _ tag’: 0x1, Encryption}] -> Forward which encrypts packets matching a specific VLAN tag. This specification is high-level and declarative. NFs in NF chain specifications use a predefined but extensible vocabulary. They do not specifywhere orhow an NF executes. For example, Lemur users would not need to know whether anACL NF runs on hardware accelerators or on x86 servers. Indeed, for one NF chain, Lemur might decide to configure an ACL on a PISA switch, but for another in software. Each chain processes traffic from one or more traffic aggregates. An aggregate specifies a combination of flow 5-tuple values (source and destination IP addresses and port numbers, as well as protocol number); in our setting, an aggregate may represent traffic from a customer, for example. Specifying performance objectives. Finally, for each traffic aggregate, the operator specifies the SLO that must be satisfied by the associated NF chain. We have derived Lemur’s SLO specifications from dis- cussions with operators; the specifications are simple yet capture important use cases. For each NF chain and traffic aggregate, the operator can specify: a min throughput t min , a max throughputt max , and a max 19 UseCase t min t max Description Bulk 0 ∞ Best effort Metered bulk 0 α Best effort, capped at α Virtual pipe α α Exactlyα guaranteed Elastic pipe α β At least α w/ bursts up toβ Infinite pipe α ∞ At leastα Table 2.1: Lemur’s SLOs capture key operator use cases. delayd max . Lemur must provision for the NF chain to achieve at least t min throughput with at mostd max delay. Operators also permit traffic to burst up to t max . These bounds also determine pricing: operators of- ten charge a fixed price for t min , with use-based pricing above that rate; this is contractual with customers and must not be violated by NF chains that it applies to customer traffic in order, for example, to enforce its own security policies. Finally, because traffic usage beyond t min generates revenue, Lemur attempts to maximize the aggregate marginal throughput (the traffic rate in excess of t min ). Our simple SLO spec can capture several key use cases (Table 2.1). Large carriers often sell enterprises and smaller operators virtual and/or elastic pipes. In most cases, they promise a fixed minimum rate that is (in the case of an elastic pipe) burstable up to some maximum rate, with a delay bound. Residential traffic, on the other hand, is typically advertised as an elastic pipe but given metered bulk in reality. Bulk is used for low-priority traffic that consumes excess resources after other demands are met. Finally, customers with bursty demand and a willingness to pay for any level of usage can select infinite pipes (which are of course limited by hardware and interconnects). 2.3 ThePlacer Placement of NFs is a key problem in NFV. Prior work aimed for unified orchestration and execution within one platform (e.g., software), and on monolithic NFs. Lemur’s Placer, faces harder challenges: it must not only perform NF placement with limited resources, but do so while avoiding SLO violations, and 20 while accommodating multiple hardware categories (e.g., PISA switches, smart NICs) which have not only different resources but different types of resources. 2.3.1 ThePlacementProblem The input to Placer is a collection of NF chains, and associated SLOs (e.g.,t min andt max for each chain). In addition, Placer is also given the underlying topology consisting of a single PISA switch connected to several servers. Each server may have one or more attached smart NICs. Placer produces a placement that specifies whether each NF in an NF chain should run on the PISA switch, a smart NIC (and which smart NIC), OpenFlow switch, or a server (and which server and config). If it places an NF on a server, Placer also specifies NF core allocation. Given NUMA in modern servers, and CPU socket-NIC association, Placer also specifies the NIC to which the chain is assigned to the chain. A placement is to be feasible if the following conditions hold: (a) each NF chain receives at leastt min ; (b) the NFs allocated to the PISA switch collectively fit into the switch; (c) the placement respects the server core counts of every server; (d) aggregate traffic resulting from the placement does not exceed the capacity of any network link. Placer aims to generate a feasible placement with maximal aggregate marginal throughput: the differ- ence between a chain’s estimated throughput and itst min . To check whether an NF chain meets its SLO, Placer estimates the throughput of each chain. As discussed in §2.2, this objective is natural in our setting because it maximizes revenue for the ISP. ∗ Challenges. Several aspects make this placement problem hard. First, some NFs can be placed on servers, switches, or smart NICs, while others have limited placement options (e.g., PISA switches cannot currently perform payload encryption). Second, different hardware resources have different constraints. PISA switches can process NFs at line rate but have limited pipeline ∗ More fine-grained objectives may also make sense in our setting; an ISP may wish to allocate higher marginal rates to certain customers, or ensure proportional fairness in rate allocation. We leave this to future work. 21 stages and memory. Servers are less constrained and more general but are slower. Smart NICs occupy a midpoint. Beyond placement, Placer has to meet SLOs. To do this, it needs to estimate the throughput achieved by an NF chain in a given placement. This throughput is primarily constrained by the processing on servers and smart NICs (since PISA switches process at line rate). When it places NFs on servers, Placer must also minimize overhead in NF execution, which will allow efficient packing of NFs into servers. Consider two successive NFs A andB in a chain: Placer must decide, while meeting SLOs, whether to place these on the same core (to avoid copying costs), or on different cores (to permit parallelism) [182]. To meet the SLO, it may be necessary to replicateA across several cores. Alternative Approaches. For concreteness, consider two strawman approaches. The first places an NF on the PISA switch whenever a switch implementation exists. This may be infeasible depending on the chain, since it may exceed the number of switch stages. The second, at the other end of the spectrum, places an NF on a server if a software version exists, which may be infeasible because there may be too few cores to satisfyt min for one or more chains. Prior work has considered placingVM-based NFs on server cores while satisfying SLOs, a mixed integer programming problem [70, 94] solved either using heuristics [70] or with an MILP solver [94]. Other work has explored minimum bounce placements [105]. However, in this context, such bounces between nodes may be unavoidable. Consider a chain with five NFs A-E where B and D only have software implementations,A andE only switch implementations, butC can be executed on either. A minimum-bounce placement would force server placement. This may be sub-optimal and fail to meett min : another NF chain could use the core(s) allocated toC to achieve a feasible solution, or one with higher marginal throughput. 22 2.3.2 ThePlacementAlgorithm Lemur’s placement algorithm overcomes these challenges. ProfilingandEstimatedThroughput. To estimate the throughput of an NF chain, Placer precomputes profiles for each NF on a server and/or smart NIC. NF B’s profile is the CPU cycle count c to execute it Given the CPU clock rate f, the estimated rate for B is f c . Placer might allocate k cores to B, in which case its rate isk f c . If an NF chain placement has multiple servers (or smart NICs) placed NFs, the estimated rate of the NF chain is the minimum of all the per-NF (or, per NF sub-group, as discussed below) estimated rates. † The cycle count of an NF may be a function of NF state or traffic. For example, ACL processing may depend on table sizes; we profile cycle counts for different sizes and use a linear model to predict the processing costs. In other cases, such asNAT, we may not know the size of the state a priori, in which case we aim to compute a worst-case cycle count. Finally, for some NFs such asDedup, the cycle count might depend on the degree of redundancy in the packet; in this case we compute a worst-case cycle count, and plan to explore better profiling techniques in the future. Placer decouples profiling from placement, so can directly leverage improvements in profiling (such as based on operator-specific knowledge). Brute-force Placement. Placement lends itself to an optimization formulation. We cast the placement problem as an MILP, but for one key component: it is hard to estimate a priori the number of PISA switch stages used by a placement because the PISA compiler (for Barefoot’s Tofino [97]) performs stage packing. We could have modeled the PISA switch placement conservatively [42], but this would have resulted in stranded resources. An alternative isbrute-force placement, which: (a) enumerates placement patterns, (b) † As ResQ [149] shows, there are subtle NF performance interactions in software that must be accounted for, such as cache effects; ResQ is complementary to Lemur and could be used to improve our estimation [149]. 23 searches through core allocations for each pattern, and (c) finds the max marginal throughput for a pattern and core allocation. EnumeratingPlacementPatterns. Brute-force placement first enumerates patterns of all possible NF placements across available hardware for the given DAG. For example, for a chain A->B->C->D, one possible placement isA on the PISA switch,B andC on a server, andD back on the PISA switch. Another placement might placeD on a smart NIC. The space of patterns is large but constrained by the fact that not all NFs can run on all platforms: e.g., aDedup NF that de-duplicates packet payloads can only run on a server. Dealingwithbranchesinchains. In enumerating placement patterns for NF chains with branches, we decompose such chains into linear chains. Thus, if a chain branches from NF X to two NFs Y and Z, and then merges back into an NFW , we decompose these into two chainsX->Y->W andX->Z->W. Here we assume knowledge of traffic splits across the two chains (in our discussions, operators estimate these using historical measurements). Later we merge the throughput estimates forX andW . Searching throughCoreAllocations. For each pattern, brute-force p In our example above, ifB andC are assigned to a server, we run them to completion on one core (i.e., a packet batch is fully processed by both NFs beforeB starts processing the next batch). In this case, we sayB andC are part of a singlesubgroup, and, in making core allocation decisions, we treat a subgroup as a single entity. Subgrouping has two advantages. First, run-to-completion is fast because it permits zero- copy packet transfers between NFs, has no scheduling overhead, and has no cross-core communication. Second, subgrouping involves a search of fewer patterns and fewer core allocations. Subgrouping and run-to-completion are not always optimal. Consider two NFs B and C (where C comes after B in an NF chain) each with a cycle cost of x. The throughput of the subgroup BC is f 2x . Instead of sub-grouping, one can runB andC on separate cores; the throughput for the two NFs would be 2f 2x+δ whereδ is the cross-core or cross-socket cost. Depending on the relative values ofx andδ , this 24 throughput can be higher than run-to-completion. However, modeling cross-core and cross-socket costs is complex, especially considering cache effects [149]; we profile conservatively (§2.6.3), leaving more sophisticated profiling to future work. Replicatingstatefulandbranch/mergeNFs. Brute-force placement does not replicate any subgroup containing one or more stateful NFs even though it may be possible to do so. For example, NAT can be replicated by partitioning the port space to minimize cross-core communication. Our current implementa- tion does not do this yet in part because automatically generating this replication in a meta-compiler (§2.4) is difficult, and we have left this to future work. We plan to leverage work on stateful NF scaling, such as S6 [164]. Meta-compilation complexity also motivates us to avoid replicating NFs where branching or merging occurs. FindingMaximumMarginalThroughput. For a given placement pattern and a given core allocation, we can find the maximum throughput achievable for each NF chain from our cycle cost profiles. The throughput is constrained either by a server subgroup or a smart NIC NF. We compute the NF chain’s estimated throughput, as discussed above, as the minimum of the throughputs of all NF subgroups or smart NIC NFs in the chain. However, the sum of NF chain rates can overwhelm a NIC, so we must find assignments for NF chains that respect NIC capacities while maximizing aggregate marginal throughput. This problem is complex when two subgroups in an NF chain may be placed on different servers, or different NIC interfaces on a 25 server with multiple NICs. Brute-force placement uses a linear program to determine the max marginal throughput. Putting it all together. Brute-force placement lists possible placements (where a placement includes a pattern, a core allocation for each subgroup, and the rates assigned to NF chains), ordered by decreas- ing maximum marginal throughput. We then iteratively call a PISA compiler to find the highest-ranked placement within the switch’s stage constraints. A Fast, Scalable Heuristic. Brute-force placement has two expensive pieces: enumerating placements and core allocations, and compiling placements on a PISA switch’s compiler. Next we describe a low- complexity heuristic that reduces the cost of both of these and is several orders of magnitude faster than brute-force placement. Unless otherwise indicated, Placer uses this heuristic, which has three steps. 1. Check stage constraints. Placer greedily places as many NFs on the PISA switch as possible. If this placement exceeds the switch’s stages, it iteratively moves the lowest cycle cost NF away from the switch until it finds a placement that respects the stage’s constraints. The rationale behind removing the lowest cycle cost NF first is: since the PISA switch guarantees line-rate for any chain that fits the switch resources, if a high cost NF and a low cost NF use the same number of stages, it is always better to remove the low-cost NF (since it is more likely that we can pack this on the server and satisfy SLOs). Thus, unlike brute-force placement, Placer checks the switch stage constraint first , which more effectively prunes the search space. The output of this step is a baseline placement. The next step may remove NFs assigned to the PISA switch in the baseline placement to explore alternative placements (as described below); however, it never adds an NF to a PISA switch, guaranteeing that the final placement always respects the switch constraint. 2. Coalesce sub-groups. Even with the baseline placement, the search space is still large: we can offload each PISA switch NF (or combinations thereof) to the server to see if these result in higher marginal throughputs. Each such offload presents an opportunity to coalesce sub-groups. To see why, consider a 26 chain{A->B}->C->{D->E} where the{} denote server-placed subgroups. In this example,C is a PISA switch NF. MovingC to the server enables coalescing the two sub-groups into a single sub-group, freeing up a core that can help make another NF chain feasible, or increase overall marginal throughput. To make an optimal coalescing decision, Placer needs to consider core allocation, but this can involve an expensive search since other factors (such as NIC link capacity) constrain core allocation. Thus it decouples coalescing from core allocation and uses three simple rules to coalesce sub-groups. Consider two subgroups{A->B} and{D->E}. Placer coalesces these only if the resulting marginal throughput from allocating two cores to the coalesced sub-group is higher than allocating one core to each sub-group. We call thisstrictcoalescing. However, there are other situations in which coalescing might be beneficial (with appropriate core allocation) because they can free up cores for use by other NF chains, and we consider two. Inaggressivecoalescing, Placer coalesces two subgroups as long as the SLO is not violated; this is aggressive because it can potentially backfire and result in lower overall marginal throughput. In conservative coalescing, Placer coalesces two sub-groups only if the chain’s throughput does not decrease. The output of this step is three different placements: the baseline placement, an aggressive placement which applies strict and aggressive coalescing, and a conservative placement which applies strict and conservative coalescing. 3. Maximize marginal throughputs. For each of the three placements, Placer generates core allo- cations, runs the LP to compute marginal throughput under link constraints, and picks the configuration with the highest marginal throughput. Dynamics. The placement algorithm runs when an NF chain config changes: e.g., when an operator adds or removes an NF from a chain, changes an SLO, or updates the traffic aggregate associated with a chain. As we show in §2.6, Placer is fast enough to handle these dynamics. However, these kinds of changes need additional run-time support to dynamically reconfigure NF chains without impacting traffic; such support is usually found in NFV orchestration frameworks, into which we expect Lemur to be integrated. 27 2.4 TheMeta-Compiler Lemur’smeta-compiler integrates many different execution platforms, each with its own execution model, language(s), and toolchains. It takes as input the NF chain specifications (§2.2), and automates the entire process of generating and running code for all NF chains. To do this, it parses the NF chain specifications, and develops an intermediate graph representation of all the NFs. In this NF-graph, nodes are NFs, links represent data-flows, and each node is associated with attributes that govern placement and other infor- mation. The meta-compiler then feeds the NF-graph to Placer to find the highest marginal throughput placement. Using this placement, the meta-compiler synthesizes (a) NF chain routing and (b) NF code generation. These synthesis tasks are aided by the meta-compiler’s library of NF implementations. 2.4.1 SynthesizingNFChainRouting Given a placement, traffic that matches an NF chain must traverse each of its NFs, across different platforms in the correct order. Lemur must synthesize routing configurations to deliver packets from each NF to the next NF in the NF chain, which may be on a different platform. To do this, we use the Network Service Header (NSH) [124], which tags packets with a service path index (SPI) and service ID (SI); a service path is equivalent to a linear NF chain, and a service ID helps sequence execution of NFs within a single chain. The meta-compiler’s first step, after placement, is to assign SPI and SI values nodes in the NF-graph. Then it needs to synthesize code for routing between NFs in each chain. For this, the meta-compiler pre- defines implementations of two modules for each platform–encap and decap–which respectively add and remove NSH or its equivalent. Having determined the SPI and SI values, the meta-compiler must generate code for each platform to affect the routing between NFs. To minimize encap and decap overhead, Lemur concatenates NFs in a single service path and only generates encap and decap modules at the head and tail of that service path. 28 For example, consider an NF chainA->B->C->D, whereB andC are on a server andA andD are on the switch. The meta-compiler inserts code to set the initial SPI/SI values in the switch, then inserts code after NFA to forward the packet to the server. On the server, B andC run to completion, but the meta-compiler must insert code to increment the SI value, and, after C’s completion, code to route the packet back to the switch. Finally, it must add code to strip NSH afterD. This example shows a key part of Lemur’s design: here, the PISA switchcoordinates execution of the NF chain via routing. This is natural as all traffic enters/exits the PISA switch ToR. 2.4.2 CodeGeneration The meta-compiler generates code from the built-in NF implementation library. For example, the library might have anACL implementation for the PISA switch and an x86 server, and aDedup implementation only for an x86 server (because PISA switches cannot implementDedup). Using these, the meta-compiler can generate code for the NF chain ACL->Dedup as follows: if ACL is placed on the PISA switch, it generates the appropriate routing PISA code as above, prepends it to ACL’s PISA implementation, and performs similar steps for the x86Dedup code. This is conceptually easy but complicated by the platforms: a Barefoot Tofino-based programmable switch and x86 commodity servers running BESS [10]. ‡ SynthesizingP4NFchains. PISA switches are programmed using P4 [11] with monolithic programs for packet processing. However Lemur requires composability of NF chains in P4. To enable this, programmers must be able to write standalone P4 NFs that can then be composed into NF chains. Rather than invent a new language, we minimally extended P4’s syntax to allow users to specify standalone NFs. We also ‡ The meta-compiler supports eBPF on Netronome’s Agilio CX 1x40 Gbps Smart NIC; the code generation technique described above suffices. 29 developed an associated pre-processor to the meta-compiler that parses these extensions. The following paragraphs describe our extensions to the P4 syntax, and the pre-processor. Defining standalone P4 NFs. In Lemur, NF-developers can write a standalone P4 NF in much the same way as they write a regular P4 program: by defining headers, per-packet metadata, header parser specification, match/action tables, and control flow of the packet processing pipeline. Lemur makes small changes in the way programmers specify headers, metadata and parsers. For each P4 NF to be standalone, the NF developer cannot know the actions a packet will be subject to after the NF is processed. Lemur assumes that the NF will pass all packets to the next NF in the chain. However, the programmer can set thedrop _ flag in metadata to ensure that a packet is not passed to the next NF. This is useful in implementing firewalls. One key feature of P4 is that it is protocol-independent. When Lemur aims to unify standalone P4 NFs, it must ensure agreement on how to parse headers. Hence for each P4 NF, programmers must specify headers and header layouts, and then specify how these headers are parsed. An NF-developer may not know a priori all the headers and their associated layouts; this is only known after placement is finalized. So an NF-developer must specify headers and parsers in a manner amenable to composability: the meta- compiler must be able to combine header parsers of P4 NFs when generating code for a set of NF chains. To achieve this, Lemur provides an interface for NF-developers to describe headers and parsers. It provides a library of predefined headers (along with their layouts). NF-developers may extend this library. When writing a P4 NF, they simply list the headers they wish to use, and describe anNF-localparser via a simple graph definition language. Composing P4 NFs into chains. After Placer runs, Lemur reads in the P4 NF modules and merges them together as a single unified P4 program. In addition to name mangling P4 NFs to ensure uniqueness, and eliminating redundant headers, the meta-compiler implements two key algorithms. 30 The first algorithm auto-generates the unified parser from the NF-local parsers, specified by the NF developer, by merging the NF-local parsers: it takes the union of the next header choices for each unique header in a parse tree. The meta-compiler then auto-generates the headers and the parser definitions, using layouts from the header library (see §2.5.2.1). The meta-compiler must also assemble the per-NF tables and actions into a global sequence of tables and actions consistent with dependencies between NFs in the NF chain definitions. One naive solution is to generate code for NFs in a topological-sort order, and place a check at the beginning of each NF. However, P4 programs generated in this manner can waste many switch stages. Our experience with resource mappings for many compiled P4 programs points to two important facts that must be considered when generating a unified P4 pipeline: (1) noloop: a match/action table cannot be revisited in the pipeline; consider a merge at the end of a branch, where NFs A and B merge into C. The tables from C must be applied in the unified pipeline exactly once, and after all tables from A andB. (2) ordering dependent tables: two match/action tables cannot be packed on the same stage if they have dependencies between them. Therefore, the challenge is to convert a DAG of NFs (with many branching and merging points) into a tree struct that must respect all dependencies in the original NF DAG and must not introduce unnecessary dependencies between NFs. The meta-compiler statically analyzes the NF-graph for these dependencies and generates tables with this property (details in §2.5.2.2). Resource-AwareCodeGeneration. The generated NF code must conserve constrained resources on our platforms: PISA pipeline stages, and server cores. Minimizing PISA switch stage usage. Of the various constraints (DRAM, TCAM, matching bits, stages), the number of stages is most constraining (it is the constraint that is easiest to violate). As pre- vious studies [57] have shown, in a P4 switch table dependencies can rapidly consume available stages. Therefore, we optimized switch stage usage by eliminating table dependencies ([57] uses similar tech- niques for standard P4 programs). These optimizations use a static analysis of the NF chain graph, similar 31 to the one described above, and execute the following assertions: (a) Do not generate code to insert an NSH header if a chain is placed by Placer entirely on the PISA switch; (b) Instead of updating the SI values after each P4 NF, update it once at the end of a chain of sequential NFs; (c) To steer packets returning from the server to the correct next NF in the chain, incorporate the steering into the first switch stage which also steers previously unseen packets; and (d) Allow the P4 compiler to pack parallel branches into the same set of switch stages by expressing the exclusivity among these branches explicitly in the generated P4 code. CodegenforBESSpacketsteeringandNFscheduling. BESS is a DPDK-based software switch that supports standalone NFs and NF-chaining, so we did not need to extend the BESS programming model (see also §2.5.1.1). Placer determines NF run-to-completion subgroups and how many cores are allocated per subgroup. The meta-compiler must generate code to demultiplex packets to the right subgroup and further the right subgroup instance. This demultiplexer module also decapsulates NSH headers because BESS NF imple- mentations aren’t aware of this header; an auto-generated multiplexer module at the end re-inserts this header. In Lemur, the demultiplexer runs on a single core, pulls packets from the NIC, and steers packets to the subgroup (§2.5.1.2). This incurs cross-core communication costs; in future work we intend to generate PISA switch code to tag and steer packets to specific cores as in Metron [60]. Finally, the meta-compiler uses BESS’s scheduler, which supports hierarchical scheduling policies via a per-core tree with NF leaves and policy interior nodes. Given subgroups and core allocations, the meta- compiler specifies NF scheduling. Placer might choose to allocate multiple subgroups to the same core, and the meta-compiler generates code to schedule these subgroups round-robin. We also use the scheduler to enforcet max . (More discussions in §2.5.1.3) 32 2.5 Implementation In this section, we include hardware and software switch implementation details. 2.5.1 x86-basedcommodityserver 2.5.1.1 BESSscriptgeneration Lemur’s chain specification is inspired by BESS’ script language. When specifying a pipeline, the user of BESS writes simple languages to concatenate NFs with arrows, and our Lemur user-level configuration adopts BESS script language style with small revisions. As such, Lemur can use BESS’s parser with two small modifications, described below. Instance Name parsing BESS allows users to define several module instances that belong to the same module class, and our configuration file language also supports instance naming convention. Lemur user can declare several instance names to represent multiple BESS module instances. For example, there is an access control (ACL) module class, and users can define an ’ACL0’ instance that uses ACL module class. Analogously, Lemur users are allowed to create instance name ’ACL0’ for ACL NF. Lemur also supports macro definitions for arguments for module creation. To support both of these, we added functionality to the parser. 2.5.1.2 SharedModules In a BESS pipeline, some modules are shared by all contiguous subgroups. Specifically, ‘PortInc’ and ‘PortOut’ modules are used to pull and push packets from the NIC in poll mode. Similarly, all packets are required to decapsulate the NSH header and be distributed to a corresponding contiguous subgroup for further processing, and for this we introduce a custom ‘NSHdecap’ module into each BESS pipeline, to be shared by all contiguous subgroups. Before pushing packets to the NIC, packets are required to be encapsulated with the NSH header again to indicate the downstream module in another platform what 33 next NF processing should be applied. Hence, the final step to wrap up a subgroup processing in BESS is to use a custom ‘NSHencap’ module to tag the next service path index and service index pair. 2.5.1.3 CoreAssignment Given the placement solution returned from Lemur Placer, BESS code generator automatically translates the optimization result to manage the pipeline scheduler. BESS’s scheduler is responsible for managing the execution of modules to process traffic in a whole pipeline; BESS separates the module graph from the scheduler tree, which is a per-core tree of logical (interior nodes) or physical (leaf nodes) schedule-able entities akin to Linux tc, enabling the implementation of complex hierarchical scheduling policies. By default, a single pipeline is assigned to the first system core under a round-robin root node and would be assigned with one core to handle corresponding traffic. When BESS’s code generator receives a placement solution from the optimizer, it pre-computes the optimal core placement to maximize throughput. According to this core allocation, we allocate cores to contiguous NF subgroups via the BESS scheduler. This allocation of cores to subgroups is done carefully to avoid violation of mutual exclusion in the NF DAG and to avoid fragmentation of stateful NFs. Ultimately, the overall chain throughput is limited by some contiguous subgroup, and since NF and/or subgroups can have dramatically different costs, we find that despite multi-chain allocation it is ultimately most meaningful to analyze each chain independently. 2.5.2 PISAswitch Deploying NFs in switch hardware is different from deploying them in a server. Hardware switches process packets with a pipeline of switch stages. P4 switches require a platform-specific compiler that compiles a P4 program into a binary configuration for switch ASICs. The binary configuration maps a switch abstraction into the underlying hardware resources, such as per-stage register bits, TCAM, SRAM, ALUs and so on. 34 To generate a P4 program and get it compiled into the switch hardware, Lemur’s meta compiler unifies many P4 NFs and generates a single P4 program that (1) has a unified packet header parser that can recog- nize all possible packet headers from each individual NFs, and (2) has a complete set of match-action tables and ensures packets traverse through them in the correct order. With Lemur, NF developers can build new P4 NFs as standalone NFs. They use Lemur’s extended P4 syntax to design new P4 NFs, or modify the P4 implementations of NFs slightly to make them recognizable by Lemur’s meta-compiler. Then, Lemur’s meta-compiler is responsible for unifying P4 NFs into a final P4 program by unifying header parser trees, and composing them into a switch pipeline. 2.5.2.1 AlgorithmofunifyingP4parsers In an abstract P4 switch model, a header parser is a parser tree that has each tree node representing a unique packet header and is usually rooted at Ethernet header. It is an ordered tree and contains a number of transitions from one header to next possible headers. To unify P4 NFs, Lemur’s meta-compiler must provide an unified header parser that parses all necessary packet headers. To do so, the meta-compiler starts from an empty parse tree and merges each P4 NF’s parse tree into that unified tree. To merge a new parse tree, it traverses the new tree and visits all parsing states (i.e. headers). At each parsing state, it compares all state transitions between the new tree and the unified tree, and integrates any non-existing transitions and new headers into the unified tree. If the meta-compiler encounters a conflicting header transitions, then it rejects this placement because at least two NFs conflict with each other and cannot be placed at the P4 switch together. 2.5.2.2 AlgorithmofgeneratingtheglobalP4pipeline Another important aspect of Lemur’s meta-compiler is to unify match-action tables from all P4 NFs into one final P4 switch pipeline. 35 At the pre-processing stage, Lemur’s meta-compiler concatenates NFs into a subgroup if they are in a sequential order and have no branches or merges in between. This saves switch’s resources because packets’ NSH headers do not get unnecessarily updated and matched when they traverse NFs in the sub- group. Concatenating P4 NFs into subgroups also simplifies the control flow of the final P4 pipeline. The output is a P4 subgroup DAG. In this DAG, nodes are categorized as normal (leaf) nodes, branching nodes and merging nodes. To convert a P4 subgroup DAG to a P4 subgroup tree, the meta-compiler handles branching nodes and merging nodes differently. A branching node is a subgroup node that has multiple downstream subgroup nodes. Traffic is split into downstream subgroups with a set of BPF rules according to the NF chain specification. To handle a branching node, Lemur’s code generator generates and inserts a customized traffic-splitting table that is pre-populated with BPF rules to split traffic. When packets arrive at the branching point, the table matches on packets’ traffic classes and decides which branch the packet should be forwarded to. Decisions are stored in a per-packet metadata field. Then, the meta-compiler steps into all branches one at a time. It generates code for each branch individually and places a condition checking before that branch. This is to make sure that only destined packets should be processed by a branch. This design only introduces necessary dependencies between upstream subgroups and downstream subgroups, and does not introduce unnecessary dependencies among different downstream subgroups. This allows a platform-specific P4 compiler to pack parallel branches into the same set of switch stages whenever possible. A merging node is a subgroup node with multiple upstream subgroup nodes. It is the merging point of multiple branches. In a P4 switch pipeline, a table cannot be revisited twice as the pipeline must be a tree structure. Therefore, Lemur must choose the right place to generate code for a merging node, and ensure that all its upstream branches can finally reach the merging subgroup node. The code generator implements this by detaching a merging node from the P4 subgroup DAG and re-attaching it to its all direct-predecessors’ common ancestor node. That ancestor node has just the right scope to ensure that 36 all branches can reach the merging node. The merging node is placed at the same level as the ancestor’s children. When traversing the P4 subgroup tree, the code generator must visit all non-merging nodes first. The code generator also places a condition check on packets’ metadata to select packets that are necessarily processed by NFs in merging nodes. After dealing with branching and merging nodes, Lemur’s code generator takes the P4 subgroup tree. It traverses the tree recursively in Preorder, and generates P4 code for each subgroup node. 2.5.3 Execution: smartNIC We use the Netronome Agilio CX 1x40 Gbps SmartNIC. This smartNIC is capable of executing eBPF (ex- tended Berkeley Packet Filter) programs. The NFs are programmed in C language and then compiled to the eBPF target. We then load the eBPF program in the SmartNIC, offloading the computation to the NIC. XDP (eXpress Data Path) is used to hook the ingress traffic to the SmartNIC, which is running the eBPF program. Programming the SmartNIC with eBPF technology presents some challenges. It has only 512 bytes of memory stack. It can only load 4196 instructions. There can be no function call. Moreover, to load the program in the SmartNIC, the code has to pass a verifier. The verifier does not allow back-edge jump (for, while). We solved these challenges by optimizing the code for 64-bit implementation, using loop unrolling to avoid for (back-edge), and inlining all function calls. 2.6 Evaluation We compare Lemur against several other alternatives and illustrate features of Lemur’s design. 37 Chain Specification Chain 1 BPF->Subchain 7->BPF->UrlFilter->Subchain 8 ↘ Subchain 8 ↘ Subchain 8 Chain 2 Encrypt->LB->3xNAT(branched)->IPv4Fwd Chain 3 Dedup->ACL->Limiter->LB->IPv4Fwd Chain 4 Dedup->ACL->Monitor->Tunnel->BPF-> 3xSubchain 6 (branched)->IPv4Fwd Chain 5 ACL->UrlFilter->Fast Encrypt->IPv4Fwd Subchain 6 LB->Limiter->ACL Subchain 7 ACL->Limiter Subchain 8 Detunnel->Encrypt->IPv4Fwd Table 2.2: Five canonical NF chains used for evaluation. 2.6.1 Methodology Implementation. Our Lemur implementation has three key pieces: (1) NF implementations in C, C++, and P4, (2) the Placer, and (3) the meta-compiler. Our NFs require 1396 lines of C++ (new BESS modules), 412 lines of C (eBPF code to run on the SmartNIC), and 1273 lines of P4 (P4 libraries). The Placer consists of 841 lines of Python. The meta-compiler consists of 6450 lines of Python composed of 2564 lines for the parser core, 312 lines for the BESS code generator, 3142 lines for the P4 code generator, 434 lines for OpenFlow, and 120 lines of ANTLR to parse NF chain specifications. Experimentsetup. Most of our experiments use two servers connected to a PISA switch functioning as a ToR. Both servers run BESS, one as a traffic generator and the other for NFs. Our PISA hardware is an Edgecore 100BF-32X with a Barefoot Tofino switching chip with 32x100G ports. The traffic generator is a dual-CPU 40-core 2.2 GHz Xeon E5-2630 with one Mellanox 100Gbps MCX515A-CCAT NIC. The BESS server for Lemur is a dual-CPU 8-core 1.7 GHz Xeon Bronze 3106 with one 40Gbps single-port XL710 Intel NIC. In some experiments, we use a Netronome Agilio CX 1x40 Gbps NIC or Edgecore AS5712-54X OpenFlow switch. NFsandNFchains. Our experiments use five different canonical chains, shown in Table 2.2. These rep- resent a range of use cases selected from [69] and from our discussions with ISPs. These canonical chains are composed of numerous NF implementations across the three platforms for which we have developed 38 NF Spec C++ P4 eBPF OF Encrypt 128-bit AES-CBC • Decrypt 128-bit AES-CBC • Fast Enc. 128-bit Chacha • • Dedup Network RE [5] • Tunnel Push VLAN tag • • • • Detunnel Pop VLAN tag • • • • IPv4Fwd IP Address match • • • • Limiter Token bucket • Url Filter HTML Filter • Monitor Per-flow statistics • • NAT Carrier-grade NAT • • LB Layer-4 load balance • • • Match Flexible BPF Match • • • ACL ACL on src/dst fields • • • • Table 2.3: All supported NFs and available placement choices in Lemur. We artificially limit IPv4Fwd as P4-only for the sake of evaluation. Lemur thus far. We include a summary of each network function, its corresponding implementation, and the placement choices available in Table 2.3. Two NFs, in bold, cannot be replicated across multiple cores. Comparison. We compare Lemur, which runs the heuristic placement described in §2.3.2, against alter- native strategies. Each of these alternatives corresponds to approaches described in prior work. Optimal runs the brute-force placement algorithm. HW Preferred places as many modules as possible on the PISA switch, which models the preferential use of accelerated hardware [90]. SW Preferred places all NFs with software implementations in software (BESS), which models the preferential deployment of NFs on com- modity servers with kernel-bypassing techniques [109]. MinimumBounce minimizes the bounces between the switch and servers, emulating prior work (e.g., using Kernighan-Lin in E2 [105]). Greedy selects HW- preferred chain placement and allocates cores to first meet the minimum rate requirement of each chain; once the minimum requirement is satisfied, it greedily allocates spare cores to chains sequentially by index. Once a chain’s maximum rate is reached, it moves on to the next chain to allocate spare cores, possibly causing the first chain to take resources that another chain would need to achieve a more globally-ideal allocation. Our evaluation goal is to show that an approach that holistically considers both accelerators 39 and commodity servers and trades off traffic bounces when necessary to accommodate more chains, can do much better than approaches that focus on a single dimension. Experiment Design. The input to our experiments is a collection of chains, together with an SLO for each chain. The space of possible SLOs is large. We systematically explore part of this space as follows. For each chain, we first define its base rate as the rate it would achieve if only one core were allocated to the slowest software NF in the chain. Then, we perform experiments in which each chain’st min is set toδ times the base rate. We varyδ from 0.5 to 4.0, in steps of 0.5. Asδ increases, it becomes harder for schemes to satisfy SLOs, since they need to either allocate more switch resources or cores. In all experiments, we sett max to be 100 Gbps. Metrics. For each experiment, we first compute the placement generated by Lemur and the other schemes and then use the meta-compiler to generate code. Thereafter, we execute the NF chain configuration on the testbed, but only when the placement is feasible (i.e., meets SLOs). We measure and report the aggregate throughput achieved, from which we can derive the aggregate marginal throughput of each scheme. 2.6.2 ComparisonResults We test Lemur against alternatives with (subsets of) Chains 1-4 in Table 2.2 (we use the fifth for Smart NIC experiments). Overallresults. Figure 2.2 compares Lemur performance with the alternatives. These graphs showδ on the x-axis and aggregate throughput in Gbps on the y-axis. Each scheme is shown by a vertical bar, and the aggregatet min is shown by a hashed blue rectangle for each value ofδ . The difference between each vertical bar and the top of the hashed rectangle is its aggregate marginal throughput. The absence of a vertical bar for a scheme for a given δ indicates that the scheme could not generate a feasible solution at thatδ . 40 0 5 10 15 20 25 30 35 0.5x 1x 1.5x 2x Aggregated Throughput (Gbps) Minimal Rate Requirement Optimal HW Preferred SW Preferred Minimum bounce Greedy Lemur Predicted Lemur Min requirement (a) NF chains{1,2,3,4} 0 5 10 15 20 25 30 35 40 45 0.5x 1x 1.5x 2x 2.5x 3x 3.5x 4x Aggregated Throughput (Gbps) Minimal Rate Requirement Optimal HW Preferred SW Preferred Minimum bounce Greedy Lemur Predicted Lemur Min requirement (b) NF chains{1,2,3} 0 5 10 15 20 25 30 35 40 45 0.5x 1x 1.5x 2x 2.5x 3x Aggregated Throughput (Gbps) Minimal Rate Requirement Optimal HW Preferred SW Preferred Minimum bounce Greedy Lemur Predicted Lemur Min requirement (c) NF chains{1,2,4} 0 5 10 15 20 25 30 35 40 45 0.5x 1x 1.5x 2x 2.5x Aggregated Throughput (Gbps) Minimal Rate Requirement Optimal HW Preferred SW Preferred Minimum bounce Greedy Lemur Predicted Lemur Min requirement (d) NF chains{1,3,4} 0 5 10 15 20 25 30 35 40 45 0.5x 1x 1.5x 2x 2.5x 3x Aggregated Throughput (Gbps) Minimal Rate Requirement Optimal HW Preferred SW Preferred Minimum bounce Greedy Lemur Predicted Lemur Min requirement (e) NF chains{2,3,4} 0 5 10 15 20 25 30 35 0.5x 1x 1.5x Aggregated Throughput (Gbps) Minimal Rate Requirement Lemur No Core Opt. No Profile Min requirement (f) NF chains {1,2,3} with/without optimization Figure 2.2: Performance comparison of alternative schemes and of Lemur with toggled optimizations. Figure 2.2(a-e) show experiments for different chain combinations: all four of chains 1-4, and all 3-chain combinations of these 4 chains. In all experiments, as δ increases, Lemur is the only one that produces a feasible solution. Comparison with Optimal. Moreover, across all experiments, Lemur’s heuristic is able to find an SLO-satisfied solution for all 29 sets, matching the brute-force placement. In addition, Lemur achieves the same marginal throughput as Optimal in all but one of the experiments; in that one case, Lemur still outperforms all other alternatives. Moreover, as δ increases, the total aggregate throughput of the chains decreases. This is because all chains place greater demands to meet their minimum rates and some chains are significantly more expen- sive than others; as a result, increasingδ forces the reallocation of resources towards expensive chains (in order to meet their SLOs) and away from faster chains that could have delivered aggregate throughput gains. Four chain experiment. In Figure 2.2a Lemur performs better than the alternatives as it frees up and then uses spare cores to meet chain SLOs; there are either insufficient cores or insufficient switch pipeline 41 stages for other schemes that waste cores on chains that will overshoot their SLOs while failing to meet the SLOs for others. At aδ of 0.5, all approaches find feasible solutions, but Lemur has the highest marginal throughput. By δ 1.0, only the Greedy approach and HW Preferred approach are able to compete with Lemur. Atδ of 1.5, Lemur is the only approach that provides a feasible solution. The reasons for these are varied, and somewhat nuanced, and are better illustrated by our 3-chain experiments. Threechainexperiments. In 3-chain configurations (Figures 2.2b–2.2e) we find that Lemur consistently provides higher marginal throughputs than the alternatives, and finds feasible solutions at higher δ even when other alternatives cannot. Minimum Bounce. Minimum bounce provides comparable marginal throughput to Lemur for low values of δ , but beyond a δ of 1.0, it fails to find a solution. This is because it is unwilling to move an intermediate NF to P4 as it attempts to avoid bounces. Adding the bounce might allow an NF to use P4, freeing up server resources to satisfy SLO. HWPreferred. HW Preferred delivers the same rate regardless ofδ because it maximizes P4 process- ing, and otherwise allocates spare cores evenly among chains. While effective at lower δ values, it fails once the SLO for a slower chain cannot be satisfied because of insufficient cores. In both the 4-chain and the 3-chain experiments, the HW Preferred solution fits in the switch; below we discuss an example where alternatives exceed switch stage limits, but Lemur does not. SWPreferred. SW Preferred fails to scale because all NFs are in one subgroup, and we do not replicate stateful NFs or branch/merge NFs. So, SLOs cannot be satisfied even at low δ . Lemur, though it is subject to the same replication constraints is able to find a solution with high marginal throughput. Greedy. Greedy performs quite well in all our experiments as it uses hardware when possible and attempts to meet the minimum SLO using differential core allocation across chains. Greedy differs from HW Preferred as it does not evenly distribute cores to chains but instead does so preferentially to meet SLOs using Lemur’s profiling (§2.3.2). Even so, it fails to find a feasible placement at higher values of δ 42 when Lemur can. The reason is subtle: while Greedy is the only one of our alternatives that targets SLOs, it starts with a HW Preferred placement instead of a full exploration. Thus Greedy may fail to satisfy SLOs because of a lack of cores. Consider a chain A->B->C->D->E, where B and D are on the server, the rest on the switch. Greedy is forced to allocate one core each to B and D, while Lemur can potentially place C on the server and allocate a core to the subgroup B->C->D. (Our heuristic implements such optimizations.) Specificchains. In Figure 2.2b all schemes deliver higher rates and meet higherδ requirements since this experiment omits Chain 4, which, as seen in Table 2.2, is complex. In NF chains{1,2,3} and NF chains{1,2,4}, whenδ = 0.5, we note that Minimum bounce is com- petitive to Lemur, but in NF chains{1,3,4} it underperforms. This is due to Chain 1, for which Minimum Bounce tries to minimize the number of bounces although it is possible to move some modules to P4. Under the same constraint as discussed in §2.3.2, Lemur finds that one more bounce of Chain 1 is beneficial and enables an expensive subgroup to be replicated, which allows spare cores to be allocated to faster chains. Instead, Minimum Bounce is unwilling to trade the bounce and it allocates the cores to slow chains, with lower total throughput. ComparisonSummary. Across all experiments, Lemur canalways find a feasible solution while other approaches only do 17-76% of the time. Moreover, overall, Lemur obtains a marginal throughput lead ranging from 500 Mbps to nearly 24 Gbps (at the latter end, more than 50% of link capacity). Profiling and Performance prediction. In all figures in Figure 2.2, we show the predicted aggregate throughput as a⋄ above the Lemur bar. This prediction is the sum of estimated rates (§2.3.2) of all chains. In general, the predicted throughput closely matches the measured throughput; in this section, we explore why the match is close, and when it is not perfect. Predictionsareconservative. This prediction depends on profiling and is not always perfect. For ex- ample, Greedy, which uses profiles, exhibits non-monotonic aggregate throughput for NF chains {1,2,4} 43 forδ = 1 andδ = 0.5. When we profile an NF, we pick the worst-case cycle count reported by BESS. In some runs, NFs see lower cycle counts and therefore higher rates. In this case, the predicted aggregate rate for both values ofδ were the same, butδ =1 saw lower cycle counts and a higher rate in our experiments. Cross-socketcosts. Our profiles assume worst-case cross-socket costs. Our server has 2 sockets, and the NIC is connected to one of them. If a subgroup is replicated on cores on the same socket as the NIC, our measured rates will be higher than predicted; this occurs in most experiment sets. Data-dependent NFs. Performance prediction may be inaccurate for an NF like dedup. This does not occur in our experiments, but this NF is interesting in two ways: (a) the number of cycles to process a packet can vary due to packet contents; and (b) the NF’s packet egress rate is less than its ingress rate. We leave exploration of this to future work. Thestabilityofprofiledcyclecosts. Table 2.4 shows the statistics of cycle costs for several NFs, across 500 profiling runs. § In general, these profile costs are extremely stable, with the worst-case cycle cost being within 6.5% of the average cycle cost. This is surprising, but explains why our predicted throughput matches the measured throughput so well. To understand this better, we tried to understand the effect of under-estimating the cycle costs. We conducted an experiment in which we reduced the profiled costs by a fraction, ranging from 1% to 10%, mimicking errors in profiling. Wefoundthat,evenwiththeseerrors,Lemur produces a configuration with the same aggregate marginal throughput as the baseline, up to 8% errors . The stability of the cycle costs, and the relative insensitivity of Lemur to errors, is encouraging and explains why our predicted throughputs match measured throughputs. § We generate traffic in two ways that exercise worst-case NF behavior. For NFs that perform poorly with long-lived traffic, we generated 30-50 uniformly distributed long-lived flows. For NFs that perform poorly with short-lived flows, we generated 3.2 mpps of traffic with 10,000 new flows/sec each lasting 1 second. 44 NF NUMA Mean Min Max Encrypt Same 8593 8405 8777 Encrypt Diff 8950 8755 9123 Dedup Same 30182 29202 30867 Dedup Diff 31188 29969 33185 ACL (1024 rules) Same 3841 3801 4008 ACL (1024 rules) Diff 4020 3943 4091 NAT (12000 entries) Same 463 459 477 NAT (12000 entries) Diff 496 491 507 Table 2.4: Example profiled NF costs (CPU cycles/packet) Cacheeffects. ResQ [149] examined NF cache effects and showed NF profiling can be cache sensitive, especially if packet queues before NFs are large and are shared across NFs. In Lemur, we have experi- mentally verified that our queues are short; in this setting [149] shows that NF profiling variability can be bounded to within 3%, consistent with our findings. Anextremeconfiguration: P4stageconstraints. So far switch stage constraints are implicit since link and core constraints dominate. Next we consider an extreme NF chain configuration that causes the switch to run off stages. This is a variant of Chain 2 without encryption: BPF->11 NAT(branched)->IPv4Fwd, and we chooseδ = 0.5 for which we expect the chain min- imum rate requirement will be about 44.9 Gbps. Here SW Preferred fails to satisfy SLOs while all other alternatives exceed the number of switch stages. By contrast, Lemur finds a feasible solution, placing 10 of theNATs in the switch, and one on the server. This illustrates the importance of using the P4 compiler to compute stage usage. Initially, we attempted to estimate stage usage by analyzing placement results, using a recent work technique [42]. But, such estimates were very conservative. For the 10 NAT placement, it estimated 14 stages, while the compiler could fit these into 12 stages using internal black-box optimizations. This shows the importance of our dependency elimination algorithms for stage compaction (§2.4.2); without it, the 10NAT placement would have required 27 stages. 45 2.6.3 OtherExperiments Importance of Lemur Components. Lemur uses both NF profiling and subgroup scaling with core allocation to meet SLOs. Figure 2.2f considers removal of each feature in turn. No Profiling. Here we assume all NFs have the same cycle cost. Because this variant is unable to distinguish between expensive and cheap NFs, it generally has lower marginal throughput, and becomes infeasible for higher values ofδ because it needlessly gives cores to cheap NFs. No Core Allocation. Here we assign no extra cores to scale subgroups; this variant can only satisfy SLOs atδ =0.5. 0 5 10 15 20 25 30 0.5x 1x 1.5x Aggregated Throughput (Gbps) Minimal Rate Requirement Lemur 2 servers Lemur 1 server Min requirement (a) Two servers 0 5 10 15 20 25 30 35 40 45 0.5x 1x 1.5x 2x Aggregated Throughput (Gbps) Minimal Rate Requirement Lemur w/ SmartNIC Lemur Min requirement (b) SmartNIC 0 2 4 6 8 10 0.5x 1x 1.5x Aggregated Throughput (Gbps) Minimal Rate Requirement ACL on Open flow ACL on SW Min requirement (c) OpenFlow switch Figure 2.3: Performance comparison of Lemur running with different hardware and on multiple servers. Placementacrossmultipleservers. Above we evaluate with a 16-core server with a single NIC, but this is not a limitation for Lemur. Lemur can reason about multi-server placements. To show this, we run an experiment using NF chains{1,2,3} where Lemur is used to place the 3-chain set on (a) a single 8-core server, and (b) two 8-core servers. Figure 2.3a shows that, whenδ =0.5, the single server gets less than half the aggregate throughput of the 2-server experiment. However, for a very subtle reason, whenδ = 1.5, Lemur cannot find a feasible solution for the single serve case. Chain 3 contains the following sub-chain Dedup->ACL->Limiter. The base rate of this chain is bottlenecked byDedup. Whenδ =0.5, Lemur is able to allocate one core to the subgroupDedup->ACL->Limiter, because the additional modules ACL andLimiterS are relatively lightweight, so one core can satisfy their SLO. However, whenδ =1.5, this subgroup assignment can no longer satisfy the SLO, and Lemur, to satisfy the chain’s SLO (a) offloads 46 ACL to the switch, (b) replicatesDedup on two cores, and (c) allocates a 3rd core forLimiter which is non-replicable. It thus runs out of cores in the 1-server case. PlacementonaSmartNIC. Lemur can accelerate NFs across multiple types of hardware. To demonstrate this, Figure 2.3b shows an experiment on Chain 5, which includes the ChaCha NF [98, 144]. ChaCha cannot be implemented in the P4 switch but can be in the SmartNIC as well as and on x86 servers. Our SmartNIC implementation (which uses eBPF on a 40G Netronome NIC) is more than 10× faster than on the server. In Figure 2.3b, Lemur is able to achieve higher aggregate throughput at lowerδ by offloading ChaCha to the SmartNIC. At δ = 1.5, Lemur cannot produce a server-only solution since the t min is too high for a server (even with multiple cores). This shows that Lemur can achieve close to the line rate of 40 Gbps, lower only due to NSH header overhead. Placement on an OpenFlow switch. OpenFlow switches are ubiquitous today, unlike PISA switches. We show how Lemur can use an OpenFlow switch in place of a PISA switch. Unlike a PISA switch, an OpenFlow switch has fixed table order, so the Placer must check whether a configuration violates the switch table order to execute NFs. Also, Openflow switches do not support NSH; Lemur uses VLAN in its place (specifically, the 12-bit vid field as SPI-SI to demultiplex packets for different subgroups). This somewhat limits how many chains and how many NFs can be configured, but using Lemur with an Open- Flow switch still provides overall benefits. To demonstrate how an Openflow switch can accelerate NFs, we compare offloading ACL, or not, to an OpenFlow switch on chain 3, as shown in Figure 2.3c; this can accommodate up to 7710 Mbps traffic for that chain, while stitching ACL via a commodity server can only achieve 693 Mbps. Lemur decouples the performance optimization and the code generation from the actual deployment of NF chains. This makes it extensible to other hardware platforms as well. Addinglatencyconstraints. Lemur can reason about latency constraints, which are encoded into our LP formulation (§2.3.2). The switch vendor’s EULA disallows reporting latency details, but we show a single 47 experiment in which we model chain latency using (a) propagation and transmission latency between switch and server, and (b) NF execution delay. Here we used Chain 1 and Chain 4, and assigned each a latency constraint of 45µ s. This allows Lemur to increase marginal throughput at the expense of additional bounces between the server and switch, and for this case we get over 21 Gbps. When we constrain the latency to 25µ s, Lemur is forced to reduce the number of bounces and can only achieve 9 Gbps. ¶ Meta-compilerBenefitsandOverhead. The meta-compiler automates coding tasks that would other- wise have to be performed by a system administrator. We quantify this benefit by counting the lines of code auto-generated by Lemur. The most significant code generation component is for P4, and for NF chains {1,2,3,4} more than a third of the total code (about 820 out of 1700 lines) is auto-generated, with most of the auto-generated code (600 lines) providing packet steering. Flexible NF-chain composition comes at a cost which takes two forms: additional stage usage in P4, and additional cycle costs in BESS. We have to burn two P4 stages, one each to encapsulate and decapsulate packets. Our BESS cycle cost overheads for these are modest at about 220 cycles. The server also incurs about 180 cycles to load-balance packets when a subgroup is allocated to multiple cores. These overheads are a small fraction of NF cycle costs and of the coordination overheads imposed by any framework or virtual switch. ScalingPlacerComputation. Brute-force placement is slow; for the 4-chain case (34 NF instances in to- tal) it takes 14901 seconds (~4 hours). Our heuristic is far faster, taking 3.5 s for the 4-chain case, motivating our careful design. ¶ Sources of latency include DPDK and switch queueing, and encap/decap overheads. The 9 Gbps drop occurs because the higher throughput placement violates latency SLO (due to multiple bounces), so Lemur picks an SLO-compliant alternative with lower throughput. 48 Chapter3 Quadrant: ACloud-DeployableNFVirtualizationPlatform Cloud computing has demonstrated that manageable, fast, large-scale, multi-tenant processing with per- formance objectives can be achieved by scale-out computing on commodity clusters and network hard- ware, using standard OS abstractions for resource management. For NFV to be production ready similarly requires performance, scalability, and isolation. NFV workloads involve the deployment of chains of NFs from different vendors in a multi-tenant environment. For this reason, we posit that NFV must leverage cloud computing infrastructure as much as possible. Put differently, for deployability and ease of manage- ment, NFV needs to conform to the infrastructure rather than the other way around. The ideal for NFV is that, to both the network operator and the cloud operator, running NF chains should appear no different from running a cloud application. In this chapter, we ask the following questions: what are the set of necessary properties for NFV deployment? And, is it possible to achieve these in today’s cloud context? 3.1 Introduction Network Function Virtualization (NFV) enables both simple (e.g., VLAN tunneling) and complex (e.g., traf- fic inference) packet processing using software-based Network Functions (NFs). Over the last decade, much research has explored the design of NFV platforms or components thereof (e.g., [172, 105, 164, 60, 109, 127, 49 140] among others). Despite this, we know of no widely deployed NFV platforms that have achieved the original goal of NFV: making hardware middleboxes “someone else’s problem” [136]. Instead, industry has doubled down on custom hardware solutions [106] and complex and bespoke NFV frameworks [100]. We posit that NFV can achieve its original goals using an NFV platform architected as a cloud service. In fact, cloud-deployability of NFV is fast becoming a necessity, driven by the move to cloud-hosted 5G cellular function virtualization [89]. A cloud-deployable NFV platform must concurrently support several functional and performance requirements identified by prior work: (1) chaining multiple, possibly-stateful third-party NFs to achieve operator objectives [172, 105, 164]; (2) NF-state and traffic isolation between mutually-untrusted, third-party NFs [109, 127]; (3) near-line-rate, high-throughput packet processing [60]; and (4) latency and throughput SLO-adherence [172]. In addition, it must substantially reuse cloud com- puting infrastructure and abstractions [140]. To our knowledge, no prior work satisfies all these objectives (Table 3.1). Recent work employs clean-slate custom interfaces, runtimes, and control planes [105, 60, 109, 127, 67], but has not achieved NF chaining, isolation, and scaling without losing generality, performance, or ease of deployment. Many approaches break layering and isolation for performance [105, 60, 67], or leverage specialized hardware [72, 172, 64] at the cost of poor deployability. To support untrusted third-party NFs, other solutions either use language-based isolation [109, 117], losing generality or require expensive per- hop packet copying [127], sacrificing performance (§3.3.1). Most relevant is SNF [140], a recent effort on cloud-based NFV. It sheds significant insight on distributing traffic among NF instances but does not sup- port SLO-aware chaining and ignores optimization opportunities available in today’s cloud infrastructure, such as using hardware and OS kernel features. In this chapter, we describe the design and implementation of Quadrant, an NFV platform that achieves key functional and performance requirements (Table 3.1), while significantly reusing cloud infrastructure and abstractions. It uses containers to run NFs, NIC virtualization and software-based packet steering 50 KeyProperty Quadrant NetBricks [109] EdgeOS [127] Metron [60] SNF [140] Performance High Medium Medium High Low Isolation ✓ ✓ ✓ ✗ ✓ Stateful-NF Support ✓ ✗ ✗ ✓ ✓ Third-party Compatibility ✓ ✗ ✓ ✗ ✓ SLO-aware Chaining ✓ ✗ ✗ ✗ ✗ Failure Resilience ✓ ✗ ✗ ✗ ✓ Table 3.1: A comparison of NFV platforms’ properties that are key for being production-ready. to balance load among NFs, extends a standard cluster management system (Kubernetes) to auto-scale NF processing and ensure failure resilience of NF chains, and standard OS kernel mechanisms to achieve isolation without sacrificing performance. Contributions. We make the following contributions: High-performancespatiotemporalpacketisolation. Quadrant’s use of containerized NFs, together with NIC virtualization, ensures that an NF chain can only see its own traffic. Quadrant also ensures a stronger form of packet isolation (§3.4.2): an NF in a chain can access a packet only after its predecessor NF. Quadrant achieves this by spatially isolating the first NF from the others in the chain using a packet copy. Subsequent NFs can process packets in a zero-copy fashion, with temporal isolation enforced by CPU scheduling. This approach is general and transparent to NF implementations, and requires no language support. Performance-aware scheduling. For performance, Quadrant dedicates cores to NF chains and uses kernel bypass to deliver packets to NFs, and uses standard OS interfaces to cooperatively schedule NF threads from different NFs to mimic run-to-completion [60], proven to be essential for high NFV perfor- mance (§3.4). Run-to-completion processes a batch of packets; Quadrant selects batch sizes that satisfy SLOs while minimizing context-switch overhead. SLO-awareauto-scaling. In response to changes in traffic, Quadrant auto-scales NF chains by dynam- ically adjusting the number of NF chain instances to minimize CPU core usage while preserving latency 51 ToR FaaS worker cluster Gateway Control Plane worker 1 worker 2 worker N FaaS worker subsystem Quadrant controller FaaS Image Registry Quadrant Ingress Deployment (k8s) Load balancing More racks … Hybrid Traffic ToR NF Container NF Code NF Runtime Figure 3.1: Quadrant’s control plane. Quadrant controller Quadrant agent Quadrant sched N Quadrant sched 1 … NF 1 Core 1 runqueue NIC ToR Enforcing scheduling policies Monitoring Worker node Code NF 2 NF 3 … Runtime Code Runtime Code Runtime Figure 3.2: A Quadrant worker. Figure 3.3: Quadrant’s controller interacts with Quadrant’s ingress and worker subsystem to deploy con- tainerized NFs for packet processing. Unshaded boxes are existing cloud components that Quadrant reuses, lightly shaded ones are components that Quadrant modifies, and darker ones represent new components specific to Quadrant embedded within the infrastructure. SLOs (§3.5). This flexibility allows tenants to trade-off latency for lower cost, a capability present in a few bespoke NFV systems [149, 172]. We find (§3.6) Quadrant achieves up to 2.31 × the per-core throughput when compared against state- of-the-art NFV systems [109, 127] that use alternative isolation mechanisms. Under dynamic traffic loads, Quadrant achieves zero packet losses and is able to satisfy tail-latency SLOs. Compared to a highly- optimized NFV system that does not provide packet isolation and is not designed to satisfy latency SLOs (but is designed to minimize latency) [60], Quadrant uses slightly more CPU cores (12–32%) while achiev- ing isolation and satisfying latency SLOs. Quadrant’s total code base is less than half the size of existing NFV platforms [109, 60, 140], and just 3% of an existing open-source Function-as-a-Service (FaaS) plat- form [101]. 52 3.2 QuadrantOverview Here we sketch Quadrant’s architecture (Figure 3.3), which we detail in later sections. Quadrant Interface. Clients access today’s cloud services via REST APIs, and front-ends access back- ends via RPC. Both these abstractions work well for normal web requests but can be inappropriate and heavyweight for NFs that process traffic at the packet level, and can introduce significant overhead in the form of unnecessary network headers and additional protocol processing [140]. As a cloud service for deploying custom NFs, Quadrant needs to have an efficient programming model that allows developers to easily create custom NF logic to support packet processing. Quadrant adopts an event-based programming model widely used for web services, and adjusts the programming model so that NFs accept a raw packet struct (a pointer to a packet) as input: they are handler functions for raw-packet events (rather than web or RPC requests). NFs can have state, and they share state using a standard distributed key-value store (e.g., Redis [132]). They invoke a Quadrant runtime that abstracts access to packets and state via library APIs (Figure 3.3(B)). Indeed, this abstraction is standard for commercial NFs that attach to virtual Ethernet devices. A Quadrant customer (e.g., an organization or an ISP) can then assemble an NF chain of such NFs along with (a) a traffic filter specification for what traffic is to be processed by the chain; and (b) a per-packet latency SLO. QuadrantDesign. Quadrant’s architecture reuses existing cloud infrastructure (Figure 3.3(A)). It assumes commodity servers and OpenFlow-enabled switches, and reuses cloud-native worker subsystems (e.g., Kubernetes) that manage a pool of worker servers and allocate system resources (NIC, CPU core and memory) to NFs. Each worker executes NFs encapsulated in containers ∗ ; in Quadrant, each container hosts an NF. † For custom NFs, customers provide container images for each NF: they compile and containerize each NF together with the NF runtime. For third-party NFs, Quadrant’s runtime offers a virtual Ethernet ∗ Containers and VMs are common isolation mechanisms for applications. We adopt NF containers because they are 1.5-2.3× faster than NF VMs [109]. † This also includes cases where NFs can be concatenated into a single NF with tools such as OpenBox [13]. 53 interface (a.k.a. veth) that is an API wrapper for exchanging packets, which is a standard interface for NFs in production environments. NF images are ready for deployment once uploaded to Quadrant. For each worker pool, Quadrant requires two new components: a Quadrant controller and the Quad- rantingress. At each worker, Quadrant adds ascheduler and a Quadrantagent. The controller manages the deployment of NF instances by interacting with the worker subsystem to deploy Quadrant components (an ingress, per-worker scheduler, and agents) prior to startup. At runtime, the Quadrant controller uses the worker subsystem to deploy NFs as containers. It collects NF performance statistics from each Quad- rant agent, serves queries from the ingress, or pushes load balancing decisions to the ingress which then enforces them by modifying a flow table. Traffic enters and leaves the system at the ingress which forwards traffic using flow entries that enforce the Quadrant controller’s workload assignment strategies. A flow entry forwards traffic to a deployed NF chain instance. When a new flow arrives, the ingress queries the Quadrant controller (or uses prefetched queries) to instantiate a new flow entry and routes subsequent packets in the flow to the corresponding NF chain instance. By design, this architecture is similar to that of Function-as-a-Service (FaaS), because NF chains re- semble cloud functions: they are event-based and require scalability, but also have significantly different functional and performance needs. Indeed, our Quadrant implementation is built upon an open-source FaaS platform, OpenFaaS [101], that can be easily deployed in commodity clouds. Quadrant re-uses Open- FaaS’s worker subsystem for managing workers and deploying services. The worker subsystem (imple- mented using Kubernetes in some FaaS implementations [101]) manages the system resources, i.e., a pool of worker machines. Each worker machine executes functions encapsulated in containers; in Quadrant, each container hosts an NF. Customers provide container images for each NF in a chain. The centralized Quadrant controller manages the deployment of FaaS services by interacting with this worker subsystem to deploy Quadrant components (an ingress, per-worker scheduler and agents) prior to startup. However, 54 it incorporates four novel features designed to address NFV requirements (§3.1): a novel execution model (§3.3) that permits high throughput packet processing, a core allocation and scheduling strategy (§3.4) that minimizes latency and overhead, a packet isolation strategy that permits third-party NFs to run fast and securely (§3.4.2), and an auto-scaling technique that minimizes CPU core usage while being able to meet latency SLOs (§3.5). We describe these next. 3.3 Quadrant’sExecutionModel An execution model describes how an application is executed, what memory it can access, and how it accesses the NIC resource. It is critical for achieving key functional and performance requirements of NFV described in Table 3.1. NFs can be seen as network applications that operate on network packets and internal state. Quadrant users write packet processing functions, and use runtime APIs to access packets and state. NFs run as processes with isolated memory for NF states. Packet memory is carefully managed by Quadrant to enable memory sharing, avoiding unnecessary packet copying. Each NF can have many instances that run as processes across multiple cores, each with a data-plane thread managed by Quadrant’s scheduler. We argue that this design is critical for achieving both performance and isolation. To provision NFs, Quadrant acts as a cluster system manager to allocate resources,e.g., CPU cores, memory, and network interfaces, to NFs. Quadrant tracks the liveness and performanace for each chain with per-worker Quadrant agents. To scale NFs, Quadrant adjusts the number of instances allocated for each chain with the goal of minimizing CPU core usage while meeting SLOs. In this section, we describe Quadrant’s NF execution model and how it differs qualitatively from prior work. 55 3.3.1 ExistingNFExecutionModels Prior work has explored different NF execution models that dictate how NFs share packet memory, how the runtimes steer packets to NFs, and how they schedule NF execution. Memory model. Prior work has explored three different models for NFs running on the same worker machine: they (1) may share NF state memory, and packet buffer memory ( e.g., in Metron [60] and Net- Bricks [109]), (2) do not share NF state memory, but share packet buffer memory globally ( e.g., in E2 [105] and NFVnice [67]), or (3) do not share either NF state memory or packet buffer memory ( e.g., in Ed- geOS [127]). Network I/O model. Packets must be sent to a specific NF running on a specific server core. In many NFV platforms, such as E2 [105], NFVnice [67], and EdgeOS [127], a hardware switch forwards packets to specific worker machines. Once packets arrive at the server’s NIC, a virtual switch forwards traffic locally. In a multi-tenant environment, the vSwitch has read and write access to each individual NF’s memory space, and copies packets when forwarding them from an upstream NF to its downstream. The vSwitch can become a bottleneck for both intra- and inter-machine traffic. To scale it up, a runtime can add CPU cores for vSwitches, but is a waste of otherwise productive cores. On our test machine, a CPU core can achieve a 6.9 Gbps throughput when forwarding 64-byte packets (or 13.5 Mpps). Consider a chain with 4 NFs running on a server with a 10 Gbps NIC. The aggregate traffic volume can reach 40 Gbps at peak on the vSwitch, which requires at least7 CPU cores to run vSwitches (more if traffic is not evenly distributed across the vSwitches). An alternative approach is to offload packet switching to the ToR switch and the NIC’s internal switch. Both switches coordinate to ensure packets arrive at the target machine’s/target process’s memory. When a packet hits the ToR, the switch not only forwards the packet to a dedicated machine but also facilitates intra-machine forwarding via L2 tagging. This approach eliminates the need to run a vSwitch. However, it can only ensure that packets are received by the first NF in a chain. Metron [60] and NetBricks [109] 56 take this approach but rely on a strong assumption: that all NFs can be compiled and run in a single process. However, many popular NFs are commercially available only as containers or VMs and cannot be compiled with other NFs to form a single binary that runs the NF chain. Even if that were possible, the packet isolation requirement constrains flexibility significantly, since it can only then be achieved using language-based memory isolation (e.g., by using Rust [109, 117]). CPU scheduling model. Memory and network I/O models also impact CPU resource allocation and scheduling of NFs and NF chains. When NF chains run in a single process (as in Metron and NetBricks), those runtimes can dedicate a core to an entire chain. When NFs run in separate processes (as in E2 or NFVnice), runtimes must decide whether to allocate one or more cores to a chain, and how to schedule each NF. 3.3.2 NFChainExecutionModel To ensure minimal changes to existing cloud infrastructures, Quadrant chooses an execution model that sits in a different point in the design space: it deploys each NF in a chain as a container. NFs can share packet buffers (as in Metron or NetBricks), but packet isolation is enforced through OS protection, careful scheduling, and packet copying (unlike NetBricks, which relies on language-specific isolation). Quadrant uses NIC I/O virtualization and kernel bypass to reduce packet steering overhead. The rest of this subsection describes some of the details of Quadrant’s packet I/O and memory models. The next section describes CPU allocation and scheduling. Packet I/O. Quadrant uses DPDK for fast userspace networking to handle packet I/O for NFs. Because other cloud services may use the kernel networking stack and run on the same worker, Quadrant must use userspace networking for NFs while being compatible for kernel networking options. To do so, it uses Single-root Input/Output Virtualization [145] (SR-IOV) to virtualize the NIC hardware. SR-IOV allows a 57 PCIe device to appear as many physical devices (vNICs). With SR-IOV, NIC hardware generates one Phys- ical Function (PF) that controls the physical device, and many Virtual Functions (VFs) that are lightweight PCIe functions with the necessary hardware access for receiving and transmitting data. ‡ On a worker, the Quadrant agent (Figure 3.3) manages the virtualized devices via kernel APIs through the PF. FlowtoChainMapping. Quadrant maps flows to NF chains at its ingress (§3.5.2). Before the Quadrant controller allocates a CPU core to a chain, the Quadrant agent sets up a VF to the chain and pins the chain to its allocated core (§3.4). § Later, the hardware switch, when matching a flow, rewrites the MAC address of the packet to be the one from the corresponding VF’s MAC address. This approach enables outsourcing flow dispatching and provides a flow-level granularity. Memory. In Quadrant, a runtime on behalf of an NF chain initializes a file-backed dedicated memory region that holds fixed-size packet structures for incoming packets. It also creates a ring buffer that holds packet descriptors that point to these packet structures. To receive packets from the virtualized NIC, the NF runtime passes this ring buffer to its associated VF so that the NIC hardware can perform DMA directly to the NF runtime’s memory. NF State Management. Stateful NF (e.g., IDS) packet processing depends on both the packet itself and the NF’s current state. Prior work (e.g., statelessNF [58], S6 [164], SNF [140]) has demonstrated that it is feasible to efficiently decouple NF processing from state, because most stateful NFs only have to access remote state 1–5 times per connection/flow [58]. In Quadrant, we leverage this observation to maintain per-NF global state remotely in Redis, while providing efficient caching to mitigate the latency overhead of pulling state from the external store. Quad- rant’s programming model exposes a set of simple APIs for writing a stateful NF:update(flow, val) ‡ Using multi-queue NICs may lead to performance isolation issues that have solutions proposed by recent research to improve fairness and performance [146, 47]. In Quadrant, NICs do not involve complex packet scheduling. Instead, they just dispatch packets based on L2 headers, so simply applying a bandwidth limit to VFs is sufficient to avoid this issue. § Mellanox ConnectX-5 100 GbE NICs and Intel XL710 40 GbE NICs support up to 128 VFs, while Intel E810 100 GbE NICs can support up to 256 VFs. With a large number of VFs, Quadrant can saturate all cores on modern platforms, even for a hundred cores. 58 andread(flow, val), whereflow corresponds to a BPF matching rule. Besides global NF state, Quad- rant’s NF runtime maintains general NF state in a hash table locally so that the user-defined NF can process most packets with state present in its local memory. The runtime makes the state synchronization trans- parent to the NF by interacting with the external Redis service, and ensures that each NF can only access its own state. It processes packets in batches (§3.4), and for each packet batch, the runtime batches all state accesses required by all packets prior to processing. It pulls state from Redis with a batched read request to amortize the per-packet state access delay. Once an NF calls update, its runtime issues a request to the local Quadrant agent to update global state in the Redis service and the packet triggering the state update. The agent releases the packet once the global state has been updated. This is necessary to keep NF state consistent: the packet won’t reach its destination unless the NF’s global state has been updated. This design also avoids doing state synchro- nization operations in the data plane, and minimizes Quadrant’s state synchronization’s impacts on the overall end-to-end latency. In Quadrant, each NF is associated with a unique hash key, which is used to tag NF states in the Redis service. This is useful to recover the state of a single NF instance when migrating flows from it or recovering from failure (§3.4.3). Quadrant’s state consistency mechanism builds on Redis’s consistency guarantee. In Redis, acknowl- edged writes are committed and never lost and reads return the most recent committed write [132]. In Quadrant, an NF emits a packet only after receiving a state update acknowledgment, and starts processing a migrated flow only after emitting packets from the original core. When an NF updates per-flow state (see also §3.7), this ensures state consistency. This can add some delay, but our experiments demonstrate that, despite this, Quadrant can achieve its performance goals. 59 3.4 CoreAllocationandScheduling In Quadrant, each NF is deployed as an individual container in a Kubernetes cluster. Quadrant dedicates a core to all NFs in a chain; that core serves a traffic aggregate assigned to that chain. When the total traffic exceeds the capacity of a single core, Quadrant spins up another chain instance on another core, and splits incoming traffic between NF chain instances (§3.5). The Quadrant controller manages all NFs via Kubernetes APIs to control the allocation of memory, CPU share, and disk space. 3.4.1 ControllingChainExecution Userspace I/O and shared memory can reduce overhead, but to be able to process packets at high through- put and low latency, Quadrant must have tight control over NF chain execution. As discussed earlier, custom NF platforms use two different approaches. One approach bundles NFs in an NF chain into a single process toruntocompletion in which each NF in the chain processes a batch of packets before moving onto the next batch (as in Metron or NetBricks). This approach ensures high performance and predictability by amortizing overhead over a packet batch. To achieve packet isolation, NetBricks relies on language isolation, so cannot support third-party NFs. The second approach, used by NFVnice and others, is to run each NF in a separate process and use vSwitches for packet forwarding, which ensures isolation but incurs high overhead by copying packets, requiring careful CPU allocation and scheduling (e.g., tuning CFS and using ECN for backpressure in NFVnice). Instead, Quadrant aims for the best of both worlds: it does not force developers to write and release NF code in a specific programming language; it also avoids overheads and complexity brought by approaches that use vSwitches. Quadrant introduces spatiotemporal packet isolation in which NF chains operate on 1) spatially-isolated packet memory regions (as opposed to the typical model in run-to-completion software switches such as BESS, in which all NF chains on a machine run in the same memory) and 2) are temporally isolated through careful sequencing of their execution, which proceeds in a run-to-completion fashion 60 across processes and uses cooperative scheduling mechanisms to hand off control at the natural execution boundary of packet batch handoff (§3.4.2). This isolation ensures that NF chains (which may process different customers’ traffic) cannot see each others’ packet streams or state, and even within a chain each NF maintains private state and only gets to execute (and thus access packet memory) when it is expected to perform packet processing in the chain. Enforcingrun-to-completionscheduling. Quadrant uses a per-core NF Cooperative Scheduler. All NF containers in a chain are assigned to a single core; each runs two processes. The NF process is single threaded and processes traffic. The NF runtime process has an RPC server to control the NF and a moni- toring thread to collect statistics (§3.5). To avoid interfering with packet processing, the monitoring thread runs on a separate core. The runtime is invisible to NF authors. To tightly coordinate NF chain execution, Quadrant uses Linux’s real-time (RT) scheduling support, and manages NF threads’ real-time priorities and schedules them using a FIFO policy. We use this policy to emulate, as described below, NF chain run-to-completion execution in which each NF in the chain processes a batch of packets in sequence. Scheduling model. In Quadrant’s cooperative scheduling, an upstream NF runs in a loop to process individual packets of a given batch, and then yields the core to its downstream NF. This is transparent to the NFs: once the user-defined NF finishes processing, the NF runtime determines whether to transmit the packet batch to the downstream NF; if yes, the runtime invokes yield. ¶ For this, the Cooperative Scheduler has to bypass the underlying scheduler (CFS in our implementation) and take full control of a core. Internally, the scheduler maintains two FIFO queues: a run queue that contains runnable NFs, and a wait queue that contains all idle NFs. It offers a set of APIs that the NF runtime can use to transfer the ownership of NF processes of a chain from CFS to the Cooperative Scheduler. These ¶ To deal with non-responsive NFs, the runtime terminates chain execution if an NF fails to yield after a conservative timeout. 61 APIs are used by the Quadrant agent, which runs as a privileged process. NFs themselves cannot access these APIs, so cannot change scheduling priorities or core affinity. Once a chain is deployed, all NFs are managed by the Cooperative Scheduler, and are placed in the scheduler’s wait queue as detached. Once an NF chain switches into the attached state (see below), the Cooperative Scheduler pushes NFs of this chain into its run queue and ensures that the original NF dependencies are preserved in the run queue. To detach a chain, the Cooperative Scheduler waits for the chain to finish processing a batch of packets, if any, and then moves these NF processes back to the wait queue. How scheduling works. Once an NF starts, Quadrant’s NF runtime reports its thread ID (tid) to the Quadrant agent running on the same worker. Once all NFs are ready, the Quadrant agentregisters their tids as a scheduling group (called a sgroup) to the Cooperative Scheduler. Thereafter, the cooperative scheduler takes full control of NFs. An NF chain starts in thedetached state. When the Quadrant controller assigns flows to the chain (§3.5), the Cooperative Scheduler attaches the chain to the core. When the monitoring thread sees no traffic has arrived for the chain, the scheduler detaches the chain, so the Quadrant controller can re-assign the core. For attach and detach operations, and to schedule NF chain execution, the Cooperative Scheduler has a master thread to serve scheduling requests and runs one enforcer thread on each managed core. The scheduler uses features of Linux FIFO thread scheduling: 1) high-priority threads preempt low-priority threads and 2) a thread is executed once it is at the head of the run queue, and is moved to the tail after it finishes. An enforcer thread is raised to the highest priority when enforcing scheduling decisions. When an NF chain is instantiated on a core, the enforcer thread registers the corresponding NF processes as low-priority FIFO threads so that they are appended to the wait queue. When attaching the NF chain, it moves NF processes to the run queue by assigning them a higher priority, and vice versa when detaching 62 a sgroup. Operations are done in the sequence that NFs are positioned in the NF chain, so when an NF yields, the CPU scheduler automatically schedules the next NF in the chain. In this model, each worker machine splits CPU cores into two groups. One group is managed by the Cooperative Scheduler, while the other runs with normal threads managed by CFS. We use a standard kernel and support different schedulers on different cores. This enables running NF and non-NF workloads on the same machine. Recent research [103, 34] shows that this is critical for achieving high CPU core efficiency for latency-sensitive applications. Estimatingminimumbatchsize. The Cooperative Scheduler introducesN context switches for a chain with N NFs. Without packet batching, a core may incur significant context switch overhead. Quadrant estimates the minimum batch size required to bound the context switch overhead within a fractionp (which is configurable). Let ˜ r is the packet rate when running the NF chain in a single thread, the maximum achievable rate. Then ifF is the processor clock frequency, andS i is the cycle count of thei-th NF in a chain needed to process a packet, ˜ r = F P N i=1 S i . The actual packet rate is given by: r = F P N i=1 S i + N·Cctx B (3.1) whereC ctx is the context switch cost,B is the batch size. To bound the overhead to a fractionp, we simply solve for the smallestB that satisfies the inequality ˜ r/r≤ (1+p). 63 vNIC rx tx Core 1 chain VF 1 NF1 NF2 NF3 1 2 3 4 5 6 Figure 3.4: Timeline of packets on a Quadrant worker. A packet is tagged at the ingress. 1. NIC’s L2 switch sends it to NIC VF associated with the destined chain. NIC VF DMAs packets to the first NF’s memory space. 2. NF 1 processes the packet. 3. After NF 1’s packet processing function returns, the packet is copied to the chain’s pktbuf by the NF runtime if there are other NFs. This is necessary to ensure packet isolation as the NIC’s pktbuf should only been seen by NF 1. 3–5 A per-core cooperative scheduler controls the execution sequence of NFs to ensure temporal packet isolation. 6. Final NF asks VF to send the packet out. 3.4.2 SpatiotemporalPacketIsolation What is packet isolation? Quadrant targets support for third-party NFs (e.g., a Palo Alto Networks firewall [108]) in multi-tenant settings where each chain may consist of NFs from multiple vendors, and each chain may be responsible for processing a specific customer’s traffic. For this, Quadrant must ensure (1) memory isolation: each NF must have its own private memory for maintaining NF state; (2) packet isolation: within an NF-chain, an NF should not be able to access a packet until its predecessor NF has finished processing the packet, and across chains, an NF should not be able to access packets not destined to its own chain. Achieving Isolation in Quadrant. Since each NF is encapsulated in a container, memory isolation for NF state is trivially ensured. Quadrant uses shared memory to effect zero-copy packet transfers. Figure 3.4 describes how Quadrant achieves packet isolation while permitting (near) zero-copy transfers. The key idea is to use shared packet memory for NFs to avoid packet copying whenever possible, and control the access to the shared memory via cooperative scheduling to provide lightweight isolation. Quadrant allocates each NF chain a separate virtual NIC with SR-IOV, each initialized with a separate ring buffer queue that holds packets for the chain. Upon packet arrival, the NIC hardware directly DMAs 64 packets to this queue. Ideally, NFs within the chain access the queue directly in the shared memory region, avoiding copying. However, this can violate packet isolation because a downstream NF could access shared memory while the NIC hardware writes to it. To avoid this, Quadrant gives only the first NF in the chain access to the NIC packet queue, and also allocates a second packet queue for each NF chain. This second queue holds packets for downstream NFs in the chain, and is shared among those NFs. Thus, the first NF can access the NIC packet queue and is spatially isolated from other chains and from downstream NFs. It processes each batch of packets and copies it to the second packet queue. Quadrant then temporally isolates the second packet queue across all downstream NFs through coop- erative scheduling. Cooperative scheduling ensures NFs run in the order they appear in the chain, so even though it has access to shared memory, a downstream NF cannot access a batch that has not been pro- cessed by an upstream NF since it will not be scheduled. This permits zero-copy packet transfer between all NFs except the first. For a chain with only one NF, Quadrant omits the unnecessary packet copying and cooperative schedul- ing. The Quadrant NF runtime also applies an optimization that prefetches packet headers into the L1 cache before calling the user-defined NF for processing. This optimization can improve performance (§3.6.2). Finally, Quadrant allocates each chain its own packet queues, and does not share queues across chains. This ensures spatial packet isolation across different chains. 3.4.3 OtherDetails Mitigating startup cost. Auto-scaling may need to allocate a new worker to an NF chain. Cold-starts can incur significant delay, especially since Quadrant uses user-space networking libraries that can incur 500 ms or more to set up memory buffers. This delay can result in SLO violations. Like prior work [86, 65 99] on reducing serverless startup time, Quadrant keeps a pool of pre-deployed NF chains that start in the detached state and do not consume CPU resources. Failureresilience. Quadrant is resilient to NF failures. Each NF monitor tracks liveness of each NF in a chain by tracking the progress of per-NF packet counters. Other Quadrant components like the controller, the agents, and the ingress are instantiated by Kubernetes, which manages their recovery. Once it detects a failed NF, the controller must migrate flows assigned to it to another worker. This is conceptually identical to the flow migration (§3.5.3) discussed above. 3.5 Auto-scalinginQuadrant Quadrantauto-scales (adapts resources allocated to) NF chains in response to traffic volume changes. Quad- rant uses an architecture (Figure 3.3) similar to other cloud services [101, 154]: its controller coordinates with the global ingress and worker machines for auto-scaling. The ingress forwards requests to idle worker instances. The controller manages the pool of instances to handle dynamic traffic while achieving cost ef- ficiency. The controller is aided by a per-worker Quadrant agent that monitors NF performance and works with the cooperative scheduler to enforce scheduling policies. 3.5.1 Monitoringandscalingsignals TheNFmonitor. Monitoring is critical for scaling NF chains. At each NF, the NF monitor collects perfor- mance statistics, including NIC queue length, the instantaneous packet rate, and the per-batch execution time. The packet rate is measured as the average processing rate of the whole NF chain and NIC queue length is as reported by the NIC hardware. It also estimates per-batch execution time by recording the global CPU cycle counter at the beginning and the end of sampled executions. A chain’s latency SLO is the upper-bound for the tail (defined as the 99th percentile) end-to-end latency. ∥ ∥ End-to-end packet latency measures the time a packet spends in Quadrant, including both packet processing and network transmission. 66 To avoid interfering with data-plane processing, the NF monitor runs in a separate thread and is not scheduled on a core running NFs. Each NF monitor maintains statistics and sends updates to the Quadrant controller only when significant events occur (to minimize control overhead), such as when queue lengths or packet rates exceed a threshold. Signalsusedbyauto-scalingalgorithms. Quadrant’s scaling algorithm estimatesend-to-endtaillatency and the packet load (defined below) to determine when to scale up or down. To estimate the end-to-end tail latency, Quadrant estimates the p99th duration that a packet spends on a worker (the worker latency), and the p99th network transmission latency. It estimates the worker latency as 2× the p99 per-batch execution time acquired from the monitoring service, as a packet may have to wait for the previous and current batch.We use a function of the link’s throughput for the network transmission delay and use offline profiling to map a worker’s throughput to the p99 network transmission latency. Our end-to-end latency estimation is conservative because (1) the worker latency is the worst-case latency, and (2) the p99th end-to-end latency is less than or equal to the sum of the p99th worker latency and network transmission latency. Quadrant also measures thepacketload as the ratio between the current packet rate and the maximum packet rate. ∗∗ 3.5.2 QuadrantIngress Quadrant’s ingress implements its controller’s load balancing decisions. It adapts existing load-balancers to ensure flow-consistent forwarding decisions. To do this, it pre-fetches from the controller a list of (worker, core) pairs, and their associated load, to assign to new flows. When a new flow arrives, it assigns it to a worker based on the associated load and installs a flow entry. These actions can be implemented either in hardware or software (we have implemented both). ∗∗ Queueing theory notes that the delay can skyrocket as the arrival rate nears the service rate. Quadrant avoids scheduling a chain close to its maximum rate because a small rate increase can significantly increase the latency; it stops assigning more flows to a chain above a given load (e.g., 90%). 67 3.5.3 ScalingofNFChains Quadrant tries to schedule chains on the fewest CPU cores that can serve traffic while meeting SLOs. It does so by (a) carefully managing flow-to-worker mappings, and (b) monitoring SLOs and migrating flows to avoid SLO violations. Managing flow-to-worker mappings. The Quadrant controller uses the per-chain end-to-end latency (§3.5.1) estimation as the primary scaling signal to balance loads among workers to avoid SLO violations. It uses a hysteresis-based approach to control the end-to-end latency under a given latency SLO, while maximizing core utilization. SupposeT slo is the target chain’s SLO. Quadrant uses two thresholds: alower thresholdαT slo ; theupper threshold isβT slo (0<α<β < 1). The Quadrant controller only assigns new flows to chains whose estimated end-to-end latency is less than the lower threshold. Of these, it selects the chain with the highest packet load (§3.5.1), thereby ensuring that Quadrant uses the fewest cores. Finally, it stops assigning new flows to a chain whose estimated p99 latency is between the two thresholds. Migrating flows to meet SLOs. Due to traffic dynamics, a chain’s estimated end-to-end latency can exceed the upper threshold; then, the controller moves flows from this chain until its end-to-end latency falls below the lower threshold. Migrating flows reduces the queueing delay. According to Little’s law [80], the average packet queueing delay isd = 1 /(rmax− r), wherer max is the maximum packet rate that a chain runs on a core, andr is the chain’s current packet rate. To compute the slope of the queueing delay curve: δd δr = 1 (r max − r) 2 (3.2) 68 Translate (3.2) into the following form: δr r = δd d ( r max r − 1) (3.3) where δr /r is the packet rate change ratio; δd /d is the latency change ratio; the rate-adapting term( rmax r − 1) indicates that decreasing packet rate more is necessary for decreasing the latency by the same ratio when the packet rater is low. With the above intuition, we decide the sum of packet rates∆ r for migrated flows as a function of the chain’s current packet rater curr and its estimated latencyt curr . Note that,t curr >βT slo , whereT slo is the latency SLO andβT slo is the upper latency threshold (§3.5.3). Quadrant uses the lower thresholdαT slo as the target latency for the migration, and calculates the sum of migrated flows’ packet rates as †† : ∆ r =r curr t curr − αT slo t curr ( r max r curr − 1) (3.4) Alternatively, Quadrant can migrate flows so that the aggregated packet rate is proportional to the latency change ratio w/o the rate-adjusting term. We evaluate these in Appendix. Quadrant’s runtime manages the migration of stateful NFs to a new worker. The runtime on the old worker synchronizes NF states with Redis before emitting packets in a batch. When a flow migrates to another worker, that worker’s runtime fetches related state from Redis before processing packets. Reclaimingidlecores. Finally, when an NF thread becomes idle (all flows previously assigned to it have completed), Quadrant reclaims the assigned core. Quadrant could have, instead, migrated flows away from underutilized NF chains, but this would have complicated state management for stateful NFs. We have left this optimization to future work. †† Due to measurement errors,rcurr samples (calculated by using the chain’s packet counter) may be higher thanrmax (cal- culated by using the amortized per-packet cycle cost). To avoid a negative value, we apply a hard lower bound 0.25 for the rate-adapting term, whenrcurr ≥ 0.8rmax. 69 3.6 Evaluation Next we substantiate claims listed in Table 3.1: Quadrant ensures high performance and meets SLOs, provides NF isolation, supports stateful NFs, and is robust to NF failure, while reusing existing cloud com- ponents. Implementation. Quadrant is built upon OpenFaaS [101], an open-source FaaS platform for hosting serverless functions. OpenFaaS consists of infrastructure and application layers, and uses Kubernetes, Docker, and the Container Registry. Quadrant reuses these APIs to manage and deploy NFs. OpenFaaS uses its gateway to trigger functions, and Quadrant adds an ingress. Incoming traffic is split at the system gateway; normal application requests are forwarded to OpenFaaS’s gateway, while NFV traffic is forwarded to the Quadrant ingress. OpenFaaS uses a function runtime that maintains a tunnel to the FaaS gateway, and hands off requests to user-defined functions; instead, Quadrant uses the above mechanisms to receive traffic from its ingress (§3.5). Quadrant reuses OpenFaaS’s general framework and relies on a per-worker agent for NF performance monitoring and its cooperative scheduler for enforcing scheduling policies. We quantify Quadrant’s additions in §3.6.1. Experimentsetup. We use Cloudlab [24] and run experiments on a cluster of 10 servers, and configure both DPDK and SR-IOV. Each server has dual-CPU 16-core 2.4 GHz Intel Xeon E5-2630 (Haswell) CPUs with 128 GB memory (DDR4 1866 MHz). To reduce jitter, we disable hyperthreading and CPU frequency scaling. Each server has one dual-port 10 GbE Intel X520-DA2 NIC. Both are connected to an experimental LAN for data-plane traffic. Each machine has one 1 GbE Intel NIC for control and management traffic. 70 Servers connect to a Cisco C3172PQs ToR switch with 48 10 GbE ‡‡ ports and Openflow v1.3 support. The traffic generator and the Quadrant ingress run on dedicated machines. MethodologyandMetrics. Our experiments use end-to-end traffic with 3 canonical chains from light to heavy CPU cycle cost, from documented use cases [69]. Chain 1 is an L2/L3 tunneling pipeline: Tunnel→IPForward; Chain 2 is an expensive chain with DPI and encryption NFs: ACL→UrlFilter→Encrypt; Chain 3 is a state-heavy chain that requires connection consistency: ACL→NAT. Tunnel parses a packet’s header, determines its VLAN TCI value, and appends a VLAN tag to the packet. ACL enforces 1500 access control rules. URL Filter performs TCP reconstruction for client sessions and applies complex string matching rules (e.g., Snort [129] rules) to block connections mentioning banned URLs. Encrypt encrypts each packet payload with 128-bit ChaCha. NAT maintains a list of available L4 ports and performs address translation for connections, assigning a free port and maintaining this port mapping for a connection’s lifetime. Key performance metrics include end-to-end latency distribution and packet loss rate and time-average and max CPU core usage for the test duration. The traffic generator uses BESS [10] to generate flows with synthetic test traffic. For both microbenchmarks and cluster-scale experiments, we run a DPDK-based traffic generator on a dedicated server and collect performance metrics. 3.6.1 QuantifyingReuseofAbstractions Quadrant’s deployability stems from its reuse of existing cloud frameworks and its limited new code. Quad- rant adds code in three categories. The first is code for NFV at the (edge) cloud (independent of Quadrant), 4150 LOC, including for packet processing, monitoring, isolation, SLO scaling, and core reclaiming. The second category contains 1210 LOC to support Quadrant’s specific mechanisms, including isolation with shared memory and SLO-adherent chaining. The third category is 4200 LOC to leverage standard APIs, ‡‡ We also conducted one experiment (§3.6.6) using 40/100 GbE NICs on our own testbed. (Our experiments use servers w/ Inter CPUs and an OpenFlow-enabled network. Cloudlab does not support these for 40/100 GbE.) 71 including run-to-completion scheduling, supporting statefulness and packet processing interfaces, and co- operative scheduling. The rest is for CLI and debugging tools, which are nice to have but not necessary. By comparison, OpenFaaS [101] is 345k, OpenLambda [48] is 217k, NetBricks [109] is 31k, Metron [60] is 30k for its control plane, and SNF [140] is 20k. In summary, Quadrant adds a small fraction to existing FaaS systems (2.7% of OpenFaaS). Further, Quadrant uses far fewer lines of code versus custom NFV systems because it reuses existing abstractions judiciously, and only requires about 1k lines of custom code (the second category above). 3.6.2 PerformanceComparisons: Isolation Next, we compare Quadrant against other NFV systems that make different isolation choices. For this experiment, we use chains of many instances of a canonical Berkeley Packet Filter (BPF) [79] NF that parses packet headers and performs 200 longest-prefix matches on packet 5-tuples. §§ Our evaluations vary NF chain lengths as in prior work [60, 73]. Isolation via copying. EdgeOS [127] supports isolation via data copying. We emulate EdgeOS on top of a reimplementation of NFVnice [67] with the same set of mechanisms for packet copying, scheduling notifications, and cache-line optimizations. We use NFVnice’s master module to move packets between NF processes. The master module runs as a multi-threaded process with one RX thread for receiving packets from the NIC, one TX thread for transmitting packets among NFs, and one wake-up thread for notifying a §§ Fixing the number of per-packet memory accesses is important for a reason described later. We have also experimented with different values of the number of matches, and omit results for brevity. 72 1 2 3 4 5 6 0 1 2 3 4 NF Chain Length Throughput [Mpps] Quadrant Quadrant (single thread) NetBricks EdgeOS Figure 3.5: Throughput with increasing chain length for running an NF chain on a single core. NF that a message has arrived at its message buffer. All three threads run on dedicated cores to maximize the performance. Isolationviasafelanguages. NetBricks [109] uses compile-time language support from Rust to ensure isolation among NFs plus a run-time array bounds check. We reuse NetBricks’s open-source implementa- tion. Results. Figure 3.5 shows the throughput of different isolation approaches. Quadrant outperforms Net- Bricks (1.21-1.51× ) and NFVnice w/ packet copying (1.61-2.31× ). NFVnice with packet copying achieves 62% throughput relative to Quadrant with a single-NF chain. As chain length increases its throughput decreases despite its 3 extra CPU cores for transmitting packets among NFs because: (a) cross-core packet copy overheads and (b) load imbalance across NFs since NFVnice tunes scheduling shares for NFs on a single core using Linux’scgroup mechanism. NetBricks suffers from memory access overheads due to array bounds checks; in our experiments, memory accesses are incurred during longest prefix matches. These overheads become significant when packets trigger complex computations, which explains its drop in performance. To validate this assertion, 73 we ran NetBricks with dummy NFs (that use an equivalent number of CPU cycles withno per-packet mem- ory accesses), and found that it can achieve 94-99% of Quadrant’s performance. By contrast, Quadrant’s lightweight isolation does not incur per-memory-access overheads, so it has higher throughput. To understand the overhead in Quadrant imposed by isolation, we implemented a no-isolation variant labeled Quadrant (single thread) that runs all NFs in a single thread. Compared to this unsafe-but-fast variant, Quadrant has an overhead that remains at the same level regardless of the chain length: Quadrant achieves a 90.2%-94.2% per-core throughput when deploying a multi-NF chain while providing isolation. Thus, Quadrant pays a 6-10% penalty for achieving isolation. For single-NF chains, we turned off the prefetch-into-L1 optimization described in §3.4.2 in Quadrant’s variant, and found that Quadrant achieves slightly better performance. 3.6.3 PerformanceComparisons: Scaling Quadrant scales chains to meet their latency SLOs. We quantify CPU core usage when deploying chains. Here, we compare Quadrant against Metron [60], a high-performance NFV platform, in the same end- to-end deployment setting. Metron auto-scales core usage, but does not support SLO-adherence. E2 [105] and OpenBox [13] also have the same property, but Metron outperforms them, so we compare only against Metron. Metron does not provide packet isolation, so we do not include it in isolation comparisons. Before each experiment, an NF chain specification is passed to both systems’ controllers to deploy NFs in the test cluster. Metron also uses a hardware switch to dispatch traffic, and has its own CPU scaling mechanism. Unlike Quadrant, it compiles NFs into a single process, and runs-to-completion each chain as 74 a thread. Each Metron runtime is a multi-threaded process that takes all resources on a worker machine to execute chains with no isolation. Results. Across all experiments, both systems achieve a zero loss rate. Thus, we compare two systems by looking at the tail latency, and the CPU core usage when they serve the test traffic (100 million pack- ets). Quadrant can meet the tail latency SLO for all chains. Metron targets zero loss, not SLO adherence. Figure 3.6 plots the CPU core usage as a function of achieved tail latency by both systems. Metron does not adjust its CPU core usage for different latency SLOs, while Quadrant is able to adjust the number of cores used to serve traffic under different SLOs, to trade off latency and efficiency; it dedicates more cores for a stringent SLO. To fairly compare CPU core usage, we select Quadrant’s samples whose tail latency are smaller but closest to Metron’s achieved latency, and compare the time-averaged CPU cores again Metron. For Chain 1, Quadrant achieves 82.7µ s latency using 3.61 cores on average, about 12% more than Metron (they both use the same number of max cores), while Metron achieves 85.4µ s latency. For Chain 2, Quadrant achieves comparable latency, uses about 23% more cores on average (14.38 vs. Metron’s 11.66). Results are similar for Chain 3. Quadrant’s higher core usage results from its support for isolation, its SLO-adherence (both of which Metron lack), and its scaling algorithm (different from Metron’s, §3.5.3). Quadrant incurs multiple context switches in scheduling a chain. With a tight latency SLO, Quadrant uses smaller batch sizes, resulting in a higher amortized per-packet overhead; this is more significant for light chains ( e.g., Chain 3). However, the absolute number of extra cores remains small because such chains run at high per-core throughput. Quadrant’s monitoring may notify users if chains have small batch size due to a stringent SLO; they can relax the latency SLO or proceed with higher overhead. To understand the impact of the scaling algorithm by itself, we port Metron’s scaling algorithm to Quadrant, and implement a variant, called Quadrant-Metron. Figure 3.6 shows the achieved latency and 75 40 60 80 100 120 140 160 180 4 12 20 28 36 44 Quadrant Avg Quadrant Max Metron Avg Metron Max Quadrant-Metron Avg Quadrant-Metron Max 60 80 100 120 140 160 12 24 36 48 40 45 50 55 60 65 70 75 80 4 16 28 40 52 Tail Latency [µ s] # of CPU cores CHAIN 1 CHAIN 2 CHAIN 3 Figure 3.6: Core usage of NF chains implemented in Quadrant and Metron as a function of achieved tail latency. CPU core usage for this variant. Like Metron, Quadrant-Metron does not adjust CPU core usage for differ- ent latency SLOs; for Chain 1, Quadrant achieves 128.4µ s, but Quadrant-Metron achieves 173.7µ s latency and uses 16% more cores on average. Similar results hold for other chains, and validate our decision to design a new scaling algorithm instead of using Metron’s. 3.6.4 ValidatingSLO-adherencewithScaling Methodology. We evaluate Quadrant’s SLO-adherence in scaling different chains under traffic dynamics. For each experiment, we run a DPDK-based flow generator to generate traffic at 10 flows/s with a median packet size of 1024-byte, which we selected through trace analysis [22]. The traffic generator gradually increases the number of flows and reaches the maximum throughput after 60 seconds, with a peak load of 18 Gbps. ¶¶ Then the traffic generator stays steady at the maximum rate until 100 million packets are sent. ¶¶ This traffic volume is similar to that used by prior work [58]. 76 40 60 80 100 120 140 160 180 40 70 100 130 160 190 Quadrant Result Latency SLO Lower Latency Threshold 60 80 100 120 140 160 180 200 70 100 130 160 190 40 60 80 100 120 140 160 180 200 40 70 100 130 160 190 Latency SLO [µ s] Tail Latency [µ s] CHAIN 1 CHAIN 2 CHAIN 3 Figure 3.7: End-to-end tail latency achieved by NF chains in Quadrant as a function of latency SLO. All traffic enters the system through a switch. We evaluate end-to-end metrics, including the tail latency, and the time-averaged number of cores for deploying chains. SLO-adherence for different NF chains. Quadrant scales chains to meet latency SLOs. Quadrant es- timates a chain’s tail latency, and uses it as a knob to control the end-to-end delay for packets being processed by the chains. We evaluate Quadrant’s ability of controlling the end-to-end tail latency under different SLOs with all test chains. Results. Figure 3.7 shows the end-to-end tail (p99) latency achieved by Quadrant as a function of latency SLOs. For each chain, Quadrant meets the tail latency SLO for all tested SLOs. At a higher latency SLO, both the lower latency threshold and the tail latency are higher. We see the cause of this behaviour: Quadrant’s controller migrates flows from a chain when its estimated latency exceeds the upper latency threshold, and it sets the lower threshold as the latency target (§3.5.3). This feature aligns with the trade-off between latency and efficiency: for a traffic input, achieving a higher tail latency results in a higher per-core throughput, which means Quadrant can devote fewer CPU 77 0 20 40 60 80 95 20 30 40 50 46 45.8 45.9 46.3 44.8 45.3 Percentage of flows with an increased rate [%] Tail latency [µ s] Figure 3.8: End-to-end tail latency achieved under different levels of traffic dynamics. Latency SLO is 70 µ s for all groups. cores to serve traffic. This feature is important in the cloud context: Quadrant can use the right level of system resources to meet the latency SLO. We note that Chain 1 and 2 have tail latency close to lower latency thresholds. Chain 3 behaves dif- ferently: its tail latency stops increasing after its latency SLO is greater than 130 µ s, because Chain 3 deployments have reached the per-core packet load limit. In §3.5, Quadrant avoids executing chains close to its max per-core packet rate. For these cases, the per-core rate is high enough so that it is less beneficial to pursue a higher per-core efficiency at the cost of making the end-to-end latency unstable. SLO-adherenceundertrafficdynamics. It is important that Quadrant works for different traffic inputs so we verify Quadrant’s ability to control latency with such inputs. To do so, we deploy chains with a fixed latency SLO to see whether Quadrant can control latency with traffic dynamics. We gradually increase traffic by randomly accelerating a subset of flows by 30% of their packet rates for half of a flow duration. We vary the percentage of flows with an increased packet rate, and measure Quadrant’s latency performance. Results. We show the tail latency under traffic inputs with different subsets of flows with an increased packet rate. (Figure 3.8) For all these cases, Quadrant is able to meet the tail end-to-end latency SLO; in fact, all groups achieve similar latency results regardless of the input. 78 9 10 11 12 16 17 18 19 1 20 40 60 80 100 Latency [µ s] Cumulative Prob. [%] SR-IOV 80B Host 80B SR-IOV 1500B Host 1500B Figure 3.9: End-to-end latency CDF with SR-IOV on and off. 100 300 600 900 1,200 1,500 200 300 400 500 600 700 Packet Size [B] Cost [cycles] P99 P50 Figure 3.10: Per-packet cost of copying packets of different sizes 3.6.5 QuantifyingIsolationOverhead Spatialisolationoverhead. Spatial isolation overhead results from SR-IOV; we compare running a test NF with and without SR-IOV enabled. Our test NF is an Empty module so that it only involves swapping the dst and src Ethernet addresses of a packet to send it back. Results. Figure 3.9 shows running with SR-IOV adds only 0.1 us latency for both 80-byte and 1500- byte packets. We also find that the maximum throughput achieved by an SR-IOV enabled NIC ≥ 99.6% of the throughput achieved by a NIC running in a non-virtualized mode. 79 Temporal isolation overhead. NFs in a chain hand off packet ownership. Packet isolation requires that NFs in the same chain can only acquire packet ownership after its predecessor finishes processing it (§3.4.2). To quantify this temporal isolation overhead, we evaluate using a multi-NF chain. Results. Figure 3.10 shows the p50 and p99 CPU cycle cost for copying one packet of different sizes. The median cost to copy a 100-byte packet is 247 cycles and, for a 1500-byte packet, 467 cycles. This small difference is due to the cost of allocating a packet struct. Scheduling NFs cooperatively involves context switches between NF threads that belong to different NF processes. We profile the average cost of context switches between NFs: 2143 cycles per context switch. Note that this context switch cost is amortized among the batch of packets in each execution. For a default 32-packet batch, the amortized cost is only 67 cycles per packet. This cost is 27| 14% of the cost for copying a 64| 1500-byte packet respectively. Further, it is only 31% of the cost for forwarding a packet via a vSwitch with packet copying, as in EdgeOS. ∗∗∗ Using munmap/mmap for transferring packet ownership. For isolation we could have used munmap and mmap to explicitly manage the ownership of the shared packet buffer. munmap requires 4083 cycles, andmmap 8495 cycles. With all packets placed in the same memory page, we need onemunmap andmmap to transfer the page to a different process. This costs significantly more (5.87 × ) than the context switch, justifying our approach to isolation. 3.6.6 Scalingto40and100GbENIC Cloudlab only supports OpenFlow for 10 GbE NICs, so most of our experiments use those. To show that Quadrant scales to 40/100 GbE NICs, we set up a separate two-node cluster: one node as the traffic gener- ator and the other one as the Quadrant worker. The traffic generator is a dual-socket 20-core 2.2GHz Xeon E5-2630. The Quadrant worker is a dual-socket 16-core 1.7GHz Xeon Bronze 3106. Both servers have one ∗∗∗ Quadrant has zero software packet switching cost because it uses the ToR switch and the NIC’s L2 to dispatch packets to different chains. 80 80 100 120 140 160 40 70 100 130 160 80 100 120 140 40 70 100 130 160 80 100 120 140 160 4 8 12 16 Latency [µ s] 80 100 120 140 4 8 12 16 Latency [µ s] Latency Result Latency SLO Avg Core Max Core Tail Latency [µ s] # of Cores a. Intel XL710 40G b. MLNX ConnectX-5 100G Figure 3.11: End-to-end tail latency and CPU core usage achieved by Quadrant (Chain 1) as a function of latency SLO. 100Gbps single-port Mellanox ConnectX-5 NIC. The worker has one additional 40Gbps single-port Intel XL710 NIC. They connect to an Edgecore 100BF-32X (32x100G) switch. Experiments in §3.6.3 and §3.6.4 use this setup. Results. Figure 3.11 shows that, for Chain 1, Quadrant is able, as before, to adjust the number of cores used to serve traffic for different latency SLOs, and utilize all available cores on the worker to meet stringent SLOs, both for 40 GbE and 100 GbE. Chain 3 behaves similarly as Chain 1, but Chain 2, because it is CPU heavy, needs more cores than we have to saturate the NICs, so we have omitted the experiment. Overall, these experiments show that Quadrant’s design scales seamlessly at higher NIC speeds. 3.6.7 CooperativeScheduling Do we need the Cooperative Scheduler? Quadrant’s cooperative scheduler enables packet isolation, even for third-party NFs. A weaker form of isolation, assuming that NFs can be trusted, can be achieved us- ing the Linux CFS scheduler, together with explicit handoff from one NF to another using shared memory (an NF sets a flag to indicate packets are ready to be processed by the next downstream NF). Unfortunately, 81 NFChainLength 1 2 3 4 5 6 Quadrant (coopsched) 4700 2200 1520 1180 960 810 Quadrant (CFS) 3340 1530 980 680 520 415 NFVNice w/ pkt copy 2920 1230 815 545 425 350 Table 3.2: Per-core NF chain throughput (kpps) w/ and w/o coopsched. Metric Quadrant Localmem DummyNF Singlethread Per packet cycles 2846 2746 2730 2592 Chain delay in cycles p50th 121536 116232 116008 85980 p99th 128116 122324 117892 87492 Misses dTLB [count] 72,864,218 68,259,827 61,251,661 1,185,591 dTLB [%] 0.52% 0.48% 4.07% 0.00% iTLB [count] 11,542,564 12,325,256 9,615,381 591,737 iTLB [%] 478.18% 488.28% 379.31% 182.39% Misses LLC cache 18,923,855 6,710,255 6,699,896 12,963,766 L1 dcache 508,578,460 417,298,551 333,262,806 327,837,034 L1 icache 41,272,281 37,127,027 30,856,178 17,568,605 Table 3.3: Overheads under isolation variants. this weaker alternative is also slower (Table 3.2); Quadrant w/ Cooperative Scheduler outperforms Quad- rant w/ CFS by 40.7-95.2%. Note that the latter still outperforms NFVNice w/ pkt copy, as Quadrant does not require expensive cross-core packet copying for each inter-NF hop. Cache and TLB effects. Cooperative scheduling involves context switches between NF processes in a chain. It can also flush caches and TLBs; we conduct an experiment to quantify these. We run the same test chain, with 5 BPF modules, as in §3.6.2 with four experimental groups: 1) Quadrant: the vanilla deployment w/o adaptive batch optimization; 2) Local mem: the vanilla deployment that operates on one dummy packet in the shared memory region; 3) Dummy NF: a chain of dummy NFs that do not process packets, but simulate NF cycle costs; 4) Single thread: a chain that runs in a single thread. The traffic generator produces traffic (1024B packets) to saturate the chain’s NIC queue so that each chain runs at a batch size of 32, the NIC’s default batch size. The TLB and cache misses are measured as the average value for a 15-second execution duration for 5 measurements. 82 Results. Table 3.3 shows NF runtime statistics. For all multiple-process groups, we see higher iTLB and dTLB misses. As shown, the number of dTLB misses is less than 1% of dTLB hits for cases that run a non-dummy NF, though dTLB misses are less important for an NF’s performance. (Quadrant uses 2 MB hugepages for packet memory for each NF chain. Increasing this to 1 GB does not alter Quadrant’s max throughput as there are few dTLB misses.) All multi-process groups see higher iTLB misses compared to the single thread case because NF pro- cesses do not share code in memory. Local mem and dummy NF perform similarly in terms of per-packet cycle cost (and the number of cache misses) because Local mem processes one packet that resides in the chain’s local memory and is likely to benefit from the L3 cache. Quadrant has a slightly higher per-packet cost compared to the other two multi-process cases. We find that the extra cost only comes from the first NF that copies incoming packets. The 2nd-4th NFs in 1)-3) have the same cycle cost (509 cycles / packet). These NFs benefit from L3 caching as the first NF’s runtime loads when copying packets from the NIC’s buffer to the per-chain packet buffer. In the above four cases, two major differences explain the per-chain cycle cost: a) iTLB misses when deploying as a multi-process chain and b) L3 cache misses when processing network traffic. Quadrant incurs both; Local mem and Dummy NF only the former, and Single thread only the latter. We calculate cycle cost for each for the 5-NF chain: for iTLB misses it is 254 cycles / packet (or 50.8 cycles / packet / hop), and cache misses add 100 cycles / packet. The former is extra overhead of a context switch, which could 83 NFChain Quadrant Fixed- small (batch=32) Fixed- medium (batch=128) Fixed- large (batch=512) Chain 1 306 246 305 260 Chain 2 116 106 116 62 Chain 3 322 260 313 142 Table 3.4: Per-core chain throughput (kpps) under different batch settings. be reduced by tagged TLBs, while cache misses are unavoidable. Overall, the amortized TLB overhead is relatively small compared to the context switch itself. Batching to reduce overhead. Quadrant uses batching to amortize context switching overheads, and estimates an appropriate batch size for an NF chain (§3.4). Here, we show Quadrant’s batching by compar- ing the maximum per-core throughput produced under Quadrant’s batching and other schemes that use a fixed batch size for different chains. Results. Table 3.4 demonstrates that Quadrant’s batch setting performs significantly better than Fixed-small (which uses a small batch size of 32) andFixed-large (batch size of 512) batch settings, and always produces a throughput that matches the highest throughput among all experimental groups. Surprisingly, using a large batch size decreases per-core throughput of NF chains. 3.7 Discussion Quadrant can leverage prior work to scale and better support stateful NFs. Consistencymodel. While Quadrant is consistent for the per-flow state, it needs to be extended to ensure consistent updates to global states (e.g., per-device packet counts in 4G/5G Evolved Packet Core (EPC)). 84 Prior work (S6 [164]) has explored mechanisms to ensure eventual consistency of global states and can be incorporated into Quadrant. Ingress scalability. Quadrant’s ingress runs as software, and installs rules in hardware switches, for when OpenFlow becomes available. Fastpass [116] has demonstrated that software-based per-flow routing is feasible and efficient even at the data center scale. SNF [140] used software ingress, and it also showed that its implementation incurs negligible latency and can scale out dynamically to adapt to traffic volume. Quadrant can incorporate these to scale better. Hardwareingress. Modern hardware switches have enough resources for per-flow rules (e.g., NoviFlow switches have 225K entries [60]). Finally, in a multi-tenant cloud environment, Genesis [147] has explored managing flow space of hardware switches for isolation. 85 Chapter4 Ironside: Sub-millisecondLatencySLOsforNFV 4.1 Introduction Network Function Virtualization (NFV) enables processing packets in software, often using a chain of Network Functions (NFs). NFs can filter traffic, encrypt it, modify headers or payload in transit, and so on. A long line of work has shown that NFV improves network manageability over hardware middleboxes [136, 105, 109, 148, 67, 149, 152]. More recent work has demonstrated that NFV is performant — on commodity hardware, NFV systems can process complex NF chains at line rates [60, 9, 172, 140, 152]. Spurred by these developments, cloud providers have started to offer NFV capabilities to customers. This is most visible at the edge of today’s distributed clouds [2, 1, 8], where cloud racks placed at the edge of a cloud provider’s global network expose to customers the ability to process packets. In this setting, the latency of packet processing must be as small as possible; at the edge, network latencies are low (in nearly 50 countries, RTTs to the cloud are less than 20 ms [95]). Execution of NF chains is pure overhead, and at the edge, even latency of a few milliseconds can represent significant overhead, especially since increased latency can adversely impact revenue. As processor speeds increase, many NF chains can process a single packet within tens of microseconds. Yet, today’s NFV systems exhibit 99th percentile (p99) latencies on the order of several milliseconds or tens of milliseconds (§3.6). As we describe below, much of the increased latency comes from queueing. In this 86 chapter, we ask: is it possible to design an NFV system that offers a sub-millisecond p99 latency service-level objective (SLO)? To our knowledge, no other work has explored this capability. A primary challenge we face is the flow-affinity requirement for NFV (§4.2.1). NFV systems assign flows to cores so that all packets in a flow are processed by the same core. This avoids expensive cross- core synchronization for stateful NFs which update shared states on every packet. To ensure low latency while preserving flow-affinity, NFV systems auto-scale core allocations [60, 140, 152] — as traffic increases or decreases, they scale up or down (respectively) the number of cores allocated to traffic. This ensures that traffic can be processed in a timely manner, while at the same time minimizing core usage. The latter is especially important in the cloud context, where cores not used for packet processing can be used to process revenue-generating application workloads. Auto-scaling algorithms in today’s NFV systems fail to react to two kinds of bursts seen in real traffic that can significantly increase queueing latency (§4.2.2). The first results from instantaneous packet rates in a flow that exceeds the processing capacity of a single core. The second is a burst of a large number of flow arrivals that also exceeds core capacity since each new flow requires additional setup overhead for stateful NFs. These bursts can be short-lived (10 ms or shorter) and can impose queueing latencies of several milliseconds (§4.2.2). Existing NFV auto-scaling approaches fail to achieve low latency in real traffic for three reasons (§4.2.3): they make core allocation decisions based on observed packet rates alone, so cannot account for flow arrival bursts; they use time-averaged estimates that cannot capture short-timescale bursts; they make core allocation decisions at longer timescales (of 100 ms to 1 s) than the duration of bursts in real traffic. Contributions. This chapter describes the design and implementation of Ironside, a rack-scale NFV sys- tem that achieves sub-millisecond latency SLOs while ensuring high CPU efficiency (§4.3). Its contributions include: 87 • A novel hierarchical multi-scale flow-to-core mapping strategy that distributes core allocations across different spatial scales (ingress, server, core) and different temporal scales (seconds at ingress to microsec- onds at core (§4.3.1)). This ensures that Ironside can adapt to traffic changes at different timescales while ensuring scalable core allocation decisions by making coarse-grained decisions at the ingress (which sees the most traffic) and fine-grained decisions at the core (which sees relatively lower traffic). • A more accurate core capacity estimation technique for NFs that takes instantaneous flow counts as well as packet rates into account. • A lightweight mechanism to recruitauxiliary cores to handle bursts that can potentially violate latency SLOs. • A fast algorithm to remap RSS buckets at a server to core to ensure that no core experiences sustained overload. Experiments using traffic traces from a backbone as well as from within an ISP (§4.4) show that Ironside can support p99 latency SLOs on the order of 100-500µ s while the state-of-the-art NFV systems [60, 152, 16] exhibit 10× higher p99 latency SLOs. Moreover, Ironside’s CPU efficiency is comparable to, or better than, all of these systems. An ablation study shows that each of its design decisions contributes significantly to Ironside’s performance. 4.2 Background,Motivation,andApproach To motivate the problem that Ironside addresses and the approach it takes, we begin with some background. 4.2.1 Background DeploymentScenario. Many cloud providers usedistributedcloud designs with rack deployments at the edge of the network [2, 1, 8]. These rack deployments are either within the premises of an enterprise or at the ingress of a carrier (e.g., a telco). Each rack contains a ToR switch and many multi-core servers. Each 88 server core can run edge applications, such as those envisioned for multi-access edge computing (MEC). In addition, CPU cores can also run software NFs: indeed, cloud provider software in these deployments explicitly supports software NFs. In this chapter, we target NF processing (i.e., packet-level processing) in these deployments. NFs and NF chains. NFs may process packets in a variety of ways: translate network addresses (e.g., NAT), monitor and filter traffic ( e.g., using ACLs or applying advanced pattern matching techniques on packet payloads), encrypt traffic, cache web contents, and so forth. Unlike NFs, microservices and other applications operate at the bytestream level or the request level, not on individual packets. In addition, NFs often maintain per-flow state when processing all packets from each individual flow. An NF chain is a sequence of NFs that process packets belonging to flows. For example, an NF chain may first apply an access control list to a flow, then filter URLs using deep packet inspection, and finally encrypt packets in the flow (§3.6 contains other examples of NF chains.) A single edge rack may execute multiple NF chains to serve different classes of traffic. For each NF chain, an operator defines which traffic class (or traffic aggregate) the chain should process. LatencySLO. In the edge rack setting, it is desirable to control the latency incurred in processing a packet using an NF chain. This enables cloud providers to meet end-to-end latency targets for applications. This bound, called a latency SLO, is expressed in terms of a high percentile (e.g., the p95 or p99 percentile, denoted p95 or p99) of the latency experienced by a packet traversing the NF chain. This latency consists of two primary parts: the time taken by the NF chain to process a packet, and the queueing delay experienced by packets. Flow Affinity Requirement. Many NFs are stateful: they create and update the internal state when processing packets. Most stateful NFs maintain per-flow state [58]: an NF creates new states when it encounters a new flow, and every packet of the flow can potentially update the per-flow state. If packets of a flow are processed by different cores, NFs can incur significant synchronization and cache coherence 89 overhead to ensure consistent updates to the per-flow state across cores. For this reason, NF systems usually preserve flow-affinity : to the extent possible, packets of a flow are processed by a single core. When a flow migration happens, all related NF states associated with the flow need to be migrated to the new core before the core can start processing the flow. Auto-scaling. NFV systems like Metron [60], RSS++ [9], Quadrant [152], and Dyssect [16] assign NFs or NF chains to individual cores. In the settings that these papers consider, as in ours, the volume of traffic processed by an NF chain generally exceeds what a single core can handle, so each NF or NF chain is allocated more than one core. These systemsauto-scale CPU core allocations: they dynamically change the number of CPU cores as- signed to an NF chain based on traffic, all without requiring human intervention. Specifically, auto-scaling (a) dynamically tracks increases and decreases in traffic and (b) correspondingly increases or decreases the number of cores assigned to each NF chain, with the aim of using the fewest cores necessary to process the traffic for that NF chain. This permits CPU efficiency, since cores not used for packet processing can be used for application or background tasks. Existing systems auto-scale core usage with different goals. Metron [60] auto-scales to ensure enough cores to handle the current traffic while minimizing packet losses due to any overloaded cores. Quad- rant [152] auto-scales ∗ to ensure a target p99 latency SLO, while RSS++ [9] and Dyssect [16] auto-scales to ensure a target p50 median latency. Our work focuses on auto-scaling for ensuring p99 sub-millisecond latency SLOs. 4.2.2 Challenges Two factors make auto-scaling challenging. The first is the latency SLO. If an NFV system allocates too few cores, packets can experience queueing, and significant SLO violations can occur. If it allocates too ∗ Quadrant achieves similar SLOs, but on synthetic, non-bursty traffic. In §4.4, we show that, on real traces, its p99 latencies can be as high as several milliseconds. 90 NFChain TrafficInput Backbone AS Light(5µ s) 831.3 53.8 Medium(10µ s) 2,690 578 Heavy(20µ s) 6,648 3,584 Table 4.1: A whale (i.e., a single flow whose packet processing requirement exceeds the capacity of a core) can inflate p99 latency. Each cell represents the p99 latency for the corresponding NF chain and trace inµ s. NFChain TrafficInput Backbone AS Light(5µ s) 4,029 3,062 Medium(10µ s) 3,830 14,241 Heavy(20µ s) 4,671 14,558 Table 4.2: Many minnows (i.e., non-whale flows) can inflate p99 latency. Each cell represents the p99 latency for the corresponding NF chain and trace in µ s. many, CPU efficiency can suffer. The second is flow affinity. An auto-scaling solution must not violate flow affinity; doing so would complicate NF design and incur significant performance penalties. Beyond this, bursty traffic presents a challenge for efficient auto-scaling. When a traffic burst arrives, if an NF chain does not have an adequate number of cores, packets in the burst can encounter a queueing delay. If the burst is small, or if the latency SLO is loose, SLO violations may not occur. However, we find these extreme traffic cases do exist in real-world traffic, and they cause significant SLO violations. Superbursts. In real packet traces, two types of large bursts (superbursts) can occur that can result in SLO violations. To our knowledge, we are the first to identify superbursts as a challenge for core allocation in NF systems. Whales. A whale is (a part of) asingle flow whose packet processing requirements exceed the capacity of a single core. For example, suppose an NF-chain requires 10 µ s to process a packet; if more than 10 3 packets arrive on a single flow within 10 ms, this would exceed the core’s processing capacity during that time window. Whales arise because of the flow-affinity constraint in NF systems. Packet traces captured in the wild contain whales. To show this, we use two packet traces: a backbone trace [15], and one collected in a large autonomous system (AS) [87]. For these two, we computed the p99 latency resulting from processing each flow on a separate core using 3 different types of NF chains: a light chain that requires 5µ s of processing andmedium andheavy chains that require 10 and 20µ s respectively. To do this, we use a packet-level discrete simulation framework which simulates a rack-scale NFV system, 91 including one ToR switch and many multi-core servers. It takes a realistic traffic trace as input, replays the trace and directs traffic to the simulated cluster running NF chains. In this setup, since every flow is processed by a single core, one expects the p99 latency to be close to the NF chain’s processing time. However, we find the overall p99 latency to be more than two orders of magnitude higher in some cases (Table 4.1): 6.6 ms for the backbone trace and 3.6 ms for the AS trace for the heavy chain. This indicates that in both those traces, one or more flows exhibit packet arrivals that, in the short term, exceed core capacity and cause significant queueing that affects the overall latency. Minnows. Minnows represent many active flows at a single core whose collective processing require- ments exceed the capacity of a single core. During auto-scaling, minnows can cause tail latency SLO violations even in the absence of whales. This is because: (a) if the corresponding NF chain is stateful, the number of flows (possibly bursty ones) with packet arrivals within a short time period can be significantly larger than the median; (b) the state creation overhead for each flow can overwhelm core capacity. Real packet traces also contain minnows. Our methodology to demonstrate this uses the same two traces as described above, but: (a) uses flow-hashing (similar to these used in RSS-based systems) to dis- tribute flows to cores, (b) assigns as many cores as necessary, so that the p50 latency for the entire trace is less than 10× the NF processing time and (c) shapes each flow, so that any burstiness is only due to flow arrivals, not packet arrivals. Ideally, because traffic is shaped, one would expect the p99 latency to be close to the processing time. However, as Table 4.2 shows, the p99 latency is much higher (nearly 14.6 ms for the heavy chain on the AS trace). This shows that having minnows in traffic can adversely impact an NFV system’s tail latency performance and thus cause latency SLO violations. 92 4.2.3 Approach Shortcomings of Existing Approaches. None of the existing NFV systems [152, 60, 16] can maintain low p99 latency SLOs in the presence of superbursts. We demonstrate this in §3.6 using experiments on a cluster testbed using real traces, but intuitively this is because: • They use only packet counts (or queue lengths) to make core reallocation decisions. This does not consider minnow superbursts. • Some of them make core reallocation decisions at the traffic ingress using smoothed traffic statistics collected from individual servers. Smoothed traffic statistics cannot capture traffic demand shifts at short timescales. Smoothing introduces delay in the measurement-reaction loop and prevents the ingress from reacting quickly to transient bursts at individual CPU cores. • All of them make core reallocation decisions at a relatively coarse timescale (of 100 ms to 1 s). This is not enough if the p99 latency SLO is, say 500µ s. To meet the latter SLO, the system would need to detect and react † to potential latency violations at a finer timescale than 500 µ s, or conservatively over-allocate cores across longer time intervals, sacrificing CPU efficiency. Ironside’s Approach. Ironside reactively scales core allocation to achieve sub-millisecond p99 latency SLOs for traffic containing superbursts while still ensuring high CPU efficiency. It does this using two key ideas: hierarchical multi-scale allocation at different spatial and temporal scales, and using flow counts in addition to packet counts or rates to estimate core capacity. In the next section, we describe these in detail. 4.3 IronsideDesign This section begins with a high-level overview of Ironside before describing its algorithms in detail. † We cannot predict when minnows or whales occur. Proactive approaches such as splitting flows across cores (for whales) or conservatively allocating cores can result in inefficient CPU usage and/or increased processing overhead. Thus, we focus on reactive strategies: detecting whales and minnows as quickly as possible, and quickly make core re-allocation decisions. 93 4.3.1 Overview Goal. Ironside is a rack-scale NFV orchestration system that simultaneously seeks to achieve high CPU- efficiency (measure by core-hours used) while ensuring sub-millisecond p99 latency SLOs. Assumptions. (1) Users of Ironside (e.g., network operators) specify a set of NF chains to run on the rack. Each chain handles a traffic aggregate ( i.e., a group of flows) defined by the operator, who also specifies a p99 latency SLO for the chain. Ironside aims to satisfy this SLO for flows processed by the NF chain. (2) Traffic aggregates are large enough that Ironside can dedicate a server to handle a chain; Ironside targets the enterprise edge or ISP ingress, where traffic volumes can be significant. (3) In most cases, an NF chain runs-to-completion [152] on a core; run-to-completion has been shown to be a low overhead execution strategy for NF chains. When a core doesn’t run NF chains, it can execute other edge applications. The primary challenge. Given the flow-affinity constraint for NF processing, the primary challenge in the design of Ironside is to determine a policy (or set of policies), together with associated mecha- nisms, for mapping flows to cores such that Ironside can achieve the goal described above. As discussed in §4.2.3, purely ingress-based allocation, using packet statistics, at the timescale of 100s of milliseconds, cannot work for Ironside. This begs the question: where should allocation decisions be made, and at what timescales? Hierarchical Multi-scale Allocation. The key idea in Ironside is to map flows to cores at three dif- ferent spatial scales (ingress, server, and core), and at three different temporal scales (seconds, 100s of milliseconds, and 100s ofµ s, respectively), each with different objectives (Figure 4.1): • At theingressmapper, Ironside assigns flows to servers with the aim of minimizing the number of servers dedicated to an NF chain. 94 Ingress Mapper Server Mapper Server Mapper Core Mapper Server Mapper Core Mapper Incoming Traffic A Pool of Auxiliary Cores Core Mapper Core Mapper Core Mapper Rack Server Dedicated Core Finer Scale Figure 4.1: Ironside’s hierarchical multi-scale allocation maps traffic to cores at three different spatial scales and temporal scales. • Ironside’s server mapper, on each Ironside server running NF chains, attempts to use the fewest cores with high utilization, allowing Ironside to handle or adapt to sustained 100s of millisecond timescale bursts. It uses dedicated cores for NF processing. • At each core, Ironside’s core mapper reactively mitigates µ s-scale bursts by recruiting auxiliary cores from a pool. An auxiliary core is generally used for running other applications (e.g., latency-insensitive 5G components, IoT computations, or other data processing tasks), but can be temporarily recruited for NF chain processing by preempting these applications. Handling minnows and whales. The core mapper, when it detects potential SLO violations due to minnows, rapidly recruits one or more auxiliary cores (with the same server) and migrates flows to them. When the core mapper detects a whale, it splits NFs in the NF chain, executing one or more NFs on an auxiliary core. Once auxiliary cores finish processing the assigned flows, they resume application processing. Dealing with flow count dependence. To map flows to cores, NFV systems need to estimate or predict, for an NF chain, how many packets or flows a chain can handle for a given latency SLO. Some systems determine this dynamically and reactively [60], others use core capacity predictors derived using 95 filter(f) core core NIC Queues RX TX TX TX 1 2 3 Core Mapper 3 3 4 4 Core Remapping core 4 3 5 Dedicated Core Auxiliary Cores core Figure 4.2: At each epoch, the NFV runtime main- tains the number of packet and flow arrivals. At the end of an epoch, the core mapper flags potential SLO violations. If the backlog exceeds the core’s capac- ity, the core mapper recruits auxiliary cores to han- dle excess traffic. 1 10 20 30 40 50 60 70 80 32 64 96 128 160 # of Active Flows # of Packets Epoch Size: 200µ s Figure 4.3: An example of the Pareto-frontier for the number of active flows (with at least one packet ar- rival) (f) and the number of packets (p) that an NF chain is able to process within an epoch. As f in- creases, the chain must process less packets in order to avoid backlog. offline profiling [152]. Given Ironside’s stringent time requirements, it uses the latter approach. However, unlike prior systems whose core capacity predictors are based on packet counts, Ironside’s predicts core capacity in terms of the flow count and the packet count, since high flow counts can violate latency SLOs (§4.2.2). In the following subsections, we describe hierarchical multi-scale allocation in more detail. To simplify the description, we focus on a single NF chain hosted on the rack. Our Ironside implementation supports, using straightforward extensions, multiple NF chains running on the rack. 4.3.2 Absorbingbursts: TheCoreMapper Goalandchallenge. The server mapper assigns a set of flows mapped to a NIC queue to a dedicated core. The core mapper, one instance of which runs on each dedicated core, seeks to prevent SLO violations on 96 each of its assigned flows. To do this, it must make flow re-mapping decisions at timescales smaller than the target latency SLO (of 100s ofµ s). Approach. The core mapper continuously and at microsecond scales (a) determines potential SLO viola- tions, (b) predicts additional core capacity required to avoid these violations, and (c) re-maps some of its flows to one or more auxiliary cores. Determining Potential SLO Violations. Given a p99 latency SLO of L, Ironside must monitor traffic (and remap flows) at timescales shorter than L in order to detect violations. While many choices are possible for the timescale, a sufficient one is a timescale of L 2 . This interval is called anepoch. For example, if L is 200 µ s, the core mapper uses an epoch of 100 µ s. Within each epoch, the core mapper seeks to maintain the following invariant: it allocates enough CPU cores to process, within the epoch, any flow that arrives before the beginning of the epoch. With this invariant, it can ensure that no packet stays in the core for more than two epochs, which is the latency SLO. These epoch durations are 3-4 orders of magnitude smaller than those of other NFV systems [9, 152, 16]. To determine potential SLO violations, at the end of every epoch, Ironside: (a) counts the number of packets and active flows in the epoch; and (b) predicts whether the dedicated core can process these without causing SLO violations. Determiningqueueoccupancy. At each dedicated core, the NFV runtime ‡ continuously pulls packets from the NIC queue, and tracks packet and flow arrivals (Figure 4.2). It also tracks which packets and which ‡ The runtime delivers packets to be processed by the NF chain. 97 flows have been processed within an epoch. From these, it estimates the count of packets, and the number of new flow arrivals in the queue during the current epoch (the backlog). Predicting core usage. Using these two quantities, Ironside determines whether the dedicated core can handle the backlog. If not, it must recruit auxiliary cores. Ironside trains a data-driven predictor ob- tained from real traces, leveraging the observation made in ResQ [149] and Lemur [172] that performance predictors are accurate for NFs. This training works as follows. Ironside processes the real trace using the complete system (including ingress and server mappers), but disables the core mapper’s flow migration component (discussed below). This results in epochs with SLO violations. In each such epoch, it records the number of flows ( f) and packets (p) the core was actually able to process. § Each such tuple⟨f,p⟩ represents a measure of the capacity of the core and the collection of tuples forms a Pareto-frontier. Any⟨f,p⟩ outside this frontier cannot be processed by the core. For example, if a tuple⟨55,64⟩ is obtained during training, the core cannot process a backlog of⟨55,220⟩; it must recruit one or more auxiliary cores to process the residual backlog. This intuition forms the basis for the core mapper’s capacity predictor, which it uses to determine how much of the backlog can be processed by the dedicated core. It also uses this to determine how many auxiliary cores it needs to process the residual backlog. It does this by repeatedly assigning part of the residual backlog to an auxiliary core, using the predictor, until all the backlog has been assigned. Figure 4.3 shows the Pareto-frontier for a stateful NF chain used in §3.6) under an epoch size of 200µ s. The Pareto-frontier exhibits discrete jumps of 32 packets, since packets are processed in batches of 32 packets. Thus, at the end of this step, core mapper develops an assignment of⟨f ′ i ,p ′ i ⟩ tuples to the dedicated core and to each auxiliary core (the subscripti denotes the corresponding core), such that the sum of the § If the packet size distribution changes, Ironside might need to re-train the predictor (ResQ [149] also needs to retrain). Training the predictor takes less than one minute and we can train many performance predictors in parallel under different traffic distributions. 98 f ′ s andp ′ s correspond to the backlog. In the above example, the original backlog may be split into three groups of flows: ⟨26,95⟩,⟨22,80⟩, and⟨7,45⟩ so that each group of flows is within the Pareto-frontier and can be processed by a core in an epoch. Core Re-mapping. After recruiting auxiliary cores from a pool, Ironside must migratef ′ i flows with p ′ i packets to thei-th auxiliary core. For each auxiliary core, the core mappermigrates a subset of flows in its backlog to asoftwarequeue associated with the auxiliary core. To migrate a flow, the core mapper creates a software queue associated with the auxiliary core, and moves packets from its own queue to that software queue. In turn, each auxiliary core runs a core mapper to avoid SLO violations due to traffic bursts in subsequent epochs. The tricky step in core re-mapping is handling whales (§4.2.2). Since Ironside must maintain flow- affinity, it can only split NFs: if a chain has 3 NFs, we can either move the last 1 or 2 NFs to an auxiliary core. This ensures that more packets can be processed, so Ironside can drain the backlog. To do this, given a particular packet backlog p for a whale, Ironside needs to determine the best splitting strategy to use. It does this by developing another predictor by profiling different packet counts and different splitting strategies. In the event that the packet count cannot be satisfied by any profiled splitting strategy, Ironside processes the flow best-effort (which can result in SLO violations), but also notifies the operator. 4.3.3 CoreEfficiency: TheServerMapper Goal and challenge. The core mapper makes decisions every 100s ofµ s and can handle bursts on that timescale, but flow rates and counts may change over longer timescales. The server mapper seeks to ensure high core efficiency by continuously re-allocating flows to the minimum number of dedicated cores necessary to process the offered load. If we could do this atµ s timescales we would not need the core mapper, but it is non-trivial to re-map large traffic aggregates at those time-scales: software-based forwarding (e.g., via vSwitch) is expensive 99 and hard to scale, especially when a NIC’s line rate reaches 100 Gbps. Hardware-based approaches rely on reprogramming flow tables at NICs or switches, which cannot be done at µ s timescales. Thus, we choose a hierarchical design: the server mapper provisions dedicated cores, and the core mapper recruits auxiliary cores to clear short-term backlogs. The key challenge for the server mapper is to (a) quickly estimate the number of dedicated cores necessary to serve the offered load and (b) rapidly re-compute flow-to-core mappings. The smaller the timescale at which it can do this, the higher Ironside’s CPU efficiency. In our current design, the server mapper runs every 1 second. Smaller intervals do not further improve the system’s core efficiency and may make Ironside unstable because measurements of packet rate and flow rate can become inaccurate at smaller timescales. Dedicatedcorecountpredictor. To predict the number of dedicated cores, Ironside needs a core capacity predictor. The core mapper uses such a predictor, but that predicts the backlog that a core can drain within a short time interval. Unlike the core mapper, the server mapper needs a predictor that can estimate what offered load (combination of active flow count f and aggregate packet rate r) a core can serve without being overloaded at a longer timescale. To obtain this, for various values off, for each NF chain and its associated latency SLO, Ironside empirically determines what packet rater the core can support without violating the latency SLO. Using this data, given an aggregate flow count f and an aggregate packet count p, Ironside distributes this offered load across as few dedicated cores as possible; we describe this below. For this predictor, Ironside uses synthetic traffic, which does not capture the burstiness in real traffic. For this reason, the server mapper’s estimate is conservative: it does not aim to guarantee the tail latency target, instead, it relies on the core mapper to adapt to bursty traffic on-demand. Fastflow-to-coreremapping. This is the server mapper’s primary function. The server mapper knows the aggregate flow count f ′ and packet rate r ′ of traffic at each active dedicated core. Its core capacity 100 predictor can determine, for a given core, what flow count ˆ f and packet rate ˆ r can be supported on the core. It then seeks to (re)distribute incoming traffic to the fewest number of dedicated cores possible. Leveraging RSS for fast mapping. We borrow and adapt an idea from prior work [9, 16]. Many NICs support Receive-Side-Scaling (RSS) [53], which hashes incoming traffic into a large ( e.g., 128 or 512) number of RSS buckets. Each bucket, in turn, can be bound to a NIC queue assigned to one dedicated core; traffic from this bucket is destined to that core. In general, since the number of buckets is larger than the number of cores, multiple buckets may be bound to a core. Ironside leverages this observation, albeit differently from [9]. Fast heuristics for traffic re-distribution. For each RSS bucket, Ironside tracks the average active flow count and packet rate for each decision interval. Then, for each core, it needs to find a group of buckets satisfying the following constraints: (1) The group of buckets is uniquely assigned to the core, (2) The sum of each bucket’s⟨f,r⟩ tuple, denoted as⟨f ′ ,r ′ ⟩, does not exceed the predicted capacity⟨ ˆ f,ˆ r⟩. In addition, the server mapper has to ensure that all buckets are assigned to cores. It is possible to model this problem as an optimization formulation whose objective is to minimize the number of dedicated cores. Unfortunately, this formulation (which we omit for brevity) is a constrained mixed-integer linear problem (MILP), which can take hundreds of ms or more to solve using a commercial solver. Instead, we use a fast greedy heuristic, which attempts to use the fewest dedicated cores, but models a key constraint that is hard to express in a mathematical formulation: minimizing the number of changes to bucket-to-core mappings. This ensures that flows do not need to be frequently migrated between cores, which can degrade the performance of NF processing as it impacts cache locality. The heuristic (Figure 4.4) handles cores whose load has increased (because of flow arrivals or traffic increases), as well as those whose load has decreased (because of departures): • It first determines which dedicated cores are overloaded (i.e., their⟨f ′ ,r ′ ⟩ exceeds core capacity). For each, it determines the fewest buckets that need to be migrated away to other dedicated cores to ensure 101 <100, 500> 1 <50, 100> 1 <30, 100> 1 <20, 200> 1 <50, 1000>* 1 <30, 200> 2 <10, 50> 2 core 1 core 2 NIC Indirection Table <250, 1900> <40, 250> <100, 500> 1 <50, 100> 1 <30, 100> 1 <20, 200> 1 <50, 1000> 2 <30, 200> 2 <10, 50> 2 core 1 core 2 <200, 900> <90, 1250> Before After Dedicated Cores Packet Hash Figure 4.4: The server mapper leverages RSS and updates the NIC’s indirection table to apply bucket-to- core mappings. For each decision interval, it tracks the flow count and packet rate < f,r > for each RSS bucket, corresponding to one entry in the NIC’s table. At the end of an interval, the server mapper decides the number of dedicated cores at a server: it finds overloaded cores, whose loads exceed the core’s capacity, moves the minimum set of buckets from them to other cores, and then tries to reclaim cores. In the above case, one bucket is migrated from core 1 to core 2 to avoid overloading. the remaining traffic fits within the core capacity. It moves these buckets to other under-loaded cores using a first-fit strategy. If no such core exists, it allocates a new dedicated core. • Then, it starts with the least loaded core and attempts to migrate its existing buckets to other cores, reclaiming the core if successful. It repeats this until no cores can be reclaimed. In some NICs, updating NIC’s flow tables and RSS configurations can incur millisecond-level delays [9, 59, 16]. ¶ To minimize the impact, Ironside uses NIC’s flow table (rather than NIC’s redirection table or RETA) to maintain bucket-to-core mappings, and writes the new rule specifying new bucket-to-core mappings before deleting the old rule to ensure no packets are dropped. Because this can take up to a few milliseconds, Ironside conservatively chooses a decision interval of 1 s, to minimize overhead, and to avoid oscillations in bucket-to-core assignments. Boost mode for dedicated cores. In between periodic invocations of the server mapper, a dedicated core can see large backlogs that last for more than a few epochs. Ironside must handle such overloads to ¶ Some NICs (e.g., Intel) may be able to update bucket-to-core mappings inµ s. Ironside seeks to be hardware-independent, so it does not assume fast updates. 102 minimize SLO violations, since a dedicated core can dispatch only a limited number of packets within a epoch. To address this, we tried to invoke the server mapper on-demand (in between two periodic invocations). However, this still led to a high tail latency because of the delay in updating the RSS table in NIC hardware. Instead, under such overload, Ironside uses aboostmode for dedicated cores in software. In this mode, the dedicated core skips NF processing and focuses on pulling packets from the NIC queue and dispatching them to auxiliary cores. To do this, it recruits another core to conduct NF processing on its behalf. Once the queue has been drained, it can resume NF processing. In §4.4.5, we show that enabling boost-mode is necessary for meeting tail latency SLOs. 4.3.4 MinimizingServers: TheIngressMapper The ingress mapper assigns flows to servers to minimize the number of servers assigned to the NF chain. A rack may see many flows, so the ingress mapper cannot scale if it needs to make per-flow server assignment decisions. Instead, the ingress mapper makes assignment decisions on a prefix of the destination address. Specifically, it assigns flows with the same destination prefix of length k to the same server. This requires 2 32− k routing entries for IPv4. k can be chosen to match ToR switch table capacities. When traffic arrives at a previously unoccupied (i.e., one for which no packets have been seen) prefix, it steers that prefix to the server with the most number of dedicated cores as long as that number is under a thresholdτ . If no such server exists, the ingress mapper finds a new one from the rack and steers traffic to it. Thus, new servers are recruited only when all other servers are near the target load. Parameter τ ensures that each server has sufficient auxiliary cores to handle bursts. 103 When it changes the number of dedicated cores, each server mapper reports this number to the ingress mapper, which uses this to steer prefixes. Ironside does not attempt to consolidate servers with few ded- icated cores. This requires cross-server NF state migration but is also unnecessary since we assume that non-dedicated cores (on any server) can be used to run edge applications. 4.4 Evaluation In this section, we evaluate: Ironside’s ability to achieve sub-millisecond tail latency SLOs (§4.4.2), Iron- side’s CPU core efficiency relative to other NFV systems (§4.4.3), and how its design choices contribute to its performance (§4.4.4-§4.4.5). Implementation. Ironside uses BESS [10], a high-performance DPDK-based switch [31]. Its design con- sists of two parts: a rack-scale ingress and a worker, requiring 2.9k lines of C++. The former implements statistics collection from workers and runs the ingress mapper (§4.3.4). The latter contains the server map- per (§4.3.3), the core mapper (§4.3.2), and additional functionality to enable dedicated cores to pull traffic from the NIC queues into software queues, as well as to recruit auxiliary cores to handle bursts. In addition, Ironside’s framework for training core capacity predictors adds 1.2k lines of Python. 4.4.1 SetupandMethodology Testbed. We use Cloudlab [24] to run experiments on a cluster of 5 servers under a single ToR. Each server has dual-CPU 32-core 2.8 GHz AMD EPYC Milan CPUs with 256 GB ECC memory (DDR4 3200 MHz). To reduce jitter, we disable SMT and CPU frequency scaling. Each server has a 100 GbE Mellanox ConnectX-6 NIC for data traffic and a separate NIC for control and management traffic. Methodology. On this cluster, we run Ironside and other NFV systems to serve NF chains that process real-world traffic and measure end-to-end packet processing latency and CPU core usage metrics. We use two canonical stateful NF chains, from documented use cases [69, 73]. 104 Chain 1 is a flow monitoring and load-balancing pipeline: ACL →LB→Accounting. Chain 2 is a more compute-intensive chain with DPI and encryption NFs: ACL→UrlFilter→Encrypt. These NFs and NF chains are frequently used for evaluating NFV systems’ performance [60, 140, 152]. ACL enforces 1.5k access control rules. LB is a L4 load balancer that distributes flows across a list of backend servers. Ac- counting is a monitoring NF that tracks the data usage of each flow. UrlFilter performs TCP reconstruction over flows, and applies complex string matching rules ( i.e., Snort [45] rules) to block connections mention- ing banned URLs. Encrypt encrypts each packet with 128-bit ChaCha [98, 144]. NAT maintains a list of available L4 ports and performs address translation for connections, assigning a free port and maintaining this port mapping for a connection’s lifetime. We use BESS’s implementation of these NFs; they have been used in prior NFV papers [73, 152, 16]. Traffic Traces. Our experiments use two real traffic traces ∥ (§4.2). The backbone trace contains 4.9M flows, with flow sizes from 54 bytes to 206M bytes, and a maximum duration of 52 seconds. The flow arrival rates are 31.2k (average) and 51.2k (max) flows per second, and packet sizes are 988 (average) and 1514 (max) bytes. The AS trace contains 1.6M flows ranging from 54 bytes to 514M bytes in size, with a maximum duration of 205 seconds. The flow arrival rates are 10.1k (average) and 39.0k (max) per second. The packet sizes are 976 (average) and 1514 (max) bytes. Thus, the AS trace exhibits a larger range of flow sizes and durations, but slightly lower flow rates, than the backbone trace. We train our core capacity predictors (§4.3.2, §4.3.3) on 15 s of traffic the backbone trace, and test on the rest. We run a BESS-based traffic generator on a dedicated server. Each experiment runs a single combina- tion of NF chain and traffic trace. For our choice of chains and traffic traces, 5 servers suffice to handle the workload, hence our choice of cluster size. Key Metrics. We quantify Ironside performance using: (a) end-to-end latency, for which we use two measures: the median (p50) and the p99 tail latency; and (b)CPUcoreusage in terms of total CPU-hour(s) ∥ While we target Ironside for edge settings, in the evaluation we use backbone and AS traces since edge traffic in a few years may increase to those levels. 105 or the time-averaged CPU cores for serving test traffic. Ideally, Ironside should exhibit low CPU core usage while ensuring the target p99 latency SLO. Comparisons. We compare Ironside against three prior systems,i.e., Metron, Quadrant, and Dyssect, that auto-scale NF chains under traffic dynamics. Metron [60] supports scaling of NF chains. It enables L2 switching at NICs and modifies L2 headers at the ToR switch to enforce routing decisions. It detects overloaded cores by periodically checking packet losses (via packet counters) at the switch, and splits a core’s traffic aggregate into two subsets and migrates one subset to a new core when it detects core overload. Metron does not explicitly support latency SLOs. Quadrant [152] explicitly targets p99 latency SLOs, scales NF chains by collecting at its ingress the worst-case packet delay at each core, and uses a hysteresis-based approach to control end-to-end latency of each core running a chain. It can satisfy p99 latency SLOs for synthetic traffic [152]. Real traffic traces contain significant bursts that can, as we show below, impact Quadrant’s performance. Dyssect [16] seeks to minimize p50 latency of serving NF chains. It uses RSS to scale NF chains at a worker, and periodically migrates a subset of flows to new CPU cores, using an MILP formulation. How- ever, Dyssect is not designed for a rack-scale system; to enable the rack-scale experiments for Dyssect, we apply Ironside’s rack-scale ingress to distribute traffic to Dyssect workers. Dyssect improves on the design of, and outperforms RSS++ [16], so we don’t evaluate against RSS++. 4.4.2 LatencyPerformanceComparisons We first compare Ironside’s latency against other NFV systems. Tables 4.3, 4.4, 4.5,and 4.6 show the p50 and p99 latency of different NFV platforms when serving the backbone traffic and the AS traffic. Across all platforms, Ironside is the only one capable of achieving sub-millisecond (100-500µ s) p99 latency SLOs. All platforms do exhibit sub-millisecond p50 latency, showing that their scaling techniques work well to bound median latency. 106 Platform Latency SLO p50 p99 Avg. Cores Loss Rate [µ s] [µ s] [µ s] Ironside 100 28.8 89.4 34.6 0.006 200 32.1 174.4 30.7 0.006 300 30.7 238.8 30.1 0.005 400 34.5 254.4 30.5 0.010 500 31.1 298.2 29.7 0.005 Metron - 57.7 2541 21.8 0.024 Quadrant 100 30.8 3051 45.3 0.062 200 39.1 3021 43.3 0.057 300 38.8 3015 43.3 0.056 400 39.9 2988 43.1 0.052 500 53.7 2994 41.7 0.044 Dyssect 100 63.0 78798 34.7 0.224 200 68.0 68356 31.9 0.106 300 78.1 70392 32.7 0.249 400 76.2 64885 30.1 0.218 500 68.7 71586 31.3 0.194 Table 4.3: Comparisons of end-to-end metrics (p50 and p99 latency, the time-averaged CPU core usage and loss rate) forchain1 under thebackbonetraf- fic by Ironside and others, as a function of the sys- tem’s latency target. Platform Latency SLO p50 p99 Avg. Cores Loss Rate [µ s] [µ s] [µ s] Ironside 100 18.0 101.4 56.0 0.004 200 22.8 180.3 41.9 0.010 300 24.8 197.2 42.8 0.002 400 24.9 210.3 40.3 0.007 500 26.8 270.5 40.6 0.003 Metron - 254.5 5580 32.1 0.044 Quadrant 100 34.0 5459 76.7 0.126 200 34.7 5465 78.3 0.133 300 34.0 5452 75.5 0.132 400 34.5 5473 77.1 0.137 500 34.2 5455 77.2 0.135 Dyssect 100 62.6 78955 50.6 0.294 200 61.0 62528 45.6 0.246 300 60.2 77793 47.4 0.279 400 50.2 87130 48.2 0.208 500 61.4 78030 44.8 0.214 Table 4.4: Comparisons of end-to-end metrics (p50 and p99 latency, the time-averaged CPU core usage and loss rate) forchain2 under thebackbonetraf- fic by Ironside and others, as a function of the sys- tem’s latency target. Metron, which does not explicitly support SLOs (so has only a single entry in the tables), achieves 2541 µ s and 5580µ s p99 latency for chain 1 and chain 2 respectively (backbone traffic), and 1926 µ s and 5310 µ s respectively (AS traffic). It also has the highest p50 latency among all platforms because it relies on hardware to achieve dynamic scaling. Detecting packet losses by sending periodic control-plane queries to collect packet counter statistics at ToR switches cannot avoid high latency caused by short-term bursts. Moreover, re-programming switch tables (for splitting traffic aggregates) can take hundreds of millisec- onds. Quadrant supports p99 latency SLOs targets. However, in our experiments, it incurs 2988-3051 µ s (5452-5473µ s) p99 latency for chain 1 (resp. chain 2) under the backbone traffic. Even though it actively monitors each core’s worst-case packet processing delay, it still fails to deliver good tail latency results on highly bursty real-world traces for three reasons. 1) Improper scaling signal: for each core, Quadrant maintains a moving average worst-case packet processing delay, updated to a rack-scale ingress every hundreds of milliseconds. This signal is insufficient to detect transient bursts quickly. 2) Delayed scaling: 107 Platform Latency SLO p50 p99 Avg. Cores Loss Rate [µ s] [µ s] [µ s] Ironside 100 22.7 51.9 17.5 0.010 200 22.9 52.1 17.4 0.011 300 23.1 53.4 15.0 0.006 400 23.4 53.2 16.2 0.013 500 23.5 53.5 17.5 0.008 Metron - 48.6 1926 11.9 0.014 Quadrant 100 37.2 1026.5 17.9 0.014 200 30.7 930.0 17.3 0.010 300 29.5 765.2 16.9 0.013 400 29.3 556.5 17.3 0.011 500 28.2 810.2 16.4 0.011 Dyssect 100 17.7 13586 17.9 0.044 200 17.6 13031 17.2 0.054 300 17.8 12672 17.9 0.050 400 18.2 12480 15.9 0.052 500 19.1 22776 16.5 0.049 Table 4.5: Comparisons of end-to-end metrics (p50 and p99 latency, the time-averaged CPU core usage and loss rate) for running chain 1 under the AS traffic by Ironside and others, as a function of the system’s latency target. Platform Latency SLO p50 p99 Avg. Cores Loss Rate [µ s] [µ s] [µ s] Ironside 100 33.1 87.0 25.2 0.006 200 41.6 103.3 18.1 0.008 300 48.8 105.6 16.3 0.008 400 50.0 108.0 18.4 0.007 500 60.8 103.3 16.6 0.007 Metron - 78.7 5310 14.8 0.029 Quadrant 100 70.4 5352 19.2 0.073 200 69.9 5363 18.6 0.072 300 70.5 5351 19.3 0.074 400 70.4 5369 18.2 0.073 500 69.9 5373 18.2 0.072 Dyssect 100 24.2 48041 18.4 0.056 200 24.1 57273 18.1 0.049 300 23.2 34573 16.9 0.041 400 24.4 50812 17.0 0.040 500 23.8 47229 16.7 0.044 Table 4.6: Comparisons of end-to-end metrics (p50 and p99 latency, the time-averaged CPU core usage and loss rate) for running chain 2 under the AS traffic by Ironside and others, as a function of the system’s latency target. Quadrant makes scaling decisions using one rack-scale controller (rather than making local decisions at each worker), which re-balances flows from overloaded cores every 50 milliseconds and enforces scaling via RPCs that add extra millisecond-level delays. 3) Inaccurate scaling: Quadrant migrates at-most half of flows from a core in a scaling period, which may not work well because a significant number of flows can arrive within its scaling period. Dyssect supports p50 latency SLO targets. To optimize p50 latency, it migrates flows from cores to control each core’s load, defined as the percentage of CPU time executing actual NF processing. Dyssect produces good p50 latency results: 35 µ s and 60 µ s (backbone traffic), but has the highest p99 latency among all: 78 ms and 87 ms for chain 1 and chain 2 (backbone traffic). Core load is an insufficient scaling signal for maintaining low tail latency. Dyssect also uses a 100-ms scaling period, a compute-intensive MILP formulation to determine flow migrations, and locks in its data-plane implementation, all of which increase latency. Dyssect’s median latency is smaller than Ironside’s median latency under the AS traffic, but is higher under the backbone traffic. The AS trace has smaller flow rates and packet rates. Dyssect 108 optimizes for its median latency, and works well under this scenario. However, it can result in more queue overflows in the more challenging backbone traffic, and thus produce higher median latency. In general, for all systems, p99 latencies are higher for chain 2 than chain 1, since the former is more compute-intensive. The one exception to this Dyssect whose p99 is comparable for both chains for the backbone trace, which has a much higher flow rate. Finally, Ironside achieves superior latency performance while also incurring the lowest packet loss rate among all systems (see Loss Rate column). In all systems, most packet losses occur due to queue overflows caused by improper scaling decisions. Ironside ’s auto-scales quickly, and incurs the fewest losses. To contextualize acceptable loss rates, according to a mobile cloud gaming study [32] major US cellular carriers achieve 1.7-4.8% loss rates for 4G services, and 0.2-4.2% for 5G services. Ironside’s loss rate is at the lower end of the 5G range. To conclude, we find that prior systems are unable to ensure sub-ms tail latency SLOs if (1) they only use hardware-based scaling; (2) they do not have mechanisms to quickly detect and react to traffic bursts. Ironside uses a hybrid scaling design and has the core mapper to process transient bursts quickly. 4.4.3 CPUCoreUsageComparisons To compare CPU core usage of different NFV systems, in each 100- µ s epoch we record the number of active CPU cores, then calculate the time-averaged core usage across the entire experiment. Tables 4.3, 4.4, 4.5, 4.6 show CPU core usage for Ironside and other NFV systems. In addition to having 1-2 orders of magnitude lower tail latency than other NFV systems, Ironside uses less CPU up to 34% (chain 1) / 48% (chain 2) (backbone traffic) and 17% (chain 1) / 15% (chain 2) (AS traffic) compared to Quadrant, and uses less CPU up to 14% (chain 1) and 20% (chain 2) (backbone traffic) and 16% (chain 1) / 11% (chain 2) (AS traffic) compared to Dyssect. Quadrant monitors per-core worst-case packet delay to make flow migration decisions. Transient bursts can increase worst-case delays and trigger flow re-allocation to a new core, 109 Platform Avg. Cores p50 p99 [µ s] [µ s] Chain1 Metron 21.8 47.7 2540 Ironside 23.9 51.3 2126 Chain2 Metron 32.1 254.5 5580 Ironside 30.4 374.2 4567 Table 4.7: Comparisons of end-to-end p50 and p99 latency and the time-averaged CPU core usage by Iron- side and Metron. In these experiments, we set Ironside’s latency target to 2500µ s and 5000µ s respectively. Ironside still produces lower p99 latency results, and has similar CPU core usage. which increases core usage. Dyssect’s CPU core usage is slightly better than Quadrant’s core usage, and comparable to Ironside’s. This is because Dyssect uses the CPU core load as its scaling signal, which is less sensitive to bursts. Compared to Metron, Ironside uses slightly more (10-26%) CPU cores (but has an order of magnitude lower p99 latency), as Metron uses efficient hardware-based scaling. Though Metron aims to minimize packet losses during scaling, it does not aggressively optimize end-to-end latency (thus producing higher p50 and p99 latency). In contrast, Ironside achieves an order of magnitude lower p99 latency, prevents packet losses by allocating traffic to cores with core capacity predictors, and demonstrates lower packet loss rates. To better compare Metron and Ironside core usage, we set Ironside’s p99 latency SLO to the p99 latency achieved by Metron. Table 4.7 shows that Ironside can still produce lower p99 latency results with loose latency SLOs, and uses a comparable number of CPU cores compared to Metron. In this experiment, Ironside is less aggressive in migrating flows to auxiliary cores, so uses fewer cores. 4.4.4 AblationStudy: TheCoreMapper To illustrate the importance of the core mapper, we evaluate Ironside against the following variants. Ironsidestatic-safe: this variant only considers the packet count (or queue length) when predicting core usage. This core mapper makes safe choices by always using the smallest packet count regardless of the 110 100 200 300 400 500 10 2 10 3 10 4 Latency SLO [us] p99 Latency [us] 100 200 300 400 500 10 2 10 3 10 4 Latency SLO [us] Ironside Ironside static-safe Ironside static-unsafe Ironside w/o core-mapper 100 200 300 400 500 10 20 30 40 Latency SLO [us] # of CPU Cores 100 200 300 400 500 10 20 30 40 50 60 Latency SLO [us] CHAIN 1 CHAIN 2 CHAIN 1 CHAIN 2 Figure 4.5: Comparisons of end-to-end p99 latency and time-averaged CPU cores achieved/used by Ironside and its variants, as a function of achieved p99 latency. Ironside w/o core-mapper has the worst latency result, which shows the importance of the core mapper. Ironside static-unsafe and Ironside static-safe only consider the packet count (or queue length) when predicting core usage. The former can have latency SLO violations, while the latter can result in more CPU core usage. Overall, Ironside’s design performs the best in terms of meeting SLOs and then minimizing CPU core usage. number of flow arrivals. For example, in Figure 4.3, it will determine a core’s processing capability to be 64 for any number of flow arrivals. Ironside static-unsafe: this is similar to the above one, but it always uses the largest packet count regardless of the number of flow arrivals. For example, in Figure 4.3, it will determine a core’s processing capability to be 160 for any number of flow arrivals. Ironside w/o core mapper: this variant disables the core mapper. It, however, keeps the server mapper to make scaling decisions in a much larger time granularity. Figure 4.5 demonstrates that only Ironside can provide sub-ms latency SLOs when using its core map- per or uses static-safe core mapper. Without core mapper, Ironside is unable to achieve sub-ms latency. When using static-unsafe core mapper, Ironside achieves sub-ms p99 latency, but violates SLOs in some cases. Thestatic-safe variant can result in a smaller p99 latency in most cases, but may use up to 13% more CPU cores, compared to Ironside’s design. 4.4.5 AblationStudy: TheServerMapper Importance of flow counts. Ironside’s server mapper finds the fewest dedicated cores on a Ironside worker in response to persistent load shifts. As with the core mapper, we explore two variants: 111 100 200 300 400 500 0 50 100 150 200 250 300 350 400 Latency SLO [us] p99 Latency [us] 100 200 300 400 500 0 50 100 150 200 250 300 350 400 Latency SLO [us] Ironside Ironside static-safe Ironside static-unsafe 100 200 300 400 500 10 20 30 40 Latency SLO [us] # of CPU Cores 100 200 300 400 500 10 20 30 40 50 60 Latency SLO [us] CHAIN 1 CHAIN 2 CHAIN 1 CHAIN 2 Figure 4.6: Comparisons of end-to-end p99 latency and time-averaged CPU cores achieved/used by Ironside and other two variants with different server mappers, as a function of achieved p99 latency. For chain 2, Ironside is able to meet all latency SLOs with the smallest time-averaged CPU cores. Ironside static-safe-server allocates cores based on the lowest packet rate seen by the core capacity predictor (§4.3.3), regardless of the number of active flows on a dedicated core. Ironsidestatic-unsafe-server allocates cores based on the highest packet rate seen by the core capacity predictor. Figure 4.6 shows that all three designs can meet sub-ms p99 latency SLOs with comparable CPU core usage for chain 1. For chain 2, Ironside’s design dominates both in terms of latency and CPU core usage, indicating that, in some settings, the server mapper must consider flow and packet rates for efficient SLO compliance. Thenecessityofboost-mode. Instead of reconfiguring RSS tables under persistent overload, Ironside’s boost-mode uses the dedicated core to drain packet queues, and migrates NF processing to other cores (§4.3.3). To show the importance of having boost-mode, we compare Ironside against two different vari- ants: 1) Ironside w/o boost-mode: Ironside variant that disables the boost-mode; 2) Ironside w/ on-demand invocations: instead of enabling boost-mode, this variant invokes the server mapper’s RSS-based flow-to- core remapping immediately when a persistent overload presents. Results. Figure 4.8 shows that enabling boost-mode is crucial for achieving sub-ms tail latency SLOs. The two other Ironside variants can only result in ms-scale tail latency; the first because it cannot handle load changes at 100s-ms or second timescales and the second because of the overhead and the delay of re-programming NIC’s RSS configurations. 112 100 200 300 400 500 10 2 10 3 10 4 p99 Latency [us] 100 200 300 400 500 10 2 10 3 10 4 Ironside Ironside w/o boost-mode Ironside w/ on-demand invoc. CHAIN 1 CHAIN 2 Table 4.8: Comparisons of end-to-end p99 latency achieved by Ironside and variants with different server mappers, as a function of the system’s latency target. Turning off Iron- side’s boost-mode or applying on-demand invocations of the server mapper cannot achieve sub-millisecond tail la- tency SLOs. This show that NFV systems need resort to handle bursts in software. Queuelength Core re-mapping process [cycles] Core re-mapping process [µ s] 128 4.2k 1.5 256 7.1k 2.6 512 7.7k 2.8 1024 12.2k 4.4 Table 4.9: The average execution time of Ironside’s core mapper’s core re-mapping process, as a function of packet queue length. This process runs at the end of each short epoch and is lightweight (i.e. <5µ s to re-map more than 1k packets). 4.4.6 AnalyzingSchedulingOverheads Enqueue And Dequeue Overhead. In Ironside, a dedicated core pulls the packet from a NIC queue and enqueues the packet to a software queue (either owned by the core or an auxiliary core) before it is processed. The packet will be dequeued and being processed later. We now quantify this enqueue/dequeue overhead (Ironside uses a lock-less queue). By computing the avg. cycle cost of 10k enqueue and dequeue operations, we find: enqueue adds 40 cycles / batch;dequeue adds 32 cycles / batch. The overhead is small compared to the packet processing time of an NF chain. Core-remapping Overhead. At the end of each short epoch, Ironside’s core mapper invokes a core re- mapping to balance workloads on software queues and recruit auxiliary cores when necessary. This core re-mapping includes decision and enforcement processes that scan all packets in a software queue (owned by a dedicated core). Table 4.9 shows benchmark results on the execution time of the core re-mapping processing under different packet queue lengths. As shown, even for more than 1k packets, this core re-mapping process can be finished in less than 5 µ s. In reality, Ironside can control software queue length with its server 113 mapper design, and expect much fewer packets in a software queue. This overhead is relatively small for 100s-µ s latency SLOs. To extend Ironside for tighter SLOs, re-design might be required and we have left this to future work. 114 Chapter5 LiteratureReview The number of Internet users has gone up from roughly 1 billion in 2005 to 5.5 billion in 2022, as re- ports have shown [128, 150]. Both the industry and the research community have noticed two trends: (1) more people are getting access to the Internet. (2) more companies, universities, and organizations are transforming legacy services into online IT services; The ever-increasing Internet services and users require network operators to put significant efforts into operating networks to support new devices and applications. In this transition, many researchers start to rethink the design of the development and de- ployment of network services. Achieving high throughput, low latency, and low development, operational, and management costs is the key. Our work contributes to network processing and services, cloud computing, and network management by advancing the design and implementation of Network Function Virtualization (NFV) technology. In this chapter, we review related work in these sectors. 5.1 MiddleboxandNetworkFunction NFs. A majority of NFs are networking functionalities deployed to operate a private network. Here are some examples. Firewalls provide some basic security protections by detecting and reacting to malicious network users. They can be implemented as Access Control List (ACL) [41], Filter [78], Snort [130] etc. 115 Load Balancers (LBs) distribute traffic across backend servers [26, 90], while NATs perform address trans- lations [25]. They support the deployment of distributed systems, such as web services. Security-related NFs, including Intrusion detection systems [185] and DDoS detection systems [133], are developed to pro- tect users. QoS NFs, such as Heavy Hitter detection [142], are used to monitor the network usage of users and to support traffic engineering. 5G deployments also involve many NFs, such as Access and Mobility Function (AMF) and Network Slice Selection Function (NSSF), and many others [3]. Data centers employ WAN optimizers to reduce network traffic across geo-distributed sites [4]. NFV was proposed to optimize the development and deployment of these NFs. Thus, we consider these NFs in our research extensively. Datacenterandcloudnetworking. Many data center and cloud networking functionalities are packet processing tasks and are categorized as NFs. Software switches [10, 44, 68] forward packets in software and ensure network access and performance policies to VMs, which is critical for cloud networking. Protocol offloading [55, 85] is a common technique to accelerate the networking stack. Many network-layer opti- mizations are conducted to accelerate performance for various workloads, such as Deep Neural Networks (DNNs) training and inference [115, 51], containers and VMs [63], big-data analytics [19], etc. These data center and cloud use cases may benefit from new design principles and techniques that optimize for NFV. For instance, Iron [62] reuses many NFV techniques and concepts to achieve better isolation for containers in the cloud. 5.2 NFVFrameworks In the past, most network functions were implemented as hardware middleboxes. As the network becomes more complex, network operators must purchase, orchestrate, manage, and upgrade many middleboxes. In 2012, a survey [136] of 57 enterprise network managers indicated that operating and managing a private network can be difficult and expensive. In the paper, Sherry et al. proposed the concept of NFV and demonstrated the feasibility of a prototype system APLOMB. 116 Ideally, NFV would enable the rapid development of new NFs and reduce operational and management costs for network operators. To fulfill this goal, many researchers proposed NFV frameworks that are used to ease the development of NFs, and execute, isolate, orchestrate, and scale NFs. Typically, an NFV framework consists of adataplane directly involved in packet processing and acontrolplane that handles all other aspects of NFV by controlling other components (including the data plane). Data plane. NFV frameworks evolved from software switches in the early days. Some examples in- clude Click modular router [66] and Snabb switch [111]. Click was developed in C++, while Snake used Lua [102] for programming. They both employed the concept of modular network processing: develop- ers wrote small packet processing modules, while users combined these modules to achieve a complex packet processing task. However, they were developed using the default Linux networking stack, which was proven to have significant overheads under high network IOs. In the meantime, NIC vendors collab- orated and released DPDK [31], i.e., a user-space networking stack that bypasses the kernel. DPDK has enabled high-performance network IO over commodity Network Interface Cards (NICs). Later, old soft- ware switches enhanced by DPDK and new packet processing frameworks based on DPDK (e.g., BESS [10, 44]) have become the default packet processing engine used by NFV frameworks. Control plane. The control plane of an NFV framework deals with load balancing, interconnection, declarative policies, SLO enforcement, and state management. In 2015, Palkar et al. introduced E2 [105], i.e., an NFV framework that unifies these control plane functionalities. E2 utilizes OpenFlow and software switches and uses Software-Defined Network (SDN) to achieve service interconnection, overload detection, and scaling. E2 demonstrates the possibility of orchestrating NFs to achieve flexible NF chains on top of the commodity cloud infrastructure. However, E2 does not consider any performance optimizations and thus remains a conceptual prototype. Most NFs are stateful. Thus, state management is an important topic in NFV. Split-Merge [126] is the first paper that classifies NF states into internal states, external partitioned states, and external coherent 117 states. It provides a good standpoint for subsequence NFV frameworks to consider stateful NFs in their designs. In OpenNF [36], Gember-Jacobson et al. observe the complexity of managing NF states during dynamic scaling events. The authors discuss many scaling cases in which some flows need to be migrated to other NF instances and propose solutions to avoid race conditions when managing distributed states. OpenNF proposes that NF state management should be done together with flow routing. In NetBricks [109], Pandaetal. notice the importance of offering highly optimized data plane and con- trol plane building blocks to NF developers. They observe that using either containers or VMs for running NFs can introduce significant performance overheads, and then rethink the ways of building and running NFs: they propose using Rust to develop and isolate NFs. Their approach is limited because NetBricks requires NF vendors to use Rust. However, this paper makes clear many runtime-specific functions, such as memory and packet isolation, and motivates many follow-ups. Our work Quadrant [152] proposes a general yet high-performance isolation mechanism that outperforms NetBricks. SafeBricks [117] considers the NF execution in the cloud and argues for the need of protecting the client’s traffic. It leverages a combination of hardware enclaves and language-based enforcement. How- ever, SafeBricks has a similar problem as NetBricks. On the other side, Zhang et al. introduce OpenNetVM [180] that requires no programming language restrictions. OpenNetVM also provides data and control plane functions, supports DPDK, and considers multicore scalability. On top of OpenNetVM, Ren et al. propose EdgeOS [127] that supports memory and packet isolation in the multi-tenant cloud environment. EdgeOS deploys NFs as individual containers and employs an expensive inter-NF isolation mechanism via packet copying for each inter-NF hop. In Quadrant, we propose a lightweight isolation mechanism that outperforms EdgeOS. NFP [148] is an NFV framework that enables parallelism for NFV by eliminating unnecessary NF de- pendencies. This is orthogonal to our research and can be integrated into ours. 118 5.3 HardwareOffloadinginNFV Offloading computations. The past decade has witnessed a great advancement in hardware accelera- tors, including Field Programmable Gate Arrays (FPGAs), GPUs, smart NICs, and programmable switches. Many research efforts have focused on offloading computations to hardware to accelerate NFV. FPGAs. NFV systems often use NetFPGA, a specialized FPGA that comes with NIC ports and con- nects to the motherboard via PCIe. ClickNP [74] is the first FPGA-accelerated NFV platform. It extends the Click Modular Router [66]. It keeps Click’s modular programming abstraction and supports a C-like programming language. Our work is different: we are not to provide a unified programming interface for hardware-accelerated NFV. Pigasus [181] represents a different type of hardware offloading. In the paper, Zhaoetal. target a single complex IDS NF that is among the most demanding stateful NFs. Pigasus greatly improves IDS’s throughput by employing multi-stage processing and only leaving a very small amount of computations to be done in software. GPUs. NBA [64] extends Click to support GPU offloading and keeps the same modular programming abstraction. In G-Net [178], Zhangetal. observe the inefficiency of existing GPU virtualization approaches. They propose a new GPU virtualization scheme that enables spatial GPU sharing and provide packet and memory isolation for NFs offloaded to the same GPU. Grus [183] lowers latency for NFs on GPUs. In future work, we plan to extend our work Lemur [172] to GPUs using techniques from Grus. SmartNICs. Smart NICs offer energy-efficient processors that can be programmed to offload custom packet processing functions and do not require vast changes in the cloud infrastructure. However, smart NICs often have limited computing and memory capabilities. UNO [72] is an offloading strategy that 119 optimizes the offloading of both NFs and packet switching to smart NICs. Compared to UNO, Lemur is distinguished by its focus on meeting SLOs across chains and hardware platforms. Programmable switches. In around 2014, Protocol-Independent Packet Processors (P4) and pro- grammable switches were introduced by a thread of research [11, 141, 135]. Programmable switches showed strong capabilities in supporting custom packet processing pipelines at line rates of multiple Tbps. Since then, researchers have been seeking opportunities for offloading network processing tasks, in- cluding NFs, to these switches. Several papers [35, 90, 61, 21, 112] demonstrate the power of programmable switches in accelerating certain NFs, such as 5G Core, Firewalls, and LBs. They motivate our goal of of- floading NFs to programmable hardware. Lemur generalizes this line of work by considering placement across heterogeneous hardware and is also unique in considering SLOs. Sonata [42] utilizes programmable switches for achieving accurate network measurements. In contrast, Lemur is a generic framework for ac- celerating various NFs. Prior work also has utilized new features in hardware to improve NFV. For example, Metron [60] utilizes OpenFlow switches to offload NF tables and to steer traffic to CPU cores to achieve dynamic scaling. ResQ [149] uses Intel’s CAT to achieve better performance isolation for NFs deployed on the same server to meet performance SLOs. In another paper [139], Sieber et al. demonstrate that NFV can benefit from Data Direct I/O (DDIO) technology. 5.4 Software-basedOptimizationinNFV Dirty-slate optimizations. NFP [148] enables parallelism for NFV. Metron and NetBricks [60, 109] ad- vocates run-to-completion for NFV. Batchy [73] optimizes NF throughput by eliminating batch Fragmen- tation cases. NFVNice and EdgeOS [67, 127] interact with the CPU scheduler to schedule NFs and manage memory allocations for NFs. NFVNice works in a restricted setting without considering multi-tenant clus- ter deployments. EdgeOS provides isolation with high-performance overheads. Some researchers propose 120 system-level optimizations to accelerate operations in NF processing (e.g., instruction executions, memory and cache access) [155, 160, 161, 162, 38]. That said, this line of work is compatible with our work and some of these optimizations have been and could be further integrated. Another line of research [164, 58, 140] considers optimizations for stateful NFs. In Quadrant [152, 153] and Ironside [151], we also design our system to support stateful NFs and provide special NF-state management mechanisms to mask state-access overheads. Software NF placement. SmartChain [156], which explores the placement of software NFs between a smart NIC and CPU cores on a server. Lemur tackles a more practical deployment scenario: many chains deployed in a cluster where a switch interconnects many servers with or without a smart NIC. Other work [70, 71, 20, 40] solve an optimization problem targeting a general deployment scheme. Lemur’s heterogeneous-hardware architecture and run-to-completion execution are novel, and our model reflects practical deployment constraints, optimizing for throughput while meeting latency SLOs. Of these, Cziva et al. [20] and Laghrissi et al. [71] consider a case where user mobility affects the end-to-end NF- chain latency. Lemur targets a different setting where mobility is less of a concern: packet processing at the ingress to a cellular provider’s backhaul network. In [40], Gouareb et al. assume NFs are hosted on separate clusters and model the end-to-end latency by considering both the link delay and the inter-cloud delay. Lemur focuses on latency SLOs within the cluster since that is what a service provider can control. 5.5 Multi-coreCPUScheduling In addition to NFV Literature, our research draws inspiration from the research on multi-core CPU schedul- ing. This line of research has explored mechanisms for scheduling data center workloads (e.g., microser- vices and RPC services) to achieve low latency and high CPU core efficiency. We briefly discuss them. ZygOS [118] study scheduling policies for serving RPC requests in a server setting. In ZygOS, Prekas etal. study the queueing models to understand the queueing effects on a single server. They conclude that 121 the centralized queue model performs the best compared to others that use multiple queues. The centralized queue model guarantees that the system achieves work conservation: cores are always busy as long as there are requests on the queue. However, this model is impractical. The authors then propose a practical scheduling scheme, called work-stealing, to approximate the optimal model. At the rack scale, RackSched [184] shows that using Join-the-Shortest-Queue (JSQ) scheduling together with the work-stealing strategy at each server works the best. Since then, researchers have converged on the JSQ + the centralized queue model (or work stealing) strategy. Many follow-ups utilize the above principle in the system design. Among them: Arachne [123] allows applications to manage cores based on dynamic traffic loads. Shenango [104] co-locates latency-sensitive and batch-processing applications together to achieve higher CPU efficiency. Caladan [34] mitigates inter- ference for co-located tasks for better latency performance. A most recent follow-up paper [88] explores optimal combinations of load balancing and scaling methods forµ s-scale RPC applications. Our work Ironside is inspired by this body of work. However, we note that the flow affinity constraint makes NF chain core allocation fundamentally different from them. To understand this, we discuss why neither JSQ nor the work-stealing scheme can be applied in the NFV context. Say an NFV system receives a new packet. At the rack level, it must send the packet to the server responsible for processing the packet’s corresponding flow. This can break JSQ because the system may send the packet to a server without the shortest queue. At the server level, the centralized queue model does not work anymore: even if a core has an empty queue and is idle, the core still cannot pull packets from other cores’ queues. Doing so requires the NFV system to use cross-core synchronizations at the data plane, which is avoided by NFV systems for performance issues [105, 58, 164, 140, 16]. Therefore, in Chapter 4, we ask for a new mechanism required to handle dynamic scaling for NFV. In the future, we plan to support and optimize NFV under more network protocols and practical deployment cases, such as MPTCP [174, 14]. 122 5.6 OtherRelevantTopics Packet scheduling. Using in-network packet scheduling to reduce queueing delay for better latency is not a new idea [116, 93, 43]. RC3 [93] assigns service tiers for TCP sessions so that high-priority flows can take resources from low-priority flows to shorten flow completion time. Ironside co-locates NF chains with best-effort workloads to improve NF chains’ latency performance by borrowing resources from best-effort workloads. Fastpath [116] demonstrates the ability to achieve a near-zero queueing delay with precise in-network packet scheduling. This motivates the design of Ironside: with accurate task scheduling at fine timescales, queueing delay can be reduced significantly for NFV workloads. Moreover, there is a line of research on packet scheduling at the data plane using programmable switches [92, 141, 138, 173]. In future work, we intend to consider NFV task scheduling at the data plane to further reduce overheads of software load balancing. Function-as-a-Service(FaaS). FaaS is a serverless computing paradigm that aims to reduce operational and management overheads for hosting web services. FaaS is available on edge and cloud platforms [134, 33, 91, 39], and as open-source [101, 49]. Research on FaaS has two styles. The first improves aspects of FaaS platforms. Sock [99] and LightVM [86] optimize sandbox startup time. SAND [7] optimizes IPC performance. E3 [81] accelerates FaaS execution with smart NICs to improve cost-efficiency. The second explores new applications made possible by FaaS, such as real-time big data analysis [119, 65] and scalable video encoding [30]. SNF [140] proposed FaaS to execute NFs. However, SNF does not execute NF chains. FaaS platforms are designed for latency-insensitive general applications and not for NFV. Our research borrows the auto-scaling idea from FaaS platforms and proposes new design principles. Edge computing. Besides edge cloud offerings from large cloud providers, edge computing also in- cludes enabling and optimizing computations on low-performance devices and intermittently powered systems [84, 83, 122, 96]. In this dissertation, we focus on high-performance NFV deployments in the cloud 123 context where resources are often abundant. We also intend to extend NFV to work on low-performance edge devices, to explore and support more networking use cases [171, 51, 176, 157, 159, 163]. Finally, this dissertation has been made public in a series of conference papers and technical re- ports [172, 153, 152, 151, 46, 125]. 124 Chapter6 Conclusions In this chapter, we conclude the dissertation. §6.1 lists the major contributions. §6.2 summarizes the primary results of our research. §6.3 describes potential directions for future work. 6.1 SummaryofContributions Here is a list of contributions of this dissertation to NFV Research. The recognition that existing design principles and mechanisms, particularly in the execution, the interconnection, and the task scheduling for NFs, may be suboptimal for NFV. The design and implemen- tation of a cross-platform NFV platform Lemur, demonstrating an automated offloading paradigm that optimizes performance and respects hardware-related constraints. The design and implementation of a lightweight NF isolation mechanism, increasing per-core throughput by a factor of 2.3 by eliminating un- necessary packet copying and transmissions for NFs deployed on the same server. The observation that two types of bursts in real-world traffic make it challenging to achieve low latency for NFV. The design and implementation of a rack-scale NFV task scheduler, reducing p99 latency by a factor of 10. The scheduler uses a novel traffic assignment strategy that assigns CPU cores at different spatial and time scales to react quickly and efficiently to bursts. 125 6.2 SummaryofResults NFV aims to replace hardware middleboxes with software NFs deployed on commodity clouds. Today’s cloud platforms are mainly designed and optimized to meet the requirements of general applications. How- ever, NFs are network processing tasks and can have many unique characteristics. This motivates the reexamination of design principles and mechanisms in supporting NFs on cloud platforms. In this dissertation, we have studied design principles and mechanisms that can improve NFV systems in the following three dimensions: performance,scalabilityandefficiency,andclouddeployability . We posit that: even if NFV workloads are unique, we can still have performant, scalable, and efficient NFV deployed on top of a commodity cloud. This fulfills NFV’s vision: to utilize cloud infrastructure to reduce operational and management overheads for operators. We analyze the execution model of NF chains, and nail down three critical aspects of NFV deployment: the execution of NFs, the interconnection of NFs, and the task assignment for NFs. We then show that cloud-based NFV can be performant without sacrificing scalability, efficiency, or cloud deployability. With three systematic solutions, we have shown that NFV systems can benefit from offloading computations automatically to on-path hardware accelerators, adopting a lightweight inter-NF isolation mechanism, and employing a task scheduler with a multi-scale core-scaling strategy. In the first study, we describe the design and implementation of Lemur, i.e., a cross-platform NFV meta-compiler. Lemur automatically determines the optimal hardware offloading strategy under various hardware resource constraints and link capacity constraints. Besides the strategy, Lemur also generates low-overhead code or tables to coordinate NFs deployed across platforms to process traffic. We have validated our algorithm and system design by experimentally deploying multiple NF chains on a rack- scale testbed. In the second study, we focus on NFs deployed on commodity servers. NFV systems are responsible for hosting and isolating many NFs on a server and for moving packets between NFs. We show that existing 126 NFV systems either make impractical assumptions or add significant overheads when transmitting packets and isolating NFs. In Chapter 3, we propose a lightweight isolation mechanism that avoids unnecessary packet copying and transmission operations. We validate the design of the isolation mechanism and show that it improves per-core throughput up to 2.3× compared to alternatives. In the third study, we study the auto-scaling mechanism for achieving low latency under traffic dynam- ics. NF chains can negatively impact user experience if they introduce high latency. We start by proposing a baseline auto-scaling scheme that supports real-time monitoring and reacts to overloads. However, when examining it under realistic traffic inputs, we observe high tail latency of up to 10s milliseconds. With traf- fic analysis, we find two types of superbursts that are responsible for the impacted tail latency. However, none of the existing dynamic scaling mechanisms is capable of handling these superbursts: they use in- adequate scaling signals, make scaling decisions in large timescales, and ignore the performance effect of NFV’s flow processing. In Chapter 4, we solve these problems with a novel multi-scale core-scaling strategy that makes auto-scaling decisions at rack, server, and core-spatial scales, and at increasingly finer timescales. Additionally, this strategy utilizes a more accurate core capacity estimation technique that takes instantaneous flow counts and packet rates into account. In evaluation, we show that our auto- scaling solution can support p99 latency SLOs on the order of 100-500µ s while others exhibit 10× higher p99 latency SLOs. 6.3 FutureWork In the context of NFV, we believe it is interesting to explore the following two directions for future work. 6.3.1 ExploringNFUseCases NFs represent network processing tasks, stemming from many traditional use cases from network oper- ators. Over the past decade, the research community has devoted many efforts to proposing innovative 127 ideas for the NFV system design. However, less attention was given to exploring new use cases of network processing and NFV. As of today, NFV systems are still mainly used to host and maintain NFs that were introduced more than a decade ago. We believe that: it is important to start exploring more use cases that can benefit from relevant NFV techniques. Privacy-preservingNFs. Nowadays, companies are having huge incentives to collect user data to support using AI and data analytics tools. However, few companies are capable of providing necessary security or privacy guarantees to their users. We’ve seen frequent data breaches of network operators [131, 18, 113] and cloud services [56, 114]. Thus, providing privacy-preserving network services can be an urgent matter. NFs can be a suitable place to integrate new security and privacy-related functions [179, 120, 121] because NFs are often deployed as the first hop or the last hop in a network and are on the critical path. Fortunately, we start to see companies that make efforts on providing privacy-preserving network services for users, such as Apple private relay [52] and INVISV PGPP service [54]. We envision the number and the scale of privacy-preserving NFs will go up in the following years. Application-specificNFs. These NFs are tightly coupled with applications to offer better performance. Network optimizations have been applied to many applications, including web browsing [29] and video streaming [6]. However, most existing solutions are deployed either at the end host or at the server, treating the network as a mystery box. Application-specific NFs can open the gate for in-network performance optimizations. They can create more possibilities to improve applications’ performance, such as optimizing the allocation of shared resources across devices, across users, and even across applications. Mobile and edge cloud use cases. Cloud providers have started to offer edge cloud capabilities, aim- ing to host communication and computing services (e.g., 5G and wireless functions [3, 76, 77] and edge computation offloading) closer to users [2, 1, 8]. We expect more applications and ISP functionalities to be offloaded to edge clouds, including sensor data processing, IoT and smart home applications [50, 177, 128 37, 165, 169, 168, 166, 167, 75, 175]. NFV can play an important role in the edge, by offering customized network services to manage edge applications. Traffic engineering can be applied to optimize resource allocations and to improve networking delays for such applications. NFV can also help decide to balance the trade-off between performance and power usage for mobile devices. 6.3.2 CloudIntegrationofNFV On the other hand, it is also important to keep improving NFV to make it further integrated into the cloud computing world. To this end, here are several aspects that may add value to existing NFV research. Firstly, future research can investigate many deployment-specific problems, such as handling failures, improving energy efficiency, and achieving dynamic upgrading. Such research would first start by bor- rowing design principles for other cloud applications, and would then examine these design principles with their underlying assumptions, and determine whether they can be applied to NFV. Notice that NFV demonstrates a set of unique characteristics (i.e., stateful packet processing with partitionable states and flow affinity constraint). New design principles may be needed. Secondly, future research can seek to ease the development of NFs. Nowadays, developing NFs often requires expertise in computer networking and systems, and sometimes computer hardware. Also, hard- ware accelerators keep evolving and come with new programming paradigms each time, which can be challenging for NF developers. Today’s machine learning solves this problem with a unified programming framework (such as OpenCL and TensorFlow) that supports many hardware backends. Inspired by this, future work can investigate the design and implementation of a unified programming framework. The framework should support common programming language frontends to reduce the learning curve and support many popular backends. Thirdly, future research can seek to mitigate the gap between NFs and other general applications. Serverless computing is a popular development model that works well for creating web services. The key 129 idea is to partition a web application into multiple microservices deployed and scaled separately, which is similar to these datacenter services. Future work would consider applying the microservice model to NFV as well, and extend microservices to support both network-level processing and application-level process- ing. Mitigating the gap between these two types of applications can increase infrastructure efficiency due to a higher degree of multiplexing the underlying cloud infrastructure. 130 Bibliography [1] Amazon Web Services (AWS). AWS marketplace: F5 Distributed Cloud Services. Available at: https://aws.amazon.com/marketplace/pp/prodview-cn54su2icbkkc. 2022. [2] Google Cloud Platform (GCP). Introducing Google Distributed Cloud. Available at: https://cloud.google.com/blog/topics/hybrid-cloud/announcing-google-distributed- cloud-edge-and-hosted. 2022. [3] Sherif Abdelwahab, Bechir Hamdaoui, Mohsen Guizani, and Taieb Znati. “Network function virtualization in 5G”. In: IEEE Communications Magazine 54.4 (2016), pp. 84–91.doi: 10.1109/MCOM.2016.7452271. [4] Bhavish Aggarwal, Aditya Akella, Ashok Anand, Athula Balachandran, Pushkar Chitnis, Chitra Muthukrishnan, Ramachandran Ramjee, and George Varghese. “EndRE: An End-System Redundancy Elimination Service for Enterprises”. In: 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 10). San Jose, CA: USENIX Association, Apr. 2010.url: https://www.usenix.org/conference/nsdi10-0/endre-end-system-redundancy- elimination-service-enterprises. [5] Bhavish Aggarwal, Aditya Akella, Ashok Anand, Athula Balachandran, Pushkar Chitnis, Chitra Muthukrishnan, Ramachandran Ramjee, and George Varghese. “EndRE: An End-system Redundancy Elimination Service for Enterprises”. In: Proceedings of USENIX/ACM NSDI. 2010. [6] Zahaib Akhtar, Yun Seong Nam, Ramesh Govindan, Sanjay Rao, Jessica Chen, Ethan Katz-Bassett, Bruno Ribeiro, Jibin Zhan, and Hui Zhang. “Oboe: Auto-Tuning Video ABR Algorithms to Network Conditions”. In: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. SIGCOMM ’18. Budapest, Hungary: Association for Computing Machinery, 2018, pp. 44–58.isbn: 9781450355674.doi: 10.1145/3230543.3230558. [7] Istemi Ekin Akkus, Ruichuan Chen, Ivica Rimac, Manuel Stein, Klaus Satzke, Andre Beck, Paarijaat Aditya, and Volker Hilt. “SAND: Towards High-Performance Serverless Computing”. In: 2018 USENIX Annual Technical Conference (USENIX ATC 18). Boston, MA: USENIX Association, 2018, pp. 923–935.isbn: 978-1-931971-44-7.url: %5Curl%7Bhttps://www.usenix.org/conference/atc18/presentation/akkus%7D. [8] Microsoft Azure. Microsoft Azure: Hybrid And Multi-cloud Solutions. Available at: https://azure.microsoft.com/en-us/solutions/hybrid-cloud-app. 2022. 131 [9] Tom Barbette, Georgios P. Katsikas, Gerald Q. Maguire, and Dejan Kostić. “RSS++: Load and State-Aware Receive Side Scaling”. In:Proceedingsofthe15thInternationalConferenceonEmerging Networking Experiments And Technologies. CoNEXT ’19. Orlando, Florida: Association for Computing Machinery, 2019, pp. 318–333.isbn: 9781450369985.doi: 10.1145/3359989.3365412. [10] BESS (Berkeley Extensible Software Switch). The official BESS Github repository: https://github.com/NetSys/bess. 2022. [11] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and David Walker. “P4: Programming Protocol-independent Packet Processors”. In: SIGCOMM Comput. Commun. Rev. 44.3 (July 2014), pp. 87–95.issn: 0146-4833.doi: 10.1145/2656877.2656890. [12] Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, Nick McKeown, Martin Izzard, Fernando Mujica, and Mark Horowitz. “Forwarding metamorphosis: Fast programmable match-action processing in hardware for SDN”. In: ACM SIGCOMM Computer Communication Review. Vol. 43. ACM. 2013, pp. 99–110. [13] Anat Bremler-Barr, Yotam Harchol, and David Hay. “OpenBox: A Software-Defined Framework for Developing, Deploying, and Managing Network Functions”. In: Proceedings of the 2016 ACM SIGCOMM Conference. SIGCOMM ’16. Florianopolis, Brazil: Association for Computing Machinery, 2016, p. 511.isbn: 9781450341936.doi: 10.1145/2934872.2934875. [14] Milind M Buddhikot and ZENG Yijing. Designs of an MPTCP-aware load balancer and load balancer using the designs. US Patent 11,223,565. Jan. 2022. [15] CAIDA UCSD Anonymized Internet Traces. Available at http://www.caida.org/data/passive/passive _ dataset.xml. 2019. [16] Fabricio B. Carvalho, Ronaldo A. Ferreira, Italo Cunha, Marcos A. M. Vieira, and Murali K. Ramanathan. “Dyssect: Dynamic Scaling of Stateful Network Functions”. In: Proceedings of IEEE International Conference on Computer Communications. (INFOCOM) ’22. Online: Institute of Electrical and Electronics Engineers, 2022, pp. 1529–1538. [17] Adrian M Caulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, et al. “A cloud-scale acceleration architecture”. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE. 2016, pp. 1–13. [18] Niraj Chokshi. T-Mobile Says Hacker Got Data From 37 Million Customer Accounts. https://www.nytimes.com/2023/01/19/business/t-mobile-hacked-data-breach.html. Jan. 2023. [19] Paolo Costa, Austin Donnelly, Antony Rowstron, and Greg O’Shea. “Camdoop: Exploiting In-network Aggregation for Big Data Applications”. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). San Jose, CA: USENIX Association, Apr. 2012, pp. 29–42.url: https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/costa. 132 [20] Richard Cziva, Christos Anagnostopoulos, and Dimitrios P Pezaros. “Dynamic, latency-optimal VNF placement at the network edge”. In: Ieee infocom 2018-ieee conference on computer communications. IEEE. 2018, pp. 693–701. [21] Rakesh Datta, Sean Choi, Anurag Chowdhary, and Younghee Park. “P4guard: Designing p4 based firewall”. In: MILCOM 2018-2018 IEEE Military Communications Conference (MILCOM). IEEE. 2018, pp. 1–6. [22] Digital Corpora Enterprise Network Traces. Available at https://downloads.digitalcorpora.org/corpora/scenarios/2009-m57-patents/net. 2009. [23] J. Duato, A. J. Peña, F. Silla, R. Mayo, and E. S. Quintana-Ortí. “rCUDA: Reducing the number of GPU-based accelerators in high performance clusters”. In: 2010 International Conference on High Performance Computing Simulation. June 2010, pp. 224–231.doi: 10.1109/HPCS.2010.5547126. [24] Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, Gary Wong, Jonathon Duerig, Eric Eide, Leigh Stoller, Mike Hibler, David Johnson, Kirk Webb, Aditya Akella, Kuangching Wang, Glenn Ricart, Larry Landweber, Chip Elliott, Michael Zink, Emmanuel Cecchet, Snigdhaswin Kar, and Prabodh Mishra. “The Design and Operation of CloudLab”. In: 2019 USENIX Annual Technical Conference (USENIX ATC 19). Renton, WA: USENIX Association, July 2019, pp. 1–14.isbn: 978-1-939133-03-8.url: https://www.usenix.org/conference/atc19/presentation/duplyakin. [25] Kjeld Egevang and Paul Francis. RFC 1631: The IP network address translator (NAT). Tech. rep. https://www.rfc-editor.org/rfc/rfc1631. 1994. [26] Daniel E. Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, Eric Mann-Hielscher, Ardas Cilingiroglu, Bin Cheyney, Wentao Shang, and Jinnah Dylan Hosein. “Maglev: A Fast and Reliable Software Network Load Balancer”. In: 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). Santa Clara, CA: USENIX Association, Mar. 2016, pp. 523–535.isbn: 978-1-931971-29-4.url: https: //www.usenix.org/conference/nsdi16/technical-sessions/presentation/eisenbud. [27] Abdessalam Elhabbash, Assylbek Jumagaliyev, Gordon S Blair, and Yehia Elkhatib. “SLO-ML: A Language for Service Level Objective Modelling in Multi-cloud Applications”. In: Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing. 2019, pp. 241–250. [28] Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, et al. “Azure accelerated networking: SmartNICs in the public cloud”. In: 15th{USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18). 2018, pp. 51–66. [29] Tobias Flach, Nandita Dukkipati, Andreas Terzis, Barath Raghavan, Neal Cardwell, Yuchung Cheng, Ankur Jain, Shuai Hao, Ethan Katz-Bassett, and Ramesh Govindan. “Reducing Web Latency: The Virtue of Gentle Aggression”. In: SIGCOMM Comput. Commun. Rev. 43.4 (Aug. 2013), pp. 159–170.issn: 0146-4833.doi: 10.1145/2534169.2486014. 133 [30] Sadjad Fouladi, Riad S. Wahby, Brennan Shacklett, Karthikeyan Vasuki Balasubramaniam, William Zeng, Rahul Bhalerao, Anirudh Sivaraman, George Porter, and Keith Winstein. “Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads”. In: 14thUSENIXSymposiumonNetworkedSystemsDesignandImplementation(NSDI17). Boston, MA: USENIX Association, Mar. 2017, pp. 363–376.isbn: 978-1-931971-37-9.url: https: //www.usenix.org/conference/nsdi17/technical-sessions/presentation/fouladi. [31] Linux Foundation. Data Plane Development Kit (DPDK). The official DPDK website: http://www.dpdk.org. 2015. [32] Linux Foundation. Mobile cloud gaming: the real-world cloud gaming experience in Los Angeles. The official RootMetrics website: https://rootmetrics.com/en-US/content/us-LA-gaming-report-2020. May 2020. [33] The Apache Software Foundation. Apache OpenWhisk. http://openwhisk.org/. Accessed on May 2020. 2017. [34] Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and Adam Belay. “Caladan: Mitigating Interference at Microsecond Timescales”. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, Nov. 2020, pp. 281–297.isbn: 978-1-939133-19-9.url: https://www.usenix.org/conference/osdi20/presentation/fried. [35] Rohan Gandhi, Hongqiang Harry Liu, Y. Charlie Hu, Guohan Lu, Jitendra Padhye, Lihua Yuan, and Ming Zhang. “Duet: Cloud Scale Load Balancing with Hardware and Software”. In: Proceedings of the 2014 ACM Conference on SIGCOMM. SIGCOMM ’14. Chicago, Illinois, USA: Association for Computing Machinery, 2014, pp. 27–38.isbn: 9781450328364.doi: 10.1145/2619239.2626317. [36] Aaron Gember-Jacobson, Raajay Viswanathan, Chaithan Prakash, Robert Grandl, Junaid Khalid, Sourav Das, and Aditya Akella. “OpenNF: Enabling Innovation in Network Function Control”. In: Proceedings of the 2014 ACM Conference on SIGCOMM. SIGCOMM ’14. Chicago, Illinois, USA: Association for Computing Machinery, 2014, pp. 163–174.isbn: 9781450328364.doi: 10.1145/2619239.2626313. [37] Mohammad Ghaderibaneh, Caitao Zhan, Himanshu Gupta, and CR Ramakrishnan. “Efficient quantum network communication using optimized entanglement swapping trees”. In: IEEE Transactions on Quantum Engineering 3 (2022), pp. 1–20. [38] Hamid Ghasemirahni, Tom Barbette, Georgios P. Katsikas, Alireza Farshin, Amir Roozbeh, Massimo Girondi, Marco Chiesa, Gerald Q. Maguire Jr., and Dejan Kostić. “Packet Order Matters! Improving Application Performance by Deliberately Delaying Packets”. In: 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, WA: USENIX Association, Apr. 2022, pp. 807–827.isbn: 978-1-939133-27-4.url: https://www.usenix.org/conference/nsdi22/presentation/ghasemirahni. [39] Google. Google Cloud Functions. https://cloud.google.com/functions/. Accessed on May 2020. 2017. 134 [40] Racha Gouareb, Vasilis Friderikos, and Abdol-Hamid Aghvami. “Virtual network functions routing and placement for edge cloud latency minimization”. In: IEEE Journal on Selected Areas in Communications 36.10 (2018), pp. 2346–2357. [41] Andreas Grünbacher. “POSIX Access Control Lists on Linux.” In: Proceedings of (USENIX) ’03 Annual Technical Conference. Vol. 259272. 2003. [42] Arpit Gupta, Rob Harrison, Marco Canini, Nick Feamster, Jennifer Rexford, and Walter Willinger. “Sonata: Query-Driven Streaming Network Telemetry”. In: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. SIGCOMM ’18. Budapest, Hungary: Association for Computing Machinery, 2018, pp. 357–371.isbn: 9781450355674.doi: 10.1145/3230543.3230555. [43] Himanshu Gupta, Max Curran, and Caitao Zhan. “Near-Optimal Multihop Scheduling in General Circuit-Switched Networks”. In: Proceedings of the 16th International Conference on Emerging Networking EXperiments and Technologies. CoNEXT ’20. Barcelona, Spain: Association for Computing Machinery, 2020, pp. 31–45.isbn: 9781450379489.doi: 10.1145/3386367.3432589. [44] Sangjin Han, Keon Jang, Aurojit Panda, Shoumik Palkar, Dongsu Han, and Sylvia Ratnasamy. “SoftNIC: A software NIC to augment hardware”. In: EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2015-155 (2015). [45] Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W. Moore, Gianni Antichi, and Marcin Wójcik. “Re-Architecting Datacenter Networks and Stacks for Low Latency and High Performance”. In: Proceedings of the Conference of the ACM Special Interest Group on Data Communication. SIGCOMM ’17. Los Angeles, CA, USA: Association for Computing Machinery, 2017, pp. 29–42.isbn: 9781450346535.doi: 10.1145/3098822.3098825. [46] Zongyin Hao, Quanfeng Huang, Chengpeng Wang, Jianfeng Wang, Yushan Zhang, Rongxin Wu, and Charles Zhang. “Pinolo: Detecting Logical Bugs in Database Management Systems with Approximate Query Synthesis”. In: 2023 USENIX Annual Technical Conference (USENIX ATC 23). Boston, MA: USENIX Association, July 2023, pp. 345–358.isbn: 978-1-939133-35-9.url: https://www.usenix.org/conference/atc23/presentation/hao. [47] Mohammad Hedayati, Kai Shen, Michael L. Scott, and Mike Marty. “Multi-Queue Fair Queuing”. In: 2019 USENIX Annual Technical Conference (USENIX ATC 19). Renton, WA: USENIX Association, July 2019, pp. 301–314.isbn: 978-1-939133-03-8.url: http://www.usenix.org/conference/atc19/presentation/hedayati-queue. [48] Scott Hendrickson, Stephen Sturdevant, Tyler Harter, Venkateshwaran Venkataramani, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. “Serverless Computation with OpenLambda”. In: 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16). Denver, CO: USENIX Association, 2016, pp. 1–6.url: https://www.usenix.org/conference/hotcloud16/workshop- program/presentation/hendrickson. 135 [49] Scott Hendrickson, Stephen Sturdevant, Tyler Harter, Venkateshwaran Venkataramani, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. “Serverless Computation with OpenLambda”. In: 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16). Denver, CO: USENIX Association, 2016, pp. 1–6.url: https://www.usenix.org/conference/hotcloud16/workshop- program/presentation/hendrickson. [50] Mark Hillery, Himanshu Gupta, and Caitao Zhan. “Discrete outcome quantum sensor networks”. In: Physical Review A 107.1 (2023), p. 012435. [51] Yitao Hu, Weiwu Pang, Xiaochen Liu, Rajrup Ghosh, Bongjun Ko, Wei-Han Lee, and Ramesh Govindan. “Rim: Offloading Inference to the Edge”. In: Proceedings of the International Conference on Internet-of-Things Design and Implementation. IoTDI ’21. Charlottesvle, VA, USA: Association for Computing Machinery, 2021, pp. 80–92.isbn: 9781450383547.doi: 10.1145/3450268.3453521. [52] Apple Inc. About iCloud Private Relay. https://support.apple.com/en-gb/HT212614. Apr. 2023. [53] Intel Inc. Receive-Side Scaling (RSS). Improving Network Performance in Multi-Core Systems. Available at: http: //www.intel.com/content/dam/support/us/en/documents/network/sb/318483001us2.pdf. 2016. [54] INVISV Inc. INVISV Relay: Providing Multi-Hop Privacy for All. https://invisv.com/relay/. Oct. 2022. [55] Eun Young Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. “MTCP: A Highly Scalable User-Level TCP Stack for Multicore Systems”. In: Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation. NSDI’14. Seattle, WA: USENIX Association, 2014, pp. 489–502.isbn: 9781931971096. [56] Yang Jie and Liza Lin. Alibaba Falls Victim to Chinese Web Crawler in Large Data Leak. https://www.wsj.com/articles/alibaba-falls-victim-to-chinese-web-crawler-in- large-data-leak-11623774850. June 2021. [57] Lavanya Jose, Lisa Yan, George Varghese, and Nick McKeown. “Compiling Packet Programs to Reconfigurable Switches”. In: 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). Oakland, CA: USENIX Association, May 2015, pp. 103–115.isbn: 978-1-931971-218.url: https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/jose. [58] Murad Kablan, Azzam Alsudais, Eric Keller, and Franck Le. “Stateless Network Functions: Breaking the Tight Coupling of State and Processing”. In: 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). Boston, MA: USENIX Association, Mar. 2017, pp. 97–112.isbn: 978-1-931971-37-9.url: https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/kablan. 136 [59] Georgios P. Katsikas, Tom Barbette, Marco Chiesa, Dejan Kostić, and Gerald Q. Maguire. “What You Need to Know About (Smart) Network Interface Cards”. In: Passive and Active Measurement. Ed. by Oliver Hohlfeld, Andra Lutu, and Dave Levin. Cham: Springer International Publishing, 2021, pp. 319–336.isbn: 978-3-030-72582-2. [60] Georgios P. Katsikas, Tom Barbette, Dejan Kostić, Rebecca Steinert, and Gerald Q. Maguire Jr. “Metron: NFV Service Chains at the True Speed of the Underlying Hardware”. In: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). Renton, WA: USENIX Association, Apr. 2018, pp. 171–186.isbn: 978-1-939133-01-4.url: https://www.usenix.org/conference/nsdi18/presentation/katsikas. [61] Naga Katta, Mukesh Hira, Changhoon Kim, Anirudh Sivaraman, and Jennifer Rexford. “HULA: Scalable Load Balancing Using Programmable Data Planes”. In: Proceedings of the Symposium on SDN Research. SOSR ’16. Santa Clara, CA, USA: Association for Computing Machinery, 2016. isbn: 9781450342117.doi: 10.1145/2890955.2890968. [62] Junaid Khalid, Eric Rozner, Wesley Felter, Cong Xu, Karthick Rajamani, Alexandre Ferreira, and Aditya Akella. “Iron: Isolating Network-based CPU in Container Environments”. In: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). Renton, WA: USENIX Association, Apr. 2018, pp. 313–328.isbn: 978-1-939133-01-4.url: https://www.usenix.org/conference/nsdi18/presentation/khalid. [63] Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. “FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds”. In: 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). Boston, MA: USENIX Association, Feb. 2019, pp. 113–126.isbn: 978-1-931971-49-2.url: https://www.usenix.org/conference/nsdi19/presentation/kim. [64] Joongi Kim, Keon Jang, Keunhong Lee, Sangwook Ma, Junhyun Shim, and Sue Moon. “NBA (Network Balancing Act): A High-Performance Packet Processing Framework for Heterogeneous Processors”. In: Proceedings of the Tenth European Conference on Computer Systems. EuroSys ’15. Bordeaux, France: Association for Computing Machinery, 2015.isbn: 9781450332385.doi: 10.1145/2741948.2741969. [65] Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, and Christos Kozyrakis. “Pocket: Elastic Ephemeral Storage for Serverless Analytics”. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). Carlsbad, CA: USENIX Association, Oct. 2018, pp. 427–444.isbn: 978-1-939133-08-3.url: %5Curl%7Bhttps://www.usenix.org/conference/osdi18/presentation/klimovic%7D. [66] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M Frans Kaashoek. “The Click modular router”. In: ACM Transactions on Computer Systems (TOCS) 18.3 (2000), pp. 263–297. [67] Sameer G. Kulkarni, Wei Zhang, Jinho Hwang, Shriram Rajagopalan, K. K. Ramakrishnan, Timothy Wood, Mayutan Arumaithurai, and Xiaoming Fu. “NFVnice: Dynamic Backpressure and Scheduling for NFV Service Chains”. In: Proceedings of the Conference of the ACM Special Interest Group on Data Communication. SIGCOMM ’17. Los Angeles, CA, USA: ACM, 2017, pp. 71–84. isbn: 978-1-4503-4653-5.doi: 10.1145/3098822.3098828. 137 [68] Praveen Kumar, Nandita Dukkipati, Nathan Lewis, Yi Cui, Yaogong Wang, Chonggang Li, Valas Valancius, Jake Adriaens, Steve Gribble, Nate Foster, and Amin Vahdat. “PicNIC: Predictable Virtualized NIC”. In: ACM SIGCOMM 2019. 2019. [69] Surendra Kumar, Mudassir Tufail, Sumandra Majee, Claudiu Captari, and Shunsuke Homma. Service Function Chaining Use Cases In Data Centers. Internet-Draft draft-ietf-sfc-dc-use-cases-06. Work in Progress. Internet Engineering Task Force, Feb. 2017. 23 pp.url: https://datatracker.ietf.org/doc/html/draft-ietf-sfc-dc-use-cases-06. [70] Tung-Wei Kuo, Bang-Heng Liou, Kate Ching-Ju Lin, and Ming-Jer Tsai. “Deploying chains of virtual network functions: On the relation between link and server usage”. In: Proceedings of IEEE INFOCOM. 2016. [71] Abdelquoddouss Laghrissi, Tarik Taleb, Miloud Bagaa, and Hannu Flinck. “Towards edge slicing: VNF placement algorithms for a dynamic & realistic edge cloud environment”. In: GLOBECOM 2017-2017 IEEE Global Communications Conference. IEEE. 2017, pp. 1–6. [72] Yanfang Le, Hyunseok Chang, Sarit Mukherjee, Limin Wang, Aditya Akella, Michael M. Swift, and T. V. Lakshman. “UNO: Uniflying Host and Smart NIC Offload for Flexible Packet Processing”. In: Proceedings of the 2017 Symposium on Cloud Computing. New York, NY, USA: Association for Computing Machinery, 2017, pp. 506–519.isbn: 9781450350280.url: https://doi.org/10.1145/3127479.3132252. [73] Tamás Lévai, Felicián Németh, Barath Raghavan, and Gabor Retvari. “Batchy: Batch-scheduling Data Flow Graphs with Service-level Objectives”. In: 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). Santa Clara, CA: USENIX Association, Feb. 2020, pp. 633–649.isbn: 978-1-939133-13-7.url: https://www.usenix.org/conference/nsdi20/presentation/levai. [74] Bojie Li, Kun Tan, Layong (Larry) Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, Peng Cheng, and Enhong Chen. “ClickNP: Highly Flexible and High Performance Network Processing with Reconfigurable Hardware”. In: Proceedings of the 2016 ACM SIGCOMM Conference. SIGCOMM ’16. Florianopolis, Brazil: Association for Computing Machinery, 2016, pp. 1–14.isbn: 9781450341936.doi: 10.1145/2934872.2934897. [75] Yilong Li, Yijing Zeng, and Suman Banerjee. “Enabling Wideband, Mobile Spectrum Sensing through Onboard Heterogeneous Computing”. In: Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications. HotMobile ’21. Virtual, United Kingdom: Association for Computing Machinery, 2021, pp. 85–91.isbn: 9781450383233.doi: 10.1145/3446382.3448651. [76] Zhuqi Li, Yuanchao Shu, Ganesh Ananthanarayanan, Longfei Shangguan, Kyle Jamieson, and Paramvir Bahl. “Spider: A multi-hop millimeter-wave network for live video analytics”. In: 2021 IEEE/ACM Symposium on Edge Computing (SEC). IEEE. 2021, pp. 178–191. 138 [77] Zhuqi Li, Yaxiong Xie, Longfei Shangguan, Rotman Ivan Zelaya, Jeremy Gummeson, Wenjun Hu, and Kyle Jamieson. “Towards programming the radio environment with large arrays of inexpensive antennas”. In: 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 2019, pp. 285–300. [78] Min-Sheng Lin, Chien-Yi Chiu, Yuh-Jye Lee, and Hsing-Kuo Pao. “Malicious URL filtering — A big data application”. In: 2013 IEEE International Conference on Big Data. 2013, pp. 589–596.doi: 10.1109/BigData.2013.6691627. [79] Linux. BPF Documentation. https://www.kernel.org/doc/html/latest/bpf/index.html. 2021. [80] John DC Little. “A proof for the queuing formula: L=λ W”. In: Operations research 9.3 (1961), pp. 383–387. [81] Ming Liu, Simon Peter, Arvind Krishnamurthy, and Phitchaya Mangpo Phothilimthana. “E3: Energy-Efficient Microservices on SmartNIC-Accelerated Servers”. In: 2019 USENIX Annual Technical Conference (USENIX ATC 19). Renton, WA: USENIX Association, July 2019, pp. 363–378. isbn: 978-1-939133-03-8.url: https://www.usenix.org/conference/atc19/presentation/liu-ming. [82] John W Lockwood, Nick McKeown, Greg Watson, Glen Gibb, Paul Hartke, Jad Naous, Ramanan Raghuraman, and Jianying Luo. “NetFPGA–an open platform for gigabit-rate network switching and routing”. In: Microelectronic Systems Education, 2007. MSE’07. IEEE International Conference on. IEEE. 2007, pp. 160–161. [83] Yubo Luo and Shahriar Nirjon. “SmartON: Just-in-time active event detection on energy harvesting systems”. In: 2021 17th International Conference on Distributed Computing in Sensor Systems (DCOSS). IEEE. 2021, pp. 35–44. [84] Yubo Luo and Shahriar Nirjon. “Spoton: Just-in-time active event detection on energy autonomous sensing systems”. In: Brief Presentations Proceedings (RTAS 2019) 9 (2019). [85] Arthur B Maccabe, Wenbin Zhu, Jim Otto, and Rolf Riesen. “Experience in offloading protocol processing to a programmable NIC”. In: Proceedings. IEEE International Conference on Cluster Computing. IEEE. 2002, pp. 67–74. [86] Filipe Manco, Costin Lupu, Florian Schmidt, Jose Mendes, Simon Kuenzer, Sumit Sati, Kenichi Yasukata, Costin Raiciu, and Felipe Huici. “My VM is Lighter (and Safer) Than Your Container”. In: Proceedings of the 26th Symposium on Operating Systems Principles. SOSP ’17. Shanghai, China: ACM, 2017, pp. 218–233.isbn: 978-1-4503-5085-3.doi: 10.1145/3132747.3132763. [87] MAWI Working Group Traffic Archive: Daily Traces At The Transit Link of WIDE To The Upstream ISP. Available at http://mawi.wide.ad.jp/mawi/. Sept. 2022. 139 [88] Sarah McClure, Amy Ousterhout, Scott Shenker, and Sylvia Ratnasamy. “Efficient Scheduling Policies for Microsecond-Scale Tasks”. In: 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, WA: USENIX Association, Apr. 2022, pp. 1–18.isbn: 978-1-939133-27-4.url: %5Curl%7Bhttps://www.usenix.org/conference/nsdi22/presentation/mcclure%7D. [89] D. Meyer. MWC Barcelona Shows the Cloud Has Eaten Telecom. https://www.sdxcentral.com/articles/news/op-ed-mwc-barcelona-shows-the-cloud- has-eaten-telecom/2021/07/. July 2021. [90] Rui Miao, Hongyi Zeng, Changhoon Kim, Jeongkeun Lee, and Minlan Yu. “SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs”. In: Proceedings of the Conference of the ACM Special Interest Group on Data Communication. SIGCOMM ’17. Los Angeles, CA, USA: Association for Computing Machinery, 2017, pp. 15–28.isbn: 9781450346535. doi: 10.1145/3098822.3098824. [91] Microsoft. Azure Functions. https://azure.microsoft.com/en-us/services/functions/. Accessed on May 2020. 2017. [92] Radhika Mittal, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. “Universal Packet Scheduling”. In: 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). Santa Clara, CA: USENIX Association, Mar. 2016, pp. 501–521.isbn: 978-1-931971-29-4.url: %5Curl%7Bhttps://www.usenix.org/conference/nsdi16/technical- sessions/presentation/mittal%7D. [93] Radhika Mittal, Justine Sherry, Sylvia Ratnasamy, and Scott Shenker. “Recursively Cautious Congestion Control”. In: 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). Seattle, WA: USENIX Association, Apr. 2014, pp. 373–385.isbn: 978-1-931971-09-6.url: https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/mittal. [94] Ali Mohammadkhan, Sheida Ghapani, Guyue Liu, Wei Zhang, KK Ramakrishnan, and Timothy Wood. “Virtual function placement and traffic steering in flexible and dynamic software defined networks”. In: Local and Metropolitan Area Networks (LANMAN), 2015 IEEE International Workshop on. 2015. [95] Nitinder Mohan, Lorenzo Corneo, Aleksandr Zavodovski, Suzan Bayhan, Walter Wong, and Jussi Kangasharju. “Pruning Edge Research with Latency Shears”. In: HotNets ’20: The 19th ACM Workshop on Hot Topics in Networks, Virtual Event, USA, November 4-6, 2020. Ed. by Ben Zhao, Heather Zheng, Harsha V. Madhyastha, and Venkat N. Padmanabhan. online: ACM, Nov. 2020, pp. 182–189.doi: 10.1145/3422604.3425943. [96] Mahathir Monjur, Yubo Luo, Zhenyu Wang, and Shahriar Nirjon. “SoundSieve: Seconds-Long Audio Event Recognition on Intermittently-Powered Systems”. In: Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services. 2023, pp. 28–41. [97] Barefoot Networks. The World’s Fastest & Most Programmable Networks. https: //barefootnetworks.com/resources/worlds-fastest-most-programmable-networks/. 140 [98] Yoav Nir and Adam Langley. ChaCha20 and Poly1305 for IETF Protocols. RFC 8439. June 2018.doi: 10.17487/RFC8439. [99] Edward Oakes, Leon Yang, Dennis Zhou, Kevin Houck, Tyler Harter, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. “SOCK: Rapid Task Provisioning with Serverless-Optimized Containers”. In: 2018 USENIX Annual Technical Conference (USENIX ATC 18). Boston, MA: USENIX Association, 2018, pp. 57–70.isbn: 978-1-931971-44-7.url: %5Curl%7Bhttps://www.usenix.org/conference/atc18/presentation/oakes%7D. [100] ONAP Platform Architecture. https://www.onap.org/architecture. 2018. [101] OpenFaas. https://www.openfaas.com/. 2021. [102] Lua Org. Lua Documentation. https://www.lua.org/docs.html. 2023. [103] Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. “Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads”. In: 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). Boston, MA: USENIX Association, Feb. 2019, pp. 361–378.isbn: 978-1-931971-49-2.url: https://www.usenix.org/conference/nsdi19/presentation/ousterhout. [104] Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. “Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads”. In: 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). Boston, MA: USENIX Association, Feb. 2019, pp. 361–378.isbn: 978-1-931971-49-2.url: https://www.usenix.org/conference/nsdi19/presentation/ousterhout. [105] Shoumik Palkar, Chang Lan, Sangjin Han, Keon Jang, Aurojit Panda, Sylvia Ratnasamy, Luigi Rizzo, and Scott Shenker. “E2: A Framework for NFV Applications”. In: Proceedings of the 25th Symposium on Operating Systems Principles. SOSP ’15. Monterey, California: Association for Computing Machinery, 2015, pp. 121–136.isbn: 9781450338349.doi: 10.1145/2815400.2815423. [106] Palo Alto Networks. Disable Firewall offloading traffic . https: //knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000Cm8cCAC. Sept. 2018. [107] Palo Alto Networks. Tomorrow’s Network Security Stops Threads at Scale. https://www.paloaltonetworks.com/prisma/vm-series. 2021. [108] Palo Alto Networks. Tomorrow’s Network Security Stops Threads at Scale. https://www.paloaltonetworks.com/prisma/vm-series. 2021. [109] Aurojit Panda, Sangjin Han, Keon Jang, Melvin Walls, Sylvia Ratnasamy, and Scott Shenker. “NetBricks: Taking the V out of NFV”. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Savannah, GA: USENIX Association, Nov. 2016, pp. 203–216.isbn: 978-1-931971-33-1.url: https://www.usenix.org/conference/osdi16/technical-sessions/presentation/panda. 141 [110] Weiwu Pang, Sourav Panda, Jehangir Amjad, Christophe Diot, and Ramesh Govindan. “CloudCluster: Unearthing the Functional Structure of a Cloud Service”. In: 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, WA: USENIX Association, Apr. 2022, pp. 1213–1230.isbn: 978-1-939133-27-4.url: https://www.usenix.org/conference/nsdi22/presentation/pang. [111] Michele Paolino, Nikolay Nikolaev, Jeremy Fanguede, and Daniel Raho. “SnabbSwitch user space virtual switch benchmark and performance optimization for NFV”. In: 2015 IEEE Conference on Network Function Virtualization and Software Defined Network (NFV-SDN) . IEEE. 2015, pp. 86–92. [112] Francesco Paolucci, Filippo Cugini, Piero Castoldi, and Tomasz Osiński. “Enhancing 5G SDN/NFV Edge with P4 Data Plane Programmability”. In: IEEE Network 35.3 (2021), pp. 154–160.doi: 10.1109/MNET.021.1900599. [113] Isabella Paz. T-Mobile Says Hack Exposed Personal Data of 40 Million People. https: //www.nytimes.com/2021/08/18/business/tmobile-data-breach.html?smid=url-share. Aug. 2021. [114] Isabella Paz. Twilio suffers data breach after its employees were targeted by a phishing campaign . https://www.theverge.com/2022/8/8/23296923/twilio-data-breach-phishing- campaign-employees-targeted. Aug. 2022. [115] Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. “A Generic Communication Scheduler for Distributed DNN Training Acceleration”. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. SOSP ’19. Huntsville, Ontario, Canada: Association for Computing Machinery, 2019, pp. 16–29.isbn: 9781450368735.doi: 10.1145/3341301.3359642. [116] Jonathan Perry, Amy Ousterhout, Hari Balakrishnan, Devavrat Shah, and Hans Fugal. “Fastpass: A Centralized "Zero-Queue" Datacenter Network”. In: Proceedings of the 2014 ACM Conference on SIGCOMM. SIGCOMM ’14. Chicago, Illinois, USA: Association for Computing Machinery, 2014, pp. 307–318.isbn: 9781450328364.doi: 10.1145/2619239.2626309. [117] Rishabh Poddar, Chang Lan, Raluca Ada Popa, and Sylvia Ratnasamy. “SafeBricks: Shielding Network Functions in the Cloud”. In: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). Renton, WA: USENIX Association, Apr. 2018, pp. 201–216.isbn: 978-1-939133-01-4.url: https://www.usenix.org/conference/nsdi18/presentation/poddar. [118] George Prekas, Marios Kogias, and Edouard Bugnion. “ZygOS: Achieving Low Tail Latency for Microsecond-Scale Networked Tasks”. In: Proceedings of the 26th Symposium on Operating Systems Principles. SOSP ’17. Shanghai, China: Association for Computing Machinery, 2017, pp. 325–341.isbn: 9781450350853.doi: 10.1145/3132747.3132780. [119] Qifan Pu, Shivaram Venkataraman, and Ion Stoica. “Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure”. In: 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). Boston, MA: USENIX Association, 2019, pp. 193–206.isbn: 978-1-931971-49-2.url: %5Curl%7Bhttps://www.usenix.org/conference/nsdi19/presentation/pu%7D. 142 [120] Yining Qi, Yubo Luo, Yongfeng Huang, and Xing Li. “Blockchain-based privacy-preserving group data auditing with secure user revocation”. In: Comput. Syst. Sci. Eng. 45.1 (2023), pp. 183–199. [121] Yining Qi, Yubo Luo, Yongfeng Huang, and Xing Li. “Blockchain-Based Privacy-Preserving Public Auditing for Group Shared Data.” In: Intelligent Automation & Soft Computing 35.3 (2023). [122] Yining Qi, Zhen Yang, Yubo Luo, Yongfeng Huang, and Xing Li. “Blockchain-Based Light-Weighted Provable Data Possession for Low Performance Devices.” In: Computers, Materials & Continua 73.2 (2022). [123] Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ousterhout. “Arachne: Core-Aware Thread Management”. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). Carlsbad, CA: USENIX Association, Oct. 2018, pp. 145–160.isbn: 978-1-939133-08-3.url: https://www.usenix.org/conference/osdi18/presentation/qin. [124] Paul Quinn, Uri Elzur, and Carlos Pignataro. Network Service Header (NSH). RFC 8300. Jan. 2018. doi: 10.17487/RFC8300. [125] Barath Raghavan, Ramesh Govindan, Zhuojin Li, and Jianfeng Wang. METHODS AND SYSTEMS FOR EFFICIENT AND SECURE NETWORK FUNCTION EXECUTION. US Patent App. 18/082,873. June 2023. [126] Shriram Rajagopalan, Dan Williams, Hani Jamjoom, and Andrew Warfield. “Split/Merge: System Support for Elastic Execution in Virtual Middleboxes”. In: 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). Lombard, IL: USENIX Association, Apr. 2013, pp. 227–240.isbn: 978-1-931971-00-3.url: https: //www.usenix.org/conference/nsdi13/technical-sessions/presentation/rajagopalan. [127] Yuxin Ren, Guyue Liu, Vlad Nitu, Wenyuan Shao, Riley Kennedy, Gabriel Parmer, Timothy Wood, and Alain Tchana. “Fine-Grained Isolation for Scalable, Dynamic, Multi-tenant Edge Clouds”. In: 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, July 2020, pp. 927–942.isbn: 978-1-939133-14-4.url: https://www.usenix.org/conference/atc20/presentation/ren. [128] Hannah Ritchie, Edouard Mathieu, Max Roser, and Esteban Ortiz-Ospina. “Internet”. In: Our World in Data (2023). https://ourworldindata.org/internet. [129] Martin Roesch. “Snort - Lightweight Intrusion Detection for Networks”. In:Proceedingsofthe13th USENIX Conference on System Administration. LISA ’99. Seattle, Washington: USENIX Association, 1999, pp. 229–238. [130] Martin Roesch. “Snort - Lightweight Intrusion Detection for Networks”. In:Proceedingsofthe13th USENIX Conference on System Administration. LISA ’99. Seattle, Washington: USENIX Association, 1999, pp. 229–238. [131] Rebecca R. Ruiz. F.C.C. Fines AT&T $25 Million for Privacy Breach. https://archive.nytimes.com/bits.blogs.nytimes.com/2015/04/08/f-c-c-fines-att- 25-million-for-privacy-breach/?searchResultPosition=4. Apr. 2015. 143 [132] Salvatore Sanfilippo. Redis. https://github.com/redis/redis. May 2009. [133] Vyas Sekar, Nick Duffield, Oliver Spatscheck, Jacobus van der Merwe, and Hui Zhang. “LADS: Large-Scale Automated DDOS Detection System”. In: Proceedings of the Annual Conference on (USENIX) ’06 Annual Technical Conference. (ATEC) ’06. Boston, MA: USENIX Association, 2006, p. 16. [134] Amazon Web Services. AWS Lambda@Edge. https://aws.amazon.com/lambda/edge. Accessed on May 2020. 2017. [135] Muhammad Shahbaz, Sean Choi, Ben Pfaff, Changhoon Kim, Nick Feamster, Nick McKeown, and Jennifer Rexford. “PISCES: A Programmable, Protocol-Independent Software Switch”. In: Proceedings of the 2016 ACM SIGCOMM Conference. SIGCOMM ’16. Florianopolis, Brazil: Association for Computing Machinery, 2016, pp. 525–538.isbn: 9781450341936.doi: 10.1145/2934872.2934886. [136] Justine Sherry, Shaddi Hasan, Colin Scott, Arvind Krishnamurthy, Sylvia Ratnasamy, and Vyas Sekar. “Making Middleboxes Someone Else’s Problem: Network Processing as a Cloud Service”. In: Proceedings of the ACM (SIGCOMM) 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. (SIGCOMM) ’12. Helsinki, Finland: Association for Computing Machinery, 2012, pp. 13–24.isbn: 9781450314190.doi: 10.1145/2342356.2342359. [137] T. Shimokawabe, T. Aoki, C. Muroi, J. Ishida, K. Kawano, T. Endo, A. Nukada, N. Maruyama, and S. Matsuoka. “An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code”. In: SC ’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. Nov. 2010, pp. 1–11.doi: 10.1109/SC.2010.9. [138] Vishal Shrivastav. “Fast, Scalable, and Programmable Packet Scheduler in Hardware”. In: Proceedings of the ACM Special Interest Group on Data Communication. SIGCOMM ’19. Beijing, China: Association for Computing Machinery, 2019, pp. 367–379.isbn: 9781450359566.doi: 10.1145/3341302.3342090. [139] Christian Sieber, Raphael Durner, Maximilian Ehm, Wolfgang Kellerer, and Puneet Sharma. “Towards optimal adaptation of nfv packet processing to modern cpu memory architectures”. In: Proceedings of the 2nd Workshop on Cloud-Assisted Networking. 2017, pp. 7–12. [140] Arjun Singhvi, Junaid Khalid, Aditya Akella, and Sujata Banerjee. “SNF: Serverless Network Functions”. In: Proceedings of the 11th ACM Symposium on Cloud Computing. SoCC ’20. Virtual Event, USA: Association for Computing Machinery, 2020, pp. 296–310.isbn: 9781450381376.doi: 10.1145/3419111.3421295. [141] Anirudh Sivaraman, Suvinay Subramanian, Mohammad Alizadeh, Sharad Chole, Shang-Tse Chuang, Anurag Agrawal, Hari Balakrishnan, Tom Edsall, Sachin Katti, and Nick McKeown. “Programmable Packet Scheduling at Line Rate”. In: Proceedings of the 2016 ACM SIGCOMM Conference. SIGCOMM ’16. Florianopolis, Brazil: Association for Computing Machinery, 2016, pp. 44–57.isbn: 9781450341936.doi: 10.1145/2934872.2934899. 144 [142] Vibhaalakshmi Sivaraman, Srinivas Narayana, Ori Rottenstreich, S. Muthukrishnan, and Jennifer Rexford. “Heavy-Hitter Detection Entirely in the Data Plane”. In: Proceedings of the Symposium on SDN Research. SOSR ’17. Santa Clara, CA, USA: Association for Computing Machinery, 2017, pp. 164–176.isbn: 9781450349475.doi: 10.1145/3050220.3063772. [143] Hardik Soni, Myriana Rifai, Praveen Kumar, Ryan Doenges, and Nate Foster. “Composing Dataplane Programs withµ P4”. In: Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication. SIGCOMM ’20. Virtual Event, USA: Association for Computing Machinery, 2020, pp. 329–343.isbn: 9781450379557.doi: 10.1145/3387514.3405872. [144] Speeding up and strengthening HTTPS connections for Chrome on Android. https: //security.googleblog.com/2014/04/speeding-up-and-strengthening-https.html. 2014. [145] Patrick Kutch. PCI-SIG SR-IOV Primer an Introduction to SR-IOV Technology Intel® LAN Access Division, Jan. 2011, 321211-002. 2011. [146] Brent Stephens, Aditya Akella, and Michael Swift. “Loom: Flexible and Efficient NIC Packet Scheduling”. In: 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). Boston, MA: USENIX Association, Feb. 2019, pp. 33–46.isbn: 978-1-931971-49-2.url: https://www.usenix.org/conference/nsdi19/presentation/stephens. [147] Kausik Subramanian, Loris D’Antoni, and Aditya Akella. “Genesis: Synthesizing Forwarding Tables in Multi-Tenant Networks”. In:SIGPLANNot. 52.1 (Jan. 2017), pp. 572–585.issn: 0362-1340. doi: 10.1145/3093333.3009845. [148] Chen Sun, Jun Bi, Zhilong Zheng, Heng Yu, and Hongxin Hu. “NFP: Enabling Network Function Parallelism in NFV”. In: Proceedings of the Conference of the ACM Special Interest Group on Data Communication. SIGCOMM ’17. Los Angeles, CA, USA: Association for Computing Machinery, 2017, pp. 43–56.isbn: 9781450346535.doi: 10.1145/3098822.3098826. [149] Amin Tootoonchian, Aurojit Panda, Chang Lan, Melvin Walls, Katerina Argyraki, Sylvia Ratnasamy, and Scott Shenker. “ResQ: Enabling SLOs in Network Function Virtualization”. In: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). Renton, WA: USENIX Association, Apr. 2018, pp. 283–297.isbn: 978-1-939133-01-4.url: https://www.usenix.org/conference/nsdi18/presentation/tootoonchian. [150] Internet World Stats - Usage and Population Statistics. “INTERNET GROWTH STATISTICS”. In: (2022). https://www.internetworldstats.com/emarketing.htm. [151] Jianfeng Wang, Siddhant Gupta, Marcos A. M. Vieira, Barath Raghavan, and Ramesh Govindan. “Scheduling Network Function Chains Under Sub-Millisecond Latency SLOs”. In: (2023). arXiv: 2305.01890[cs.NI]. [152] Jianfeng Wang, Tamás Lévai, Zhuojin Li, Marcos A. M. Vieira, Ramesh Govindan, and Barath Raghavan. “Quadrant: A Cloud-Deployable NF Virtualization Platform”. In: Proceedings of the 13th Symposium on Cloud Computing. SoCC ’22. San Francisco, California: Association for Computing Machinery, 2022, pp. 493–509.isbn: 9781450394147.doi: 10.1145/3542929.3563471. 145 [153] Jianfeng Wang, Tamás Lévai, Zhuojin Li, Marcos AM Vieira, Ramesh Govindan, and Barath Raghavan. “Galleon: Reshaping the Square Peg of NFV”. In: arXiv preprint arXiv:2101.06466 (2021). [154] Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and Michael Swift. “Peeking Behind the Curtains of Serverless Platforms”. In: 2018 USENIX Annual Technical Conference (USENIX ATC 18). Boston, MA: USENIX Association, July 2018, pp. 133–146.isbn: ISBN 978-1-939133-01-4.url: %5Curl%7Bhttps://www.usenix.org/conference/atc18/presentation/wang-liang%7D. [155] Ping Wang, Fei Wen, Paul V. Gratz, and Alex Sprintson. “SIMD-Matcher: A SIMD-Based Arbitrary Matching Framework”. In: ACM Trans. Archit. Code Optim. 19.3 (May 2022).issn: 1544-3566.doi: 10.1145/3514246. [156] Shuhe Wang, Zili Meng, Chen Sun, Minhu Wang, Mingwei Xu, Jun Bi, Tong Yang, Qun Huang, and Hongxin Hu. “SmartChain: Enabling High-Performance Service Chain Partition between SmartNIC and CPU”. In: ICC 2020-2020 IEEE International Conference on Communications (ICC). IEEE. 2020, pp. 1–7. [157] Yuanli Wang, Dhruv Kumar, and Abhishek Chandra. “Poster: Exploiting Data Heterogeneity for Performance and Reliability in Federated Learning”. In: 2020 IEEE/ACM Symposium on Edge Computing (SEC). IEEE. 2020, pp. 164–166. [158] Yuanli Wang, Baiqing Lyu, and Vasiliki Kalavri. “The non-expert tax: quantifying the cost of auto-scaling in cloud-based data stream analytics”. In: Proceedings of The International Workshop on Big Data in Emergent Distributed Environments. 2022, pp. 1–7. [159] Yuanli Wang, Joel Wolfrath, Nikhil Sreekumar, Dhruv Kumar, and Abhishek Chandra. “Accelerated Training via Device Similarity in Federated Learning”. In: Proceedings of the 4th International Workshop on Edge Systems, Analytics and Networking. EdgeSys ’21. Online, United Kingdom: Association for Computing Machinery, Apr. 2021, pp. 31–36.isbn: 9781450382915.doi: 10.1145/3434770.3459734. [160] Fei Wen, Mian Qin, Paul Gratz, and Narasimha Reddy. “An FPGA-based Hybrid Memory Emulation System”. In: 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE. 2021, pp. 190–196. [161] Fei Wen, Mian Qin, Paul Gratz, and Narasimha Reddy. “OpenMem: Hardware/Software Cooperative Management for Mobile Memory System”. In: 2021 58th ACM/IEEE Design Automation Conference (DAC). online: ACM/IEEE, 2021, pp. 109–114.doi: 10.1109/DAC18074.2021.9586186. [162] Fei Wen, Mian Qin, Paul Gratz, and Narasimha Reddy. “Software Hint-Driven Data Management for Hybrid Memory in Mobile Systems”. In: ACM Trans. Embed. Comput. Syst. 21.1 (Jan. 2022). issn: 1539-9087.doi: 10.1145/3494536. 146 [163] Joel Wolfrath, Nikhil Sreekumar, Dhruv Kumar, Yuanli Wang, and Abhishek Chandra. “HACCS: Heterogeneity-Aware Clustered Client Selection for Accelerated Federated Learning”. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2022, pp. 985–995.doi: 10.1109/IPDPS53621.2022.00100. [164] Shinae Woo, Justine Sherry, Sangjin Han, Sue Moon, Sylvia Ratnasamy, and Scott Shenker. “Elastic Scaling of Stateful Network Functions”. In: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). Renton, WA: USENIX Association, Apr. 2018, pp. 299–312.isbn: 978-1-939133-01-4.url: https://www.usenix.org/conference/nsdi18/presentation/woo. [165] Zongxing Xie, Hanrui Wang, Song Han, Elinor Schoenfeld, and Fan Ye. “DeepVS: a deep learning approach for RF-based vital signs sensing”. In: Proceedings of the 13th ACM international conference on bioinformatics, computational biology and health informatics. 2022, pp. 1–5. [166] Zongxing Xie and Fan Ye. “Scaling Device-Free Indoor Tracking Based on Self Calibration”. In: Proceedingsofthe20thACMConferenceonEmbeddedNetworkedSensorSystems. 2022, pp. 841–842. [167] Zongxing Xie and Fan Ye. “Self-calibrating indoor trajectory tracking system using distributed monostatic radars for large scale deployment”. In: Proceedings of the 9th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation . 2022, pp. 188–197. [168] Zongxing Xie, Bing Zhou, Xi Cheng, Elinor Schoenfeld, and Fan Ye. “Passive and Context-Aware In-Home Vital Signs Monitoring Using Co-Located UWB-Depth Sensor Fusion”. In: ACM Transactions on Computing for Healthcare 3.4 (2022), pp. 1–31. [169] Zongxing Xie, Bing Zhou, and Fan Ye. “Signal quality detection towards practical non-touch vital sign monitoring”. In: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. 2021, pp. 1–9. [170] Chen Xu, Ruipeng Zhang, Mengjun Xie, and Li Yang. “Network intrusion detection system as a service in openstack cloud”. In: 2020 international conference on computing, networking and communications (ICNC). IEEE. 2020, pp. 450–455.doi: 10.1109/ICNC47757.2020.9049480. [171] Kang Yang, Yuning Chen, Xuanren Chen, and Wan Du. “Link Quality Modeling for LoRa Networks in Orchards”. In: IPSN ’23. San Antonio, TX, USA: Association for Computing Machinery, May 2023, pp. 27–39.doi: 10.1145/3583120.3586969. [172] Jane Yen, Jianfeng Wang, Sucha Supittayapornpong, Marcos A. M. Vieira, Ramesh Govindan, and Barath Raghavan. “Meeting SLOs in Cross-Platform NFV”. In: Proceedings of the 16th International Conference on Emerging Networking EXperiments and Technologies. CoNEXT ’20. Barcelona, Spain: Association for Computing Machinery, 2020, pp. 509–523.isbn: 9781450379489.doi: 10.1145/3386367.3431292. [173] Zhuolong Yu, Chuheng Hu, Jingfeng Wu, Xiao Sun, Vladimir Braverman, Mosharaf Chowdhury, Zhenhua Liu, and Xin Jin. “Programmable Packet Scheduling with a Single Queue”. In: Proceedings of the 2021 ACM (SIGCOMM) 2021 Conference. (SIGCOMM) ’21. Virtual Event, USA: Association for Computing Machinery, 2021, pp. 179–193.isbn: 9781450383837.doi: 10.1145/3452296.3472887. 147 [174] Yijing Zeng, Milind Buddhikot, and Suman Banerjee. “All Roads Lead to Rome: An MPTCP-Aware Layer-4 Load Balancer”. In: 2021 IFIP Networking Conference (IFIP Networking). 2021, pp. 1–9.doi: 10.23919/IFIPNetworking52078.2021.9472795. [175] Yijing Zeng, Roberto Calvo-Palomino, Domenico Giustiniano, Gerome Bovet, and Suman Banerjee. “Adaptive uplink data compression in spectrum crowdsensing systems”. In: IEEE/ACM Transactions on Networking (2023). [176] Caitao Zhan, Himanshu Gupta, Arani Bhattacharya, and Mohammad Ghaderibaneh. “Efficient Localization of Multiple Intruders in Shared Spectrum System”. In: 2020 19th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). 2020, pp. 205–216. doi: 10.1109/IPSN48710.2020.00025. [177] Caitao Zhan, Himanshu Gupta, and Mark Hillery. “Optimizing Initial State of Detector Sensors in Quantum Sensor Networks”. In: arXiv preprint arXiv:2306.17401 (2023). [178] Kai Zhang, Bingsheng He, Jiayu Hu, Zeke Wang, Bei Hua, Jiayi Meng, and Lishan Yang. “G-NET: Effective GPU Sharing in NFV Systems”. In: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). Renton, WA: USENIX Association, Apr. 2018, pp. 187–200. isbn: 978-1-939133-01-4.url: https://www.usenix.org/conference/nsdi18/presentation/zhang-kai. [179] Ruipeng Zhang, Chen Xu, and Mengjun Xie. “Secure Decentralized IoT Service Platform Using Consortium Blockchain”. In: Sensors 22.21 (2022), p. 8186. [180] Wei Zhang, Guyue Liu, Wenhui Zhang, Neel Shah, Phillip Lopreiato, Gregoire Todeschi, K.K. Ramakrishnan, and Timothy Wood. “OpenNetVM: A Platform for High Performance Network Service Chains”. In: Proceedings of the 2016 Workshop on Hot Topics in Middleboxes and Network Function Virtualization. HotMIddlebox ’16. Florianopolis, Brazil: Association for Computing Machinery, 2016, pp. 26–31.isbn: 9781450344241.doi: 10.1145/2940147.2940155. [181] Zhipeng Zhao, Hugo Sadok, Nirav Atre, James C. Hoe, Vyas Sekar, and Justine Sherry. “Achieving 100Gbps Intrusion Prevention on a Single Server”. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, Nov. 2020, pp. 1083–1100.isbn: 978-1-939133-19-9.url: https://www.usenix.org/conference/osdi20/presentation/zhao-zhipeng. [182] Peng Zheng, Arvind Narayanan, and Zhi-Li Zhang. “A Closer Look at NFV Execution Models”. In: Proceedings of the 3rd Asia-Pacific Workshop on Networking 2019 . 2019, pp. 85–91. [183] Zhilong Zheng, Jun Bi, Haiping Wang, Chen Sun, Heng Yu, Hongxin Hu, Kai Gao, and Jianping Wu. “Grus: Enabling Latency SLOs for GPU-Accelerated NFV Systems”. In: 2018 IEEE 26th International Conference on Network Protocols (ICNP). IEEE. 2018, pp. 154–164. [184] Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, and Xin Jin. “RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers”. In: 14th USENIX SymposiumonOperatingSystemsDesignandImplementation(OSDI20). USENIX Association, Nov. 2020, pp. 1225–1240.isbn: 978-1-939133-19-9.url: https://www.usenix.org/conference/osdi20/presentation/zhu. 148 [185] Zscaler. Tomorrow’s Network Security Stops Threads at Scale. https://www.zscaler.com/. 2023. 149
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Towards highly-available cloud and content-provider networks
PDF
Efficient delivery of augmented information services over distributed computing networks
PDF
Scaling-out traffic management in the cloud
PDF
Networked cooperative perception: towards robust and efficient autonomous driving
PDF
Efficient pipelines for vision-based context sensing
PDF
Hardware and software techniques for irregular parallelism
PDF
Towards building a live 3D digital twin of the world
PDF
Enabling virtual and augmented reality over dense wireless networks
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Measuring the impact of CDN design decisions
PDF
Supporting faithful and safe live malware analysis
PDF
Machine learning for efficient network management
PDF
Rate adaptation in networks of wireless sensors
PDF
Protecting online services from sophisticated DDoS attacks
PDF
Enabling massive distributed MIMO for small cell networks
PDF
Relative positioning, network formation, and routing in robotic wireless networks
PDF
Detecting and mitigating root causes for slow Web transfers
PDF
Optimal distributed algorithms for scheduling and load balancing in wireless networks
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Adaptive and resilient stream processing on cloud infrastructure
Asset Metadata
Creator
Wang, Jianfeng
(author)
Core Title
Performant, scalable, and efficient deployment of network function virtualization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-08
Publication Date
07/25/2023
Defense Date
07/19/2023
Publisher
University of Southern California. Libraries
(digital)
Tag
cloud computing,network functions,OAI-PMH Harvest,outsourcing,service-level objective
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Raghavan, Barath (
committee chair
), Govindan, Ramesh (
committee member
), Prasanna, Viktor (
committee member
), Psounis, Konstantinos (
committee member
)
Creator Email
jianfenw@usc.edu,pkueewjf@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113288946
Unique identifier
UC113288946
Identifier
etd-WangJianfe-12114.pdf (filename)
Legacy Identifier
etd-WangJianfe-12114
Document Type
Dissertation
Rights
Wang, Jianfeng
Internet Media Type
application/pdf
Type
texts
Source
20230725-usctheses-batch-1073
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Repository Email
cisadmin@lib.usc.edu
Tags
cloud computing
network functions
outsourcing
service-level objective