Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Algorithms and architectures for high-performance IP lookup and packet classification engines
(USC Thesis Other)
Algorithms and architectures for high-performance IP lookup and packet classification engines
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ALGORITHMS AND ARCHITECTURES FOR HIGH-PERFORMANCE IP LOOKUP AND PACKET CLASSIFICATION ENGINES by Thilan Ganegedara A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2013 Copyright 2013 Thilan Ganegedara Dedication To my wife and my parents. ii Acknowledgments First and foremost, I would like to express my sincere gratitude to my advisor, Prof. Viktor Prasanna. I am truly grateful for him for accepting me as a student to his research group. I would also like to thank my qualifying examination and Ph.D. defense committee: Dr. Gordon Brebner, Prof. Ramesh Govindan, Prof. Timothy Pinkston, Prof. Cauligi Raghavendra and Prof. John Sylvester for their encouragement, constructive criticisms and guidance. A special thanks to Dr. Gordon Brebner for giving me the opportunity to work with him at Xilinx Inc. as a visiting researcher. I was blessed with a very supportive and an intellectual research group, who made my studies at University of Southern California a very pleasant one. In no particular order, I would like to thank my research associates Nam Ma, Hoang Le, Hyeran Jeon, Mike Giakkoupis, Lucia Sun, Swapnil Haria, Gagandeep Singh, Qingbo Wang, Weirong Jiang, Edward Yang, Yun Qu, Da Tong, Andrea Sanny, Kiran Matam, Ren Chen and Shi- jie Zhou. The Electrical Engineering department staff is also gratefully acknowledged for always being very friendly and supportive whenever I reached out to them. My undergraduate advisors Dr. Lilantha Samaranayake, Prof. Nimal Ekanayake, Dr. Sanath Alahakoon and Prof. Ratnajeevan Hoole are sincerely acknowledged for their continuous support during my undergraduate studies and my higher studies application process. iii I am also deeply grateful to my graduate advisors Ms. Diane Demetras and Ms. Jennifer Gerson, and associate professor Bhaskar Krishnamachari for their kind advices and immense support during the thesis submission process, which ensured timely sub- mission of my thesis. Last but certainly not least, my heartiest gratitude goes to my loving wife, Dulanjalie Ganegedara, and my caring parents, Sugath and Padmini Ganegedara, without whom, none of this would have been possible. Their unwavering support, encouragement, and motivation has made me the person that I am today. I could not have wished for a better spouse or better parents. A special thanks to you from the bottom of my heart. iv Table of Contents Dedication ii Acknowledgments iii List of Figures ix List of Tables xii Abstract xiii Chapter 1: Introduction 1 1.1 Trends of the Internet . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Internet Routers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Design Considerations . . . . . . . . . . . . . . . . . . . . 4 1.3 IP Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.1 Longest Prefix Matching (LPM) . . . . . . . . . . . . . . . 7 1.3.2 IPv4 and IPv6 Addressing . . . . . . . . . . . . . . . . . . 8 1.4 Packet Classification . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5.1 Router Virtualization . . . . . . . . . . . . . . . . . . . . . 12 1.5.2 IPv6 Lookup for Backbone Networks . . . . . . . . . . . . 13 1.5.3 Packet Classification . . . . . . . . . . . . . . . . . . . . . 14 1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6.1 Scalable Router Virtualization with Dynamic Updates . . . . 16 1.6.2 Performance Modeling of Virtual Routers . . . . . . . . . . 17 1.6.3 High Performance IPv6 Forwarding for Backbone Routers . . . . . . . . . . . . . . . . . . . . . . . 19 1.6.4 Ruleset-Feature Independent Packet Classification . . . . . . 21 1.7 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 v Chapter 2: Background and Related Work 24 2.1 Networking Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.1 Field Programmable Gate Array (FPGA) . . . . . . . . . . . 25 2.1.2 General Purpose Multi-Core . . . . . . . . . . . . . . . . . 27 2.2 IP Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3 Router Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4 Packet Classification . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter 3: Scalable Router Virtualization with Dynamic Updates 42 3.1 Fill-In Algorithm and Data Structure . . . . . . . . . . . . . . . . . 42 3.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 42 3.1.2 Fill-In: A Distance-Based Mapping Technique . . . . . . . 43 3.1.3 Node Structure for Fill-In . . . . . . . . . . . . . . . . . . . 46 3.1.4 Memory Requirement Analysis . . . . . . . . . . . . . . . . 47 3.2 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.1 Architecture: IP Lookup . . . . . . . . . . . . . . . . . . . 48 3.2.2 Architecture: Incremental Updates . . . . . . . . . . . . . . 50 3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.1 FPGA Platform and Routing Table Sources . . . . . . . . . 52 3.3.2 Update Capability . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.4 Throughput and Resource Usage . . . . . . . . . . . . . . . 56 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Chapter 4: Performance Modeling of Virtual Routers 60 4.1 Notations and Assumptions . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Router Models for Power Estimation . . . . . . . . . . . . . . . . . 62 4.2.1 Non-virtualized . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.2 Virtualized-separate . . . . . . . . . . . . . . . . . . . . . . 64 4.2.3 Virtualized-merged . . . . . . . . . . . . . . . . . . . . . . 65 4.3 Virtual Routers on FPGA . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.1 Static power . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.2 Power Consumed by Memory . . . . . . . . . . . . . . . . 68 4.3.3 Power Consumed by Logic . . . . . . . . . . . . . . . . . . 69 4.3.4 Pipelined IP Lookup . . . . . . . . . . . . . . . . . . . . . 70 4.3.5 Routing Tables . . . . . . . . . . . . . . . . . . . . . . . . 72 vi 4.4 Virtualized Router: Power Performance . . . . . . . . . . . . . . . 73 4.4.1 Total Power Dissipation: Experimental vs. Estimation . . . . 74 4.4.2 Power Efficiency . . . . . . . . . . . . . . . . . . . . . . . 78 4.5 Generalization of Virtual Routers . . . . . . . . . . . . . . . . . . . 79 4.5.1 Virtual Router Models . . . . . . . . . . . . . . . . . . . . 79 4.5.2 Memory Footprint . . . . . . . . . . . . . . . . . . . . . . 86 4.5.3 Grouped Router Virtualization . . . . . . . . . . . . . . . . 87 4.6 Virtualized Router Architecture on FPGA . . . . . . . . . . . . . . 91 4.7 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 93 4.7.1 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.7.2 Power Efficiency . . . . . . . . . . . . . . . . . . . . . . . 96 4.7.3 Chip Floor-planning . . . . . . . . . . . . . . . . . . . . . 99 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Chapter 5: High Performance IPv6 Forwarding for Backbone Routers 104 5.1 Routing Table Statistics . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2 Range Tree-based IPv6 Lookup Approach . . . . . . . . . . . . . . 108 5.2.1 Enabling Parallelism for IP lookup . . . . . . . . . . . . . . 109 5.2.2 Disjoint Grouping of Prefixes . . . . . . . . . . . . . . . . . 110 5.2.3 Algorithm and Partitioning . . . . . . . . . . . . . . . . . . 111 5.3 IPv6 Lookup Architecture . . . . . . . . . . . . . . . . . . . . . . . 114 5.3.1 Hardware Architecture . . . . . . . . . . . . . . . . . . . . 114 5.3.2 Software Architecture . . . . . . . . . . . . . . . . . . . . . 116 5.4 Lookup Engine Performance . . . . . . . . . . . . . . . . . . . . . 118 5.4.1 Hardware Architecture . . . . . . . . . . . . . . . . . . . . 118 5.4.2 Software Lookup Engine . . . . . . . . . . . . . . . . . . . 122 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Chapter 6: Ruleset-Feature Independent Packet Classification 129 6.1 Motivation and Algorithm . . . . . . . . . . . . . . . . . . . . . . . 129 6.1.1 Bit-Vector Based Packet Classification . . . . . . . . . . . . 130 6.1.2 StrideBV Algorithm . . . . . . . . . . . . . . . . . . . . . 130 6.1.3 Modularization and Integration of Range Search . . . . . . . 135 6.2 Hardware Architecture for Packet Classification . . . . . . . . . . . 139 6.2.1 StrideBV Architecture . . . . . . . . . . . . . . . . . . . . 139 6.2.2 Range Search Integration . . . . . . . . . . . . . . . . . . . 142 6.2.3 Modularization . . . . . . . . . . . . . . . . . . . . . . . . 143 6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 145 6.3.1 Memory Requirement . . . . . . . . . . . . . . . . . . . . . 145 6.3.2 Throughput and Packet Latency . . . . . . . . . . . . . . . 148 6.3.3 Power Efficiency . . . . . . . . . . . . . . . . . . . . . . . 152 vii 6.3.4 Comparison with Existing Literature . . . . . . . . . . . . . 155 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Chapter 7: Conclusion 158 Bibliography 163 viii List of Figures 1.1 High level system architecture of a router . . . . . . . . . . . . . . 4 1.2 An example illustrating the Longest Prefix Matching (LPM) algo- rithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Address architecture of (a) IPv4 and (b) IPv6 . . . . . . . . . . . . 9 2.1 Resource layout of modern FPGAs. The 4 quadrants illustrate four possible use-case scenarios using different types of resources avail- able. Various other configurations are possible by utilizing the on- chip resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 High level architecture of a modern multi-core processor and the memory system. The numbers within brackets indicate the typical latency values associated with different types of memory units. . . . 29 2.3 An example routing table and its corresponding trie . . . . . . . . . 30 2.4 An example virtualized networking topology . . . . . . . . . . . . . 34 2.5 Router virtualization approaches (a) Merged and (b) Separate . . . . 36 3.1 Shared leaf node data structure for merged virtualized routers. . . . 43 3.2 Virtual tries for tables A and B. . . . . . . . . . . . . . . . . . . . . 44 3.3 Fill-In trie of virtual tries A and B. . . . . . . . . . . . . . . . . . . 45 3.4 Node structures for IP lookup with Fill-In: (a) IPv4 and (b) IPv6. . . 48 3.5 IP lookup architecture for Fill-In on FPGA. Two packets/updates can be fed, every clock cycle, to the parallel-linear pipeline. . . . . . 49 3.6 Support for updates (a) Modifies and (b) Inserts/Deletes with dual- ported BRAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.7 Variation of pipeline length for differentC values. . . . . . . . . . . 54 3.9 Performance (clock frequency and throughput) and resource usage (number of BRAM blocks and slices used) of the FPGA-based archi- tecture. X-axis shows the number of virtual routers and within parenthesis, the number of pipeline stages. . . . . . . . . . . . . . . 57 4.1 BRAM power variation with operating frequency (Note: The num- ber within parenthesis denotes the speed grade). . . . . . . . . . . . 69 ix 4.2 Per stage logic and signal power consumption (Note: The value inside parenthesis denotes the speed grade). . . . . . . . . . . . . . 71 4.3 Pointer and NHI memory requirements for merged ( = 80% and = 20%) and separate approaches. . . . . . . . . . . . . . . . . . 73 4.4 Comparison of total power consumption in virtualized and non-virtualized schemes for speed grades -2 (top) and -1L (bottom). . . . . . . . . . 74 4.5 Comparison of total power consumption in different virtualized schemes for speed grades -2 (top) and -1L (bottom). . . . . . . . . . . . . . . 75 4.6 Percentage error of the model estimation compared with the experi- mental results for speed grades -2 (top) and -1L (bottom). . . . . . . 76 4.7 Power dissipated per unit throughput for speed grades -2 (left) and -1L (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.8 Normalized node distribution for real IPv4 routing tables. . . . . . . 81 4.9 Pointer and NHI memory variation for increasing number of virtual routers for VS, VM for = 0:2 and VM for = 0:8. . . . . . . . . 84 4.10 Total memory consumption of VM and VS approaches. . . . . . . . 87 4.11 Total memory consumption of VG for = 20% and VG for = 80% vs. VS and VM for = 80%. . . . . . . . . . . . . . . . . . . 89 4.12 Pipelined FPGA architecture for the VG approach. . . . . . . . . . 92 4.13 Throughput variation for increasing number of virtual routers. . . . 95 4.14 Total power consumption variation with increasing number of vir- tual routers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.15 Power efficiency variation with increasing number of virtual routers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.1 IPv6 address architecture. . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Normalized prefix length distribution for real and synthetic IPv6 routing tables generated using RRC00 backbone routing table . . . . 108 5.3 Partitioning of real ( e 10 K) and synthetic ( e 350 K) routing tables with and without the use of Algorithm 2. Figures 5.3a, 5.3c, 5.3e show the partitioning of real routing table and Figures 5.3b, 5.3d, 5.3f shows the partitioning of the synthetic routing table. Note that the upper X-axis corresponds to aggregated partition numbers and the lower X-axis corresponds to the initial partition numbers. Partitions are sorted based on the number of prefixes contained. . . . . . . . . 112 5.4 Pipelined IPv6 lookup architecture on FPGA. The two pipelines shown are aligned in such a way that the stage memory of the range trees are aligned along BRAM columns on FPGA for improved resource usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.5 Hierarchical multi-threaded architecture of the proposed IPv6 lookup engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 x 5.6 Throughput and memory footprint of the hardware lookup engine for increasing routing table size. The dotted line denotes the maxi- mum on-chip memory available on the Virtex 7 X1140T FPGA. . . 119 5.7 Power consumption of the IPv6 lookup engine for increasing routing table size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.8 Power and memory requirement variation with increasing number of partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.9 Performance of the software lookup engine for varying IPPT val- ues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.10 Best and worst case performance of master thread only approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.11 Scalability of the software IP lookup engine. . . . . . . . . . . . . . 126 5.12 Architecture with integrated initial lookup and master-only approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.13 Performance of the multi-core IPv6 engine with initial lookup inte- grated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.1 FSBV bit-vector generation and header processing example. . . . . 133 6.2 StrideBV pipelined architecture (BVP/BVM/BVR - Previous/Memory/Resultant Bit-Vectors, HDR - 5-field header, Stride -k bit header stride) . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.3 Memory efficient range search implementation on FPGA via explicit range storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.4 Serial architecture for latency tolerant applications. . . . . . . . . . 143 6.5 Parallel architecture for low latency applications. . . . . . . . . . . 143 6.6 Memory requirement of the proposed solution for stride sizesk = f2; 4; 8g and increasing classifier size. . . . . . . . . . . . . . . . . 146 6.7 Variation of 1) multiplication factor, 2) classification latency and 3) BRAM utilization with increasing stride size. . . . . . . . . . . . . 148 6.8 Tradeoff between memory and logic slice utilization with increasing stride size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.9 Power consumption of a) serial and b) parallel architectures. . . . . 155 xi List of Tables 2.1 An example 5-tuple rule set. (SA/DA:8-bit; SP/DP: 4-bit; Protocol: 2-bit) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1 Memory Requirement Analysis . . . . . . . . . . . . . . . . . . . . 48 4.1 Notations and Symbols . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Virtex 6 XC6VLX760 Device Specs . . . . . . . . . . . . . . . . . 66 4.3 BRAM power model . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4 Notations and Symbols . . . . . . . . . . . . . . . . . . . . . . . . 80 4.5 Scalability of different virtualization approaches on a Virtex 7 2000T device using on-chip memory . . . . . . . . . . . . . . . . . . . . . 90 4.6 Performance before and after chip floor-planning . . . . . . . . . . 100 5.1 Routing table statistics obtained from [50] dated 07=30=2012 . . . . 106 6.1 Comparison of variations to StrideBV . . . . . . . . . . . . . . . . 134 6.2 Performance of the serial and parallel architectures . . . . . . . . . 149 6.3 Performance comparison with existing literature (*Has no support for arbitrary ranges. Inclusion of arbitrary ranges could dramatically increase memory required per rule.) . . . . . . . . . . . . . . . . . 154 xii Abstract The Internet has become ubiquitous within the past few decades. The number of active users of the Internet has reached 2:5 billion and the number of Internet connected devices has reached 11 billion in year 2012. Considering this proliferation of Internet users and devices, forecasts show that the network traffic is expected to grow threefold between 2012 and 2017, which will result in a 1:4 Zettabytes of data exchange on the Internet in the year of 2017. These enormous amounts of traffic in the Internet demands high forwarding rates to satisfy the requirements of various time-critical applications. For example, multimedia applications such as video streaming, V oice over IP (V oIP) and gaming, require high bandwidth and low latency packet delivery. To meet such demands, network speeds have significantly increased since the inception of Internet; 10 Mbps to 100 Gbps rates within three decades. Such improvements in throughput are facilitated by the advancements in the underlying forwarding algorithms and the processing platforms used for networking. The goal of this research is to harness the processing capabilities and memory capac- ities of current state-of-the-art hardware and software platforms to devise wire-speed packet forwarding engines that are suitable for the future Internet. Even though the existing networking platforms possess the raw processing power and memory capacity, designing packet forwarding engines that meet the performance demands of future net- works is not straightforward. It requires leveraging both algorithmic and architectural xiii aspects of the solution and the platform, respectively, which forms the basis for our research. Specifically, four research problems are studied in this dissertation. They are as follows: Scalable router virtualization with dynamic updates: With the advent of data centers and cloud computing, router virtualization is gaining popularity in the networking industry. Dedicated networking equipment on a per user (or virtual network) basis is expensive as well as not scalable. Router virtualization allows consolidation of multiple physical routers onto a single shared platform. In this research, scalable algorithms and architectures for large-scale router virtualization are developed. Update capabilities are integrated into the lookup architecture to enable non-blocking, incremental routing table updates. Performance modeling of virtual routers: Mapping multiple virtual routing tables onto a shared physical platform is challenging with stringent memory con- straints, especially on hardware platforms. Furthermore, it is important to know how many virtual networks can be supported on a given amount of hardware resources and what the performance would be. Hence, theoretical models for virtualized router performance are developed and a comprehensive performance evaluation of virtual routers is presented. High performance IPv6 forwarding for backbone routers: The successor of the most prevalent logical addressing scheme in the Internet (IPv4) is IPv6. With this, several challenges arise from the packet forwarding engine’s standpoint: 1) increased routing table storage requirement, 2) increased lookup complexity 3) sustaining high performance. A versatile IPv6 lookup engine is developed that is suitable for both software and hardware platforms. The performance of the xiv proposed approach evaluated on both software and hardware platforms show that the solution is suitable to be deployed in state-of-the-art 100 Gbps line-cards. Ruleset-feature independent packet classification: Most packet classification solutions rely on various features of the classifier (or ruleset) to achieve low memory consumption and their reported performance. However, the unavailabil- ity such classifier features may cause such solutions to yield poor performance, rendering them to be suitable for only a subset of classifiers. A ruleset-feature independent packet classification engine that delivers deterministic performance for any classifier is proposed and evaluated. The aforementioned solutions are evaluated using state-of-the-art Field Programmable Gate Arrays (FPGAs). The IPv6 forwarding engine is implemented on both FPGA and general purpose multi-core processors to illustrate the versatility of the proposed solu- tion. Performance evaluations demonstrate superior performance compared with exist- ing solutions, with respect to throughput, memory consumption and power consumption. While the Internet backbone links are being upgraded to 100 Gbps rates, 400 Gbps and even 1 Tbps links are in the roadmap. Achieving such throughput rates while ensur- ing power and packet latency demands are met is a challenging task. This dissertation takes a step in this direction by proposing and developing novel lookup algorithms for packet forwarding engines that will meet and exceed the demands of the future Internet. xv Chapter 1 Introduction 1.1 Trends of the Internet Internet has become an integrated part of the modern society. This has resulted in a wide- spread use of Internet throughout the world which has caused the Internet to transform from a network of few hundred nodes to a network of several billion nodes in three decades. Forecasts [61, 62] indicate that it will continue to grow in an exponential manner, which will take us to the Zettabyte era in year 2015. By 2017, the Internet is expected to change in the following manner compared with its status in 2012 [61]: Number of users: 2:3 billion to 3:6 billion, a 1:5 growth Total number of Internet connected devices: 11:5 billion to 19:2 billion, a 1:7 growth Global IP traffic: Expected to increase 3-fold, a compound annual growth rate (CAGR) of 23% Global Internet video: Expected to increase 4-fold, a CAGR of 30% Global mobile data traffic: Expected to increase 13-fold, a CAGR of 66% Such enormous growth rates pose great challenges on the current Internet infrastruc- ture. Within 5 years, the Internet infrastructure has to scale to support a network that is 1 nearly two times bigger, that carries nearly three times more data, with four times faster network speeds than it does today. In order to facilitate these demands of the future Internet, routers will have to host larger forwarding information bases (e.g. routing tables), forward traffic at faster rates, ensure critical Quality of Service (QoS) demands of real-time applications and in addi- tion, operate under a limited power budget. This calls for novel packet forwarding engine architectures with superior capabilities and efficiency than those available today. 1.2 Internet Routers 1.2.1 Overview The kernel function of a router is to perform end-to-end packet delivery through a net- work of connected devices. Internet Protocol (IP) addresses are used to address the devices in the Internet and packet routing is done based on a routing table stored in the routers. A routing table contains a set ofhprefix;actioni pairs, which defines how to route packets belonging to different networks to ensure end-to-end packet deliv- ery. A prefix is a convenient way to address a collection of devices using a single entry rather than having one entry per address. An example for prefix notation is 192:168:10:0=24 which is a compact representation of IP addresses falling between the range of [192:168:10:0 192:168:10:255]. The action field indicates through which port of the router the packets that match with prefix need to be forwarded. Between the source and the destination of a packet, multiple intermediate routers may exist depend- ing on the path selected for packet delivery. The routing table residing in a router can be generated in two ways: Static configuration: The network administrator updates the routing table based on the network topology information. While this is a feasible approach for small 2 networks, in networks as large as the Internet, static configuration can be an impossible task. The main reasons are the size of the network and the possi- bility of link or node (i.e. router) failure. In the case of a link or node failure, static configuration either has to have backup paths configured or the network administrator has to manually intervene and update the routing table to account for network topology changes. In large networks, this becomes an impractical task to perform. Dynamic configuration: A routing protocol, an algorithm running inside a router, monitors and updates the routing table as necessary. The routing table takes into account the abrupt link and node failures in its connected network and updates the routing table to indicate possible alternative paths to route packets to their final destination without any loss of service. Due to this reason, all the routers in the Internet use dynamic configuration and several routing protocols are available for use depending on the level in which the router appears in the Internet (e.g., customer-edge, network-edge, core). A physical router is logically separated into two planes to simplify the operation. Namely, the control plane and data plane. The control plane runs the router operating system which provides administrative interfaces, the aforementioned routing protocols and administers resource management tasks. The data plane processes the data pack- ets arriving from the network interfaces, makes the routing decision and forwards the packet through the appropriate network interface. It uses the routing table populated via static/dynamic configuration and forwards the packets accordingly. The interaction between the control and data plane occurs when the routing table or the forwarding information base residing in the data plane needs to be updated to account for any changes that occurred in the network topology. In such events, the control plane will send control packets to the data plane with the update information 3 Control Plane Data Plane Router Operating System Routing Protocol Administration Packet forwarding Data collection Packet scheduling Forwarding information updates Metadata related to network traffic Network Administrator Packets in Packets out Figure 1.1: High level system architecture of a router and the data plane makes the changes to the routing table accordingly. Also, the data plane collects information related to packet flows that are important for planning and billing purposes and this information is sent to the control plane periodically. The high level system architecture of a router is illustrated in Figure 1.1. In this research, the focus is solely on optimizing the router data plane for different time-critical networking tasks using state-of-the-art networking platforms. 1.2.2 Design Considerations Modern routers have stringent performance demands in order to facilitate the immense data exchange taking place in the Internet. These demands continue to heighten with the ever-increasing growth of network traffic and the QoS requirements of multime- dia applications. Such applications include video streaming, audio/video conferencing, V oice over IP (V oIP) and gaming. Low latency and high throughput are critical for these applications. On the other hand, from the network equipment standpoint, the resource (logic, computing, memory) consumption and power consumption needs to be within 4 acceptable limits while ensuring that the required performance is delivered. Further, networking equipment needs to be scalable in order to facilitate the growth of networks. All these demands require a careful balance of performance, resource consumption and power consumption, which is the basis for the research presented in this dissertation. In this section, we describe the design considerations of modern routers in detail. Throughput: Throughput of a router measures the amount of data that can be forwarded in unit time. Since a router operates on IP packets, the throughput is related to the packet size. However, the packets in the Internet traffic can be of disparate sizes 1 . Due to this reason two metrics exist in the literature for report- ing the packet processing speed of a router. One is lookups per second (lps or LPS) and the other is bits per second (bps) 2 . The former simply states how many packets are forwarded by the router in a unit time and does not take packet size into account. The latter, reports how many data bits were forwarded in unit time. In order to report the possible worst case throughput, the minimum packet size is often used which is 40 bytes for IPv4 packets and 64 bytes for IPv6 packets. Packet latency: The time difference between the ingress and egress times of a packet at a router is known as the packet latency. It denotes how long a delay a packet will incur if it were to pass through a router. Depending on the amount of traffic, congestion level and the processing capabilities, the packet latency can vary from router to router and even for different packets inside the same router. Packet latency includes latency introduced by the interfaces, header parser, table lookup, scheduler, etc. 1 The packet size is observed to be in the range of 40 1500 bytes in size [51]. 2 Both metrics are often augmented with the prefix multipliers and in literature, terms such as Million lookups per second (Mlps or MLPS) and Giga bits per second (Gbps) are commonly used. 5 Resource consumption: We define the term resource in a broad sense at this stage. A resource can be a logic unit, computational unit, buffer or memory. Logic and computational resources are required to perform various logical and/or computational operations on the packet header. On hardware platforms, logic and computational resources are available at the granularity of logic gates and simple arithmetic units, while on General Purpose Processors (GPPs) they are available at the granularity of processing cores. Buffers are often used for packet queues and to generate pipelined architectures. Memory resources are used to store the forwarding information base(s). Depending on the considered networking plat- form, one or more of these resources may not be abundant, hence requires care- ful consideration when designing architectures for high-performance networking. Resource consumption is often measured as a ratio or a percentage which indi- cates how much of the available resources is consumed. Power consumption: In order to deliver multi-giga bits per second throughput, networking devices consume significant amounts of power. In [44] it has been shown that the power densities of modern routers will exceed their capacity if the trend in throughput continues to grow. Also, it is shown that in a router, 60 + % of power is dissipated for layer 3 (networking layer) operation. Hence, power consumption of networking infrastructure has become a major concern. Power consumption of a router is measured using the amount of power consumed per unit throughput, for example Watts per Gbps. This metric gives an idea about how much power needs to be dissipated to deliver a unit of throughput, which is an appropriate metric in the networking domain. This metric defines the power efficiency of the architecture. 6 Prefix Action 192.168.10.0/24 192.168.0.0/16 Port 1 Port 0 Destination IP: 192.168.10.54 Incoming Packet Routing Table Both prefixes match but the first entry has a longer length (24 vs. 16) and is selected as LPM. Hence, the incoming packet is forwarded through port 1 Figure 1.2: An example illustrating the Longest Prefix Matching (LPM) algorithm Scalability: Network infrastructure need to be able to support the continuous growth of the Internet. From the network equipment standpoint, this means the router has to be capable of: 1) supporting larger routing tables without the need for significant architectural and/or algorithmic modifications, and 2) sustaining the performance despite the increasing routing table size. Scalability of a networking device indicates whether a solution will be able to deliver the aforementioned demands with the growing network sizes. 1.3 IP Lookup 1.3.1 Longest Prefix Matching (LPM) IP lookup is the kernel function of a router which performs the routing table lookup for the incoming packets. For each incoming packet, the destination IP is extracted from the layer 3 packet header and is looked up against a routing table which contains the forwarding information. As mentioned previously, the routing table can be gener- ated statically or dynamically. As shown in Figure 1.2, an incoming packet can match multiple entries in a routing table. In order to resolve the final match, Longest Prefix Matching (LPM) [68] is used. As the name suggests, in LPM, the longest prefix that matches with the destination IP of a packet is used as the final match. The prefixes of a routing table are of different lengths and they follow the Classless Inter-Domain Routing (CIDR) addressing [66]. 7 In order to realize LPM several data structures are used in the literature. A trie is one such tree-like data structure which is traversed using the individual bits of the incoming packet’s destination IP address. The depth of the trie at which a given prefix appears corresponds to the length of the prefix. Using this property, LPM can be guaranteed by updating the forwarding information with the most recently visited prefix’s forward- ing information as the trie is traversed. For aW bit field, a uni-bit trie based solution requiresO(W ) memory accesses (or equivalently the latency) to complete the search. Tree is another data structure that can be used to perform LPM. With tree data struc- tures, the destination IP is used as the key for the search and the traversal decision is made based on the result of the key-value comparison. ForN prefix values, the latency of tree based solutions is O(logN), which is desirable. Hashing is another technique widely used in literature which yieldsO(1) latency. However, it must be noted that when using hashing functions that are non-perfect, collisions can occur and a collision resolv- ing mechanism needs to be used in such cases. When perfect hashing functions [70] are used, the memory footprint of the routing table storage can grow significantly with the routing table size. However, a lookup is guaranteed to be completed inO(1) time without the need for further processing. 1.3.2 IPv4 and IPv6 Addressing IPv4 is currently the most prevalent logical addressing scheme 3 in the current Internet. IPv4 is a 32 bit addressing scheme which is capable of addressing up to 4 billion devices, theoretically. Addressing using IPv4 follows the CIDR addressing and is organized as a combination of network identifier and host identifier as shown in Figure 1.3(a). With the current growth rate of the Internet, it is evident that IPv4 addressing is not a sustain- able addressing scheme. Internet Assigned Numbers Authority (IANA) announced the exhaustion of IPv4 addresses in February 2011 [13] and has now started the allocation 8 of the next generation of IP addresses, IPv6, which are 128 bits long. It is estimated that even with overallocation of network addresses, IPv6 addressing may not be exhausted in the foreseeable future [67]. Network ID (32-L bits) Host ID (L bits) (a) IPv4 Subnet prefix (n bits) Interface ID (128-n bits) (b) IPv6 Figure 1.3: Address architecture of (a) IPv4 and (b) IPv6 IPv6 has a different address architecture than that of IPv4. According to the RFC- 4291 [29], it can be seen that even though an IPv6 address is 128 bits in length, the subnet prefix portion of it is only 64 bits, except for a few special cases. In addition to the subnet prefix, each IPv6 address carries a 64 bit interface identifier which complies with the Institution of Electrical and Electronics Engineers (IEEE) Extended Unique Identifier (EUI-64) standards [29]. EUI-64 is similar to the Medium Access Control (MAC) addresses (48 bit) used in Ethernet switching. This results in 64 bit prefixes rather than 128 bit prefixes for IPv6. The address architecture of an IPv6 address is shown in Figure 1.3(b). Even though the current representation of IPv6 in backbone routing tables is around 2%, in the future this number is expected to grow significantly with the migration from IPv4 to IPv6 addressing. The main challenges with IPv6 are: 1) increased routing table size and 2) increased complexity of the IP lookup process, which is caused by the increased bit length of addresses and prefixes. 3 Logical, due to the fact that an IP address is not hardcoded into a device, rather can be changed depending on the connected network. 9 1.4 Packet Classification Packet classification is a prominent technique used in networking equipment for various purposes. Its applications are diverse, including network security, access control lists, traffic accounting and flow identification [25, 30, 59, 64]. It is often used as a technique to classify packets into flows in order to facilitate stateful monitoring and/or processing of network traffic. Stateful inspection of packets allows packets to be inspected as a stream rather than as individual packets. This enables the detection of signatures, regular expressions, etc., that are distributed across multiple packets, which is not possible with stateless packet inspection. To perform packet classification, one or more header fields of an incoming packet is checked against a set of predefined rules, usually referred to as a ruleset or a classifier. The number of header fields considered for the packet classification process can vary depending on the requirements of the classification engine. While up to 8 fields can be specified, the most prominent packet classification scheme uses 5 fields of the packet header [63]. Hence, this scheme is called 5 field or 5 tuple packet classification. The 5 fields are as followshSource IP (32 bits), Destination IP (32 bits), Source Port (16 bits), Destination Port (16 bits), Protocol (8 bits)i. In order for a packet to match a rule in the classifier, all header fields of the packet need to match the corresponding fields of the rule. Similar to IP lookup, multiple rules in the classifier can match an incoming packet header. In order to resolve to a final match, the rules in a classifier are prioritized. The priority is defined as the order in which the rules appear in the classifier — the first rule having the highest priority and the last rule having the lowest priority. If more than one rule matches, the rule with the lowest index is used as the final match. 10 1.5 Motivations Algorithmic solutions for networking applications have gained interest in research as well as industrial communities. The main reason being, brute-force, simplistic solu- tions such as Ternary Content Addressable Memory (TCAM) [45] are expensive and power hungry. TCAMs execute a massively parallel search on a per packet basis, which causes the amount of computations done for a lookup operation to be proportional to the size of the routing table/classifier. This causes TCAMs to consume excessive amounts of power especially for large routing tables/classifiers. Algorithmic solutions on other hand, can be developed to meet the performance requirements of networks while operat- ing under a low power budget. This dissertation explores Static Random Access Mem- ory (SRAM) [54] based pipelined architectures for high performance networking. When considering a storing and searching a single bit, SRAMs have consume lower power than TCAM and are not expensive (we elaborate on this comparison in later chapters). While Dynamic RAM (DRAM) [53] based solutions can be developed, the indetermin- istic memory access times and the access latencies associated with periodic refreshing of memory cells can yield poor performance. Even though DRAMs are available in orders of magnitude larger capacities than SRAM, state-of-the-art FPGA devices offer adequate SRAM memory capacities for the considered applications. Reduced Latency DRAM (RLDRAM) [46] is an alternative type of memory that has similar characteris- tics as SRAM, but is available in larger capacity than SRAM. The proposed designs can benefit from such types of memories for higher scalability. Further, for the IPv6 solution, we considered general purpose multi-core platforms for implementation. With the stagnation of operating frequency of general purpose pro- cessors (GPPs), multi-core computing has become the means by which high perfor- mance is achieved. Although state-of-the-art multi-core platforms offer a large number of cores, for networking applications, it is not straightforward how the available cores 11 can be utilized in an efficient manner to deliver high performance. With a hierarchical memory sub-system and multiple cores communicating via shared memory, it necessi- tate careful planning to overcome performance degradations caused by memory access latencies. IP lookup and packet classification are memory-bound applications, hence it is critical to minimize memory related delays in the lookup process. However, the abundant parallelism offered in GPPs can be harnessed to achieve high performance via algorithmic optimizations. This inspired us to explore these multi-core platforms to develop high speed packet processing engines for the Internet backbone. The motivations for exploring the four research problems considered in this disser- tation are explained in detail in the following sections. 1.5.1 Router Virtualization Network virtualization [6] was introduced to overcome the deficiencies inherent in tra- ditional networks, such as underutilization and protocol rigidity. It enables Internet Ser- vice Providers (ISPs) to define multiple virtual networks on top of a physical network so that underutilized networking hardware can be efficiently used to accommodate multiple networks. This makes device consolidation possible, which results in significant savings in terms of the cost of networking devices, power consumption and maintenance cost. Further, each virtual network can run different protocols for packet forwarding which increases the flexibility in the network [6]. A virtualized router has to maintain multiple routing tables to serve traffic from mul- tiple virtual networks which is achieved via router virtualization. Each virtual routing table represents a particular virtual network. Packets from each virtual network is routed in the network based on the corresponding virtual routing table. A virtualized router is able to distinguish traffic from different virtual networks and load the corresponding 12 routing table data in order to perform the packet forwarding task. However, there are challenges associated with this when it comes to implementation. The main challenge in virtualizing a router is increasing the number of virtual routers hosted per chip. Even though state-of-the-art hardware platforms offer adequate mem- ory capacities, with an increasing number of virtual routers, the available memory resources can be easily exhausted. Unless there are abundant resources, techniques to reduce the resource requirement should be considered. We observed that the mem- ory consumption of most of the existing router virtualization schemes increase dramati- cally with the number of virtual routers due to the data structure adopted to perform IP lookup. Also, due to the same reason, updating routing tables has become a costly oper- ation that requires several node-level updates. In virtual router environments, updates can be frequent due to the increased number of networks hosted per router, compared with a non-virtualized scenario. Hence, costly and frequent update operations ultimately degrade the throughput of the IP lookup engine. Further, due to the increased resource (memory, logic and routing) requirement to store and access multiple virtual routing tables, the power consumed by the architecture increases. While the power consumed by the FPGA device is relatively low, it is possi- ble to reduce the power consumption via algorithmic optimizations. This improves the power efficiency of the networking hardware, which is desirable from the ISP’s stand- point. 1.5.2 IPv6 Lookup for Backbone Networks Backbone networks have stringent performance constraints. While high throughput is a critical requirement to forward the immense amounts of traffic flowing in the Internet, low packet latency is also essential to ensure QoS guarantees of real-time applications such as V oIP and multimedia streaming applications. The main challenge with backbone 13 routers is their routing table size. Since backbone routers are the main carriers of Inter- net data traffic, the number of networks connected to backbone routers is large, which is reflected in the routing table sizes. The publicly available backbone routing tables contain nearly 500 K prefixes per routing table [49]. Large routing tables require large memory capacity for storage. Further, with the increased prefix length of IPv6 address- ing, the memory limitation is more pronounced especially on hardware platforms. Another aspect of using larger memory to store the routing tables is the increased power consumption. Performing the search using a single search structure may yield high power consumption. However, it is possible to decompose the IP lookup prob- lem into a set of sub-problems using algorithmic techniques, which reduces the search space for a single lookup, thereby reducing the overall power consumption of the lookup engine. Both software and hardware platforms are capable of delivering the performance demands of backbone routers. With platform-specific optimizations, these kernels can be optimized to meet throughput, latency and power requirements of future backbone networks. 1.5.3 Packet Classification Even though packet classification is a well studied problem in the literature, most, if not all, solutions are reliant on classifier features. Considering the diversity of applications where packet classification is used, classifier feature dependent solutions can potentially yield poor performance when the required features are not present. A solution that is independent of classifier features guarantees performance of the classification engine, regardless of the classifier features. This results in a robust packet classification solution suitable for any application. 14 The de-facto ruleset feature independent solution used in the literature is TCAM. However, due to range search 4 , TCAM solutions may require more storage capacity than what is originally needed to store the classifier. Hence, with TCAM, both scalability and power consumption becomes limiting factors. With algorithmic solutions, such limitations can be overcome to design high performance packet classification engines that consume low power and deliver high throughput. Most existing algorithmic solutions use tree-like search structures to perform packet classification and pipelining is often adopted to perform the search at high clock fre- quencies (on hardware platforms). Due to the exponential (or sub-exponential) memory increase with the increasing tree depth, such solutions suffer from performance degrada- tion due to longer memory access times. The operating frequency is typically governed by the largest memory component in the pipelined architecture. Hence, tree data struc- tures cause the performance to be limited by the memory access time of the largest (slowest) memory stage. Such limitations can be overcome to achieve high performance packet classification engines. For example, a pipeline with uniformly balanced stage memory across the pipeline will have a smaller largest stage memory size, hence lower memory access time, yielding a faster packet forwarding engine. However, achieving perfect memory balancing requires algorithmic innovations. 4 In packet classification, three types of search operations can exist for a given header lookup. 1) Prefix search, 2) range search, and 3) exact match. Prefix search is similar to IP lookup where the search is done only for a specified bit length of the key and the rest of the key is considered matched (wildcard). Exact match, as the name suggests, requires all the bits of the key to be matched. In range search, a lower bound and an upper bound is specified, and if the key belongs in the range, it is considered as a match. Although prefix search can also be considered as a range search, the lower and upper bounds are encompassed in the prefix. In range search, however, the lower and upper bounds can be arbitrary, hence requires special handling. 15 1.6 Contributions The major contributions of this dissertation are summarized below. The main goal is to devise memory efficient, high performance and power efficient algorithmic solutions for: 1) virtualized routers, 2) backbone IPv6 forwarding engines and 3) packet classi- fication engines. The features offered on modern FPGA and multi-core platforms are exploited in order to achieve these goals while meeting and exceeding demands of cur- rent networking environments. 1.6.1 Scalable Router Virtualization with Dynamic Updates An update in a router can be of three forms: 1) Modify 2) Insert and 3) Delete. These updates should be applied immediately (i.e., on-the-fly), in order for the packets to be routed correctly. In a virtualized router, updates are more frequent since table update requests from multiple routers should be served . Handling incremental updates in a non- virtualized router itself is a non-trivial problem [4, 11]. Thus, for a virtualized router, this problem is aggravated and may potentially degrade the performance of the router. The routing table residing in a router’s memory can be updated using two methods: 1) recompute and reload, or 2) modify the memory while the router is in operation. The former is the simplest but requires blocking of network traffic. The latter can be blocking or non-blocking, but is tedious to be performed in hardware, and is known as incremental (table) updates or simply, updates. Depending on the amount of preprocessing required, incremental updates can be applied on-the-fly, or can possibly be delayed. In this research, we propose 1) a technique named Fill-In, to consolidate multiple virtual routers to a single platform in an update friendly fashion, and 2) an FPGA-based architecture that supports on-the-fly incremental updates. By exploiting the dual-ported memory modules available in state-of-the-art FPGA devices, we support uninterrupted 16 network traffic at 150 Gbps throughput for minimum size (40 byte) packets, using a parallel-linear-pipelined architecture. With our proposed table update techniques, we show that our architecture can handle a route update with a single write instruction. In addition, we show that the proposed scheme achieves comparable scalability compared with existing techniques for router virtualization. The main contributions are [24]: Support for incremental updates: Fill-In, an update friendly routing table merging technique that leads to an FPGA-based lookup architecture that efficiently handles both intermittent and frequent updates, without interrupting network traffic. Scalability: Comparable scalability with those of existing solutions with respect to the number of prefixes/routing tables. Fine-grained resource management: Ability to define the memory usage at each stage of the pipeline to utilize the available memory in an efficient manner. High-speed hardware architecture: Parallel-linear-pipelined FPGA architecture that yields a throughput of 150 Gbps. 1.6.2 Performance Modeling of Virtual Routers Various methods exist in the literature to perform router virtualization in the data plane [9, 14, 36, 58, 65]. While these approaches have their own benefits, we provide a generalized performance evaluation of all these approaches by abstracting their fea- tures from algorithm and data structure standpoints. First, we provide analytical models to estimate the power consumption of different virtual router scenarios (from the packet forwarding engine’s perspective) and compare the benefits achieved by using each model. We give a comprehensive comparison of virtualized vs. non-virtualized routers from a power standpoint. In addition, we also 17 compare the main two virtualization schemes to show the advantages and disadvantages of using each approach. The benefits of using low power features of FPGA is highlighted from both power and throughput standpoints. Then we refine the power models that we propose for FPGA based virtual router architectures. We also parameterize the memory consumption of these approaches and compute the memory consumption for near-worst case and near-best case scenarios (as described in Chapter 4). By using the results obtained from our parameterized experiments, we design the hardware architecture to perform packet forwarding using these approaches and evaluate their performance on a state-of-the-art FPGA device based on post place- and-route results. Performance is measured with respect to throughput, resource, and power consumption. Next, we evaluate the scalability of the virtualization approaches by examining the number of virtual networks that can be supported on a single chip. We also propose a novel router virtualization approach which employs routing table grouping to further improve the scalability of the current approaches. Network/router virtualization is usually adopted in the edge-network level (similar to cloud environ- ments). We use an edge network routing table size of 10000 prefixes per virtual routing table and show that in the near-worst case scenario, using the proposed grouping tech- nique, 50 virtual networks can be hosted on a single FPGA chip, while operating at 20+ Gbps rates. The main contributions of this research are [19, 22]: An accurate analytical model to estimate power savings achieved using router virtualization. Exploration of low power FPGAs to achieve greater power benefits in networking applications. 18 Generalization of current router virtualization approaches from algorithm and data structure standpoints. Using this, we evaluate the scalability in terms of the num- ber of virtual networks hosted on a single FPGA. Performance evaluation of the generalized router models from memory, through- put, and power perspectives. Novel router virtualization approach based on routing table grouping, to enhance the scalability of virtual routers. Chip floor-planning to efficiently utilize the real estate available on FPGA and for enhanced performance of pipelined architectures. 1.6.3 High Performance IPv6 Forwarding for Backbone Routers We provide a hardware and a software solution for this problem. The benefits of the two implementations are presented separately. Hardware Solution We propose a solution to perform wire-speed and large scale IPv6 packet forwarding on FPGA. We devise an algorithm that partitions a routing table into a set of disjoint, yet balanced groups, which enables us to search only one partition to find the matching pre- fix for a given packet header. This feature can be used to disable the inactive partitions in order to reduce the total power consumption. Even though routing table partitioning has been studied [42, 52, 78], disjoint partitioning and early identification of the parti- tions has not been exploited to enhance power efficiency of packet forwarding engines. Based on this partitioning, we propose a range tree [78] based solution to achieve high packet forwarding rates. We evaluate our solution on a state-of-the-art FPGA device. Post place-and-route results show that a throughput of 200+ Gbps can be achieved for 19 a 1 million entry IPv6 routing table. Further, compared with state-of-the-art TCAM the proposed lookup engine is 50 power efficient. The research contributions are [23]: A high performance IPv6 lookup engine on FPGA that delivers 200+ Gbps (400+ Million Lookups Per Second (MLPS) 5 for large backbone routing tables. A routing table partitioning algorithm to produce disjoint and balanced partitions. A parallel pipelined architecture that can take advantage of FPGA chip resources efficiently via floor-planning. Compact representation of the routing table for improved memory efficiency on hardware which yields a single-chip solution for large-scale IPv6 backbone rout- ing. Dynamic power efficiency of 65 compared with TCAM and 2:5 is achieved with 8 partitions compared with baseline (i.e. no partitioning). Software Solution We use the same partitioning scheme proposed above and present a solution to perform wire-speed IPv6 packet forwarding on modern multi-core platforms. The disjoint parti- tioning is highly desired on multi-core platforms since 1) a lookup operation is limited to a single partition which is of smaller size than the original routing table, and 2) this pro- vides opportunity for more parallelism. Using the range tree data structure, we achieve high packet forwarding rates on multi-core platforms due to low lookup latency. We 5 Gbps = 0:1MLPS64 8 for minimum size (64 byte) packets 20 evaluate our solution on two state-of-the-art multi-core systems and show that through- puts of 100+ Gbps can be achieved, which make them suitable for state-of-the-art line cards on backbone routers. We summarize the contributions as follows [21]: An easily parallelizable solution able to exploit the parallel processing capabilities available on modern multi-core platforms. Reduced worst case packet latency via routing table partitioning. A hierarchical multi-threaded lookup architecture that can be mapped onto multi- core platforms efficiently. 230+ Million Lookups Per Second (MLPS) rates for 2 million entry IPv6 routing tables on state-of-the-art multi-core processors. 1.6.4 Ruleset-Feature Independent Packet Classification A novel architecture for packet classification whose performance is independent of rule- set features and is suitable for hardware platforms is presented. We used a Bit-Vector (BV) based approach [60] to represent the ruleset. Each rule is decomposed into a col- lection of sub-rules, which are formed by partitioning the original rule. For this, we develop an algorithm named StrideBV to generate the sub-rules, which is an extension of the Field Split Bit Vector (FSBV) algorithm proposed in [31]. Our solution offers the flexibility of deciding the bit width of each sub-rule, which in turn decides the perfor- mance trade-off of the architecture. In order to handle arbitrary ranges, we augment the architecture with explicit range search capability which does not require any range-to- prefix conversion. This yields higher memory efficiency which enhances the scalability of our approach. In addition, we propose a rule priority based partitioning scheme which 21 allows us to modularize our solution to eliminate the inherent performance limitations in the traditional BV approaches [60]. We evaluate the proposed solution on a state-of-the-art FPGA device and we report the post place-and-route performance results. The modular packet classification engine is implemented as a pipelined architecture and we measure its performance with respect to throughput, memory, and power consumption. Our extensive experiments show that for varying module sizes, the throughput remains above 100 Gbps for minimum size packets, which makes our solution ideal for 100G line-card solutions. As for scalabil- ity, on the considered FPGA our solution is able to support up to 28 K 5-field packet classification rulesets while utilizing only the on-chip memory. The power consumption of the proposed architecture is 0:8 mW/rule on average which is 5 lower than that of TCAM solutions. To the best of our knowledge, this is the first solution available in the literature that delivers deterministic (i.e. ruleset-feature independent) performance for any given ruleset. To summarize, we list our contributions in this research as follows [15, 20]: A memory-efficient bit vector-based packet classification algorithm suitable for hardware implementation. A modular architecture to perform large-scale high-speed (100+ Gbps) packet classification on a single chip. Inclusion of range-search in BV-based solutions to eliminate the expensive range- to-prefix conversion. Support for up to 28 K rule classifiers using on-chip distributed and block Random Access Memory (RAM) on a state-of-the-art FPGA. Deterministic performance for any given ruleset due to ruleset-feature indepen- dence. 22 1.7 Organization The rest of this dissertation is organized in the following manner. Chapter 2 discusses background and related work in the area of router virtualization, IPv6 lookup and packet classification. Details about state-of-the-art software and hardware networking plat- forms are also briefly discussed in this section. Chapter 3 introduces the scalable router virtualization solution and an architecture with dynamic update capabilities. Chapter 4 describes the performance modeling and analysis of virtual routers on FPGA. Chapter 5 discusses the hardware and software solutions developed for IPv6 lookup for backbone routers. Chapter 6 presents the novel ruleset-feature independent packet classification architecture. Chapter 7 summarizes the research contributions in this dissertation and discusses potential future directions of this research. 23 Chapter 2 Background and Related Work 2.1 Networking Platforms Both hardware and software platforms are used to implement high performance net- working equipment. These two broad categories offer distinct features which make them suitable for various time-critical networking tasks. Here, we discuss the pros and cons of each briefly. Hardware platforms: These platforms can be used to develop custom solu- tions for specific applications and are often used as application accelerators. The architecture to perform the desired operation needs to be built using logic and memory resources. Due to the custom nature of these applications, they deliver high performance over general purpose platforms. Pipelining and parallelization is often employed to improve the throughput of the considered application. Field Programmable Gate Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs) are two of the most prominent platforms used to implement hardware accelerators and high performance architectures for time-critical applications. Software platforms: These are general purpose computing platforms that accommodate a wide variety of computational tasks. At a high level, a processing core contains arithmetic and logic units (ALUs), a cache-based memory hierar- chy with one or more levels and interfaces to external Input/Output (I/O) devices. Modern GPPs contain multiple such cores which perform computations using the 24 shared memory communication model that can run multiple concurrent threads at a given point in time. In the context of this research, in order to achieve high performance, the latency of the table lookup operation needs to be minimized. This is important when improving the performance despite the abundant thread level parallelism available. General purpose multi-core platforms and Network Processing Units (NPUs) are examples for software platforms. In this dissertation, the proposed solutions are evaluated on state-of-the-art FPGA platforms and the IPv6 solution is evaluated on both FPGA and general purpose multi- core platforms. The rationale for the platform selection is described in the following sections. 2.1.1 Field Programmable Gate Array (FPGA) FPGA is a Static Random Access Memory (SRAM) based platform that offers various computational, logic, interconnect/routing and memory resources on the chip. These resources are organized in granularities that are specific to each resource type. Logic resources are arranged in logic slices and such logic slices contain logic, arithmetic and routing resources. Logic and arithmetic resources are used to build computational ele- ments. The routing resources can connect neighboring slices to create larger designs that require more resources than what is available in a single slice. For example, imple- menting a complex computing kernel may require multiple logic slices. For floating point operations, dedicated Digital Signal Processing (DSP) blocks are available. As for memory resources, a percentage of the logic slices can be configured as minia- ture blocks of memory. Since the logic slices are distributed across the chip, this type of memory is referred to as distributed RAM (dRAM). However, the amount of memory available as dRAM is limited and can be insufficient for certain applications. Another type of memory called block RAM (BRAM) is also available on FPGA chips which is of 25 higher capacity compared with dRAM. Also, BRAM is available at a coarser granular- ity compared with distributed RAM. BRAM is concentrated into columns on the FPGA and are accessible as independent blocks of memory and multiple such blocks may be cascaded to build larger memory blocks. Both dRAM and BRAM offer dual ported fea- ture, which enables these memory blocks to serve two concurrent memory operations in a single clock cycle. Using these on-chip resources, fairly large architectures can be implemented. If more resources are needed, the I/O interfaces can be used to connect external devices such as memory (SRAM or Dynamic RAM (DRAM)) and other processing units (gen- eral purpose processors, etc.). Also, these I/O interfaces are used to build high-speed interfaces for data communication applications and dedicated high-speed transceivers are also present on modern FPGAs, such as Serializer/Deserializer (SerDes) and GTX/GTL. Further, stacking of multiple FPGAs is another emerging technology which allows multiple chips to operate in a coherent manner [73]. The high-level architecture of an FPGA chip is shown in Figure 2.1 and it depicts the layout of different resource types on an actual FPGA chip, at a small scale. We show four possible configurations that use combinations of different resource types available on FPGA in the four quadrants of the chip. Various other configurations are possible depending on the design requirements. FPGA lies in between general purpose multi-core and application specific ASIC in the spectrum. It possesses the ability to be reconfigured, both statically and dynami- cally, which is not possible with ASIC architectures. This is important in the domain of networking with the continually changing networking standards and protocols. With the rigid ASIC architectures, adapting to new standards or protocols can be impossible depending on the flexibility offered in the architecture. Also, the Non-Recurring Engi- neering (NRE) cost of ASICs make them an expensive option compared with FPGA, 26 Input/Output Input/Output Legend Inactive slice Routing slice Distributed RAM Logic slice BRAM DSP I/O Used resource BRAM + Logic + Routing + I/O dRAM + Logic + Routing + I/O BRAM + Logic + Routing + I/O + DSP BRAM + dRAM + Logic + Routing + I/O + DSP Figure 2.1: Resource layout of modern FPGAs. The 4 quadrants illustrate four possible use-case scenarios using different types of resources available. Various other configura- tions are possible by utilizing the on-chip resources. despite their high performance. However, FPGAs deliver superior performance com- pared with GPPs especially in the networking context due to the custom nature of the architecture. Operating frequencies of 200 400 MHz is typical and with pipelining, multi-gigabit lookup engines can be implemented using the on-chip resources of the device. 2.1.2 General Purpose Multi-Core These platforms are mainly used to execute compute intensive tasks and the compute task involves various floating and fixed point computations. The basic unit is a process- ing core that has a pipelined Central Processing Unit (CPU) and a memory sub-system. The user simply writes instructions to carry out the necessary operations to perform the desired task and unlike in hardware platforms, different programs can be run on 27 the same processor without any architectural modifications, hence the name general purpose. Typically, the memory sub-system inside a processor has the hierarchy of 1) registers, 2) Level 1 (L1) cache and 3) Level 2 (L2) cache. There is a trade-off between the size and speed when it comes to the memory sub-system where, when moving away from the processing core, the size of the memory unit increases, but the speed decreases. This is more pronounced in multi-processor systems where the communication among processors takes place via a shared memory system. Due to this reason, these platforms are also called Non-Uniform Memory Access (NUMA) systems. A single core is capable of running multiple threads concurrently in a time-shared fashion to improve the core utilization despite the memory and/or I/O latencies of the executing threads. When one thread is waiting for resources, another waiting thread, whose resources are available, can become active and start using the processor core. The switching of threads inside a core is known as context switching. Depending on the resources available, context switching can be a costly or a non-costly operation. Regardless, the parallel execution of threads can be employed to greatly improve the execution time of a given process. In addition, modern GPPs host multiple processing cores per chip, which further enhances the parallel processing capabilities. These cores communicate via a shared memory system (often a higher level cache memory). The high level architecture of a modern router is shown in Figure 2.2. The memory system of a GPP extends from CPU registers, L1 cache (dedicated per core), L2 cache (dedicated per core), L3 (shared), main memory to disk storage. Hence, compared with FPGA, memory resources on GPPs are abundant. However, as noted above, depending on the amount of separation between the memory unit and the pro- cessor, the access times can vary from few nanoseconds to few milliseconds. Hence, 28 Core 0 Core 1 L3 Cache (10 ns) Main Memory (100 ns) Magnetic Disk (10 ms) System Bus CPU L1 Cache (1 ns) L2 Cache (5 ns) CPU L1 Cache (1 ns) L2 Cache (5 ns) Figure 2.2: High level architecture of a modern multi-core processor and the memory system. The numbers within brackets indicate the typical latency values associated with different types of memory units. achieving high performance on GPPs can be challenging for memory-bound applica- tions. Despite the fact that CPUs are pipelined at instruction level and their high oper- ating frequencies (typically over 2 GHz), due to memory access and context switching latencies, their performance is low compared with the hardware counterparts, especially in the networking domain. Hence, careful planning of resources and computing kernels is imperative to achieve high performance on these platforms. 2.2 IP Lookup 2.2.1 Background The kernel task of a router is to route packets from one network to the other, such that the packet reaches its intended destination while meeting the Quality of Service (QoS) requirements. To do this, routers perform an operation called IP lookup on all incoming packets in which, the incoming packet’s destination IP address is looked up against a 29 Prefix Action 10* 01* 111* 1* 001* 00* P1 P2 P3 P4 P5 P4 0 Root 1 P4 P4 0 1 P1 0 1 P3 1 P5 1 P2 Figure 2.3: An example routing table and its corresponding trie table of routing prefixes to identify the Longest Prefix Match (LPM) [68]. When the LPM prefix is found, the packet is forwarded through the network interface specified in the next-hop information (NHI) (or action) field of the prefix. Various methods are employed to perform IP lookup and one widely-used technique is trie/tree based IP lookup [14, 35]. In trie-based approaches, a trie is built using the routing table prefixes as shown in Figure 2.3. When a packet lookup needs to be per- formed, the trie is traversed based on the bit values of the incoming packet’s destination IP address. Several variants of trie implementations exist to reduce the memory con- sumption such as leaf-pushing [52]. The number of nodes, hence the memory require- ment, in a trie highly depends on the characteristics of the prefixes in a routing table and for a uni-bit trie, the lookup time isO(W ), whereW is the bit length of a prefix. Various other data structures are also used to perform IP lookup and they are discussed in this dissertation. 2.2.2 Related Work IPv6 Lookup Recently, there have been many efforts to devise memory and/or resource efficient IPv6 lookup engines that deliver high-performance. In [27], the IPv6 engine was realized using a combination of Content Addressable Memory (CAM) and Programmable Logic 30 Device (PLD), in which the prefix search was performed by the CAMs while the PLD aggregated the results from the CAMs (priority encoder) to yield Longest Prefix Match (LPM) [68]. Although the routing table was distributed across multiple CAMs based on the subnet mask length, all the CAMs were required to perform the search increasing the amount of computations performed per lookup. The architecture yields lower through- put due to the lack of pipelining and the delays introduced by the priority encoder. A multi-bit trie based solution was explored in [28] that exploited the prefix length distri- bution of IPv6 routing tables. The lookup engine employed a combination of trie search and hashing to perform the search. The performance was evaluated on a Intel IXP2800 network processing unit (NPU) that runs at 700 MHz. For a 400 K entry synthetic IPv6 routing table, their solution required 35 MBytes (280 Mbits) of memory and operated at 21 MLPS rates (multi-threaded). In [75] the authors propose a partitioning and binary search tree based solution for high performance packet forwarding on multi-core platforms. However, no balancing of partitions is proposed, which causes packets belonging to different partitions to expe- rience different delays. Further, as noted by the authors, the solution is not scalable to IPv6 despite the 700+ MLPS lookup rates. [78] proposes a range tree based solution for IPv6 which partitions the routing table by exploiting the span of IPv6 prefixes. This requires them to search each partition, one after the other, in order to arrive at LPM. Such sequential operations on multi-core platforms can yield poor performance due to increased packet lookup latency. FlashTrie [3] introduces a new data structure named prefix compressed trie to improve the on-chip memory efficiency of the solution and relies heavily on the external DRAM to store the forwarding information. Due to the unpredictable nature of the operation of DRAM and lack of real simulations makes it difficult to conclude whether 100 Gbps can be achieved for IPv6 lookup for the proposed solution. 31 In [42], a 2 3 tree based solution using routing table partitioning was introduced. The partitioning was performed in order to minimize the memory footprint of the table storage. Here also, since the partitioning is not disjoint, for a single packet lookup, all the partitions need to be searched. To realize this, they implemented a multi-pipelined architecture on FPGA and Application Specific Integrated Circuit (ASIC) platforms and showed that 100 Gbps throughput rates can be achieved. The proposed solution sup- ports 230 K IPv6 prefixes on a single FPGA using 36 Mbits of on-chip memory. The authors also propose a power optimization scheme using stage memory partitioning, however, this results in increased routing complexity which ultimately affects through- put adversely. [78] proposes a range tree based solution for IPv6 which partitions the routing table by exploiting the span of IPv6 prefixes. This requires them to search each partition, one after the other, in order to arrive at LPM. Power Efficiency Performance of networking equipment is often measured using throughput and power efficiency. Using pipelining and custom architectures on hardware platforms, achiev- ing multi gigabits per second throughput has become relatively easy [1, 2]. However, improving power efficiency of networking equipment is critical, but not straightfor- ward. With the constantly growing size and demands of the Internet, heat dissipation of equipment has become a major issue [8, 44]. Several solutions have been proposed to improve the power efficiency of networking equipment. Both TCAM and algorithmic solutions exist in literature. In [34], trie partitioning methods are used to reduce the pipeline depth as well as per stage memory requirement, to reduce the power consumed per lookup. Memory balancing is integrated with these solutions to further enhance power efficiency. TCAMs are known to be power hungry due to its massively parallel search. However, by properly organizing the TCAMs (with the aid of some algorithmic 32 techniques), the power consumption of TCAM based approaches can also be reduced. In [77], the authors propose a load balancing scheme for multi-chip TCAM based IP lookup. By controlling the TCAM entries triggered by a lookup, power efficiency is achieved. IPStash [38] is an alternative solution for TCAM and is based on a memory architecture that is similar to set associative memory. By appropriately mapping the routing table to the set associative memory and using controlled prefix expansion the authors achieve 35% power savings compared to state-of-the-art TCAM solutions. 2.3 Router Virtualization 2.3.1 Background In order to realize network virtualization, the networking equipments are ought to have virtualization capabilities. This is essential when keeping different users’ information transparent to the other users, which ensures security. This is often achieved by associ- ating a Virtual Network IDentifier (VNID) for each user and encapsulating packets with the VNID. Protocols such as Multi-Protocol Label Switching (MPLS) [14, 69] are used for the encapsulation purpose to hide routing information from higher layers of the pro- tocol stack. This also implies that separate routing information has to be maintained for each virtual network in order for the packets to be routed correctly. Hence, each network has a VNID as well as its own routing table. One approach to implement such networks is to maintain dedicated networking hardware (routers, links, etc.,) for each individual network [9,37]. This approach is not only expensive, but also increases the management overhead in terms of maintaining multiple network devices and links. From an orthog- onal perspective, today’s networking platforms are equipped with various and abundant computational and memory resources, hence allocating such devices to a single network is an overkill of the chip resources as well as the cost. 33 Infrastructure Provider 1 Infrastructure Provider 2 Virtual Network B Virtual Network A Physical Network Figure 2.4: An example virtualized networking topology Router virtualization is a method by which, multiple virtual networks can be hosted on a shared hardware platform while delivering the performance required for each vir- tual network. The benefits of router virtualization is numerous. We list a few of them below: The hardware resources are efficiently utilized since implementing multiple vir- tual networks on a single hardware platform requires more resources than what is necessary to implement only one network. Hence, the cost effectiveness is improved. The consolidation of networking hardware provides a single point of administra- tion. In networking, this is a desirable feature since the management overhead is greatly reduced. Due to the aggregated nature of networking equipment, the space requirement is reduced while the investments on cooling can also be reduced significantly. 34 Virtualization Approaches Router virtualization is an emerging research area that has attracted interest in both industrial and academic community. The main advantage of router virtualization is that all the networking equipment can be brought into a single administrative domain which makes the management tasks much easier. In addition to this, several other benefits such as, reduction in equipment cost and space is also prominent. In the context of this work, we consider virtualization of the data plane of the router. Control plane virtualization can be achieved simply by adopting the existing Operating System (OS) virtualization techniques. However, data plane virtualization requires more careful consideration since multiple factors such as throughput, resource limitations, power, etc. come into picture. From an industry standpoint, deployment of router virtualization can be seen in the Juniper J series [37] routers and Cisco Catalyst-6500 router [7, 9]. In the research com- munity, two main categories of virtualization techniques can be found: 1) separate and 2) merged. As the names suggest, in the separate case [65], each virtual network gets its own lookup engine whereas in the merged case [14, 24, 40, 58], all the virtual net- works share a single lookup engine using some merging technique. The merging pro- cess exploits the structural similarity of tries to reduce the addition of new nodes by increasing node sharing. This leads to resource efficiency. The two router virtualization approaches are depicted in Figure 2.5. 2.3.2 Related Work In [6], the authors present a comprehensive study on network virtualization in terms of its advantages, design goals and possible challenges that can arise. Despite its advan- tages, little work has been done in terms of algorithmic/architectural contributions to support network virtualization on the physical networking substrate. 35 Merged flow in Merged flow out Virtualized router: Merged A shared router serving K virtual networks (a) Merged Merged flow in Merged flow out Logical router 1 Logical router 2 Logical router K-1 Distributor Virtualized router: Separate (b) Separate Figure 2.5: Router virtualization approaches (a) Merged and (b) Separate As stated previously, two approaches for router virtualization exist in the literature. One is the merged approach in which all the routing tables are merged into a shared search structure. The other one is separated in which there is a separate search structure instance for each virtual network. Each approach has its own advantages and disadvan- tages and can be adopted depending on the requirement of the network and available resources. Several networking device manufacturers have introduced network virtualization on their routers [9, 37] to support up to hundreds of virtual networks. Cisco [9] proposes a software and hardware virtualized router and Juniper [37] achieves router virtualization by instantiating multiple router instances on a single hardware router to enforce security and isolation among virtual routers. 36 In [14] the authors present a memory-efficient data structure for IP lookup in a vir- tualized router. They take the merged approach and achieve significant memory saving by using a shared data structure. The operation of the algorithm is simple. It takes the tries built for each routing table one by one and merges them together to form a single trie, which is the merged trie. The merged trie contains the forwarding information for all the virtual routing tables. When performing IP lookup the first phase simply does a trie traversal until it reaches a leaf node. In the second phase, the leaf node contains forwarding information for all the considered virtual networks in the form of a vector. The VNID of the packet is used as an index to locate the corresponding forwarding information from the vector. At this step, the IP lookup operation is complete. Their algorithm performs well when the routing tables have similar structure. Otherwise, the memory requirement increases significantly. Song et al. in [58] builds on this approach and proposes an algorithm named trie braiding to increase the overlap among the different tries. Instead of merely overlaying the tries, they twist the tries at sub-trie levels to improve the node overlap. In order to keep track of the twisting of the sub-tries, they introduce braiding bits to the data structure. Even though the authors achieve high memory efficiency, updating the virtual routing tables requires re-computation of the merged trie, which is a costly operation. In [65] authors take the separated approach in which they implement multiple virtual router instances on hardware (on NetFPGA [47]) and on software (running on a virtual- ized general purpose computer). They show the scalability of their design for up to 16 routing tables. Due to the communication overhead between the hardware (NetFPGA) and software router, the throughput of their virtualized router is relatively low ( 100 Mbps) and the scalability is limited due to extensive hardware resource usage. 37 Router virtualization is also present in OpenFlow [48] environments where a cen- tralized router (with control and data planes) manages a set of forwarding data planes. In [5], a router virtualization framework for OpenFlow is presented. 2.4 Packet Classification 2.4.1 Background As introduced in Chapter 1, packet classification is a technique used in many networking applications. Compared with single field lookup, packet classification matches multiple headers of the incoming packets to decide the forwarding information [18]. An example 5-field packet classification ruleset is shown in Table 2.1. The fields SA and DA refer to Source and Destination IP addresses, SP and DP refer to Source and Destination port addresses. A rule is considered to be matching with an incoming header when all the headers of the packets match with those of the header. Note that multiple rules can match a given packet header and the highest priority rule (i.e. the matching rule with the lowest index) is used as the final match. For example, a packet header withh 10001010, 11101101, 4, 7, 10i matches with rules R1 and R2 both. However, rule R1 is used as the final match based on the priority. Table 2.1: An example 5-tuple rule set. (SA/DA:8-bit; SP/DP: 4-bit; Protocol: 2-bit) Rule Source IP Destination IP Source Port Destination Port Protocol Action (SA) (DA) (SP) (DP) R1 * * 2 – 9 6 – 11 * act0 R2 1* 11* 3 – 8 1 – 8 10 act0 R3 0* 0110* 9 – 12 10 – 13 11 act1 R4 0* 11* 11 – 14 4 – 8 * act2 R5 011* 11* 1 – 4 9 – 15 10 act2 38 2.4.2 Related Work Solutions for packet classification can be categorized into two main groups: 1) decom- position based and 2) decision tree based. Decomposition based approaches are two phased. In the first phase, individual field search is performed separately to produce partial search results and in the second phase the partial results are combined using an aggregation network. Various solution techniques based on Ternary Content Address- able Memory (TCAM), bit vector, tree/trie traversal and hashing are present in the liter- ature. By combining TCAMs and the BV algorithm, Song et al. [59] present an architecture called BV-TCAM for multi-match packet classification. A TCAM performs prefix or exact match, while a multi-bit trie implemented in Tree Bitmap [12] is used for source or destination port lookup. The authors predict that their design can achieve 10 Gbps throughput when implemented on advanced FPGAs when pipelined. Taylor et al. [64] introduced Distributed Cross-producting of Field Labels (DCFL), which is a decomposition-based algorithm leveraging several observations of the struc- ture of real filter sets. They decompose the multi-field searching problem and use inde- pendent search engines, which can operate in parallel to find the matching conditions for each filter field. DCFL uses a network of efficient aggregation nodes, by employ- ing Bloom Filters and encoding intermediate search results. As a result, the algorithm avoids the exponential increase in the time or space incurred when performing this oper- ation in a single step. The authors predict that an optimized implementation of DCFL can provide over 100 million lookups per second and store over 200 K rules in the cur- rent generation of FPGA or ASIC without the need for external memory. However, their prediction is based on the maximum clock frequency of FPGA devices and a logic intensive approach using Bloom Filters. This approach may not be optimal for FPGA 39 implementation due to long logic paths and large routing delays. Furthermore, the esti- mated number of rules is based only on the assumption of statistics similar to those of the currently available rule sets. Jedhe et al. [30] realize the DCFL architecture in their complete firewall implementation on a Xilinx Virtex 2 Pro FPGA, using a memory intensive approach, as opposed to the logic intensive one, so that on-the-fly update is feasible. They achieve a throughput of 50 MLPS, for a rule set of 128 entries. They also predict the throughput can be 24 Gbps when the design is implemented on Virtex-5 FPGAs. Decision tree based approaches are radically different in that, the ruleset is mapped to a multi-dimensional space where each dimension represents a header field used in the classifier [25, 56]. Each rule forms a hypercube in the multi-dimensional space which represents the volume covered by that rule. Due to rule overlap, the hypercubes formed by the rules intersect with each other. A packet header becomes a point in this space and the challenge is to identify the hypercube with the highest priority, this point belongs to. A decision tree partitions the multi-dimensional space so that the search can be performed effectively guided by the packet header. Luo et al. [43] propose a method called explicit range search to allow more cuts per node than the HyperCuts algorithm. The tree height is dramatically reduced at the cost of increased memory consumption. At each internal node, a varying number of memory accesses may be needed to determine which child node to traverse, which may be infeasible for pipelining. Since the authors do not implement their design on FPGA, the actual performance results are unclear. To achieve power efficiency, Kennedy et al. [39] implement a simplified HyperCuts algorithm on an Altera Cyclone 3 FPGA. They store up to hundreds of rules in each leaf node and match them in parallel, resulting in low clock frequency (32 MHz reported in [39]). Since the search in the decision tree 40 is not pipelined, their implementation can sustain only 0:47 Gbps in the worst cases where it takes 23 clock cycles to classify a packet for a ruleset of size 20 K. 41 Chapter 3 Scalable Router Virtualization with Dynamic Updates Router virtualization has gained interest in both industrial and research arenas due to its inherent benefits. Existing merged router virtualization schemes suffer from the wide forwarding information vectors stored at the leaf node level of the trie, which limits throughput due to larger read-word size, and memory footprint increases due to the redundancy introduced in the merging process. A novel merging scheme is introduced that avoids wide forwarding information vectors and the flexibility to define the memory distribution of the pipeline is provided. Facilities for incremental updates that do not require blocking of network traffic is integrated in the architecture, which makes the developed architecture an update friendly solution for router virtualization. 3.1 Fill-In Algorithm and Data Structure 3.1.1 Problem Definition GivenK virtual routing tablesR i ;i = 0; 1;:::;K 1, find 1) an algorithm to merge the routing tables with support for on-the-fly incremental updates 2) a scalable virtualized router architecture that can sustain the required throughput levels (specified by the ISPs). Further, the updates should not cause network traffic to be blocked or interrupted. 42 Left Pointer Right Pointer NHI 1 NHI 2 NHI K-1 NHI K Figure 3.1: Shared leaf node data structure for merged virtualized routers. 3.1.2 Fill-In: A Distance-Based Mapping Technique In the merged virtualization approach, all the virtual routing tables, are mapped onto one search tree. Depending on the mapping technique, the performance of a given router virtualization scheme can vary significantly [14, 17, 40, 58]. With trie data structure, the issue arises mainly because of node sharing. For example, in [14, 17] and [58], a single node has to serve multiple virtual networks. Each node has to have a NHI field for allK virtual routing tables. An example of a shared node is illustrated in Figure 3.1. Techniques such as leaf pushing can be applied to reduce the memory consumption, by pushing all the next hop information to the leaf nodes [52]. However, as we show in Section 3.3, this makes incremental updates more difficult and also, memory efficiency can be poor. Considering these drawbacks of node sharing, we introduce a non-shared trie data structure for virtualized routers. However, we employ the merged scheme due to its resource efficiency. In order to map multiple virtual routers on to a single lookup archi- tecture, we use a distance-based mapping scheme called Fill-In. Fill-In takes the uni-bit tries built for all the virtual routing tables as input, and builds a single search tree that can be used to perform IP lookup for all the input virtual routing tables. We describe our distance-based mapping algorithm using the two virtual tries shown in Figure 3.2. Our algorithm shares similarities with [32], however we use Fill-In to merge routing tables and facilitate updates rather than for memory balancing. The granularity of resources on FPGA depends on the underlying architecture. On modern FPGAs, BRAM is organized in either 36 Kb (18 Kb in older devices) size blocks. When distributed RAM is completely used or when it cannot accommodate 43 0 1 0 1 0 0 1 0 1 1 1 0 1 0 1 1 Prefix Action 000* P1 01* P5 0101* P4 P3 P4 P6 10* 1011* 11* A B Prefix Action 00* 01* 11* P4 P2 P3 P4 P2 P3 P1 P5 P4 P3 P4 P6 Virtual Table A Virtual Table B Figure 3.2: Virtual tries for tables A and B. the requirement, data should be stored in BRAM. In such situations, a complete BRAM block is allocated irrespective of the amount of data to be stored. If we were to use binary tries shown in Figure 3.2, the BRAM utilization becomes low and leads to inef- ficient memory usages. Fill-In takes this fact also into account. The designer has com- plete control over the number of nodes at each level, which we call the node distribution. Depending on the memory organization, the desired node distribution can change. How- ever, for a binary trie it is fixed. With Fill-In, an arbitrary node distribution can be used. For this, we introduce Node Distribution Function (NDF) with which the the memory distribution of the pipeline can be specified. To show the effect of NDF on the final architecture, for our implementation, we consider the following linear function: NDF (l) =lKC 44 1 0 1 0 1 0 1 0 0 1 0 1 1 1 A & B B:P4(0) B:P2(0) B:P3(1) A:P1(0) A:P5(0) A:P4(0) A:P3(0) A:P4(0) A:P6(0) 0 1 Level = 0 Level = 1 Level = 2 Level = 3 Level = 4 NDF(0) = 2 NDF(1) = 5 NDF(2) = 6 NDF(3) = 7 NDF(4) = 8 Figure 3.3: Fill-In trie of virtual tries A and B. where NDF (l) is the allowed number of nodes at level l, K is the number of virtual routers, andC is a design-time constant. Note that this function can be changed, depend- ing on the resource usage requirements. Also,l = 0 is reserved for the root node. Fill-In merges the virtual tries one by one, constrained to the node distribution func- tion described above. The algorithm is described in Algorithm 1. The Fill-In trie for the two tries in Figure 3.2 is shown in Figure 3.3. In this figure, we have used arbitrary NDF values to illustrate the operation of our scheme. When two nodes from consecutive levels are mapped to the same level of the Filled- In trie, the search becomes complicated. In a pipelined implementation, this causes the search to perform two lookups if a child node and its parent exist in the same level. Since this degrades the performance, we avoid the above by checking the consistency of levels, of a node and its parent. For example, as shown in Figure 3.3, even though the NDF of level 1 is 5, level 2 nodes of trie B cannot be mapped to level 1 of Filled-In trie due to level inconsistency. As a consequence of the above, and the constraints imposed by NDF, a node might not end up in its original level in the trie. This level shift is recorded as the distance value, in a node’s respective parent. As Figure 3.3 shows (the value within parenthesis shows a node’s level shift), the last node of trie B cannot be mapped to level 2, since 45 the NDF value is 6. Therefore, it is moved to the immediate lower level (level 3 in this case). This level shift is stored in the parent node, for both left and right children. These shifts might cause the number of levels to increase beyond the bit width of an IP address, depending on the NDF used. However, as we show in Section 3.3, this can be avoided by selecting the NDF appropriately. Algorithm 1 Fill-In(NDF) Require: Virtual triesT k ,k = 0; 1;:::;K 1, Empty trieT m 1: for all Virtual trieT k do 2: Begin atl = 0 3: while levell<L do 4: for all Noden ofT k at levell do 5: if LevelInconsistent(n) then 6: l l + 1; Continue; 7: ifjNodes@Level(l;T m )j<NDF (l) then 8: Addn toT m 9: Updaten’s parent’s pointers 10: Updaten’s distance 11: else 12: l l + 1; Continue; returnT m 3.1.3 Node Structure for Fill-In Having a uniform node structure eases the update complexity of the architecture, espe- cially on hardware. When a node is created, its structure cannot be changed at run- time, unless reconfiguration capabilities are integrated into the architecture. Even with reconfiguration capabilities, if changing the node requires more memory than what it is originally allocated, such operations cannot be performed. Leaf pushing, for example, creates two types of nodes: leaf and non-leaf. Non-leaf nodes contain two pointer fields (e.g. 216 bit) whereas the leaf nodes only contain the next hop information (e.g. 6 bit). Suppose the router received an update to change a leaf node to a non-leaf node. Unless 46 all the nodes are allocated the memory required for a non-leaf node, it is impossible to apply this change without a complete reconfiguration. This includes an undesirable delay overhead. On the other hand, allocating maximum size for all the nodes results in poor memory efficiencies. This issue is more noticeable in virtualized schemes. A uniform node structure overcomes this issue since all the nodes have exact same structure. Thus applying an update can be done quickly and incrementally. In our merging scheme, we require each node to have all the necessary fields, even if some of them might not be used at a particular level. In typical trie data structures, uniformity can be achieved by letting all the nodes have two pointer fields and a next hop information field. However, in our scheme, we require each parent node to store the distance values of its child nodes. The above method is sufficient for IPv4 (32 bit prefixes). For IPv6 (effectively 64 bit prefixes), path compression [52] can be used to alleviate the address length issue. The node structure used for IPv4 and the potential node structure for IPv6 are shown in Figure 3.4. 3.1.4 Memory Requirement Analysis We give a brief theoretical analysis of the memory requirement of our scheme, compared with the existing approaches, in Table 3.1. The notations used are as follows: K - number of virtual routing tables/routers,N k - number of nodes in virtual triek,N max - maximum of allN k s,P - bit width of pointer field,D - bit width of distance field,H - bit width of next hop information field. 3.2 FPGA Implementation We use a parallel-linear pipelined architecture to perform IP lookup for our Fill-In trie and to perform on-the-fly incremental updates while ensuring throughput requirements 47 Table 3.1: Memory Requirement Analysis Scheme Memory requirement Fill-In ( P K1 k=0 N k (2P + 2D +H)) Separate [65] ( P K1 k=0 N k (2P +H)) Simple overlaying [14] (N max =2 (2P +HK)) Trie braiding [58] (N max =2 (2P +HK +K)) Left Pointer (16-bit) Right Pointer (16-bit) Left Distance (3-bit) Right Distance (3-bit) Next-hop Info. (6-bit) Left Pointer (16-bit) Right Pointer (16-bit) Left Distance (3-bit) Right Distance (3-bit) Next-hop Info. (6-bit) Left Compr. (3-bit) Right Compr. (3-bit) (a) (b) Valid Prefix Valid Prefix Figure 3.4: Node structures for IP lookup with Fill-In: (a) IPv4 and (b) IPv6. of the virtual routers. The IP lookup portion and update portion of our architecture are described in Sections 3.2.1 and 3.2.2, respectively. 3.2.1 Architecture: IP Lookup IP lookup in Fill-In is similar to any trie based pipelined IP lookup architecture, except for the additional distance field. The distance value simply lets the packet skip few stages of the pipeline. This can be achieved by executing a No-op (no operation) when- ever the distance is non-zero. The distance value gets decremented as it traverses the pipeline and when the value becomes zero, the appropriate memory address is accessed, either to decide the next node to visit or to acquire next-hop information. The IP lookup architecture is illustrated in Figure 3.5. The valid and prefix bits, shown in Figure 3.4, are examined during the search to decide the next node to visit and to update next-hop field, respectively. 48 Initial Stage Stage 1 VID Root Address D=0? -1 Stage:1 Memory Y N N N N 0 1 0 1 0 1 D_old 1 Incoming Packets IP address IP address Next Hop (NH) Distance(D) Child Pointer (CP) NH_old NH_new D_new CP_old CP_new VID Stage L-1 D=0? -1 D_old Stage:L-1 Memory Y N N N N 0 1 0 1 0 1 D_old IP address NH_new D_new CP_old CP_new D_old NH_old 2 1 Routed Packets 2 K-entry Lookup Table 0 1 K Figure 3.5: IP lookup architecture for Fill-In on FPGA. Two packets/updates can be fed, every clock cycle, to the parallel-linear pipeline. 49 To take advantage of the dual-ported BRAMs, we implement dual-linear pipelines, through which the throughput of the architecture can be doubled with little logic over- head. This allows the two pipelines to share the same stage-memories to perform par- allelized packet processing. With dual-ported BRAM, two memory accesses can be served independently, in a single clock cycle. These memory accesses can be Reads (R) or Writes (W). For IP lookup, R operations are used whereas for updates (Section 3.2.2) W operations are used. Depending on the inputs, the two pipelines can operate in the following modes: RR, RW, WR or WW. 3.2.2 Architecture: Incremental Updates In order to provide support for incremental updates, we augment each stage of the pipeline shown in Figure 3.5 with an update module. As mentioned earlier, three types of updates exist: 1) modify, 2) insert and 3) delete. Out of the three, modifications are the easiest. An existing prefix can be updated by sending a write bubble with the corresponding stage, memory address and the updated prefix information. It should be noted that in [14] and [58], even a prefix update requires a complete reloading of the updated routing table, since leaf-pushing can potentially require multiple node updates in a single stage, for a prefix modification. Inserts and deletes are more complex than a prefix modification. However, by using a write bubble table [40] and dual-ported memory on FPGA, these updates can be eas- ily handled in our pipelined architecture. For a trie data structure update, only a sin- gle node update is required at a given stage. This requires one write operation at the respective stages, which can be performed on either of the two pipelines. The remain- ing pipeline can be used for IP lookup. Since updates do not require regular traffic to be blocked, they are non-blocking. Figure 3.6 illustrates our architecture for the three types of updates mentioned earlier. The modifications are treated differently as opposed 50 (a) Prefix modification (b) Inserts and deletes Stage X Update Type = 0 Stage ID = 5 Memory address = 0x0426 Prefix update = 54 Update Type = 1 Bubble ID =2 Regular IP lookup Bubble ID Memory address New node content Stage:X Memory Regular IP lookup (a) (b) 0 1 Update Type Figure 3.6: Support for updates (a) Modifies and (b) Inserts/Deletes with dual-ported BRAM. to inserts/deletes, and are distinguished by the update type field in the write bubble. Currently, we set the size of the write bubble table of our architecture to the number of virtual routing tables supported, so that each virtual router gets an opportunity to queue one update per lookup operation. Update scheduling is beyond the scope of this research, hence it is not discussed. 51 3.3 Experimental Results 3.3.1 FPGA Platform and Routing Table Sources For our experiments, we considered a Xilinx Virtex 5 device (XC5VLX220) as our target platform. It has 6912 Kb BRAM, 2280 Kb distributed RAM and supports up to 550 MHz clock frequency. This device provides adequate resources (memory and logic) for our experiments. In order to evaluate the performance of our approach, we obtained several real rout- ing tables from [49]. We considered routing tables of 17 different edge networks con- nected to the Internet. The routing table sizes varied from 37 to 3725 prefixes, which had a total of 14094 prefixes. We avoided using the publicly available core routing tables as their prefix distributions are similar and have nearly the same number of pre- fixes. Further, we avoided partitioning core routing tables to generate smaller routing tables, as this causes the generated routing tables to be unrealistic. We do not assume any particular structure for the routing tables as done in [14] and [58]. Also, we make no assumptions on the size of the routing tables considered. This makes our solution generic for any set of virtual routing tables. 3.3.2 Update Capability Our main claim in this work is support for on-the-fly incremental updates. Using the FPGA-based architecture introduced in Section 3.2, we support table updates for all three types of updates. Each update requires only one write bubble, that can be executed in L clock cycles. However, it should be noted that depending on the NDF used, the length of the pipeline can increase. By choosing an appropriate NDF, one can avoid or limit the increase in the pipeline length. Figure 3.7 illustrates the effect of NDF on the pipeline length. For the experiments, we used the NDF mentioned in Section 3.1. It can 52 be seen that the choice of valueC directly affects the length of the pipeline. Inside each column, we show the maximum distance a node has experienced. In our experiments, a C value of 15 resulted in zero distance for all the nodes. To run Fill-In on a trie with N nodes, it takes O(N) time. For the largest routing table used in our experiments, it took only 0:02 ms to complete the execution of Fill-In, on a dual Quad-core AMD Opteron 2350 processor running at 2:0 GHz. Our experi- ments show that, computing the write bubble content for a single table update can be done at wire-speed. Therefore, Fill-In does not become a bottleneck for the update pro- cess. Previous work such as trie braiding [58] takes time in the order of seconds for preprocessing despite its high memory efficiency. It should be noted that when using a shared NHI vector as done in [14] and [58], a single routing table update may require regeneration of the merged trie, which is a costly operation. The rationale behind the requirement for the regeneration of the merged trie is that a single routing table update may necessitate updates in multiple node level updates. For example, new nodes may require to be added to the trie, but since the addition of a single node may cause significant changes to the merged trie and the NHI vectors, applying an update on-the-fly can be challenging. One remedy for such a case is that a shadow merged trie can be maintained in addi- tion to the original trie. The shadow trie is updated based on the received updates while the original trie performs packet lookup. Periodically, the roles of the shadow and origi- nal tries can be swapped in order to bring the routing table updates into effect. However, this method also has undesirable effects, especially on hardware platforms. For example, shadow trie storage demands twice the memory capacity required for a single merged trie storage. This limits the scalability of the solution. Also, the process of applying an update can be a challenging task on hardware due to the aforementioned reasons. Hence, offline trie generation may be necessary. This demands for periodic loading of the entire 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Number of routing tables 0 5 10 15 20 25 30 35 40 45 50 Number of pipeline stages C=5 C=10 C=15 Figure 3.7: Variation of pipeline length for differentC values. merged trie onto the FPGA device. This communication can potentially degrade the performance of the lookup engine. 3.3.3 Scalability Scalability is the ability of a router to accommodate the growth of a network, given the hardware resources (e.g. memory, logic, etc.). This can be measured with respect to the number of prefixes or routing tables hosted per chip. We use total memory consumption (pointer and next-hop information) as the metric. Figure 3.8a compares various tech- niques with respect to this metric when the number of routing tables is varied. Note that, this metric is relevant if the routing tables are of the same size. In our experiments, we use routing tables of different sizes. Hence, Figure 3.8b does not reflect the actual memory increase of our scheme. Memory increase with the number of prefixes is shown in Figure 3.8a. In both figures, it can be seen that Fill-In achieves almost linear memory increase. 54 Note that, for trie braiding, we have used the theoretical minimum memory require- ment mentioned in Table 3.1. It must be noted that the memory requirement for this scheme can increase depending on the compatibility and the variance of the size of the routing tables considered. Also as shown in [19], the memory consumption of the merged router virtualization schemes with shared NHI vectors increase significantly with increasing number of virtual routers. Despite the compatibility of the routing tables, the previous argument holds true. Hence, with regards to scalability, using a shared NHI vector approach can have adverse effects on scalability. As a remedy, solu- tions such as [26] has been proposed to alleviate the memory increase by compactly storing the shared NHI vectors as binary search trees. However, this hampers the rout- ing table updates. Separate virtualization also render to be a viable solution with its lower memory footprint. However, it must be noted that the requirement of having one logical packet forwarding engine per virtual routing table demands higher amounts of logic resources. Essentially, the amount of logic resources consumed is proportional to the number of virtual routers whereas in the merged scenario, it is almost fixed. This has impact on two performance metrics: 1) throughput and 2) power. When the number of signal- ing/connections in a FPGA design increases, the performance tends to decrease due to the increase in longest path delay. This is expected since the amount of routing resources available on a chip is limited, hence when more routing resources are used, the probabil- ity of finding a shorter path decreases. This causes the clock rate to decrease, degrading the throughput. Also, since “clocked” logic resources cannot be turned off via clock gating, the logic power consumption also increases linearly with the number of vir- tual routing table. This causes the overall power consumption to increase, lowering the power efficiency of the virtualized router. 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Number of routing tables 0 100 200 300 400 500 600 Memory requirement (KBytes) Separate Simple merge Trie braiding (Theor. min.) Fill-In (a) 0 2000 4000 6000 8000 10000 12000 14000 16000 Number of prefixes 0 100 200 300 400 500 600 Memory requirement (KBytes) Separate Simple merge Trie braiding (Theor. min.) Fill-In (b) Figure 3.8: Memory requirement for increasing (a) virtual routing tables and (b) pre- fixes. 3.3.4 Throughput and Resource Usage Here we analyze performance and resource usage of our FPGA-based architecture. The performance is measured in terms of clock frequency and throughput for minimum size (40 Byte) packets and is presented in Figure 3.9. Since the architecture is a linear pipeline, the throughput is governed by the clock rate of the architecture. Typically, for architectures of this nature, the clock rate is dictated by the pipeline stage with the largest memory block. In most other designs, due to the sub-exponential growth of memory across the pipeline, the largest stage memory deteriorates the throughput. 56 However, due to the effect of NDF, in this architecture the size of a given stage memory can be explicitly defined. Even though this slightly increases the number of stages of the pipeline hence the latency, the gain in throughput is more appealing in the arena of high-speed networking. 1(24) 4(30) 8(33) 12(38) 16(40) Number of routing tables 220 240 260 280 300 320 340 Clock Freq. (MHz) 1(24) 4(30) 8(33) 12(38) 16(40) Number of routing tables 140 150 160 170 180 190 200 210 Throughput (Gbps) 1(24) 4(30) 8(33) 12(38) 16(40) Number of routing tables 30 40 50 60 70 Num. of BRAM blocks 1(24) 4(30) 8(33) 12(38) 16(40) Number of routing tables 600 700 800 900 1000 1100 1200 1300 Num. of Slices Figure 3.9: Performance (clock frequency and throughput) and resource usage (number of BRAM blocks and slices used) of the FPGA-based architecture. X-axis shows the number of virtual routers and within parenthesis, the number of pipeline stages. Our experiments show that the proposed architecture is able to sustain a throughput of 150 Gbps, on the average. For these experiments, we used dual-linear pipelines using the dual-ported feature of the BRAM available on the FPGA chip. With this feature, two concurrent and independent memory accesses can be served by a single memory block, which enables us to essentially double the throughput of the proposed architecture. 57 The clock frequency decreases with the number of virtual routing tables. This is due to the increasing size of stage memory. When going from 4 to 8 virtual routers, a signif- icant drop in clock rate can be observed. Such effects are due to the unplanned layout of the architecture on the FPGA chip. Without the use of floor-planning techniques [71], the architecture lacks the layout information, which becomes crucial when performing signal routing. With the increasing number of pipeline stages and increasing size of stage memory blocks, the amount of routing increases. Without the layout information, the routing tool is unable to comprehend the logical structure of the architecture. This translates to a poorly routed design, which yields lower performance. Floor-planning techniques can be adopted in order to enhance the performance of such architectures. Further, the number of BRAM blocks and logic slices used are measured. To show the effect of the distance, we use the pipeline constructed forC = 5. From the figures, it can be seen that the amount of resources required increases with increasing number of virtual routers. As discussed earlier, with increasing number of virtual routers, the memory consumption grows, which in turn increases the number of BRAM blocks used. The reason for the increase in logic resource consumption is 1) due to increased pipeline length and 2) due to additional routing required to create larger memory blocks with increasing stage memory size. 3.4 Conclusion In this research, we discussed a novel distance-based mapping technique, Fill-In, to merge multiple virtual routing tables into a single search tree. Additionally, we pro- vided a high-speed FPGA-based architecture to perform IP lookup as well as on-the-fly updates for a virtualized router. This has not been addressed in any of the previous work. During our initial experiments, we noticed that node sharing results in very inefficient 58 memory usage when the routing table sizes vary. Therefore, we considered the pro- posed mapping technique to avoid node sharing, which resulted in an update friendly, yet a scalable solution. It must be noted that the update facility described here assumes that there is addi- tional memory available in each pipeline stage for the new nodes to be created. Within a period of time, new nodes may be created and some nodes may even be deleted. Via techniques such as bookkeeping, management of used and unused memory can be done in an efficient manner. Such extensions can help reduce the amount of additional mem- ory required in each stage. For the sake of simplicity, for experiments conducted in this research, a simple Node Distribution Function (NDF) was used. Various NDFs can be exploited to achieve an optimal balance in throughput and packet latency. However, it is inevitable that even with the best NDF, with increasing number of virtual routing tables, the pipeline length will increase. This issue will be more pronounced in the case of IPv6 as opposed to IPv4. In such scenarios, path compression techniques can be used as a remedy. 59 Chapter 4 Performance Modeling of Virtual Routers This chapter is organized in two phases. The first phase describes the analytical models for power estimation of virtual routers and the second phase presents a comprehensive performance analysis of virtual routers. 4.1 Notations and Assumptions To denote the various schemes, we use a set of abbreviations and they are as follows: NV: Non-virtualized (conventional) VS: Virtualized separate approach VM: Virtualized merged approach In this study, we make several assumptions to simplify the modeling of virtual routers. However, it should be noted that these assumptions may be altered depend- ing on the considered application. Assumption 1 Network traffic is uniformly distributed among theK virtual routers. In other words, i = 1=K fori = 0; 1;:::;K 1 While different and more realistic network traffic models exist, the goal of this research is to model and understand the performance of different router virtualization 60 Table 4.1: Notations and Symbols Description Symbol No. of virtual networks K Virtual routeri VR i Lookup pipelinei P i No. of stages per pipeline N Memory size of stagej of pipelinei M i;j Logic slices in stagej of pipelinei L i;j Power consumed P () Device (FPGA/ASIC/etc.) D Leakage power P L Utilization of virtual routeri i Node overlap ratio Node non-overlap ratio (1) 0 schemes from the performance perspective. Therefore, not much emphasize is given to the traffic patterns as the experimental results obtained in this research can be easily extended by modifying the i parameters appropriately. Assumption 2 All routing tables are of same size. An upper bound value is assumed considering real life edge-level routing table (10000 prefixes) to simulate a worst case scenario. This translates toM i;j =M k;j fori;k = 0; 1;:::;K 1 andj = 0; 1;:::;N. Currently, there are no publicly available virtual routing tables. Hence, a worst-case edge-level routing table size (10 K prefixes) was assumed. Also, since the distribution of the size of virtual routing tables is not available publicly, in order to simulate a worst case, all virtual routing tables were assumed to be of the same size. Assumption 3 In the case of virtual-separate, packets belonging to different virtual networks are assumed to be properly distributed among the virtual router instances and the packet distributor energy is considered negligible. 61 In the case of a virtualized networks, packets from various virtual networks may arrive in a single stream of packets. In order to perform packet lookup, it is essential to separate the traffic belonging to different networks. However, this is a trivial task as the only operation required is to inspect the virtual network identifier of the packet and identify which pipeline to employ to perform packet lookup. A Binary CAM (BCAM) or a direct-lookup table can be used for this purpose. Due to the small size of this unit, the power consumed by the same is negligible compared to that of the pipeline. Assumption 4 Merging of routing tables is generalized to make our analysis generic. Node overlap ratio is defined as the amount of node overlap in a given level or equiva- lently as follows: = numberof commonnodes totalnumberof nodes While this metric can be given various other definitions, for our analysis, the interest is to find out how much additional memory will be required to include another virtual network into the same platform. For this purpose, the proposed node overlap ratio rendered effective. Use of other definitions for will require the analysis to be redone. 4.2 Router Models for Power Estimation The performance modeling done in this work is for layer 3 lookup operation of a router. We consider three main types of routers: 1) Non-virtualized, 2) Virtualized-Separate and 3) Virtualized-Merged, which are described in detail below. We provide compre- hensive models to estimate the power consumption of a router in these three different scenarios. For this work, we consider linear pipelined lookup architectures only. Hence, we consider tree/trie structures for IP lookup which are mapped onto the stages of a pipeline. 62 We consider three main components that contribute to the power consumption of a router’s data plane operation. Leakage power represents the static power dissipa- tion while power consumed by logic and memory accounts for dynamic power. The static power is proportional to the area of the device used, while dynamic power highly depends on the clock frequency, the type (logic, memory, etc.) and amount of resources used. Hence, we first show the resource consumption for each setup and translate that to the power consumption on a per resource type, basis. When the router is not serving any packets, the logic or memory resources can be sent to an idle mode. Hence, during the off period of the duty cycle, the dynamic power can be assumed to be zero, but the static power is dissipated constantly since the device has to be operating despite the duty cycle. Turning off the logic and memory resources can be effectively done using clock enable signals (boolean flags indicating whether service is required or not) and clock gating, respectively. 4.2.1 Non-virtualized A non-virtualized router is the conventional approach in networking. Network equip- ment is dedicated for individual networks and the utilization of each equipment is fairly low due to the behavior of the edge-network users. The resource consumption is expressed in Eq. 4.1. The deviceD here refers to the chip on which the lookup engine is implemented. Since we have multiple equipment, multiple devices are required, hence, the static power consumed increases proportional to the number of networks. For dynamic power, we introduce the utilization for a fair comparison. As stated in Assumption 1, we assume a uniform distribution of packets across the virtual networks. If required, more complex distributions can be modeled by appropriately changing the i values. Power consumed in the non-virtualized case is expressed in Eq. 4.2. 63 R NV = K1 X i=0 (D + N1 X j=0 (L i;j +M i;j )) (4.1) P NV = K1 X i=0 (P L + i N1 X j=0 (P (L i;j ) +P (M i;j ))) (4.2) 4.2.2 Virtualized-separate The virtualized-separate is very similar to the non-virtualized case, except for the fact that now we have a single shared platform hosting all the virtual routers. Hence, the static power dissipation is nearly brought down by a factor ofK. However, the dynamic power consumption remains the same with its correlation to the utilization. The resource utilization and the power models are expressed in Eq. 4.3 and Eq. 4.4 respectively. R VS =D + K1 X i=0 N1 X j=0 (L i;j +M i;j ) (4.3) P VS =P L + K1 X i=0 i ( N1 X j=0 (P (L i;j ) +P (M i;j ))) (4.4) It should be noted that the separate approach has its disadvantages just as in the non-virtualized case. The number of separate lookup instances that can be implemented on a given device is limited by the available resources. Hence, the scalability of the separate virtualization approach dictated by the platform used. However, from a power perspective, we have fine grained control over the resources and temporarily turn off the resources that are not being used, while using a single device. 64 4.2.3 Virtualized-merged This approach is radically different from the previous two cases. Here, the multiple virtual routing tables are merged (using some table merging technique), to produced a single lookup tree. The incoming packet stream, consisting of packets from different virtual networks, is sent through the lookup engine and based on the virtual network identifier (VNID), the router loads the corresponding routing table data and forwards the packet. Hence, the router hardware is time-shared among the virtual networks (in the case of separate router virtualization, the hardware was space-shared). The resource utilization and power model is given in equations Eq. 4.5 and Eq. 4.6 respectively. Since there is only one lookup pipeline, we use the index 0 for the pipeline instead of usingi. R VM =D + N1 X j=0 (L 0;j ) +M VM;;K (4.5) P VM =P L + N1 X j=0 P (L 0;j ) +P (M VM;;K ) (4.6) For this discussion, we do not elaborate on the memory requirement of the VM approach and denote it as M VM;;K . A detailed description of this calculation is pre- sented in Section 4.5.1 considering the number of virtual routers, the node count of the tries being merged and the node overlap ratio. In the case of merged, the scalability limitation has two aspects. First one is the resource limitation. The purpose of merging is to reduce the overall memory require- ment. However, as we merge multiple routing tables, the total size of memory required to store the merged lookup tree may exceed the memory available on the device. This is one aspect. The second aspect is that when we merge two routing tables, the lookup engine has to be able to sustain the required throughputs of the two virtual networks, 65 even in the worst case. When multiple such routing tables are merged, the throughput is shared among the virtual networks, hence at some point, the lookup engine may fail to sustain the required throughput. These two are the major limitations in the merged approach. However, the merged approach is more scalable than the separate approach considering the resource consumption. Here, the resources refer to memory and logic resources. 4.3 Virtual Routers on FPGA In the previous section, we discussed how to model the power consumption of a vir- tual router on FPGA. We now focus on implementing these different architectures on a state-of-the-art FPGA. For these experiments, we consider a Xilinx Virtex 6 platform (XC6VLX760) under two speed grade scenarios: 1) speed grade -2 for high performance and 2) speed grade -1L for lower power. This device was chosen considering its on-chip resources, listed in Table 4.2. In order to support multiple virtual networks, having abun- dant on-chip resources, mainly Block RAM, distributed RAM and I/O (Input/Output) pins, is critical. Table 4.2: Virtex 6 XC6VLX760 Device Specs Resource Amount Logic Cells 758K Max. distributed RAM 8 Mb Block RAM 26 Mb Max. I/O pins 1200 In the proposed model, we consider three main contributors for power: static, logic and memory. We initially identify the representative values and/or functions for these 66 two components (P L ,P (L i;j ) andP (M i;j )) on the aforementioned two platforms. For all our power calculations, we use the Xilinx XPower Analyzer (XPA) and XPower Estimator (XPE ) tools. These tools provide a means by which a given design can be evaluated from a power standpoint at resource-type level and at different operational frequencies. 4.3.1 Static power The static power is the minimum power required to keep the device “powered up” with no switching. Even though static power does not depend on the frequency at which the device operates, it is proportional to the area of the device, process technology, and the operating temperature (which affects the leakage current). Various circuit optimization techniques can be adopted to reduce this power component and we see such deploy- ments in the low power FPGA devices. The main distinction in a high-performance and low power variants is the supply current, which is significantly lower in the low power FPGAs [72]. In our case, we examined the static power dissipation of the device under the two speed grades and the results are as follows: Speed grade -2: 4:5 5% W Speed grade -1L: 3:1 5% W The variation is based on the amount of resources used (or equivalently area covered by the used resources). We observed a maximum of5% deviation in our application and the value may vary depending on the resource consumption. 67 4.3.2 Power Consumed by Memory As mentioned in Section 2.1, two types of memory is available on FPGAs, namely distributed RAM and block RAM. Even though both types of memories maybe used in our applications, for simplicity, we assume only BRAM is used. On the device we are considering, 26 Mb of BRAM is available. However, BRAM (on Xilinx devices) is organized into 36 Kb blocks (contains two independent 18 Kb blocks) . Hence, despite how small the amount of memory required, a BRAM block has to be assigned to serve the purpose. Therefore, BRAM power is determined by the number of blocks used rather than the total size of memory. The other determining factors are 1) operating frequency, 2) duty cycle, 3) write rate, and 4) bit width of read out data. We assumed a write rate of 1% (low update rate) and 18 bit wide data for the comparison. In our experiments, we noted that the effect of bit width is negligible compared with the effect of other parameters. We conducted experiments using the XPE tool to analyze the behavior of BRAM based on the size and operating frequency. The observation was that BRAM power monotonically increased with both size and operating frequency. However, it should be noted that the behavior of 18 Kb and 36 Kb modules were different . Dynamic power is computed usingCapacitanceVoltage 2 SwitchingFrequency. Due to no change in the other two parameters, in Figure 4.1 the dynamic power is proportional to the switching or operating frequency. Depending on the number of BRAM blocks used, the unit power consumption can be used to evaluate the dynamic power consumption of larger BRAM blocks. Using these details, we generate a power model for BRAM under different scenarios. The model is summarized in Table 4.3. The notations used in the table areM - memory requirement in bits,f - operating frequency in MHz. These results can be used to predict theP (M i;j ) values in the models proposed in Section 4.2. 68 150 200 250 300 350 Operating frequency (MHz) 0 1 2 3 4 5 6 7 8 9 Per block power (mW) 18Kb (-2) 36Kb (-2) 18Kb (-1L) 36Kb (-1L) Figure 4.1: BRAM power variation with operating frequency (Note: The number within parenthesis denotes the speed grade). Table 4.3: BRAM power model Setup Power (W) 18 Kb (-2) dM=18Ke 13:65f 36 Kb (-2) dM=36Ke 24:60f 18 Kb (-1L) dM=18Ke 11:00f 36 Kb (-1L) dM=36Ke 19:70f 4.3.3 Power Consumed by Logic In most studies related to networking, the power consumed by logic is considered negli- gible compared to that of memory [31,35,41]. However, in our study, we identified that logic power (including signal power) can become relatively significant. Logic power is distributed among Look-Up Tables (LUT), shift registers, distributed RAM and flip- flops. Signal power includes the power dissipated when communicating among the aforementioned logic resources as well as memory components. In order to avoid the clutter, we treat both logic and signal power as a whole and present the results. 69 In order to evaluate logic power, we stay at the granularity of a single processing element (PE) of a pipeline stage. This includes the stage registers and any type of logic resources that are required to perform the memory access and computations required at each stage. In the case of our uni-bit trie, the logic resource consumption was as follows: Slice registers as flip-flops: 1689 Slice LUTs as logic: 336 Slice LUTs as memory: 126 Slice LUTs as routing: 376 The power consumed depends on the frequency of operation and the amount of resources used. The observation was that logic power linearly increases with the number of pipeline stages. The variation with frequency is illustrated in Figure 4.2. Further, for a trie based IP lookup implementation, per stage logic power dissipation as a function of operating frequency, in MHz, can be expressed as: Speed grade -2: 5:180f W Speed grade -1L: 3:937f W 4.3.4 Pipelined IP Lookup Algorithmic (i.e. trie/tree based) IP lookup have become popular over TCAM based IP lookup due to their flexibility and scalability. Mapping such trie/tree based solutions to FPGA platforms can be done efficiently. Most router virtualization solutions are trie based [14,17,24,58]. Hence we use trie as the representative example. Each trie level is mapped onto a pipeline stage and each stage is associated with an independently acces- sible memory [33, 34, 40]. When a lookup request is received, the packet traverses the 70 150 200 250 300 350 Operating frequency (MHz) 0 0.5 1 1.5 2 Per stage power (mW) Signal (-2) Logic (-2) Signal (-1L) Logic (-1L) Figure 4.2: Per stage logic and signal power consumption (Note: The value inside paren- thesis denotes the speed grade). pipeline similar to the trie traversal and at the end of the pipeline, outputs the appro- priate next-hop port information (NHI). Generally, the NHI information is stored at the leaf nodes of the trie (nodes that do not possess any children nodes) using techniques such as leaf pushing [52], in order to reduce the memory consumption for trie storage. In the case of virtualization, a leaf node is simply a vector that has routing information corresponding to all the considered virtual networks. And the vector is indexed using the VNID to extract the forwarding information [14, 17]. The three cases considered here (non-virtualized, virtualized-separate and virtualized-merged) have the same architecture with the following distinctions: Non-virtualized implements a single lookup engine on a single device (i.e. FPGA) and all the devices are dedicated on a per network basis. For aK virtual network scenario,K devices are required. Virtualized-separate implements multiple lookup engines on a single device and between two lookup engines, there is no resource sharing except for the FPGA fabric itself. 71 Virtualized-merged implements a single shared lookup engine and all the virtual networks share the same memory and logic in the lookup engine. The amount of logic used, remains almost the same as in the two other cases, however, the amount of memory required may significantly increase depending on the node overlap ratio,. 4.3.5 Routing Tables Router virtualization is most effective at edge level of the network since the problem of underutilization is most prevalent at the edge network level. In order to demonstrate the results for a more realistic scenario, we use routing tables from real networks obtained from [49]. To simplify the implementation, we assume all the routing tables to be of same size and we use the largest routing table we obtained from [49] to report the results for the worst case scenario. This particular routing table consisted of 3725 prefixes and the corresponding trie had 9726 nodes without leaf pushing and 16127 nodes with leaf pushing. Figure 4.3 illustrates the effect of virtualization on memory consumption under different scenarios and illustrates the amount of memory used for pointers (non-leaf nodes) and for NHI/forwarding information (leaf nodes). It can be seen that the memory savings achieved by the merged schemes is highly dependent on the node overlap ratio, . Also it is clear that pointer saving becomes less effective as the number of virtual routers increase and decreases. Since we cannot assume any particular structure for the considered routing tables, cannot be determined in advance and leads to indeterministic memory requirements whereas in separate (even non-virtualized) approach, the memory requirement is deterministic. Also, it should be noted that merging schemes are appropriate (from a memory standpoint) when the number of virtual routers is small. 72 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Number of routing tables 0 1 2 3 4 5 6 7 8 Memory for pointers (Mbit) Overlap 80% Overlap 20% Separate 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Number of routing tables 0 10 20 30 40 Memory for NHI (Mbit) Overlap 80% Overlap 20% Separate Figure 4.3: Pointer and NHI memory requirements for merged ( = 80% and = 20%) and separate approaches. 4.4 Virtualized Router: Power Performance In the previous section, we observed how each power component behaves for the two scenarios and derived relationships in terms of operating frequency and amount of logic resources required. However, the standard metric used for power measurements in routers is Watts per Gbps. This describes how much energy is spent to provide a unit throughput. In order to evaluate the two virtualized routers against the non-virtualized router, we analyzed the post place-and-route results of these architectures on the device we considered (Virtex 6 LX760) under the two speed grades (-2 and -1L). Without loss of generality, for all pipelines we assume a length of 28 stages. 73 0 2 4 6 8 10 12 14 16 No. of virtual routers 0 10 20 30 40 50 60 70 Total power (W) NV (-2) EXP VS (-2) EXP VM 20% (-2) EXP VM 80% (-2) EXP 0 2 4 6 8 10 12 14 16 No. of virtual routers 0 10 20 30 40 50 Total power (W) NV (-1L) EXP VS (-1L) EXP VM 20% (-1L) EXP VM 80% (-1L) EXP Figure 4.4: Comparison of total power consumption in virtualized and non-virtualized schemes for speed grades -2 (top) and -1L (bottom). 4.4.1 Total Power Dissipation: Experimental vs. Estimation Here, we validate the models proposed in Section 4.3 against the experimental results we obtained. To clearly illustrate the performance of difference schemes, we first compare the total power utilized by all the schemes (non-virtualized, virtualized-separate and virtualized-merged ( = 20% and = 80%) and then we show the comparison of all the virtualized schemes. These results are shown in Figure 4.4 and Figure 4.5, respectively. It can be observed clearly that the non-virtualized router consumes power proportional to the number of (virtual) networks. In contrast, virtualized routers consume very small amount of power since the static power consumed by the lookup engine is shared among the considered virtual networks. 74 0 2 4 6 8 10 12 14 16 No. of virtual routers 4.4 4.6 4.8 5 5.2 5.4 5.6 Total power (W) VS (-2) EXP VM 20% (-2) EXP VM 80% (-2) EXP 0 2 4 6 8 10 12 14 16 No. of virtual routers 3.1 3.2 3.3 3.4 3.5 3.6 Total power (W) VS (-1L) EXP VM 20% (-1L) EXP VM 80% (-1L) EXP Figure 4.5: Comparison of total power consumption in different virtualized schemes for speed grades -2 (top) and -1L (bottom). Another interesting observation is that in Figure 4.5, the total power dissipation decreases with the increasing number of virtual networks. According to the model (Eq. 4.4), the power consumption must remain the same since only one lookup engine is active at a given time (Assumption 1). However, the experimental value decreases due to various hardware optimizations applied when implementing multiple parallel archi- tectures. We limited the maximum number of virtual networks to 15 since in the case of virtualized-separate, the I/O pin requirement exceeded when the number of virtual net- works was increased. In a complete router implementation (parsing, lookup, editing, scheduling, etc.), this number may become even less when other inputs and outputs are 75 0 2 4 6 8 10 12 14 16 No. of virtual routers −2 −1 0 1 2 Percentage error (%) NV (-2) VS (-2) VM 20% (-2) VM 80% (-2) 0 2 4 6 8 10 12 14 16 No. of virtual routers −3 −2 −1 0 1 2 Percentage error (%) NV (-1L) VS (-1L) VM 20% (-1L) VM 80% (-1L) Figure 4.6: Percentage error of the model estimation compared with the experimental results for speed grades -2 (top) and -1L (bottom). considered. The goal of this work is to analyze the power behavior of the lookup portion of a router. Therefore, the above implementation stand as an accurate prototype for the considered purpose. Figure 4.6 shows the percentage error of the models we proposed in Section 4.3. The percentage error is calculated as follows: Percentageerror = P Model P Experimental P Experimental 100% It can be seen that the model estimation is highly reliable with a maximum error of 3%. The cause for this error is the various hardware optimizations that are performed, 76 0 2 4 6 8 10 12 14 16 No. of virtual routers 0 20 40 60 80 100 120 Power efficiency (mW/Gbps) NV (-2) EXP VS (-2) EXP VM 20% (-2) EXP VM 80% (-2) EXP 0 2 4 6 8 10 12 14 16 No. of virtual routers 0 20 40 60 80 100 120 Power efficiency (mW/Gbps) NV (-1L) EXP VS (-1L) EXP VM 20% (-1L) EXP VM 80% (-1L) EXP Figure 4.7: Power dissipated per unit throughput for speed grades -2 (left) and -1L (right). by the synthesis tool, when the amount of resources used increases. As shown in the figure, for non-virtualized and virtualized-separate, the error is much less compared with that of virtualized-merged. In the merged approach, we use more BRAM per pipeline stage, to accommodate the increasing number of virtual routers. The synthesis tool performs various routing and placement optimizations to improve the performance of the design which causes our predictions to deviate slightly from the exact measurement. Nevertheless, the model we proposed here provide an accurate means by which the power consumption of virtualized routers can be estimated. 77 4.4.2 Power Efficiency In the context of lookup engines, one important metric is the packet handling rate. In this work, we use Giga bits per second as the metric to measure packet handling rate and to compute this, we use minimum packet size as 40 bytes. A router may use more power to support higher throughput, which might not be desirable. In order to compare such architectures with power efficient architectures, we use the power dissipated per unit throughput as the metric for our comparisons. Figure 4.7 illustrates the comparison of the three approaches with respect to the considered metric. The lower the mW/Gbps number is, the better the architecture. Therefore, by ana- lyzing the results in Figure 4.7, the virtualized separate approach yields the best power efficiency. The conventional router is the second best while merged approach shows the worst performance. The main reason behind the poor performance of the merged approach is the reduction in operating frequency (hence, throughput) with the increas- ing number of virtual routers. Due to the higher resource consumption, the operating frequency decreases significantly. The power consumed by the resources increase, but the throughput drops. As a result, the power per unit throughput increases. Also, the performance difference between the two cases, = 20% and = 80% is also intuitive. When the node overlap ratio is much less, the amount of resources consumed by the router increases, while the throughput decreases. In both, total power consumed and power per unit throughput, we have presented the results for the two speed grades (-2 and -1L). We observed a 30% less power con- sumption when speed grade -1L was chosen compared with speed grade -2. However, the power saving comes at the expense of throughput. This fact becomes clear when comparing the two speed grades with respect to mW/Gbps in Figure 4.7. The two speed grades perform almost similarly with same variation and performance numbers. Hence, low power FPGAs are suitable in environments where throughput is not a major concern. 78 4.5 Generalization of Virtual Routers This is the second phase of the performance evaluation of virtual routers. The previous sections discussed the analytical models and performance evaluation of virtual routers from solely a power standpoint. In what follows, we introduce a comprehensive perfor- mance evaluation of virtual routers on FPGA. The existing router virtualization schemes are generalized from a mathematical standpoint and their performance is evaluated. We also propose a novel grouped router virtualization approach which inherits the merits of both separate and merged router virtualization schemes. Note that some of the material from the power modeling phase is repeated here in order to provide the context to the second phase. 4.5.1 Virtual Router Models On FPGA, two main types of resources are available on-chip: 1) Memory and 2) Logic. In applications of this nature, often, memory is the main bottleneck as routing table stor- age requires considerable amount of memory. Typically, the logic resource consumption of trie-based IP lookup is below 10%, hence is not a limiting factor. Therefore, we model each virtualization approach purely based on its memory consumption. The notations used are listed in Table 4.4. We have provided detailed descriptions for the parameters we introduce in this work, which are not self-explanatory. Several of these notations were used in the first phase of this work, however, we redefine the notations used for the second phase for clarity. The trie for IP lookup is generated using a routing table (N prefixes). As mentioned earlier, memory footprint of a uni-bit trie can be reduced by using techniques such as leaf-pushing. We employ leaf-pushing in our approach and the number of nodes in the leaf-pushed trie (n) determines the memory footprint. The leaf-pushed trie is mapped 79 Table 4.4: Notations and Symbols Description Symbol No. of virtual networks K Virtual routeri VR i Bit width of IP address W No. of stages per pipeline P No. of prefixes per routing table N No. of nodes per leaf-pushed trie n Memory size of stagei of pipeline M i Number of nodes at leveli of the trie n i Node overlap ratio Node non-overlap ratio (1) 0 Pointer bit width p Next-hop information (NHI) bit width h Number of routing tables per group G on to a pipelined lookup engine and the pipeline may containO(W ) stages, which is 32 in the case of IPv4 prefixes. However, depending on the longest prefix length occurred in the routing table, this number (P ) may be less thanW . Each pipeline stage is equipped with its own stage memory and the memory distribu- tion across the pipeline stages is dependent on the prefix distribution of the routing table as we show in Section 4.5.1. However, despite slight variations, common characteris- tics can be found in routing tables, which make the memory distribution approximately similar for different routing tables. = #commonnodesinprimaryandsecondary #nodesinprimary 80 0 10 20 30 Trie level 0 0.02 0.04 0.06 0.08 0.1 Normalized node distribution Figure 4.8: Normalized node distribution for real IPv4 routing tables. Routing Tables In order to evaluate the virtualization approaches, multiple routing tables are required. Virtualization generally occurs at the edge-network level. These networks are typically of smaller size compared with backbone networks and the routing table sizes may vary from few 10s to few 1000s of prefixes [49]. Currently, there are no virtual routing tables available publicly. Since the focus of this work is to generalize the virtualization approaches, using real routing tables is not critical. We model a near worst case scenario by using a routing table of 10000 prefixes. However, the number of prefixes only, does not represent a routing table entirely. In order to simulate a realistic routing table, we extracted the per level node/memory dis- tribution from a set of real routing tables obtained from [49]. The averaged normalized per level node distribution is depicted in Figure 4.8. In our work, we assumed the same node distribution and extrapolated to 10000 prefix routing tables by using FRuG [16]. Throughout, we use the same routing table model. After the leaf-pushing process, the number of non-leaf nodes will be (n 1)=2 and the number of leaf nodes will be (n+1)=2. This is because every node in the tree, except for the leaf nodes, will have exactly two children. For non-leaf nodes, two pointer fields 81 are required to point to left and right children and for leaf nodes, only a NHI field is required to indicate the packet forwarding information. Note that even though all the leaf nodes may not necessarily be at a single level, but spread over a few levels, they can easily be brought down to the lowest available level by including a single bit to every node indicating whether the node is a leaf node or not, without affecting the total node count. In Table 4.4, then i values refer to the trie after applying this leaf node moving. Therefore,n P1 essentially represent the number of leaf nodes in the original trie. Virtual Separate (VS) The virtual routers share the FPGA fabric by implementing separate IP lookup engines corresponding to each virtual router. The benefits of this approach are, 1) the opera- tion of one virtual router does not affect the operation of the others since the lookup engines are not time-shared, 2) a given lookup engine corresponds to only one routing table, therefore, no complex merging algorithms are required. However, since multiple pipelined engines need to be implemented, the logic resource consumption may increase linearly with the number of virtual routers. This affects resource consumption as well as power consumption/efficiency. The total memory requirement of the VS approach can be expressed as: M VS (K) =K P2 X i=0 (n i 2p) +n P1 h ! (4.7) The number of virtual routersK becomes a multiplication factor since the number of lookup engines is the same as the number of virtual routers. As mentioned earlier, in this work, we simulate a near worst case scenario for edge network virtualization and consider a worst case routing table size. Therefore, the memory requirement of a single 82 trie is simply multiplied by the number of virtual routers to obtain the total memory footprint. Virtual Merged (VM) In the VM approach, the tries corresponding to virtual routing tables are merged using a merging algorithm [14, 58]. While various such algorithms exist, we generalize these merging algorithms by considering the key features of these algorithms. The merging algorithms attempt to exploit the structural similarity of tries by sharing common nodes available in the tries. In order to generalize these merging approaches, we use the node overlap ratio () used earlier, which is a measure of the number of nodes common to the two tries. This lets us model any merging algorithm since the node overlap can be accurately modeled using. While we define the node overlap ratio as a single value, if desired, it can be defined at a finer granularity by defining the node overlap ratio for each level of the trie. However, this complicates the modeling process. Therefore, we use a single node overlap ratio at the granularity of a trie as opposed to at trie-level, which simulates the overall effect of merging. The net effect we expect by introducing is that in every level 0 n i number of new nodes are introduced. However, the initial few levels potentially may be fully occupied since the maximum number of nodes for those levels is reached easily (e.g. only 4 nodes at level 2). Hence, there might not be any room for new nodes to be added. If modeling the behavior of overlap at each level is critical, in such cases, assigning node overlap ratio on a per level basis is useful. However, our interest is on the net effect on the total memory consumption, which can be sufficiently modeled using an overall overlap ratio. Therefore, the capacity of a given level is ignored under this model. 83 5 10 15 20 25 # virtual routers 0 5 10 15 20 25 Memory (Mbit) VS VM α = 20% VM α = 80% (a) Pointer memory 5 10 15 20 25 # virtual routers 0 10 20 30 40 50 60 70 80 Memory (Mbit) VS VM α = 20% VM α = 80% (b) NHI memory Figure 4.9: Pointer and NHI memory variation for increasing number of virtual routers for VS, VM for = 0:2 and VM for = 0:8. We now formulate the node count if 2 tries are merged together. According to the definition of, when a secondary trie is merged onto the primary trien number of nodes are common to the two tries. The rest is not common, therefore is added to the primary trie. The number of new nodes added is 0 n. Therefore, the total number of nodes in the merged trie isn + 0 n = (1 + 0 )n. When considering per level node increase, we assume that in each trie level 0 n i number of new nodes are introduced. Hence, the total memory consumption after merging two routing tables can be expressed as: 84 M VM (2) = (1 + 0 ) P2 X i=0 (n i 2p) +n P1 2h ! (4.8) The primary trie’s node count in each level goes from n i to (1 + 0 )n i due to the additional nodes introduced by merging the secondary trie. Also note that after merging the two tries together, every leaf node has to contain two NHI fields, one for each virtual network. This introduces significant amount of forwarding information duplication, which demands additional storage. As we will discuss later in this section, this also causes the memory requirement of the merged approach to increase dramatically when multiple virtual routers are merged together. In Section 4.5.3, we show how this increase can be controlled by creating groups of routing tables and creating multiple merged tries than merging all the routing table into a single merged trie. Now we extend Eq. 4.8 extended toK virtual routers, the effect of 0 forms a poly- nomial expansion in which case, the number of nodes per level increases by a factor of: K1 C 0 + K1 C 1 0 + K1 C 2 02 + + K1 C K1 0K1 = (1 + 0 ) K1 (4.9) With this understanding, the memory requirement of VM approach can be expressed as: M VM (K) = (1 + 0 ) K1 P2 X i=0 (n i 2p) +n P1 Kh ! (4.10) 85 4.5.2 Memory Footprint In order to understand the effect of andK, we put our two virtualization approaches to test using the aforementioned 10000 prefix routing table model. We used p = 16 andh = 6 as representative values of child pointer and NHI bit widths. The memory footprint for increasingK is illustrated in Figure 4.9. We define the near-best and near worst-case as the cases where the trie node overlap, , is 80% and 20%, respectively. We evaluate the near-extreme scenarios to illustrate the behavior of virtualized routers in real-world situations since 100% and 0% node overlap will not occur with real-world routing tables. We show the pointer memory and NHI memory consumption separately to show the contribution of each component to the total. The pipeline stages 0! P 2 represent the pointer memory and stageP 1 represents NHI memory. It can be observed that with increasing number of virtual routers, the NHI memory increases dramatically for the VM approaches, especially when is small. This is due to: 1) the new nodes introduced at the leaf level and 2) due to the increase in the number of bits per leaf node. When a new virtual router is added, in the VM approach, all the leaf nodes need to be augmented with another h bits indicating the NHI for the added virtual network. This can be addressed by compact representations of the NHI by eliminating the empty elements of the NHI vector [26]. However, for this work, we assume a vector of forwarding information (NHI) exists at each leaf node and the information corresponding to a particular virtual network is extracted using the VNID of the packet. Figure 4.10 depicts the total memory consumption (i.e. pointer memory + NHI memory) of the router virtualization approaches with increasing number of virtual net- works. It can be seen that the VM approaches are eventually outperformed by the VS 86 5 10 15 20 25 # routers 0 20 40 60 80 100 Memory requirement (Mbit) FPGA on-chip memory VS VM α = 20% VM α = 80% Figure 4.10: Total memory consumption of VM and VS approaches. approach. In the VM approaches, the memory savings is achieved by sharing the non- leaf nodes, but in the process, the leaf nodes become large in size. The effect of this is marked when multiple virtual routers are merged. As seen from Figure 4.9, the impact of NHI memory is several times greater than that of pointer memory, especially when the number of virtual routers is high. Therefore, with respect to memory requirement, the VS approach is more appealing for memory constrained environments. However, when it comes to implementation, less hardware resource consumption is desired as multiple packet processing functionalities are offered on a single chip and the available hardware has to be shared among these tasks. In the case of VS approach, even though the mem- ory consumption is lower, more resources are required to build the multiple pipelines. This tradeoff between memory and hardware resources lead us to the grouped router virtualization which is discussed next. 4.5.3 Grouped Router Virtualization The issue with VM approach is that the NHI memory increases dramatically due to the increased number of leaf nodes as well as the increased number of virtual routers. 87 However, it is not necessary for all the routing tables to be merged into a single lookup engine. As a solution, we propose the grouped router virtualization in which, the routing tables are formed into groups where each group contains a subset of the virtual routing tables and the groups are mutually exclusive. After the groups are formed, within each group, the routing tables can be merged using a chosen merging algorithm, which pro- duces a single merged trie for each group. These merged tries can be implemented as separate pipelines to perform IP lookup on the incoming packets. The benefits of the grouped approach are two fold: Since the number of routing tables merged in a group is less than the number of virtual routers (especially for large-scale router virtualization), the memory increase at the leaf nodes is reduced considerably. This makes the grouped virtu- alization (VG) approach more memory efficient than the VM approach. Lookup engines are implemented on a per group basis as opposed to per vir- tual network basis as in the VS approach. This reduces the hardware resources required to implement the virtualized router, which is attractive from the resource and power points of view. Due to the above two reasons, the VG approach becomes a better candidate than the VM or VS approaches. As done for VM and VS approaches, the memory consumption of the VG approach can be expressed mathematically as: M VG (K;G) = K G (1 + 0 ) G1 P2 X i=0 (n i 2p) +n P1 Gh ! (4.11) Grouping of routing tables can be done based on various parameters. For exam- ple, the goal might be to reduce the intra-group merged trie memory consumption, in 88 which case, the grouping criterion will be to pick routing tables with similar proper- ties into a single group. It should be noted that these properties may vary for different merging algorithms. Clustering algorithms can be employed to form these groups by taking into account the properties of the routing tables. If the grouping is done based on the aforementioned criterion, then the merged trie for a group will have lower mem- ory consumption as compatible routing tables are selected for each group, than random grouping. This improves node overlap, thereby reducing the memory required both in pointer and NHI storage. We evaluate the effect of grouping from a memory consumption perspective in this section. Unless stated otherwise, throughout, we assume a group size ofG = 5. This means that each group consists of 5 virtual routing tables. This number can be changed to adjust the number of groups formed, hence, the number of lookup engines. Fig- ure 4.11 illustrates the memory consumption of the grouped approach (VG), computed using Eq. 4.11, for = 20% and = 80% against VS and VM for = 20%. This simulates the near best and worst case scenarios for the VG approach. 5 10 15 20 25 # routers 0 5 10 15 20 25 30 35 Memory requirement (Mbit) VS VM α = 80% VG α = 20% VG α = 80% Figure 4.11: Total memory consumption of VG for = 20% and VG for = 80% vs. VS and VM for = 80%. 89 Table 4.5: Scalability of different virtualization approaches on a Virtex 7 2000T device using on-chip memory Approach VS VM VG (G = 5) VG(G = 10) = 20% = 80% = 20% = 80% = 20% = 80% Total memory (Mb) 66:5 61:5 65:3 65:1 64:1 56:6 64:2 # Virtual routers 70 20 42 50 115 30 230 # Pipelines 70 1 1 10 23 3 23 From the figure, it can be seen that the memory consumption of the VG approach is close to that of VS approach and VM for = 80%. When node overlap is high, the VG approach outperforms both VS and VM approaches with respect to memory consump- tion. This is due to the non-leaf node sharing which is possible in VG approach, but not in VS and due to the limited leaf node size compared with VM approach. Essentially, the VG approach inherits the merits of both VS and VM approaches, which makes it an attractive solution for FPGA implementations. To summarize, we give the scalability of each approach on a Xilinx Virtex 7 V2000T FPGA in Table 4.5. This table assumes 100% on-chip memory utilization on FPGA, however, in a realistic scenario, memory utilization will be less than 100% due to the granularity at which the memory resources are available on FPGA and the routing com- plexities that may arise. For example, the smallest block RAM available on FPGA is 36 Kb and the entire 36 Kb of the block may not be utilized in a particular stage of the lookup pipeline. In such cases, the maximum number of virtual routers that can be hosted on a single chip may slightly vary. From Table 4.5, it can be seen that the VG approach outperforms the VM approach with respect to the number of virtual routers that can be hosted on a single chip, at the expense of implementing multiple pipelines. When compared with the VS approach, even when the node overlap ratio is low, the scalability of the VG approach is on par with that of the VS approach, with lower resource consumption due to the lower number 90 of pipelines. When is high, the scalability of the VG approach is superior compared with both VM and VS. This implies that if routing tables can be grouped in such a way that the node overlap ratio can be maximized, the scalability of the VG approach is well suited for large-scale router virtualization. 4.6 Virtualized Router Architecture on FPGA We implemented the three virtualized router approaches on FPGA as pipelined architec- tures to measure their performance quantitatively on the aforementioned FPGA. In this dissertation, we present the architecture of the VG approach only since the architecture features of both VM and VS are present in VG, as VG is a hybrid of the two. The pipelined architecture of VG approach is depicted in Figure 4.12. The initial table lookup is to identify the corresponding trie root for the incoming packet. The VNID associated with the packet is used to index the table and the trie root information corresponding to that is used to identify the pipeline in which, the routing information related to that particular virtual network is stored. After locating the pipeline, the trie is traversed using the destination IP address of the incoming packet. Up to this point, the router architecture is similar to that of the VS approach as multiple pipelines exist and the pipeline corresponding to an incoming packet is located using the VNID of the packet. There will beK pipelines in the case of VS, but onlyK=G in the case of VG. At the end of trie traversal, the VS architecture will have the forwarding information of the packet and will use that information to forward it through the port indicated in NHI field. But rather on the VG architecture, similar to the VM approach, the NHI extraction process is a two step process. At the end of the trie traversal, the output will be a vector of NHI from which, the NHI corresponding to the VNID of the packet has to be extracted from the vector. This can be done by using a mechanism similar to the 91 K/G pipelines VNID Trie root PE PE PE Stage P-1 memory IP VNID IP VNID ADDR IP VNID ADDR IP VNID ADDR IP VNID ADDR VNID V_NHI Extract NHI from NHI Vector VNID V_NHI NHI Stage P-2 memory Stage 0 memory IP VNID PE PE PE Stage P-1 memory IP VNID IP VNID ADDR IP VNID ADDR IP VNID ADDR IP VNID ADDR VNID V_NHI Extract NHI from NHI Vector VNID V_NHI NHI Stage P-2 memory Stage 0 memory IP VNID Figure 4.12: Pipelined FPGA architecture for the VG approach. 92 priority encoding process by using the VNID as the search key [20]. At this point, the forwarding information for the packet is extracted, hence, the search is terminated. The operation inside a processing element (PE) in a given stage (i) is simply inspect- ing thei th bit of the destination IP address of the packet and loads the left or right child pointer depending on the bit value (left pointer if0, right pointer if1). The trie nodes are stored in stage memory which is realized using on-chip memory available on FPGA. Note that both Block RAM (BRAM) and distributed RAM available on FPGA can be used for this purpose. The initial lookup table can be realized by using a direct lookup table (storing 2 k entries for ak bit key) or using a Binary Content Addressable Memory (BCAM). Due to simplicity, the direct lookup table is preferred over BCAM, hence is used in our work. 4.7 Performance Evaluation As mentioned in Section 4.5.3, the target device in our work is Xilinx Virtex 7 V2000T. The device comes with 21:5 Mb of distributed RAM, 46:5 Mb of BRAM and 1200 I/O pins available on-chip. Ample amount of logic resources are available on this FPGA with nearly 2 million logic cells. This device was selected since it offers both high performance and high capacity in terms of on-chip resources. The architecture discussed in Section 4.6 was implemented on this device to evaluate the performance of VS, VM and VG approaches with respect to throughput and power consumption. The results of these experiments are discussed in the following sections and the all the reported results are post Place-and-Route (PAR) results. 93 4.7.1 Throughput Throughput of a router is reported either in Million Lookups Per Second (MLPS) or Gigabits per second (Gbps). The conversion between the two is simple: Gbps =PacketsizeinbitsMLPS The performance of the architecture is decided based on the clock frequency at which it operates, expressed in MHz. In the case of a single pipeline design, the MLPS : MHz mapping is 1 : 1. When multiple pipelines exist, for exampleC number of pipelines, the mapping becomesC : 1 as each pipeline will be operating at the given clock frequency with a packet being output every clock cycle. While the packet size may vary significantly for IP packets, the minimum packet size is used to calculate the throughput in Gbps in order to compute the worst case through- put. In the case of IPv4 packets, the minimum packet size is 40 bytes (or 320 bits) and for IPv6, it is 64 bytes (or 512 bits). Since the proposed solution is for IPv4 router virtualizaiton, we use 40 byte packet size for all throughput calculations. Figure 4.13 illustrates the throughput variation for increasing number of virtual routing tables for the three virtualization approaches. The figure shows that the virtualized router architectures are able to operate beyond 10 Gbps rates, which is sufficient for edge-network operation. For these experiments, we have considered single-ported BRAM, however, the distributed RAM as well as BRAM available on current FPGAs have the dual-ported feature, which can be used to issue two memory read operations in a single clock cycle. This feature can be used to essentially double the throughput of the architecture by taking two packet lookup requests in a single clock cycle as opposed to a single packet lookup request. The effect of enabling 94 dual-ported BRAM on the overall clock rate is not significant, hence, the architecture is still able to operate at clock rates close to those of the single-ported designs. 0 5 10 15 20 25 30 # routers 0 10 20 30 40 50 60 Throughput (Gbps) VS VM α = 20% VM α = 80% VG α = 20% VG α = 80% Figure 4.13: Throughput variation for increasing number of virtual routers. The main deciding factor when it comes to throughput is the memory consumption. The larger the memory, the longer the access time. The total memory consumption can be seen from Figure 4.10 and Figure 4.11. Since the BRAM is arranged in columns on FPGA, when arranging the stage memory, multiple columns of BRAM might be required to realize the required amount of memory. In such cases, the communication paths become longer, causing the clock period to increase. This can be observed in the figure for the different virtualization schemes. For example, VM with = 20% has the poorest memory efficiency, hence the lowest throughput. VG with = 80% has the best memory efficiency, hence the highest throughput. The trend is common to all schemes in that, the throughput decreases for increasing number of virtual routers. This is expected due to the increased memory consumption and increased path delays. When comparing Figure 4.9 and Figure 4.10 with Figure 4.13, it can be seen that there is an inverse relationship between the memory and throughput. Especially for the VM approach, the last trie stage (comprising NHI vectors) consumes the most amount of memory, hence the performance is often governed by the clock period of this stage. For 95 example, for VM with = 20% and 25 routers, NHI memory is 80+% of total memory. In such cases, the effect of the NHI stage (i.e. last stage) is significant on overall clock frequency. However, factors such as overall utilization of chip resources and the layout of the architecture also affect the overall performance. The magnitude of such effects can, however, be minimized by proper placement of logic and memory resources by exploiting chip floor-planning techniques. However, when considering points such as VS for 20 routers and VM with = 20% for 25 routers, it can be seen that the performance variation is not consistent. Such anomalies occur due to the component placement of the place-and-route tool, on which, the user has little control over if techniques such as floor-planning are not used. The results appearing in Figure 4.13 are the results obtained without floor-planning used, hence the performance trend is violated in some instances. In Section 4.7.3, we dis- cuss how floor-planning tools can be leveraged to enhance the performance of pipelined designs by manually assisting the place-and-route tool in placing components on the FPGA fabric. 4.7.2 Power Efficiency The power consumption of the architecture was measured using Xilinx XPower Ana- lyzer (XPA) tool. We present power using two metrics: 1) power consumed by the architecture in Watts and 2) power efficiency measured in Watts dissipated per unit throughput. These two metrics gives two perspectives of the power performance of the architecture. The power measurements include static (leakage) and dynamic power dissipated by the architecture. The dynamic power includes memory, logic, signal and clock power. 96 When computing these power values, we assumed clock gating to reduce memory power consumption. In VS and VG approaches, all the pipelines need not be acti- vated for a single packet lookup operation since only one pipeline will correspond to any incoming packet. Hence, only that particular pipeline needs to be activated. By employing clock gating techniques, the stage memory of the inactive pipelines can be turned off with only a 2% increase in logic power consumption. Note that only the memory power consumption can be reduced by introducing clock gating, while all the other power components will remain the same. We included this in our power analysis by setting the enable rates for each pipeline based on the activity needed, described as follows: VS approach: For an incoming packet lookup request, only one pipeline needs to be activated. Therefore, the enable rate for the total BRAM consumed is 1=K. VG approach: For an incoming packet lookup request, a pipeline corresponding to only one group needs to be activated. Therefore, the enable rate for the consumed BRAM isG=K. Figure 4.14 illustrates the power consumed by the architecture. It can be seen that the VS approach shows a consistent trend while the other approaches show inconsistent trends. This can be reasoned using two factors: 1) total memory consumption, and 2) operating frequency. For example, in the VM approach for = 20%, it can be seen that the power consumption decreases when going from 15 routers to 20 routers. Between these two points, the memory consumption increases from 37 Mb to 61 Mb and the frequency decreases from 97 MHz to 54 MHz, respectively. Therefore, the effect of larger memory is suppressed by the significant decrease in operating frequency. A similar effect can be observed in [19] where power performance drops for the VS approach when going from 5 to 10 routers, the total power consumption decreases rather 97 than increasing (due to increased resource consumption as we see in Figure 4.14). This is due to the sudden frequency drop between the two data points which decreases from 350 MHz to 170 MHz. This significant decrease in frequency is due to the arbitrary component placement of the place-and-route tool. Hence, even though the power con- sumption should have increased with increasing number of virtual routers (increasing logic and memory consumption), the dynamic power of the architecture decreases sig- nificantly due to the decrease in clock frequency. However, in Figure 4.14, since the clock frequency decreases gradually in the VS approach, the dynamic power consump- tion increases which is a combined effect of amount of resources consumed and clock frequency. In Figure 4.15, the trends in the curves show consistent pattern as we have taken both clock frequency and power consumption into account. In networking context, this metric is more appealing as it explains how many Watts will be dissipated if the throughput were to be increased by 1 Gbps. For this metric, the higher the value, the lower the performance and vice versa. The figure shows that the VG approach yields the best power efficiency among the considered schemes and the VM approach has the worst power efficiency, while VS has moderate power efficiency. Even though the memory consumption of the VS approach is low, since K pipelines will be implemented for the K virtual routers, the clock and signal power components increase significantly compared with other approaches. Therefore, the power efficiency is not as high as the VG approach which implements onlyK=G pipelines. 98 0 5 10 15 20 25 30 # routers 2.2 2.4 2.6 2.8 3 3.2 3.4 Power (W) VS VM α = 20% VM α = 80% VG α = 20% VG α = 80% Figure 4.14: Total power consumption variation with increasing number of virtual routers. 0 5 10 15 20 25 30 # routers 0 20 40 60 80 100 120 140 160 Power efficiency (mW/Gbps) VS VM α = 20% VM α = 80% VG α = 20% VG α = 80% Figure 4.15: Power efficiency variation with increasing number of virtual routers. 4.7.3 Chip Floor-planning In Section 4.7.1, we mentioned that chip floor-planning can be employed to enhance the performance of the pipelined trie-search architectures on FPGA. When the architec- ture description is provided to the place-and-route tool, it tries to optimize the compo- nent placement in order to achieve performance according to the various optimization 99 Table 4.6: Performance before and after chip floor-planning Approach Before floor-planning After floor-planning Speedup Clock Rate Throughput Power eff. Clock Rate Throughput Power eff. (MHz) (Gbps) (mW/Gbps) (MHz) (Gbps) (mW/Gbps) VS 105 33:6 89 242 77 57 2:3 VM = 20% 55 17:6 157 67 21 135 1:2 VM = 80% 95 30:4 96 121 39 81 1:3 VG = 20% 112 35:8 71 184 58 52 1:6 VG = 80% 117 37:4 65 201 64 43 1:7 parameters set. However, it does not necessarily interpret the layout of the architec- ture and mimic the layout when placing components on the FPGA fabric. For example, pipelined trie search architectures have a structure in which, pipeline stages are arranged in a serial fashion while inside each pipeline stage, a PE and stage memory is working synchronously to conduct the operation assigned for the PE. In case the stage memory is located far from the PE, the communication wire lengths increase causing longer paths, hence longer clock periods. This degrades the performance of the architecture unexpect- edly and causes inconsistencies in the obtained results as pointed out in Section 4.7.1. However, high-level tools exist by using which, the user can control the place-and- route tool’s operation by specifying the physical layout of the architecture the user intended. Simply, the user can restrict each high-level module of the architecture to a user-defined region of the FPGA fabric, thereby constraining the performance to that particular region rather than letting the tool place them at distant locations. We used the Xilinx PlanAhead tool to achieve chip floor-planning for the virtualized router architec- tures. For each approach, we drew the layout of the pipeline using the PlanAhead tool such that, the stage memory of the pipeline is aligned closely with the BRAM columns that are available on the FPGA. The initial stages that do not have a higher memory 100 requirement, were realized using distributed RAM as stage memory. Therefore, the ini- tial few stages consumed only logic slices. The larger stages have higher stage memory demand, therefore, they were realized using BRAM as stage memory. The correspond- ing PEs were placed closely along with the stage memory in order to minimize the routing distance between stage memories and the processing elements. In this work, we considered two orientations, vertical and horizontal. In the verti- cal version, the pipeline was laid along the BRAM columns and in the horizontal ver- sion, the pipeline was laid tangential to the BRAM columns. The experimental results revealed that the horizontal arrangement is more suited for the considered architecture. The main issue with the vertical layout was that towards the last stages of the pipeline, the amount of memory available is limited, therefore BRAM blocks from other columns were connected to form the larger stage memories. This cause the routing between the PE and memory to become complex, degrading the clock frequency. Therefore, we used the horizontal orientation for our architecture. On the considered device, 9 BRAM columns are available. Since the first few stages are purely logic slice based, they can be arranged compactly on the FPGA fabric. The stages with larger memory can then be aligned with the 9 BRAM columns. However, since the number of stages (<32 #dist:RAM stages) is greater than the number of BRAM columns, the pipeline needs to wrap around. This can be done either in a zig-zag fashion or in a snake-like fashion. The zig-zag approach requires long wires from one end of the chip to the other end of the chip, which increases routing delays. Hence, the snake-like approach was chosen. This enabled us to compactly arrange the pipeline on the FPGA chip, which reduced routing delays and enhanced the association between PEs and stage memories compared with the case where no chip floorplanning was used. We used the case of having 20 virtual routers for all the approaches. The results we obtained are reported in Table 4.6. It is evident that the use of chip floor-planning 101 makes a significant performance improvement compared with the one in which no floor- planning was used. Also note the improvement achieved with respect to power effi- ciency. For the same design, by employing chip floor-planning, the power dissipated per unit throughput can be reduced considerably (due to reduced signal routing com- plexity). Last but not least, this enables us to achieve consistent performance trends eliminating occasional anomalies. 4.8 Conclusion Network virtualization has become an important component in recent networking con- text. Datacenter networking and Software Defined Networking (SDN) being few promi- nent applications. Router virtualization provides hardware support to realize network virtualization at the router hardware level. In this work, we presented a comprehen- sive analysis of existing virtualization approaches with respect to memory consumption, throughput, power and power efficiency on Field Programmable Gate Array (FPGA). We generalized the existing Virtualized Separate (VS) and Virtualized Merged (VM) approaches based on their vital features and discussed their best and worst case scenar- ios. In the analysis, it was shown that the VM approach has poor scalability especially when the node overlap ratio () is low. Also it was shown that with increasing num- ber of virtual networks, the leaf node memory consumption increases dramatically due to the vector-style storage of Next-Hope Information (NHI). In addition to the perfor- mance analysis, we introduced a novel virtualization approach which is a hybrid of the VM and VS approaches and we name it Virtualized Grouped (VG). In the VG approach, instead of keeping the virtual routers entirely separately or maintaining a single vir- tualized router, the routing tables are grouped and subsets are formed. Each subset is 102 implemented as a virtualized router and multiple such routers coexist on the same FPGA chip. We analyzed the performance of all three approaches under near-best and near-worst case node overlap ratios and discussed the scalability and performance of each approach. The results indicated that the VM approach has poor scalability in terms of the number of virtual networks that can be supported on a single FPGA chip. While the VS approach has good scalability, we showed that the VG approach is on par with the scalability of VS approach. We developed the pipelined IP lookup architectures for all three approaches and reported their post place-and-route performance on a state-of-the-art Xilinx Virtex 7 2000T FPGA. From the analysis, it was clear that the VM approach does not yield high performance due to the effect of higher memory consumption. The VS approach render high performance, however, since multiple pipelines need to be implemented, the logic power consumption increases significantly, causing lower power efficiency. On the other hand, the VG approach has highest throughput and highest power efficiency for high, but on par with VS approach for low. This superior performance is on part due to the moderate memory consumption and moderate logic resource consumption. In addition, we showed that chip floor-planning techniques can be employed to take control over component placement on the FPGA to effectively reduce the routing delays that can occur if only the place-and-route tool was used. On the average, we observed 1:6 speedup when floor-planning was used. Also, this speedup resulted in boosting up the power efficiency of the architecture significantly. 103 Chapter 5 High Performance IPv6 Forwarding for Backbone Routers With the exhaustion of IPv4 address space, IPv6 is gaining popularity and is being widely adopted by ISPs. This chapter introduces a partitioning based IPv6 lookup engine that is able to harness the processing capabilities of both software and hardware platforms. On software, the abundant thread-level parallelism is exploited to eliminate the thread switching, context switching and memory related delays in order to enhance the throughput. On hardware, the partitioning is used as a technique to improve the power efficiency of the lookup engine. By clock gating the stage memory of the inactive partitions, power consumption of the lookup architecture is brought down by several orders of magnitude. 5.1 Routing Table Statistics According to RFC-4291 [29], it can be seen that even though an IPv6 prefix is 128 bits in length, the prefix portion of it is only 64 bits, except for a few special cases. This is because an IPv6 address also carries the interface ID (64 bit) in the IEEE Extended Unique Identifier (EUI-64) format [29]. This results in 64 bit prefixes rather than 128 bit prefixes for IPv6. The address architecture of an IPv6 prefix is shown in Figure 5.1. In order to explore the viability of our solution, we first explored the features of real-life IPv6 backbone routing tables. These routing tables were obtained from RIPE 104 Subnet prefix Interface ID n bits 128-n bits Figure 5.1: IPv6 address architecture. Routing Information Service (RIS) project [50]. The collected routing tables are dated 07=30=2012 and the statistics of the routing tables are given in Table 5.1. As it can be seen from Table 5.1, currently, the IPv6 addresses constitute just above 2% of the total number of prefixes on the average. However, this number is expected to grow significantly with networking companies adopting IPv6 and eventually IPv4 becoming obsolete. With such a scenario, the resource requirements and latency issues will become major concerns in packet forwarding engines. Even though the number of IPv4 prefixes are in the range of 430 K on the average, after removing the duplicate prefixes, the prefix count becomes 350 K on the average. In order to understand the memory consumption, we computed the memory require- ment of these routing tables if they were implemented as uni-bit trie and range tree. The memory computations were carried out as per Eq. 5.1a and Eq. 5.1c. The notations are as follows with the values used for the computations inside paren- thesis: N T - number of nodes in a trie,W p - number of bits per pointer field (15 bits), W nhi - number of bits per Next-Hop forwarding Information (NHI) field (6 bits),N R - number of ranges produced by the prefixes,W ip - number of bits per prefix (64 bits). For the experiments, we considered leaf-pushed trie, which essentially reduces the memory footprint of a trie. When converting a regular uni-bit trie to a leaf-pushed trie, the num- ber of nodes increase since new nodes are created when routing information is pushed down to the leaf level [52]. For the considered IPv6 routing tables, we observed a 1:866 node count inflation on the average when leaf-pushed. In a leaf-pushed trie, 105 Table 5.1: Routing table statistics obtained from [50] dated 07=30=2012 RRC# # IPv4 # IPv6 IPv6 Statistics prefixes prefixes #Trie nodes Trie # Ranges Range tree Range tree (Mb) Exp. (Mb) Imp. (Mb) 0 454736 10197 133519 2:937 11209 1:524 0:807 1 415677 10113 128703 2:831 11153 1:517 0:803 2 272743 1369 14687 0:323 1469 0:200 0:106 3 422181 10060 124475 2:738 11157 1:517 0:803 4 425722 7378 77121 1:697 10430 1:418 0:751 5 419862 10051 123799 2:723 11168 1:519 0:804 7 425915 9887 113023 2:486 11072 1:506 0:797 10 416544 10055 124591 2:741 11159 1:518 0:803 11 423073 9956 121523 2:673 11037 1:501 0:795 12 428063 10180 135039 2:970 11190 1:522 0:806 13 431047 10032 127267 2:800 11079 1:507 0:798 14 425597 10035 123673 2:720 11191 1:522 0:806 15 442222 9963 112443 2:474 11141 1:515 0:802 16 407634 7826 83763 1:843 8473 1:152 0:610 half of the nodes, non-leaf nodes, contain pointers to children nodes and the other half, leaf-nodes, contain NHI. For range tree [78], we consider two approaches, namely, explicit and implicit. In both versions, the ranges are arranged similar to a BST (both versions have same struc- ture), but the search operation is different. In the explicit version, the entire range is saved as lower and upper bounds, and the incoming key is compared against both bounds and the traversal decision is made based on the result of the two comparisons. Since the ranges are disjoint, if the input key falls within a particular range, then the search can be terminated at that node. This is helpful especially on software platforms where early ter- mination of search yields lower packet latency, hence higher throughput. In the implicit 106 version, only the upper bound is stored and the incoming packet is compared against the stored value and the search needs to go up to the leaf-node level for termination. This is useful on hardware platforms where linear pipelines are employed, since the entire pipeline needs to be traversed despite an early termination in search. M trie = 1:866N T (2W p +W nhi ) (5.1a) M rtreeexp =N R (2W ip +W nhi ) (5.1b) M rtreeimp =N R (W ip +W nhi ) (5.1c) Table 5.1 reveals that the real-life IPv6 routing tables are small, hence, the full effect of a backbone routing table cannot be observed using them. For this reason and to evalu- ate the scalability of the proposed solution, we generated large IPv6 routing tables using an extended version of FRuG [16]. Using FRuG, the prefix distribution and the structure of the seed routing tables are preserved when generating the synthetic routing tables. We show the normalized prefix length distribution of the real and synthetic routing tables generated using the RRC00 IPv6 routing table in Figure 5.2 and it can be seen that the synthetic routing table has almost the same prefix length distribution as that of the real routing table. However, it should be noted that for range tree, the worst case scenario is if a particular routing table hasN prefixes, the number of ranges that can exist is bound by 2N 1. We evaluated this empirically and found that the ranges per prefix is 1:14 and 0:84 for real and synthetic routing tables (350 K routing table for each real routing table), respectively, on the average. The reason for the value to be lower for the synthetic routing table is the increased amount of prefix overlap. The main use of the synthetic routing tables is to highlight the benefits of using range tree as opposed to trie. The memory requirement of trie can increase significantly 107 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 Prefix length 0 0.1 0.2 0.3 0.4 0.5 Normalized prefix count Real Synthetic Figure 5.2: Normalized prefix length distribution for real and synthetic IPv6 routing tables generated using RRC00 backbone routing table with increasing number of prefixes as the trie size grows dramatically depending on the prefix distribution of the routing table. This is on one part due to the increased prefix length. Since a IPv6 prefix is 64 bits in length, the effect of a single prefix is much more than that of a IPv4 prefix. Further, doubling of packet lookup latency also becomes a concern. In order to evaluate this quantitatively, we generated synthetic routing tables of 350 K prefixes for each real routing table and observed memory consumption of each approach. The average values indicate that using uni-bit trie requires 29:19 Mbit, explicit range tree requires 39:77 Mbit and implicit range tree requires 21:05 Mibt. Note that all memory computations follow Eq. 5.1a and Eq. 5.1c and the aforementioned bit lengths. 5.2 Range Tree-based IPv6 Lookup Approach Tree-based solutions, are elegant in terms of the number of memory accesses required to perform a lookup, which is O(logN), where N is the number of keys/ranges [78]. In this section, we explain our approach for the IPv6 lookup engine that we propose for 108 both hardware and software platforms. First we present the details of our solution and discuss how software and hardware platforms can benefit from such techniques. 5.2.1 Enabling Parallelism for IP lookup Despite the platform and even application, considering current trends, enabling paral- lelism is essential for high performance computing. IP lookup is not a compute-bound, but a memory-bound operation. In such cases, parallelism is critical to exploit the multi- ple processing cores available on today’s General Purpose Processors (GPPs). However, it is not straightforward to parallelize the IP lookup operation in an efficient manner on such platforms. The most basic method of parallelization can be thought of as search tree duplica- tion, in which case, all partitions possess the entire routing table information and lookup can be carried on independent of other packet lookups. However, this causes the mem- ory required to store the search tree to increase proportional to the number of partitions. While this is not a feasible solution on hardware platforms in most cases (due to limited on-chip memory), on software platforms, this translates to increased lookup time since the cache memory will not be sufficient to store the duplicated search trees, especially for large routing tables. Also, partitioning a routing table in such a way that all par- titions need to be searched in order to find the LPM is also not efficient from power consumption point of view [27, 42]. Another approach is to partition the routing table. A simplistic method is to place the prefixes in a set of buckets, randomly or sequentially as they appear in the routing table. This will require all the partitions to be searched whenever a packet requests lookup. Even though such solutions are viable, they are not desirable in either hardware or software platforms mainly due to the cost of a single lookup operation. Further, this hinders any power optimizations that can be applied as we show later. 109 A more attractive method to perform partitioning is to place the prefixes in a set of bins in such a way that the prefixes in one bin are disjoint from those in other bins. In the context of our problem, IP lookup, the state of being disjoint can be described as being able to search in only one bin and finding the corresponding routing information, without consulting the other bins. Since packets correspond to different prefixes in a given trace, when more partitions are present, the more likely for them to be processed independently and in a parallel fashion. This provides opportunities for parallelism. However, it is imperative that the formed partitions are of similar sizes. Otherwise, it gives rise to various other issues such as fairness (some packets experiencing longer latency than others) and uneven memory distribution. There are multiple ways to partition a routing table in this manner. One way is to consider the most discriminating bits of the prefixes in the routing table and partition based on them. The effectiveness of such a scheme highly depends on the bit-level fea- tures of a routing table and it might become difficult to find such a partitioning scheme, especially when the routing table size increases. Also, the formed partitions may be of different sizes, which may cause the performance of the lookup scheme to be affected due to memory distribution imbalance. 5.2.2 Disjoint Grouping of Prefixes In this work, we consider partitioning using the initial bits of the prefixes. The initial bits based partitioning is essentially dividing the address space into multiple disjoint sections and considering subsets of prefixes belonging to each section as a smaller routing table. For example, if p bits are used there can be as many as O(2 p ) partitions. Depending on the prefix distribution of a given routing table, there may not be 2 p partitions and some partitions may be very small while some are very large. Also, when partitioning this way, the prefixes with length less than p needs to be expanded. Our analysis of 110 real routing tables indicate that prefixes of length 25 and shorter constitute only 1% of routing table. Hence, the effect of shorter prefixes is negligible. The core benefits of this approach are three fold: Easy identification of the corresponding partition: The only operation required to identify the partition a packet belongs to is to inspect the first few bits used for partitioning. This can be easily done using a lookup table with O(1) time complexity). Balanced partitioning: Even though the initial partitioning may not be balanced, it is relatively easy to form balanced partitions by combining initial partitions, as we demonstrate later. Smaller memory footprint: Since the initial bits are used to identify the partition, those bits are not relevant for the packet lookup process thereafter. Hence, a prefix (or a range) may be stored using less than 64 bits. Even though this is not partic- ularly useful on byte-oriented systems (e.g. GPPs), on hardware platforms, such features can be exploited. 5.2.3 Algorithm and Partitioning In order to demonstrate the results of the considered partitioning scheme, we use the real [50] and synthetic RRC00 routing tables (since real RRC00 has the largest IPv6 routing table). Moreover, we consider partitioning of both real and synthetic routing tables to show that synthetic routing tables are not biased in any way when it comes to prefix distribution. As mentioned in Section 5.2.2, we use the initial few bits of the prefix and partition the routing table into subsets by grouping the prefixes with the same initial bits into 111 1 2 3 4 5 6 1 2 3 4 0 500 1000 1500 2000 2500 3000 #prefixes Initial Aggregated (a)p = 10,n i = 7,n a = 4 5 10 15 20 1 2 3 4 5 0 10000 20000 30000 40000 50000 60000 70000 80000 #prefixes Initial Aggregated (b)p = 10,n i = 24,n a = 5 5 10 15 20 25 1 2 3 4 0 500 1000 1500 2000 2500 3000 #prefixes Initial Aggregated (c)p = 15,n i = 25,n a = 4 100 200 300 2 4 6 8 10 12 0 10000 20000 30000 40000 #prefixes Initial Aggregated (d)p = 15,n i = 336,n a = 12 50 100 150 200 250 1 2 3 4 5 6 0 500 1000 1500 2000 #prefixes Initial Aggregated (e)p = 20,n i = 294,n a = 6 1000 2000 3000 4000 20 40 60 80 0 1000 2000 3000 4000 #prefixes Initial Aggregated (f)p = 20,n i = 4854,n a = 94 Figure 5.3: Partitioning of real ( e 10 K) and synthetic ( e 350 K) routing tables with and without the use of Algorithm 2. Figures 5.3a, 5.3c, 5.3e show the partitioning of real routing table and Figures 5.3b, 5.3d, 5.3f shows the partitioning of the synthetic routing table. Note that the upper X-axis corresponds to aggregated partition numbers and the lower X-axis corresponds to the initial partition numbers. Partitions are sorted based on the number of prefixes contained. 112 Algorithm 2 Aggregated partitioning 1: procedure AGGREGATE(initial;aggregate;maxsize) 2: initial:sort(descendingcount); 3: booldone false; 4: whiletrue do 5: done true; 6: foripart ininital:begin toinitial:end do 7: if !(ipart:taken) then 8: partitionnewpart; 9: newpart:members ipart:members; 10: newpart:size =ipart:members:size; 11: ipart:taken true 12: done false; 13: break; 14: ifdone then 15: break; 16: forepart ininital:end toinitial:begin do 17: if !epart:taken then 18: ifnewpart:size +epart:sizemaxsize then 19: newpart:members epart:members; 20: newpart:size + = epart:members:size; 21: epart:taken true; 22: else 23: break; 24: aggregate:push back(newpart); returnaggregate a partition. Although works as a partitioning scheme, the partitions can be of diverse sizes (i.e. number of prefixes). This causes, the formed partitions to be of different sizes, which causes memory imbalance across partitions as well as varying lookup latencies for packets belonging to different partitions. While a perfectly balanced partitioning might not be possible depending on the prefix distribution, the variance in partition size can be reduced by augmenting the process. For this, we devised Algorithm 2 to reduce the variance in partition size. Figure 5.3 shows the effect of using Algorithm 2. The number of bits used for partitioning is denoted byp, andn i andn a stand for initial and 113 aggregated number of partitions, respectively. For the results shown in Figure 5.3, we variedp and setmaxsize to the largest initial partition size of each scenario. One can use the tuning parametersp andmaxsize to adjust the number of aggregated partitions (n a ) created. Figure 5.3 shows that the aggregated partitions have nearly the same size except for the last few partitions. This is due to the way the partitioning algorithm opeates. As it can be seen from 2, the remaining smallest size partitions are aggregated to form the an aggregated partition such that the size of the aggregated partition satisfies the size requirement. The size requirement can be changed in order to vary the number of aggregated partitions generated. For the partitioning in Figure 5.3, we have used maxsize as the size requirement. Hence, whenever the aggregated partition size reaches maxsize, the algorithm completes generation of one aggregated partition. This way, the last few partitions do not have adequate smaller partitions to form partitions of size in the range of the maximum size, therefore the algorithm creates few smaller partitions. The effect of this variation ultimately translates to a difference in range tree height and in our experiments we observed a maximum of 2 level difference. For example, in the synthetic routing table case, whenp = 15, the largest aggregated partition size is 32124 prefixes and the smallest is 10373 - a difference of two tree levels. Hence, the overall effect of partition size difference can be tolerated to some extent as long as the packet latency for different partitions does not deteriorate significantly. 5.3 IPv6 Lookup Architecture 5.3.1 Hardware Architecture The basic building block of the hardware architecture is a linear pipeline. Each stage consists of a Processing Element (PE), a stage memory and stage registers. We use the 114 implicit range tree approach for the hardware engine since a packet needs to traverse the entire pipeline for the search to complete, regardless of whether the search terminated in an intermediate stage. This is desirable, especially because of the lower memory consumption of the implicit range-tree approach. Further, this reduces the operational complexity of a PE since only one key value needs to be compared against the incom- ing IP; the explicit approach requires two key comparisons. For high performance, we further pipelined a PE such that the key comparison stage and the memory access stage are pipelined internally. This significantly improves the performance while adding little logic overhead. The architecture is illustrated in Figure 5.4. The figure shows how the sub-pipelines for different partitions are arranged on the FPGA physically. Each sub-tree has an expo- nential memory distribution since the range tree discussed here is complete [78]. There- fore, each level k has exactly 2 k nodes, with exception for the last level. The block memory on FPGA is arranged as vertical columns along the chip and the pipeline stages can be aligned along these vertical columns to take advantage of the resource distribution of the chip. Since the aggregated partitions are nearly of the same size, the generated trees are also of similar depth. This enables us to map two pipelines in a rectangular block on the FPGA chip, as depicted in Figure 5.4. We use Xilinx PlanAhead tool to draw the pipeline layout on the chip. Further, we exploit the various on-chip memory resources available on FPGA for higher scalability (i.e. larger routing table size) and performance. The first few stages of the range tree and the Bit Lookup Table (BLT) requires small amounts of memory, there- fore they can be realized using distributed Random Access Memory (RAM), which are at a finer granularity than block RAM. This improves the utilization of on-chip memory on FPGA by minimizing the unused memory space. 115 PE 0-0 Mem 0-0 PE 0-1 Mem 0-1 PE 0-n0 Mem 0-n0 PE 1-0 Mem 1-0 PE 1-1 Mem 1-1 PE 1-n1 Mem 1-n1 Bit Lookup Table (BLT) Val Ptr To PE 1-0 From BLT Figure 5.4: Pipelined IPv6 lookup architecture on FPGA. The two pipelines shown are aligned in such a way that the stage memory of the range trees are aligned along BRAM columns on FPGA for improved resource usage. 5.3.2 Software Architecture The initial portion of the lookup is the partition identification. This can be simply real- ized using an initial lookup table. For example, if the firstp bits were used for partition- ing, then the lookup table of size 2 p will hold the sub-tree pointer information for each partition. Multiple entries may point to the same sub-tree pointer due to the aggrega- tion step. With this approach, the partition search complexity becomesO(1) which is desirable for high-speed operation. For the main lookup engine that performs range tree search, we adopt a master- worker architecture. After the partition search (described previously), the packet is for- warded to the corresponding master thread, denotedM i , wherei is the corresponding aggregated partition index. Each M i creates a set of worker threads, denoted W ij , where j is the worker thread number. The architecture is flexible in that, the number 116 of worker threads created can be controlled. The overall architecture is depicted in Fig- ure 5.5. On multi-core platforms, it is desirable to have more partitions (i.e. more master threads) which provides more opportunity for parallelism. Further, when the number of partitions is higher, the number of prefixes (hence the height of the range tree) per partition decreases. This effectively reduces packet lookup latency. Also, the explicit range tree approach is more suited for the software engine since the search can be termi- nated when the corresponding node is found. This comes at the cost of higher memory consumption. However, on GPPs, this is a minor concern. M 0 M K-1 W 0 -0 W 0 -k 1 W K-1 -0 W K-1 -k K-1 Bit Lookup Table (BLT) Val Ptr K partitions Partition search Master Threads Worker Threads Figure 5.5: Hierarchical multi-threaded architecture of the proposed IPv6 lookup engine. 117 5.4 Lookup Engine Performance 5.4.1 Hardware Architecture Throughput Here we report the performance of the architecture presented in Figure 5.4. The pro- posed architecture was implemented on a Xilinx Virtex 7 X1140V FPGA and the post place-and-route results are reported here. We used both distributed and block RAM available on FPGA to facilitate higher scalability. The considered chip has 84 Mbit total memory available. The proposed solution scaled up to a 1 million entry IPv6 rout- ing tables while sustaining high lookup rates. We exploited the dual-ported feature of both distributed and block RAM on FPGA to enhance the throughput by processing two packet headers at a time by a single pipeline stage (via stage memory sharing). From Figure 5.6, it can be seen that the proposed architecture can operate at high lookup rates lead ing to throughputs of 200+ Gbps per pipeline for minimum size (64 byte) packets. The decrease in performance is due to larger stage memory when using larger routing tables. With increasing pipeline depth, stage memory size increases exponentially. In order to generate large stage memory blocks, on FPGA, several BRAM blocks need to be combined. Due to the column-wise arrangement of BRAM on FPGA, creating larger stage memory introduces penalties in access time. The effect of that can be seen in Figure 5.6. For these experiments, we set the number of partitions to be 2. This is achieved by setting the value for p, appropriately. This allowed us to store a 64p bit key as opposed to a 64 bit key. This resulted in savings in memory footprint. Our experiments indicated that on the considered FPGA device, the proposed solution is able to scale up to 1 million entry routing tables while sustaining 200+ Gbps throughput rates. 118 2.0E+05 4.0E+05 6.0E+05 8.0E+05 1.0E+06 #prefixes 0 10 20 30 40 50 60 70 80 90 Memory (Mbit) 200 210 220 230 240 250 260 Throughput (Gbps) Memory Total On-chip mem. Throughput Figure 5.6: Throughput and memory footprint of the hardware lookup engine for increasing routing table size. The dotted line denotes the maximum on-chip memory available on the Virtex 7 X1140T FPGA. Power Further, due to disjoint grouping of prefixes of the routing table, we were able to limit the power consumption of the architecture significantly. For the hardware approach, we limited the number of partitions used as the gains achieved by increasing the partitions has diminishing returns and when using larger number of bits for the initial partitioning of the routing table, the BLT becomes significantly larger causing excessive memory consumption. Since only one partition needs to be searched, the dynamic power con- sumption of the architecture is significantly reduced as opposed to searching using the full tree [42]. In Figure 5.7, we report the dynamic power consumption which includes the power consumed by logic, memory and routing [10] measured using the Xilinx XPower Analyzer tool. The lookup rate of the proposed IPv6 lookup engine is on par with that of [3, 42], however, the power savings achieved in this work via partitioning is not possible such solutions. Further, the scalability of the proposed solution is higher than that of both [3, 42]. For example, our architecture is able to support up to 1 million entry IPv6 119 200000 400000 600000 800000 1000000 # prefixes 1 2 3 4 5 Dynamic Power (W) 190 200 210 220 230 240 250 260 Frequency (MHz) Frequency Dynamic Power Figure 5.7: Power consumption of the IPv6 lookup engine for increasing routing table size. 1 2 3 4 5 6 7 8 # Partitions 0 2 4 6 8 10 12 14 Dynamic power eff. (mW/Gbps) 29 30 31 32 33 34 35 36 37 Memory (Mbit) Memory Dynamic power eff. Figure 5.8: Power and memory requirement variation with increasing number of parti- tions. routing tables, which is not possible with the two solutions in comparison, without the use of large external memory. 120 Benefits of Partitioning In order to assess the benefits of the disjoint and balanced partitioning, we used a 512 K entry routing table and evaluated performance for different number of partitions. Fig- ure 5.8 summarizes the results of our experiments. As it can be seen, the power effi- ciency, measured in power per unit throughput, improves with increasing number of partitions. This is due to selective enabling of lookup pipelines which is facilitated by disjoint grouping of prefixes. Compared with a state-of-the-art TCAM [74] operating at 360 MHz and consuming 2 W/Mb, our calculations indicate that the proposed solution is 50 power efficient, on the average. The tapering off of the power savings is due to the increased routing and logic power with the increasing number of partitions. The tradeoff is between the increased logic and routing power, and power savings from selective memory enabling. When more parti- tions are instantiated, the amount of non-memory type resources (logic, interconnect, registers, etc.) that needs to be “clocked” increases. Even though memory power is dominant when the number of partitions is few, for a higher number of partitions, the dynamic power consumed by non-memory type resources becomes on par with memory power consumption [10, 57]. Hence, the power benefits achieved by partitioning fades away. In Figure 5.8 we report the dynamic power efficiency, which is the amount of dynamic power dissipated per unit throughput. Another aspect of partitioning is the memory savings. Since the first p bits of the prefixes are already processed by the initial bit lookup table (BLT), those bits need not be stored in the range tree nodes. Due to this, the memory requirement decreases with increasing partition size since more bits are required to form higher numbers of partitions. In our experiments, the value ofp ranged from 0; 6; 10; 14 for the number of partitions 1; 2; 4; 8, respectively, and the memory requirement decrease is highlighted in Figure 5.8. 121 5.4.2 Software Lookup Engine We use three state-of-the-art platforms to evaluate the software and hardware architec- tures to evaluate the proposed solution. For the software portion, we use a 2 AMD Opteron 6278, 2:4 GHz, 16C, with 16 MB L2 cache, 16 MB L3 cache and 128 GB DDR3 main memory, and a 2 AMD Opteron 6220, 3:0 GHz, 8C, with 8 MB L2 cache, 16 MB L3 cache and 64 GB DDR3 main memory, which we name 32C and 16C, respec- tively. We use two platforms to highlight the performance variation with higher number of cores and clock frequency on the lookup engine performance. For performance evaluation, we used synthetic routing tables generated using [16]. As for the packet traces, in order to evaluate the random access performance, we gen- erated traces using the disjoint ranges generated from the routing tables in a uniformly random fashion. No temporal locality was assumed. If the number of generated ranges isN R , then the probability of an IPv6 address being generated from a specific range is 1=N R . The average tree depth traversed by a packet for a given trace can be calculated as P k 1 k=0 k 2 k 2 k 1 +1 1 k 1 for a complete tree withk 1 levels. This ensures that most of the generated IPv6 addresses correspond to a leaf or near-leaf level of the tree, which sim- ulates a near worst case scenario. However, in a realistic environment, the packets may terminate the search at intermediate levels, in which case, the performance will be better than what is reported in this work. In this research, we evaluate the lookup portion of the software engine, i.e. the range tree lookup. Since the initial partition search isO(1) time and the size of the Bit-Lookup Table (BLT) is considerably small compared with the size of the search trees, this operation can be performed within few clock cycles. We use two state-of-the-art platforms to evaluate the software architecture: 1) 2 AMD Opteron 6278, 2:4 GHz, 16C, with 16 MB L2 cache, 16 MB L3 cache and 128 GB DDR3 main memory, and 2) 2 AMD Opteron 6220, 3:0 GHz, 8C, with 8 MB L2 cache, 16 MB L3 cache and 64 GB DDR3 main memory, which we name 32C and 16C, 122 0 100 200 300 400 500 #partitions 0 10 20 30 40 50 60 70 80 Lookup rate (MLPS) 1 10 100 1000 Threads per partition (TPP) MLPS (16C) MLPS (32C) TPP (a) IPPT = 10000 0 100 200 300 400 500 #partitions 0 40 80 120 160 200 240 Lookup rate (MLPS) 0.1 1 10 100 Threads per partition (TPP) MLPS (16C) MLPS (32C) TPP (b) IPPT = 100000 0 100 200 300 400 500 #partitions 0 50 100 150 200 250 300 Lookup rate (MLPS) 0.01 0.1 1 10 Threads per partition (TPP) MLPS (16C) MLPS (32C) TPP (c) IPPT = 1000000 Figure 5.9: Performance of the software lookup engine for varying IPPT values. 123 respectively. We use two platforms to highlight the performance variation with higher number of cores and clock frequency on the lookup engine performance. Performance of Master-Worker Architecture First, we examined the performance of the lookup engine for a fixed trace size. And note that, for this experiment we assume that the packet trace is evenly distributed across the partitions. Even though this simulates a best case scenario, later we discuss the worst case performance for the case in which all the packets are directed to a single partition. We considered a 10 million IPv6 address trace and observed the performance variation for an increasing number of partitions. In this experiment, we controlled the number of packets each worker thread handles in order to observe its effect. We name this parameter IPs Per Thread (IPPT). Figure 5.9 illustrates the results of these experiments. We calculate Threads Per Partition (TPP) as partition trace size/IPPT. The key observation is that the lookup rate is best when TPP 1. This variation is mainly due to thread creation overhead. Since a new worker thread is created for each trace subset with IPPT number of IPs, the total number of threads created is high for lower IPPT. For the case where TPP < 1 and the number of partitions is high, more master threads are created with low load. Both these scenarios cause the performance to degrade, suppressing the gains achieved by enhanced parallelism and reduced search tree height. We keep the discussion on Figure 5.9 short due to space limitations. Rather, we describe how such thread creation overheads can be eliminated by modifying the proposed software architecture. Performance of Master-only Architecture The master-worker architecture requires dynamic thread creation and buffer manage- ment, creating adverse effects on lookup rate. We conducted experiments for the case 124 0 100 200 300 400 500 #partitions 10 100 1000 Lookup rate (MLPS) 16C - Best 16C - Worst 32C - Best 32C - Worst Figure 5.10: Best and worst case performance of master thread only approach. where the master thread itself takes care of the IP lookup process instead of creating worker threads. The performance values obtained are shown in Figure 5.10. We report performance for two scenarios: 1) when all masters are fully utilized (best case - paral- lel) and 2) when only one master thread is utilized (worst case - serial). For both cases, the trace size was kept the same. As expected, performance increases with increasing number of partitions initially and declines due to increased context switching overhead. The 32C outperforms 16C in the best case due to more cores, however in the worst case, the higher clock frequency of 16C gains advantage over 32C. The memory consumption of the proposed solution is 3 lower than that of [28] and lookup rate is nearly 5 higher than both [28, 78] solutions even after scaling for tech- nology gap. Even though the comparison is not detailed due to space limitations, it is evident that the proposed solution outperforms existing literature by fair margins. Com- pared with the IPv4 lookup engine proposed in [75], our IPv6 lookup engine delivers similar performance despite the increased lookup complexity and storage requirements. In order to evaluate the performance for larger routing tables, we generated routing tables of various sizes that represent current backbone routing table sizes and beyond, and observed the performance. The results are shown in Figure 5.11 and for this 125 0E+00 2E+06 4E+06 6E+06 8E+06 #prefixes 10 100 1000 Lookup rate (MLPS) 0 200 400 600 800 1000 Memory (Mbit) 16C - Best 16C - Worst 32C - Best 32C - Worst Memory Figure 5.11: Scalability of the software IP lookup engine. experiment, we fixed the number of partitions to 128 since in Figure 5.10, 128 parti- tions yielded the best performance. The performance variation shown in Figure 5.11 is expected due to larger routing table size, which translates to increased tree depth. Also, when considering the best performance curves, the higher core count of 32C platform yields higher lookup rate due to higher parallelism available. However when the routing table size increases, due to the size increase of each sub-tree, the 16C platform with higher clock frequency, outperforms the 32C platform (faster memory operations and context switching). In the worst case performance scenario, due to higher clock fre- quency, the 16C platform delivers high performance. Note that, for this experiments, we ensured that the total L3 cache size of the processor is exceeded to observe performance variation even when a portion of the routing table resides in main memory. The previous experiments do not include the initial lookup portion of the lookup engine. Figure 5.13 illustrates the performance after integrating the initial lookup. The architecture adopted in these experiments is shown in Figure 5.12. The traffic from dif- ferent network interfaces is buffered in their corresponding buffers and the traffic corre- sponding to each partition is identified by performing the initial lookup. The number of buffers is equivalent to the number of interfaces of the router. Hence an 8 bit interface 126 T 0 T K-1 Bit Lookup Table (BLT) Val Ptr K partitions Figure 5.12: Architecture with integrated initial lookup and master-only approach. !" #$% #$& #$% #$& ! Figure 5.13: Performance of the multi-core IPv6 engine with initial lookup integrated. ID yields 64 network interfaces. Since the buffers pertaining to each partition is also used by the range tree lookup, it needs to be accessed in a mutually exclusive manner. For this we implemented a producer-consumer type buffer using mutex locks. The trend seen in Figure 5.11 is present in Figure 5.13, however, the performance is lower due to the shared buffer effect. Nevertheless, even with the initial lookup, the proposed lookup engine is able to perform IPv6 lookup at 100+ Gbps rates. 127 5.5 Conclusion In this work, we proposed a range tree based solution for IPv6 lookup that is suited for both software and hardware platforms. We devised a tunable partitioning algorithm that forms a set of disjoint subsets of prefixes (of similar size), given a routing table. This enabled us to enhance parallelism on software platforms and to improve power and memory efficiency on hardware platforms. The solution was tested on state-of- the-art software and hardware platforms and the experimental results revealed that the proposed solution is able to operate at 100+ and 200+ Gbps throughputs on software and hardware platforms, respectively, for a 1 million entry IPv6 backbone routing table. In the case of the hardware solution, as the experimental results suggest, the benefits of partitioning fades away with increasing number of partitions. The reason for this trend is that, even though selective enabling can be used to turn-off unused stage memory blocks, processing elements cannot be turned-off in a similar fashion. Hence, with addition of multiple pipelines the dynamic power consumption decreases initially and starts to increase. This trade-off must be given careful consideration when deciding the number of partitions, especially for hardware platforms. The software solution yielded superior performance without the initial lookup and the performance decreased significantly with the integration of initial lookup. This was due to the mutex buffers used for each partition. Locking and unlocking buffers intro- duce further delays to the IP lookup process, which in turn decreases throughput. Efforts can be made to enhance the performance of the mutex buffers or to eliminate the need for mutex buffers completely, if possible. This will greatly improve the performance of the proposed software IPv6 lookup engine. 128 Chapter 6 Ruleset-Feature Independent Packet Classification All existing packet classification solutions rely on one or more properties of the rule- set in order to improve the performance of the classifier engine. However, due to the diversity of applications in which packet classification is used, such features may not be present in all rulesets, which make ruleset-feature dependent solutions less robust. In this chapter, to the best of our knowledge, we introduce the first ruleset-feature indepen- dent packet classification solution that yields deterministic performance for any given ruleset. We present a modular architecture that delivers high performance, which is suitable for state-of-the-art line-card operation, which can be a potential alternative for TCAMs for packet classification. 6.1 Motivation and Algorithm We develop our complete solution in two phases, which are 1) StrideBV algorithm, and 2) integration of range search and modularization of StrideBV architecture. These two phases are discussed in detail in this section. First we describe the usage of bit-vectors in the domain of packet classification. 129 6.1.1 Bit-Vector Based Packet Classification In the basic BV approach, a ruleset of sizeN is represented in the form of aN bit BV . Each bit-element in the vector represents the status of the rule, i.e. match or no-match (1 or 0, respectively), at a given point in the classification process. The BV approach can be used at multiple levels of the matching process. For example, it can be used to indicate the partial results for each individual field search. Aggregating the partial results is merely a matter of logically ANDing the BVs together to identify the potential match. BV has the capability of reporting multiple matches as well as reporting highest priority match via priority encoding. The simplicity of the merging process makes the BV approach an attractive solution, although, with increasing ruleset size, storing and retrieving bit-vectors from memory causes the memory access delays to increase significantly, causing the throughput to degrade. In this work, we propose methods to circumvent such scalability limitations via priority-based ruleset partitioning. 6.1.2 StrideBV Algorithm In [31] an algorithm named Field Split Bit-Vector (FSBV) was introduced to perform individual field search as a chain of sub-field searches, in the form of bit-vectors. The motivation was to improve memory efficiency of the packet classification engine by exploiting various ruleset features. Even though the FSBV algorithm itself does not rely on ruleset features, the architecture proposed in [31] relies on the features of the ruleset. However, as mentioned above, such ruleset features may not be present in all classifiers and the lack of thereof can potentially yield poor performance. Considering the diverse nature of packet classification rulesets [63, 64], a robust solution that is able to guarantee the performance for any classifier is in demand. The 130 FSBV algorithm proposed in [31] has the characteristics of being feature independent. However, the algorithm was applied only to a set of selected fields with memory opti- mization being the major goal. This caused the overall solution to be ruleset-feature reliant. In this work, we generalize the FSBV algorithm and extend the use of it to build a ruleset-feature independent packet classification solution, named StrideBV. In the original FSBV algorithm, eachW bit field was split intoW 1 bit sub-fields. However, it is possible to generalize the sub-field bit length without affecting the oper- ation of the packet classification process. In the proposed solution, a sub-field length of k bits is considered and we refer to such a sub-field as a stride. Due to this reason and the underlying data structure being bit-vector, the proposed solution is named StrideBV . We discuss the StrideBV algorithm in detail here. Eachk bits of theW bit rule can perform independent matching on the correspond- ingk bits of an input packet header. In other words, we can divide theW bit rule into W=k ofk bit sub-fields. Each sub-field corresponds to 2 k N bit-vectors: aN bit-vector for each permutation of thek bits. A wildcard () value in a ternary string is mapped to all bit-vectors. . To match an input header with this W bit rule, each k bits of the header of the input packet will access the corresponding memory whose depth 1 is 2 k , and return oneN bit-vector. Each suchk bit header lookup will produce aN bit-vector and they are bitwise ANDed together in a pipelined fashion to arrive at the matching results for the entire classifier. The memory requirement of the StrideBV approach is fixed for a classifier that can be represented in a ternary string format (i.e. a string of0, 1 and) and can be represented as (2 k NW=k). The search time isO(W=k) since a bitwise AND operation can be done inO(1) time in hardware andO(W=k) such oper- ations need to be performed to complete the classification process. Figure 6.1 shows an 1 Depth of a memory is defined as the number of entries in it. The width of a memory is defined as the number of bits in each entry. 131 example of applying the StrideBV algorithm for matching a packet header of 4 bits with a 4 bit ruleset. For simplicity and clarity, we show the BV generation and the lookup process fork = 1. One caveat with StrideBV is that it cannot handle arbitrary ranges in an efficient manner. The two port fields (SP and DP) may be represented as arbitrary ranges. For example, when a certain rule needs to filter the incoming traffic belonging to one of the well-known port numbers, the rule will have in its DP field: 0 1023. In such cases, the arbitrary ranges need to be converted into prefix format where the value of each bit can be0,1, or (wildcard character). In the following analysis, we assume a port field of a rule can be represented as a ternary string and later, in Section 6.1.3, we demonstrate how to eliminate this requirement to represent arbitrary ranges in a memory efficient manner for improved scalability. The formal algorithms for building the bit-vectors and for performing packet clas- sification, are shown in Algorithms 3 and 4, respectively. The Permutations function accepts k bits of the rule and generates a 2-dimensional array of size 2 k (height) k (width), in which, each row value specifies the bit-vector values for the considered stride. For example, consider a stride size of k = 2 and the rule value for a partic- ular stride is 0. In this case, the array output by the Permutations function will be f11;11;01;01g, which is the match result off00;01;10;11g against rule value0. The value at thej th position of the array indicates the bit-vector value for the considered stride, when the input packet header has a value of the binary representation ofj. Variants of StrideBV We discuss two possible implementations of StrideBV . They differ in the way the bit- vectors are stored in a particular stage. The two storage options are: 132 R1 R2 R3 R4 0 1 f[3] Field value f = 1011 0 0 0 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 f[2] f[1] f[0] 0 1 0 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 0 0 . . . Rule Field f R1 R2 R3 R4 1001 101* 0100 1*10 Field split Conversion process Match/lookup process Figure 6.1: FSBV bit-vector generation and header processing example. Algorithm 3 Bit-vector Generation Require: N rules each of which is represented as a W -bit ternary string: R n = T n;W1 T n;W2 T n;0 ,n = 0; 1; ;N 1 Ensure: 2 k W=k,N-bit-vectors: 1: V i;j =B i;j;N1 B i;j;N2 B i;j;0 , 2: i = 0; 1; ;W=k 1, andj = 0; 1; ; 2 k 1 3: Initialization:V i;j 00 08i;j 4: for [ doProcessR n ]n 0 toN 1 5: fori 0 toW=k 1 do 6: S[2 k ][k] = Permutations(R[ik : (i + 1)k]) 7: forj 0 to 2 k 1 do 8: V i;j [ik : (i + 1)k] =S[j] 1. Store the bit-vectors corresponding to the 2 k combinations of thek bit stride in a single memory block and load a single bit-vector per stage 2. Store 2k bit-vectors corresponding to the individual bits of thek bit stride and loadk bit-vectors per stage and perform a bitwise AND Table 6.1 compares the characteristics of the two approaches. The first method consumes more memory while reducing memory bandwidth and second method saves memory at the cost of memory bandwidth. However, it should be noted that in the sec- ond case, in a single stage,k number ofN bitAND operations need to be performed. This increases the amount of work to be done per stage which may potentially cause the clock period to increase. Since our goal is to implement a high-throughput packet 133 Algorithm 4 Packet Classification Process Require: AW -bit packet header:P W1 P W2 P 0 . Require: 2 k W=k,N bit-vectors: 1: V i;j =B i;j;N1 B i;j;N2 B i;j;0 , 2: i = 0; 1; ;W=k 1, andj = 0; 1; ; 2 k 1 Require: AN bit-vectorV m to indicate match result Ensure: N bit-vectorV m indicates all match results 3: InitializeV m :V m 11 1 All rules match initially 4: for [ dobit-wise AND]i 0 toW=k 1 5: j = [P ik :P (i+1)k ] 6: V m V m &V i;j Table 6.1: Comparison of variations to StrideBV Method Memory size Total memory bandwidth # stages 1 (2 k =k)NW NW=k W=k 2 NW NW W=k classification engine, we opt for the first method at the cost of increasing the memory consumption. Also, as we will show in Section 6.3.1, the utilization of memory blocks will also depend on the implementation variant adopted. Multi-match to Highest-Priority Match In [31, 59], the output of the lookup engine is the bit-vector that indicates the matching rules for the input packet header. This is desirable in environments such as Intrusion Detection Systems (IDSs) where reporting all the matches is necessary for further pro- cessing. However, in packet classification, only the highest priority match is reported since routing is the main concern. The rules of a classifier are sorted in the order of decreasing priority. Hence, extract- ing the highest priority match from the N bit-vector translates to identifying the first bit position which is set to 1, when traversing the bit-vector from index 0! N 1. 134 This task can be easily realized using a priority encoder. The straightforward priority encoder produces the result in a single cycle. However, when the length of the bit-vector increases, the time required to report the highest priority match increases proportionally. This causes the entire pipeline to run at a very slow clock rate for larger BV lengths, which affects the throughput. As a remedy, we introduce a Pipelined Priority Encoder (PPE). A PPE for aN bit- vector consists of log B N number of stages and since the work done per stage is trivial, the PPE is able to operate at very high frequencies. Here, the parameterB refers to the degree of the pipeline encoder, which essentially indicates how many comparisons are done in a particular stage. In other words, into how many partitions the bit-vector is split into in a given stage. With the pipelined approach, the performance bottleneck introduced by the single stage priority encoder can be effectively eliminated. In order to extract the matching result of the modularized architecture proposed in this work, we build a hierarchical PPE which will be discussed in Section 6.1.3. 6.1.3 Modularization and Integration of Range Search In this section we show how range-search can be integrated into StrideBV which elimi- nates the need for range-to-prefix conversion. This improves the scalability of our solu- tion in terms of the number of rules supported on a given amount of memory. We also discuss about the modularization of the architecture by which the performance limita- tions of traditional BV approaches can be circumvented. The combination of these three forms the proposed architecture which is completely independent of ruleset features and delivers guaranteed performance for any given ruleset. 135 Modular BV Approach Since each individual element of the BV is responsible for a single rule of the entire ruleset, each individual bit-level operation is completely independent of the rest of the bits in the BV . This allows us to partition the BV without affecting the operation of the BV approach. The benefit of partitioning is that we no longer need to load aN bit BV at each pipeline stage, but rather, aN=P bit BV , whereP is the number of partitions, reducing the per pipeline stage memory bandwidth requirement by a factor ofP . Par- titioning is done based on the rule priority of the rules, i.e. firstN=P rules in the first partition and so on. This eases the highest priority match extraction process. The mod- ular approach necessitates each partition to be implemented as a separate independent pipeline. On a FPGA environment, this does not become an issue as multi-way pipelines can be easily implemented exploiting the resources available on the device. To better explain the severeness of the unpartitioned BV approach, we provide quan- titative insight using realistic example values. Consider a classifier (i.e. a ruleset) with 2000 rules and a pipelined architecture operating at 200 MHz. Considering a stride value ofk = 3, the pipeline length becomesd104=3e = 35. With no partitioning, each pipeline stage will be requesting a 2000 bit wide word at a 200 million requests per second rate. This translates to an on-chip memory bandwidth of 14 Tbps. On current FPGAs, having such high bandwidth is not possible. Therefore, the operating frequency drops with increasing ruleset size [20], affecting the performance of the packet classifi- cation engine. Each partition will produce a portion of the longer BV , referred to as sub-BV here- after, which contains the matching result(s). The aggregation of these sub-BVs is dis- cussed in Section 6.1.3. 136 Scalable Range Matching on Hardware As pointed out earlier, the effect of range-to-prefix conversion is significant, consider- ing the fact that arbitrary ranges exist in real-life rulesets. The most prevalent range (well-known port numbers) 0 1023 can also cause serious effects on the memory foot- print when converted to prefix format. For example, a single such range converts to 6 individual prefixes, causing a single rule to expand into 6 rules. With two such ranges appearing in a single rule for both source and destination port values, causes the rule to expand into 36 rules. However, the effect of range-to-prefix conversion may become severe for arbitrary ranges with the worst caseO(W 2 ) expansion factor. As a remedy, we propose a method which causes a minute memory footprint increase. This is achieved by storing the lower and upper bounds of the range explic- itly and performing comparison operations with the incoming header value, along the pipeline. The same technique is used for both port fields. The na¨ ıve application of this rule requires the storage of two separate rules, one for lower bound and the other for upper bound, which causes the memory required to store the classifier to double, which renders this solution less attractive. However, the explicit range storage is required only for the two port fields, hence storing the rest of the fields in this format is not required. This method eliminates extra bit-vector storage and range-to-prefix expansion. This yields significant savings with respect to memory consumption, compared with solu- tions in which range-to-prefix conversion is required. Experimental results on real-life classifiers show that, the range-to-prefix conversion process introduces a rule inflation factor of 6 worst case and 2:32 average case [63]. In order to facilitate this in hardware, we modify the pipeline stages corresponding to the port fields accordingly. The search operation is a sequential comparison of unsigned integers. The result of a sub-field is carried on to the next stage and the comparison completes when the last sub-field of a range field is reached. Since the incoming packet 137 header value needs to be checked against both lower and upper bounds of the rules, two values need to be loaded at each pipeline stage, per rule. These values either can be stored locally in the pipeline stage itself or stored in stage memory. In this work, we store them locally within the pipeline stage, by encoding the range values as logic functions. This allows us to achieve further improvements in memory efficiency as the range search stages do not require storage of BVs. Multi-level Priority Encoding With the modular approach, multiple pipelines are required to support large-scale packet classification. Multiple rules from a classifier may match with an incoming packet, however, only the matching rule with highest priority is applied on the packet in the case of packet classification. The priority is simply the order in which the rules appear in the classifier. In order to extract this information from the BVs output from the pipelines, we implement a two stage, hierarchical priority encoder. The first stage of the priority encoder deals with the sub-BVs produced by each sub- pipeline. It extracts the lowest bit index that is set to1 in the sub-BV , which is the local highest priority match (Idx Local ). In order to compute the global match, the global index (Idx Global ) need to be computed. This can be easily done by computing the following: Idx Global =Idx Local +Idx Pipe N=P where Idx Pipe is the pipeline identifier, varying from 0 to P 1. By keeping the value ofN=P (i.e. partition size) a power of 2, the multiplication can be converted to a simple bit-shift operation. 138 When these global indices are produced from all the sub-pipelines, they are stored in a BV in the second stage, and the same pipelined priority encoding is carried out on this BV to produce the global highest priority match. The latency introduced by the priority encoding process is log B1 (N=P ) from the first stage and log B2 P , where B1 andB2 are two user defined parameters with which the total latency can be controlled. B1 andB2 also defines the degree of the two priority encoders. In the case of binary mode, the values are set toB1 =B2 = 2. However, for larger ruleset sizes, the latency introduced by the binary priority encoder can be high. Depending on the requirement the parametersB1 andB2 can be chosen appropriately. 6.2 Hardware Architecture for Packet Classification We discuss the hardware architecture in three steps: 1) StrideBV architecture, 2) range search integration and 3) modularization. 6.2.1 StrideBV Architecture StrideBV has a regular pipelined structure in which, the per stage memory requirement is uniform across the pipeline. This is a unique feature of the StrideBV approach, which makes it unique. Trie and tree based architectures, on the other hand, have exponen- tial memory distribution across the pipeline, which causes the stages with larger stage memory to influence the clock frequency. This ultimately degrades the throughput of the architecture. In the StrideBV architecture, in each pipeline stage, k bits of header bits of the incoming packets is used as the memory address to the stage memory. The output bit- vector of stage memory (BVM) and the bit-vector generated from the previous stage (BVP) are bitwise ANDed together to produce the resultant bit-vector (BVR) of the 139 HDR[0:W-1] BVP[0:N-1] HDR[0:W-1] BVP[0:N-1] Stride BVR[0:N-1] . . . . . . Pipelined Priority Encoder (PPE) log B N stages Highest priority match Packet classification request Stage (0) Memory HDR[0:W-1] BVP[0:N-1] Stride Stage (W/k-1) Memory AND gate network AND gate network N N N N Figure 6.2: StrideBV pipelined architecture (BVP/BVM/BVR - Previous/Memory/Resultant Bit-Vectors, HDR - 5-field header, Stride -k bit header stride) 140 LB0 LBN-1 UBN-1 ≥ ≤ ≥ ≤ Header stride UB0 HDR[0:W-1] BVin[0:N-1] LB0 LBN-1 UBN-1 ≥ ≤ ≥ ≤ Header stride UB0 HDR[0:W-1] BVin[0:N-1] BVGT[0:N-1] BVLT[0:N-1] BVGT[0:N-1] BVLT[0:N-1] BVGT[0:N-1] BVLT[0:N-1] HDR[0:W-1] BVin[0:N-1] BVGT[0:N-1] BVLT[0:N-1] HDR[0:W-1] BVin[0:N-1] BVGT[0:N-1] BVLT[0:N-1] HDR[0:W-1] BVin[0:N-1] Initial range search stage Remaining range search stages Terminating stage Regular StrideBV stages Regular StrideBV stages Figure 6.3: Memory efficient range search implementation on FPGA via explicit range storage 141 current stage. The same stage construction is adhered throughout the pipeline. The final stride lookup stage will output the multi-match result and the PPE extracts the highest- priority match from the resultant bit-vector. Note that BVP in stage 0 is set to all1’s to indicate that the entire ruleset is considered as potential matches. As the lookup process progresses, the bits at rule indices that do not match with the incoming header will be set to0 to indicate no-match. The architecture is presented in Figure 6.2. 6.2.2 Range Search Integration While there are numerous methods to handle ranges, in this work, we adopt explicit range search done in a serial fashion. The main reason for this choice is to not affect the uniformity of the pipeline stages significantly. As pointed out earlier, the StrideBV pipeline has a uniform structure, which can be easily mapped on to the FPGA fabric. In order to facilitate the stride access, at each range search stage, we consider only k bits of the header and perform a comparison withk of lower bounds (LBs) and upper bounds (UBs) of the range values to see whether the incoming value belongs in the stored ranges. The search is done from Most Significant Bits (MSBs) to Least Signifi- cant Bits (LSBs) to finally arrive at which ranges match and which did not. This result is then ANDed with the bit-vector generated by the StrideBV pipeline to combine the range search results with prefix/exact match results. The architecture of the range search is shown in Figure 6.3. The operation of the sequential comparison is illustrated in Figure 6.3. In each stage, the incoming packet’s header stride is compared against the stride of lower and upper bounds of the rules by performing greater than or equal and less than or equal operations, respectively. The result of the first stride is forwarded to the second stride comparison and so on. Only if the previous comparison returnedtrue, the comparison in the current stage may yieldtrue. The results of these operations are stored in two 142 BVs corresponding to lower and upper bounds of range values. When the sequential comparison is terminated these two BVs are ANDed together to form the complete match result for the range fields. 6.2.3 Modularization We present the overall architecture of the proposed solution in Figure 6.4. The strides that process header fields with prefix and exact match are using StrideBV stages and the strides that perform range comparisons are using range-search stages. The search process is followed by the priority encoder network which extracts the highest-priority match. SA (StrideBV) DA (StrideBV) SP (Range) DP (Range) PRT (StrideBV) Module 0 SA (StrideBV) DA (StrideBV) SP (Range) DP (Range) PRT (StrideBV) Module N/P -1 Priority Encoder Priority Encoder Main Priority Encoder N/P modules Header In Highest Priority Match out Figure 6.4: Serial architecture for latency tolerant applications. SA (StrideBV) DA (StrideBV) SP (Range) DP (Range) PRT (StrideBV) Module 0 Priority Encoder Main Priority Encoder N/P modules Header In Highest Priority Match out Delay Delay Aggregator Figure 6.5: Parallel architecture for low latency applications. 143 Note that the order in which the headers are inspected does not affect the final search result. Therefore, the StrideBV stages and range-search stages can be permutated in any order. Figure 6.4 takes a serial search approach, consuming consecutive strides of the header as the packet progresses through the pipeline stages. This causes the packet classification latency to increase linearly proportional to the header length. With longer header lengths, this can be an undesirable feature especially for applications that demand low latency operation, such as multimedia, gaming and datacenter environments. In order to facilitate such requirements, we propose a variant to the serial stride search architecture. Figure 6.5 illustrates the low latency variant of the proposed archi- tecture. As mentioned previously, since the order of search does not affect the final result, we perform the stride search in parallel. For example, a 5-tuple header is 104 bits in length. For stride sizek = 4, if we perform serial stride search, the packet classifi- cation latency will be 26 clock cycles, excluding the priority encoder delay. However, if we perform the search in a parallel fashion, then the latency can be brought down significantly. One possible arrangement is to perform the following three searches in parallel:fSAg,fDAg,fSP, DP, PRTg. However, in order to aggregate the results from the parallel search operations, an additional stage needs to be added. In such a sce- nario, if we use a stride size ofk = 4 for this arrangement, the delay will only be 11 clock cycles (with added delay stages for SA and DA lookups), which causes the packet classification latency to reduce 0:42 compared with the serial search approach. This makes our solution suitable for networks that demand low-latency operation. . In both serial and parallel orientations, each module is implemented as a separate pipeline. For a classifier of size N (i.e. N rules), and a partition size of P , there will be N=P modules, hence N=P modular pipelines. All the pipelines receive the same input header and they perform the search inside the pipeline, which is a series of bitwise AND operations. At the end of the classification step of a modular pipeline, 144 each module generates its own sub-BV which contains the match results. In order to extract the highest priority match of each sub-BV , each modular pipeline is equipped with its own pipelined priority encoder. Once the module-local highest priority match is extracted, this information is fed into the main priority encoder which determines the overall highest priority match. And the packet is treated based on the action specified in this rule. 6.3 Performance Evaluation We evaluate the performance of both serial and parallel versions of StrideBV on a state- of-the-art Virtex 7 2000T FPGA using post place-and-route results. We use memory consumption, throughput, power consumption and overall packet latency as perfor- mance metrics in this evaluation. Further, we highlight the benefits of using FPGA floor- planning techniques to map the proposed architecture on to the FPGA fabric. Finally, we compare our solution against several existing packet classification schemes to show the improvements rendered by the proposed scheme. 6.3.1 Memory Requirement The memory requirement of StrideBV for a given packet classification ruleset is strictly proportional to the number of rules in the classifier. Since range-to-prefix conversion is not required in our approach, each rule storage takes onlyO(1) space. Hence, for a classifier withN rules, the memory requirement is (N). This unique property of our solution makes it a robust alternative for all ruleset-feature dependent schemes available in the literature. The proposed architecture is flexible in that the stride size and orientation (serial or parallel), based on the latency and memory constraints of a given network environment 145 can be decided to suit its requirements. However, note that the orientation of the archi- tecture has no effect on the memory requirement as no additional bit-vector storage is required to transform the architecture from serial to parallel or vice versa. Similarly, the modularization of the architecture has no effect on the memory requirement since the partitioning of the architecture does not introduce any additional rules. Further, the architecture proposed here, is able to exploit distributed RAM as well as BRAM avail- able on the device. Hence, all on-chip memory resources can be made use of. In Figure 6.6 we illustrate the memory requirement of the proposed architecture for various stride sizes. The pipeline stage memory size increases by a factor of 2 k and the pipeline length decreases by a factor of 1=k, for a stride of sizek bits. Hence the overall effect of stride size on the memory consumption of the architecture is 2 k =k. Consequently, this causes the latency of a packet to decrease by a factor of 1=k at the expense of a 2 k =k memory increase. Note that this latency is excluding the latency introduced by the priority encoder network. The overall latency will be discussed in Section 6.3.2. 0 10000 20000 30000 40000 50000 60000 Classifier size (#rules) 0.1 1 10 100 1000 Memory footprint (Mbit) k = 2 k = 4 k = 8 Virtex 7 2000T on-chip memory Figure 6.6: Memory requirement of the proposed solution for stride sizesk =f2; 4; 8g and increasing classifier size. 146 Figure 6.6 assumes a 100% utilization of on-chip memory of the considered FPGA. However, in reality, on-chip memory is available at a coarse granularity. For example, the minimum distributed RAM size is 64 bits and the minimum block RAM size is 18 Kb. These memory blocks have predetermined height and width constraints that pre- vents one from using them at finer granularities. For example, the minimum depth con- figuration for block RAM on Virtex 7 devices is 512 (height) words of 72 bits (width). The effect of these constraints are discussed here. Building stage memory of higher bit-widths (e.g. to store 1024 bit-vectors) is achieved by cascading multiple memory blocks together. However, when considering the height of stage memory, for a stride of size k bits, the height required is 2 k . The minimum height supported by block RAM is 512. In order to fully utilize such a block RAM, stride size has to be increased tok = 9. The side effect of using a higher stride size is that the memory required to store a rule increases by a factor of 2 k =k as shown earlier. Hence, for example, compared withk = 2, the memory required to store a rule increases by a factor of 28. We show this tradeoff in Figure 6.7. The multiplication factor is simply the ratio between the amount of memory required to store a rule using stride ofk bits and the bit width of a single rule. This metric indicates how much space is required to store a single rule compared with its original size. It can be seen that the amount of space required to store a single rule increases in an exponential fashion with increasing stride size. For the considered FPGA device, if all on-chip memory resources are exhaustively utilized, the maximum classifier size that can be fit is 28 K. Note that this constraint is imposed by the underlying architec- ture of FPGA and such limitations can be eliminated in custom built platforms such as ASIC, where the granularity of memory can be made finer than what is offered on mod- ern FPGA platforms. Hence, the same architecture can be mapped onto ASIC, which can support higher classifier sizes and render higher throughput. However, the ASIC 147 1 2 3 4 5 6 7 8 9 Stride size 0 10 20 30 40 50 60 Multi. factor 0 20 40 60 80 100 BRAM utilization (%) 0 20 40 60 80 100 120 Class. latency (Clocks) Multi. factor BRAM utilization Class. latency Figure 6.7: Variation of 1) multiplication factor, 2) classification latency and 3) BRAM utilization with increasing stride size. implementation of the proposed architecture and its performance evaluation is beyond the scope of this research. 6.3.2 Throughput and Packet Latency Throughput In Section 6.2 we introduced the serial and parallel variants of the proposed architecture. We evaluate the performance of both architectures in this section in detail. The perfor- mance of networking hardware is measured in either the number of packets forwarded per second (PPS) or the number of data bits forwarded per second (bps). The conversion between the two metrics is straightforward and can be stated as: bps = packet size in bits PPS. In order to report the worst case throughput, we use minimum packet size which is 40 bytes for IPv4 packets. Table 6.2 shows the performance of the architecture. For these experiments, we fixed the stride size tok = 4. The reason for this choice is thatk = 3 andk = 4 yields the best tradeoff of the metrics we consider as we will show later. In addition, considering 148 Table 6.2: Performance of the serial and parallel architectures Partition Serial Parallel size Clock Rate Throughput Latency Clock Rate Throughput Latency (MHz) (Gbps) (Clocks/ns) (MHz) (Gbps) (Clocks/ns) 16 526 337 28=53 451 289 16 64 438 280 29=66 378 242 17 256 296 190 30=101 257 165 18 1024 211 135 31=146 174 111 19 the bit widths of the individual header fields, strides of size 4 can be perfectly aligned with header boundaries, which preserves the header field separation. This enables us to separate the range search fields from the prefix/exact search fields, which simplifies the architecture. In order to demonstrate the usefulness of the modular architecture, we report the performance for various partition sizes. Each module is implemented as a pipeline as discussed in Section 6.2. The advantage of modularization is that instead of stor- ing and processing massively wide bit-vectors, the bit-vector length inside a module is restricted to a size such that the performance is not significantly deteriorated due to bit-vector length. With this approach, in order to support higher scalability, one simply can add more modules and the the partition size can also be chosen based the latency and throughput demands of the network. The superior flexibility offered in the proposed architecture is appealing in modern networking environments where the demands vary from network to network. We exploit the dual-ported feature of FPGA on-chip memory to increase the through- put of the architecture. Each modular pipeline is able to accept two packet headers per cycle. The reported clock frequency and throughput are for the dual-ported implementa- tion of the on-chip memory. For these experiments, the stage memory was built using the 149 distributed RAM available on FPGA. It is possible to implement the modular pipelines with BRAM and even a hybrid of distributed RAM and BRAM for higher scalability. Chip Floor-Planning With the current place-and-route tools, it is typical for a large design to incur longer wire delays due to unconstrained placement of logic/memory components of an architecture. This is the likely scenario if the place-and-route tool does not have the understanding of the physical layout of a design. However, using careful floor-planning, such anomalies can be avoided to ensure that the physical layout of the architecture is preserved even after it is mapped onto the FPGA fabric. The proposed architecture has a regular struc- ture and the same structure is maintained in all pipeline stages. Therefore, it is relatively easy to map the pipelined architecture onto the FPGA fabric in such a way that the rout- ing delays are minimized. While there are numerous ways of laying out the pipeline on the fabric, in our work we adopted the following methodology. The the pipeline stages were placed in a contiguous manner, which required a snake-like layout for the serial case and a straight-line layout for the parallel case. This enabled us to constrain the occupancy of a single modular pipeline to a specified area of the chip (localized rout- ing), which in-turn reduces its effect on the performance of the other modular pipelines. Due to this reason, even if more modular pipelines are added, the reported performance of a modular pipeline is not significantly deteriorated. Packet Latency The cumulative packet latency comprise of the packet classification latency and the latency introduced by the priority encoder. As mentioned previously, the latency intro- duced by the packet classifier can be calculated as W=k, where W is the header bit 150 length and k is the stride size in bits. Once the packet classification process is com- plete, the highest priority match of the sub-BV produced by each modular pipeline is extracted. Then the main priority encoder resolves the highest priority match from all sub-BVs to find the matching rule for the input packet header. The latency of the priority encoder can also be adjusted by appropriately choosing the degree. For example, a priority encoder with degreed and number of elementsn will introduce a delay of log d n. With higher values ofd, the latency can be minimized. How- ever, there is a tradeoff. Whend is increased, the number of serial comparisons that need to be made at a stage also increases proportionally. Hence, the delay also increases. This delay is tolerable as long as the the delay of a priority encoder pipeline stage is lower than the maximum stage delay of the packet classification pipeline. Increasing priority encoder delay beyond this limit will cause the throughput to degrade even though the delay is minimized. The latency reported in Table 6.2 are ford = 4, and comprise of packet classification latency and the total priority encoder latency for a modular pipeline. The delay introduced by the main priority encoder depends on the size of the clas- sifier under consideration. Considering the largest classifier that can be hosted on the given FPGA, a 28 K rule classifier, and a partition size of 1024, the main priority encoder will be handling a 28 entry bit-vector. Here also, the delay can be adjusted by choosing the degree of the priority encoder accordingly, similar to the case of the priority encoder at the end of each modular pipeline. If we setd = 4, then the total delay introduced by the priority encoder network will bedlog 4 1024e +dlog 4 28e, which amounts to 8 clock cycles. 151 6.3.3 Power Efficiency While the throughput of the networking has dramatically increased over the past few decades, from 100 Mbps connections to 100 Gbps connections, the power consump- tion of networking hardware has also increased significantly. Power efficiency has gained much interest in networking community due to this reason. Operating under a restricted power budget is therefore imperative, which renders TCAMs a less attrac- tive solution despite their simplicity. TCAMs have poor power efficiency due to their massively parallel exhaustive search conducted on a per packet header basis. Current TCAM devices are able to operate at 360 MHz speeds, support variable word sizes and are available in different capacities. A TCAM’s power consumption ranges between 15 20 Watts/Mb [45], which is relatively high compared with the power consumption of an architecture on FPGA that performs the same operation. However, despite the high power consumption, the packet latency of TCAM is minimal with only a single clock cycle latency. The power consumed by a hardware architecture is twofold: 1) static power and 2) dynamic power. Static power is the leakage power consumed by the chip, and is propor- tional to the area of the chip. Dynamic power is proportional to the clock frequency of the design and the amount of resources that are currently active 2 . We consider both static and dynamic components of power and evaluate the power efficiency of our architecture based on the experimental results gathered from the Xilinx XPower Analyzer tool. We first evaluated the effect of stride size on the power consumption. With increas- ing stride size, the number of stages reduces, but the stage memory size grows exponen- tially. Therefore there is a tradeoff between the power consumed by the logic portion of 2 Even though other factors such as operating temperature affects the amount of power consumed (both static and dynamic), in this discussion, we focus on the most prominent factors 152 the architecture (stage registers, AND network, routing, etc.) and the power consumed by the memory portion of the architecture (stage memory to store bit-vectors). Since the dynamic power is proportional to the amount of resources consumed, we first evaluated the number of logic slices consumed by the logic and memory compo- nents of the architecture. The baseline was set for the architecture with stridek = 1 and we projected the results obtained from this experiment, to larger stride sizes by using the following relationships: S L;m =S L;1 =k S M;m =S L;1 2 k =k whereS L;x andS M;x denote the number of slices used as logic and memory portions of the architecture for stride sizek = x, respectively. The projected results are depicted in Figure 6.8. Even though this does not directly translate to power consumption of the different components due to the effect of stride size on frequency, if we assume a fixed frequency for all stride sizes, Figure 6.8 can be used as a direct translation from resource to power consumption. The conclusion we can draw from Figure 6.8 is that increasing the stride size will have its tradeoffs both from memory and power consumption standpoints. In our exper- iments, the stride size that yielded best tradeoff between memory and power consump- tion was k = 4. Therefore, for all the experiments conducted on the proposed archi- tecture, this stride size was used. Figure 6.9a and Figure 6.9b illustrates the results of the experiments conducted fork = 4 using different partition sizes. We show the static and dynamic power consumptions, which constitute the total power consumption of the 153 1 2 3 4 5 6 7 8 9 10 Stride size (k) 1000 10000 100000 1000000 # Logic Slices Logic Distributed RAM Total Figure 6.8: Tradeoff between memory and logic slice utilization with increasing stride size. architecture, in order to illustrate the scalability of the design with respect to power con- sumption. It can be seen that the static power is almost constant with dynamic power increasing with the increasing partition size. This is expected due to increased resource usage (i.e. more units being “clocked”). Due to this reason, for smaller classifier sizes (ex: < 512 rules), the power consumption, of the proposed solution is worse compared with the power consumption for larger classifier sizes. Table 6.3: Performance comparison with existing literature (*Has no support for arbi- trary ranges. Inclusion of arbitrary ranges could dramatically increase memory required per rule.) Approach Throughput Memory Latency Power Eff. Ruleset Range-to (Gpbs) (bytes/rule) (Clocks) (mW/rule) Dep. prefix Proposed - Serial 135 52 31 0:624 NO NO Proposed - Parallel 111 52 19 0:920 NO NO DCFL [64] 19 90 5 N/A HIGH NO BV-TCAM [59] 75 154 11 0:846 HIGH NO Emulated TCAM [76] 64 24 1 N/A LOW YES* TCAM [45] 115 30 1 4:901 LOW YES 154 1 2 3 4 Classifier size (# rules) 0 1 2 3 4 Power (W) Serial - Dynamic Serial - Static (a) Serial 16 64 256 1024 Classifier size (# rules) 0 0.5 1 1.5 2 2.5 3 Power (W) Parallel - Dynamic Parallel - Static (b) Parallel Figure 6.9: Power consumption of a) serial and b) parallel architectures. 6.3.4 Comparison with Existing Literature We compare several existing approaches with StrideBV and the details are presented in Table 6.3. For these experiments, we considered a classifier size of 512 rules. It must be noted that for the ruleset-feature dependent solutions, the performance can significantly vary depending on the ruleset characteristics, which consequently affects the perfor- mance whereas with the proposed solution, the reported performance is guaranteed for any 512 rule classifier. Here, we are assuming that each rule is translated into 2:32 rules, 155 which is the average case reported in [63]. For both proposed schemes, stride of size k = 4 is assumed. In order to compensate for the technology gap in cases where older devices were used (e.g. Xilinx Virtex II Pro), we assumed that all architectures operate at 300 MHz. We also report the worst case memory consumption reported for each approach to illustrate the caveats associated with ruleset-dependent solutions. In [64], the solution requires 5 sequential memory accesses to complete the lookup process, which yields lower throughput. While their solution is memory efficient for specific classifiers, it has been shown that the memory requirement can go up as high as 90 bytes/rule. In BV-TCAM [59], the authors use a TCAM generate on FPGA. As shown in [55], the clock frequency that can be achieved with TCAMs built on FPGA is around 230 MHz for a size of 32 104. Hence the clock frequency of the complete architecture will be governed by the clock frequency of the TCAM despite our 300 MHz assumption. We also illustrate the ruleset dependence of each scheme to highlight the unique fea- tures of the proposed scheme. We have used HIGH to denote the algorithmic schemes that are heavily reliant on ruleset features and LOW to denote the schemes that require range-to-prefix conversion only. However, note that the effect of range-to-prefix con- version can be significant depending on the ranges present in the classifier. Unlike our scheme, [76] has no support for arbitrary ranges. Inclusion of arbitrary ranges could dramatically increase the memory required per rule in [76]. 6.4 Conclusion We presented a modular architecture for high-speed and large-scale packet classifica- tion on Field Programmable Gate Array (FPGA). We employed the Bit-Vector (BV) 156 approach and overcame the inherent performance bottlenecks of the same via priority- based ruleset partitioning. Essentially, we reduced the per pipeline memory bandwidth requirement to improve the performance. Further, we incorporated range search in our architecture by performing an explicit range search without affecting the architecture, which is a significant improvement compared with range-to-prefix conversion which yields poor memory efficiency. The proposed StrideBV architecture currently does not possess any update capabil- ities. The architecture can be easily augmented to include rule deletions by adding one pipeline stage in which a bit-vector will be maintained to indicate the valid and invalid (i.e. deleted) rules. By ANDing the bit-vector generated from the StrideBV pipeline and the valid/invalid vector, only the valid rules will be considered in the packet classifica- tion process. Modifications to existing rules (i.e. change of action field of a rule) can also be done without significant modifications to the architecture, using write-bubbles. However, enabling insertions can be a challenging task since it requires computation of bit values for the new rule and inserting them along the pipeline. This requires the bit-vectors to have additional empty space to insert new rules. Extending the StrideBV architecture to IPv6 can be challenging with the increased prefix length. It will increase the packet delay and will demand more memory resources. While the parallel variant of the StrideBV architecture can be adopted, the memory requirement of the architecture will increase notably. While similar arguments hold for platforms such as Ternary Content Addressable Memory (TCAM), further algorith- mic enhancements to the StrideBV architecture may allow StrideBV to cope with the increased prefix length better than TCAM. 157 Chapter 7 Conclusion Packet forwarding engines are a prominent component of the Internet. Most high-speed network applications have become possible due to the advancements in the underlying technologies of these engines. While achieving high performance is critical, recently, many other aspects of packet forwarding engines such as power efficiency and resource consumption have gained attention both in research and industrial arenas. Meeting such stringent constraints requires advanced algorithmic techniques rather than adopting brute-force solutions that offer little flexibility. In this thesis, four solutions for different packet classification engines were pro- posed. Algorithms were devised to offer flexibility in forwarding engine design and to optimize one or more of the performance metrics such as throughput, power, scalability and resource usage. Each of the solutions were evaluated using state-of-the-art Field Programmable Gate Array (FPGA) with respect to the aforementioned parameters. Fur- ther, in this thesis, floor-planning techniques are explored in order to further enhance the performance of the packet classification engines. Naive placement of memory and logic blocks on the FPGA fabric typically cause long wires, hence lower performance. How- ever, via proper placement of the architectural blocks, it was shown that the performance can be enhanced. We summarize our contributions in this work as follows: Scalable Router Virtualization with Dynamic Updates: A novel routing table merging technique, Fill-In, was developed to alleviate the inherent performance 158 limitations of the merged router virtualization approach, with support for non- blocking routing tables updates. The flexibility offered in the merging algorithm provides facility to define the memory distribution at each level of the pipeline. Further, pipelining dramatically improved the performance of the architecture which yielded 150 Gbps throughput. This architecture can potentially be used as a memory balanced architecture for further enhancing the performance. Various Node Distribution Functions (NDFs) that suit the configuration of the platform under consideration can be explored. Optimizing node overlap is helpful only when the number of virtual routing tables is small. When the number of virtual routers increase, the memory requirement of the leaf nodes dominate the total memory requirement, hence increasing node overlap does not result in a scalable solution for large-scale router virtualization. Also, no attempts were made to improve the power consumption of this archi- tecture. Partitioning techniques can easily be adopted to reduce the overall power consumption of the architecture via clock gating techniques. Clock gating can also be employed to disable memory blocks that do not need to be active to perform a lookup operation, which can considerably bring down the power consumption of the architecture. Performance Modeling of Virtual Routers: In this research, we performed an extensive performance analysis of virtual routers on FPGA. Three approaches were analyzed, namely, Non-Virtualized (NV), Virtualized Separate (VS) and Vir- tualized Merged (VM). These different approaches were implemented on both high-performance and low-power FPGAs for power modeling of the aforemen- tioned approaches. It was shown that the proposed models are able to estimate the power performance of the three router approaches with a3% accuracy. 159 A comprehensive performance evaluation of the virtual routers was also per- formed, which measured the performance of the virtual routers with respect to throughput, resource usage and power consumption. A novel grouped router vir- tualization (VG) approach was also proposed which yields higher performance and scalability than both VM and VS approaches. The grouped router virtualization approach was evaluated assuming that a group- ing technique exists. Clustering algorithms can be adopted to perform this routing table grouping by considering the routing table properties. The performance of different routing table merging techniques depend highly on the properties of the routing tables that are being merged. Considering these properties as features for clustering, routing table grouping can be done in a manner that yields higher memory efficiency. In this work, the virtualization schemes were generalized to observe the performance trends of the different approaches to understand their behavior on FPGAs. We observed the near-worst and near-best case behavior of these schemes, however, experiments with real-life routing tables will expose the practical use of these virtualization schemes. High Performance IPv6 Forwarding for Backbone Routers: A routing table partitioning technique was developed that produces both disjoint and balanced partitioning. This routing table partitioning technique was applied for large scale IPv6 backbone routing tables and the resulting partitions were mapped onto both hardware and software platforms as range trees. Due to the improved parallelism available, the proposed technique yielded 200 Gbps on FPGA and 100 Gbps on multi-core platforms. In this work, mapping of partitions to lookup engines was done as a one-to-one 160 mapping. However, it is possible to extend this solution by adopting a many-to- one mapping of partitions to lookup engines. This provides the opportunity to per- form the lookup operation without using many lookup engines, which improves resource usage and opens possibilities for further power enhancements. Ruleset-Feature Independent Packet Classification: The first algorithmic solu- tion that does not rely on ruleset-features was developed and evaluated on a state- of-the-art FPGA. Instead of performing search using individual fields, a rule was divided into multiple sub-fields and the sub-field search was performed using bit- vectors. This yielded an architecture that is well suited for FPGA and other hard- ware platforms. Modularization of the architecture was introduced as a technique to improve the scalability of the proposed architecture. Range search was inte- grated into the architecture which eliminated the use of costly range-to-prefix conversion. On FPGA, due to the arrangement of memory, the performance of the architecture was constrained. However, on platforms such as Application Specific Integrated Circuits (ASICs), the proposed architecture can be implemented in a compact manner, which will result in higher throughput than what is achieved on FPGA. Further, ruleset updates can also be integrated into the architecture by appropri- ately updating the bit-vectors, without interrupting the network traffic. In the current architecture, there are no efforts to reduce power consumption. Efforts can be made to include power optimizations in this architecture by disabling the operation in pipeline stages in which further processing is not necessary. Fur- ther, chip floor-planning can be done at a fine-grained level to further improve the performance of our architecture. It is evident that brute-force solutions will have poor scalability with the increasing network sizes and the increasing demands of networks. Algorithmic techniques that are 161 flexible and offer higher scalability are being adopted in a widespread manner. Hence, inventing and optimizing algorithmic techniques for high performance packet forward- ing engines is imperative to the growth of the future Internet. 162 Bibliography [1] Alcatel-Lucent. Alcatel-lucent fp3 400g network processor. http://www. alcatel-lucent.com/fp3/. [2] Mike Attig and Gordon Brebner. 400 gb/s programmable packet parsing on a single fpga. In Proc. 7th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, pages 12–22, oct. 2011. [3] M. Bando, Y .-L. Lin, and H. J. Chao. Flashtrie: Beyond 100-gb/s ip route lookup using hash-based prefix-compressed trie. Networking, IEEE/ACM Transactions on, 20(4):1262 –1275, aug. 2012. [4] A. Basu and G. Narlikar. Fast incremental updates for pipelined forwarding engines. Networking, IEEE/ACM Transactions on, 13(3):690 – 703, 2005. [5] Zdravko Bozakov. An open router virtualization framework using a programmable forwarding plane. In Proceedings of the ACM SIGCOMM 2010 conference, SIG- COMM ’10, pages 439–440, New York, NY , USA, 2010. ACM. [6] N.M. Chowdhury, Mosharaf Kabir, and Raouf Boutaba. A survey of network vir- tualization. Comput. Netw., 54(5):862–876, April 2010. [7] Cisco. Cisco catalyst 6500 virtual switching system 1440. http://www. cisco.com/en/US/products/ps9336/index.html. [8] Cisco. Evaluating and enhancing green practices with cisco catalyst switching. http://www.cisco.com. [9] Cisco. Hardware and software virtualized routers. http://www.cisco.com/ en/US/solutions/collateral/ns341/ns524/ns562/ns573/ white_paper_c11-512753_ns573_Networking_Solutions_ White_Paper.html. [10] J.A. Clarke, A.A. Gaffar, and G.A. Constantinides. Parameterized logic power consumption models for fpga-based arithmetic. In Field Programmable Logic and Applications, 2005. International Conference on, pages 626–629, 2005. 163 [11] Will Eatherton, George Varghese, and Zubin Dittia. Tree bitmap: hard- ware/software ip lookups with incremental updates. SIGCOMM Comput. Com- mun. Rev., 34:97–122, April 2004. [12] Will Eatherton, George Varghese, and Zubin Dittia. Tree bitmap: hard- ware/software ip lookups with incremental updates. SIGCOMM Comput. Com- mun. Rev., 34(2):97–122, April 2004. [13] American Registry for Internet Numbers (ARIN). Ipv4 address exhaustion. https://www.arin.net/announcements/2011/20110203.html. [14] Jing Fu and Jennifer Rexford. Efficient ip-address lookup with a shared forward- ing table for multiple virtual routers. In Proceedings of the 2008 ACM CoNEXT Conference, CoNEXT ’08, pages 21:1–21:12, New York, NY , USA, 2008. ACM. [15] T. Ganegedara, W. Jiang, and V .K. Prasanna. A scalable and modular architecture for high-performance packet classification (in review). In Parallel and Distributed Systems (TPDS), 2013 Transactions on, 2013. [16] T. Ganegedara, Weirong Jiang, and V . Prasanna. Frug: A benchmark for packet forwarding in future networks. In Performance Computing and Communications Conference (IPCCC), 2010 IEEE 29th International, pages 231 –238, dec. 2010. [17] T. Ganegedara, Weirong Jiang, and V . Prasanna. Multiroot: Towards memory- efficient router virtualization. In Communications (ICC), 2011 IEEE International Conference on, pages 1–5, 2011. [18] T. Ganegedara, V . Prasanna, and G. Brebner. Optimizing packet lookup in time and space on fpga. In Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on, pages 270–276, 2012. [19] T. Ganegedara and V .K. Prasanna. Fpga-based router virtualization: A power perspective. In Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International, pages 360 –367, may 2012. [20] T. Ganegedara and V .K. Prasanna. Stridebv: Single chip 400g+ packet classifi- cation. In High Performance Switching and Routing (HPSR), 2012 IEEE 13th International Conference on, pages 1 –6, june 2012. [21] T. Ganegedara and V .K. Prasanna. 100+ gbps ipv6 packet forwarding on multi- core platforms. In Global Communications Conference (Globecom), 2013, 2013. [22] T. Ganegedara and V .K. Prasanna. A comprehensive performance analysis of vir- tual routers on fpga. In Reconfigurable Technology and Systems (TRETS), 2013 Transactions on, 2013. 164 [23] T. Ganegedara and V .K. Prasanna. A high-performance ipv6 lookup engine on fpga. In Field Programmable Logic and Applications (FPL), 2013 International Conference on, 2013. [24] Thilan Ganegedara, Hoang Le, and V . Prasanna. Towards on-the-fly incremental updates for virtualized routers on fpga. In Field Programmable Logic and Appli- cations (FPL), 2011 International Conference on, 2011. [25] P. Gupta and N. McKeown. Classifying packets with hierarchical intelligent cut- tings. Micro, IEEE, 20(1):34 –41, jan/feb 2000. [26] S. Haria, T. Ganegedara, and V . Prasanna. Power-efficient and scalable virtual router architecture on fpga. In Reconfigurable Computing and FPGAs (ReConFig), 2012 International Conference on, pages 1–7, 2012. [27] T. Hayashi and T. Miyazaki. High-speed table lookup engine for ipv6 longest prefix match. In Global Telecommunications Conference, 1999. GLOBECOM ’99, volume 2, pages 1576 –1581 vol.2, 1999. [28] Xianghui Hu, Bei Hua, and Xinan Tang. Triec: a high-speed ipv6 lookup with fast updates using network processor. In Proceedings of the Second international con- ference on Embedded Software and Systems, ICESS’05, pages 117–128, Berlin, Heidelberg, 2005. Springer-Verlag. [29] Internet Engineering Task Force (IETF). Ipv6 addressing architecture. https: //tools.ietf.org/html/rfc4291. [30] G.S. Jedhe, A. Ramamoorthy, and K. Varghese. A scalable high throughput fire- wall in fpga. In Field-Programmable Custom Computing Machines, 2008. FCCM ’08. 16th International Symposium on, pages 43 –52, april 2008. [31] Weirong Jiang and Viktor K. Prasanna. Field-split parallel architecture for high performance multi-match packet classification using fpgas. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, SPAA ’09, pages 188–196, New York, NY , USA, 2009. ACM. [32] Weirong Jiang and V .K. Prasanna. A memory-balanced linear pipeline architecture for trie-based ip lookup. In High-Performance Interconnects, 2007. HOTI 2007. 15th Annual IEEE Symposium on, pages 83 –90, 2007. [33] Weirong Jiang and V .K. Prasanna. Multi-way pipelining for power-efficient ip lookup. In Global Telecommunications Conference, 2008. IEEE GLOBECOM 2008. IEEE, pages 1 –5, 30 2008-dec. 4 2008. 165 [34] Weirong Jiang and V .K. Prasanna. Towards green routers: Depth-bounded multi- pipeline architecture for power-efficient ip lookup. In Performance, Computing and Communications Conference, 2008. IPCCC 2008. IEEE International, pages 185 –192, dec. 2008. [35] Weirong Jiang, Qingbo Wang, and V .K. Prasanna. Beyond tcams: An sram-based parallel multi-pipeline architecture for terabit ip lookup. In INFOCOM 2008. The 27th Conference on Computer Communications. IEEE, pages 1786 –1794, april 2008. [36] Juniper. Control plane scaling and router virtualization. http://www. juniper.net/us/en/local/pdf/whitepapers/2000261-en.pdf. [37] Juniper. Jcs1200 control system. http://www.juniper.net/us/en/ local/pdf/whitepapers/2000261-en.pdf. [38] S. Kaxiras and G. Keramidas. Ipstash: a power-efficient memory architecture for ip-lookup. In Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International Symposium on, pages 361 – 372, dec. 2003. [39] Alan Kennedy, Xiaojun Wang, Zhen Liu, and Bin Liu. Low power architecture for high speed packet classification. In Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ANCS ’08, pages 131–140, New York, NY , USA, 2008. ACM. [40] Hoang Le, Thilan Ganegedara, and V . Prasanna. Memory-efficient and scalable virtual routers using fpga. In Field Programmable Gate Arrays (FPGA), 2011 International Symposium on, 2011. [41] Hoang Le and V .K. Prasanna. Scalable high throughput and power efficient ip- lookup on fpga. In Field Programmable Custom Computing Machines, 2009. FCCM ’09. 17th IEEE Symposium on, pages 167 –174, 2009. [42] Hoang Le and V .K. Prasanna. Scalable tree-based architectures for ipv4/v6 lookup using prefix partitioning. Computers, IEEE Transactions on, 61(7):1026 –1039, july 2012. [43] Yan Luo, Ke Xiang, and Sanping Li. Acceleration of decision tree searching for ip traffic classification. In Proceedings of the 4th ACM/IEEE Symposium on Archi- tectures for Networking and Communications Systems, ANCS ’08, pages 40–49, New York, NY , USA, 2008. ACM. [44] A. M. Lyons, D. T. Neilson, , and T. R. Salamon. Energy efficient strategies for high density telecom applications. Princeton University, Supelec, Ecole Centrale Paris and Alcatel-Lucent Bell Labs Workshop on Information, Energy and Envi- ronment, June 2008. 166 [45] Chad Meiners. Hardware based packet classification for high speed internet routers. http://sites.tums.ac.ir/superusers/111/Gallery/ 20120206081940Hardware.pdf. [46] Micron. Rldram overview. http://www.micron.com/products/dram/ rldram-memory. [47] NetFPGA. Netfpga boards. http://netfpga.org/. [48] Open Networking Foundation (ONF). Openflow. http://www.openflow. org/. [49] Potaroo. Bgp analysis reports. http://bgp.potaroo.net/. [50] RIPE. Ripe routing information service (ris).http://www.ris.ripe.net/. [51] John Heidemann Rishi Sinha, Christos Papadopoulos. Internet packet size dis- tributions: Some observations. http://www.isi.edu/ ˜ johnh/PAPERS/ Sinha07a.pdf. [52] M.A. Ruiz-Sanchez, E.W. Biersack, and W. Dabbous. Survey and taxonomy of ip address lookup algorithms. Network, IEEE, 15(2):8 –23, 2001. [53] Samsung. Dram overview. http://www.samsung.com/global/ business/semiconductor/product/dram. [54] Samsung. Sram product catalog. http://www.samsung.com/global/ business/semiconductor/product/sram/catalogue?iaId=181. [55] Andrea Sanny, Thilan Ganegedara, and Viktor Prasanna. A comparison of ruleset feature independent packet classification engines on fpga (unpublished). In Pro- ceedings of the 2013 IEEE Reconfigurable Architectures Workshop, IPDPS RAW ’13, 2013. [56] Sumeet Singh, Florin Baboescu, George Varghese, and Jia Wang. Packet classifi- cation using multidimensional cutting. In Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communica- tions, SIGCOMM ’03, pages 213–224, New York, NY , USA, 2003. ACM. [57] Konstantinos Siozios, Konstantinos Tatas, Dimitrios Soudris, and Antonios Thanailakis. A novel methodology for designing high-performance and low- energy fpga routing architecture. In Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays, FPGA ’06, pages 224–224, New York, NY , USA, 2006. ACM. 167 [58] Haoyu Song, M. Kodialam, Fang Hao, and T.V . Lakshman. Building scalable virtual routers with trie braiding. In INFOCOM, 2010 Proceedings IEEE, pages 1 –9, 2010. [59] Haoyu Song and John W. Lockwood. Efficient packet classification for network intrusion detection using fpga. In Proceedings of the 2005 ACM/SIGDA 13th inter- national symposium on Field-programmable gate arrays, FPGA ’05, pages 238– 245, New York, NY , USA, 2005. ACM. [60] T. Srinivasan, N. Dhanasekar, M. Nivedita, R. Dhivyakrishnan, and A.A. Azeezun- nisa. Scalable and parallel aggregated bit vector packet classification using prefix computation model. In Parallel Computing in Electrical Engineering, 2006. PAR ELEC 2006. International Symposium on, pages 139 –144, sept. 2006. [61] Cisco Systems. Vni forecast highlights. http://www.cisco.com/web/ solutions/sp/vni/vni_forecast_highlights/index.html. [62] Cisco Systems. The zettabyte era ˜ Ntrends and analysis. http: //www.cisco.com/en/US/solutions/collateral/ns341/ ns525/ns537/ns705/ns827/VNI_Hyperconnectivity_WP.html. [63] David E. Taylor. Survey and taxonomy of packet classification techniques. ACM Comput. Surv., 37(3):238–275, September 2005. [64] D.E. Taylor and J.S. Turner. Scalable packet classification using distributed crossproducing of field labels. In INFOCOM 2005. 24th Annual Joint Confer- ence of the IEEE Computer and Communications Societies. Proceedings IEEE, volume 1, pages 269 – 280 vol. 1, march 2005. [65] Deepak Unnikrishnan, Ramakrishna Vadlamani, Yong Liao, Abhishek Dwaraki, J´ er´ emie Crenne, Lixin Gao, and Russell Tessier. Scalable network virtualization using fpgas. In Proceedings of the 18th annual ACM/SIGDA international sympo- sium on Field programmable gate arrays, FPGA ’10, pages 219–228, New York, NY , USA, 2010. ACM. [66] Wikipedia. Classless inter-domain routing. http://en.wikipedia.org/ wiki/Classless_Inter-Domain_Routing. [67] Wikipedia. Ipv6. http://en.wikipedia.org/wiki/IPv6. [68] Wikipedia. Longest prefix match. http://en.wikipedia.org/wiki/ Longest_prefix_match. [69] Wikipedia. Multiprotocol label switching. http://en.wikipedia.org/ wiki/Multiprotocol_Label_Switching. 168 [70] Wikipedia. Perfect hash functions. http://en.wikipedia.org/wiki/ Perfect_hash_function. [71] Xilinx. Planahead design and analysis tool. http://www.xilinx.com/ tools/planahead.htm. [72] Xilinx. Spartan-3l low power fpga family. http://www.xilinx.com/ support/documentation/data_sheets/ds313.pdf. [73] Xilinx. Stacked silicon interconnect. http://www.xilinx.com/ products/technology/stacked-silicon-interconnect/ index.htm. [74] Heeyeol Yu. A memory- and time-efficient on-chip tcam minimizer for ip lookup. In Design, Automation Test in Europe Conference Exhibition (DATE), 2010, pages 926–931, 2010. [75] Marko Zec, Luigi Rizzo, and Miljenko Mikuc. Dxr: towards a billion routing lookups per second in software. SIGCOMM Comput. Commun. Rev., 42(5):29–36, September 2012. [76] Carlos A. Zerbini and Jorge M. Finochietto. Performance evaluation of packet classification on fpga-based tcam emulation architectures. In Proceedings of the 2012 IEEE Global Communications Conference, Globecom ’12, 2012. [77] Kai Zheng, Chengchen Hu, Hongbin Liu, and Bin Liu. An ultra high throughput and power efficient tcam-based ip lookup engine. In INFOCOM 2004. Twenty- third AnnualJoint Conference of the IEEE Computer and Communications Soci- eties, volume 3, pages 1984 –1994 vol.3, march 2004. [78] Pingfeng Zhong. An ipv6 address lookup algorithm based on recursive balanced multi-way range trees with efficient search and update. In Computer Science and Service System (CSSS), 2011 International Conference on, june 2011. 169
Abstract (if available)
Abstract
The Internet has become ubiquitous within the past few decades. The number of active users of the Internet has reached 2:5 billion and the number of Internet connected devices has reached 11 billion in year 2012. Considering this proliferation of Internet users and devices, forecasts show that the network traffic is expected to grow threefold between 2012 and 2017, which will result in a 1.4 Zettabytes of data exchange on the Internet in the year of 2017. ❧ These enormous amounts of traffic in the Internet demands high forwarding rates to satisfy the requirements of various time-critical applications. For example, multimedia applications such as video streaming, Voice over IP (VoIP) and gaming, require high bandwidth and low latency packet delivery. To meet such demands, network speeds have significantly increased since the inception of Internet
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
High performance packet forwarding on parallel architectures
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
High performance web-caching architecture
PDF
Supporting faithful and safe live malware analysis
PDF
High-performance linear algebra on reconfigurable computing systems
PDF
Model-driven situational awareness in large-scale, complex systems
PDF
Language abstractions and program analysis techniques to build reliable, efficient, and robust networked systems
PDF
Introspective resilience for exascale high-performance computing systems
PDF
Optical signal processing for high-speed, reconfigurable fiber optic networks
PDF
Enabling virtual and augmented reality over dense wireless networks
PDF
Design of low-power and resource-efficient on-chip networks
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Dynamically reconfigurable off- and on-chip networks
PDF
High performance classification engines on parallel architectures
PDF
Measuring the impact of CDN design decisions
PDF
Compiler directed data management for configurable architectures with heterogeneous memory structures
Asset Metadata
Creator
Ganegedara, Thilan
(author)
Core Title
Algorithms and architectures for high-performance IP lookup and packet classification engines
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/06/2013
Defense Date
06/19/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
high-performance networking,low-power architectures,network security,OAI-PMH Harvest,parallel processing,reconfigurable computing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Govindan, Ramesh (
committee member
), Pinkston, Timothy M. (
committee member
)
Creator Email
ganegeda@usc.edu,thilangane@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-343985
Unique identifier
UC11296398
Identifier
etd-Ganegedara-2135.pdf (filename),usctheses-c3-343985 (legacy record id)
Legacy Identifier
etd-Ganegedara-2135.pdf
Dmrecord
343985
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Ganegedara, Thilan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
high-performance networking
low-power architectures
network security
parallel processing
reconfigurable computing