Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
High performance packet forwarding on parallel architectures
(USC Thesis Other)
High performance packet forwarding on parallel architectures
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HIGH PERFORMANCE PACKET FORWARDING ON PARALLEL ARCHITECTURES by Weirong Jiang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) December 2010 Copyright 2010 Weirong Jiang Dedication To my wife and my parents ii Acknowledgments First, I must thank my advisor, Prof. Viktor K. Prasanna. His patient guidance and his insightful discussions are invaluable to me. This work would not have been possible without him. I would also like to express my gratitude to my committee members: Prof. Murali Annavaram and Prof. Ramesh Govindan. They have provided many useful insights during this process. Additionally, Prof. Peter Beerel and Prof. Monte Ung provided helpful feedback at the time of my qualifying exam. I was privileged to receive guiding advice from Dr. Maya Gokhale when I was an intern at Lawrence Livermore National Laboratory. I would also like to thank Prof. Jun Li at Tsinghua University for the part that he has played in my success. I have had the benefit of working with a wonderful and productive research group. I would especially like to acknowledge Qingbo Wang, Hoang Le, Yi-hua Edward Yang, Thilan Ganegedara, Ling Zhuo, Gerald Morris, Ronald Scrofano, Jingzhao Ou and Zachary Baker. I am grateful for their help, including from some who actually had graduated when I joined the group. I also thank Janice Thompson and Aimee Barnard for their administrative assistance. Last but not the least, I must thank my wife, Yun Zhu, whose love and support are crucial for anything that I have ever accomplished. My parents, Chaodong Jiang and Fuqin Tian, deserve special recognition for all that they have done for me, both during this process and throughout my life. I could not have reached this point without their unconditional love and support. iii Table of Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii Abstract xi Chapter 1: Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Router Architecture . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Packet Forwarding . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2.1 IP Lookup . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2.2 Packet Classification . . . . . . . . . . . . . . . . . 3 1.1.2.3 Flexible Flow Matching . . . . . . . . . . . . . . . 5 1.1.3 Performance Challenges . . . . . . . . . . . . . . . . . . . . 8 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 2: Packet Forwarding Approaches 15 2.1 Software Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Trie-based IP Lookup . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Packet Classification Algorithms . . . . . . . . . . . . . . . . 17 2.1.3 Flexible Flow Matching . . . . . . . . . . . . . . . . . . . . 21 2.2 Hardware Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 TCAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.2 SRAM-based Pipeline . . . . . . . . . . . . . . . . . . . . . 22 2.2.3 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 iv Chapter 3: Pipeline Architectures for High-Throughput IP Lookup 28 3.1 Memory-Balanced Linear Pipelines . . . . . . . . . . . . . . . . . . 28 3.1.1 Fine-Grained Mapping . . . . . . . . . . . . . . . . . . . . . 28 3.1.1.1 Trie Partitioning . . . . . . . . . . . . . . . . . . . 29 3.1.1.2 Algorithm and Architecture . . . . . . . . . . . . . 31 3.1.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.2 Bidirectional Mapping . . . . . . . . . . . . . . . . . . . . . 34 3.1.2.1 Algorithms and Architecture . . . . . . . . . . . . 35 3.1.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Parallel Multi-Pipeline Architectures . . . . . . . . . . . . . . . . . . 41 3.2.1 Memory Balancing Among Pipelines . . . . . . . . . . . . . 42 3.2.1.1 Approximation Algorithm . . . . . . . . . . . . . . 43 3.2.1.2 Partitioning with Small TCAM . . . . . . . . . . . 44 3.2.2 Traffic Balancing Among Pipelines . . . . . . . . . . . . . . 47 3.2.2.1 IP Caching . . . . . . . . . . . . . . . . . . . . . . 47 3.2.2.2 Dynamic Subtrie-to-Pipeline Remapping . . . . . . 50 3.2.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.3 Sequence Preserving . . . . . . . . . . . . . . . . . . . . . . 53 3.2.3.1 Early Caching . . . . . . . . . . . . . . . . . . . . 53 3.2.3.2 Output Delay . . . . . . . . . . . . . . . . . . . . 54 3.2.4 Overall Performance . . . . . . . . . . . . . . . . . . . . . . 54 Chapter 4: Towards Green Routers: Power-Efficient IP Lookup 56 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.1 Greening the Routers . . . . . . . . . . . . . . . . . . . . . . 57 4.1.2 Power-Efficient IP Lookup Engines . . . . . . . . . . . . . . 58 4.2 Architecture-Aware Data Structure Optimization . . . . . . . . . . . 59 4.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 59 4.2.1.1 Non-Pipelined and Pipelined Engines . . . . . . . . 59 4.2.1.2 Power Function of SRAM . . . . . . . . . . . . . . 61 4.2.2 Special Case: Uniform Stride . . . . . . . . . . . . . . . . . 63 4.2.3 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . 63 4.2.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . 64 4.2.4.1 Results for Non-Pipelined Architecture . . . . . . . 65 4.2.4.2 Results for Pipelined Architecture . . . . . . . . . . 66 4.3 Reducing Dynamic Power Dissipation . . . . . . . . . . . . . . . . . 68 4.3.1 Analysis and Motivation . . . . . . . . . . . . . . . . . . . . 70 4.3.2 Architecture-Specific Techniques . . . . . . . . . . . . . . . 72 4.3.2.1 Inherent Caching . . . . . . . . . . . . . . . . . . 72 4.3.2.2 Local Clocking . . . . . . . . . . . . . . . . . . . 73 4.3.2.3 Fine-Grained Memory Enabling . . . . . . . . . . . 74 4.3.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . 75 v Chapter 5: Large-Scale Wire-Speed Packet Classification 79 5.1 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1.2 Architecture Overview . . . . . . . . . . . . . . . . . . . . . 82 5.1.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.1.3.1 Decision Tree Construction . . . . . . . . . . . . . 84 5.1.3.2 Tree-to-Pipeline Mapping . . . . . . . . . . . . . . 85 5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2.1 Pipeline for Decision Tree . . . . . . . . . . . . . . . . . . . 90 5.2.2 Pipeline for Rule Lists . . . . . . . . . . . . . . . . . . . . . 91 5.2.3 Rule Update . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3.1 Algorithm Evaluation . . . . . . . . . . . . . . . . . . . . . . 92 5.3.2 FPGA Implementation Results . . . . . . . . . . . . . . . . . 93 Chapter 6: Scalable Architecture for Flexible Flow Matching 97 6.1 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.1.1 Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.1.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 103 6.2.2 Algorithm Evaluation . . . . . . . . . . . . . . . . . . . . . . 104 6.2.3 Implementation Results . . . . . . . . . . . . . . . . . . . . 106 Chapter 7: Conclusion 108 7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 109 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.2.1 From IPv4 to IPv6 . . . . . . . . . . . . . . . . . . . . . . . 111 7.2.2 Growing Table Size . . . . . . . . . . . . . . . . . . . . . . . 112 7.2.3 Dynamic Update . . . . . . . . . . . . . . . . . . . . . . . . 112 7.2.4 Evolving with Packet Forwarding . . . . . . . . . . . . . . . 114 References 115 vi List of Tables 1.1 Summary of packet forwarding at various levels . . . . . . . . . . . . 2 1.2 Example IP lookup table . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Example packet classification rule set . . . . . . . . . . . . . . . . . 4 1.4 Header fields supported in current OpenFlow . . . . . . . . . . . . . 6 1.5 Example OpenFlow rule set . . . . . . . . . . . . . . . . . . . . . . . 7 1.6 Comparison of TCAM and SRAM technologies . . . . . . . . . . . . 10 3.1 Representative routing tables . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Traffic distribution over 8 pipelines . . . . . . . . . . . . . . . . . . . 52 4.1 Representative routing tables (snapshot on 2009/04/01) . . . . . . . . 66 4.2 Real-life IP header traces . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3 Resource utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1 Performance of algorithms for rule sets of various sizes . . . . . . . . 93 5.2 Resource utilization of the packet classification engine on FPGA . . . 94 5.3 Performance comparison of FPGA-based packet classification engines 94 6.1 Breakdown of aP = 4-tree decision forest . . . . . . . . . . . . . . . 106 6.2 Resource utilization of the 4-tree decision forest on FPGA . . . . . . 107 vii List of Figures 1.1 Block diagram of the router system architecture . . . . . . . . . . . . 2 2.1 (a) Prefix set; (b) Uni-bit trie; (c) Leaf-pushed trie; (d) Multi-bit trie. . 16 2.2 Example of HiCuts and HyperCuts decision trees. . . . . . . . . . . . 19 2.3 Ring pipeline and CAMP . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 OLP architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Prefix expansion ratio. . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Algorithm: node-to-stage mapping in OLP. . . . . . . . . . . . . . . 33 3.4 Node distribution after fine-grained mapping. . . . . . . . . . . . . . 35 3.5 Bidirectional fine-grained mapping for the trie in Figure 2.1. . . . . . 35 3.6 Algorithm: Selecting the subtrie to be inverted . . . . . . . . . . . . . 36 3.7 Algorithm: Bidirectional fine-grained mapping . . . . . . . . . . . . 37 3.8 Block diagram of the basic architecture of BiOLP. . . . . . . . . . . . 38 3.9 Bidirectional fine-grained mapping with different heuristics. (Inver- sion factor = 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.10 Bidirectional fine-grained mapping with various inversion factors. (Largest leaf heuristic) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.11 Node distribution over subtries. . . . . . . . . . . . . . . . . . . . . . 42 3.12 Algorithm: Subtrie-to-pipeline mapping . . . . . . . . . . . . . . . . 44 3.13 Node distribution over 8 pipelines (using the approximation algorithm). 44 3.14 Height-bounded partitioning and mapping. . . . . . . . . . . . . . . 45 viii 3.15 Algorithm: Height-bounded split:HBS(n;i) . . . . . . . . . . . . . 46 3.16 Index table for the height-bounded splitting. . . . . . . . . . . . . . . 47 3.17 Node distribution over 8 pipelines (with a small index TCAM). . . . . 48 3.18 POLP architecture (W = 3;P = 4). . . . . . . . . . . . . . . . . . . 49 3.19 Block diagram of the architecture with pre-caching (P = 8). . . . . . 49 3.20 Algorithm: Subtrie-to-pipeline remapping. . . . . . . . . . . . . . . . 51 3.21 Throughput speedup with different numbers of pipelines (P = P c = 1;2;4;6;8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.22 Node distribution over all stages in the 8-pipeline 25-stage architecture. 55 4.1 Power function of SRAM sizes . . . . . . . . . . . . . . . . . . . . . 62 4.2 Algorithm:FixedStride(W;k). . . . . . . . . . . . . . . . . . . . . 65 4.3 Power results of the non-pipelined architecture using (a) the uniform stride and (b) the optimal stride. . . . . . . . . . . . . . . . . . . . . 67 4.4 Power results of the non-pipelined architecture using (a)B s = 2 and (b)B s = 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 Power results of the pipelined architecture using (a) the uniform stride and (b) the optimal stride. . . . . . . . . . . . . . . . . . . . . . . . . 68 4.6 Power results of the pipelined architecture using (a) B s = 2 and (b) B s = 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.7 Power results of the pipelined architecture using (a)h = k +16 and (b)h =k2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.8 Traffic rate variation over the time. . . . . . . . . . . . . . . . . . . . 71 4.9 Access frequency on each stage. . . . . . . . . . . . . . . . . . . . . 72 4.10 Pipeline with inherent caching . . . . . . . . . . . . . . . . . . . . . 73 4.11 Local clocking for one stage. . . . . . . . . . . . . . . . . . . . . . . 74 4.12 Profiling of dynamic power consumption in a pipelined IP lookup engine 76 4.13 Power reduction with fine-grained memory enabling . . . . . . . . . . 77 4.14 Power reduction with inherent caching and local clocking . . . . . . . 78 ix 5.1 Motivating example . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 Block diagram of the two-dimensional linear dual-pipeline architecture 84 5.3 Algorithm: Building the decision tree . . . . . . . . . . . . . . . . . 86 5.4 Building the decision tree for the example rule set. . . . . . . . . . . 87 5.5 Mapping a decision tree onto pipeline stages . . . . . . . . . . . . . . 88 5.6 Algorithm: Mapping the decision tree onto a pipeline . . . . . . . . . 90 5.7 Implementation of updatable, varying number of branches at each node 95 5.8 Implementation of rule matching . . . . . . . . . . . . . . . . . . . . 96 5.9 Distribution over Tree Pipeline stages for ACL 10K . . . . . . . . . . 96 6.1 Rule duplication in HyperCuts tree. . . . . . . . . . . . . . . . . . . 99 6.2 Algorithm: Building the decision forest . . . . . . . . . . . . . . . . 100 6.3 Algorithm: Building the decision tree and the split-out set . . . . . . . 101 6.4 Multi-pipeline architecture for searching the decision forest (P = 2). . 102 6.5 Average memory requirement with increasingP . . . . . . . . . . . . 105 6.6 Tree depth with increasingP . . . . . . . . . . . . . . . . . . . . . . . 106 6.7 Number of cutting fields with increasingP . . . . . . . . . . . . . . . 107 x Abstract Packet forwarding has long been a performance bottleneck in Internet infrastruc- ture, including routers and switches. While the throughput requirements continue to grow, power dissipation has emerged as an additional critical concern. Also, as the Internet continues to constantly evolve, packet forwarding engines must be flexible in order to enable future innovations. Although ternary content addressable memo- ries (TCAMs) have been widely used for packet forwarding, they have high power consumption and are inflexible for adapting to new addressing and routing protocols. This thesis studies the use of low-power memory, such as static random access memory (SRAM) combined with application-specific integrated circuit (ASIC)/ field- programmable gate array (FPGA) technology, to develop high-throughput, power- efficient, and flexible algorithmic solutions for various packet forwarding problems, which include IP lookup, packet classification and flexible flow matching (such as OpenFlow). We propose to map state-of-the-art packet forwarding algorithms onto SRAM- based parallel architectures. High throughput is achieved via pipelining and/or multi- processing. Several challenges for such algorithm-to-architecture mapping are ad- dressed. Meanwhile, enabled by the customized architecture design, the algorithms are optimized to achieve memory and/or power/ energy efficiency. For IP lookup, we propose two mapping schemes to balance the memory distri- bution across the stages in a pipeline. In the case of multi-pipeline architectures, xi our schemes balance both the memory requirement and the traffic load among multiple pipelines. The intra-flow packet order is also preserved. In addition to the power reduction achieved by replacing TCAMs with SRAMs, we propose data structure and architectural optimizations to further lower the power/ energy consumption for SRAM-based pipelined IP lookup engines. For packet classification, we propose a decision-tree-based, two-dimensional dual-pipeline architecture. Several optimization techniques are proposed for the state-of-the-art decision-tree-based algorithm. As a result, the memory require- ment is almost linear with the number of rules in the forwarding table. Considering OpenFlow as a representative of flexible flow matching, we develop a framework to partition a given table of flexible flow rules into multiple sub- sets, of which each is built into a depth-bounded decision tree. The partitioning scheme is carefully designed to reduce the overall memory requirement. We evaluate our solutions implemented on modern ASIC/ FPGA and demonstrate their superior performance over the state-of-the-art with respect to throughput, memory re- quirement and power/ energy consumption. xii Chapter 1 Introduction Internet is built as a packet-switching network. The kernel function of Internet infras- tructure, including routers and switches, is to forward the packets that are received from one subnet to another subnet. The packet forwarding is accomplished by us- ing the header information extracted from a packet to look up the forwarding table maintained in the routers/ switches. Due to rapid growth of network traffic, packet forwarding has long been a performance bottleneck in routers/ switches. This chapter gives an overview of the key packet forwarding problems in the network infrastructure of today and the future. 1.1 Background 1.1.1 Router Architecture As shown in Figure 1.1, a router contains two main architectural components: a rout- ing engine and a packet forwarding engine. The routing engine on the control plane processes routing protocols, receives inputs from network administrators, and pro- duces the forwarding table. The packet forwarding engine on the data plane receives packets, matches the header information of the packet against the forwarding table to 1 identify the corresponding action, and applies the action for the packet. The routing engine and the forwarding engine perform their tasks independently, although they constantly communicate through high-throughput links [37]. Forwarding Table Forwarding Engine Routing Engine Routing Protocol Process Forwarding Table Updates Routing Protocol Packets Data Plane Control Plane Packet_in Packet_out Admin Input Admin Output Figure 1.1: Block diagram of the router system architecture 1.1.2 Packet Forwarding A network packet consists of various header fields. Based on which packet header field(s) are to be matched, we categorize packet forwarding in next-generation routers/ switches into IP lookup, packet classification and flexible flow matching, as summa- rized in Table 1.1, which will be explained in the rest of this section. Table 1.1: Summary of packet forwarding at various levels Packet field Matching type IP lookup Destination IP address Longest prefix matching Packet classification 5-tuple header Multi-field range matching OpenFlow switching 12-tuple header Exact/Wildcard matching 2 1.1.2.1 IP Lookup The core function of network routers is IP lookup, where the destination IP address of each packet is matched against the entries in the routing table. Each routing entry consists of a prefix and its corresponding next-hop interface. Table 1.2 shows a sim- ple routing table where we assume 8-bit IP addresses. A prefix in the routing table represents a subset of IP addresses that share the same prefix, and the prefix length is denoted by the number followed by the slash. The nature of IP lookup is longest pre- fix matching (LPM) [70]. In other words, an IP address may match multiple prefixes, but only the longest prefix is used to retrieve the next-hop information. For example, a packet with destination IP address 11010010 will match the prefixes 110* and 11* in Table 1.2. But 110* becomes the LPM. Therefore, that packet is forwarded to the corresponding next-hop interface, P6. Table 1.2: Example IP lookup table Prefix/Length Next-Hop Interface 00000000/1 P1 01000000/3 P4 10000000/2 P2 11000000/3 P6 11000000/2 P5 1.1.2.2 Packet Classification Packet classification [23] enables routers to support firewall processing, Quality of Service differentiation, virtual private networks, policy routing, and other value added services. An IP packet is usually classified based on the five fields in the packet header: 32-bit source/destination IP addresses (denoted SA/DA), 16-bit source/destination port numbers (denoted SP/DP), and 8-bit transport layer protocol (denoted Prtl). 3 Individual entries for classifying a packet are called rules or classifiers. Each rule has the associated value for each field, a priority, and an action to be taken if matched. Each field in a rule is allowed three kinds of matches: exact match, prefix match, or range match. In an exact match, the header field of the packet should exactly match the rule field. Exact match is used for the protocol field. In a prefix match, the rule field should be a prefix of the header field. IP address fields are specified as prefixes. In a range match, the header values should lie in the range specified by the rule. Ranges are useful for matching port numbers. Table 1.3 shows a simplified example, where we assume 8-bit source and destination IP addresses, 4-bit source and destination port numbers, and a 2-bit transport protocol value. Actually, both exact match and pre- fix match can be viewed as some form of range match. Exact match is identical to matching a range with equal upper and lower bounds. A prefix specifies a contiguous interval of values. For example, the prefix01 for 4-bit values can also be specified by the range [0100, 0111]. Hence, packet classification can be modeled as a problem of multi-field range matching. Table 1.3: Example packet classification rule set Rule SA (8-bit) DA (8-bit) SP (4-bit) DP (4-bit) Prtl (2-bit) Priority R1 1110* 01* [0001:0011] [0110:0111] 01 1 R2 0000* 1101* [1010:1101] [0010:0110] 10 1 R3 100* 00* [0010:0101] [0110:0111] 11 2 R4 10* 0011* [0000:0011] [0110:0111] 01 3 R5 111* 010* [0010:0101] [0110:0111] 00 3 R6 001* 1* [0001:0011] [0110:0111] 01 4 R7 0* 11* [0001:0011] [0110:0111] 10 5 R8 * 0110* [0000:1111] [0000:0111] 11 6 When a packet arrives at a router, its header is compared to a set of rules. A packet is considered matching a rule only if it matches all the fields within that rule. If a packet matches multiple rules, the matching rule with the highest priority is returned 4 in most applications. However, some applications require returning all the matching rules. Such a problem is called multi-match packet classification [90]. 1.1.2.3 Flexible Flow Matching Emerging network requirements, such as user-level and fine-grained security, mobil- ity, and reconfigurability, have made network virtualization an essential feature for next-generation enterprise, data center, and cloud computing networks. Major router vendors have recently initiated programs to provide open router platforms which al- low users to develop software extensions for proprietary hardware [57]. Such an open router platform requires the underlying forwarding hardware to be flexible and to pro- vide a clean interface for software [10]. One such effort is the OpenFlow switch, which manages explicitly the network flows, using a flow table with rich definition as the software-hardware interface [56]. The OpenFlow switch [62] brings programmability and flexibility to the network infrastructure by separating the control and the data planes of routers/switches and managing the control plane at few centralized servers. The major processing engine in the OpenFlow switch is flexible flow matching, where up to 12-tuple header fields of each packet are matched against all the flow rules [58, 51]. The 12-tuple header fields supported in the current OpenFlow specification include the ingress port, source/ destination Ethernet addresses, Ethernet type, VLAN ID, VLAN priority, source/ des- tination IP addresses, IP protocol, IP Type of Service bits, and source/ destination port numbers [63]. Table 1.4 shows the width of each field 1 . Each field of an OpenFlow rule can be specified as either an exact number or a wild- card. IP address fields can also be specified as a prefix. Table 1.5 shows a simplified 1 The width of the ingress port is determined by the number of ports of the switch / router. For example, 6-bit ingress port indicates that the switch / router has up to 63 ports. 5 Table 1.4: Header fields supported in current OpenFlow Header field Notation # of bits Ingress port Variable Source Ethernet addresses Eth src 48 Destination Ethernet address Eth dst 48 Ethernet type Eth type 16 VLAN ID 12 VLAN Priority 3 Source IP address SA 32 Destination IP address DA 32 IP Protocol Prtl 8 IP Type of Service ToS 6 Source port SP 16 Destination port DP 16 example of OpenFlow rule table, where we consider 16-bit Eth src/dst, 8-bit SA/DA and 4-bit SP/DP. In the subsequent discussion, we have the following definitions: Simple rule is the flow rule in which all of the fields are specified as exact values, such as R10 in Table 1.5. Complex rule is the flow rule that contains wildcards or prefixes, such as R19 in Table 1.5. A packet is considered to be matching a rule if and only if its header content matches all the specified fields within that rule. If a packet matches multiple rules, then the matching rule with the highest priority is used. In OpenFlow, a simple rule always has the highest priority. If a packet does not match any rule, then the packet is forwarded to the centralized server. The server determines how to handle it and may register a new rule in the switches. Hence dynamic rule updating needs to be supported. 6 Table 1.5: Example OpenFlow rule set Rule Ingress Eth Eth Eth VLAN VLAN IP src IP dst IP IP Port src Port dst Action port src dst type ID priority (SA) (DA) Prtl ToS (SP) (DP) R1 * 00:13 00:06 * * * * * * * * * act0 R2 * 00:07 00:10 * * * * * * * * * act0 R3 * * 00:FF * * * * * * * * * act1 R4 * 00:1F * 0x8100 100 5 * * * * * * act1 R5 * * * 0x0800 * * * 01* * * * * act2 R6 * * * 0x0800 * * 001* 11* TCP * 10 15 act0 R7 * * * 0x0800 * * 001* 11* UDP * 2 11 act3 R8 * * * 0x0800 * * 100* 110* * * 5 6 act1 R9 5 00:FF 00:00 0x0800 4095 7 0011* 1100* TCP 0 2 5 act0 R10 1 00:1F 00:2A 0x0800 4095 7 01000001 10100011 TCP 0 2 7 act0 7 1.1.3 Performance Challenges As the Internet becomes even more pervasive, performance of the network infrastruc- ture that supports this universal connectivity becomes critical with respect to through- put and power. Traditionally, performance has been achieved by increasing the maxi- mum network throughput to handle bursty traffic. The harsh truth about the power-throughput relationship of today’s network infras- tructures is that power efficiency is often sacrificed in order to obtain higher throughput through brute-force expansion. A single high-end core router that switches 640 Gbps full-duplex network traffic can consume over 10 kW of power [13, 36], whereas a high-end service gateway capable of 10–45 Gbps routing and firewall throughput can take 1–5 kW of power [12, 35]. Both the large amount of total energy and the high power density cause serious problems for the industry and the environment through the operation and maintenance of these network equipments: The historical trend shows that the capacity of backbone routers had doubled every 18 months until 5 years ago. Today, terabit routers with 10–15 kW are at the limit due to the power density. As a result, a thirty-fold shortfall in capacity will be seen by 2015 as compared to the historical trend for single rack routers [54]. The high power density imposes a strenuous burden on the cooling of the net- work equipment. According to [73], a hardware component that consumes 50– 100 W/ft 2 can require 1.3x to 2.3x more power for its cooling. In other words, every wattage that is saved from the critical operation of the network equipment reduces up to 3.3 watts of total power dissipation. 8 The high power and cooling also implies high monetary investments and energy costs. This fact is especially true for network infrastructures where the equip- ment (nodes) often need to be placed strategically near the center of metropolitan areas. At 500 watts per square foot, it will cost $5000/ft 2 , or $250 million in total, to equip a 50,000-square-foot facility as a data center [4]. Packet forwarding has been a major performance bottleneck for network infras- tructure [18]. Power/ energy consumption by forwarding engines has become an in- creasingly critical concern [82, 20]. High performance packet forwarding systems require: High throughput to meet the increasing network traffic demand. Major ISPs, such as Sprint [67] and Verizon [68], have been deploying 40–100 Gbps links. An OC-768 (40 Gbps) link requires a processing rate of 125 million packets per second (MPPS) for minimum size (40 bytes) packets. Such throughput is impossible to achieve by using any existing software-based solution [70, 23]. Power efficiency to reduce heat dissipation. Recent investigations [54, 11] show that power dissipation has become the major limiting factor for next-generation routers and predict that expensive liquid cooling may be needed in future. An analysis by researchers from Bell labs [54] reveals that almost 2/3 of power dissipation inside a core router is due to packet forwarding engines. Low memory usage to support increasing number of entries. The current largest routing table has over 300K IPv4 (32-bit) prefixes [69]. In the future, both rout- ing table and prefix sizes are expected to grow significantly when 128-bit IPv6 is adopted. Interfacing with external SRAMs should be enabled to handle even larger rule sets. 9 On-the-fly update to prevent malfunctioning during updating. Since the forward- ing table is frequently updated (particularly if the network is virtualized), the packet forwarding engine must support dynamic updates without much perfor- mance degradation. 1.2 Motivation Increasing link rates demand that packet forwarding must be performed in hardware. Most hardware-based high-speed packet forwarding engines fall into two main cat- egories: TCAM (Ternary Content Addressable Memory)-based and DRAM/ SRAM (dynamic/ static random access memory)-based solutions. Although TCAM-based en- gines are widely used in today’s routers, their throughputs are limited by the relatively low clock rate of TCAMs. As a result of the massive parallelism inherent in their architectures, TCAMs do not scale well in terms of power consumption [92]. Further- more, TCAMs are expensive and offer little flexibility for adapting to new addressing and routing protocols [34]. Table 1.6: Comparison of TCAM and SRAM technologies TCAM (18 Mb chip) SRAM (18 Mb chip) Maximum clock rate (MHz) 250 [26] 450 [15, 72] Cell size (# of transistors per bit) [3] 16 6 Power consumption (Watts) 12 15 [94] 0.1 [8] As shown in Table 1.6, SRAMs outperform TCAMs of equal memory size, with respect to speed, density, and power dissipation. However, traditional SRAM-based algorithmic solutions must access a single large memory multiple times, which results in low throughput and high power/ energy consumption. Hence, our research focuses on employing multiple memory blocks, which can be accessed in parallel to speed up packet forwarding. These memory blocks can be organized in a pipeline or in multiple 10 pipelines. Several challenges must be addressed to make such solutions feasible. First, after mapping the algorithmic solutions onto our parallel architecture, the memory dis- tribution among these memory blocks should be balanced. Second, in case of multiple pipelines, the traffic on different pipelines should be balanced. Third, due to traffic balancing, the packets within the same flow may go out of order. Thus, new schemes are needed to preserve the intra-flow packet order. On the other hand, enabled by the customized architecture design, we rethink the existing algorithmic solutions and propose various optimizations to achieve power/ energy and/or memory efficiency. Other architectural methods, such as clock gating, are also integrated to reduce the dynamic power dissipation and the average energy consumption per packet. Some of our designs are evaluated based on ASIC implementation, while the oth- ers are prototyped on reconfigurable hardware, such as SRAM-based FPGAs. Unlike ASICs, FPGAs can be reprogrammed to suit a specific set of input data or operating situations, which allows for the flexibility of software with the power of fixed hard- ware. 1.3 Contributions The major contributions of this thesis, as summarized below, are in designing cus- tomized parallel SRAM-based architectures on FPGA/ ASIC to accommodate algo- rithmic optimizations, in order to achieve high-throughput low-power packet forward- ing for next-generation Internet infrastructure. 11 IP lookup We propose a heuristic to perform fine-grained node-to-stage mapping to achieve balanced memory distribution across the stages in a linear pipeline. The archi- tecture achieves a high throughput of one packet per clock cycle. To further improve the throughput, we propose the use of multiple pipelines. An approximation algo- rithm is proposed to solve the problem of balancing the memory distribution among multiple pipelines, which is NP-hard. Both IP/ prefix caching and dynamic subtrie remapping are incorporated to balance the traffic among multiple pipelines. We im- prove the caching scheme in order to utilize the locality inherent in Internet traffic and in the pipeline architecture. Lightweight schemes are developed to maintain the intra-flow packet order. Neither a large reorder buffer nor complex reorder logic is needed. The proposed 8-pipeline architecture can store a full backbone routing table with over 200K unique prefixes by using less than 3.6 MB of memory. It can achieve a high throughput of up to 11.72 billion packets per second (GPPS), i.e. 3.75 Tbps for minimum size (40 bytes) packets. Compared with TCAM, our architecture achieves 2:6-fold and fourteen-fold reduction in power and energy consumption, respectively. Power Reduction To further lower the power/ energy consumption for SRAM-based pipelined IP lookup engines, we propose data structure and architectural optimizations. We formulate the power consumption of a SRAM-based IP lookup engine as a function of the number of memory accesses (time) and the memory size (space). We revisit the conventional time-space trade-off in multi-bit tries. A dynamic programming frame- work is developed to determine the optimal strides for constructing tree bitmap coded multi-bit tries in order to minimize the worst-case power consumption. Several novel architecture-specific techniques, such as caching and fine-grained memory enabling, are incorporated into the pipelined IP lookup engine to reduce the dynamic power dis- sipation. Simulation experiments that use real-life traces show that our solutions can 12 achieve up to a fifteen-fold reduction in dynamic power dissipation over the baseline pipeline architecture that does not employ the proposed schemes. Packet classification We implement decision-tree-based packet classification algo- rithms onto FPGAs. We exploit the dual-port high-speed Block RAMs provided in Xilinx Virtex FPGAs and present a SRAM-based two-dimensional dual-pipeline ar- chitecture to achieve a high throughput of two packets per clock cycle (PPC). On-the- fly rule update without service interruption becomes feasible due to the memory-based linear architecture. Unlike a simple design that fixes the number of branches at each tree node, our design allows updating on-the-fly the number of branches that are on multiple packet header fields. All the routing paths are localized to avoid large routing delay so that a high clock frequency is achieved. Two optimization techniques, called rule overlap reduction and precise range cutting, are proposed to minimize the rule duplication. As a result, the memory requirement is almost linear with the number of rules. So, ten thousand rules can be fit into a single FPGA. The height of the tree is also reduced, which limits the number of pipeline stages. To map the tree onto the pipeline architecture, we introduce a fine-grained node-to-stage mapping scheme that allows imposing the bounds on the memory size and on the number of nodes in each stage. As a result, the memory utilization of the architecture is maximized. The mem- ory allocation scheme also enables the use of external SRAMs to handle even larger rule sets. Implementation results show that our architecture can store 10K 5-field rules in a single Xilinx Virtex-5 FPGA and sustain 80 Gbps throughput for minimum size (40 bytes) packets. To the best of our knowledge, our design is the first FPGA imple- mentation that is able to perform multi-field packet classification at wire speed while supporting a large rule set with 10K unique rules. 13 Flexible Flow Matching We propose a parallel architecture, named decision forest, for high-performance flexible flow matching. We develop a framework to partition a given table of flexible flow rules into multiple subsets of which each is built into a depth-bounded decision tree. The partitioning scheme is carefully designed to re- duce rule duplication during the construction of the decision trees. Thus, the overall memory requirement is significantly reduced. After such partitioning, the number of header fields that are used to build the decision tree for each rule subset is small. The reduced number of cutting fields leads to reduction in the logic resource requirement. Exploiting the dual-port RAMs available in current FPGAs, we map each decision tree onto a linear pipeline in order to achieve high throughput. Our extensive experiments and FPGA implementation demonstrate the effectiveness of our scheme. Our design supports 1K flexible flow rules while sustaining 40 Gbps throughput for matching min- imum size (40 bytes) packets. To the best of our knowledge, this FPGA design is the first that allows for flexible flow matching to achieve over 10 Gbps throughput. 1.4 Organization The rest of this thesis is organized as follows. We first give an overview of the afore- mentioned three packet forwarding problems and their related solutions in Chapter 2. In Chapter 3, we detail our SRAM-based pipeline architectures for terabit IP lookup. In Chapter 4, we present data-structure- and architecture- level optimizations to lower the power consumption of IP lookup engines. In Chapter 5, we discuss our design for high performance packet classification on FPGA. In Chapter 6, we present our so- lutions for scalable flexible flow matching. Finally, in Chapter 7, we summarize our work and discuss areas for future study. 14 Chapter 2 Packet Forwarding Approaches A plethora of research has been done on the subject of packet forwarding problems [70, 82]. As packet forwarding itself evolves from the basic IP lookup to the flexible flow matching, research interests are being renewed. This chapter reviews the existing packet forwarding approaches, including software and hardware solutions. 2.1 Software Approaches 2.1.1 Trie-based IP Lookup The nature of IP lookup is longest prefix matching (LPM). The most common data structure in algorithmic solutions for performing LPM is some form of trie [70]. A trie is a binary tree, where a prefix is represented by a node. The value of the prefix corresponds to the path from the root of the tree to the node representing the prefix. The branching decisions are made based on the consecutive bits in the prefix. If only one bit is used to make branching decision at a time, then a trie is called a uni-bit trie. The prefix set in Figure 2.1 (a) corresponds to the uni-bit trie in Figure 2.1 (b). For example, the prefix “010*” corresponds to the path that starts at the root and ends in node P3: first a left-turn (0), then a right-turn (1), and finally a turn to the left (0). Each 15 trie node contains two fields: the represented prefix and the pointer to the child nodes. By using the optimization called leaf-pushing [80], each node needs only one field: either the pointer to the next-hop address or the pointer to the child nodes. Figure 2.1 (c) shows the leaf-pushed uni-bit trie that is derived from Figure 2.1 (b). 01001* 010* 000* 0* 111* 1100* 011* 01011* (a) 0 1 0 1 1 0 0 1 1 1 1 1 0 root (b) P1 P2 P3 P6 P8 P7 P4 P5 0 0 P4 P3 P2 P1 P8 P7 P6 P5 Prefix root 000 P2 P3 P6 P8 P1 001 111 P4 P5 00 01 10 11 010 011 110 P7 0 1 (d) 100 101 0 1 0 1 1 0 0 1 0 1 1 1 1 0 P2 P6 P4 P5 P8 root (c) P1 P3 P3 1 0 0 0 null 0 1 P7 null Figure 2.1: (a) Prefix set; (b) Uni-bit trie; (c) Leaf-pushed trie; (d) Multi-bit trie. Given a leaf-pushed uni-bit trie, IP lookup is performed by traversing the trie ac- cording to the bits in the IP address. When a leaf is reached, the prefix associated with the leaf is the longest matched prefix for that IP address. The corresponding next-hop information of that prefix is then retrieved. The time to look up a uni-bit trie is equal to the prefix length. The use of multiple bits in one scan can increase the search speed. Such a trie is called a multi-bit trie. The number of bits scanned at a time is called the 16 stride. Figure 2.1 (d) shows the multi-bit trie for the prefix entries in Figure 2.1 (a). The root node uses a stride of 3, while the node that contains P3 uses a stride of 2. Multi-bit tries that use a larger stride usually result in a much larger memory require- ment, while some optimization schemes have been proposed for memory compression [18, 78]. The well-known tree bitmap algorithm [18] uses a pair of bit maps for each node in a multi-bit trie. One bit map represents the children that are actually present and the other represents the next hop information that is associated with the given node. Children of a node are stored in consecutive memory locations, which allows each node to use just a single child pointer. Similarly, another single pointer is used to reference the next hop information that is associated with a node. This representation allows every node in the multi-bit trie to occupy a small amount of memory. 2.1.2 Packet Classification Algorithms Multi-field packet classification can be modeled as a point projection problem, where each rule is represented as a multi-dimensional sub-space and the packet that is being classified is represented as a point. Although this problem has been well studied in the past ten years [23, 87, 22, 74], designing scalable solutions in the context of rapid growth of the network traffic and the number of rules is still a big challenge . A vast number of packet classification algorithms have been published in the past decade. Comprehensive surveys can be found in [23, 82]. Most of those algorithms fall into three categories: decision-tree-based, decomposition-based, and partitioning- based approaches. Decision-tree-based algorithms (such as HyperCuts [74]), take the geometric view of the packet classification problem. Each rule defines a hypercube in ad-dimensional 17 space whered is the number of header fields considered for packet classification. Each packet defines a point in this d-dimensional space. The decision tree construction algorithm employs several heuristics to cut the space recursively into smaller sub- spaces. Each subspace ends up with fewer rules, which to a point allows a low-cost linear search to find the best matching rule. After the decision tree is built, the algo- rithm to look up a packet is very simple. Based on the value of the packet header, the algorithm follows the cutting sequence to locate the target subspace (i.e. a leaf node in the decision tree) and then performs a linear search on the rules in this sub- space. Decision-tree-based algorithms allow incremental rule updates and scale better than decomposition-based algorithms. The outstanding representatives of decision- tree-based packet classification algorithms are HiCuts [22] and its enhanced version HyperCuts [74]. At each node of the decision tree, the search space is cut based on the information from one or more fields in the rule. HiCuts builds a decision tree using local optimization decisions at each node to choose the next dimension to cut and how many cuts to make in the chosen dimension. The HyperCuts algorithm, on the other hand, allows cutting on multiple fields per step, which results in a fatter and shorter decision tree. Figure 2.2 shows the example of the HiCuts and the HyperCuts decision trees for a set of 2-field rules that can be represented geometrically. These rules are actually R1R5 given in Table 1.3, when only SP and DP fields are considered. In Figure 2.2, (a) X and Y axes correspond to SP and DP fields for R1R5 in Table 1.3; (b)(c) A rounded rectangle in yellow denotes an internal tree node, and a rectangle in gray denotes a leaf node. Decomposition-based algorithms perform independent searches on each field and finally combine the search results from all fields. Such algorithms are desirable for hardware implementation due to their parallel searches on multiple fields. However, substantial storage is usually needed to merge the independent search results to obtain 18 R1 R3 R4 R2 X Y Y: 2 cuts R1 R2 R5 R1 R2 R4 R1 R3 R4 X: 2 cuts R1 R2 R1 R5 R1 R2 R4 R1 R3 R4 X: 2 cuts Y: 2 cuts (a) Rule set (b) HiCuts (c) HyperCuts R5 Figure 2.2: Example of HiCuts and HyperCuts decision trees. the final result. Thus, decomposition-based algorithms have poor scalability and work well only for small-scale rule sets. Lakshman et al. [45] propose the Parallel Bit Vec- tor (BV) algorithm, which is a decomposition-based algorithm that targets hardware implementation. It performs the parallel lookups on each individual field first. The lookup on each field returns a bit vector where each bit represents a rule. A bit is set if the corresponding rule is matched on this field; a bit is reset if the corresponding rule is not matched on this field. The result of the bitwise AND operation on these bit vectors indicates the set of rules that matches a given packet. The BV algorithm can provide a high throughput at the cost of low memory efficiency. The memory requirement is O(N 2 ), where N is the number of rules. Taylor et al. [83] introduce Distributed 19 Crossproducting of Field Labels (DCFL), which is also a decomposition-based algo- rithm that leverages several observations of the structure of real filter sets. They de- compose the multi-field searching problem and use independent search engines, which can operate in parallel to find the matching conditions for each filter field. Instead of using bit vectors, DCFL uses a network of efficient aggregation nodes by employing Bloom Filters and by encoding intermediate search results. As a result, the algorithm avoids the exponential increase in time or space that is incurred when performing this operation in a single step. The authors predict that an optimized implementation of DCFL can provide over 100 million packets per second (MPPS) and store over 200K rules in the current generation of FPGA or ASIC, without the need of external mem- ories. However, their prediction is based on the maximum clock frequency of FPGA devices and a logic intensive approach that uses Bloom Filters. This approach may not be optimal for FPGA implementation, due to long logic paths and large routing delays. Furthermore, the estimated number of rules is based only on the assumption of statistics similar to those of the currently available rule sets. Both decision-tree-based and decomposition-based algorithms suffer fromO(N D ) memory explosion in the worst case whereN denotes the number of rules andD the number of fields in a rule [82, 86]. To reduce memory consumption, some recent work [81, 95, 86] proposes to partition the original rule set into multiple subsets. The rules in each subset are mutually disjoint on one field or multiple fields, so that the search within each subset is simplified. For example, the Independent Sets algorithm [81] partitions the rule set into many independent sets where one-dimensional search is performed within each independent set. The memory requirement is thus dramat- ically reduced at the cost of more time needed to search multiple subsets. Although the search process on multiple independent sets can be parallelized in hardware, the 20 nondeterministic and large number of independent sets (e.g. varying from 34 to 61 for different rule sets [81]) becomes a major challenge for hardware implementation. 2.1.3 Flexible Flow Matching Next-generation Internet requires processing rich and flexible flow information in the network infrastructure. One such effort is the OpenFlow switch, which manages ex- plicitly the network flows by using a flow table with rich definition as the software- hardware interface [56, 63]. Most of the existing work in developing flexible forward- ing hardware is focused on functionality rather than performance. In the software implementation of the OpenFlow switching, hashing is adopted for matching simple rules while linear search is performed on the complex rules. When the number of complex rules becomes large, using linear search leads to low throughput. Hashing cannot provide deterministic performance, due to potential collision, and is inefficient in handling wildcard or prefix matching [32]. Luo et al. [51] port the software im- plementation of the OpenFlow switching onto multi-core network processors. But the performance improvement is limited. 2.2 Hardware Approaches Due to the increasing throughput requirement, hardware approaches become a neces- sity for high-performance packet forwarding. 2.2.1 TCAM Ternary Content Addressable Memories (TCAMs) are widely deployed in high per- formance network routers for packet classification because of their unmatched lookup 21 throughput and generality. A TCAM is a special memory device that can store ternary bit strings and perform parallel searches on all of its entries simultaneously. In TCAMs, rules are represented as ternary bit strings and stored in decreasing priority order. Given a packet header, the search for the best matching rule with the highest prior- ity is performed on all the entries in parallel. The index of the first matching rule is then used to access a memory to retrieve the associated data for the matching rule. This elegant architecture allows packet classification at very high throughput. A com- mercially available TCAM chip can store more than 100K ternary rules. It can classify 250 million packets per second, which satisfies the throughput demands of all currently existing networks [26]. While TCAMs remain the most popular choice for high perfor- mance packet classification in network routers, the research on algorithmic alternatives for general packet classification continues because of the drawbacks of TCAM devices (as compared with SRAMs in Table 1.6). Most of TCAM-based packet classifica- tion solutions also suffer from range expansion when converting ranges into prefixes [46, 77]. 2.2.2 SRAM-based Pipeline As shown in Table 1.6, SRAMs outperform TCAMs with respect to speed, density, and power consumption. However, traditional SRAM-based solutions, most of which can be regarded as some form of tree traversal, need multiple clock cycles to complete a lookup. For example, trie [70], a tree-like data structure that represents a collection of prefixes, is widely used in SRAM-based IP lookup solutions. It needs multiple memory accesses to search a trie in order to find the longest matched prefix for an IP packet. 22 Several researchers have explored pipelining in order to improve the throughput significantly. Taking trie-based solutions as an example, a simple pipelining approach is to map each trie level onto a pipeline stage with its own memory and processing logic. One IP lookup can be performed every clock cycle. However, this approach results in unbalanced trie node distribution over the pipeline stages. Memory imbal- ancing has been identified as a dominant issue for pipelined architectures [21, 7]. In an unbalanced pipeline, the “fattest” stage, which stores the largest number of trie nodes, becomes a bottleneck. It adversely affects the overall performance of the pipeline for the following reasons: First, it needs more time to access the larger local memory. This leads to reduction in the global clock rate. Second, a fat stage results in many updates, due to the proportional relationship between the number of updates and the number of trie nodes stored in that stage. Particularly during the update process caused by inten- sive route insertion, the fattest stage can also result in memory overflow. Furthermore, since it is unclear at hardware design time which stage will be the fattest, memory with the maximum size must be allocated for each stage. This results in memory wastage. Basu et al. [7] and Kim et al. [40] both reduce the memory imbalance by using variable strides to minimize the largest trie level. However, even with their schemes, the size of the memory of different stages can have a large variation. As an improve- ment upon [40], Lu et al. [50] proposes a tree-packing heuristic to further balance the memory, but it does not solve the fundamental problem of how to retrieve one node’s descendants that are not allocated in the following stage. Furthermore, a variable stride multi-bit trie is difficult for hardware implementation especially if incremental updat- ing is needed [7]. Baboescu et al. [5] propose a Ring pipeline architecture for trie-based IP lookup. As shown in Figure 2.3(a), the memory stages are configured in a circular, multi-point access pipeline, so that lookups can be initiated at any stage. The trie is split into 23 many small subtries of equal size. These subtries are then mapped to different stages to create a balanced pipeline. Some subtries have to wrap around if their roots are mapped to the last several stages. Though all IP packets enter the pipeline from the first stage, their lookup processes may be activated at different stages. Hence, all the IP lookup packets must traverse the pipeline twice to complete the trie traversal. The throughput is thus 0.5 lookups per clock cycle. Kumar et al. [44] extend the circular pipeline with a new architecture called the Circular, Adaptive and Monotonic Pipeline (CAMP) shown in Figure 2.3(b). It has multiple entrance and exit points, so that the throughput can be increased at the cost of output disorder and delay variation. It employs several request queues to manage access conflicts between the new request and the one from the preceding stage. It can achieve a worst-case throughput of 0.8 lookups per clock cycle, while maintaining balanced memory across pipeline stages. Due to the non-linear structure, neither the Ring pipeline nor CAMP under worst cases can maintain a throughput of one lookup per clock cycle. Also, neither of them properly supports the write bubble proposed in [7] for the incremental route update. 2.2.3 FPGA Field Programmable Gate Arrays (FPGA) provide a fabric upon which applications can be built. FPGAs, in particular SRAM-based FPGAs from Xilinx or Altera, are based on a lookup tables, flip-flops, and multiplexers. In these devices, a SRAM bank serves as a configuration memory that controls all of the functionality of the device, from the logic implemented to the signaling standards of the IO pins. The values in the lookup tables can produce any combinational logic functionality that is necessary, the flip-flops provide integrated state elements, and the SRAM-controlled routing di- rects logic values into the appropriate paths to produce the desired architecture. The 24 Stage 1 Stage 2 Stage 3 Stage 4 Data path active during odd cycles Data path active during even cycles Stage 2 Stage 1 Stage 3 00* 01* 10* Stage 4 11* Indexed bits Starting Stage (a) Ring pipeline architecture [5] Stage 1 Stage 2 Stage 3 Stage 4 Lookup Table for Initial bits Stage 2 Stage 1 Stage 3 00* 01* 10* Stage 4 11* Queues Reordering Buffer (optional) (b) CAMP architecture [44] Figure 2.3: Ring pipeline and CAMP device is composed of many thousands of basic logic cells that include the basic logic elements and, based on the device variety, includes fast ASIC multipliers, ethernet MACs, local RAMs, and clock managers. FPGAs started out as prototyping devices, which allow for convenient development of glue-logic-type applications for connect- ing ASIC components without high VLSI design costs or large numbers of discrete standard logic gates. As the gate density of FPGA devices increased and application- specific ASIC blocks were added, the applications shifted from glue logic to a wide variety of solutions for signal processing and network problems. The devices have been deployed in the field as the final but still flexible solution. Because the device is controlled by the state of the SRAM bits, the functionality can be changed by changing 25 the memory state. This can be useful, since logic can be customized for a particular set of input data. By combining TCAMs and the BV algorithm, Song et al. [77] present an archi- tecture called BV-TCAM for multi-match packet classification. A TCAM performs prefix or exact match, while a multi-bit trie implemented in Tree Bitmap [18] is used for source or destination port lookup. The authors never report the actual FPGA imple- mentation results, though they claim that the whole circuit for 222 rules consumes less than 10% of the available logic and fewer than 20% of the available Block RAMs of a Xilinx XCV2000E FPGA. They also predict the design after pipelining can achieve 10 Gbps throughput when implemented on advanced FPGAs. Jedhe et al. [27] realize the DCFL architecture in their complete firewall implementation on a Xilinx Virtex 2 Pro FPGA, using a memory intensive approach (not the logic intensive one) so that on-the-fly update is feasible. They achieve a throughput of 50 MPPS, for a rule set of 128 entries. They also predict that the throughput can be 24 Gbps when the design is implemented on Virtex-5 FPGAs. Papaefstathiou et al. [65] propose a memory- efficient decomposition-based packet classification algorithm, which uses multi-level Bloom Filters to combine the search results from all fields. Their FPGA implementa- tion, called 2sBFCE [60], shows that the design can support 4K rules in 178 Kbytes memories. However, the design takes 26 clock cycles on average to classify a packet, which results in low throughput of 1.875 Gbps on average. Note that both DCFL [83] and 2sBFCE [60] may suffer from false positives due to their use of Bloom Filters, as discussed earlier. Two recent works [52, 39] discuss several issues on implementing decision-tree- based packet classification algorithms on FPGA, with different motivations. Luo et al. [52] propose a method called explicit range search to allow more cuts per node than the HyperCuts algorithm. The tree height is dramatically reduced at the cost of increased 26 memory consumption. At each internal node, a varying number of memory accesses may be needed to determine which child node to traverse, which may be infeasible for pipelining. Since the authors do not implement their design on FPGA, the actual performance results are unclear. To achieve power efficiency, Kennedy et al. [39] implement a simplified HyperCuts algorithm on an Altera Cyclone 3 FPGA. They store hundreds of rules in each leaf node and match them in parallel, which results in low clock frequency (such as 32 MHz reported in [39]). Their claim about supporting the OC-768 rate is based on the average cases that are tested using several specific traffic traces. Since the search in the decision tree is not pipelined, their implementation can sustain only 0.47 Gbps in the worst cases where it takes 23 clock cycles to classify a packet for the rule set FW 20K. Naous et al. [58] implement the OpenFlow switch on NetFPGA, which is a Xil- inx Virtex-2 Pro 50 FPGA board that is tailored for network applications. They use hashing for simple rules and inherit the drawbacks of hashing-based schemes. A small TCAM is implemented on FPGA for complex rules. Due to the high cost to implement TCAM on FPGA, their design can support only a few tens of complex rules. Though it is possible to use external TCAMs for large rule tables, high power consumption of TCAMs remains a big challenge. 27 Chapter 3 Pipeline Architectures for High-Throughput IP Lookup This chapter details our solutions to address the challenges of mapping trie-based IP lookup algorithms onto SRAM-based pipeline architectures. These challenges include: Memory balancing within a pipeline Memory balancing among multiple pipelines Traffic balancing among multiple pipelines Sequence preserving 3.1 Memory-Balanced Linear Pipelines 3.1.1 Fine-Grained Mapping Our first work [28] proposed a linear pipeline architecture, named OLP, with a fine- grained node-to-stage mapping scheme to distribute nodes of a leaf-pushed uni-bit trie evenly to different pipeline stages. By adding nops (no-operations) in the pipeline, OLP offers more freedom in mapping tree nodes to pipeline stages. The tree is par- titioned, and all subtrees are converted into queues and are mapped onto the pipeline 28 root 0 0 1 0 1 0 00 01 11 Stage 1 Stage 2 Stage 3 Stage 4 NOP Q1 Q2 Q3 P2 P6 P4 P5 P7 P8 P2 P1 (a) Trie Partition 00 0 0 1 0 1 1 1 1 0 e c d f g h root 1 0 0 P2 P1 P6 P7 P8 P4 P5 P3 P3 null 01 10 11 c f g h d e P3 d f 10 null 1 c P7 1 P6 P8 e 0 P3 (b) Subtrie-to-Queue Conversion (c) Node-to-Stage Mapping Stage 5 P4 P5 0 1 1 1 g P1 1 h Figure 3.1: OLP architecture. from the first stage. It can achieve a high throughput of one lookup per clock cycle and support write bubbles [7] for incremental updates without disrupting router operations. 3.1.1.1 Trie Partitioning Similar to the Ring pipeline and CAMP, we use prefix expansion to split the original trie (shown in Figure 2.1 (c)) into multiple subtries, as shown in Figure 3.1 (a). Several initial bits are used as the index to partition the trie into many disjoint subtries. The number of initial bits to be used is called the initial stride and is denoted asI. Since the 29 first stage consists of all of the subtries’ roots, it cannot be balanced by moving nodes or inserting nops. Adjusting the initial stride becomes the only way to balance the memory requirement of the first stage with that of other stages. A largerI can result in more small subtries, which can help balance the memory distribution. However, prefix expansion may result in prefix duplication where a prefix may be copied to multiple subtries. Hence, a largeI can result in many non-disjoint subtries. For example, if we useI = 4 to expand the prefixes in Figure 2.1 (a), then the prefix P3, whose length is 3, will be copied to two subtries. One subtrie with the initial bits of “0100” has the prefixes P3 and P4, and the other with “0101” has the prefixes P3 and P5. Prefix duplication results in memory inefficiency and may increase the update cost. Table 3.1: Representative routing tables Routing table Location Date # of prefixes # of prefixes w/ length< 16 RIPE NCC(rrc00) Amsterdam 20071130 243474 1949 (0.80%) LINX (rrc01) London 20071130 240797 1945 (0.81%) SFINX (rrc02) Paris 20071130 238089 1941 (0.82%) AMS-IX (rrc03) Amsterdam 20071130 246530 1950 (0.79%) CIXP (rrc04) Geneva 20071130 240180 1948 (0.81%) VIX (rrc05) Vienna 20071130 241948 1968 (0.81%) JPIX (rrc06) Otemachi 20071130 239332 1926 (0.80%) NETNOD (rrc07) Stockholm 20071130 248856 1943 (0.78%) MAE-WEST (rrc08) San Jose 20040901 83556 495 (0.59%) TIX (rrc09) Zurich 20040201 132786 991 (0.75%) MIX (rrc10) Milan 20071130 236991 1939 (0.82%) NYIIX (rrc11) New York 20071130 238836 1952 (0.82%) DE-CIX (rrc12) Frankfurt 20071130 243731 1999 (0.82%) MSK-IX (rrc13) Moscow 20071130 238461 1942 (0.81%) PAIX (rrc14) Palo Alto 20071130 243731 1949 (0.80%) PTTMetro-SP (rrc15) Sao Paulo 20071130 243242 1946 (0.80%) We study the prefix length distribution based on four representative routing tables collected from [69]: rrc00, rrc01, rrc08, and rrc11. Their information is listed in Table 30 3.1. We obtain the results similar to [6]: few prefixes are shorter than 16. Hence, using anI of less than 16 should not result in much duplication of prefixes. To find an appropriateI, we consider various values ofI to partition the above four routing tables. We examine the prefix expansion ratio (PER), which is defined in Equation (3.1). PER = P K i=1 PrefixCount(T i ) PrefixCount(T o ) (3.1) where K denotes the number of subtries after partitioning, PrefixCount(T i ) the number of prefixes in thei th subtrie, andPrefixCount(T o ) the number of prefixes in the original trie. Figure 3.2 shows the prefix expansion ratio for various values ofI. Using anI of less than 12 results in little prefix duplication. 4 6 8 10 12 14 16 1 1.05 1.1 1.15 1.2 1.25 Initial stride Prefix expansion ratio rrc06 rrc08 rrc10 rrc14 aggr Figure 3.2: Prefix expansion ratio. 3.1.1.2 Algorithm and Architecture We define the height of a trie node to be the maximum distance from it to a leaf node. Using the following notations, the problem of memory balancing within a pipeline can be formulated as (3.2) with the constraint (3.3). H denotes the number of pipeline stages. 31 M i denotes the number of nodes mapped to thei th stage. T denotes a subtrie, andS p the set of subtries that are assigned to the pipeline. size(:) denotes the size, i.e., the number of nodes, of a subtrie. R n denotes the number of remaining nodes to be mapped onto stages; R h denotes the number of remaining stages onto which the remaining nodes will be mapped. min max i=1;2;;H M i (3.2) H X i=1 M i = X T2Sp size(T) (3.3) Solving the above programming problem and obtaining the optimum value ofM i = P T2Sp size(T) H is not difficult. However, since our architecture requires the pipeline be linear, the following constraint must be met. Constraint 1. If node A is an ancestor of node B in a subtrie, then A must be mapped to a stage preceding the stage to whichB is mapped. We use a simple heuristic to perform the node-to-stage mapping. As Figure 3.1 shows, by supporting nops, we allow the nodes on the same level of a subtrie to be mapped onto different pipeline stages. This heuristic provides more flexibility for map- ping the trie nodes and helps achieve a balanced node distribution across the stages. We manage two lists: ReadyList and NextReadyList. The former stores the nodes which are available for filling the current stage, while the latter stores the nodes for filling the next stage. Since Stage 1 is dedicated for the subtries’ roots, we start with filling the nodes that are children of the roots into Stage 2. When filling a stage, the 32 nodes inReadyList are popped out and filled into the stage in the decreasing order of their heights. If a node is filled, then its children are pushed into theNextReadyList. When a stage is full orReadyList becomes empty, we move on to the next stage. At that time, theNextReadyList is merged intoReadyList. By this means, Constraint 1 can be met. The complete algorithm is shown in Figure 3.3. Input: S p : the set of subtries assigned to the pipeline. Output: H stages with mapped nodes. 1: Create and initialize two lists:ReadyList = andNextReadyList =. 2: R n = P T2Sp size(T);R h =H. 3: Fill the roots of the subtries into Stage 1. 4: Push the children of the filled nodes intoReadyList. 5: R n =R n M 1 ,R h =R h 1. 6: fori = 2 toH do 7: M i = 0. 8: Sort the nodes inReadyList in the decreasing order of the node height. 9: whileM i <R n =R h andReadylist6= do 10: Pop node fromReadyList and fill into Stagei. The popped node’s children are pushed intoNextReadyList. 11: end while 12: R n =R n M i ,R h =R h 1. 13: Merge theNextReadyList to theReadyList. 14: end for 15: ifR n > 0 then 16: Return Failure. 17: else 18: Return Success. 19: end if Figure 3.3: Algorithm: node-to-stage mapping in OLP. To allow two nodes on the same subtrie level to be mapped to different stages, we must implement theNOP (no-operation) in the pipeline. Our method is simple. Each node that is stored in the local memory of a pipeline stage has two fields. One is the memory address of its child node in the pipeline stage where the child node is stored. The other is the distance to the pipeline stage where the child node is stored. When a packet is passed through the pipeline, the distance value is decremented by 1 when it 33 goes through a stage. When the distance value becomes 0, the child node’s address is used to access the memory in that stage. In the proposed architecture, a lookup can be performed in each clock cycle. The delay for each lookup is constant and measured as the number of clock cycles, which is equal to the number of pipeline stages. The memory requirement is proportional to the total number of subtrie nodes. Since OLP has the same entry point and unique exit point, it keeps the output sequence in the same order as input. As a linear pipeline architecture, OLP can use the same update scheme as proposed in [7]. By inserting write bubble, the pipeline memory can be updated without disrupting the on-going operations. 3.1.1.3 Results We conducted the experiments on the four representative routing tablesrrc00,rrc01, rrc08 and rrc11 collected from [69]. We set I = 12. Figure 3.4 shows the node distribution across the stages in the pipeline. Except for the first several stages, all of the stages have almost equal numbers of trie nodes. 3.1.2 Bidirectional Mapping In OLP, the first several stages may not be balanced, since the top levels of a trie have few nodes. But there are many nodes at the leaf level. Hence, we can invert some subtries so that their leaf nodes are mapped onto the first several stages. Accordingly, we proposed a bidirectional linear pipeline architecture in [47], with a mapping scheme as shown in Figure 3.5. 34 0 5 10 15 20 25 0 0.5 1 1.5 2 2.5 3 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 Figure 3.4: Node distribution after fine-grained mapping. 0 1 0 Stage 1 Stage 2 Stage 3 (a) Partition (c) Node-to-Stage Mapping P6 00 0 0 1 0 1 1 1 1 0 e c d f g h root 1 0 0 P2 P1 P6 P7 P8 P4 P5 P3 P3 null 01 10 11 Stage 4 P2 1 f P8 0 1 0 1 h (b) Invert 0 1 1 1 1 0 d f g h 0 0 P6 P4 P5 P3 P3 0 c 1 P2 P1 0 1 e P7 P8 P6 P5 P4 P3 g P5 d 0 c P1 P7 e Figure 3.5: Bidirectional fine-grained mapping for the trie in Figure 2.1. 3.1.2.1 Algorithms and Architecture Several heuristics are proposed to select the subtries that are to be inverted: 1. Largest leaf : The subtrie with the most leaves is preferred. This heuristic is straightforward, since we need enough nodes to be mapped onto the first several stages: 35 2. Least height: The subtrie of shortest height is preferred. Due to Constraint 1, a subtrie with a larger height has less flexibility for being mapped onto pipeline stages. 3. Largest leaf per height: This heuristic is a combination of the previous two, by dividing the number of leaves of a subtrie by its height. 4. Least average depth per leaf : Average depth per leaf is the ratio of the sum of the depth of all the leaves to the number of leaves. This heuristic prefers a more balanced subtrie. A balanced subtrie has many nodes not only at the leaf level but also at the lower levels, which can help balance not only the first stage but also the first several stages. Algorithm 3.6 finds the subtries to be inverted, whereIFR denotes the inversion factor. A larger inversion factor results in more subtries to be inverted. When the inversion factor is 0, no subtrie is inverted. When the inversion factor is close to the pipeline depth, all subtries are inverted. The complexity of this algorithm is O(K), whereK denotes the total number of subtries. Input: K subtries. Output: V subtries to be inverted. 1: N = total# of trie nodes of all subtries,H = # of pipeline stages,V = 0. 2: whileV <K <IFRdN=He do 3: Based on the chosen heuristic, select one subtrie from those not inverted. 4: V =V +1,K =K1+# of leaves of the selected subtrie. 5: end while Figure 3.6: Algorithm: Selecting the subtrie to be inverted Now, we have two sets of subtries. Those subtries that are mapped from roots are called the forward subtries, while the others are called the reverse subtries. We use a bidirectional fine-grained mapping algorithm (Algorithm 3.7). The nodes are popped 36 out of the ReadyList in the decreasing order of their priority. The priority of a trie node is defined as its height if the node is in a forward subtrie, and its depth if in a reverse subtrie. The node whose priority is equal to the number of the remaining stages is regarded as a critical node. For the forward subtries, a node is pushed into theNextReadyList immediately after its parent is popped. For the reverse subtries, a node is not pushed into theNextReadyList until all of its children are popped. The complexity of this mapping algorithm isO(HN) whereH denotes the pipeline depth andN the total number of nodes. Input: K forward subtries. Input: V reverse subtries. Output: H stages with mapped nodes. 1: Create and initialize two lists:ReadyList = andNextReadyList =. 2: R n = # of remaining nodes,R h = # of remaining stages =H. 3: Push the roots of the forward subtries and the leaves of the reverse subtries into ReadyList. 4: fori = 1 toH do 5: M i = 0,Critical =FALSE. 6: Sort the nodes inReadyList in the decreasing order of the node priority. 7: whileCritical =TRUE or (M i <dR n =R h e andReadylist6=) do 8: Pop node fromReadyList and map into Stagei. 9: if The node is in forward subtries then 10: The popped node’s children are pushed intoNextReadyList. 11: else if All children of the popped node’s parent have been mapped then 12: The popped node’s parent is pushed intoNextReadyList. 13: end if 14: Critical =FALSE. 15: if There exists a nodeN c 2 ReadyList and the priority ofN c >= R h 1 then 16: Critical =TRUE. 17: end if 18: end while 19: R n =R n M i ,R h =R h 1. 20: Merge theNextReadyList to theReadyList. 21: end for Figure 3.7: Algorithm: Bidirectional fine-grained mapping 37 To enable the bidirectional fine-grained mapping scheme, we develop a bidirec- tional linear pipeline architecture based on dual-port SRAMs 1 , as shown in Figure 3.8. Packet 1 Pipeline Direction Index Table (DIT) Pointers to Search Results Dual-Port SRAM Distance Checker M U X M U X 0 From previous stage To next stage Distance Checker M U X M U X 0 From previous stage To next stage Figure 3.8: Block diagram of the basic architecture of BiOLP. One Direction Index Table (DIT) stores the relationship between the subtrees and their mapping directions: forward or reverse. For any arriving packet p, the initial bits of its IP address are used to lookup the DIT and retrieve information about its corresponding subtreeST(p). The information includes the distance to the stage where the root of ST(p) is stored, the memory address of the root of ST(p) in that stage, and the mapping direction ofST(p) that leads the packet to different entrance of the pipeline. For example, in Figure 3.8, if the mapping direction is forward, then the packet is sent to the leftmost stage of the pipeline. Otherwise, the packet is sent to the rightmost stage. Once its direction is known, the packet will go through the entire pipeline in that direction. The pipeline is configured as a dual-entrance bidirectional linear pipeline. At each stage, the memory has dual Read/Write ports, so that the packets from both directions can access the memory simultaneously. The content of each entry in the memory includes the memory address of the child node and the distance to the stage where the child node is stored. If the distance value is zero, the memory address of 1 Dual-port SRAMs have been standard components in many devices such as FPGAs [89]. 38 its child node will be used to index the memory in the next stage to retrieve the child node content. Otherwise, the packet will pass that stage without any operation but decrement its distance value by one. We update the memory in the pipeline by inserting write bubbles [7]. The new content of the memory is computed offline. When an update is initiated, a write bubble is inserted into the pipeline. The direction of write bubble insertion is determined by the direction of the subtree that the write bubble is going to update. Each write bubble is assigned an ID. There is one write bubble table in each stage. It stores the update information associated with the write bubble ID. When it arrives at the stage prior to the stage to be updated, the write bubble uses its ID to lookup the write bubble table. Then, it retrieves the memory address to be updated in the next stage, the new content for that memory location, and a write enable bit. If the write enable bit is set, the write bubble will use the new content to update the memory location in the next stage. Since the subtrees mapped onto the two directions are disjoint, a write bubble in- serted from one direction will not contaminate the memory content for the search from the other direction. Also, since the pipeline is linear, all packets preceding or following the write bubble can perform their searches while the write bubble performs an update. 3.1.2.2 Results We conducted the experiments on the four representative routing tablesrrc00,rrc01, rrc08, and rrc11 that are collected from [69]. We used various inversion heuristics and inversion factor to evaluate their impacts. In these experiments, the number of initial bits used for partitioning the trie is 12. The value of the inversion factor is set to 1. According to Figure 3.9, the least average depth per leaf heuristic has the best performance. It shows that, when we have a choice, a balanced subtree should be inverted. This can be explained that a 39 0 5 10 15 20 25 0 0.5 1 1.5 2 2.5 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (a) Largest leaf 0 5 10 15 20 25 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (b) Least height 0 5 10 15 20 25 0.5 1 1.5 2 2.5 3 3.5 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (c) Largest leaf per height 0 5 10 15 20 25 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (d) Least average depth per leaf Figure 3.9: Bidirectional fine-grained mapping with different heuristics. (Inversion factor = 1) balanced subtree has many nodes not only at the leaf level but also at the lower levels, which can help balance not only the first stage but also the first several stages. Using the largest leaf heuristic, we changed the value of the inversion factor. The results are shown in Figure 3.10. When the inversion factor is 0, the bidirectional mapping becomes fine-grained forward mapping only. The mapping turns to be fine- grained reverse mapping when the inversion factor is close to the pipeline depth so that all subtrees are inverted. 40 0 5 10 15 20 25 0 0.5 1 1.5 2 2.5 3 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (a) Inversion factor = 0 0 5 10 15 20 25 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (b) Inversion factor = 4 0 5 10 15 20 25 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (c) Inversion factor = 8 0 5 10 15 20 25 0 1 2 3 4 5 6 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (d) Inversion factor = 12 Figure 3.10: Bidirectional fine-grained mapping with various inversion factors. (Largest leaf heuristic) 3.2 Parallel Multi-Pipeline Architectures The improvement in memory access speed is rather limited. Thus, it becomes neces- sary to employ multiple pipelines that can operate concurrently to speed IP lookup. Memory and traffic balancing among multiple pipelines become new problems. Sim- ilar to the above analysis of how the fattest stage affects the global performance of a pipeline, the pipeline which stores the largest number of trie nodes becomes a perfor- mance bottleneck of the multi-pipeline architecture. Unlike TCAM-based solutions, memory balancing is the primary challenge for SRAM-based pipeline solutions. On 41 0 50 100 150 200 0 1 2 3 4 x 10 4 Subtrie ID # of nodes rrc06 0 50 100 150 0 1 2 3 x 10 4 Subtrie ID # of nodes rrc08 0 50 100 150 200 0 1 2 3 4 x 10 4 Subtrie ID # of nodes rrc10 0 100 200 300 0 1 2 3 4 x 10 4 Subtrie ID # of nodes rrc14 Figure 3.11: Node distribution over subtries. the other hand, similar to TCAM-based solutions, traffic balancing is needed to achieve multiplicative throughput improvement. Previous works on parallel TCAM-based IP lookup engines use either a learning algorithm to predict the future behavior of incom- ing traffic based on its current distribution [64, 94] or IP/ prefix caching to utilize the locality of Internet traffic [49, 2]. 3.2.1 Memory Balancing Among Pipelines The trie partitioning scheme in Section 3.1.1.1 may result in many subtries of various sizes. For example, we useI = 8 to partition the tries corresponding to the four routing tables shown in Table 3.1. We obtain the trie node distribution over resulting subtries, as shown in Figure 3.11. The problem now is to map those subtries to multiple pipelines while keeping the memory requirement of the pipelines balanced. To formulate the problem, we use the following notations: 42 K denotes the number of subtries. P denotes the number of pipelines. T i denotes thei th subtrie,i = 1;2; ;K. S i denotes the set of subtries contained by thei th pipeline,i = 1;2; ;P . size(:) denotes the size, i.e., the number of nodes, of a subtrie or a set of subtries. We seek to assign each subtrie to a pipeline so that all pipelines have an equal number of trie nodes. Hence, the problem can be formulated as Equation (3.4). min max i=1;2;;P size(S i ) (3.4) with constraint (3.5): [ i=1;2;;P S i =fT j jj = 1;2; ;Kg (3.5) 3.2.1.1 Approximation Algorithm The above optimization problem is NP-complete, which can be proved by a reduc- tion from the partition problem [41]. We use an approximation algorithm to solve it, as shown in Figure 3.12. According to [41], in the worst-case, the resulting largest pipeline may have 1.5 times the number of nodes as in the optimal mapping. To verify the effectiveness of the above algorithm, we executed the algorithm on the 16 routing tables given in Table 3.1. In these experiments, we set P = 8. We obtained the resulting size of each pipeline, as shown in Figure 3.13. For all the 16 routing tables, our algorithm resulted in nearly balanced memory distribution among the eight pipelines. 43 Input: K subtries:fT i ji = 1;2; ;Kg; andP empty pipelines. Output: P pipelines, each of which contains a set of subtriesS i ,i = 1;2; ;P . 1: SetS i = for all pipelines,i = 1;2; ;P . 2: SortfT i g in the decreasing order ofsize(T i ),i = 1;2; ;K. 3: Assume thatsize(T 1 )size(T 2 )size(T K ). 4: fori = 1 toK do 5: FindS m so thatsize(S m ) = min j=1;2;;P S j . 6: AssignT i to them th pipeline:S m S m [T i . 7: end for Figure 3.12: Algorithm: Subtrie-to-pipeline mapping 1 2 3 4 5 6 7 8 4 5 6 7 8 9 10 11 12 x 10 4 Pipeline ID # of nodes Figure 3.13: Node distribution over 8 pipelines (using the approximation algorithm). 3.2.1.2 Partitioning with Small TCAM In the worst case, the number of memory accesses needed for an IP lookup is equal to the trie height. Since the pipeline depth is determined by the worst-case number of memory accesses, it is necessary to bound the tree height for all IP addresses while minimizing the overhead. We propose a holistic scheme to partition a full routing trie into many height-bounded subtries, and map those subtries onto multiple pipelines so that each pipeline contains equal numbers of trie nodes. Our scheme consists of two 44 phases: prefix expansion and height-bounded split, as illustrated in Figure 3.14 (a) and (b), where the trie shown in Figure 2.1 (b) is partitioned and mapped onto 2 pipelines. The value of the height bound isB H = 3. (a) Prefix Expansion 00 root null 01 10 11 (b) Height-Bounded Split (B H = 3) 00 null 01 10 11 Pipeline 1 Pipeline 2 0 1 1 1 1 0 P3 P6 P4 P5 P1 0 P2 P1 1 P8 P7 0 0 0 1 1 1 1 0 P3 P6 P4 P5 P1 1 P8 P7 0 0 0 P2 P1 Figure 3.14: Height-bounded partitioning and mapping. We first sort those subtries obtained from prefix expansion in decreasing order of size. Then, we traverse those subtries one by one. Each subtrie is traversed in the post-order. Once the height bound, which is denoted asB H , is reached, a new subtrie is split. After the number of nodes mapped onto a pipeline exceeds the size bound of the pipeline,B P , we map the rest of nodes onto the next pipeline. Algorithm 3.15 shows a recursive implementation of the height-bounded split, wheresize(P i ) denotes the number of nodes mapped onto thei th pipeline. 45 Input: n: a node; Input: i: the ID of the pipeline forn to be mapped on. Output: i: the ID of the pipeline for the next node to be mapped on. 1: ifn ==null then 2: Returni. 3: end if 4: i =HBS(n:left child;i) 5: i =HBS(n:right child;i) 6: ifsize(P i )<B P then 7: Mapn ontoP i . 8: else 9: Mapn ontoP i+1 . 10: end if 11: ifn:height>=B H then 12: Markn as a subtrie root. 13: end if 14: Returni+1. Figure 3.15: Algorithm: Height-bounded split:HBS(n;i) After height-bounded splitting, the resulting subtries may be rooted at different depths of the original trie. For the subtries rooted at the depth ofI, we use an index SRAM (called SRAM A). For the rest of the subtries, we need an index TCAM and an index SRAM (called SRAM B). The TCAM stores the prefixes that represent the subtries. The two index SRAMs store the information associated with each subtrie: the mapped pipeline ID, the ID of the stage where the subtrie’s root is stored, and the address of the subtrie’s root in that stage. Figure 3.16 shows the index table that is used to achieve the mapping shown in Figure 3.14. An arriving input IP searches the index SRAM A and the index TCAM in parallel. I initial bits of the input IP are used to index the SRAM A. Meanwhile, the entire input IP searches the index TCAM and obtains the subtrie ID corresponding to the longest matched prefix. Then, the IP uses the subtrie ID to index the SRAM B to retrieve the associated information. The result obtained from the SRAM B has a higher priority 46 than that from the SRAM A. The number of entries in the SRAM A is 2 I , while the number of entries in the index TCAM and SRAM B is at most2 32B H . 010* Pipeline ID 1 TCAM SRAM_B IP [31:0] Matched Index Prefix IP [31:30] Priority Bit 1 1 0 Pipeline ID & Other Info Other Info ... ... ... Other Info ... 2 2 2 SRAM_A Pipeline ID 1 00 01 10 11 ... Figure 3.16: Index table for the height-bounded splitting. We mapped the 16 routing tables onto a 8-pipeline architecture. I = 8;P = 8;H = 25. Figure 3.17 shows that our partitioning scheme achieved a balanced mem- ory allocation among the eight pipelines. The number of entries in the index TCAM is no more than 64. Compared to Figure 3.13, using a small TCAM can achieve much more balanced memory distribution among pipelines. 3.2.2 Traffic Balancing Among Pipelines Both prefix caching and learning-based dynamic remapping are employed to balance the traffic among multiple pipelines. The former can benefit from the locality of traffic, while the latter can handle the long-term traffic bias. 3.2.2.1 IP Caching Caching is an efficient way to exploit Internet traffic locality for parallel IP lookup. In [34], we propose a parallel architecture with multiple memory-balanced linear pipelines, called the Parallel Optimized Linear Pipeline (POLP) architecture, as shown in Figure 47 1 2 3 4 5 6 7 8 4 5 6 7 8 9 10 11 x 10 4 Pipeline ID # of nodes Figure 3.17: Node distribution over 8 pipelines (with a small index TCAM). 3.18. The architecture consists ofP pipelines, of which each stores part of the entire routing trie. Figure 3.18 shows an architecture with P = 4. The trie is partitioned into disjoint subtries by using the initial bits of the prefixes. We use the approxima- tion algorithm to map the subtries to pipelines, while keeping the memory requirement over different pipelines balanced. Within each pipeline, a fine-grained node-to-stage mapping, similar to Figure 3.3, is employed to balance the trie node distribution across stages. We cache the popular prefixes in W small pipelines, called pipelined prefix caches (PPCs), to balance the traffic among the pipelines. The memory requirement across the stages in each PPC is balanced by using the same scheme as that to balance theP main pipelines that store the entire trie. Since the PPCs store a small portion of the trie, the output of PPCs is only a subset of the next-hop address table.W next-hop address translation (NAT) tables are used to translate the PPC’s outputting “next-hop addresses” to the actual next-hop addresses in the routing table. Similar to TCAM-based solutions, caching helps balance the traffic among multi- ple IP lookup engines. However, unlike TCAMs, pipeline solutions need several clock 48 Packet 1 Pipeline 1 Pipeline 2 Pipeline 3 PPC Queue 1 Queue 2 Queue 3 Pipeline 4 Queue 4 PPC PPC Packet 2 Packet 3 DIT DIT DIT NAT NAT NAT Pointer to Next-hop Address 000...00 000...01 Packet [initial_stride] 111...11 Pipeline Index Node Address 000...00 000...01 Next-hop Address Index (obtained from PPC) 111...11 Pointer to Next-hop Address Figure 3.18: POLP architecture (W = 3;P = 4). cycles to retrieve lookup results. The cache miss penalty may be quite high due to the large processing delay [85]. Deeper pipelining with larger delay even worsens it. In [30], we propose a new SRAM-based parallel architecture with multiple memory- balanced linear pipelines, as shown in Figure 3.19. It can be fundamentally divided into two parts based on the functions: lookup engines and load balancer. Packet 1 Pipeline 1 Pipeline 2 Queue 1 Queue 2 Pipeline 8 Queue 8 Packet 2 Packet 8 Scheduler Next-hop Info In- bound Flow Table Out- bound Flow Table Queue Length DIT DIT DIT Payload Buffer Scheduler Scheduler Figure 3.19: Block diagram of the architecture with pre-caching (P = 8). To relieve the cache miss penalty that is due to large pipeline delay, our architecture extends the idea of flow caching from Layer-4 switching, where only the first packet of a flow needs lookup, and the rest of the packets of the flow are cut-through routed via looking up the flow cache [85]. In [30], we define a sequence of packets with the same destination IP address as a flow 2 . We propose a scheme called flow pre-caching, 2 A flow is usually identified by the common fields of IP headers, e.g. typically the five tuple of the source and destination IP addresses, source and destination port numbers and the protocol number [85]. 49 which allows the destination IP address of a flow to be cached before its next-hop information is retrieved. When a new packet arrives, it compares its destination IP address with the cached IP addresses. If the arriving packet matches any of the cached flows, it is assigned the ID of that flow regardless of whether the next-hop information of that flow is available. In other words, the new packet pre-fetches its lookup result even if its flow has not retrieved the next-hop information. The Scheduler directs this packet to the pipeline with minimum load (whose queue has the fewest packets). Then, this packet goes through the pipeline without any operation. Otherwise (such as if the new packet does not match any of the cached flows), it is treated as the first packet of a new flow. The Scheduler directs it to the pipeline whose ID is obtained through indexing DITs. Then, this packet goes through the pipeline to perform the lookup. When a packet exits pipelines, it uses its flow ID to index the Outbound Flow Table to find its next-hop information. If there is no valid information in the Outbound Flow Table for it, then the next-hop information that is retrieved from pipelines will be used to update that flow entry. 3.2.2.2 Dynamic Subtrie-to-Pipeline Remapping Due to their finite sizes, the caches may not capture long-term traffic bias. The initial subtrie-to-pipeline mapping does not take into account traffic bias. Some pipelines may be busy, while others receive few packets. To handle this problem, we propose an exchange-based updating algorithm to periodically remap some subtrie that includes popular prefixes to the pipelines. In addition to the notations in the previous sections, we define the following notations to describe the algorithm shown in Figure 3.20. After each exchange of two subtries, the contents of the DITs in the architecture need to be updated as well. 50 p denotes a prefix. SP(T) denotes the set of prefixes contained in the subtrieT . PV(p) denotes the popularity value of a prefixp, i.e., the number of timesp has been retrieved. PV(T) denotes the popularity value of a subtrieT .PV(T) = P p2SP(T) PV(p). PV(S i ) denotes the popularity value of thei th pipeline which contains a set of subtriesS i .PV(S i ) = P T2S i PV(T). Input: P pipelines, each of which contains a set of subtries. Output: P pipelines with possibly different subtrie sets. 1: Find thec th pipeline whose popularity valuePV(S c ) = min P i=1 PV(S i ). 2: Find theh th pipeline whose popularity valuePV(S h ) = max P i=1 PV(S i ). 3: Find the subtrie T cc 2 S c and the subtrie T 0 hh 2 S h : F(T cc ;T 0 hh ) = min T2Sc;T 0 2S h F(T;T 0 ). 4: if0<PV(T hh )PV(T 0 cc )<PV(S h )PV(S c ) then 5: ExchangeT cc andT hh between thec th and theh th pipelines. 6: end if Figure 3.20: Algorithm: Subtrie-to-pipeline remapping. In the above algorithm, F(T;T 0 ) is the evaluation function to select two subtries from the two pipelines for exchange. It is defined as F(T;T 0 ) = size(T)size(T 0 )+ PV(T)PV(T 0 ) , where is a number in [0,1]. is used to differentiate two subtries that have equal size. In our architecture, we set = 0:5. The proposed evaluation function prefers two subtries that have small variation in size while a wide gap in their popularity values. For each remapping, only two subtries are exchanged between two pipelines. The traffic distribution among pipelines is balanced by incremental updating rather than by reconstructing the entire routing table. 51 3.2.2.3 Results Due to unavailability of public IP traces that are associated with their corresponding routing tables, we generated the routing table based on the given traffic trace. We downloaded the real-life traffic trace AMP-1110523221-1 from [61]. It has 769.1 K packets. We extracted the unique destination IP addresses from it to build the routing table. The resulting routing table has 17628 entries. In this experiment,P andP c were increased whileP =P c .H =H c = 25;Q = 2, and C = N 100 , where N denotes the number of prefixes in the pipelines. We used different caching and remapping options to observe their effects on the scalability of the throughput speedup. The results are shown in Figure 3.21. When neither caching nor remapping was enabled, the throughput speedup exhibited poor scalability. The throughput speedup was only 2:5 with 8 pipelines. If remapping was enabled, then the throughput speedup was improved to over 5. It was improved to over 7:5 when caching was also enabled. Figure 3.21 also reveals that prefix caching makes larger contribution than dynamic remapping in realizing high throughput speedup. But dynamic remapping helps balance the global traffic distribution among pipelines, as shown in Table 3.2 (whereC denotes caching, andR remapping). In most cases, dy- namic remapping can be treated as an option, since it has high overhead but little effect on the throughput improvement, provided that prefix caching has been enabled. Table 3.2: Traffic distribution over 8 pipelines Pipeline ID 1 2 3 4 5 6 7 8 Traffic (wo/C wo/R): % 40.5 9.6 4.0 2.6 3.6 21.4 10.4 7.8 Traffic (wo/C w/R): % 13.0 12.4 12.8 13.6 12.0 12.5 12.0 11.8 Traffic (w/C wo/R): % 79.2 3.3 1.3 0.9 1.2 7.9 3.5 2.7 Traffic (w/C w/R): % 12.9 14.6 10.4 12.7 12.1 12.9 11.8 12.6 52 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 # of Pipelines (= # of PPCs) Throughput Speedup wo/ caching, wo/ remapping wo/ caching, w/ remapping w/ caching, wo/ remapping w/ caching, w/ remapping Figure 3.21: Throughput speedup with different numbers of pipelines (P = P c = 1;2;4;6;8). 3.2.3 Sequence Preserving The intra-flow packets may go out of order due to caching and queuing, which ad- versely affects some network applications [19, 88]. Thus, reorder buffers and logic are usually needed, which are expensive and complicated. Our work in [30] aims to eliminate them by exploiting the processing delay in the pipelines. 3.2.3.1 Early Caching In our architecture (Figure 3.19), all packets are required to go through the pipelines from the first stages, whether or not they have cache hit or miss. The queued packets cannot catch up with their preceding packets that are already in the pipelines. Thus, the Scheduler can detect the intra-flow out-of-order packets when sending packets to queues. If the intra-flow out-of-order packet is detected, then a task to exchange the payload between out-of-order packets is initiated. Since it takes multiple clock cycles for a packet to complete looking up the pipelines, the payload exchange has enough 53 time to complete before the packets exit the pipelines. Thus, the intra-flow packet order can be preserved. 3.2.3.2 Output Delay We propose an alternative way in [29] to preserve intra-flow packet order by delaying the outputs. The schedulers track the status of each queue (i.e. the number of packets waiting in the queue). When a packet has a cache hit, the scheduler dispatches it to the queue with the fewest packets. We call this queue the lightest loaded queue. Assume that there areL min packets in the lightest loaded queue. The scheduler also checks the status of the queue, which corresponds to the subtrie onto which the packet is mapped. The mapping relationship is stored in the DIT. Assume that there areL c packets in this queue. The scheduler attaches a delay value of L c L min to the packet. A packet goes to an output delay queue when it exits the pipeline. Each output delay queue is built as a single-entrance multi-exit pipeline. If the delay value of the packet is 0, then the packet is output immediately. Otherwise, the packet goes through the output delay queue and decrements its delay value by one at each stage, until its delay value becomes zero. By the above schemes, the intra-flow packet order is preserved. 3.2.4 Overall Performance Consider the largest routing table rrc07 among the 16 routing tables shown in Table 3.1. According to Figure 3.22, each stage has fewer than 8K nodes. Thus, 13 address bits are enough to index a node in the local memory of a stage. The pipeline depth is 25, and thus we need 5 bits to specify the distance. Each node stored in the local memory needs 18 bits. The total memory needed to storing 248856 prefixes from rrc07 in a 25-stage 8-pipeline architecture is182 13 258 28 Mb= 3:6 MB, where 54 each stage needs 18 KB of memory. Using CACTI 5.3[8], we estimate the memory access time. An 18 KB SRAM using 65 nm technology needs 0.64 ns to access and dispatches 0.066 W of power. The maximum clock rate of the above architecture in ASIC implementation can be 1.56 GHz. 0 10 20 0 1000 2000 3000 4000 5000 6000 Stage ID # of nodes Pipeline #1 0 10 20 0 1000 2000 3000 4000 5000 6000 Stage ID # of nodes Pipeline #2 0 10 20 0 1000 2000 3000 4000 5000 6000 Stage ID # of nodes Pipeline #3 0 10 20 0 1000 2000 3000 4000 5000 6000 Stage ID # of nodes Pipeline #4 0 10 20 0 1000 2000 3000 4000 5000 6000 Stage ID # of nodes Pipeline #5 0 10 20 0 1000 2000 3000 4000 5000 6000 Stage ID # of nodes Pipeline #6 0 10 20 0 1000 2000 3000 4000 5000 6000 Stage ID # of nodes Pipeline #7 0 10 20 30 0 1000 2000 3000 4000 5000 6000 Stage ID # of nodes Pipeline #8 Figure 3.22: Node distribution over all stages in the 8-pipeline 25-stage architecture. As Figure 3.21 shows, the throughput speedup can be higher than7:5. The overall throughput of the 8-pipeline architecture is 11.72 billion packets per second, i.e. 3.75 Tbps for the packets with the minimum size of 40 bytes. The power consumption of the architecture is0:06625 = 1:65 W. The energy consumption for one IP lookup is 1:650:64 = 1:06 nJ. In contrast, according to the TCAM model [1], a 248856-row 32-bit TCAM using 65 nm technology and 128 sub-banks needs3:42 ns to access and dissipates 4:33 power and 14:81 nJ energy. In other words, our architecture achieves 2:6-fold and fourteen-fold reduction in power and energy consumption, respectively, compared with TCAM. 55 Chapter 4 Towards Green Routers: Power-Efficient IP Lookup Some recent investigations [54, 11] show that power dissipation has become the ma- jor limiting factor for next-generation routers and predict that expensive liquid cool- ing may be needed in future routers. Recent analysis by researchers from Bell labs [54] reveals that almost 2/3 of power dissipation inside a core router is due to IP for- warding engines. Though SRAM-based pipeline architectures have been proposed as a promising alternative to power-hungry TCAMs for IP lookup engines in next- generation routers [9, 34], they may still suffer from high power/ energy consumption, due to the large number of memory accesses for each IP lookup [92]. The overall power consumption for each IP lookup in SRAM-based engines can be expressed as in Equation 4.1: Power overall = H X i=1 [P m (S i ;)+P l (i)] (4.1) Here,H denotes the number of memory accesses,P m (:) the function of the power dis- sipation of a memory access (which usually has a positive correlation with the memory size), S i the size of theith memory that is being accessed, andP l (i) the power con- sumption of the logic that is associated with theith memory access. Since the logic dissipates much less power than the memories in the memory-dominant architectures [44, 66], the main focus of our work is on reducing the power consumption of the 56 memory accesses. Note that the power consumption of a single memory is affected by many other factors, such as the fabrication technology and sub-bank organization, which are beyond the scope of our work. This chapter presents our techniques for lowering the power/ energy consumption in SRAM-based IP lookup engines. Since the energy consumption for each memory access is the product of the power consump- tion and the memory access time, the energy consumption per IP lookup is identical to its power consumption times clock period. In this chapter, “power”, if not specified, refers to the power/ energy consumption per IP lookup. 4.1 Related Work Reducing the power consumption of network routers has been a topic of significant interest [20, 11, 59]. Most of the existing work focuses on system- and network-level optimizations. 4.1.1 Greening the Routers Chabarek et al. [11] enumerate the power demands of two widely used Cisco routers. The authors further use mixed integer optimization techniques to determine the optimal configuration at each router in their sample network for a given traffic matrix. Nede- vschi et al. [59] assume that the underlying hardware in network equipment supports sleeping and dynamic voltage and frequency scaling. The authors propose to shape the traffic into small bursts at edge routers to facilitate sleeping and rate adaptation. 57 4.1.2 Power-Efficient IP Lookup Engines Power-efficient IP lookup engines have been studied from various aspects. However, to the best of our knowledge, little work has been done on pipelined SRAM-based IP lookup engines. Some TCAM-based solutions [92, 94] propose various schemes to partition a routing table into several blocks and perform IP lookup on one of the blocks. Similar ideas can be applied for SRAM-based multi-pipeline architectures [31]. Those partitioning-based solutions for power-efficient SRAM-based IP lookup engines do not consider either the underlying data structure or the traffic characteristics and are orthogonal to the solutions proposed in this chapter. Kaxiras et al. [38] propose a SRAM-based approach called IPStash for power- efficient IP lookup. IPStash replaces the full associativity of TCAMs with set associa- tive SRAMs to reduce power consumption. However, the set associativity depends on the routing table size and thus may not be scalable. For large routing tables, the set associativity is still large, which results in low clock rate and high power consumption. Traffic rate variation has been exploited in some recent papers for reducing power consumption in multi-core processor based IP lookup engines. In [53], clock gating is used to turn off the clock of unneeded processing engines of multi-core network processors to save dynamic power when there is a low traffic workload. In [42], a more aggressive approach of turning off these processing engines is used to reduce both dynamic and static power consumption. Dynamic frequency and voltage scaling are used in [39] and [55], respectively, to reduce the power consumption of the processing engines. However, those schemes still consume large power in the worst case when the traffic rate is consistently high. Some of those schemes require large buffers to store the input packets so that they can determine or predict the traffic rate. But the large packet buffers result in high power consumption. Also, these schemes do not consider 58 the latency for the state transition, which can result in packet loss in case of bursty traffic. 4.2 Architecture-Aware Data Structure Optimization To the best of our knowledge, little work has been done on data structure optimization for power-efficient SRAM-based IP lookup engines. In this paper we focus on fixed- stride multi-bit tries where all nodes at the same level have the same stride. Fixed- stride multi-bit tries are attractive for hardware implementation due to their ease for route update [18]. 4.2.1 Problem Formulation We use the following notations. LetW denote the maximum prefix length. W = 32 for IPv4. LetS =fs 0 ;s 1 ; ;s k1 g denote the sequence of strides for building ak- level multi-bit trie. LetjSj denote the number of strides inS (jSj =k). P k1 0 s i =W . Considering the hardware implementation for tree-bitmap-coded multi-bit tries, we cap the length of strides at s i < B s ;i = 0;1; ;k 1, where B s is a predefined parameter, called the stride bound. 4.2.1.1 Non-Pipelined and Pipelined Engines A SRAM-based non-pipelined IP lookup engine stores the entire trie in a single mem- ory. Any IP lookup may need to access the memory multiple times. Hence, the worst- case power consumption of a SRAM-based non-pipelined IP lookup engine can be 59 modeled by Equation (4.2), where Power memory and Power logic denote the power consumption of the memory and of the logic, respectively. Power = (Power memory +Power logic )k (4.2) The logic dissipates much less power than the memories in the memory-intensive architectures [44, 24, 33]. For example, [33] shows that the memory dissipates al- most an order of magnitude higher power than the logic in FPGA implementation of a pipelined IP lookup engine. Thus, we do not consider the power consumption of the logic. The optimal stride problem can be formulated as: min k=1;2;;W min S(k) P m (M(S(k)))k (4.3) where M(S) denotes the memory requirement of the multi-bit trie built using S. P m (M) is the power function of the SRAM of the sizeM. For a SRAM-based pipelined IP lookup engine, its worst-case power consumption can be modeled by Equation (4.4), whereH denotes the pipeline depth, i.e. the number of pipeline stages. Power memory (i) andPower logic (i) denote the power consumption of the memory and of the logic in theith stage, respectively. Power = H X i=1 [Power memory (i)+Power logic (i)] (4.4) Similar to the non-pipelined engine, we omit the power consumption of the logic. Also, assuming that the memory distribution across the pipeline stage is balanced, the optimal stride problem can be formulated as: min S [P m ( M(S) H )]max(jSj;H) (4.5) 60 where M(S) denotes the memory requirement of the multi-bit trie built using S. P m (M) is the power function of the SRAM of the size M. The number of mem- ory accesses is determined by thejSj andH. WhenH <jSj, multiple clock cycles are needed to access a stage. To achieve high throughput, we letHjSj. SincejSj =kW , we can rewrite (4.5) to be: min k=1;2;;W min S(k) [P m ( M(S(k)) H )]H (4.6) To solve (4.3) and (4.6), we can first fix k and find the optimal S(k) so that the power consumption is minimized for the givenk. Then, we compare the power con- sumption for differentk to obtain the overall optimalS. 4.2.1.2 Power Function of SRAM Before we solve the above optimization problem, we need to figure out the power func- tion of the SRAM with respect to its sizeM: P m (M). There is some published work on comprehensive power models of SRAM [17, 48, 84]. But these detailed “white box” models do not show the direct relationship between the power consumption and the memory size. We use CACTI tool [84] to evaluate both the dynamic and the static power consumption of SRAMs of different sizes and then obtain the function parame- ters through curve fitting (“black box” modeling). According to [17, 48], when the word width is constant, both the high-level dy- namic and static power consumption of SRAMs can be approximately represented in the form of: P(M) =AM B (4.7) whereM is the memory size, andA andB are the parameters whose values are differ- ent for dynamic and static power. 61 We vary the SRAM size from 256 bytes to 8 Mbytes while keeping the word width to be 8 bytes, and obtain their power consumption using CACTI tool [84]. After curve fitting, we obtain thatA dynamic = 2:0710 4 ,B dynamic = 0:50,A static = 1:5710 6 , andB static = 0:95. The results from CACTI and curve fitting are both shown in Figure 4.1. 0 2 4 6 8 10 x 10 6 0 1 2 3 4 5 6 7 SRAM size (bytes) Power (Watts) Power function of SRAM size dynamic_power_cacti dynamic_power_fitting static_power_cacti static_power_fitting Figure 4.1: Power function of SRAM sizes HenceP m (M)=A dynamic M B dynamic +A static M B static 10 6 (207M 0:5 +1:6M). Then, (4.3) and (4.6) become (4.8) and (4.9), respectively: min k=1;2;;W min S(k) (207M(S(k)) 0:5 +1:6M(S(k)))k (4.8) min k=1;2;;W min S(k) (207M(S(k)) 0:5 H 0:5 +1:6M(S(k))) (4.9) For a givenk, whenM(S(k)) is minimized, the power consumption is also mini- mized. Thus, the above problems can be reduced to finding the optimal stride so that the memory requirements are minimized. 62 4.2.2 Special Case: Uniform Stride In the original tree bitmap paper [18], the authors suggest using the same stride for all the nodes except for the root node. We call such a special fixed-stride multi-bit trie as a multi-bit trie with uniform stride. The stride used by the root,s 0 , is called the initial stride. Givenk, we can find the optimalS by exhaustive search over different initial strides. In each iteration,s i = Ws 0 k1 ,i = 1;2; ;k1. 4.2.3 Dynamic Programming Srinivasan and Varghese [80] have developed a dynamic programming based solution to minimize the memory requirement of ak-level multi-bit trie. Sahni and Kim [71] made further improvement to reduce the complexity of the algorithms. However, those algorithms focused on the naive implementation of multi-bit tries, without considering the tree bitmap coding technique [18] for compressing the memory requirement of multi-bit tries. Similar to [80] and [71], we use the following notations. O denotes the uni-bit trie for the given set of prefixes. nNode(i) denotes the number of nodes at theith level ofO. nPrefix(i;j) denotes the total number of prefixes contained between the ith and thejth levels ofO. T(j;r),r j +1, denotes the cost (the memory requirement) of the best way to cover levels 0 throughj ofO using exactlyr expansion levels. 63 The dynamic programming recurrence forT is: T(j;r) = j1 min m=max(r2;jBs) fT(m;r1)+ nNode(m+1)2 Bs 2=32+ nNode(j +1)+nPrefix(m+1;j)g (4.10) T(j;1) = 2 j+1 (4.11) Note that all the strides except the initial stride are capped by the stride bound (B s ). In hardware implementation, the length of the bitmaps is determined by B s . AlgorithmFixedStride(W;k), as shown in Figure 4.2, computesT(W1;k), which is the minimum memory requirement to build a k-level tree-bitmap-coded multi-bit trie. The complexity of AlgorithmFixedStride(W;k) isO(kWB s ). After obtaining T(W1;k), we can follow the track of the correspondingM(;) to find the optimal S inO(k) time. 4.2.4 Performance Evaluation We used seventeen real-life backbone routing tables from the Routing Information Service (RIS) [69]. Their characteristics are shown in Table 4.1. Note that the routing tables rrc08 and rrc09 are much smaller than others, since the collection of these two data sets ended on September 2004 and February 2004, respectively [69]. We conducted the experiments for both non-pipelined and pipelined architectures. We evaluated the impacts of different architecture parameters on the power-optimal design of the data structure. The architecture parameters include the stride type, the stride bound, and the pipeline depth. For fixed-stride tree-bitmap-coded multi-bit tries, two stride types are considered. The first uses uniform strides, as described in Section 64 Input: O Output: T(W1;k),S(k) 1: // ComputeT(W1;k) 2: forj = 0 toW1 do 3: T(j;1) = 2 j+1 4: end for 5: forr = 2 tok do 6: forj =r1 toW1 do 7: minCost =MaxValue 8: form = max(r2;jB s ) toj1 do 9: cost =T(m;r1)+nNode(m+1)2 B s 2+nNode(j+1)+nPrefix(m+ 1;j) 10: ifcost<minCost then 11: T(j;r) =minCost 12: M(j;r) =m 13: end if 14: end for 15: end for 16: end for 17: // ComputeS(k) =fs i g,i = 0;1; ;k1 18: m =W1 19: forr =k to1 do 20: s r1 =mM(m;r) 21: m =M(m;r) 22: end for Figure 4.2: Algorithm:FixedStride(W;k). 4.2.2. The second uses optimal strides whose value is capped byB s , as discussed in Section 4.2.3. 4.2.4.1 Results for Non-Pipelined Architecture First, we setB s = 4 and examined the results for the non-pipelined architecture using uniform and optimal strides. The results are shown in Figure 4.3. In both cases, the power was minimized whenk = 5 andS = 16;4;4;4;4. Then we varied the stride bound (B s ). Figure 4.4 shows the results by using two different stride bounds:B s = 2 andB s = 6 for the architecture that uses optimal stride. 65 Table 4.1: Representative routing tables (snapshot on 2009/04/01) Routing table # of prefixes # of prefixes w/ length< 16 rrc00 300365 2366 (0.79%) rrc01 282852 2349 (0.83%) rrc02 272504 2135 (0.78%) rrc03 285149 2354 (0.83%) rrc04 294231 2381 (0.81%) rrc05 284283 2379 (0.84%) rrc06 283835 2337 (0.82%) rrc07 280786 2347 (0.84%) rrc08 83556 495 (0.59%) rrc09 132786 991 (0.75%) rrc10 283573 2347 (0.83%) rrc11 282761 2350 (0.83%) rrc12 284469 2350 (0.83%) rrc13 289849 2355 (0.81%) rrc14 278750 2302 (0.83%) rrc15 299211 2372 (0.79%) rrc16 288218 2356 (0.82%) For B s = 2, the power was minimized when k = 9 and S = 16;2;2;2;2;2;2;2;2. ForB s = 6, the minimal power was achieved whenk = 4 andS = 17;5;5;5. 4.2.4.2 Results for Pipelined Architecture Figure 4.5 shows the power consumption of the pipelined architecture using the two stride types. The pipeline depth (h) was set to be equal to k. The stride bound B s was 4. Both cases achieved the optimal power performance when k = 6 and S = 13;4;4;4;4;3. Figure 4.6 shows the results by using different stride bounds for the optimal stride. We set h = k. For B s = 2, the power was minimized when k = 9 and S = 16;2;2;2;2;2;2;2;2. For B s = 6, the minimal power was achieved when k = 5 andS = 16;4;4;4;4. 66 0 10 20 30 40 0 20 40 60 80 100 k Watts (a) Uniform Stride 0 10 20 30 40 0 20 40 60 80 100 120 140 160 k Watts (b) Optimal Stride (B s = 4) Figure 4.3: Power results of the non-pipelined architecture using (a) the uniform stride and (b) the optimal stride. 0 10 20 30 40 0 50 100 150 200 250 k Watts (a) Optimal Stride (B s = 2) 0 10 20 30 40 0 50 100 150 200 250 300 350 k Watts (b) Optimal Stride (B s = 6) Figure 4.4: Power results of the non-pipelined architecture using (a)B s = 2 and (b) B s = 6. We also conducted experiments using different pipeline depths. Both cases achieved the minimal power consumption whenk = 6 andS = 13;4;4;4;4;3. These are the same as the results forh = k (Figure 4.5(b)). This means that the pipeline depth has little impact on determining the optimal strides for pipelined architectures. 67 0 10 20 30 40 1 2 3 4 5 6 h Watts (a) Uniform Stride 0 10 20 30 40 1 2 3 4 5 6 7 8 h Watts (b) Optimal Stride (B s = 4) Figure 4.5: Power results of the pipelined architecture using (a) the uniform stride and (b) the optimal stride. 0 10 20 30 40 0 5 10 15 20 h Watts (a) Optimal Stride (B s = 2) 0 10 20 30 40 0 5 10 15 20 h Watts (b) Optimal Stride (B s = 6) Figure 4.6: Power results of the pipelined architecture using (a)B s = 2 and (b)B s = 6. 4.3 Reducing Dynamic Power Dissipation This paper exploits several characteristics of Internet traffic and of the pipeline ar- chitecture, to reduce the dynamic power consumption of SRAM-based IP forwarding engines. First, as observed in [30], Internet traffic contains a large amount of locality, where most packets belong to few flows. By caching the recently forwarded IP ad- dresses, the number of memory accesses can be reduced so that power consumption is lowered. Unlike previous caching schemes, most of which need an external cache to be attached to the main forwarding engine, we integrate the caching function into the 68 20 30 40 50 1 2 3 4 5 6 7 8 9 h Watts (a) h = k+16 0 20 40 60 80 0 2 4 6 8 10 h Watts (b) h=k*2 Figure 4.7: Power results of the pipelined architecture using (a)h = k +16 and (b) h =k2. pipeline architecture itself. As a result, we do away with complicated cache replace- ment hardware and eliminate the power consumption of the “hot” cache [93]. Second, since the traffic rate varies from time to time, we freeze the logic when no packet is input. We propose a local clocking scheme where each stage is driven by an indepen- dent clock and is activated only under certain conditions. The local clocking scheme can also improve the caching performance. Third, we note that different packets may access different stages of the pipeline, which leads to a varying access frequency onto different stages. Thus we propose a fine-grained memory enabling scheme to make the memory in a stage sleep when the incoming packet is not accessing it. Our simulation results show that the proposed schemes can reduce the power consumption by up to fifteen-fold. We prototype our design on a commercial field programmable gate array (FPGA) device and show that the logic usage is low while the backbone throughput requirement (40 Gbps) is met. 69 4.3.1 Analysis and Motivation We obtained four backbone Internet traffic traces from the Cooperative Association for Internet Data Analysis (CAIDA) [14]. The trace information is shown in Table 4.2, where the numbers in the parenthesis are the ratio of the number of unique destination IP addresses to the total number of packets in each trace. Table 4.2: Real-life IP header traces Trace Date # of packets # of unique IPs equinix-chicago-A 20090219 460448 31923 (6.93%) equinix-chicago-B 20090219 2811616 182119 (6.48%) equinix-sanjose-A 20080717 3473762 233643 (6.73%) equinix-sanjose-B 20080717 2200188 115358 (5.24%) Traffic Locality According to Table 4.2, regardless of the length of the packet trace, the number of unique destination IP addresses is always much smaller than that of the packets. These results coincide with those of previous work on Internet traffic charac- terization [49]. Due to TCP burst, some destination IP addresses can be connected very frequently in a short time span. Hence, caching has been used effectively in exploiting such traffic locality to either improve the IP forwarding speed [49] or help balance the load among multiple forwarding engines [30]. This paper employs the caching scheme to reduce the number of memory accesses so that the power consumption can be lowered. Traffic Rate Variation We analyze the traffic rate in terms of the number of packets at different times. The results for the four traces are shown in Figure 4.8, where the X axis indicates the time intervals and the Y axis the number of packets within each time interval. As observed in other papers [39], the traffic rate varies from time to 70 time. Although the router capacity is designed for the maximum traffic rate, power consumption of the IP forwarding engine can be reduced by exploiting such traffic rate variation in real life. 0 20 40 60 80 100 120 200 400 600 equinix−chicago−A 0 200 400 600 800 1000 1200 1400 1600 1800 200 400 600 equinix−chicago−B 0 500 1000 1500 2000 2500 0 100 200 equinix−sanjose−A 0 200 400 600 800 1000 1200 1400 1600 1800 400 600 800 equinix−sanjose−B Figure 4.8: Traffic rate variation over the time. Access Frequency on Different Stages The unique feature of the SRAM-based pipelined IP forwarding engine is that different stages contain different sets of trie nodes. Given various input traffic, the access frequency to different stages can vary significantly. For example, we used a backbone routing table from the Routing Infor- mation Service (RIS) [69] to generate a trie, mapped the trie onto a 25-stage pipeline, and measured the total number of memory accesses on each stage for the four input traffic traces. The results are shown in Figure 4.9, where the access frequency of each stage is calculated by dividing the number of memory accesses on each stage by that on the first stage. The first stage is always accessed by all packets, while the last few stages are seldom accessed. According to this observation, we should disable the memory access in some stages when the packet is not accessing the memory in that stage. 71 0 5 10 15 20 25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Access frequency Stage ID equinix−chicago−A equinix−chicago−B equinix−sanjose−A equinix−sanjose−B Figure 4.9: Access frequency on each stage. 4.3.2 Architecture-Specific Techniques We propose a caching-enabled SRAM-based pipeline architecture for power-efficient IP forwarding, as shown in Figure 4.10. Let H denote the pipeline depth, i.e. the number of stages in the pipeline. TheseH stages store the mapped trie nodes. Every time the architecture receives a packet, the incoming packet compares its destination IP address with the packets that are already in the pipeline. It will be considered as “cache hit” if there is a match, even though the packet has not retrieved the next-hop information yet. To preserve the packet order, the packet that is having a cache hit still goes through the pipeline. However, no memory access is needed for this packet, so that the power consumption for this packet is reduced. 4.3.2.1 Inherent Caching Most existing caching schemes need to add an external cache to the forwarding en- gine. However, the cache itself can be power-intensive [93] and also needs extra logic 72 Packet Stage 0 = ? Stage 1 = ? Stage H-1 = ? Hit / Miss Hit / Miss Hit / Miss Cache hit Next-hop info IP + Next-hop info IP Figure 4.10: Pipeline with inherent caching to support cache replacement. The relatively long pipeline delay can also result in low cache hit rates in traditional caching schemes [30]. Our architecture implements the caching function without appending extra caches. As shown in Figure 4.10, the pipeline itself acts as a fully associative cache, where the existing packets in all the stages are matched with the arriving packet. If the arriving packet (denoted asPkt new ) matches a previous packet (denoted asPkt exist ) that is already existing in the pipeline, Pkt new has a cache hit even thoughPkt exist has not retrieved its next-hop information. ThenPkt new will go through the pipeline with the cache hit signal set to ‘1’. On the other hand, Pkt new does not obtain the next-hop information untilPkt exist exits the pipeline. As shown in Figure 4.10, the packet exiting the pipeline will forward its IP address and its retrieved next-hop information to all the previous stages. The packets in the previous stages compare with the forwarded IP address. The packet matching the forwarded IP address will take the forwarded next-hop information as its own and carry the retrieved next-hop information along when traversing the rest of the pipeline. 4.3.2.2 Local Clocking Most of the existing pipelined IP lookup engines are driven by a global clock. The logic in a stage is active even when there is no packet to be processed. This results 73 in unnecessary power consumption. Furthermore, since the pipeline keeps forwarding the packets from one stage to the next stage at the highest clock frequency, the pipeline will contain few packets if the traffic rate is low. Since the pipeline is built as a cache which is dynamic and sensitive to input traffic, few packets in the pipeline indicates a small number of cached entries, which results in low cache hit rate. To address this issue, we propose a local clocking scheme where each stage is driven by an individual clock. Only one constraint must be met to prevent any packet loss: Constraint 1: If the clock of the previous stage is active and there is a packet in the current stage, the clock of the current stage must be active. SRAM AND Stage i Stage (i+1) Clk_i Clk_(i+1) Valid Figure 4.11: Local clocking for one stage. Hence, we design the local clocking as shown in Figure 4.11. The clock of a stage will not be active until the stage contains a valid packet and its preceding stage is for- warding some data to the current stage. To prevent clock skew, some delay logic is added in the data path of the clock signal of the previous stage. In a real implementa- tion, we do not use the AND gate which may result in glitches. Instead, we use clock buffer primitives provided by Xilinx design tools [89]. 4.3.2.3 Fine-Grained Memory Enabling As discussed earlier, the access frequency to different stages within the pipeline varies. Current pipelined IP forwarding engines keep all memories active for all the packets, 74 which results in unnecessary power consumption. Our fine-grained memory enabling scheme is achieved by gating the clock signal with the read enable signal for the mem- ory in each stage. The read enable signal becomes active only when the packet goes to access the memory in the current stage. In other words, the read enable signal will remain inactive in any of the following four cases: no packet is arriving, the distance value of the arriving packet is larger than 0, the packet has already retrieved its next- hop information, or the cache hit signal carried by the packet is set to ‘1’. 4.3.3 Performance Evaluation We prototyped our design (denoted ’Proposed’) and the baseline pipeline (denoted ’Baseline’) that did not integrate the proposed schemes, respectively, on FPGA using Xilinx ISE 10.1 development tools. The target device was Xilinx Virtex-5 XC5VFX200T with -2 speed grade. Table 4.3 shows the post place and route results where ’Both’ denotes both designs 1 . Although our design used more logic resource than the base- line, the design consumed still a small amount of the overall on-chip logic resources. Both designs achieved a clock frequency of 125 MHz while using the same amount of BlockRAMs. Such a clock frequency results in a throughput of 40 Gbps for minimum size (40 bytes) packets, which meets the current backbone network rate. Table 4.3: Resource utilization Design Used Available Utilization Number of Slices Baseline 569 30,720 1.85% Proposed 748 30,720 2.43% Number of bonded IOBs Both 73 960 7% Number of Block RAMs Both 295 456 64% 1 Due to the limitation of the size of on-chip memory, both designs supported 70K prefixes which is 1/4 of the current largest backbone routing table. However, our architecture can be extended by using external SRAMs. 75 The dynamic power consumption of a pipelined IP lookup engine can be modeled as Equation (4.12), wherep denotes the packet to be looked up,H the pipeline depth, N(p) the number of packets, andPower memory (i;p) andPower logic (i;p) denote the power consumption of the memory and of the logic in theith stage byp, respectively. Power = P p P H i=1 [Power memory (i;p)+Power logic (i;p)] N(p) (4.12) We profiled the power consumption of the memory and of the logic based on our FPGA implementation results. Using the XPower Analyzer tool provided by Xilinx, we obtained the power consumption for the baseline and our design, as shown in Figure 4.12. As we expected, the power consumption by memory dominated the overall power dissipation of the pipelined IP lookup engine. Power (mW) 0.29 92.01 2.04 92.7 0 10 20 30 40 50 60 70 80 90 100 Logic BRAM Baseline Proposed Figure 4.12: Profiling of dynamic power consumption in a pipelined IP lookup engine Based on the profile data of the power consumption in the architecture, we devel- oped a cycle-accurate simulator for our pipelined IP lookup engine. We conducted the experiments using the four real-life backbone traffic traces given in Table 4.2, and evaluated the overall power consumption. 76 First, we examined the impact of the fine-grained memory enabling scheme. We disabled both inherent caching and local clocking, and then ran the simulation under two different conditions: (1) without fine-grained memory enabling (denoted as ’wo/ FME’) and (2) with fine-grained memory enabling (denoted as ’w/ FME’) . Figure 4.13 compares the results, where the results are normalized by being divided by the results with condition (2). According to Figure 4.13, fine-grained memory enabling can achieve up to 12-fold reduction in power consumption. Normalized Power 3.75 11.95 6.25 6.15 1 1 1 1 0 2 4 6 8 10 12 14 equinix-chicago-A equinix-chicago-B equinix-sanjose-A equinix-sanjose-B wo/ FME w/ FME Figure 4.13: Power reduction with fine-grained memory enabling Second, we evaluated the impact of the inherent caching and the local clocking schemes. We enabled the fine-grained memory enabling scheme, and then ran the simulation under three different conditions: (1) without either inherent caching or local clocking (denoted as ’Baseline’); (2) with inherent caching but without local clocking (denoted as ’Cache only’); (3) with both schemes (denoted as ’Cache + LC’). The results are shown in Figure 4.14 where the results without both schemes are set as the baseline. Without local clocking, the reduction in power consumption using caching was very little, because of the low cache hit rate (e.g. 1.65% for equinix-chicago-B). 77 Local clocking improved the cache hit rate (e.g. the cache hit rate for equinix-chicago- B increased to 45.9%), which resulted in higher reduction in power consumption. Normalized Power 1 1 1 1 0.97 0.98 0.99 0.98 0.60 0.79 0.77 0.79 0 0.2 0.4 0.6 0.8 1 1.2 equinix-chicago-A equinix-chicago-B equinix-sanjose-A equinix-sanjose-B Baseline Cache only Cache + LC Figure 4.14: Power reduction with inherent caching and local clocking Overall, when all the three proposed schemes were enabled, the architecture achieved 6.3, 15.2, 8.1, and 7.8 -fold reduction in power consumption, for the four traffic traces, respectively. 78 Chapter 5 Large-Scale Wire-Speed Packet Classification To achieve high throughput, recent research in this area seeks to combine algorithmic and architectural approaches, most of which are based on ternary content addressable memories (TCAMs) [90, 46, 77] or a variety of hashing schemes such as Bloom Filters [16, 65, 60, 91]. However, as shown in Table 1.6, TCAMs are not scalable in terms of clock rate, power consumption, or circuit area, when compared to SRAMs. Most of TCAM-based solutions also suffer from range expansion when converting ranges into prefixes [46, 77]. Bloom Filters have become popular due to theirO(1) time per- formance and high memory efficiency [91]. However, a secondary module is always needed to resolve the false positives that are inherent in Bloom Filters, which may be slow and can limit the overall performance [79]. On the other hand, mapping decision- tree-based packet classification algorithms [22, 74] onto SRAM-based pipeline archi- tecture appears to be a promising alternative [5]. By pipelining the traversal of the decision tree, a high throughput of one packet per clock cycle (PPC) can be sustained. This chapter discusses our design on FPGA and compares with existing FPGA-based packet classification solutions. 79 5.1 Our Approach 5.1.1 Motivations Although multi-field packet classification is a saturated area of research, little work has been done on FPGAs. Most of the existing FPGA implementations of packet clas- sification engines are based on decomposition-based packet classification algorithms, such as BV [45] and DCFL [83]. Our work is mainly based on the HyperCuts algo- rithm which is considered to be the most scalable decision-tree-based algorithm for multi-field packet classification [82]. However, like other decision-tree-based packet classification algorithms, the HyperCuts algorithm suffers from memory explosion due to rule duplication. For example, as shown in Figure 2.2, rules R1, R2, and R4 are replicated into multiple child nodes in both HiCuts and HyperCuts trees. We identify that rule duplication when building the decision tree comes from two sources: overlapping between different rules, and evenly cutting on all fields. Taking the rule set in Figure 2.2 as an example, since R1 always overlaps with R3 and R5, R1 will be replicated into the nodes which contain R3 or R5, however the space is cut. Since each dimension is alway evenly cut, R2 and R4 are replicated though they do not overlap with any other rule. The second source of rule duplication exists only when cutting the port or the protocol fields of the packet header, since the prefix fields are evenly cut in nature. A prefix is matched from the most significant bit (MSB) to the least significant bit (LSB), which is equal to cutting the value space by half per step. Accordingly, we propose two optimization techniques, called rule overlap reduc- tion and precise range cutting, as shown in Figure 5.1. 80 R1 R3 R4 R2 X Y R2 R4 R3 X: 2 cuts Y: 2 cuts R1 R5 R5 Figure 5.1: Motivating example Rule overlap reduction: We store the rules that will be replicated into child nodes in a list attached to each internal node. These rule lists are called internal rule lists, such as R1 shown in green in Figure 5.1. Precise range cutting: Assuming both X and Y in Figure 5.1 are port fields, we seek the cutting points which result in the minimum number of rule duplication, instead of deciding the number of cuts for this field. As shown in Figure 5.1, after applying the two optimizations, rule duplication is dra- matically reduced and the memory requirement becomes linear with the number of rules. Section 5.1.3.1 discusses the details for building the decision tree. The proposed rule overlap reduction technique is similar to the push common rule upwards heuristic proposed by the authors of HyperCuts [74], where rules common to all descendant leaves are processed at the common parent node instead of being duplicated in all children. However, the push common rule upwards heuristic can solve only a fraction of rule duplication that can be solved by our rule overlap reduction technique. Taking the HyperCuts tree in Figure 2.2(c) as an example, only R1 will be pushed upwards while our technique allows storing R2 and R4 in the internal nodes as well. Also, the push common rule upwards heuristic is applied after the decision tree is built, while our rule overlap reduction technique is integrated with the decision tree construction algorithm. 81 5.1.2 Architecture Overview Like the HyperCuts with the push common rule upwards heuristic enabled, our algo- rithm may reduce the memory consumption at the cost of increased search time, if the process to match the rules in the internal rule list of each tree node is placed in the same critical path of decision tree traversal. Any packet traversing the decision tree must match the rules in the internal rule list of the current node and branch to the child nodes in sequence. The number of memory accesses along the critical path can be very large in the worst cases. Although the throughput can be boosted by using a deep pipeline, the large delay of passing the packet classification engine requires that the router use a large buffer to store the payload of all packets that are being classi- fied. Moreover, since the search in the rule list and the traversal in the decision tree have different structures, a heterogeneous pipeline is needed, which complicates the hardware design. FPGAs provide massive parallelism and high-speed dual-port Block RAMs dis- tributed across the device. We exploit these features and propose a highly parallel architecture with localized routing paths, as shown in Figure 5.2. The design is based on the following considerations. 1. Regardless of internal rule lists, the traversal of the decision tree can be pipelined. Thus, we have a pipeline for traversing the decision tree, shown as light-color blocks in Figure 5.2. We call this pipeline Tree Pipeline. 2. Note that each tree node is attached to a list of rules, for both internal and leaf nodes. Analogous to internal rule list, the rule list attached to a leaf node is called a leaf-level rule list. Search in the rule lists can be pipelined as well. 82 3. When a packet reaches an internal tree node, the search in the internal rule list can be initiated when the branching decision is made by placing the rule list in a separate pipeline. We call such a pipeline a Rule Pipeline, as shown in shaded blocks in Figure 5.2. 4. For the tree nodes mapped onto the same stage of the Tree Pipeline, their rule lists are mapped onto the same Rule Pipeline. Thus, if the Tree Pipeline hasH stages, then there will beH +1 Rule Pipelines. One Rule Pipeline is dedicated for the internal rule list associated with the root node. 5. All Rule Pipelines have the same number of stages. The total number of clock cycles for a packet to pass the architecture isH +listSize, wherelistSize is the number of stage in a Rule Pipeline. 6. Consider two neighboring stages of Tree Pipeline, denoted as A and B, where Stage B follows Stage A. The Rule Pipeline that is attached to Stage A out- puts the matching results one clock cycle earlier than the Rule Pipeline that is attached to Stage B. Instead of waiting for all matching results from all Rule Pipelines and directing them to a single priority resolver, we exploit the one clock cycle gap between two neighboring Tree Pipeline stages, to perform the partial priority resolving for the two previous matching results. 7. The Block RAMs in FPGAs are dual-port in nature. Both Tree Pipeline and Rule Pipelines can exploit this feature to process two packets per clock cycle. In other words, by duplicating the pipeline structure (i.e. logic), the throughput is doubled, while the memories are shared by the dual pipelines. 83 Tree Pipeline Packet 1 Packet 2 Action ID 1 Action ID 2 Rule Pipeline(s) Priority Resolver Priority Resolver Priority Resolver Priority Resolver Figure 5.2: Block diagram of the two-dimensional linear dual-pipeline architecture As shown in Figure 5.2, all routing paths between blocks are localized. This can result in a high clock frequency even when the on-chip resources are heavily utilized. The FPGA implementation of our architecture is detailed in Section 5.2. 5.1.3 Algorithms 5.1.3.1 Decision Tree Construction To fit tens of thousands of unique rules in the on-chip memory of a single FPGA, we must reduce the memory requirement of the decision tree. In Section 5.1.1, we presented two optimization techniques, rule overlap reduction and precise range cut- ting, for the state-of-the-art decision-tree-based packet classification algorithm (Hy- perCuts). This section describes how to integrate the two optimization techniques into the decision tree construction algorithm. 84 Starting from the root node with the full rule set, we recursively cut the tree nodes until the number of rule in all the leaf nodes is smaller than a parameter named listSize. At each node, we need to figure out the set of fields to cut and the num- ber of cuts to be performed on each field. We restrict the maximum number of cuts at each node to 64. In other words, an internal node can have 2, 4, 8, 16, 32 or 64 children. For the port fields, we need to determine the precise cut points instead of the number of cuts. Since more bits are needed to store the cut points than to store the number of cuts, we restrict the number of cuts on port fields to be at most 2. For example, we can have 2 cuts on SA, 4 cuts on DA, 2 cuts on SP, and 2 cuts on DP. We do not cut on the protocol field, since the first four fields are normally enough to distinguish different rules in real life [43]. We use the same criteria as in HiCuts [22] and HyperCuts [74] to determine the set of fields to cut and the number of cuts to perform on SA and DA fields. Our algorithm differs from HiCuts and HyperCuts in two aspects. First, when the port fields are selected to cut, we seek the cut point which results in the least rule duplication. Second, after the cutting method is determined, we pick the rules whose duplication counts are the largest among all the rules covered by the current node and push them into the internal rule list of the current node until the internal rule list becomes full. Figure 5.3 shows the complete algorithm for building the decision tree, where n denotes a tree node, f a packet header field, and r a rule. Figure 5.4 shows the decision tree constructed for the rule set given in Table 1.3. In Figure 5.4, the values in parentheses represent the cut points on the port fields.. 5.1.3.2 Tree-to-Pipeline Mapping The size of the memory in each pipeline stage must be determined before FPGA imple- mentation. However, as shown in [5], when simply mapping each level of the decision 85 1: Initialize the root node and push it intonodeList. 2: whilenodeList6=null do 3: n Pop(nodeList) 4: ifn:numRules<listSize then 5: n is a leaf node. Continue. 6: end if 7: n:numCuts = 1 8: whilen:numCuts< 64 do 9: f ChooseField(n) 10: iff is SA or DA then 11: numCuts[f] OptNumCuts(n;f) 12: n:numCuts *=numCuts[f] 13: else iff is SP or DP then 14: cutPoint[f] OptCutPoint(n;f) 15: n:numCuts *=2 16: end if 17: Update the duplication counts of all r2 n:ruleSet: r:dupCount # of copies ofr after cutting. 18: whilen:internalList:numRules<listSize do 19: Find r m which has the largest duplication count among the rules in n:ruleSetnn:internalList. 20: Pushr m inton:internalList. 21: end while 22: if All child nodes contain less thanlistSize rules then 23: Break. 24: end if 25: end while 26: Push the child nodes intonodeList. 27: end while Figure 5.3: Algorithm: Building the decision tree 86 SA: 2 cuts DA: 2 cuts R1 R3 DP: 2 cuts (8|9) R6 R4 R5 DA: 2 cuts DP: 2 cuts (6|7) R8 R9 R7,R8,R9, R10 R4,R5,R6 R10 R7 Figure 5.4: Building the decision tree for the example rule set. tree onto a separate stage, the memory distribution across stages can vary widely. Allo- cating memory with the maximum size for each stage results in large memory wastage. Baboescu et al. [5] propose a Ring pipeline architecture that employs TCAMs to achieve balanced memory distribution at the cost of halving the throughput to one packet per two clock cycles, i.e. 0.5 PPC, due to its non-linear structure. Our task is to map the decision tree onto a linear pipeline (Tree Pipeline in our architecture) in order to achieve balanced memory distribution over stages while sus- taining a throughput of one packet per clock cycle (which can be further improved to 2 PPC by employing dual-port RAMs). The memory distribution across stages should be balanced not only for the Tree Pipeline, but also for all the Rule Pipelines. Note that the number of words in each stage of a Rule Pipelines depends on the number of tree nodes rather than on the number of words in the corresponding stage of Tree Pipeline, as shown in Figure 5.5, where H = 4, listSize = 2. The challenge comes from the various number of words needed for tree nodes. As a result, the tree-to-pipeline mapping scheme requires not only balanced memory distribution, but also balanced node distribution across stages. Moreover, to maximize the memory utilization in each stage, the sum of the number of words of all nodes in a stage should approach some 87 power of 2. Otherwise, for example, we need to allocate 2048 words for a stage con- suming only 1025 words. 7 (a) Decision tree 1 3 8 11 9 10 6 3 2 6 7 4 5 8 11 9 10 1 4 5 2 1 1 2 2 3 3 4 4 5 5 7 7 6 6 8 8 9 9 10 10 11 11 7 7 8 8 9 9 10 10 11 11 6 6 4 4 5 5 3 3 2 2 1 1 (b) Mapping results Figure 5.5: Mapping a decision tree onto pipeline stages The above problem is a variant of bin packing problems and can be proved to be NP-complete. We use a heuristic similar to our previous study of trie-based IP lookup [30], which allows the nodes on the same level of the tree to be mapped onto different stages. This provides more flexibility to map the tree nodes and helps achieve 88 a balanced memory and node distribution across the stages in a pipeline, as shown in Figure 5.5. Only one constraint must be followed: Constraint1. If nodeA is an ancestor of nodeB in the tree, thenA must be mapped to a stage preceding the stage to whichB is mapped. We impose two bounds: B M andB N , for the memory and node distribution, re- spectively. The values of the bounds are some power of 2. The criteria to set the bounds is to minimize the number of pipeline stages while achieving balanced dis- tribution over stages. The complete tree-to-pipeline mapping algorithm is shown in Figure 5.6, wheren denotes a tree node,H the number of stages,S r the set of remain- ing nodes to be mapped onto stages,M i the number of words of theith stage, andN i the number of nodes mapped onto theith stage. We manage two lists:ReadyList and NextReadyList. The former stores the nodes that are available for filling the current stage, while the latter stores the nodes for filling the next stage. We start with mapping the nodes that are children of the root onto Stage 1. When filling a stage, the nodes in ReadyList are popped out and mapped onto the stage in the decreasing order of their heights 1 . If a node is assigned to a stage, then its children are pushed into the NextReadyList. When a stage is full orReadyList becomes empty, we move on to the next stage. At that time, theNextReadyList is merged intoReadyList. By these means, Constraint 1 can be met. The complexity of this mapping algorithm isO(N), whereN the total number of tree nodes. External SRAMs are usually needed to handle very large rule sets, while the num- ber of external SRAMs is constrained by the number of IO pins in our architecture. By assigning large values ofB M andB N for one or two specific stages, our mapping algorithm can be extended to allocate a large number of tree nodes onto few external SRAMs which consume controllable number of IO pins. 1 Height of a tree node is defined as the maximum directed distance from it to a leaf node. 89 Input: The treeT . Output: H stages with mapped nodes. 1: Initialization:ReadyList ,NextReadyList ,S r T ,H 0. 2: Push the children of the root intoReadyList. 3: whileS r 6= do 4: Sort the nodes inReadyList in the decreasing order of their heights. 5: whileM i <B M ANDN i <B N ANDReadylist6= do 6: Pop node fromReadyList. 7: Map the popped noden p onto StageH. 8: Push its children intoNextReadyList. 9: M i M i +size(n p ). UpdateS r . 10: end while 11: H H +1. 12: Merge theNextReadyList to theReadyList. 13: end while Figure 5.6: Algorithm: Mapping the decision tree onto a pipeline 5.2 Implementation 5.2.1 Pipeline for Decision Tree As shown in Figure 5.4, different internal nodes in a decision tree may have different numbers of cuts, which can come from different fields. A simple solution is hard- wiring the connections, which however cannot update the tree structure on-the-fly [52]. We propose a circuit design, as shown in Figure 5.7. We can update the memory content to change the number of cut bits for SA and DA and the cut enable bits, which indicate whether to cut, for SP and DP. Our tree-to-pipeline mapping algorithm allows two nodes on the same tree level to be mapped to different stages. We implement this feature by using a simple method. Each node stored in the local memory of a pipeline stage has one extra field: the distance to the pipeline stage where the child node is stored. When a packet is passed through the pipeline, the distance value is decremented by 1 when it goes through a 90 stage. When the distance value becomes 0, the child node’s address is used to access the memory in that stage. 5.2.2 Pipeline for Rule Lists When a packet accesses the memory in a Tree Pipeline stage, it will obtain the pointer to the rule list that is associated with the current tree node being accessed. The packet uses this pointer to access all stages of the Rule Pipeline that is attached to the current Tree Pipeline stage. Each rule is stored as one word in a Rule Pipeline stage, benefiting from the large word width provided by FPGA. Within a stage of the Rule Pipeline, the packet uses the pointer to retrieve one rule and compare its header fields to find a match. We implement different match types for different fields of a rule, as shown in Figure 5.8. When a match is found in the current Rule Pipeline stage, the packet will carry the corresponding action information with the rule priority along the Rule Pipeline until it finds another match where the matching rule has higher priority than the one the packet is carrying. 5.2.3 Rule Update The dual-port memory in each stage enables only one write port to guarantee the data consistency. We update the memory in the pipeline by inserting write bubbles [7]. The new content of the memory is computed offline. When an update is initiated, a write bubble is inserted into the pipeline. Each write bubble is assigned an ID. There is one write bubble table in each stage, storing the update information associated with the write bubble ID. When a write bubble arrives at the stage prior to the stage to be updated, the write bubble uses its ID to look up the write bubble table and retrieves: the memory address to be updated in the next stage, the new content for that memory 91 location, and a write enable bit. If the write enable bit is set, the write bubble will use the new content to update the memory location in the next stage. Since the archi- tecture is linear, all packets preceding or following the write bubble can perform their operations while the write bubble performs an update. 5.3 Experimental Results 5.3.1 Algorithm Evaluation We evaluated the effectiveness of our optimized decision-tree-based packet classifi- cation algorithm by conducting experiments on four real-life rule sets of different sizes. Two performance metrics were measured: average memory size per rule and tree height. The former metric represents the scalability of our algorithm, and the lat- ter dictates the minimum number of stages needed in Tree Pipeline. The results are shown in Table 5.1. In these experiments, we set listSize = 8, which was optimal according to a series of tests, where we found that a largerlistSize resulted in lower memory requirement but deeper Rule Pipelines. The memory reduction became unre- markable whenlistSize> 8. According to Table 5.1, our algorithm kept the memory requirement linear with the number of rules and thus achieved much better scalability than the original HyperCuts algorithm. Also, the height of the decision tree that was generated using our algorithm was much smaller than that of HyperCuts, indicating a smaller delay for a packet to pass through the engine. According to Table 5.1, the Tree Pipeline needed at least 9 stages to map the deci- sion tree for ACL 10K. We conducted a series of experiments to find the optimal values for the memory and the node distribution bounds (B M andB N ). WhenB M 512 or B N 128, the Tree Pipeline needed more than 20 stages. When B M 2048 or 92 Table 5.1: Performance of algorithms for rule sets of various sizes Our algorithm Original HyperCuts Rule set # of Memory Tree Memory Tree rules (Bytes/rule) height (Bytes/rule) height ACL 100 98 29.31 8 52.16 23 ACL 1k 916 26.62 11 122.04 20 ACL 5k 4415 29.54 12 314.88 29 ACL 10k 9603 27.46 9 1727.28 29 B N 512, the memory and the node distribution over the stages were the same as the one that was using a static mapping scheme that mapped each tree level onto a stage. Only whenB M = 1024;B N = 256, both memory and node distribution were balanced, while the number of stages needed was increased slightly to 11 (H = 11). As Figure 5.9 shows, our mapping scheme outperformed the static mapping scheme with respect to both memory and node distribution. 5.3.2 FPGA Implementation Results Based on the mapping results, we initialized the parameters of the architecture for FPGA implementation. According to the previous section, to include the largest rule set ACL 10K, the architecture needed H = 11 stages in Tree Pipeline and 12 Rule Pipelines, of which each hadlistSize = 8 stages. Each stage of Tree Pipeline needed B M = 1024 words, of which each was 72 bits including base address of a node, cutting information, pointer to the rule list, and distance value. Each stage of Rule Pipeline needed B N = 256 words, of which each was 171 bits including all fields of a rule, priority and action information. We implemented our design, including write-bubble tables, in Verilog. We used Xilinx ISE 10.1 development tools. The target device was Xilinx Virtex-5 XC5VFX200T with -2 speed grade. Post place and route results showed that our design could achieve 93 a clock frequency of 125.4 MHz. The resource utilization is shown in Table 5.2. Among the allocated memory, 612 Kbytes was consumed for storing the decision tree and all rule lists. Table 5.2: Resource utilization of the packet classification engine on FPGA Used Available Utilization Number of Slices 10,307 30,720 33% Number of bonded IOBs 223 960 23% Number of Block RAMs 407 456 89% Table 5.3 compares our design with the state-of-the-art FPGA-based packet classi- fication engines. For fair comparison, the results of the compared work were scaled to Xilinx Virtex-5 platforms as based on the maximum clock frequency 2 . The values in parentheses were the original data that was reported in those papers. Considering the time-space trade-off, we used a new performance metric, named Efficiency, which was defined as the throughput divided by the average memory size per rule. Our design outperformed the others with respect to throughput and efficiency. Note that our work is the only design to achieve more than 40 Gbps throughput. Table 5.3: Performance comparison of FPGA-based packet classification engines Approaches # of Total memory Throughput Efficiency rules (Kbytes) (Gbps) (Gbps/KB) Our approach 9603 612 80.23 1358.9 Simplified HyperCuts [39] 10000 286 7.22 (3.41) 252.5 BV-TCAM [77] 222 16 10 (N/A) 138.8 2sBFCE [60] 4000 178 2.06 (1.88) 46.3 Memory-based DCFL [27] 128 221 24 (16) 13.9 2 The BV-TCAM paper [77] does not present the implementation result about the throughput. We use the predicted value given in [77]. 94 Add Left_Shift Shift_value Packet_SP Cmp Packet_DP Sub ‘32’ Right_Shift Shift_value Packet_SA Add Left_Shift Shift_value ‘32’ Right_Shift Shift_value Packet_DA Adder Left_Shift Shift_value Left_Shifter Shift_value Base_address Num_bits_SA Num_bits_DA Cut_point_SP Cut_point_DP OR Child_pointer (to next stage) 16 16 3 3 10 1 1 10 DP_en SP_en 1 1 A B A-B Sub A B A-B A B A>B Cmp A B A>B From Memory Figure 5.7: Implementation of updatable, varying number of branches at each node 95 Right_Shift Shift_value 32'b1 Prefix length Packet IP A B A=B Prefix (a) Prefix match (SA, DA) A B A≤B Port_low Port_high Packet port A B A≤B (b) Port match (SP, DP) A B A=B Protocol value Packet protocol Protocol mask (c) Protocol match 5 32 32 16 16 16 1 8 8 Figure 5.8: Implementation of rule matching 1 2 3 4 5 6 7 8 9 10 11 0 500 1000 1500 2000 Stage ID # of words Memory distribution 1 2 3 4 5 6 7 8 9 10 11 0 100 200 300 400 Stage ID # of nodes Node distribution Our mapping (B M =1024) Static mapping Our mapping (B N =256) Static mapping Figure 5.9: Distribution over Tree Pipeline stages for ACL 10K 96 Chapter 6 Scalable Architecture for Flexible Flow Matching Flexible forwarding hardware is recently proposed [10, 56]. Most of existing work fo- cuses on functionality rather than performance. Few efforts have been made in exploit- ing the power of state-of-the-art FPGA technology to achieve high-performance flexi- ble flow matching. To the best of our knowledge, no existing schemes for OpenFlow- like flexible flow matching can sustain a throughput above 10 Gbps in the worst case where packets are of minimum size (40 bytes). This chapter presents our initial attempt at addressing the performance challenges for flexible forwarding. 6.1 Our Approach Flexible flow matching can be viewed as an extension of the traditional 5-field packet classification. In OpenFlow, with more packet header fields to be matched, the total number of bits per packet for lookup increases from 104 to over 237 [63]. We adopt decision-tree-based algorithms that are considered among the most scalable packet classification algorithms [82, 32]. However, existing decision-tree-based packet clas- sification algorithms use all of the packet header fields to construct the tree. This results in large memory and resource requirements for flexible flow matching where 97 different flow rules in a table specify few but different header fields. Moreover, the depth of a decision tree can be very large for flexible flow matching due to the increase in the number of header fields to be matched. 6.1.1 Heuristic We observe that different complex rules in a flexible flow table may specify only a small number of fields, while leaving other fields to be wildcards. This phenomenon is fundamentally due to the concept of flexible forwarding hardware which was proposed to support various applications on the same substrate. For example, both IP routing and Ethernet forwarding can be implemented in OpenFlow. IP routing will specify only the destination IP address field, while Ethernet forwarding will use only the destination Ethernet address. 6.1.2 Motivation The memory explosion for decision-tree-based algorithms in the worst case has been identified as a result of rule duplication [32]. A less specified field usually tends to cause rule duplication. Consider the OpenFlow table shown in Table 1.5 as an ex- ample. If we consider only SA and DA fields, all the 10 rules can be represented geometrically on a 2-dimensional space shown in Figure 6.1. Decision tree -based al- gorithms (such as HyperCuts [74]) cut the space recursively based on the values from SA and DA fields. As shown in Figure 6.1, no matter how the space is cut, R14 will be duplicated to all children nodes. This is because their SA / DA fields are wildcards (i.e. not specified). Similarly, if we build the decision tree based on source / destination Ethernet addresses, no matter how the cutting is performed, R58 will be duplicated to all children nodes. We will see in Section 6.2 that the characteristics of flexible flow 98 rules (discussed in Section 6.1.1) cause severe memory explosion when the rule set becomes larger. R1, R2, R3, R4 SA DA R5 R6 R7 R8 R9 R10 SA: 2 cuts DA: 2 cuts R1 R2 R3 R4 R5 R1 R2 R3 R4 R5 R1 R2 R3 R4 R6 R7 R9 R10 R1 R2 R3 R4 R8 (a) Rule set (b) HyperCuts tree (1st level) Figure 6.1: Rule duplication in HyperCuts tree. Hence, an intuitive idea is to split a table of complex rules into different subsets. The rules within the same subset specify nearly the same set of header fields. For each rule subset, we build the decision tree based on the specified fields used by the rules within this subset. For instance, the example rule table can be partitioned into two subsets: one contains R14 and the other contains R510. We can use only source / destination Ethernet addresses to build the decision tree for the first subset while only SA / DA fields for the second subset. As a result, the rule duplication will be dramatically reduced. Meanwhile, after such partitioning, since each decision tree employs a much smaller number of fields than the single decision tree without partitioning, we can expect considerable resource savings in hardware implementation. 6.1.3 Algorithms We develop the decision forest construction algorithms to achieve the following goals: Reduce the overall memory requirement. 99 Bound the depth of each decision tree. Bound the number of decision trees. Rather than perform the rule set partitioning and the decision tree construction in two phases, we combine them efficiently, as shown in Figure 6.2. The rule set is partitioned dynamically during the construction of each decision tree. The function for building a decision tree i.e.BuildTree(:) is shown in Algorithm 6.3. Input: Rule setR. Input: Parameters:bucketSize,depthBound,P . Output: Decision forest:fT i ji = 0;1; ;P1g. 1: i 0,R i R andsplit TRUE. 2: whilei<P do 3: ifi ==P1 thenfThe last subset / treeg 4: split FALSE 5: end if 6: fT i ;R i+1 g BuildTree (R i ;split;bucketSize;depthBound) 7: i i+1 8: end while Figure 6.2: Algorithm: Building the decision forest The parameterP bounds the number of decision trees in a decision forest. We have the rule setR i to build theith tree whose construction process will split out the rule set R i+1 .i = 0;1; ;P1. In other words, the rules inR i R i+1 are actually used for building the data structure of theith tree. The parametersplit determines if the rest of the rule set will be partitioned. When building the last decision tree (i = P1), split is turned to be FALSE so that all the remaining rules are used to construct the last tree. Other parameters includedepthBound which bounds the depth of each decision tree, andbucketSize which is inherited from the the original HyperCuts algorithm to determine the maximum number of rules allowed to be contained in a leaf node. The algorithm shown in Figure 6.3 is based on the original HyperCuts algorithm, where Lines 810 and 1518 are the major changes. Lines 810 are used to bound 100 Input: Rule setR. Input: Parameters:split,bucketSize,depthBound. Output: Decision treeT and the split-out setR ex . 1: Initialize the root node:root:rules R. 2: Pushroot intonodeList. 3: whilenodeList6=null do 4: n Pop(nodeList) 5: ifn:numrules<bucketSize then 6: n is a leaf node. Continue. 7: end if 8: ifn:depth ==depthBound then 9: Assign ton thebucketSize most specified rules fromn:rules. Push remain- ing rules ofn:rules intoR ex .n is a leaf node. Continue. 10: end if 11: forf2OptFields(n) do 12: nCuts[f] OptNumCuts(n;f) 13: n:numCuts *=nCuts[f] 14: end for 15: ifsplit is TRUE then 16: r DuplicatedRule(n,nCuts) 17: Pushr intoR ex . 18: end if 19: fori 0 to2 n:numCuts 1 do 20: n i CreateNode(n;nCuts;i) 21: Pushn i intonodeList. 22: end for 23: end while Figure 6.3: Algorithm: Building the decision tree and the split-out set the depth of the tree. After determining the optimal cutting information (including the cutting fields and the number of cuts on these fields) for the current node (Lines 1114), we identify the rules that will be duplicated to all the children nodes (by the DuplicatedRule() function). These rules are then split out of the current rule set and pushed into the split-out rule setR ex . The split-out rule set will be used to build the next decision tree(s). The rule duplication in the firstP1 trees will thus be reduced. 101 6.1.4 Architecture To achieve line-rate throughput, we map the decision forest, including P trees, onto a parallel multi-pipeline architecture withP linear pipelines, as shown in Figure 6.4 where P = 2. The shaded blocks (rule stages) store the leaf-level rule lists while the tree nodes are mapped onto plain-color blocks (tree stages). Each pipeline is used for traversing a decision tree and matching the rule lists attached to the leaf nodes of that tree. The pipeline stages for tree traversal are called the tree stages, while those for rule list matching are called the rule stages. Each tree stage includes a memory block that stores the tree nodes and the cutting logic that generates the memory access address based on the input packet header values. At the end of tree traversal, the index of the corresponding leaf node is retrieved to access the rule stages. Since a leaf node contains a list ofbucketSize rules, we needbucketSize rule stages for matching these rules. All the leaf nodes of a tree have their rule lists mapped onto thesebucketSize rule stages. Each rule stage includes a memory block that stores the full content of rules and the matching logic that performs parallel matching on all header fields. Each incoming packet goes through all the P pipelines in parallel. A different subset of header fields of the packet may be used to traverse the trees in different pipelines. Each pipeline outputs the rule ID or its corresponding action. The priority Pipeline 1 Packet 1 Packet 2 Action ID 1 Action ID 2 Priority Resolver Pipeline 2 Figure 6.4: Multi-pipeline architecture for searching the decision forest (P = 2). 102 resolver picks the result with the highest priority among the P outputs from the P pipelines. It takes H +bucketSize clock cycles for each packet to go through the architecture, whereH denotes the number of tree stages. We adopt a similar scheme as [32] to map tree nodes onto pipeline stages while managing the memory distribution across stages. The cutting logic is generated based on the cutting information obtained from the tree construction process (Figure 6.3). To further improve the throughput, we exploit the dual-port RAMs provided by state-of- the-art FPGAs so that two packets are processed every clock cycle. As in [32], our architecture supports dynamic rule updates by inserting write bub- bles into the pipeline. Since the architecture is linear, all packets preceding or follow- ing the write bubble can perform their operations while the write bubble performs an update. 6.2 Experimental Results We conducted extensive experiments to evaluate the performance of our decision forest -based schemes, including the algorithms and FPGA prototype of the architecture. 6.2.1 Experimental Setup Due to the lack of large-scale real-life flexible flow rules, we generated synthetic 12- tuple OpenFlow-like rules to examine the effectiveness of our decision forest -based schemes for flexible flow matching. Each rule was composed of twelve header fields that follow the current OpenFlow specification [63]. We used 6-bit field for the ingress port and randomly set each field value. Concretely, we generated each rule as follows: 103 1. Each field is randomly set as a wildcard. When the field is not set as a wildcard, the following steps are executed. 2. For source / destination IP address fields, the prefix length is set randomly from between 1 and 32, and then the value is set randomly from its possible values. 3. For other fields, the value is set randomly from its possible values. In this way, we generated four OpenFlow-like 12-tuple rule sets with 100, 200, 500, and 1K rules, of which each is independent of the others. Note that our generated rule sets include many impractical rules because each field value is set at random. However, we argue that the lower bound of the performance of the decision forest scheme is approximated by using such randomly generated rule sets which do not match well the heuristic (Section 6.1.1) observed in real-life flexible flow matching engines. Better performance can be expected by using the decision forest scheme for large sets of real-life flexible flow rules that will become available in the future. 6.2.2 Algorithm Evaluation To evaluate the performance of the algorithms, we use following performance metrics: Average memory requirement (bytes) per rule is computed as the total memory requirement of a decision forest divided by the number of rules for building the forest. Tree depth is defined as the maximum directed distance from the tree root to a leaf node. For a decision forest including multiple trees, we consider the maxi- mum tree depth among these trees. A smaller tree depth leads to shorter pipelines and thus lower latency. 104 Number of cutting fields (denotedN CF ) for building a decision forest is defined as the maximumN CF among the trees in the forest. Using a smaller number of cutting fields results in less hardware for implementing cutting logic and smaller memory for storing cutting formation of each node. We set bucketSize = 64, depthBound = 16, and varied the number of trees P = 1;2;3;4. Figure 6.5 shows the average memory requirement per rule, where logarithmic plot is used for the Y axis. In the case ofP = 1, we observed memory explosion when the number of rules was increased from 100 to 1K. On the other hand, increasingP dramatically reduced the memory consumption, especially for the larger rule set. Almost 100-fold reduction in memory consumption was achieved for the 1K rules, when P was increased just from 1 to 2. With P 3, the average memory requirement per rule remained on the same order of magnitude for different sizes of rule sets. 1 2 3 4 100 rules 47.46 40.12 40.12 40.12 200 rules 138.88 38.62 38.62 38.62 500 rules 1051.192 99.164 54.428 54.428 1000 rules 10198.75 110.082 68.988 68.988 1 10 100 1000 10000 100000 Bytes / rule P: # of trees Memory requirement vs. P 100 rules 200 rules 500 rules 1000 rules Figure 6.5: Average memory requirement with increasingP . As shown in Figures 6.6 and 6.7, the tree depth and the number of cutting fields were also reduced by increasingP . WithP = 3 or4, six-fold and three-fold reductions 105 were achieved, respectively, in the tree depth and the number of cutting fields, when compared with using a single decision tree. 0 2 4 6 8 10 12 14 1 2 3 4 P: # of trees Tree depth vs. P 100 rules 200 rules 500 rules 1000 rules Figure 6.6: Tree depth with increasingP . 6.2.3 Implementation Results To implement the decision forest for 1K rules in hardware, we examined the perfor- mance results of each tree in a forest. Table 6.1 shows the breakdown with P = 4, bucketSize = 32,depthBound = 4. Table 6.1: Breakdown of aP = 4-tree decision forest Trees # of # of Tree Memory Tree # of Cutting Rules nodes (bytes/rule) depth fields Tree 1 712 545 78.70 2 3 Tree 2 184 265 84.70 2 5 Tree 3 65 17 41.78 1 2 Tree 4 39 9 45.23 1 2 Overall 1000 836 76.10 2 5 106 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 P: # of trees # of Cutting fields vs. P 100 rules 200 rules 500 rules 1000 rules Figure 6.7: Number of cutting fields with increasingP . We mapped the above decision forest onto the 4-pipeline architecture. Since Block RAMs were not used efficiently for blocks of less than 1K entries, we merged the rule lists of the first two pipelines and used distributed memory for the remaining rule lists. BRAM utilization was improved at the cost of degrading the throughput to one packet per clock cycle, while dual-port RAMs were used. We implemented our design on FPGA using Xilinx ISE 10.1 development tools. The target device was Virtex-5 XC5VFX200T with -2 speed grade. Post place and route results showed that our design achieved a clock frequency of 125 MHz. The resulting throughput was 40 Gbps for minimum size (40 bytes) packets. The resource utilization of the design is summarized in Table 6.2. Table 6.2: Resource utilization of the 4-tree decision forest on FPGA Available Used Utilization # of Slices 30,720 11,720 38% # of 36Kb Block RAMs 456 256 56% # of User I/Os 960 303 31% 107 Chapter 7 Conclusion Power consumption has emerged as a new challenge in the design of packet forward- ing engines for next generation networks. This thesis is among the first designs of packet forwarding engines to achieve both high throughput and low power consump- tion. Though TCAMs are de facto solutions for high-speed packet forwarding, they suffer from high power consumption. We propose mapping state-of-the-art algorith- mic solutions onto parallel architectures that are based on low-power memory such as SRAM. The contributions of our work are two-fold: Algorithm-oriented architecture design. We utilize pipelining and multi-processing to achieve a high degree of parallelism in architectures, so that the throughput is significantly improved. We address challenges in mapping state-of-the-art algo- rithmic solutions onto SRAM-based parallel architectures. Architecture-aware algorithm optimization. Customized architecture design pro- vides extra freedom in optimizing the existing algorithms for power and/or mem- ory efficiency. We propose optimized data structure for different packet for- warding problems to achieve significant reduction in power consumption and/or memory requirement. 108 7.1 Summary of Contributions We proposed two heuristics that offer more freedom in mapping trie nodes to SRAM- based linear pipeline architectures. First, any two nodes on the same level of the trie can be mapped to different pipeline stages. Second, we allow each pipeline to be tra- versed from two directions at the same time, by using dual-port SRAMs. Any subtrie can be mapped either from the root or from the leaves. As a result, memory distribution is balanced across the stages in a pipeline, while high throughput of one lookup per clock cycle is sustained. We proposed parallel SRAM-based multi-pipeline architec- tures to achieve even higher throughput. A two-level mapping scheme was proposed to balance the memory requirement among pipelines and across stages. We designed the IP caching schemes and proposed an exchange-based dynamic subtrie-to-pipeline remapping algorithm to balance the traffic among multiple pipelines. The proposed architecture with eight pipelines can store a core routing table with over 200K unique routing prefixes using 3.6 MB of memory and can achieve a high throughput of up to 11.72 billion packets per second, i.e. 3.75 Tbps for minimum size (40 bytes) packets. Compared with TCAM, our architecture achieved2:6-fold and fourteen-fold reduction in power and energy consumption, respectively. We exploited data structure optimization to reduce the power consumption. We formulated the problems by revisiting the conventional time-space trade-off in multi- bit tries. To minimize the worst-case power consumption for a given architecture, a dynamic programming framework was developed to determine the optimal strides for constructing tree bitmap coded multi-bit tries. Simulation using real-life backbone routing tables showed that careful design of the data structure, with awareness of the 109 underlying architecture, could achieve dramatic reduction in power consumption. Dif- ferent architectures could result in different optimal data structures with respect to power efficiency. We proposed several novel architecture-specific techniques to reduce the dynamic power dissipation in SRAM-based pipelined IP lookup engines. First, the pipeline was built as an inherent cache that exploited effectively the traffic locality with min- imum overhead. Second, a local clocking scheme was proposed to exploit the traffic rate variation and to improve the caching performance. Third, a fine-grained memory enabling scheme was used to eliminate unnecessary memory accesses for the input packets. Simulation using real-life traffic traces showed that our solution achieved up to fifteen-fold reduction in dynamic power dissipation. We presented a novel decision-tree-based linear pipeline architecture on FPGAs for wire-speed multi-field packet classification. Several optimization techniques were proposed to reduce the memory requirement of the state-of-the-art decision-tree-based packet classification algorithm, so that 10K unique rules could fit in the on-chip mem- ory of a single FPGA. Our architecture provided on-the-fly reconfiguration due to the linear memory-based architecture. To the best of our knowledge, our design was the first FPGA-based packet classification engine that achieved double wire speed while supporting 10K unique rules. We proposed an FPGA-based parallel architecture, called decision forest, to ad- dress the performance challenges for flexible flow matching in next generation net- works. We developed a framework to partition a set of complex flow rules into multiple subsets and build each rule subset into a depth-bounded decision tree. The partitioning scheme was designed so that both the overall memory requirement and the number of packet header fields for constructing the decision trees were reduced. Extensive simu- lation and FPGA implementation results demonstrate the effectiveness of our solution. 110 The FPGA design supports 1K OpenFlow-like complex rules and sustains 40 Gbps throughput for minimum size (40 bytes) packets. 7.2 Future Work It is generally believed that packet forwarding is well-studied. The vast body of previ- ous work has been a dauntingly high hurdle for new researchers in this crowded area. However, as the Internet keeps evolving, one thing is known for sure: we are still far from a point when we can label the packet forwarding problem as “solved”. 7.2.1 From IPv4 to IPv6 As of August 2010, Geoff Huston’s daily IPv4 Address Report predicts the exhaustion date of the unallocated IANA pool will be by the end of May 2011 [25]. This prediction is derived from current trends and does not take into account any last-chance rush to acquire the last available addresses. Currently, IPv4 allocations are accelerating, which results in exhaustion trending to earlier dates. By mid-2012, new devices and services on the Internet will have no choice but to use only IPv6 addresses. For the rest of the Internet to be able to communicate with them, older hosts must implement IPv6 as well, or they must utilize special translator gateway services. While the current IPv6 table is still relatively small, the envisioned large scale deployment of IPv6 will result in table sizes at least as large as those for IPv4 [75]. Compared to 32-bit addresses in IPv4, addresses in IPv6 are 128 bits long. Using tries for IPv6 will result in large trie depth, which depends on the longest prefix length. Though high throughput can still be achieved by mapping a trie onto SRAM-based pipeline architecture, the latency for an IP packet to traverse the deep pipeline can be large. Such a deep pipeline requires a large number of memory blocks, which 111 consumes large chip area and power. Hence, we must study new data structures for IPv6 so that they can be mapped to SRAM-based parallel architectures efficiently. Very little work has been done in this area. Some recent efforts develop novel data structure for IPv6 lookup to achieve better time-space trade-offs [78, 76]. However, these efforts do not consider the architecture innovations and are not aiming to address the performance challenges discussed in this thesis. One direction that we can pursue is applying our algorithm-architecture co-design framework to these newly proposed data structure to achieve both high throughput and low power consumption. 7.2.2 Growing Table Size As the size of forwarding tables keeps growing, increasingly large memory require- ments are expected. However, on-chip memory is always limited. Interfacing with external memory is demanded to support large forwarding tables. A deep pipeline or a multi-pipeline architecture with balanced memory distribution may no longer be an attractive solution due to the limitation of memory bandwidth or I/O pins. A practical pipeline architecture should contain a few large stages that can be stored in external memory, while the remaining stages reside on-chip. The memory utilization among the large stages still needs to be balanced. Hence, we can extend our fine-grained mapping scheme to control memory distribution by assigning large or small capacities for the stages that interface with external memory. 7.2.3 Dynamic Update The forwarding table is frequently updated. The packet forwarding engine must sup- port dynamic updates without much performance degradation. Route update (adding, changing or removing route entries in a routing table) in most of the existing pipelined 112 IP lookup engines involves two phases: trie update and re-mapping the entire trie onto the pipeline. Each single update will trigger the re-mapping of the entire trie. This results in high update cost. We can embed the trie-to-pipeline mapping procedure into the trie update (route insertion). Such a scheme can even enable the online update in hardware architecture. When a new node is created and is to be inserted, we map this node onto the pipeline based on the memory distribution across stages at that time. The mapping must consider the remaining bits in the prefix. For example, the current bit under consideration is thecth bit of the prefix, and there arem bits remaining in the prefix for trie extension. If we only focus on the current bit and map it to the stage with lowest memory utilization, then it is possible that the bit is mapped to the last stage so that the remainingm bits will not be able to be mapped onto the pipeline. Hence, we find them stages with the lowest memory utilization and map the current bit onto the first of them stages. Another issue that we must consider is the correlation between different prefixes. For example, consider two prefixes p 1 and p 2 where p 1 is the prefix of p 2 . If p 1 is inserted ahead of p 2 , it is possible to map the last bit of p 1 to the last stage. As a result, it is impossible to map the last few bits ofp 2 to the pipeline without remapping the bits of p 1 . Hence, we develop two variants of the incremental mapping scheme. First, we consider the prefix length of the current prefix being inserted. If the prefix length isL<W whereW is the maximum prefix length (e.g. W = 32 for IPv4), we reserve the lastWL stages for other longer prefixes and will not map any bit of the current prefix onto those stages. Such a method is conservative. Another method is more aggressive: We allow the bits of the current prefix to be mapped to all available stages. When the prefix correlation issue occurs, we remap the bits of the shorter prefix to preceding stages so that the bits of the longer prefix can be mapped onto the 113 pipeline. It is expected that the aggressive method can achieve more balanced memory distribution across the stages, at the cost of additional time for remapping. 7.2.4 Evolving with Packet Forwarding The paradigm shifts in networking, such as network virtualization, take packet for- warding to a much broader and a higher level. Packet forwarding in future networks tends to be highly flexible. OpenFlow represents just one such effort. Even OpenFlow itself is evolving. Our work for high-performance OpenFlow matching is preliminary. The proposed decision forest algorithm is far from optimal. We will study the optimal rule set partitioning by converting the problem into a graph partitioning or clustering problem. Meanwhile, we lack a user controlled benchmarking tool to generate flexible forwarding rules for evaluating various packet forwarding algorithms in future net- works. We are developing a tool that allows the user to select the protocol fields and the distribution of each field. The user will have complete control over the structure and the size of the rule table. We believe that such a tool will be very useful for the networking research community. 114 References [1] B. Agrawal and T. Sherwood. Ternary CAM power and delay model: Extensions and uses. IEEE Trans. VLSI Syst., 16(5):554–564, 2008. [2] Mohammad J. Akhbarizadeh, Mehrdad Nourani, Rina Panigrahy, and Samar Sharma. A TCAM-based parallel architecture for high-speed packet forwarding. IEEE Trans. Comput., 56(1):58–72, 2007. [3] Mohammad J. Akhbarizadeh, Mehrdad Nourani, Deepak S. Vijayasarathi, and T. Balsara. A non-redundant ternary CAM circuit for network search engines. IEEE Trans. VLSI Syst., 14(3):268–278, 2006. [4] Gary Anthes. Data Centers Get A Makeover. http://www.computerworld.com/printthis/2004/0,4814,97021,00.html, Novem- ber 2004. [5] Florin Baboescu, Dean M. Tullsen, Grigore Rosu, and Sumeet Singh. A tree based router search engine architecture with single port memories. In Proc. ISCA, pages 123–133, 2005. [6] Florin Baboescu and George Varghese. Scalable packet classification. In Proc. SIGCOMM, pages 199–210, 2001. [7] Anindya Basu and Girija Narlikar. Fast incremental updates for pipelined for- warding engines. In Proc. INFOCOM, pages 64–74, 2003. [8] CACTI 5.3. http://quid.hpl.hp.com:9081/cacti/. [9] Lorenzo De Carli, Yi Pan, Amit Kumar, Cristian Estan, and Karthikeyan Sankar- alingam. Plug: flexible lookup modules for rapid deployment of new protocols in high-speed routers. In Proc. SIGCOMM, 2009. [10] Martin Casado, Teemu Koponen, Daekyeong Moon, and Scott Shenker. Rethink- ing packet forwarding hardware. In Proc. of 7th ACM Workshop on Hot Topics in Networks (HotNets-VII), October 2008. [11] Joseph Chabarek, Joel Sommers, Paul Barford, Cristian Estan, David Tsiang, and Stephen Wright. Power awareness in network design and routing. In Proc. INFOCOM, pages 457–465, 2008. 115 [12] Cisco ASR 1000 Series Aggregation Services Routers. http://www.cisco.com/en/US/prod/collateral/routers/ps9343/data sheet c78- 447652.pdf. [13] Cisco CSR-1 Carrier Routing System. www.cisco.com/web/go/crs. [14] Colby Walsworth, Emile Aben, kc claffy, Dan An- dersen. The caida anonymized 2009 internet traces. http://www.caida.org/data/passive/passive 2009 dataset.xml. [15] Cypress Sync SRAMs. http://www.cypress.com. [16] Sarang Dharmapurikar, Haoyu Song, Jonathan S. Turner, and John W. Lock- wood. Fast packet classification using bloom filters. In Proc. ANCS, pages 61– 70, 2006. [17] Minh Q. Do, Mindaugas Drazdziulis, Per Larsson-Edefors, and Lars Bengtsson. Parameterizable architecture-level SRAM power model using circuit-simulation backend for leakage calibration. In Proc. ISQED, pages 557–563, 2006. [18] Will Eatherton, George Varghese, and Zubin Dittia. Tree bitmap: hard- ware/software IP lookups with incremental updates. SIGCOMM Comput. Com- mun. Rev., 34(2):97–122, 2004. [19] S. Govind, R. Govindarajan, and Joy Kuri. Packet reordering in network proces- sors. In Proc. IPDPS, pages 1–10, 2007. [20] Maruti Gupta and Suresh Singh. Greening of the Internet. In Proc. SIGCOMM, pages 19–26, 2003. [21] Pankaj Gupta, Steven Lin, and Nick McKeown. Routing lookups in hardware at memory access speeds. In Proc. INFOCOM, pages 1240–1247, 1998. [22] Pankaj Gupta and Nick McKeown. Classifying packets with hierarchical intelli- gent cuttings. IEEE Micro, 20(1):34–41, 2000. [23] Pankaj Gupta and Nick McKeown. Algorithms for packet classification. IEEE Network, 15(2):24–32, 2001. [24] Jahangir Hasan and T. N. Vijaykumar. Dynamic pipelining: making ip-lookup truly scalable. In Proc. SIGCOMM, pages 205–216, 2005. [25] Geoff Huston. IPv4 Address Report, daily generated. http://www.potaroo.net/tools/ipv4/index.html, August 2010. [26] IDT Network Search Engines. http://www.idt.com/?catid=58522. 116 [27] Gajanan S. Jedhe, Arun Ramamoorthy, and Kuruvilla Varghese. A scalable high throughput firewall in FPGA. In Proc. FCCM, 2008. [28] Weirong Jiang and Viktor K. Prasanna. A memory-balanced linear pipeline ar- chitecture for trie-based IP lookup. In Proc. Hot Interconnects (HotI ’07), pages 83–90, 2007. [29] Weirong Jiang and Viktor K. Prasanna. Multi-Terabit IP Lookup Using Parallel Bidirectional Pipelines. In Proc. Computing Frontiers (CF ’08), 2008. [30] Weirong Jiang and Viktor K. Prasanna. Parallel IP Using Multiple SRAM-based Pipelines. In Proc. IPDPS, 2008. [31] Weirong Jiang and Viktor K. Prasanna. Towards green routers: Depth-bounded multi-pipeline architecture for power-efficient IP lookup. In Proc. IPCCC, pages 185–192, 2008. [32] Weirong Jiang and Viktor K. Prasanna. Large-scale wire-speed packet classifica- tion on FPGAs. In Proc. FPGA, pages 219–228, 2009. [33] Weirong Jiang and Viktor K. Prasanna. Reducing dynamic power dissipation in pipelined forwarding engines. In Proc. ICCD, 2009. [34] Weirong Jiang, Qingbo Wang, and Viktor K. Prasanna. Beyond TCAMs: An SRAM-based parallel multi-pipeline architecture for terabit IP lookup. In Proc. INFOCOM, pages 1786–1794, 2008. [35] Juniper Networks SRX 5000 Services Gateways. http://www.juniper.net/products/srx/dsheet/100254.pdf. [36] Juniper Networks T-series Routing Platforms. http://www.juniper.net/products/tseries/100051.pdf. [37] Juniper Networks T1600 Core Router. http://www.juniper.net. [38] Stefanos Kaxiras and Georgios Keramidas. IPStash: a set-associative memory approach for efficient IP-lookup. In INFOCOM, pages 992–1001, 2005. [39] Alan Kennedy, Xiaojun Wang, Zhen Liu, and Bin Liu. Low power architecture for high speed packet classification. In Proc. ANCS, pages 131–140, 2008. [40] Kun Suk Kim and Sartaj Sahni. Efficient construction of pipelined multibit-trie router-tables. IEEE Trans. Comput., 56(1):32–43, 2007. [41] Jon Kleinberg and Eva Tardos. Algorithm Design. Addison-Wesley Longman Publishing Co., Inc., 2005. 117 [42] Ravi Kokku, Upendra B. Shevade, Nishit S. Shah, Mike Dahlin, and Harrick M. Vin. Energy-Efficient Packet Processing. In http://www.cs.utexas.edu/users/rkoku/RESEARCH/energy-tech.pdf, 2004. [43] M. E. Kounavis, A. Kumar, R. Yavatkar, and H. Vin. Two stage packet classifi- cation using most specific filter matching and transport level sharing. Comput. Netw., 51(18):4951–4978, 2007. [44] Sailesh Kumar, Michela Becchi, Patrick Crowley, and Jonathan Turner. CAMP: fast and efficient IP lookup architecture. In Proc. ANCS, pages 51–60, 2006. [45] T. V . Lakshman and Dimitrios Stiliadis. High-speed policy-based packet for- warding using efficient multi-dimensional range matching. In Proc. SIGCOMM, pages 203–214, 1998. [46] Karthik Lakshminarayanan, Anand Rangarajan, and Srinivasan Venkatachary. Algorithms for advanced packet classification with ternary CAMs. In Proc. SIG- COMM, pages 193–204, 2005. [47] Hoang Le, Weirong Jiang, and Viktor K. Prasanna. A SRAM-based architecture for trie-based IP lookup. In Proc. FCCM, 2008. [48] Xiaoyao Liang, Kerem Turgay, and David Brooks. Architectural power models for SRAM and CAM structures based on hybrid analytical/empirical techniques. In Proc. ICCAD, pages 824–830, 2007. [49] Huan Liu. Routing prefix caching in network processor design. In Proc. ICCCN, pages 18–23, 2001. [50] Wencheng Lu and Sartaj Sahni. Packet forwarding using pipelined multibit tries. In Proc. ISCC, 2006. [51] Yan Luo, Pablo Cascon, Eric Murray, and Julio Ortega. Accelerating OpenFlow Switching with Network Processors. In Proc. ANCS. ACM, 2009. [52] Yan Luo, Ke Xiang, and Sanping Li. Acceleration of decision tree searching for IP traffic classification. In Proc. ANCS, 2008. [53] Yan Luo, Jia Yu, Jun Yang, and Laxmi N. Bhuyan. Conserving network pro- cessor power consumption by exploiting traffic variability. ACM Trans. Archit. Code Optim., 4(1):4, 2007. [54] Alan M. Lyons, David T. Neilson, and Todd R. Salamon. Energy efficient strate- gies for high density telecom applications. Princeton University, Supelec, Ecole Centrale Paris and Alcatel-Lucent Bell Labs Workshop on Information, Energy and Environment, June 2008. 118 [55] Malcolm Mandviwalla and Nian-Feng Tzeng. Energy-efficient scheme for multiprocessor-based router linecards. In Proc. SAINT, pages 156–163, 2006. [56] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peter- son, Jennifer Rexford, Scott Shenker, and Jonathan Turner. OpenFlow: enabling innovation in campus networks. SIGCOMM Comput. Commun. Rev., 38(2):69– 74, 2008. [57] Jeffrey C. Mogul, Praveen Yalagandula, Jean Tourrilhes, Rick McGeer, Sujata Banerjee, Tim Connors, and Puneet Sharma. API design challenges for open router platforms on proprietary hardware. In Proc. of 7th ACM Workshop on Hot Topics in Networks (HotNets-VII), October 2008. [58] Jad Naous, David Erickson, G. Adam Covington, Guido Appenzeller, and Nick McKeown. Implementing an OpenFlow switch on the NetFPGA platform. In Proc. ANCS, pages 1–9. ACM, 2008. [59] Sergiu Nedevschi, Lucian Popa, Gianluca Iannaccone, Sylvia Ratnasamy, and David Wetherall. Reducing network energy consumption via sleeping and rate- adaptation. In NSDI, pages 323–336, 2008. [60] Antonis Nikitakis and Ioannis Papaefstathiou. A memory-efficient FPGA-based classification engine. In Proc. FCCM, 2008. [61] NLANR network traffic packet header traces. http://pma.nlanr.net/traces/. [62] OpenFlow Consortium. http://www.openflowswitch.org. [63] OpenFlow Switch Specification, Version 1.0.0. http://www.openflowswitch.org/documents/openflow-spec-v1.0.0.pdf. [64] Rina Panigrahy and Samar Sharma. Reducing TCAM power consumption and increasing throughput. In Proc. Hot Interconnects (HotI ’02), pages 107–112, 2002. [65] Ioannis Papaefstathiou and Vassilis Papaefstathiou. Memory-efficient 5D packet classification at 40 Gbps. In Proc. INFOCOM, pages 1370–1378, 2007. [66] Lu Peng, Wencheng Lu, and Lide Duan. Power Efficient IP Lookup with Su- pernode Caching. In Proc. Globecom, 2007. [67] Brad Reed. Sprint goes 40 Gbps on Tier 1 IP net. http://www.networkworld.com/ news/2008/ 071508-sprint-40gps.html, July 2008. 119 [68] Brad Reed. Verizon moving to 100 Gbps network in ’09. http://www.networkworld.com/ news/2008/ 031008-verizon-100gpbs- network.html, March 2008. [69] RIS Raw Data. http://data.ris.ripe.net. [70] Miguel A. Ruiz-Sanchez, Ernst W. Biersack, and Walid Dabbous. Survey and taxonomy of IP address lookup algorithms. IEEE Network, 15(2):8–23, 2001. [71] Sartaj Sahni and Kun Suk Kim. Efficient construction of multibit tries for IP lookup. IEEE/ACM Trans. Netw., 11(4):650–662, 2003. [72] SAMSUNG High Speed SRAMs. http://www.samsung.com. [73] Richard Sawyer. Calculating Total Power Requirements for Data Centers. White Paper #3, American Power Conversion (http://www.apcmedia.com/salestools/VAVR-5TDTEF R0 EN.pdf), 2004. [74] Sumeet Singh, Florin Baboescu, George Varghese, and Jia Wang. Packet clas- sification using multidimensional cutting. In Proc. SIGCOMM, pages 213–224, 2003. [75] Haoyu Song, Fang Hao, Murali S. Kodialam, and T. V . Lakshman. Ipv6 lookups using distributed and load balanced bloom filters for 100gbps core router line cards. In INFOCOM, pages 2518–2526, 2009. [76] Haoyu Song, Murali S. Kodialam, Fang Hao, and T. V . Lakshman. Scalable IP lookups using shape graphs. In ICNP, pages 73–82, 2009. [77] Haoyu Song and John W. Lockwood. Efficient packet classification for network intrusion detection using FPGA. In Proc. FPGA, pages 238–245, 2005. [78] Haoyu Song, Jonathan Turner, and John Lockwood. Shape shifting trie for faster IP router lookup. In Proc. ICNP, pages 358–367, 2005. [79] Ioannis Sourdis. Designs & Algorithms for Packet and Content Inspection. PhD thesis, Delft University of Technology, 2007. [80] V . Srinivasan and G. Varghese. Fast address lookups using controlled prefix ex- pansion. ACM Trans. Comput. Syst., 17:1–40, 1999. [81] Xuehong Sun, Sartaj K. Sahni, and Yiqiang Q. Zhao. Packet classification con- suming small amount of memory. IEEE/ACM Trans. Netw., 13(5):1135–1145, 2005. [82] David E. Taylor. Survey and taxonomy of packet classification techniques. ACM Comput. Surv., 37(3):238–275, 2005. 120 [83] David E. Taylor and Jonathan S. Turner. Scalable packet classification using distributed crossproducing of field labels. In Proc. INFOCOM, 2005. [84] Shyamkumar Thoziyoor, Jung Ho Ahn, Matteo Monchiero, Jay B. Brockman, and Norman P. Jouppi. A comprehensive memory modeling tool and its appli- cation to the design and analysis of future memory hierarchies. In Proc. ISCA, pages 51–62, 2008. [85] Ye Tung and Hao Che. Study of flow caching for layer-4 switching. In Proc. ICCCN, pages 135–140, 2000. [86] Pi-Chung Wang. Scalable packet classification with controlled cross-producting. Computer Networks, 53(6):821 – 834, 2009. [87] Thomas Y . C. Woo. A modular approach to packet classification: Algorithms and results. In INFOCOM, pages 1213–1222, 2000. [88] Beibei Wu, Yang Xu, Hongbin Lu, and Bin Liu. A practical packet reordering mechanism with flow granularity for parallelism exploiting in network proces- sors. In Proc. IPDPS, pages 133a–133a, 2005. [89] Xilinx Virtex-5 FPGAs. http://www.xilinx.com/products/virtex5/. [90] Fang Yu, Randy H. Katz, and T. V . Lakshman. Efficient multimatch packet clas- sification and lookup with TCAM. IEEE Micro, 25(1):50–59, 2005. [91] Heeyeol Yu and Rabi Mahapatra. A power- and throughput-efficient packet clas- sifier with n bloom filters. IEEE Trans. Comput., to appear. [92] Francis Zane, Girija J. Narlikar, and Anindya Basu. CoolCAMs: Power-efficient TCAMs for forwarding engines. In Proc. INFOCOM, pages 42–52, 2003. [93] Chuanjun Zhang. A low power highly associative cache for embedded systems. In Proc. ICCD, 2006. [94] Kai Zheng, Chengchen Hu, Hongbin Lu, and Bin Liu. A TCAM-based dis- tributed parallel IP lookup scheme and performance analysis. IEEE/ACM Trans. Netw., 14(4):863–875, 2006. [95] Kai Zheng, Zhiyong Liang, and Yi Ge. Parallel packet classification via policy table pre-partitioning. In Proc. Globecom, 2005. 121
Abstract (if available)
Abstract
Packet forwarding has long been a performance bottleneck in Internet infrastructure, including routers and switches. While the throughput requirements continue to grow, power dissipation has emerged as an additional critical concern. Also, as the Internet continues to constantly evolve, packet forwarding engines must be flexible in order to enable future innovations. Although ternary content addressable memories (TCAMs) have been widely used for packet forwarding, they have high power consumption and are inflexible for adapting to new addressing and routing protocols.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Algorithms and architectures for high-performance IP lookup and packet classification engines
PDF
High performance classification engines on parallel architectures
PDF
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
PDF
A low-power high-speed single-ended parallel link using three-level differential encoding
PDF
Cooperation in wireless networks with selfish users
PDF
Design of low-power and resource-efficient on-chip networks
PDF
Relative positioning, network formation, and routing in robotic wireless networks
PDF
Multichannel data collection for throughput maximization in wireless sensor networks
PDF
Improving user experience on today’s internet via innovation in internet routing
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Robust routing and energy management in wireless sensor networks
PDF
Supporting faithful and safe live malware analysis
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Constructing an unambiguous user-and-machine-friendly, natural-language protocol specification system
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Scaling-out traffic management in the cloud
PDF
Improve cellular performance with minimal infrastructure changes
PDF
Anycast stability, security and latency in the Domain Name System (DNS) and Content Deliver Networks (CDNs)
PDF
Detecting and mitigating root causes for slow Web transfers
Asset Metadata
Creator
Jiang, Weirong
(author)
Core Title
High performance packet forwarding on parallel architectures
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
09/24/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
high performance,Infrastructure,low power,networking,OAI-PMH Harvest,packet forwarding,router,SRAM
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Annavaram, Murali (
committee member
), Govindan, Ramesh (
committee member
)
Creator Email
weirongj@gmail.com,weirongj@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3476
Unique identifier
UC1459153
Identifier
etd-Jiang-4042 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-409651 (legacy record id),usctheses-m3476 (legacy record id)
Legacy Identifier
etd-Jiang-4042.pdf
Dmrecord
409651
Document Type
Dissertation
Rights
Jiang, Weirong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
high performance
low power
networking
packet forwarding
router
SRAM