Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
High performance classification engines on parallel architectures
(USC Thesis Other)
High performance classification engines on parallel architectures
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HIGH PERFORMANCE CLASSIFICATION ENGINES ON PARALLEL ARCHITECTURES by Yun Rock Qu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2015 Copyright 2015 Yun Rock Qu Acknowledgments First and foremost, I would like to express my deepest gratitude to my adviser, Prof. Viktor K. Prasanna. Over the past four years, he has kindly given me the opportunity and support for my thesis work, patiently trained me in various research activities, and wisely guided me through obstacles and challenges. His teaching on parallel architectures and algorithms and various advanced topics impacted every aspect of this thesis. Also, I would like to thank Prof. Cauligi (Raghu) Raghavendra and Prof. Minlan Yu for serving on my thesis committee and giving me valuable guidance. In addition, I would like to acknowledge my colleagues at USC, in particular Shijie Zhou, Da Tong, Sanmukh Rao Kuppannagari, Ren Chen, Andrea Sanny, and Shreyas Girish Singapura. I would like to give my special thanks to Diane Demetras and Kathryn Kassar for their extensive knowledge and assistance. Last but not the least, I would like to thank my family for their patience and sup- port: my father, whose professional excellence has always been a role model to me; my mother, whose wisdom has taught me kindness and perseverance; and especially my beloved Ruby Wang, who accompanied me through the hardest time of my Ph.D. study, believed in me and gave me the confidence to finish this work. ii Table of Contents Acknowledgments ii List of Tables vi List of Figures vii Abstract ix Chapter 1: Introduction 1 1.1 Internet and Routers . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Internet Application Kernels . . . . . . . . . . . . . . . . . . . . . 3 1.3 Software Defined Networking . . . . . . . . . . . . . . . . . . . . 4 1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2: Background 9 2.1 Multi-field Packet Classification . . . . . . . . . . . . . . . . . . . 10 2.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Prior Works . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Internet Traffic Classification . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Prior Works . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 SDN and New Challenges . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2 Multi-core Processor . . . . . . . . . . . . . . . . . . . . . 29 2.5 Research Hypothesis and Methodology . . . . . . . . . . . . . . . . 31 2.5.1 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5.2 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.3 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.4 Modular Composition . . . . . . . . . . . . . . . . . . . . . 32 iii Chapter 3: Multi-field Packet Classification 34 3.1 FPGA-based High-performance Updatable Engine . . . . . . . . . . 34 3.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.2 Modular PE . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.3 2-dimensional Pipeline . . . . . . . . . . . . . . . . . . . . 37 3.1.4 Optimization Techniques . . . . . . . . . . . . . . . . . . . 39 3.1.5 Supporting Dynamic Updates . . . . . . . . . . . . . . . . . 42 3.1.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 52 3.1.7 Design Parameters and Performance Metrics . . . . . . . . . 52 3.1.8 Empirical Optimization of Parameters . . . . . . . . . . . . 53 3.1.9 Scalability of Throughput . . . . . . . . . . . . . . . . . . . 55 3.1.10 Updates and Sustained Throughput . . . . . . . . . . . . . . 56 3.1.11 Scalability of Latency . . . . . . . . . . . . . . . . . . . . . 57 3.1.12 Resource and Energy Efficiency . . . . . . . . . . . . . . . 58 3.2 Large-scale Classification on Multi-core Platforms . . . . . . . . . . 60 3.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.2.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.3 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2.4 Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.2.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 71 3.2.6 Design Parameters and Performance Metrics . . . . . . . . . 73 3.2.7 Empirical Optimization of Parameters . . . . . . . . . . . . 73 3.2.8 Scalability of Throughput and Latency . . . . . . . . . . . . 76 3.3 Comparison of Packet Classification Approaches . . . . . . . . . . 77 3.3.1 Comparison between Various Platforms . . . . . . . . . . . 77 3.3.2 Comparison with Prior Works . . . . . . . . . . . . . . . . 78 Chapter 4: Internet Traffic Classification 81 4.1 High-throughput Traffic Classification on FPGA . . . . . . . . . . . 81 4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1.2 Converting a Decision-tree . . . . . . . . . . . . . . . . . . 82 4.1.3 Hardware Architecture . . . . . . . . . . . . . . . . . . . . 84 4.1.4 Enabling Virtualization . . . . . . . . . . . . . . . . . . . . 89 4.1.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 92 4.1.6 Performance Metrics . . . . . . . . . . . . . . . . . . . . . 94 4.1.7 Empirical Optimization of Parameters . . . . . . . . . . . . 95 4.1.8 Throughput and Latency . . . . . . . . . . . . . . . . . . . 95 4.1.9 Impact of Virtualization . . . . . . . . . . . . . . . . . . . . 97 4.1.10 Resource Consumption . . . . . . . . . . . . . . . . . . . . 97 4.2 Compact Hash Tables on Multi-core Platforms . . . . . . . . . . . . 99 4.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.2.2 Compact Hash Tables . . . . . . . . . . . . . . . . . . . . . 100 iv 4.2.3 Online Classification . . . . . . . . . . . . . . . . . . . . . 104 4.2.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 106 4.2.5 Performance Metrics and Design Parameters . . . . . . . . . 106 4.2.6 Data Sets and Traces . . . . . . . . . . . . . . . . . . . . . 108 4.2.7 Latency Improvement . . . . . . . . . . . . . . . . . . . . . 109 4.2.8 Scalability of Throughput . . . . . . . . . . . . . . . . . . . 110 4.2.9 Performance Analysis . . . . . . . . . . . . . . . . . . . . . 112 4.3 Comparison of Traffic Classification Approaches . . . . . . . . . . 115 4.3.1 Comparison between Various Platforms . . . . . . . . . . . 116 4.3.2 Comparison with Prior Works . . . . . . . . . . . . . . . . 117 Chapter 5: Conclusion 120 5.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 121 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.2.1 Exploration on Packet Classification . . . . . . . . . . . . . 123 5.2.2 Exploration on Traffic Classification . . . . . . . . . . . . . 123 5.2.3 Beyond FPGA and Multi-core GPP . . . . . . . . . . . . . 124 Bibliography 126 v List of Tables 2.1 Example of the classic packet classification rule set . . . . . . . . . 11 2.2 Example of the OpenFlow packet classification rule set (10 rules, 15 fields) [23, 93] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Classic vs. OpenFlow packet classification . . . . . . . . . . . . . . 14 2.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Update overhead (clock cycles) . . . . . . . . . . . . . . . . . . . . 50 3.2 Clock rates (MHz) of various designs . . . . . . . . . . . . . . . . . 53 3.3 Resource consumption (s = 4,n = 8 andL = 356) . . . . . . . . . 58 3.4 Approach overview . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 Notations for range-trees . . . . . . . . . . . . . . . . . . . . . . . 63 3.6 Notations for hash tables . . . . . . . . . . . . . . . . . . . . . . . 65 3.7 Notations in the merging phase . . . . . . . . . . . . . . . . . . . . 69 3.8 Throughput (MPPS) / latency (ms) with respect toT andP (AMD, N = 1 K) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.9 Latency breakdown per batch (Intel) . . . . . . . . . . . . . . . . . 76 3.10 Comparison between various platforms . . . . . . . . . . . . . . . . 77 3.11 Comparison with prior works . . . . . . . . . . . . . . . . . . . . . 79 4.1 Flow-level features tested . . . . . . . . . . . . . . . . . . . . . . . 93 4.2 Clock ratef (MHz) . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.3 Overall throughputT overall (MCPS) . . . . . . . . . . . . . . . . . . 96 4.4 Throughput and latency for typical decision-trees . . . . . . . . . . 96 4.5 Notations for converting the decision-tree to hash tables . . . . . . . 100 4.6 Notations for the searching and merge stages . . . . . . . . . . . . . 104 4.7 Statistics of a typical C4.5 decision-tree[103] . . . . . . . . . . . . 108 4.8 Comparison between various platforms . . . . . . . . . . . . . . . . 116 4.9 Comparison with prior works . . . . . . . . . . . . . . . . . . . . . 118 vi List of Figures 1.1 Traditional Internet routers . . . . . . . . . . . . . . . . . . . . . . 2 1.2 OpenFlow routers . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Producing the intersection of two sets . . . . . . . . . . . . . . . . 16 2.2 The basic architecture of the BV-based approaches . . . . . . . . . 17 2.3 A supervised traffic classification system . . . . . . . . . . . . . . . 19 2.4 A virtualized classification engine . . . . . . . . . . . . . . . . . . 21 2.5 C4.5 decision-tree-based approach . . . . . . . . . . . . . . . . . . 24 2.6 Internal organization of FPGA . . . . . . . . . . . . . . . . . . . . 27 2.7 An example of a multi-core processor . . . . . . . . . . . . . . . . 29 3.1 Splitting a 4-bit field into 2 subfields, each ofs = 2 bits . . . . . . . 35 3.2 Modular PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Example: a 2-dimensional pipelined architecture (N = 4, L = 3) and priority encoders (PrEnc) . . . . . . . . . . . . . . . . . . . . . 38 3.4 A modular PE with striding, clustering, and power gating techniques; the data memory is dual-ported . . . . . . . . . . . . . . . . . . . . 40 3.5 Power gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.6 ModifyingR 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.7 Deleting an old ruleR 1 . . . . . . . . . . . . . . . . . . . . . . . . 46 3.8 Inserting a new ruleR asR 2 . . . . . . . . . . . . . . . . . . . . . 47 3.9 Example: inserting a new rule;d L s e + 1 = 3,d N n e = 4, s = n = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.10 Scalability with respect toN andn . . . . . . . . . . . . . . . . . . 55 3.11 Scalability with respect toN andL . . . . . . . . . . . . . . . . . . 56 3.12 Sustained throughput . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.13 Latency introduced by the 2-dimensional pipelined architecture (2d pipe) and the priority encoders (pri enc); for eachN, the 4 columns cor- respond to L = 89, 178, 267, and 356 from left to right, respec- tively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.14 Energy efficiency (s = 4,n = 8,N = 1024) . . . . . . . . . . . . . 59 3.15 Constructing a range-tree from the unique values . . . . . . . . . . 62 3.16 Constructing a hash table, whereq (m) = 4 andZ = 7 . . . . . . . . 67 vii 3.17 Searching in an exact match field . . . . . . . . . . . . . . . . . . . 69 3.18 Varying the number of threadsT (AMD) . . . . . . . . . . . . . . . 74 3.19 Number of LLC misses and context switches per batch ofP packets (AMD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.20 Throughput and latency on both platforms . . . . . . . . . . . . . . 76 4.1 Constructing a RST (4-bit SP, 8-bit APS) . . . . . . . . . . . . . . . 83 4.2 Basic PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 Modular PE (c = 2) . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.4 2-dimensional pipeline using the empty and modular PE; PE[j;i] indexes the PE in the j-th row and i-th column, j = 0; 1;:::;d J c e, andi = 0; 1;:::;d 2 P M1 m=0 Wm s e 1. . . . . . . . . . . . . . . . . . . 88 4.5 A top-down update for a column of PEs . . . . . . . . . . . . . . . 92 4.6 Percentage of resource consumption (J max = 64) . . . . . . . . . . 98 4.7 Converting a decision-treeT toM hash tables . . . . . . . . . . . . 103 4.8 Various shapes of decision-trees and their balance factors (B) . . . . 107 4.9 Latency on both platforms . . . . . . . . . . . . . . . . . . . . . . 109 4.10 Varying the number of concurrent classifiers (P ) . . . . . . . . . . . 110 4.11 Varying the number of leaf nodes (J) . . . . . . . . . . . . . . . . . 111 4.12 Varying the number of features (M) . . . . . . . . . . . . . . . . . 112 4.13 Latency breakdown on the AMD platform . . . . . . . . . . . . . . 113 4.14 Latency breakdown on the Intel platform . . . . . . . . . . . . . . . 113 4.15 Cache misses and context switches with respect toP . . . . . . . . 114 4.16 Cache misses and context switches with respect toJ . . . . . . . . . 114 viii Abstract The Internet backbone, including both core and edge routers, is becoming more flexible, scalable and programmable to enable future innovations in next generation Internet [24, 71]. While the functionality of Internet routers evolves, the performance remains a major concern for real-life deployment. In this thesis, we propose novel algorithms, constructions, and optimization tech- niques on two prominent classes of parallel architectures: Field-Programmable Gate Arrays (FPGAs), and multi-core General Purpose Processors (GPP). We focus on high- performance algorithmic solutions for two Internet application kernels: the multi-field packet classification, and the Internet traffic classification. For packet classification, we focus on algorithmic solutions to support high through- put and dynamic updates. We extend the decomposition-based packet classification approaches onto FPGA and multi-core processors. On FPGA, we present 2-dimensional pipelined architecture composed of fine-grained Processing Elements (PE). Efficient power optimization techniques are also proposed on this architecture. On multi-core processors, we use range-tree and hashing to search each field of the input packet header individually in parallel. The partial results from all the fields are merged to produce the final packet header match. Our implementations support very large rule sets consisting of many fields. ix For traffic classification, we present high-throughput and virtualized architectures for online traffic classification on FPGA. We provide a conversion from a decision-tree into a compact rule set table; we map the table to a 2-dimensional pipelined architecture. We develop a novel dynamic update mechanism; it requires small resource overhead and has little impact on the overall throughput. We also present a high-throughput and low- latency traffic classification engine on multi-core platforms. We convert the decision- tree used in the C4.5 algorithm into multiple hash tables. We search all the hash tables in parallel and merge the outcomes into the final classification result. High throughput can be sustained even if we scale up (1) the number of concurrent traffic classifiers, (2) the number of decision-tree leaves, and (3) the number of features examined during the classification process. For both applications, we compare the performance on various platforms with respect to throughput and latency. We vary the problem size to compare the scalability of our designs on FPGA and multi-core platforms. We also provide a detailed comparison between our approaches and existing solutions on both platforms. x Chapter 1 Introduction 1.1 Internet and Routers The Internet is a global network connecting millions of computers. More than 190 countries are linked into exchanges of data, news and opinions. According to Internet Live Stats [15], there were over 3 billion Internet users worldwide. The number of Internet users represents nearly 40 percent of the world’s population. The largest number of Internet users by country is China, followed by the United States and India. In September 2014, the total number of websites with a unique hostname online exceeded 1 billion. This is an increase from one website [10] in 1991. The first billion Internet users worldwide was reached in 2005. Unlike online services, which are centrally controlled, by design, the Internet is decentralized. Each Internet computer, called a host, is independent. Connections are made among all the Internet computers through Internet edge and backbone routers [57]. Operators, both human and machine, can choose which Internet services to use and which local services to make available to the global Internet community. Remarkably, this anarchy by design works exceedingly well. There are a variety of ways to access the Internet through a commercial Internet Service Provider (ISP) [38, 34]. As shown in Figure 1.1, a traditional router contains two main architectural com- ponents: (1) a routing engine, which maintains an IP routing table, and (2) a packet forwarding engine, which maintains a forwarding table. The routing engine on the con- trol plane processes routing protocols, receives inputs from network administrators, and 1 Routing Protocol Routing Engine Forwarding Engine Routing Table Forwarding Table Routing Protocol Packets Input Packets Forwarding Table Updates Output Packets Control Plane Data Plane Admin Input Admin Output Figure 1.1: Traditional Internet routers produces the forwarding table. The packet forwarding engine on the data plane receives packets, matches the header information of the packet against the forwarding table to identify the corresponding action, and applies the action for the packet. The routing engine and the forwarding engine perform their tasks independently, although they con- stantly communicate through high-throughput links [19]. As the Internet becomes even more pervasive, performance of the network backbone that supports the universal connectivity becomes critical with respect to various metrics: 1. Scalability: The real-world network traffic in the Internet backbone consists of millions of concurrent packet flows [11]. In addition to the large number of packet flows, a routing table maintained by an edge or core router can be very large [94]. 2 2. Throughput: Major ISPs such as Sprint [107] and Juniper [19] have been going for 40 100 Gbps links. A 40 G Ethernet link requires a processing rate of 125 Million Packets Per Second (MPPS) with minimum size (40 bytes) packets. 3. Latency: The per-hop packet forwarding latency requirement is usually in the order of 100s, whereas the updates of a routing table need to be performed in the range of 1 10 ms. 4. Power: Today’s terabit routers, consuming 10 15 kW per cabinet, are capacity- limited by the high power density. A hardware component consuming 50 100 W=ft 2 can require 1:3 to 2:3 more power for the cooling system [95]. Traditionally, by increasing the maximum network throughput, a backbone routing device can handle bursty traffic, during which an acceptable processing latency must also be enforced. However, as Internet evolves, both the research community and the industry have realized that the harsh truth about today’s network infrastructures is the scalability [71]; this harsh truth also implies that the future Internet must be adaptable to many potential incremental changes on the current Internet infrastructure. In addition, power efficiency is often sacrificed for obtaining higher performance through brute- force expansion. 1.2 Internet Application Kernels Similar as a real-time operating system kernel, the Internet application kernels are devel- oped as essential tasks to meet the requirements of the Internet applications. An Internet application kernel can be designed to be a multi-task preemptive kernel [48], or it can be designed for a particular application [94, 112]. 3 As an example, multi-field packet classification is one of the many well-known Inter- net application kernels. This Internet application kernel involves categorizing packets into “flows” in an Internet router. All the packets belonging to the same flow obey a predefined rule and are processed in a similar manner by the router. For example, based on a predefined rule, all the packets with the same source and destination IP addresses may be discarded. Packet classification is needed for non “best-effort” Internet applica- tions, such as firewalls and quality of service [7], or services that require the capability to distinguish and isolate traffic in different flows for suitable processing [23]. Many Internet application kernels are in fact very challenging problems [100, 59]; this is because they have to be performed at line rate or meet specific performance requirements. Researchers have proposed a variety of algorithms [94, 100, 59] which, broadly speaking, can include geometric algorithms, heuristic algorithms, or hardware- specific search algorithms for Internet application kernels. Due to the rapid growth of the Internet [15], many of these algorithms [47, 97] are no longer suitable for today’s Internet infrastructure. 1.3 Software Defined Networking Software-Defined Networking (SDN) [20] is an emerging architecture that is dynamic, manageable, cost-effective, and adaptable, making it ideal for the dynamic nature of today’s applications. The basic idea of SDN is to decouple the network control plane and the data plane. Compared to the traditional Internet routing devices [6, 18], SDN leads to a new trend of modular devices with directly programmable network control and abstracted forwarding functions [17]. However, SDN also poses significant challenges with respect to the performance in the data plane; for instance, the control plane may require up to 40 fields of the packet headers to be examined at very high speed [36]. 4 Control Plane Data Plane Admin Input Admin Output OpenFlow Protocol Table Updates Features, metadata, etc. User-defined Routing Protocol Input Packets / Flows Output Packets / Flows Routing / Forwarding Engine OpenFlow Table Figure 1.2: OpenFlow routers In SDN, the communication between the separated control and data planes is real- ized by a flexible protocol: OpenFlow protocol [24, 22], as shown in Figure 1.2. Due to this innovation, both the Internet core and edge routers are being extended with more flexible features, including virtualization and dynamic classification [41, 22, 71]. Sup- porting these extensions require unprecedented computation complexity, memory size and memory bandwidth [74, 67]. 1.4 Thesis Contributions To sustain high performance for the current Internet infrastructure, as well as SDN, in this thesis, we propose many novel techniques and algorithms to optimize the network classification engines for two Internet application kernels. Our techniques include both algorithmic (software-based) and architectural (hardware-based) solutions. Our goal is 5 to investigate novel algorithms, data structures, and architectures that exploit the state- of-the-art parallel architectures. Specifically, our contributions include the following: Packet Classification Engines on FPGA We presented a 2-dimensional pipelined architecture for packet classification on FPGA, which achieved high throughput while supporting dynamic updates. We arranged fine-grained PEs in a 2-dimensional array; each PE were designed to access its designated memory locally, resulting in a scalable architecture. The entire array was both horizontally and vertically pipelined. As a result, high clock rates were sustained even for very long packet headers or large rule sets; the performance of the architecture did not depend on the rule set features such as the num- ber of unique values in each field. Dynamic updates modification, deletion and insertion of any rule in the rule set during run-time were supported on the self-reconfigurable PEs with very little impact on the sustained throughput. Experimental results showed that, for a 1 K 15-tuple rule set, a state-of-the-art FPGA sustained 800 MPPS throughput with 1 million updates per second. Our architecture demonstrated 4 energy efficiency while achieving 2 throughput compared to TCAM. Packet Classification Engines on Multi-core GPP We proposed a decomposition- based packet classification approach on multi-core processors; it supported large rule sets consisting of a large number of packet header fields. We exploited range-trees and hashing techniques to search each field of the input packet headers individually in parallel. The partial results from all the fields were stored in Rule ID (RID) sets; they were merged linearly to produce the final classification results. We implemented our approach on state-of-the-art multi-core processors, with a detail evaluation with respect to throughput and latency for rule set size ranging from 1 K to 32 K. Experimen- tal results showed that, for a 32 K rule set, our algorithms achieved 20 MPPS throughput and 22:1 ms processing latency for 320 K packets on a state-of-the-art 16-core platform. 6 Traffic Classification Engines on FPGA We presented a high-throughput and vir- tualized architecture for online Internet traffic classification. To explore massive paral- lelism, we provided a conversion from a generic decision-tree into a RST; the conversion technique did not depend on the shape, depth, or degree of the tree. We employed mod- ular PEs and mapped the RST to a 2-dimensional pipelined architecture. To support hardware virtualization, we developed a novel dynamic update mechanism; it required small resource overhead and had little impact on the overall throughput. To evaluate the performance of this architecture, we implemented an online traffic classification engine on a state-of-the-art FPGA. Post-place-and-route results showed that, our clas- sification engine achieved 533 MCPS throughput, a 4 improvement compared with existing dynamically updatable online traffic classification engines on FPGA. Traffic Classification Engines on Multi-core GPP We presented a high-throughput and low-latency traffic classification engine on multi-core platforms. We converted the C4.5 decision-tree into multiple compact tables. All the compact tables were searched in parallel; efficient hashing techniques were employed to reduce the processing latency. The outcomes from all the tables were merged into the final classification result. High throughputs were sustained even when we scaled up (1) the number of concurrent traf- fic classifiers, (2) the number of decision-tree leaves, and (3) the number of features examined during the classification process. We prototyped our design on state-of-the- art multi-core platforms. For a typical C4.5 decision-tree consisting of 92 leaf nodes and 7 flow-level features, we achieved 134 MCPS throughput and 239 ns processing latency per classification. Even for highly imbalanced decision-trees or large decision- trees consisting of up to 2 K leaf nodes, our traffic classification engine sustained high throughput and low latency without sacrificing classification accuracy ( 98:15%). We 7 achieved 2 throughput compared with the classic C4.5 decision-tree-based implemen- tations, and at least 12 throughput compared with the existing traffic classifiers on multi-core platforms. 1.5 Thesis Organization The rest of the thesis is organized as follows: Chapter 2 covers the background of the two application kernels, and states the new challenges in the context of SDN and emerging Internet infrastructures. Chapter 3 details the unique construction of our packet classification engines on FPGA and on multi-core processors. This chapter compares the performance of our classification engines on various platforms, and compares our designs with existing packet classification engines. Chapter 4 presents our novel traffic classification engines on FPGA and on multi- core processors. This chapter makes a thorough performance comparison among our designs and existing traffic classification engines. Chapter 5 concludes the thesis and presents the future research directions. 8 Chapter 2 Background The Internet continually evolves in scope and complexity, much faster than our ability to characterize, understand, control, or predict it. The basic element in a communications network is the network router [4, 11]. A wide variety of functions (i.e., Internet appli- cation kernels) need to be applied the network traffic, including the firewall processing, traffic monitoring, and heavy hitter detection [111]; the goals of these application ker- nels are to extract information and control the behavior of the packet flows. In this thesis, we focus on two Internet application kernels: the well-known multi-field packet classification [48], and the Internet traffic classification [16]. These two kernels are both classification problems, and they are widely used for the following Internet applications: Firewall processing: A firewall is a network security system that monitors and controls the incoming and outgoing network traffic based on predetermined secu- rity rules [80]. A firewall typically establishes a barrier between a trusted, secure internal network and another outside network, such as the Internet, that is assumed to not be secure or trusted. Network firewalls are a software appliance running on general purpose hardware, or hardware appliances that filter traffic between two or more networks. Network Address Translation (NAT): Routers that pass data between networks contain firewall components and can often perform basic routing functions as well. Among these routing functions, NAT [75] is a methodology of remapping one routing space into another by modifying network address information in the packet headers; this is done while the packets are in transit across a traffic routing 9 device. The technique was originally used for ease of rerouting traffic in IP net- works without renumbering every host. It has become a popular and essential tool in conserving global address space allocations after the IPv4 address exhaustion [9, 3]. Agent server: The network routers may also offer other functionality to the inter- nal network they protect such as acting as a DHCP, DRP, or VPN server [75] for that network. For instance, the VPN connections over an intermediate network can save the cost of long-distance phone service and hardware costs associated with using dial-up or leased line connections. A VPN solution includes advanced security technologies such as data encryption, authentication, authorization, and Network Access Control. Network Access Control (NAC): On routers and switches, an Access Control List (ACL) refers to rules that are applied to port numbers or IP Addresses, each with a list of hosts and / or networks permitted to use the service [1]. It is possible to configure the ACL based on network domain names; in that case, the device enforcing the ACL must separately resolve names to numeric addresses. Both individual servers as well as routers can have network ACLs. The ACL can gen- erally be configured to control both inbound and outbound traffic. Like firewalls, ACLs are subject to security regulations and standards. 2.1 Multi-field Packet Classification As one of the most important Internet application kernels, multi-field packet classifica- tion [48] enables routers to support firewall filtering, Quality of Service differentiation, policy routing, and other value added services. A packet can be classified by a set of prioritized rules, each consisting of many fields in the packet header. For example, the 10 Table 2.1: Example of the classic packet classification rule set Rule SA (8-bit) DA (8-bit) SP (4-bit) DP (4-bit) Prtl (2-bit) Priority R 0 1110* 01* [0001:0011) [0110:0111) 01 1 R 1 0000* 1101* [1010:1101) [0010:0110) 10 1 R 2 100* 00* [0010:0101) [0110:0111) 11 3 R 3 10* 0011* [0000:0011) [0110:0111) 01 2 R 4 111* 010* [0010:0101) [0110:0111) 00 2 R 5 001* 1* [0001:0011) [0110:0111) 01 4 R 6 0* 11* [0001:0011) [0110:0111) 10 0 R 7 * 0110* [0000:1111) [0000:0111) 11 5 fields can be source / destination IP addresses, or source / destination port numbers [89]. A packet is considered matching a rule only if its header matches all the fields of the rule. If a packet matches multiple rules, the matched rule with the highest priority is usually returned; alternatively, some applications require returning all the matched rules [110]. Although multi-field packet classification has been well studied over the past decade [47, 48, 97, 108], the packet classification engines face challenges of (1) supporting large rule sets, (2) sustaining high throughput, (3) reducing processing latency, and (4) supporting dynamic updates [48, 93]. 2.1.1 Problem Definition 2.1.1.1 Classic Packet Classification An individual predefined entry used for classifying a packet is denoted as a rule; a rule is associated with a unique rule ID (RID) and a priority. The data set consisting of all the rules is called rule set. 11 Definition 2.1. Packet classification: Given a packet header having M fields , and a rule set ofN rules, out of all the rules matching the packet header, report the RID of the rule whose priority is the highest. A packet header is considered to be matching a rule only if it matches all the fields of that rule. The classic packet classification [48] involves 5 fields in the packet header: the Source/Destination IP Addresses (SA/DA), Source/Destination Port numbers (SP/DP), and the transport layer protocol (Prtl). Note different fields in a rule can require different types of match criteria; thus, a rule can consist of prefixes, exact values, or / and ranges. Usually the prefix match is considered in the SA and DA fields, and the exact match is considered in all the other fields. We show an example in Table 2.1; in this example, the 5-field rule set containsN = 8 rules 1 . Definition 2.2. A field requiring the prefix match is a prefix match field. Similarly, the range match field, and the exact match field can be defined. A packet is considered to be matching a rule only if it matches all the fields of that rule. A packet may match multiple rules; out of all the matching rules, usually only the rule with the highest priority is used to take action [56]. 2.1.1.2 OpenFlow Packet Classification OpenFlow packet classification requires a larger number of packet header fields to be examined. For example, in the current specification of OpenFlow protocol [23], a total number of 15 fields consisting of 356 bits in the packet header have to be compared against all the rules in the rule set. We show an example rule set in Table 2.2. A “*” in a particular field of a rule indicates the corresponding rule can match any value in that field. 1 The field widths can vary in reality. 12 Table 2.2: Example of the OpenFlow packet classification rule set (10 rules, 15 fields) [23, 93] RID Priority Ingr Meta- Eth Eth Eth VLAN VLAN MPLS MPLS SA DA Prtl ToS SP DP data src dst type ID priority label tfc Field Length 32 64 48 48 16 12 3 20 3 32 32 8 6 16 16 Field Typey E E E E E E E E E P P E E E E R 0 2 5 * 00:13:A9:00:42:40 00:13:08:C6:54:06 0x0800 * 5 0 * 001* * TCP 0 * * R 1 1 * * 08:00:69:02:FC:07 00:FF:FF:FF:FF:FF * 100 7 16000 0 00* 1011* UDP * * * R 2 3 * * * 00:00:00:00:00:00 0x8100 4095 7 * * 1* 1011* * * 2 5 R 3 4 1 * 00:FF:FF:FF:FF:FF * * 4095 * * * 1* 1* * 0 7 5 R 4 0 4 * FF:FF:FF:FF:FF:FF * * 2041 * * * 110* 01* * 0 80 123 R 5 * 5 * 00:13:A9:00:42:40 00:13:08:C6:54:06 0x0800 * 5 0 * 001* * TCP 0 * * R 6 * * * * 00:FF:FF:FF:FF:FF 0x0100 100 7 16000 0 00* 1011* UDP * * * R 7 3 * * * 00:00:00:00:00:00 0x8100 4095 7 * * 1* 1011* * * 2 5 R 8 4 * * 00:FF:FF:FF:FF:FF * 0x0800 4095 * * * 1* 1* * 0 7 5 R 9 0 5 * FF:FF:FF:FF:FF:FF * 0x0100 1029 * * * 110* 01* * 0 80 123 y: “E” as exact match, and “P” as prefix match 13 Table 2.3: Classic vs. OpenFlow packet classification Type Classic OpenFlow No. of fields (M) 5 15 Pkt. header length (L) 104 bits [44] 356 bits [23] Update rate () Relatively static 10 K updates/s [104] The current OpenFlow specification [23] suggests that the prefix match only needs to be performed in the SA and DA fields out of the 15 fields, while all the other fields require the exact match. This specification can be changed in the future, and more generic range match 2 can be required in each field [92]. For example, a rule may check whether the port number of the input packet falls into [0; 128). Compared to the prefix match and the exact match, the generic range match can be more challenging due to the range expansion when converting ranges into prefixes [99]. As shown in Table 2.3, compared to the classic packet classification, it is more chal- lenging to achieve high performance for OpenFlow packet classification: 1. Large-scale: OpenFlow requires a large number of bits and packet header fields to be processed for each packet (largeM and largeL). 2. Dynamic updates: OpenFlow places high emphasis on dynamic updates (large), including rule modification, deletion, and insertion. 2 The prefix match and the range match can be both viewed as special cases of the generic range match. 14 2.1.2 Prior Works 2.1.2.1 Packet Classification Techniques Most of the packet classification algorithms used in hardware or software fall into two major categories: the decision-tree-based [47, 56] and the decomposition-based [48, 90] algorithms. The decision-tree-based approaches involve cutting the search space recursively into smaller subspaces based on the information from one or more fields in the rule. In [56], a decision tree is mapped onto a pipelined architecture on FPGA; for a rule set containing 10 K rules, a throughput of 80 Gbps is achieved for packets of minimum size (40 bytes). However, the performance of the decision-tree-based approaches is rule-set-dependent. A cut in one field can lead to duplicated rules in other fields (i.e., rule set expansion [47]). As a result, a decision-tree can use up toO(N M ) memory; this approach can be impractical. The decomposition-based approaches [87, 90] first search each packet header field individually. The partial results are merged to produce the final result. To merge the par- tial results from all the fields, hash-based merging techniques [87] can be explored; how- ever, these approaches either require expensive external memory accesses, or rely on a second hardware module to solve hash collisions. For decomposition-based approaches, the complexity of searching a specific field is usually dependent on some rule set fea- tures, such as the number of unique rules in a field [90, 87]. 2.1.2.2 Bit Vector A Bit Vector (BV) is a specific representation for a set of numbers. A bit in a BV indicates whether or not the corresponding number exists in the set. For example, we show the BV representations for two setsf0; 1; 3g andf1; 3; 5; 6g in Figure 2.1. The BV 15 , , , , , 1 1 0 1 0 0 0 0 0 1 0 1 0 1 1 0 ∩ & → 0 1 0 1 0 0 0 0 → , Figure 2.1: Producing the intersection of two sets representation is hardware-friendly, especially for set operations such as intersection and union; the reason is that by using BVs, the set operations can be translated into simple bitwise operations. In Figure 2.1, the bitwise AND operation produces the intersection of the two sets:f1; 3g. As a decomposition-based packet classification approach, the BV approach [55] is a specific technique in which the lookup on each field returns anN-bit vector. Each bit in the bit vector corresponds to a rule. A bit is set to 1 only if the input matches the corresponding rule in this field. A bit-wise logical AND operation can be exploited to gather the matches from all the fields. The Field-Split BV (FSBV) approach [55] and its variants [44] split each field into multiple subfields of s-bits; a rule is mapped onto each subfield as a ternary string defined onf0; 1;g s . The lookup operations can be performed in all the subfields in a pipelined fashion; the partial result in each pipeline stage is represented by a BV of N bits. Logical AND operations are used to merge all the extracted BVs to generate the final match result on FPGA. We show the basic architecture [55, 44] of BV-based 16 Packet header Priority Encoder Match result Memory AND Registers Processing Element (PE) Processing Element (PE) … Packet header BV Figure 2.2: The basic architecture of the BV-based approaches approaches in Figure 2.2; FSBV approach can be viewed as a special case ofs = 1. In general, the BV-based approaches require rules to be represented in ternary strings; thus they suffer from the range expansion when converting ranges into prefixes [99]. Also, to access an N-bit data, wires of length O(N) are often used for the memory; as N increases, the clock rate of the entire pipeline deteriorates. 2.1.2.3 Dynamic Updates Dynamic updates for packet classification has been a well-defined problem [48], where the classification rule set is dynamically changed during run-time. In [48], two algo- rithms are proposed based on tree or trie structures to support dynamic updates; they require O(log M N) and O(log M+1 N) update time, respectively, for a M-field rule set consisting of N rules. They are too expensive for OpenFlow packet classification (M = 15). For the same reason, most of the decision-tree-based approaches cannot 17 easily support fast dynamic updates. Some of the decision-tree-based approaches [56] require the trees to be recomputed and remapped onto FPGA whenever the rule set needs to be updated; this is very expensive. Some of the decomposition-based approaches [87] explore external memory on FPGA; for each update, a number of external memory write accesses must be performed. This is also very expensive. In summary, we are not aware of any prior solution sup- porting high performance on hardware. While remapping an optimized architecture onto FPGA is quite time-consuming, it is, however, easy to use software directly for dynamic updates [42, 85]. For example, a software-based classification engine can be recompiled much faster, resulting in a very small update overhead. Since software-based dynamic updates are less challenging, we ignore the discussion on this topic in this thesis. 2.2 Internet Traffic Classification Internet traffic classification is a task classifying traffic based on features passively observed in the traffic, and according to specific classification goals [16]. A coarse- grained classification goal can be, for example, whether the traffic is transaction- oriented, bulk-transfer, or peer-to-peer file sharing. The classification result can also be finer-grained, i.e., to identify the exact application represented by the traffic. Traffic features include the port number, packet size, or statistics of the packet payload. Traffic classification [35] benefits many value-added services. Due to the rapid growth of the Internet, online traffic classification faces a major challenge: the increas- ing data rate of the network traffic. For example, the bandwidth of the current Internet has evolved to over hundreds of gigabits per second, while most of the existing traffic classifiers only support a few tens of gigabits per second throughput [68, 37]. 18 Feature Extraction Training Discretization Decision-making Unlabeled data Classification Engine Network traffic Discrete values Traffic class database Labeled data Online classification Figure 2.3: A supervised traffic classification system 2.2.1 Problem Definition 2.2.1.1 Classic Traffic Classification Traffic classification [33, 59] requires the network traffic to be categorized into various application classes; e.g., HTTP [31], FTP [32], P2P [62], etc. Definition 2.3. A feature is a characteristic of a single packet (a packet-level feature), or statistics of a set of packets (a flow-level feature). A feature can be extracted directly from a packet header, or calculated based on packet headers. There are two types of traffic classification problems: The supervised traffic classification are based on a set of features accurately labeled with traffic classes, while the unsupervised classification requires the data to be labeled first. The supervised traffic classification consists of 2 phases: training and online clas- sification, as shown in Figure 2.3. The traffic classification engine is configured in the 19 training phase [103, 49]. In the online classification phase, a packet flow is examined by the classification engine, and a final decision on the application class is made. Our research focus is on the online classification phase, since this is the performance bottle- neck of the entire system. 2.2.1.2 Virtualized Traffic Classification As the Internet evolves, there is a trend towards hardware virtualization [65]; i.e., to share the hardware platform among multiple users. For example, router virtualization can be described as the consolidation of multiple physical routers to a single shared hardware platform. The process of virtualization must be transparent to users such that the users should experience little difference in the service received. Virtualizing high-speed hardware-based traffic classification engine is challenging; it is not straightforward to accurately classify a large amount of traffic flows coming from many networks. As shown in Figure 2.4, packet flows p 0 ; p 1 ;:::; p S1 from S networks are time-multiplexed into one stream of data. Complications arise, however, when each network requires a different set of criteria to classify its traffic. For example, a specific network may require the traffic to be classified into 4 application classes while another network only cares about 3 application classes. Definition 2.4. Virtualized Traffic Classification: Given the incoming data stream time- multiplexed by S networks, build an online traffic classification engine shared by S networks without sacrificing classification accuracy and overall throughput. 20 Packet flows Classification Engine Results ⁞ − External controller ... Figure 2.4: A virtualized classification engine 2.2.2 Prior Works 2.2.2.1 Traffic Classification Approaches Depending on the underlying classification algorithms applied, we can categorize exist- ing traffic classification approaches into 4 major categories: 1. Port-number-based schemes classify traffic based on transport-layer port num- bers; they are no longer reliable since many applications today assign port num- bers dynamically [62]. 2. Deep Packet Inspection (DPI) compares the traffic payload with known signatures. DPI-based techniques [40] can achieve the highest accuracy, but they face the privacy issue. 3. Heuristic-based techniques classify traffic based on heuristic patterns; they usu- ally suffer from low accuracy, or cannot support the traffic volume in the Internet backbone routers. 21 4. Machine-Learning (ML) techniques [76] examine statistical properties of the traf- fic. The ML-based techniques demonstrate high classification accuracy, but real- izing high-performance ML-based traffic classifier is still challenging. We use this category of approaches for Internet traffic classification in this thesis. The ML-based traffic classification algorithms [35, 76, 59, 33] explore statistical prop- erties of traffic flows to classify network traffic. There have been large amounts of ML- based algorithms, including Naive Bayesian [73], K-means [66], and Support Vector Machine (SVM) [43]. In [33], a set of algorithms including Naive Bayesian, SVM, and C4.5 algorithm are evaluated. A total number of 22 features, including packet-level features and flow- level features are used to build the ML-based classifiers. Experimental results show that, C4.5 algorithm gives the best accuracy (> 83% accuracy for SSH traffic and> 97:8% accuracy for Skype traffic) among all the considered techniques. In [59], 9 packet- level and flow-level features are tested for traffic classification; packet size, port number and their related statistics are shown to achieve the highest accuracy. Discretization techniques are also explored in [59] for traffic classification to improve the classification accuracy. 2.2.2.2 Online Traffic Classification on FPGA Many existing works have used reconfigurable hardware for online traffic classification. In [68], an FPGA-based architecture is presented for the C4.5 algorithm; explicit range match is explored and memory accesses are parallelized to improve the performance. No post-place-and-route results are reported on FPGA. In [54], an FPGA-based architecture is proposed for multimedia traffic classification. The classifier is based onk-Nearest-Neighbor algorithm and achieves high accuracy for 22 large training data sets. This approach is restricted to classifiers with a small number of application classes. The C4.5 decision-tree [33, 103] is widely used in the online traffic classification phase. In Figure 2.5, we show a typical C4.5 decision-tree-based approach consisting of two steps: Discretization A complete range-tree [39]T m (m = 0; 1;:::;M 1) is built for each of theM features. In a specific range-treeT m , each non-leaf node stores a range bound- ary; each leaf node stores a unique number representing a non-overlapping range. In Figure 2.5,T 0 andT 1 (M = 2) are built for the Source Port number (SP) and the Aver- age Packet Size (APS) features, respectively. Supposea 3 = 0101 anda 5 = 0111, then the leaf nodea 4 = 0110 inT 0 represents the range [0101; 0111). Without loss of gen- erality, we use half-open ranges; in this example, a 4 can be any number in the range [0101; 0111). During discretization, each feature collected from the traffic is searched in its corresponding range-tree until a leaf node is reached; the outputs of the discretization process areM unique numbers. Decision-making A decision-tree T takes the outputs from all the T m ’s and makes the final decision on the traffic class. The decision-tree T is searched using the M outputs from the discretization process. Each non-leaf node in the decision-tree makes a decision based on the M outputs and the outcome of a “true/false” statement. In Figure 2.5, the root node inT represents the statement “SP = 0110”, anda 4 andb 2 are the outputs from discretization. Sincea 4 = 0110, and “SP = 0110” is true, we go to the left child of the root node. This process continues until a leaf node ofT is reached, where a final decision on the traffic class can be made. 23 Range-tree 3 1 5 0 2 4 6 P2PTV HTTP P2PTV MSN Skype MSN SP=0110? YES No Decision-tree 1 0 2 Range-tree unique number Figure 2.5: C4.5 decision-tree-based approach The range-trees and decision-tree can be mapped onto pipelined search engines on FPGA [103]: each tree level is mapped to one pipeline stage. We denote such an imple- mentation as the state-of-the-art implementation. This state-of-the-art implementation has the following disadvantages: (1) Two types of trees are built. If updates are to be performed, usually both types of trees have to be adjusted. (2) For a balanced decision- tree ofJ leaves, the length of the longest wires in their implementation isO(J). As the tree grows larger, both the clock rate and throughput degrade for large trees on FPGA. 2.2.2.3 Hardware and Software Virtualization A naive solution to the problem defined in Definition 2.4 is to constructS high-accuracy classifiers using designated hardware resources; the number of networks supported is constrained by the total amount of hardware resources available [72]. To serve more networks, more efficient solutions are still to be explored. In this thesis, our approach relies on an efficient dynamic update mechanism on the hardware architecture. With each packet flow labeled its network ID, only one packet 24 flow from a particular networkp s (s = 0; 1;:::; S 1) is served by the classification engine at a specific time. The classification engine is dynamically updated to serve all the packet flows. The major challenge for this approach is: frequent incremental updates usually have negative impact on the performance [54]. For many existing tree- based implementations [103], it would be very expensive, if at all feasible, to reconstruct an optimal classification engine repeatedly. Note that many existing works on virtualization employ virtual machines on soft- ware [42, 85]. The software-based approaches offer flexibility and isolation among virtual instances; they are easy, inexpensive, and beyond the scope of this thesis. 2.3 SDN and New Challenges Traditional Internet routers and switches [5] are usually closed and integrated with both the control plane and the data plane [52]. These components are both expensive and lack flexibility; also, they are neither user-friendly nor easy to program [36]. To test new ideas in enterprise networks with sufficient scale and realism, a novel concept is presented in [71] where an open protocol is used to efficiently manage the hardware devices. It allows researchers to run experiments on heterogeneous switches in a uniform way and evaluate their ideas in real-world traffic settings. It also enables innovation in SDN [36, 60, 17, 25], a leading architecture in the next generation Internet. OpenFlow [71] has been proposed as an interface connecting hardware devices to the software-based control plane; it has already been developed from a conceptual research idea [21] into a commercial standard [20]. The current specification of the OpenFlow protocol [23] is capable of supporting various types of network applications, while it also leads to more challenges in the data plane design. For instance, initially, the previous OpenFlow protocol only required 12-field packet classification in the data plane. The 25 Table 2.4: Challenges Traditional SDN data plane data plane Fabric data plane Edge data plane No. of 16 million [11] 96 million [27] 10 K concurrent flows Granularity packet packet/flow/any [23] Throughput 150 Gbps [6, 18] > 400 Gbps Processing hundreds of a few latency milliseconds [64] milliseconds Power per switch 5 10 kW [6, 18] 70% less [24] Existing complex, simple, programmable, solutions expensive high latency Challenges sustain high throughput achieve scalability and low latency, and low power, support dynamic updates support large tables and virtualization, and dynamic updates exploit multi/many-core platforms current version of OpenFlow protocol allows up to 40-field packet classification [36] to be performed in the data plane. The additional fields to be matched in the packet header extend the original 12 fields to further support IPv6, Address Resolution Protocol (ARP) and more network functions. For large-scale problems in the data plane of SDN core routers (fabric data plane), the hardware limitation becomes the main performance bottleneck, which in turn lim- its the scalability of SDN [105, 58, 61, 83]. In the data plane of SDN edge routers / switches (edge data plane), the key challenge is to design data structures and algorithms to achieve both high throughput and low latency. With most of the complex control functions moved into the control plane, the exist- ing solutions for SDN data plane typically employ simple hardware. As a result, the performance of the fabric and edge data planes is usually not well-studied. In addi- tion, SDN usually requires the classification rule sets to be updated frequently during 26 Logic Cell Interconnect 0 1 1 1 Long wire Short wire k BRAM I/O Figure 2.6: Internal organization of FPGA run-time (dynamic updates) in both the fabric and edge data planes [23, 27]. We com- pare the data plane of the current Internet infrastructure (traditional data plane) with the SDN data plane in Table 2.4. 27 2.4 Platforms 2.4.1 FPGA State-of-the-art FPGA devices provide dense logic units, large on-chip memory, and high-bandwidth interfaces for various external memory technologies. As shown in Fig- ure 2.6, most modern FPGA devices are composed of Lookup Tables (LUTs) and on- chip block memory (BRAM) based on the Static Random Access Memory (SRAM) technology. The LUTs can be used to implement any combinational logic; each LUT is paired with one or more flip-flops. The BRAM provides high-bandwidth memory storage with configurable word width to the LUTs. In addition, an SRAM-controlled routing fabric directs signals to appropriate paths to produce the desired architecture. FPGA technology has become an attractive option for prototyping / implement- ing real-time network processing engines [98, 53, 77]. For instance, Xilinx Virtex-7 XC7VX1140T FPGA provides above 1 M logic cells, 68 Mb on-chip BRAM and 1100 user I/O pins [29]. FPGA technologies are widely used as commercial platforms (e.g. Xilinx Virtex-6 [28]) or research platforms (e.g. NetFPGA [8, 74]. FPGAs started out as prototyping device, allowing convenient and cost-effective development of glue logic connecting discrete ASIC components. As the gate density of FPGA increased, applications of FPGA shifted from glue logic to a wide variety of high-performance and data-intensive problems, where FPGA devices are deployed in the field as final but still flexible solutions. Because the functionality of an FPGA device is configured by the on-chip SRAM, it can be altered simply by changing the state of the memory bits. This can be useful in cases where the application requires software-like data-dependent processing with ASIC-level line-rate performance. 28 2.4.2 Multi-core Processor In recent years a whole spectrum of different multi-core processors have been designed, manufactured and made commercially available. At one end there are general-purpose Figure 2.7: An example of a multi-core processor 29 multi-core processors consisting of cache-coherent and powerful cores attached to either a switching fabric or a shared bus. Examples are AMD Opteron [2] and Intel Xeon processors [13]. At the other end there are highly parallel stream processors such as NVidia Tesla [78], where hundreds of simple compute units (cores) are grouped in Single-Instruction Multiple-Data (SIMD) structures controlled explicitly by software. The most popular and prominent class of multi-core processors is the (multi-core) General-Purpose Processor (GPP), whose programmability and broad applicability give them the economies of scale and allow them to take full advantage of the Moores Law. During the past decade, GPPs have scaled up not only in total number of cores but also in per-core performance. A state-of-the-art GPP usually has 8 16 cores, sev- eral mega-bytes of on-chip cache and 2 4 high-bandwidth memory channels. Each processor core is further equipped with 4 8 high-speed pipelines, shared by 1 8 hardware threads, and operating at multi-gigahertz clock rates. In addition, multiple processors can be connected at the chip level (namely, multi-socket implementation) to build a cache-coherent Non-Uniform Memory Access (cc-NUMA) system; this allows the multi-core platform to scale up to over a hundred cores in a single system image. We show an example of multi-core GPP in Figure 2.7. This multi-core processor [14] is made up of 4 sockets, each socket consisting of 10 independent processor cores. Each core has access to its designated 32 KB L1 cache and 256 KB L2 cache. All the cores in the same socket share a 24 MB L3 cache; they also have access to large but much slower main memory. The communication among the sockets is realized by the Quick Path Interconnect (QPI) [12] technology. 30 2.5 Research Hypothesis and Methodology Novel Internet infrastructures as well as the emerging class of applications pose new challenges for the network routing devices. In these infrastructures (e.g., SDN [20]), the control programs express all the desires directly. This requires the hardware routing devices to support all the possible forwarding behaviors needed in the control plane [60, 96]; it presents a new challenge for the research community since all the engineering tasks have to be performance-aware. The goal of this thesis is to demonstrate that: State-of-the-art parallel architectures can be exploited to achieve very high performance for multi-field packet classification and Internet traffic classification. The parallel architectures that we target are state-of-the-art FPGA and multi-core GPP, since they are prevalent for network processing applications. The most impor- tant performance metrics we study include overall throughput and processing latency; for FPGA-based implementations, metrics such as update rates, power-efficiency, and resource consumption are also investigated. While each of our classification engines is unique in its algorithmic optimization and architectural mapping, a number of common design methodologies are utilized through- out this thesis. These design methodologies, as well as the techniques with which they are applied, can be useful to other research problems, enabling a broader class of per- formance improvement. 2.5.1 Pipelining We utilize a 2-dimensional pipelined architecture for both packet classification and traf- fic classification on FPGA. Although the overall architecture performs complex compu- tations, the memory access in each stage is localized, resulting in guaranteed throughput 31 performance. The 2-dimensional pipelined architecture also limits the routing length of the input and output signals between adjacent pipeline stages, which in turn improves scalability of the entire design. 2.5.2 Partitioning We exploit the partitioning technique to dived a problem into two or more heterogeneous parts, each optimized with a different set of techniques. (1) For packet classification, we search all the fields independently, either in a pipelined fashion or in parallel, and merge all the partial searching results together. (2) For traffic classification, we examine all the traffic features independently, and combine the partial results into the final result. 2.5.3 Aggregation This methodology is applied to operations or data structures with similar characteristics. By sharing the functional blocks for similar operations or removing redundancy among data structures, significant amount of resource can be saved, which usually result in increased performance. For instance, for traffic classification, we combine the range- trees and the C4.5 decision-tree together to produce a more compact representation for dynamic updates. 2.5.4 Modular Composition Appropriate partitioning and aggregation can often reveal the modular structure inherent to a large and complex problem. Respecting such modular structure when developing the solution can result in efficient mapping and implementation. For example, in our packet classification engines, modular structures are manifested as modular circuits on 32 FPGA; this modular composition helps us reduce the design complexity and improve the overall performance of our packet classification engines. 33 Chapter 3 Multi-field Packet Classification 3.1 FPGA-based High-performance Updatable Engine 3.1.1 Related Work For a packet header of L bits, we split all the fields into subfields of s bits 1 . Hence there are in totald L s e subfields, indexed by j = 0; 1;:::;d L s e 1. Let k j denote the input packet header bits in subfieldj; therefore the length ofk j iss bits. A Bit Vector (BV)B (k j ) j is defined as the vector specifying the matching conditions between the input packet header bits and the corresponding bit locations of all the rules. We show an example in Figure 3.1, where the BVs are the column vectors indexed byk j in subfield j; in this figure, for instance,B (00) 1 = 100. For a rule set of N rules, the length of each BV is N bits. We denote the data structure that stores all the BVs in a given subfield as Bit Vector Array (BV array), as shown in Figure 3.1. In this figure, each 2-bit stride is associated with anN2 s = 34 BV array. AnN 2 s BV array requires a memory size of 2 s N. After we have constructed all the BVs in all the subfields, we use the input header bits k j ’s to address the corresponding BVs in the BV array. For subfield j, B (k j ) j is extracted for the input bitsk j . For example, in Figure 3.1, if the input packet header has k 0 = 10 in subfieldj = 0, we extract the BVB (k 0 ) 0 = 010; this indicates only the rule R 1 matches the input in this subfield. 1 This is different from FSBV [55], where only two fields are split. 34 = = 00 01 10 11 00 01 10 11 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 RID 4-bit field 0011 01** 110* BV arrays ( = ) × BV array bit vector ( ) Figure 3.1: Splitting a 4-bit field into 2 subfields, each ofs = 2 bits As shown in Figure 2.2, in the pipelined architecture for the BV-based approaches on FPGA, each PE extracts a BV in a subfield; to produce the final match result, all the BVs are bitwise ANDed in a pipelined fashion. Excluding the priority encoder (if any), we haved L s e PEs in this pipeline. We denote such an architecture as the basic pipelined architecture. In the basic pipelined architecture, a BV for N rules is N-bit long. For distributed RAM (distRAM) or Block RAM (BRAM) modules of fixed size 2 , the number of the memory modules required for each PE grows linearly with respect to N. This means the length of the longest wire connecting multiple memory modules in a PE also increases atO(N) rate, which degrades the throughput of the pipeline for large N. To address this problem, we propose a 2-dimensional pipelined architecture consist- ing of multiple modular PEs in this thesis. Our work is inspired by the observation that a long BV can be partitioned into multiple shorter subvectors to eliminate the long wires and improve the overall clock rate. Specifically, our contributions include: 2 e.g., 2 6 1-bit distRAM based on 6-input Look Up Table (LUT) or 36 Kb SRAM-based BRAM [28, 29]. 35 Scalable architecture: A 2-dimensional pipelined architecture on FPGA, which sustains high throughput even if we scale up the lengths and the depths of the packet classification rule sets. Novel optimization techniques: A couple of optimization techniques to improve the performance of our architecture, including the dual-port data memory and the power gating technique. Distributed update algorithms: A set of algorithms supporting dynamic updates, including modify, delete and insert operations on the rule set. The algorithms are performed distributively on self-reconfigurable PEs. The update operations have little impact on the sustained throughput. We evaluate the performance of our proposed architectures on state-of-the-art FPGAs later in this section, where a thorough comparison of our architecture with existing solu- tions for packet classification is also included. 3.1.2 Modular PE A modular PE is used to match a single packet header against one rule (N = 1) in a 1-bit subfield (s = 1). In order to minimize the number of I/O pins utilized, a modular PE is also responsible for propagating the same input packet headers to other PEs. A modular PE should be able to handle both prefix match and exact match. Let us consider the internal organization of a modular PE as shown in Figure 3.2. The main difference between the modular PE in Figure 3.2, and the PEs used in the basic pipelined architecture, is that the modular PE only produces the match result for exactly 1 rule at a time (2 1-bit data memory). The modular PE also integrates other components: 36 Data memory packet header bit -bit AND bvin Input Reg. Rule decoder bvout rst bvwr Output Reg. Figure 3.2: Modular PE 1. Registers: it is used to construct the horizontal and vertical pipelines (See Sec- tion 3.1.3). We denote the register buffering the input packet header bits as the input register; we denote the register after the AND gate as the output register. 2. Rule decoder: the logic-based rule decoder is mainly responsible for dynamic updates (See Section 3.1.5). In Figure 3.2, the packet header bit is used to address the data memory directly; the extracted BV is then ANDed with the BV output from the previous PE in the pipeline. 3.1.3 2-dimensional Pipeline To handle a larger number of rules and more input packet header bits, we use multiple modular PEs to construct a complete 2-dimensional pipelined architecture as shown in Figure 3.3. We use PE[l;j] to denote the modular PE located in thel-th row and thej-th column, wherel = 0; 1; 2; 3 andj = 0; 1; 2 in Figure 3.3. We use distRAM for the data 37 PrEnc packet header PE [0,2] PE [1,2] PE [2,2] PE [3,2] PE [0,1] PE [1,1] PE [2,1] PE [3,1] PE [0,0] PE [1,0] PE [2,0] PE [3,0] final result PrEnc PrEnc PrEnc Figure 3.3: Example: a 2-dimensional pipelined architecture (N = 4, L = 3) and priority encoders (PrEnc) memory in each PE, so that the overall architecture can be easily fit on FPGA and the memory access in each PE is localized. Horizontal: We define horizontal direction as the forward (right) or backward (left) direction along which the BVs are propagated. We use the output regis- ters of the modular PEs to construct horizontal pipelines (e.g., PE[0; 0], PE[0; 1], PE[0; 2]). The data propagated in the horizontal pipelines consist of the BVs. Vertical: We define vertical direction as the upward (up) or downward (down) direction along which the input packet headers are propagated. We use the input registers of the modular PEs to construct vertical pipelines (e.g., PE[0; 0], PE[1; 0], PE[2; 0], PE[3; 0]). The data propagated in the vertical pipelines of PEs consist of the input packet headers. We do not restrict the rules to be arranged in the rule set following any specific order; thus, we need a priority encoder [44] at the end of each horizontal pipeline to report the 38 highest-priority match. In our approach, the match results of all the horizontal pipelines are collected by a vertical pipeline of priority encoders. Since each modular PE can perform prefix or exact match for one rule in a 1-bit subfield, the architecture in Figure 3.3 (consisting of 4 rows and 3 columns of modular PEs) can match 4 rules, each rule having 3 1-bit subfields. Using more modular PEs, this architecture can be scaled up to handle a larger number of rules, and longer packet headers. For a rule set ofN rules, andL-bit packet headers, our 2-dimensional pipelined architecture requiresN rows andL columns of PEs to be concatenated in a pipelined fashion. 3.1.4 Optimization Techniques To explore the performance tradeoffs among various design parameters, we propose four optimization techniques in this thesis. 3.1.4.1 Striding The striding technique [44] can be applied to the modular PE, as shown in Figure 3.4. Suppose the modular PE only needs to perform packet header match against one rule, then the amount of memory required for prefix match in ans-bit subfield is 2 s 1 bits. The length of the input register now iss bits because we haves bits of the input packet headers when using striding technique. 3.1.4.2 Clustering Besides the striding technique, we also introduce a clustering technique for the modular PE. The basic idea is to build a PE which can handle multiple rules instead of a single rule. Let us consider the modular PE performing packet header match againstn rules as shown in Figure 3.4. We construct a BV array consisting of 2 s bit vectors, each of 39 Data memory Reg. packet header bits bv2in -bit AND bv1in Reg. Rule decoder -bit AND Reg. bv1out bv2out Reg. reset bvwr Reg. Reg. ≠ 0 ? ≠ 0 ? en1out en2out en2in en1in Figure 3.4: A modular PE with striding, clustering, and power gating techniques; the data memory is dual-ported lengthn; this requires a data memory of size 2 s n. The 1-bit AND gate in Figure 3.2 is adjusted to ann-bit logical AND gate in Figure 3.4. 3.1.4.3 Dual-port Data Memory We employ dual-port (read) data memory on FPGA. Two concurrent packets can be pro- cessed in each modular PE. We denote the input BVs for the two concurrently processed packets asbv1in andbv2in, respectively; we denote the output BVs for the two concur- rent packets asbv1out andbv2out, respectively. The throughput is twice the maximum clock rate achieved on FPGA. Assuming the same clock rate can be sustained, this tech- nique doubles the throughput achieved by the modular PE shown in Figure 3.2. We show a modular PE with dual-port data memory in Figure 3.4. For each of the two concurrent packets, the modular PE compares ans-bit subfield of the packet header 40 PE [1,2] PE [1,1] PE [1,0] … … 000 1 port of data memory deactivated PE [1,3] 101 110 100 100 000 100 000 Figure 3.5: Power gating againstn rules. The overall 2-dimensional pipelined architecture hasd N n e rows byd L s e columns of PEs; each row of PEs (i.e., a horizontal pipeline) is responsible for matching the packet headers against n rules, while each column (i.e., a vertical pipeline) is in charge of matching ans-bit subfield. 3.1.4.4 Power Gating For any incoming packet header, if PE[l;j] (j < L s 1) identifies that this packet header does not match any of the n rules, then there is no need to examine the remaining subfields for this packet. This is because the logical AND operations in PE[l;j + 1], PE[l;j + 2],:::, and PE[l; L s 1] can only produce all-“0” vectors anyway; the outputs from the data memory modules become irrelevant for these PEs. By deactivating the corresponding port of the data memory in PE[l;j + 1], PE[l;j + 2],:::, and PE[l; L s 1], a significant amount of power can be saved on the average. We show the modular PE with the power gating technique in Figure 3.4. The comparator after each AND gate generates an enable signal; this enable signal is set to high only if the corresponding output BV is not an all-“0” vector. The enable signal (high or low) is fed into the next PE in order to activate or deactivate the corresponding port of the data memory in the next PE. 41 For example in Figure 3.5, suppose bv1out and bv2out are “000” and “100” after PE[1; 1], respectively. For any modular PE after PE[1; 1] (namely PE[1; 2] or PE[1; 3]), the logical AND gates can only produce all-“0” vectors as theirbv1out signals. In this case, the data memory ports corresponding tobv1out in both PE[1; 2] and PE[1; 3] are deactivated. Note in this example, the data memory ports corresponding tobv2out are never deactivated. 3.1.5 Supporting Dynamic Updates OpenFlow packet classification requires the hardware to adapt to frequent incremental updates for the rule set during run-time. We propose a dynamic update scheme which supports fast incremental updates of the rule set without sacrificing the pipeline perfor- mance. Before we detail our update mechanism, we define the following terms: Definition 3.1. An old rule is the rule to be modified in the rule set. A new rule is the rule to appear in the rule set after the update. Definition 3.2. The data structure (e.g., BV , BV array, and valid bit) to be updated is denoted as outdated; the data structure that is already updated is denoted as up-to-date. Given a rule set consisting ofN rulesfR i ji = 0; 1;:::; N 1g, we define the problem for dynamic updates as three subproblems: Definition 3.3. Rule modification: Given a rule with RIDR, and all of its field values and priority, search RIDR in , locatei2f0; 1; :::; N 1g whereR i =R; change ruleR i in into ruleR. Definition 3.4. Rule deletion: Given a rule with RID R, search RID R in , locate i2f0; 1; :::; N 1g whereR i =R; delete ruleR i from . 42 Definition 3.5. Rule insertion: Given a rule with RIDR, and all of its field values and priority, search RIDR in ; if8i2f0; 1; :::; N 1g,R i 6=R, then insert ruleR into . For example, in Table 2.2, we can modify rule R 1 by changing its SA field from “00*” into “0*”. In this table, we can delete ruleR 0 by removing all of its fields and priority completely; we can also insert a new rule by adding a brand new rule asR 5 into this rule set. We assume all modular PEs are implemented with the striding and clustering tech- niques; hence each of the PEs stores ann 2 s BV array. Notice that the first step of all update operations is always a RID check process, which reports whether the RID of the new rule exists in the current rule set. The RID check only performs the exact match in a logN-bit field; the rule decoders in the first column of PEs are in charge of this process, since the results of the RID check need to be reported before any update operation. After the RID check is completed, to target the above three subproblems, we present our main ideas as follows: (Section 3.1.5.1) We update all the corresponding BVs in the BV arrays, or we update priority encoders (for priority). (Section 3.1.5.2) We keep a “valid” bit for each rule; we reset the bit to invalidate a rule. (Section 3.1.5.3) We check the valid bits of all the rules first; if a rule in the rule set is invalid, we modify this invalid rule into the new rule, and validate this new rule. 43 Section 3.1.5.4 presents the architectural support for the update mechanism. The result- ing overall architecture consists of multiple self-reconfigurable PEs, each PE configur- ing its memory contents in a distributed manner. Section 3.1.5.5 summarizes the update schedule. 3.1.5.1 Modification After the RID check, suppose RIDR exists in the rule set; hence9i2f0; 1;:::; N1g such thatR i = R. Rule modification can be performed as: Given a rule with RIDR, along with all of its field values and priority, compute the up-to-date BVs, and replace the outdated BVs in the BV arrays with the up-to-date BVs. In any subfield, we assume a rule is represented by a ternary stringf0; 1;g s . In reality, a rule is represented by two binary strings, the first string specifying the non-wildcard ternary digits while the second string specifying the wildcards 3 . Algorithm 1 Constructing BVs in subfieldj Input n ternary strings each ofs bits:T i;j , whereT i;j 2f0; 1;g s ,i = 0; 1;:::; n 1. Output 2 s BVs each ofn bits: B (k j ) j =b (k j ) j;0 b (k j ) j;1 :::b (k j ) j;n1 , whereb (k j ) j;i 2f0; 1g, k j = 0; 1;:::; 2 s 1, andi = 0; 1;:::; n 1. 1: fori = 0; 1;:::; n 1 do 2: fork j = 0; 1;:::; 2 s 1 do 3: ifk j matchesT i;j then 4: b (k j ) j;i 1 5: else 6: b (k j ) j;i 0 7: end if 8: end for 9: end for 3 e.g., a tenary string “01*” is represented by “010” and “001”. 44 00 01 10 11 0 0 0 1 1 1 1 1 1 1 0 0 modify 00 01 10 11 0 0 0 1 1 1 1 1 0 0 1 1 Figure 3.6: ModifyingR 2 Modifying Prefixes Let us first consider how to update a BV array. The first step for rule modification is to construct the up-to-date BVs for this subfield. Specifically, we use Algorithm 1 to construct all the 2 s up-to-date bit vectors, each of lengthn bits, in thiss-bit subfield. The correctness of Algorithm 1 can be easily proved [55]. Note Algorithm 1 is a distributed algorithm; if the modification of ruleR i requires multiple BV arrays to be updated, Algorithm 1 is performed in parallel by the PEs in the same horizontal pipeline whereR i can be located. In each PE, the logic-based rule decoder performs Algorithm 1 to update the memory content by itself (self-reconfiguration). As shown in Figure 3.1, the BVs are arranged in an orthogonal direction to the rules in the data memory. To modify a single rule, 2 s memory write accesses are required in the worst case 4 . As can be seen later, even in the worst case, no more than 2 s bits are modified in our approach. We show an example for rule modification in Figure 3.6. In this example, we modify the subfieldj = 0 of the ruleR 2 in Figure 3.1. In this subfield, based on Algorithm 1, R 2 is to be updated from “0*” to “1*”. A naive solution is to update the entire BV array. However, since we exploit distRAM for data memory, each bit of a BV is stored in one distRAM entry; this means every bit corresponding to a rule can be modified independently. Hence in Figure 3.6, to update the subfield j = 0 of R 2 , only 4 bits have to be modified (in 4 memory accesses). To avoid data conflicts, the memory write 4 This happens when the outdated BV is different from the up-to-date BV in every single bit location. Also, we assume each time there is only one rule to be updated; updating multiple rules at the same time is not supported by the rule decoder. 45 00 01 10 11 1 0 0 0 0 1 0 0 0 0 0 1 00 01 10 11 1 0 0 0 0 1 0 0 0 0 0 1 Valid bit 1 1 0 invalidate Valid bit 1 0 0 Figure 3.7: Deleting an old ruleR 1 accesses are configured to be single-ported. In any subfield, we always use 2 s clock cycles for 2 s memory write accesses (worst case) for simplicity. Modifying Priorities If the update process requires the priority of the old rule to be changed, i.e., the new rule and the old rule have different priorities, we update the prior- ity encoders based on a dynamic tree structure [109]. The time complexity to update the dynamic tree isO(logN). In general, if a prioritized rule set requires prefix match to be performed, the parallel time complexity for modifying a rule isO max[2 s ; logN] . 3.1.5.2 Deletion After the RID check, suppose RIDR exists in the rule set; hence9i2f0; 1;:::; N1g such thatR i = R. Rule deletion can be performed as: Given a RIDR, delete the rule with RIDR i from the rule set. i.e.,R i should no longer produce any matching result. To delete a rule, let us first consider all then rules handled by a particular horizontal pipeline consisting ofd L s e PEs. We propose to usen valid bits to keep track of all the n rules. A valid bit is a binary digit indicating the validity of a specific rule. A rule is valid only if its corresponding valid bit is set to “1”. 46 00 01 10 11 1 0 0 0 0 1 0 0 0 0 0 1 00 01 10 11 1 0 0 0 0 1 0 0 1 1 1 1 Valid bit 1 1 0 validate Valid bit 1 1 1 modify Figure 3.8: Inserting a new ruleR asR 2 For a rule to be deleted, we reset its corresponding valid bit to “0”. An invalid rule is not available for producing any match result. We show an example in Figure 3.7. In this example, initiallyR 0 andR 1 are valid;R 2 is invalid. R 1 is to be deleted from the rule set. During the deletion, the valid bit corresponding toR 1 is reset to “0”. Then valid bits are directly ANDed with then-bit BV propagated through the horizontal pipeline. As a result, if a rule is invalid, the corresponding position for this rule in the final AND result can only be “0”, indicating the input does no match this rule. 3.1.5.3 Insertion After the RID check, suppose RID R does not exist in the rule set; hence8i 2 0; 1;:::; N 1, R i 6= R. Rule insertion can be performed as: Given a rule with RIDR, along with all of its field values and priority, add the new rule with RIDR into the rule set. i.e., we need to check the valid bits, modify one of the invalid rules, and validate this new rule. To insert a rule, (1) we first check whether there is any invalid rule in the rule set; we denote this process as validity check. (2) Then we reuse the location of any invalid rule to insert the new rule: we modify one of the invalid rules into the new rule by following 47 the same algorithm presented in Section 3.1.5.1. (3) Finally, we validate this new rule by updating its corresponding valid bit. Figure 3.8 shows an example of rule insertion in a subfield. In this figure, initially ruleR 2 is invalid as indicated by the valid bit. During insertion, the location in the BV array corresponding toR 2 is reused by the new ruleR. We validate the new ruleR by setting its valid bit to “1”. 3.1.5.4 Architectural Support The main idea of self-reconfiguration is to give each PE the ability to reconfigure its memory contents by itself during an update. In this case, the BV arrays do not have to be fed into the data memory explicitly; instead, only the rules are provided to each PE. As a result, the dynamic updates are performed in a fine-grained distributed manner in parallel; no centralized controller is required, and the amount of data I/O operations is minimized. Storing Valid Bits The data memory of the modular PE in Figure 3.4 can also be used to store the valid bits. We use an extra column of PEs, each storing n valid bits for each horizontal pipeline. We place this column of PE as the first vertical pipeline on the left of the 2-dimensional pipelined architecture; this is because the valid bits have to be checked as early as possible before any insertion can be made. Valid bits are extracted during run-time and output to the next PE in the horizontal pipeline. The resulting overall architecture hasd N n e rows and (d L s e + 1) columns, where valid bits are stored and extracted in PE[l; 0],l = 0; 1;:::;d N n e 1 (the first column). Logic-based Rule Decoder To save I/O pins on FPGA, we use the pins for packet header (L pins) to input a new rule (in 2 cycles for ternary strings). In each PE[l; 0], 48 l = 0; 1;:::;d N n e 1, the RID of the rule that needs an update is provided to the rule decoders. A rule decoder is in charge of: 1. RID check (for all types of update operations, only in the PEs of the first column): The rule decoders check whether the RID of the new rule already exists in the rule set. This is implemented using log(N)-bit comparators. 2. Rule translation (for modification and insertion, in all the PEs except the first column): Based on the new rule, all the data to be written (2 s bits) to the data memory are generated by the rule decoder. The logic is simple (e.g., by enumera- tions) sinces is usually small. Rule translation requires 3 clock cycles in each PE since the new rule has to be provided in 2 cycles. 3. Validity check (for insertion, only in the PEs of the first column): The rule decoder reports the position of the first invalid bit in the data memory. The logic for this process is also simple. 4. Construction of valid bits (for insertion, only in the PEs of the first column): Similar to the rule translation, the rule decoder provides the up-to-date valid bits to the data memory. Besides the 4 major functions mentioned above, the rule decoder is also a distributed controller for each PE. For example, the rule decoder in a particular modular PE can insert pipeline bubbles by resetting the output register (using the signal reset in Fig- ure 3.4). 3.1.5.5 Schedule and Overhead When no update is performed, the modular PE is in a classification process, where the input packet headers are matched against the rules. Now let us investigate the overheads introduced by the dynamic update operations in Table 3.1. 49 Table 3.1: Update overhead (clock cycles) Update types Modify Delete Insert RID check 1 1 1 Rule translation 3 - 3 Validity check Overlapped with packet matching Updating BV array 2 s - 2 s Updating priority logN logN logN Updating valid bits - 1 1 Worst-case total overhead 4 + max[2 s ; logN] 1 + logN 4 + max[2 s ; logN] 1. For all the update operations, the RID of the new rule has to be checked against all the rules, which results in 1-cycle overhead. 2. The rule decoder generates the control signals based on the new rule provided on the s packet header input pins / wires; hence the rule translation cannot be overlapped with the classification process. This leads to a 3-cycle overhead. 3. We overlap the validity check process with the on-going classification process; the validity check results are reported every clock cycle to the rule decoder for all the PE[l; 0],l = 0; 1;:::;d N n e 1. 4. To update the BV array, the rule decoder initiates 2 s memory write accesses. Dur- ing the memory write accesses, the memory cannot be read as required by the classification process. 5. The priority encoders require logN cycles to modify the priority of a rule. How- ever, the process of updating the priority can be overlapped with the process of updating the BV array. 6. The process of updating the valid bits (one memory write per PE in the first col- umn) cannot be overlapped with the on-going classification process; however, it can be overlapped with the process of updating the BV arrays (2 s cycles). 50 ID M M M M M M M M M M M 1st cycle 4th cycle RT RT RT RT M ID M M RT RT + + 2 -th cycle M M M M M M M B RT M M M M M M RT V RT RT M M RT B V M ID + + 3 -th cycle Classification process RID check Rule translation Updating BV array Updating valid bits Bubble Figure 3.9: Example: inserting a new rule;d L s e + 1 = 3,d N n e = 4,s =n = 1 In the worst case, a single rule modification requires all the BV arrays stored in a hor- izontal pipeline to be updated. We show an example of the update schedule (4 3 PEs, excluding the priority encoders) in Figure 3.9. In this example, we assume the RID of the new rule exists in the last horizontal pipeline; this can be identified by the RID check. Therefore only the BV arrays in the PEs of the last row are to be overwritten. As can be seen, although (d L s e +d N n e) clock cycles are required to propagate the new rule across the entire PE array, this amount of time does not contribute to the total update overhead. This is because the update is performed in a distributed and pipelined manner. Assuming 2 s logN, the matching process in a particular PE has to be stalled for a total number of (2 s + 4) cycles to modify or insert a rule. In Figure 3.9, for instance, the classification process is stalled for (2 s + 4) = 6 clock cycles in each PE. Except the first column of PEs, all the other PEs neither perform RID check nor update valid bits. Thus, Table 3.1 lists the worst-case total overhead for any PE. As can be seen, the rule insertion introduces the most overhead among all three types of update operations. 51 3.1.6 Experimental Setup We conducted experiments using Xilinx ISE Design Suite 14.5, targeting the Virtex 6 XC6VLX760 FFG1760-2 FPGA [28]. This device has 118; 560 logic slices, 1200 I/O pins, 26 Mb BRAM (720 RAMB36 blocks), and can be configured to realize large amounts of distRAM (up to 8 Mb). A Configurable Logic Block (CLB) on this FPGA consists of 2 slices, each slice having 4 LUTs and 8 flip-flops. Clock rate and resource consumption are reported using the post-place-and-route results. In our approach as discussed in Section 3.1, the construction of BVs does not explore any rule set features 5 , the performance of our architecture is rule-set-independent. We use randomly generated BVs; we also generate random packet headers for both the classic (d = 5,L = 104) and OpenFlow (d = 15,L = 356) packet classification in order to prototype our design, although our architecture neither restricts the number of packet header fields (d) nor requires a specific length of the packet header (L). Considering the limited on-chip resources on FPGA, the number of rules in a rule set is chosen to be from 128 to 1 K [93]. 3.1.7 Design Parameters and Performance Metrics We vary several design parameters to optimize our architecture: Size of the rule set (N): The total number of rules Length of the BV (n): The number of bits in the BV produced by a single PE Packet header length (L): The total number of bits for the input packet header Stride (s): The number of bits for a subfield 5 e.g., the number of unique values in each field or subfield, the average length of prefixes, etc. 52 Table 3.2: Clock rates (MHz) of various designs s 1 2 3 4 5 6 7 n 4 225.48 204.42 339.79 346.14 364.56 379.65 339.79 8 210.08 254.97 352.86 389.86 364.30 380.47 257.47 16 257.40 279.96 373.00 370.10 373.00 363.77 289.10 32 259.40 239.69 342.35 344.83 355.11 315.26 262.67 64 201.01 244.26 315.76 317.56 336.36 299.67 260.28 Update rate (): The total number of all the update operations (modification, deletion or insertion) for the rule set per unit time We study the performance trade-offs with respect to the following metrics: Peak throughput (T peak ): The maximum number of packets that can be processed per second without any update operation Sustained throughput (T sustain ): The number of packets processed per second considering all update operations Latency (): The processing latency of a single packet when no dynamic update is performed Resource consumption: The total amount of hardware resources (logic slices, I/O, etc.) consumed by the architecture on FPGA Both the peak throughput and the sustained throughput are measured in Million Packets Per Second (MPPS) in this thesis. 3.1.8 Empirical Optimization of Parameters To find the optimal values ofn ands by experiments, we first fix the values ofN = 128, andL = 356 for OpenFlow packet classification. The values ofn ands achieving the best performance are used later for other values ofN andL. 53 We show the maximum clock rate achieved by various designs in Table 3.2. We chooses from 1 to 7 andn from 4 to 64, since fors> 7 orn> 64, the clock rate drops to below 200 MHz. As can be seen, we achieve very high clock rate (200 400 MHz) with small variations among various designs. All the memory modules are configured to be dual-ported, hence we achieve 400 800 MPPS throughput for OpenFlow packet classification. We can observe that: 1. Fors 2, the BV arrays are stored in 2 s -input “shallow” memories. This memory organization underutilizes the 6-input LUT-based distRAM modules on FPGA. Also, since we have a large number of PEs for s 2, the entire architecture consumes large amounts of registers; the complex routing between these registers also limits the maximum achievable clock rate. 2. For 3 s 6, the best performance is achieved forn = 8 orn = 16. There is fast interconnect in a slice, then slightly slower interconnect between slices in a CLB, followed by the interconnect between CLBs. A PE withn = 8 uses exactly 8 flip-flops of a slice to register a BV , while a PE withn = 16 uses exactly all 16 flip-flops in a CLB to register a BV . These two configurations introduce the least routing overhead. 3. Fors> 6, the BV arrays are stored in 2 s -input “deep” memories. This organiza- tion requires multiple LUTs of different CLBs to be used for a single PE; the long wiring delay between CLBs results in clock rate deterioration. 4. The performance fors = 4 andn = 8 is the best. This is because all the LUTs inside a single slice can be used as 128-bit dual-ported distRAM; the configuration ofn = 8 ands = 4 not only uses up all the 8 flip-flops in a slice, but also provides a memory organization to store bit vectors of total size 2 s n = 128 bits. 54 0 200 400 600 800 1000 128 256 512 1024 Throughput (MPPS) Number of rules (N) n=N n=8 Figure 3.10: Scalability with respect toN andn In summary, forN = 128 andL = 356, the best performance is achieved whens = 4 and n = 8. Hence we use s = 4 and n = 8 to implement our architecture for other values ofN andL 6 . 3.1.9 Scalability of Throughput Usings = 4 andn = 8, we varyN andL, respectively, to show the scalability of the throughput performance. Figure 3.10 shows the throughput of our architecture with respect to various values of N (L = 356). As can be seen, our architecture achieves very high clock rate (324 MHz) and throughput (648 MPPS) even forN = 1024. We also show in the same figure the necessity of using modular PEs along with the clustering technique. Compared to the basic pipelined architecture (n =N), our architecture achieves better throughput (up to 2) when the rule set is large; in our architecture, the clock rate tapers much slower as N increases. 6 The choice ofs andn is not unique; e.g., latency can be used as a metric to chooses andn. However, in our experiments, we choose s and n such that they give the highest clock rate, since our goal is to achieve high throughput. Other choices are also possible but they all achieve similar performance. 55 0 200 400 600 800 1000 128 256 512 1024 Throughput (MPPS) Number of rules (N) L=104 L=356 Figure 3.11: Scalability with respect toN andL Figure 3.11 shows the throughput for both the classic packet classification (L = 104) and OpenFlow packet classification (L = 356). Our architecture achieves very high throughput for the classic 5-field packet classification. The OpenFlow packet classifi- cation consumes more resources and requires more complex routing; hence the perfor- mance degrades compared to the classic packet classification. 3.1.10 Updates and Sustained Throughput As discussed in Section 3.1, the rule insertion stalls the normal packet classification process for the most number of clock cycles; for the worst-case analysis, we assume pessimistically that all the update operations are rule insertions. Based on Table 3.1, the sustained throughput can be calculated using the following equation: T sustain =T peak f 2 (4 + max[2 s ; logN]) f (3.1) wheref denotes the maximum clock rate achieved for a specific design. The factor of 2 comes from the fact that memory write accesses are single-ported. 56 0 200 400 600 800 1000 1K 10K 100K 1M Throughput (MPPS) Update rate (updates/s) peak sustained Figure 3.12: Sustained throughput We vary the value of and show the sustained throughput of our architecture in Figure 3.12, considering the worst-case scenario for all update operations. In this imple- mentation,s = 4 andn = 8 are used forN = 1024. As can be seen, our architecture sustains a high throughput of 650 MPPS with 1 M updates/s, although 1 M updates/s is pessimistic considering real-world traffic 7 . 3.1.11 Scalability of Latency We show the latency performance with respect to various values of N and L 8 in Fig- ure 3.13. In the same figure, we also break down the latency introduced by the 2- dimensional pipelined architecture and the tree-based priority-encoders. As can be seen, more than 86% of the latency is introduced by the 2-dimensional pipeline: d L s e +d N n e cycles. The latency introduced by the priority encoders can be neglected; hence d L s e +d N n e . This means, for a specific configuration on s and n, and fixed values ofL (orN), is sublinear with respect toN (orL). 7 The typical update rates are 10K updates/s [104]. 8 Similar performance can be seen for other values ofN,L. 57 128 256 512 1024 0 200 400 600 800 1000 1 2 3 4 5 6 7 8 9 101112131415161718192021 Latency (ns) Number of rules pri_enc 2d_pipe Figure 3.13: Latency introduced by the 2-dimensional pipelined architecture (2d pipe) and the priority encoders (pri enc); for eachN, the 4 columns correspond toL = 89, 178, 267, and 356 from left to right, respectively Table 3.3: Resource consumption (s = 4,n = 8 andL = 356) No. of rulesN 128 256 512 1024 No. of logic slices 14773 29056 57209 112812 (% of total) (12%) (25%) (48%) (95%) No. of I/O pins 722 723 724 725 (% of total) (60%) (60%) (60%) (60%) No. of registers 48704 97502 195164 329690 (% of total) (5%) (10%) (20%) (34%) 3.1.12 Resource and Energy Efficiency We report the resource consumption for the OpenFlow packet classification in Table 3.3. The resources consumed by the architecture increases sublinearly with respect toN. We measure the energy efficiency with respect to the energy consumed for the clas- sification of each packet (J/packet); a small value of this metric is desirable. In Fig- ure 3.14, we show the energy efficiency without the power gating technique (w/o opt.) and with the power gating technique (w/ opt.) as discussed in Section 3.1.4, respectively. Three scenarios are tested with a 1 K OpenFlow rule set, including: 58 0 5 10 15 20 All-match Random No-match Energy/packet (nJ) w/o opt. L=89 w/ opt. L=89 w/o opt. L=178 w/ opt. L=178 w/o opt. L=267 w/ opt. L=267 w/o opt. L=356 w/ opt. L=356 Figure 3.14: Energy efficiency (s = 4,n = 8,N = 1024) 1. All-match: Every input packet header produces an “all-one” BV in any PE. This means every input packet header matches all the rules, which is too pessimistic. 2. Random: Packet headers are generated randomly. 3. No-match: Every input packet header is identified as not matching any of the 1 K rules in the first column of PEs. In this case, since the memory modules in all the other columns of PEs are deactivated, the energy efficiency of our design with power gating technique is optimistic. For each scenario, we vary the number of the horizontal pipeline stages to investigate the energy efficiency. The power gating technique is more effective for larger 2-dimensional pipelined architectures 9 ; this is because more data memory ports can be turned off if an early stage (close to the first column of PEs) reports no match. As can be seen in Figure 3.14, with the power gating technique, our design can save up to 67% energy; the actual energy saved depends on the pattern of the input packet headers. 9 The energy saving is also remarkable as we scale upN. 59 Table 3.4: Approach overview Field Type Prefix / range match Exact match Preprocessing Building range-tree Constructing hash table Searching Range-tree search Cuckoo hashing Merging Linear merging 3.2 Large-scale Classification on Multi-core Platforms 3.2.1 Related Work Definition 3.6. The projection of a rule on a prefix match field is called a prefix match rule. Similarly, the range match rule, and the exact match rule can be defined. Instead of a highly-customized architecture on FPGA, we rely on parallel soft- ware algorithms on state-of-the-art multi-core processors to achieve high performance for multi-field packet classification. In this thesis, we target the decomposition-based approach [87, 90], since it has better scalability than the decision-tree-based approaches [47]. The basic idea of the decomposition-based approaches is to split the rule set into multiple fields, while parallel searching operations can be performed on all the fields independently and efficiently. Given a rule set having a large number of fields, we present the procedure of our decomposition-based approach as three phases: Preprocessing: For each field, a data structure (e.g., range-tree [106, 113] or hash table [81]) is constructed for efficient search in that field. Searching: Each field of the incoming packet header is searched independently. Range-tree search or Cuckoo hashing is employed for each field; the partial matching result from each field is recorded in a RID set. 60 Merging: The partial results from all the fields are merged efficiently in a linear manner to compute the final result. Table 3.4 summarizes these three phases. Note: 1. A prefix match field or a range match field is more complex. Range-tree-based algorithms provide an efficient searching method. 2. In an exact match field, a hash-based algorithm requires a small (constant) num- ber of memory accesses regardless of the rule set size. This in turn reduces the searching latency and increases the throughput. 3. We choose to merge all the partial results linearly; this is based on the observation that the number of matches in each field is usually not large. We discuss the preprocessing of a given rule set in Section 3.2.2, including the construc- tion of the range-trees and the hash tables. In Section 3.2.3 and Section 3.2.4, we discuss the algorithms exploited in the searching and merging phases, respectively. Compare to the hardware-based dynamically updatable classification engine (Section 3.1.5), it is rel- atively easy to perform dynamic updates using the software-based solutions; hence the discussion of the software-based dynamic updates is beyond the scope of this thesis. 3.2.2 Preprocessing Definition 3.7. Multiple rules can be projected onto the same value in a specific field; such a value is defined as a unique value in this field. We denote the number of unique values in fieldm asq (m) (m = 0; 1;:::;M 1). Note a unique value in a particular field can also be a “*”.q (m) is usually much less than N [90]. For example, the rule set shown in Table 2.2 only has 3 unique values in the 61 , , Overlapping ranges Non-overlapping subranges Range-tree ( ) no match no match ( ) , , , , , no match , , , , , no match Figure 3.15: Constructing a range-tree from the unique values MPLS label field: “*”, “0”, and “16000”. Also, a unique value in a prefix match field is a prefix, while a unique value in a range match field is a range. 3.2.2.1 Constructing Range-trees Range-tree is a data structure widely used for prefix search in IP lookup [106, 113]. In this approach, the prefixes are first presented as overlapping ranges. Then, all the overlapping ranges are flattened to produce non-intersecting subranges as shown in Fig- ure 3.15. After this range-to-subrange translation, the range-tree can be constructed from the subrange boundaries. In this range-tree, each leaf node corresponds to a sub- range, where one or more RIDs can be linked with this subrange. In a prefix match field, the prefix rules need to be projected onto the unique val- ues first; then the unique values (prefixes) are converted to the overlapping ranges. We denote this preprocessing step as prefix-to-range conversion. After this conversion, Fig- ure 3.15 shows an example of constructing a range-tree from overlapping ranges. The notations are defined in Table 3.5. 62 Table 3.5: Notations for range-trees P i (i = 0; 1;:::;N 1) N Prefix/range match rules, each with a RIDR i V j (j = 0; 1;:::;q (m) 1) Range representations ofq (m) unique values x k (k = 0; 1;:::;K 1) K subrange boundaries fP i g SetfP 0 ;P 1 ;:::;P i ;:::g V j :low,V j :up, S m ’s lowerbound and upperbound,i:e:S m = [a, b) T (m) Range-tree in fieldm create tree (fx k g) Create complete tree from setfx k g add (fx k g,x) Insert a new elementx into setfx k g insert sort (fx k g,x) Insertx intofx k g and sortfx k g Algorithm 2 Constructing unique values SUBROUTINE:fV j g( map unique(fP i g) #Translate prefix/range match rulesfP i g to unique valuesfV j g 1: for allP i 2fP i g do 2: ifP i is a prefix match rule then 3: translateP i into a range ruleP 0 i 4: else 5: P 0 i P i #for a range match ruleP i 6: end if 7: ifi = 0 then 8: fV j g add(V j ,P 0 i ) #add the first element 9: else 10: ifP 0 i 62fV j g then 11: fV j g add(V j ,P 0 i ) #every element appears only once 12: end if 13: end if 14: end for In a range match field, the unique values in a range match field are already ranges; therefore the construction of the range tree does not require the prefix-to-range conver- sion. We show the pseudocode for the range-tree construction in Algorithm 2, Algo- rithm 3, and Algorithm 4. Notice that: 63 Algorithm 3 Getting subranges SUBROUTINE:fx k g( subrange(fV i g) #Translate unique valuesfV j g to subrange boundariesfx k g 1: for allV j 2fV j g do 2: ifj = 0 then 3: fx k g add(x k ,V j :low) #upon startup 4: fx k g add(x k ,V j :up) #fx k g records all the subrange boundaries 5: end if 6: ifi = 0 then 7: fV j g add(V j ,P 0 i ) #add the first element 8: else 9: ifP 0 i 62fV j g then 10: fV j g add(V j ,P 0 i ) #every element appears only once 11: end if 12: end if 13: end for Algorithm 4 Constructing a range-tree SUBROUTINE:T (m) ( range tree(fx k g,fP i g) #Construct a complete range tree in fieldm 1: T (m) 0 create tree(fx k g) 2: T (m) further create two children for each leaf node (with key) ofT (m) 0 : (1) left child for input<, (2) right child for input. 3: for allP i 2fP i g do 4: perform range query forP i inT (m) 5: storeR i in all the reached leaves ofT (m) 6: end for Each leaf node uses a set (see Section 3.2.2.2) to store the RIDs of the rules match- ing that subrange. This is also the main difference between our range-tree struc- ture and the prior work [113]. Translating overlapping ranges into non-overlapping subranges can lead to a larger number of subranges in a prefix or range match field. For example in Figure 3.15, we have 3 overlapping ranges initially: [x 0 , x 2 ), [x 1 , x 3 ), and [x 4 , 64 Table 3.6: Notations for hash tables E i (i = 0; 1;:::;N 1) N exact match rules, each with a RIDR i U j (j = 0; 1;:::;q (m) 1) q (m) unique values f 0 ,f 1 ;:::;f P1 A number ofP hash functions H (m) Hash table in the fieldm Y z Keys stored inH (m) ,z = 0; 1;:::;Z 1 H (m) (Y z ) Hash value stored inH (m) corresponding to the keyY z x 5 ); after the range-to-subrange conversion, we have in total 5 non-overlapping subranges: [x 0 ,x 1 ), [x 1 ,x 2 ), [x 2 ,x 3 ), [x 3 ,x 4 ), and [x 4 ,x 5 ). 3.2.2.2 Constructing Hash Tables We reinterpret the searching phase for an exact match field as: given a set of exact values and an input integer, locate the exact value matching the input integer. We solve this problem by using perfect hashing; without loss of generality, we use Cuckoo hashing [81, 101] to reduce the number of memory accesses. Definition 3.8. A rule ID (RID) setR i is a set containing only the RIDsR i , for some i2f0; 1;:::;N 1g. Using the notations in Table 3.6, we show the pseudocode for the construction of a hash table in Algorithm 5. Based on Cuckoo hashing, we use P hash functions to construct the hash table. Each entry in the hash tableH (m) consists of a hash key, a hash value, and a RID set. In a specific entry ofH (m) ,Y z (z = 0; 1;:::;Z 1) denotes the hash key, andH (m) (Y z ) denotes the corresponding hash value; the RID setfR i g is the partial result stored in this table. Note the unique values are stored as hash keys, and the hash values are used directly as the memory index to access the partial results inH (m) . As shown in Algorithm 5, for a unique valueU j 6=, at mostP attempts are made to access the hash table: 65 Algorithm 5 Constructing a hash-table SUBROUTINE: fU j g ( hash construct(fE i g) #Construct Cuckoo hash table 1: for allE i 2fE i g do 2: ifE i 62fU j g then 3: fU j g add(U j ,E i ) #every element appears once 4: end if 5: end for 6: for allU j 2fU j g do 7: ifU j 6= then 8: forp = 0; 1;:::; P 1 do 9: iff p (U j ) =H (m) (Y z ) andY z = then 10: Y z U j #found an empty space to store the pair 11: H (m) (Y z ) f m (U j ) 12: for allE i =U j do 13: addR i in the memory location indexed byH (m) (Y z ) #store all the RIDs corresponding to this unique value 14: end for 15: break 16: else 17: ifp =P 1 then 18: enlarge the size ofH (m) #no available space for thisU j ; a collision have occurred 19: choose a new set ofP hash functions 20: go back to Step 6 21: end if 22: end if 23: end for 24: else 25: for allE i =U j do 26: storeR i in all the memory locations #deal with a wildcard 27: end for 28: end if 29: end for 1. If9p2f0; 1;:::;P 1g and9Y z = such thatf p (U j ) = H (m) (Y z ), then the entry indexed byH (m) (Y z ) is not occupied by any other unique value; this entry can be used to store the unique valueU j . 66 ( ) RID Set { } 0x8100 0 , , , , - 1 , , 0x0800 2 , , , , , - 3 , , 0x0100 4 , , , , - 5 , , - 6 , , = = ( ) Figure 3.16: Constructing a hash table, whereq (m) = 4 andZ = 7 2. If all the memory locations indexed by f p (U j ) are occupied (i.e., 8p = 0; 1;:::;P 1, and8f p (U j ) = H (m) (Y z ), we have Y z 6=), this means there is no available space for this unique valueU j . To resolve this problem, the hash table size has to be enlarged, theP hash functions have to be rechosen, and the hash tableH m needs to be reconstructed. Note for all the rules corresponding to the unique value “*”, their RIDs have to be stored in all the memory locations. This is because the wildcard “*” matches all the input packet headers in this field. In our implementation, we choose P = 2 since a small value of P leads to only a small number of memory accesses. For the exact match field m consisting of q (m) unique values, the hash table in our implementation has around 2q (m) entries. We show an example of constructing H m in Figure 3.16; this hash table is con- structed for the Eth type field of the rule set shown in Table 2.2. Suppose initially we have the entriesH m (Y 0 ) andH k (Y 4 ) occupied by the unique values “0x8100” and “0x0100”, respectively. We have added R 1 , R 3 and R 4 to all the RID sets since their corresponding unique value is “*”. We show how to add a new unique value “0x0800” usingP = 2 hash functions intoH (m) as follows: 67 1. We apply f 0 to this unique value. Suppose f 0 (0x0800) = 0, we note the Y z satisfyingH k (Y z ) = 0 is not null (Y 0 = 0x8100). This means the entry indexed by f 0 (0x0800) is already occupied by another unique value. Hence a second attempt has to be made. 2. We usef 1 (0x0800) = 2 as the index to access the hash tableH m in the second attemp. Since the correspondingY z =, we can store the unique value into this entry. The RIDs associated with this unique value “0x0800” are then added into the corresponding RID set. 3.2.3 Searching After the preprocess phase, we search all the fields of the input packet header indepen- dently. Parallel programs can be explored in the searching phase. In a prefix or range match field, we perform a binary search in the range-tree to check the corresponding field of the input packet header; once a leaf node is reached, the partial result for that field is extracted. In an exact match field, we show an example in Figure 3.17 using the same hash functions and hash table as in Figure 3.16. In this example: 1. A hash functionf 0 is first applied to the input “0x8100”. The hash value returned byf 0 (0x8100) = 4 indicates a possible location in the hash table where the partial result is stored for the unique value “0x8100”. However, the corresponding hash key “0x0100” does not match the input; this suggests more than one attempts were made when the unique value “0x8100” was stored as a hash key in the hash table. 2. A second hash table access provides a hash key matching the input:f 1 (0x8100) = 0; using indexH (m) (Y z ) =f 1 (0x8100) = 0, the corresponding keyY 0 = 0x8100. Therefore the partial result indexed by the hash valueH (m) (Y 0 ) = 0 is extracted. 68 Input: 0x8100 = = ( ) ( ) RID Set { } 0x8100 0 , , , , - 1 , , 0x0800 2 , , , , , - 3 , , 0x0100 4 , , , , - 5 , , - 6 , , Figure 3.17: Searching in an exact match field Table 3.7: Notations in the merging phase Arr[i] (i = 0; 1;:::;;N 1) thei-th elements of the arrayArr M Number of sets needed to be merged Set m (m = 0; 1;:::;M 1) M RID sets R (m) g (g = 0; 1;:::;g (m) 1) All the RIDs in the setSet m (g (m) N) List The linked list R h (h = 0; 1;:::;H 1) RIDs in theList Result The matching RIDs We repeat this process until a returned hash key matches the input. It is also possible thatP hash functions all return non-matching hash keys; in that case, depending on the rule set, the input can be treated as (1) it does not match any rule, or (2) it matches the wildcard rule “*” in this field. 3.2.4 Merging Instead of using BVs [63, 114, 93] to record the partial results, we use RID sets to store the partial results. After the searching phase, each field produces a RID set containing all the matching RIDs in that field; we exploit a linear merging technique to 69 Algorithm 6 Keeping all the matching RIDs SUBROUTINE: (Arr;List)( init(fSet m g) #Increment element values and populate the linked list 1: InitializeArr to all zeros 2: InitializeList to 3: form = 0;m<M;m =m + 1 do 4: for allR (k) g 2Set k do 5: Arr[R (k) g ] =Arr[R (k) g ] + 1 6: ifArr[R (k) p ] = 1 then 7: insertR (k) g intoList 8: end if 9: end for 10: end for Algorithm 7 Collecting the final result SUBROUTINE: Result( (Arr;List) #Find the final merge result, clear the array and linked list 1: for allR h 2List do 2: ifArr[R h ] =M then 3: addR h intoResult 4: end if 5: Arr[R h ] 0 #clear the array 6: R h #clear the linked-list 7: end for combine the RID sets in the merging phase. One advantage of using the RID set repre- sentation and linear merging is that the merging latency is proportional to the number of the matching rules in all the fields, rather than the total number of rules. We present the procedure of the linear merging technique as follows: Step 1 We maintain an N-element array where each element corresponds to a rule (RID). All the array elements are initialized to “0”. Step 2 For each RID in a RID set, we increment its corresponding array element by “1”. We also maintain a linked-list; we insert the RID into this linked-list only if 70 its corresponding array element is incremented from “0” to “1”. We repeat this for all the RID sets, using the sameN-element array and the linked-list. Step 3 After we have checked all the RID sets, the linked-list records all the RIDs that appear at least once in the partial results. Then we only examine the array elements whose corresponding RIDs are recorded in this linked-list. If an array element is equal to the number of fields to be merged (M), the corresponding rule matches the input packet header in all the fields. Step 4 For those rules matching all the fields of the input packet header, their corre- sponding RIDs are selected as the final result. Finally we set the elements in the array back to zero and clear the linked-list. Note we clear the array and the linked-list at the same time when we access the array and the linked-list in Step 3 of the above procedure. Using the notations in Table 3.7, we show the pseudocode for the linear merging technique in Algorithm 6 and Algorithm 7. 3.2.5 Experimental Setup For our packet classification engine on multi-core platforms, we conducted experiments on a 2 AMD Opteron 6278 processor [2] and a 2 Intel Xeon E5-2470 processor [13]. The AMD processor has 16 physical cores, each running at 2:4 GHz. Each core is integrated with a 16 KB L1 data cache, 16 KB L1 instruction cache, and a 2 MB L2 cache. A 6 MB L3 cache (Last-Level Cache, LLC) is shared among all the 16 cores; all the cores have access to 64 GB DDR3-1600 main memory. The AMD processor runs openSUSE 12.2 OS (64-bit 2.6.35 Linux Kernel, gcc version 4.7.1). The Intel processor also has 16 physical cores, each running at 2:3 GHz. Each core has a 32 KB L1 data cache, 32 KB L1 instruction cache, and a 256 KB L2 cache. All the 16 cores share a 20 MB L3 cache (Last-Level Cache, LLC), and they have access to 48 GB DDR3-1600 71 main memory. This processor runs openSUSE 12.3 OS (64-bit 3.7.10 Linux Kernel, gcc version 4.7.2). Both of the AMD and the Intel processors have 32 logical cores. We implemented our classification engine using OpenMP [79] on both the AMD and Intel processors. On each processor, the OS allocates hardware resources to each thread dynamically. We used perf, a performance analysis tool in Linux, to monitor the hardware and software events such as the number of cache misses and the number of context switches. We tested various rule set sizes by varying N from 1 K to 32 K, since N = 32 K is the largest rule set to the best of our knowledge [92]. Besides the rule set size, the performance of our classification engine on multi-core platforms also depends on the statistical features of the rule sets. To prototype our designs on these two platforms, we assumed8m;q (m) 6 0:4N 10 . Based on this assumption, we constructed synthetic rule sets and used random packet headers due to the lack of large-scale real-life rule sets 11 . Note that increasing the number of unique values has a similar effect as increasingN while keeping q (m) N constant. For the searching phase, we constructed (1) binary complete range-trees for the pre- fix and range match fields, and (2) cuckoo hash tables for the exact match fields. The performance was evaluated based on the average of 30 runs, each run generating a packet trace of 1 million 15-field random packet headers. All the packets were grouped into batches to amortize the I/O overheads on the multi-core platforms. 10 Recall thatq (m) stands for the number of unique values in fieldm,m = 0; 1;:::; M 1. 11 Our assumptions here are pessimistic considering real-world scenarios, because: (1) 0:4N is too large for any field; (2) caches are underutilized for random packet traces. 72 3.2.6 Design Parameters and Performance Metrics For the design parameters, we denote the number of program threads as T , while we denote the number of packets per batch asP in this section. The following metrics are used to measure the performance of our packet classifica- tion engine on multi-core platforms: Throughput: total number of packets classified per unit time, measured in MPPS. Latency: average processing time used for classifying a single packet, measured for an entire batch when packets are processed in batches. Since the preprocessing is done offline, the processing latency for any packet is equal to the sum of the time spent in the searching and merging phases. If packets are processed in batches, the processing latency for any single packet is the same as that for the entire batch. Theoretically, processing many packets in parallel can only improve the overall throughput, but not the processing latency of each packet. The above definitions are different from prior works [97, 69, 51, 86, 90], where processing latency is either not well defined, or averaged over all the packets in a batch. 3.2.7 Empirical Optimization of Parameters To minimize the data transfer overhead on the multi-core platforms, packets are pro- cessed in batches of P packets each. We show the performance improvement of our packet classification engine in Table 3.8. As can be seen in this table, for a fixed number of T threads, small values of P (fine-grained multi-threading) lead to lots of synchronization overheads among threads and poor throughput; on the other hand, for P > 10000, the processing latency increases dramatically. In this paper, we set P T = 10000; i.e., each thread processes 10 K packets sequentially. 73 Table 3.8: Throughput (MPPS) / latency (ms) with respect toT andP (AMD,N = 1 K) T P=T 10 100 1000 10000 4 0.029 / 1.36 0.29 / 1.37 2.71 / 1.47 10.56 / 3.78 8 0.021 / 3.74 0.33 / 2.41 2.84 / 2.81 12.88 / 6.21 16 0.030 / 5.22 0.36 / 4.37 3.33 / 4.79 16.13 / 9.91 32 0.030 / 10.38 0.37 / 8.63 3.16 / 10.10 18.71 / 17.09 64 0.029 / 21.69 0.38 / 16.69 3.31 / 19.32 20.10 / 31.83 128 0.029 / 43.67 0.36 / 35.45 3.21 / 39.79 21.28 / 60.21 0 20 40 60 80 0 10 20 30 40 4 8 16 32 64 128 Latency (ms) Throughput (MPPS) No. of threads Throughput, N=1K Throughput, N=2K Throughput, N=4K Throughput, N=8K Latency, N=1K Latency, N=2K Latency, N=4K Latency, N=8K Figure 3.18: Varying the number of threadsT (AMD) We conduct a series of experiments on the AMD platform to determine the value ofT , as shown in Figure 3.18. As the value ofT grows, the throughput increases with a diminishing return while the latency increases almost linearly; note the latency is measured for the entire batch ofP = 10000T packets. ForT > 32, the throughput barely increases. We also see a similar trend on the Intel platform. To better understand the performance, we investigate the number of L3 cache (LLC) misses and the number of context switches per batch ofP packets. As can be seen in Figure 3.19: 74 0.0E+00 2.5E+05 5.0E+05 7.5E+05 1.0E+06 4 8 16 32 64 128 0 100 200 300 400 LLC misses Batch size ( × 10 K) Context switches LLC misses, N=1K LLC misses, N=2K LLC misses, N=4K LLC misses, N=8K Context switch, N=1K Context switch, N=8K Context switch, N=2K Context switch, N=4K Figure 3.19: Number of LLC misses and context switches per batch ofP packets (AMD) 1. The number of LLC misses increases as the size of the batch (P ) increases; similar trends can also be seen for other level of caches. This is because for large values of P , more (random) packet headers have to be brought into caches and more memory accesses are required to classify all the packets in the batch. 2. The number of context switches increases as the size of the batch (P ) increases due to resource contention among multiple threads. For T > 32, oversubscrip- tion 12 of parallel threads leads to a lot more context switches per batch and degrades the performance. These two factors lead to an increase of batch processing latency in Figure 3.18. In this thesis, we chooseT = 32 so that we have both relatively high throughput and low latency. 12 More threads than the number of logical cores are deployed. 75 0 10 20 30 40 0 5 10 15 20 1 2 4 8 16 32 Latency (ms) Throughput (MPPS) No. of rules (K) Throughput, AMD Throughput, Intel Latency, AMD Latency, Intel Figure 3.20: Throughput and latency on both platforms Table 3.9: Latency breakdown per batch (Intel) Rule set size Searching /ms Merging /ms 1 K 12.95 (63%) 7.60 (37%) 2 K 14.03 (68%) 6.61 (32%) 4 K 15.88 (72%) 6.17 (28%) 8 K 17.90 (73%) 6.62 (27%) 16 K 18.91 (76%) 5.97 (24%) 32 K 18.69 (74%) 6.57 (26%) 3.2.8 Scalability of Throughput and Latency UsingT = 32, we evaluate in Figure 3.20 the performance with respect to throughput and latency on both the AMD and Intel platforms. As can be seen, as the size of the rule set (N) increases, the throughput tapers while the latency increases. For 32 K rule sets, we achieve 14:7 MPPS throughput and 22:1 ms latency (for a batch of 320 K packets) on the AMD platform. For the same rule sets, the performance on the AMD platform is better than that on the Intel platform (12:7 MPPS throughput and 25:3 ms latency); note our Intel processor has a lower clock frequency. We also break down the latency of our designs (T = 32,P = 320 K,N = 32 K) on the Intel platform in Table 3.9. As can be seen in this table, the majority of the time is 76 Table 3.10: Comparison between various platforms Platform FPGA Multi-core Platform No. of rules 1 K 32 K Throughput (MPPS) 800 MPPS 20 MPPS Processing latency 800 ns per pkt. 22:1 ms for 320 K pkts. spent on the searching phase; this is consistent with the latency breakdown on the AMD platform. By exploring more cores to speed up the searching phase, we expect even higher throughput; this matches our expectations as discussed in Section 3.2.1. 3.3 Comparison of Packet Classification Approaches In this section, we first compare the performance on various platforms in Section 3.3.1. Then we compare our work with prior works in Section 3.3.2. 3.3.1 Comparison between Various Platforms We summarize the performance of our packet classification engines on various platforms in Table 3.10. We have the following observations: FPGA FPGA-based packet classification engines can achieve very high throughput and low per-packet latency. For small rule sets, the massive parallelism and localized memory accesses to small SRAM blocks allow FPGA to beat multi-core General Pur- pose Processor (GPP)-based platforms with respect to throughput and latency. The drawback of our FPGA-based designs is that only small rule sets are supported using on-chip memory. Although off-chip memory can be exploited for large rule sets, the long access latency to off-chip memory often deteriorates the performance. For example, off-chip memory often requires a complex memory controller managing ranks 77 and banks; sequential requests to memory are buffered in queues before they can be issued to the memory interface [29]. This means the latency and throughput can be compromised when accessing off-chip memory. Multi-core Platform Our decomposition-based implementation on multi-core plat- forms is scalable with respect to the number of processor cores. We expect even higher throughput when using more processor cores. State-of-the-art multi-core platforms employ Instruction-Level Parallelism (ILP) and optimized memory hierarchy; compared to FPGA, it can support very large rule sets. The drawbacks of multi-core platforms are: (1) For small rule sets, multi-core plat- forms cannot support as high performance as FPGA. (2) Processing a large batch of packets increases the processing latency of each single packet dramatically. Streaming techniques can be explored to reduce the processing latency, but they require complex kernel level optimizations. In general, it is also possible to explore other accelerators on the multi-core plat- forms, including TCAM and General Purpose Graphics Processing Units (GPGPU). However, the PCIe bandwidth [82] becomes another performance bottleneck between the CPU and the accelerator [92]. 3.3.2 Comparison with Prior Works We compare our approaches with the prior works in Table 3.11. As can be seen, pro- cessing latency is not a well-defined performance metric in most of the existing works [69, 51]. In many cases, our implementations on FPGA and multi-core platforms achieve comparable or even better throughput; however, the rule sets being matched by our packet classification engine are much more complex: 78 Table 3.11: Comparison with prior works Approach Platform Category Rule set Throughput Latency FPGA Qi et al. [88] Xilinx Virtex 6 Decision-tree 10 K, 5-field 230 MPPS 110 ns XC6VSX475t -2L Ganegedara et al. [44] Xilinx Virtex 6 Decomposition 0:5 K, 5-field, 1250 MPPS 750 ns XC6VLX760 -2L no range match This thesis [93] Xilinx Virtex 6 Decomposition 1 K, 15-field 800 MPPS 800 ns XC7VX1140t -2L Multi-core platforms Ma’s [69] 24-core Intel Xeon X5550 Decision-tree 9 K, 5-field 14 MPPS Not well-defined (2:7 GHz) + TCAM (96:5 ns/pkt.) Qu et al. [92] 28-core Intel Xeon E5-2650 Decomposition 32 K, 15-field 30:5 MPPS 22:5 ms (2:6 GHz) + NVIDIA Tesla K40 (274:7 ns/pkt.) Han’s [51] 24-core Intel Xeon X5550 Decomposition 32 K, 10-field, 58:6 MPPS Not measured (2:7 GHz) + 2NVIDIA GTX480 no range match This thesis [90] 28-core AMD Opteron 6278 Decomposition 32 K, 15-field 14:7 MPPS 22:1 ms (2:4 GHz) (69:0 ns/pkt.) 79 As shown in Section 3.2, for the decomposition-based approaches, the memory and time complexities both increase linearly with respect toM. As discussed in Section 2.1.2.1, for the decision-tree-based approaches, a rule set ofM fields andN rules may consume up toO(N M ) memory. We also conducted research on using other accelerators, such as GPGPU, to further improve the overall performance on multi-core platforms [92]. However, the data trans- fer time between the host (CPU) and the device (GPU) usually leads to considerably high communication overheads [92]. Therefore, in this thesis, we only list their perfor- mances on state-of-the-art multi-core platforms without any further discussion on this class of approaches [51, 69, 92]. 80 Chapter 4 Internet Traffic Classification 4.1 High-throughput Traffic Classification on FPGA 4.1.1 Related Work We target the C4.5 decision-tree-based approach for Internet traffic classification since it gives the best classification accuracy. The first motivation of our work is: We look for a combined data structure for the range-treesT m ’s and the decision-treeT in order to facilitate efficient dynamic updates. Hence, we combine T m ’s and T into a new deci- sion treeT M . This is done by replacing each “true/false” statement inT by a criterion checking a specific input feature. Moreover, if a criterion is expressed in the form of a range, we denote such a criterion as a range criterion. For instance, in Figure 2.5, the statement “SP = 0110” corresponding to the root node of T is replaced by the range criterion “SP2 [0101; 0111)” inT M . AfterT M is constructed, each node ofT M stores an explicit range. Another motivation of our work is: We look for an alternative representation of a tree to explore parallel searches on FPGA. The state-of-the-art implementation only exploits the parallelism between different tree levels. In fact, there are at least another two types of parallelism to be explored: Multiple leaf-to-node paths can be examined in parallel. Multiple features can be compared in parallel. 81 Thus, we construct a compact rule set table (RST) for efficient hardware implementa- tion. This new data structure allows us to exploit massive parallelism on state-of-the-art FPGA devices. 4.1.2 Converting a Decision-tree Definition 4.1. A rule set is a set ofJ rules, each rule havingM criteria defined onM features, respectively. The table storing such a rule set is a Rule Set Table (RST). We show an example of a RST in Figure 4.1. There are 3 rules (R0, R1, andR2) in this example; each rule has 2 range criteria on the SP feature and the APS feature, respectively. The RST is compact because the number of rows in each RST is no more than the number of the leaves of its corresponding decision-tree. In T M , suppose a leaf node j is reached starting from the root by going through a path consisting of k j range criteria (k j L, L denoting the depth of T M ). Let A (j) 0 ;A (j) 1 ;:::;A (j) (k j 1) denote the ranges specified by thek j range criteria, respectively. A range criterion onA (j) i (i = 0; 1;:::;k j 1) can check any of theM features. We denote the width of each feature asW m bits,m = 0; 1;:::; M 1. We convert T M into a RST as shown in Algorithm 8. The intuition behind this conversion is that both the rows and the columns of the RST can be searched in parallel. We show an example in Figure 4.1; the path indicated by the dotted red line corresponds to ruleR1. Notice three range criteria are checked along this path, butR1 only checks two features; this means one of the features is checked twice along this path. The resulting RST only has two columns, each corresponding to a feature. It is also possible that a feature is never checked along a path, leading to a “don’t care” criterion for this feature in the corresponding rule; in that case, we use a full range [0; 2 Wm ) to express the range criterion for featurem. For instance, the 8-bit APS feature is not checked along the path corresponding toR0, so we use the full range [0; 256) for the APS feature in 82 P2PTV P2PTV HTTP Decision-tree SP APS R0 [0101, 0110) [0, 256) R1 [0110, 1110) [40, 100) R2 [1110, 1111) [50, 120) RST Check SP Check APS Converting Figure 4.1: Constructing a RST (4-bit SP, 8-bit APS) R0. The correctness of this algorithm can be easily proved, as shown in Theorem 4.2. After the conversion, each rule in the RST has multiple feature criteria to be compared with the incoming traffic. The conversion in Algorithm 8 does not depend on the shape, depth, or degree of the decision-tree. The size of the RST is not directly related to the number of tree levelsL even for very imbalanced decision-trees. Theorem 4.2. (Compactness) For an arbitrary decision-tree havingJ leaves and using M features, Algorithm 8 produces a compact RST havingJ rows andM columns. Proof. Along a root-to-leaf path consisting of rangesA (j) 0 ;A (j) 1 ;:::;A (j) (k j 1) , suppose fea- turem is checkedk times byA (j) i 0 ;A (j) i 1 ;:::;A (j) i k1 , respectively, wherei 0 < i 1 < ::: < i k1 and 0<kk j . We haveA (s) i k1 :::A (j) i 1 A (j) i 0 , and T i k1 i=i 0 A (j) i =A (j) i k1 . As a result, no matter how long such a path is, the corresponding rule can specify at mostM ranges onM features; this rule can be fit in exactly one row of anM-column RST. 83 Algorithm 8 Constructing RST fromT M Input AT M ofJ leaves, usingM features. Output An arrayRST (j;m),j = 0; 1;:::;J 1,m = 0; 1;:::;M 1; it hasJ rows andM columns. Each cellRST (j;m) stores a range. 1: forj = 0 toJ 1 do 2: for leaf node j in T M , find the root-to-leaf path consisting of ranges A (j) 0 ;A (j) 1 ;:::;A (j) (k j 1) ; 3: fori = (k j 1) to 0 do 4: form = 0 to (M 1) do 5: ifRST (j;m) = then 6: RST (j;m) [0; 2 Wm ); 7: end if 8: ifA (j) i is specified for featurem then 9: RST (j;m) RST (j;m) T A (j) i ; 10: end if 11: end for 12: end for 13: end for Register f_in outputs -bit Comparator eql less y x CU Data memory wr_in Figure 4.2: Basic PE 4.1.3 Hardware Architecture We now develop a mapping from the RST to a distributed hardware architecture. The inputs to this hardware architecture are M input features extracted from the network traffic. 84 4.1.3.1 Basic Processing Element We show the organization of a basic Processing Element (PE) for aW m -bit featurem in Figure 4.2; this PE consists of a single-port 1W m data memory, aW m -bit comparator, and a 2-bit register buffering the comparison results: 1. The data memory stores oneW m -bit range boundary. The data memory can take in wr in to write a new data; the data stored in the data memory can also be read to the y port of the comparator. 2. The Control Unit (CU) provides the control signals, such as the write enable signal to the data memory. 3. TheW m -bit comparator outputs the comparison results eql and less between two numbersx andy.eql is high whenx =y;less is high whenx<y. 4. The register buffers the comparison results. The PE in Figure 4.2 is designed for comparing the input with aW m -bit range upper- bound; a similar PE can be constructed to compare the input with the lowerbound of a range. A pair of such PEs (one for lowerbound, the other for upperbound) can be used for each cell of the RST. In total, there are M input features to be examined against 2M range boundaries; the comparison results are combined logically. Note that some features may be long; the clock rate degrades for long features, because the wire length of the critical path in the comparator isO(W m ). 4.1.3.2 Modular Processing Element Now we introduce the organization of a modular PE; the length of the longest wires in this PE isO(max (s;c)) instead ofO(W m ) (s;c < W m ). We only discuss the modular 85 PE used for upperbound comparison, because the lowerbounds can be handled simi- larly. Compared to the basic PE in Section 4.1.3.1, the modular PE is improved in the following aspects: Striding: A W m -bit long input feature or range boundary is split into multiple strides ofs bits each. In Figure 4.1, for instance, the upperbound on the SP feature of R1 “1110” can be split into two 2-bit strides: “11” (higher-order) and “10” (lower-order). Clustering: An input feature is compared withc range boundaries in parallel. In Figure 4.1, for example, an input SP feature can be compared with 3 upperbounds “0110”, “1110”, and “1111” in parallel. Using the striding and clustering techniques, a modular PE takes in ans-bit stride, f in, from an input feature; f in is compared in parallel withc strides fromc upperbounds. We show an example in Figure 4.3. The data memory and the s-bit comparators are the same as in the basic PE; additional registers are added for vertical pipelining (see Section 4.1.3.3). We will introduce d in, d out, and CU later in Section 4.1.4. For c = 2 range upperbounds, eql in0, less in0, eql in1, and less in1 denote the comparison results corresponding to higher-order strides; eql out0, less out0, eql out1, and eql out1 denote the partial results of the comparison from the highest-order stride down to the current stride. We denote a group of a 1-bit AND and a 2-to-1 1-bit MUX as the combining logic, as shown in Figure 4.3. Without loss of generality, we focus on the upper combining logic in Figure 4.3; we also assume the signal sel = 1 in this section. Notice that eql out0 is high only if (1) eql in0 is high, and (2) f in equals to d 0. If eql in0 is high (all the higher-order strides are equal to all the corresponding strides of the range upperbound), then less out0 is determined by the comparison result of the current strides (f in and 86 Data mem. -bit Comp. less_1 Data mem. Reg. eql_0 sel eql_in0 less_in1 1 0 eql_out0 less_out1 CU AND 0 1 d_out0 wr_in ∙ Reg. wr_out x y AND 0 1 -bit Comp. x y f_in Reg. f_out d_in0 eql_in1 less_0 eql_out1 less_out0 less_in0 eql_1 Reg. 1 0 d_in1 d_out1 d_0 d_1 Combining logic Figure 4.3: Modular PE (c = 2) d 0). On the other hand, if eql in0 is low, then less out0 has already been determined by the comparison results from higher-order strides, less in0. The modular PE is parameterizable since boths andc can be adjusted. Hence, we use modular PEs in our classification engine. As can be seen later in Section 4.1.4, in the classification engine, we also need the empty PE, a simplified version of the modular PE. Compared to the modular PE, the empty PE does not have the comparators and the combining logic. 4.1.3.3 2-dimensional Pipelined Architecture We concatenate an array of multiple modular PEs and empty PEs into a 2-dimensional pipelined architecture, as shown in Figure 4.4. 87 PE[1,0] PE[2,0] PE[3,0] PE[1,1] PE[2,1] PE[3,1] Arbiter final result RES RES Empty PE[0,0] Empty PE[0,1] PE[1,2] PE[2,2] PE[3,2] PE[1,3] PE[2,3] PE[3,3] Empty PE[0,2] Empty PE[0,3] d_in0 / d_in1 / d_out0 / d_out1 f_in / f_out partial / final results wr_in / wr_out Figure 4.4: 2-dimensional pipeline using the empty and modular PE; PE[j;i] indexes the PE in thej-th row andi-th column,j = 0; 1;:::;d J c e, andi = 0; 1;:::;d 2 P M1 m=0 Wm s e 1. 1. Vertical (column) All the strides of the input features are propagated in vertical pipelines. Ans-bit stride is examined againstJ strides; a number ofd J c e (= 3 in Figure 4.4) modular PEs are required in each vertical pipeline. 2. Horizontal (row) Each range criteria for featurem consists of aW m -bit upper- bound and aW m -bit lowerbound. Thus, a number ofd 2 P M1 m=0 Wm s e (= 4 in Fig- ure 4.4) modular PEs are required in each horizontal pipeline. The empty PEs are placed in the first row; the functionality of the empty PEs will be introduced in Section 4.1.4. Multiple resolvers (RESs) and an arbiter are also added to gather the final classification results: The RES/arbiter collects the partial results provided by the last modular PE in the corresponding row. 88 The arbiter also outputs the final result based on the partial results collected by RESs. The organizations of the logic-based RES and the arbiter are relatively simple. This 2- dimensional pipelined architecture effectively eliminates long wires and employs local- ized connections between PEs. 4.1.3.4 Snake-like Pipeline To reconfigure the data memory content, wr in is provided to the write port of each data memory module. To feed the data memory modules of all the PEs, wr in is propagated along a snake-like pipeline. As shown in Figure 4.4, the directions of such a pipeline alternate between up and down from column to column; the data paths are also head-to- tail connected between columns. The snake-like pipeline employs localized connections between adjacent PEs yet utilizing few I/O pins. As a result, wr in is provided to all the PEs with a very small amount of resource overhead. To modify the content in a single data memory module, one write access (ofs bits) is required. 4.1.4 Enabling Virtualization To support hardware virtualization, we present an update strategy updating the RST stored in the classification engine. We denote the outdated data structures before any updates as the old data structures; we denote the data structures to be used after updates as the new data structures. Given a new decision-tree, a new RST can be constructed offline. An external controller in Figure 2.4 can be used to construct the new data struc- tures and synchronize the hardware architecture with the incoming packet flows. Three types of update operations are required: 89 1. Deletion: to invalidate a rule. This is required when the new decision-tree has less leaves than the old one. 2. Insertion: add a brand-new rule into the rule set, or validate an invalid rule. This is required when the new decision-tree has more leaves than the old one. 3. Modification: change an existing rule into another rule. Comparing the new RST with the old RST, overlapping rules do not need to be updated. In the worst case, however, the entire RST has to be updated. We assume pessimistically that the new RST and the old RST are totally different; therefore each dynamic update refreshes the entire RST. Deletion To delete a rule, a rule can be invalidated; an invalid rule should not produce any valid partial results. This is done by making the lowerbound of a range greater than the upperbound. Consequently, no input features can satisfy the range criteria of an invalid rule. Notice that rule modifications can be used to invalidate a rule. Insertion A rule can be inserted by modifying an invalid rule into a valid rule. There- fore rule modifications are required again. Since the value ofJ may fluctuate for dif- ferent networks, we use J max to denote the maximum number of decision-tree leaves during run-time. Thus, we deploy 1 empty PE and (d Jmax c e) modular PEs in each verti- cal pipeline. Modification To support rule modifications, a straightforward strategy [104] is to dou- ble the amount of memory footprint. While the PEs are accessing their designated data memories, extra data memories can be used to store the new data; a context switch is then performed to make all the PEs access the new data without stalling the pipelines 90 [104]. This strategy introduces too much resource overhead on hardware. For our 2- dimensional pipelined architecture, the total memory required by this straightforward strategy is 4J max P M1 m=0 W m . The main idea of our novel update mechanism is to utilize the data memory inside an empty PE to update the data for an entire column of PEs. The update process for a column of PEs is denoted as a ping-pong process, because data are shifted back-and- forth between adjacent PEs in the same column. For a column of PEs, a ping-pong process can start sequentially either from top to bottom (top-down) or from bottom to top (bottom-up). Figure 4.5 illustrates a top-down update on the first column of PEs in Figure 4.4, with the following steps: Step 1 Upon start of the top-down update, the content of the data memory in the empty PE[0; 0] is being updated; PE[1; 0] is still accessing its own data memory. Step 2 After the content of the data memory in PE[0; 0] have been updated, PE[1; 0] is immediately switched to use the data stored in PE[0; 0]. At the same time, the content of the data memory inside PE[1; 0] is being updated. Step 3 PE[1; 0] has finished its update. PE[2; 0] is switched to use the data provided by PE[1; 0]. The content of the data memory inside PE[2; 0] is being updated. Step 3 continues until all the PE[j;i] in this column is switched to use the data provided by PE[j 1;i]. Step 4 A top-down update completes. The data memory inside the bottom PE is left unused. After the top-down update is complete, if another update is to be performed again on this column of PEs, the data memory inside the bottom PE can be reconfigured to start a sim- ilar bottom-up update. The ping-pong process allows us to modify the data accessed by 91 PE [1,0] PE [2,0] PE [3,0] PE [0,0] PE [1,0] PE [2,0] PE [3,0] PE [0,0] PE [1,0] PE [2,0] PE [3,0] PE [0,0] PE [1,0] PE [2,0] PE [3,0] PE [0,0] updating its own data using its own data providing data to the PE below it Step 1 Step 2 Step 3 Step 4 Figure 4.5: A top-down update for a column of PEs the PEs without stalling the pipelines, yet using a small amount of memory and resource overhead. Compared to the straightforward strategy, the total memory consumption for the ping-pong process is reduced to 2(J max +c) P M1 m=0 W m . In Figure 4.3, d in0 and d in1 are the data provided by the PE above this modular PE; d out0 and d out1 are the data provided to the next PE below this PE. The CU uses a control signal sel to decide whether to use d in0 and d in1, or to use the data read from the data memory directly. Figure 4.5 only shows a top-down update for the first column (i = 0) in Figure 4.4. Since the control signals have to be propagated along horizontal and vertical pipelines, the top-down update for thei-th column is delayed byi clock cycles. As a result, for a large 2-dimensional PE array, the ping-pong process proceeds in a diagonal waveform- like manner. 4.1.5 Experimental Setup We conducted extensive experiments using Verilog on Xilinx Vivado Design Suite 2013.4, targeting the Virtex-7 XC7VX1140t FLG1930 -2L FPGA [29]. This device has 1100 I/O pins, 218800 logic slices, 67 Mb BRAM, and can be configured to realize 92 Table 4.1: Flow-level features tested Feature W m Description Transport-layer Protocol (Prtl) 8 - Source Port Number (SP) 16 - Destination Port Number (DP) 16 Average Packet Size (APS) 12 over the first Maximum Packet Size (MaxPS) 12 6 packets of a Minimum Packet Size (MinPS) 12 traffic flow large amounts of distributed RAM (distRAM, up to 17 Mb). A conservative timing con- straint of 250 MHz was used for place-and-route process. Maximum achievable clock rate and resource consumption were collected using post-place-and-route timing reports and resource utilization reports, respectively. For all the implementations on FPGA, we exploited distRAM or BRAM with dual- port read access; memory write accesses were single-ported in order to avoid writing two different data into the same location. We started from a typical C4.5 decision-tree with J max = 96 leaves and L = 42 levels [103]. Based on this decision-tree, we constructed synthetic decision-trees while keeping the ratio of Jmax L a constant in order to evaluate the effectiveness of our approach for large decision-trees. We used 6 flow-level features in Table 4.1; they demonstrate high classification accuracy with a reasonable hardware complexity [35, 59, 33]. Our classification engine classifies network traffic into 8 categories [46], including web, P2P download, direct download, streaming, game, mail, instant messaging, and distant con- trol. We used a publicly available traffic trace provided by Tstat [26]; note the through- put performance of our classification engine does not depend on the traffic trace. 93 4.1.6 Performance Metrics Classification accuracy is the average percentage of correct classifications over all the classifications performed. The classification accuracy is measured by WEKA [50], an existing ML software. We do not provide further discussions on the performance met- rics such as classification accuracy, false positive and false negative [76]; this is because these offline performance metrics are determined in the training phase. In addition, com- pared to the prior works [103], our classification engine neither improves these offline performance metrics nor degrades them. The performance metrics used in the online classification phase include: Overall Throughput (T overall ): The total number of classifications performed per unit time for all the virtual networks (in Million Classifications Per Second, MCPS) Concurrent Throughput (T concurrent ) The total number of classification per- formed per unit time for a traffic flow from a single virtual network (in MCPS) Processing Latency: The total processing latency introduced by the online clas- sification engine Resource consumption: The total amount of hardware resources used by the classification engine on FPGA Using the dual-port data memory, our 2-dimensional pipelined architecture classifies 2 packet flows concurrently; denoting the maximum clock rate achievable on FPGA asf, we haveT overall = 2f numerically. For hardware virtualization, the ping-pong mechanism does not require the pipelines to be stalled; also, since all theS virtual networks share (in a round-robin fashion) the same online classification engine, we haveT concurrent = T overall S . 94 Table 4.2: Clock ratef (MHz) s 2 3 4 5 c 2 279.11 320.15 294.33 284.33 3 310.34 290.67 286.51 275.25 4 300.75 287.48 275.80 250.82 4.1.7 Empirical Optimization of Parameters Remember s is the stride length while c is the cluster size in Section 4.1.3; thus, s determines the number of stages in each horizontal pipeline, while c determines the number of stages in each vertical pipeline. To improve T overall , we choose carefully the values of s and c in our implementa- tions. We setJ max = 128 and P M1 m=0 W m = 80 initially; we have observed consistent performance for other values ofJ max and P M1 m=0 W m as well. We show the experimental results in Table 4.2. In general, small values ofs andc lead to high clock rates, since the comparators in each PE can efficiently utilize the 6-input LUT on our FPGA. In this paper, we chooses = 3 andc = 2, unless stated otherwise. 4.1.8 Throughput and Latency We investigate the effect of the values ofJ max and P M1 m=0 W m on the overall through- put T overall , as shown in Table 4.3. In this table, we only show T overall for 80 P M1 m=0 W m 320 and 16 J max 96, although the similar trend can be seen for other combinations ofJ max and P M1 m=0 W m . As can be seen in this figure: Compared to the state-of-the-art implementation on the same FPGA, we achieve consistently betterT overall (up to 2). We have also observed consistent perfor- mance improvements for other values ofJ max and P M1 m=0 W m . 95 Table 4.3: Overall throughputT overall (MCPS) Parameters State-of-the-art [103] This work P W m 80 160 240 320 80 160 240 320 J max 16 684 538 496 455 770 726 694 687 32 512 493 458 421 759 715 701 679 48 498 476 452 396 738 696 687 659 64 437 425 401 374 705 682 653 621 80 429 386 367 358 661 631 611 598 96 399 382 354 305 645 620 609 588 Table 4.4: Throughput and latency for typical decision-trees No. of leaves Implementation State-of-the-art [103] This work 128 520 MCPS, 0:22s 533 MCPS, 1:59s 256 468 MCPS, 0:48s 530 MCPS, 3:22s 512 392 MCPS, 1:14s 523 MCPS, 6:53s 1024 304 MCPS, 2:95s 437 MCPS, 15:62s T overall tapers as we increaseJ max or P M1 m=0 W m . This is because large values of J max or P M1 m=0 W m lead to more resource consumption and less routing choices, which in turn results in slower clock rates on FPGA. Using localized connections between PEs, our 2-dimensional pipelined architec- ture is effective especially for improving the throughputs of very large decision- trees (largeJ max and P M1 m=0 W m ). To reduce the processing latency of our classification engine, we now choose the val- ues ofs andc to be 6 and 4, respectively. Also, we fix P M1 m=0 W m = 80 and varyJ max to examine the processing latency; similar trends can be seen if we fix J max and vary the other. We show in Table 4.4 the worst-case 1 processing latency performance. As can be seen, our classification engine incurs longer processing latency (approximately 5 1 The processing latency of the state-of-the-art implementation varies depending on the data traces. 96 compared with the state-of-the-art implementation). However, our pipelined architec- ture increases the throughput by upto 2 and supports hardware virtualization. 4.1.9 Impact of Virtualization As can be seen in Section 4.1.6, each of theS networks shares a portion of the overall throughput; thus,T concurrent = T overall S . The new data to be updated have to be loaded through the snake-like pipeline between two consecutive updates; under this constraint, the maximum update rate (measured in updates per second) supported by our classifica- tion engine is: f= d Jmax c ed 2 P M1 m=0 Wm s e (4.1) Note each update can refresh the entire RST. There is no theoretical upperbound on the number of virtual networks S, but a large value of S may lead to low T concurrent and require large buffers to be placed in front of our classification engine. 4.1.10 Resource Consumption We first only employ distRAM for the data memory modules; the resource consumption is shown only with respect to the number of occupied logic slices and the number of used I/O pins. A similar trend can be seen if we also employ BRAM for the data memory modules. We show the resource consumption with respect to the total feature width (30 P M1 m=0 W m 120) in Figure 4.6. We chooseJ max = 64 targeting> 250 MHz clock rate to demonstrate the scalability of our architecture. As a baseline for comparison, we also implemented a 2-dimensional PE array using the straightforward strategy for dynamic updates as discussed in Section 4.1.4: (1) the PE array in this baseline implementation 97 22% 35% 51% 61% 25% 46% 68% 90% 11% 18% 26% 31% 7% 13% 18% 23% 30 60 90 120 Total feature width Slices: baseline I/O pins: baseline Slices: this work I/O pins: this work Figure 4.6: Percentage of resource consumption (J max = 64) does not have the snake-like pipeline; (2) it does not support the ping-pong process, either. As can be seen in Figure 4.6: Compared to the baseline, our novel dynamic update mechanism reduces the resource consumption dramatically by 50% to 75%. Our 2-dimensional pipelined architecture supports dynamic updates with little resource overhead. As we increase the total feature width, the resource consumption increases sub- linearly. The values ofJ and P M1 m=0 W m can be adjusted accordingly for a given classification problem. Our 2-dimensional pipelined architecture can be scaled not only in the vertical direction with respect to the maximum number of rules, but also in the horizontal direction with respect to the total feature width. By adding more stages to each horizontal or vertical pipeline, as well as using BRAM for data memory, our 2-dimensional pipelined architecture can support even larger decision-trees with little throughput degradation. For traffic classification, using larger values ofJ max and P M1 m=0 W m implies a higher classification accuracy [50]. With BRAM-based data memory, the largest J max and 98 P M1 m=0 W m supported by our FPGA are 2048 and 320, respectively, constrained by the available resources on FPGA. 4.2 Compact Hash Tables on Multi-core Platforms 4.2.1 Related Work A new trend in network applications is to use software accelerators [84, 70]. State- of-the-art multi-core platforms have many advanced features [2, 13]. It is an attractive platform for high-performance network applications. However, to achieve high perfor- mance, efficient parallel algorithms have to be explored. High-accuracy traffic classifiers often require well-tuned discretizers [59, 103]; dis- cretizers can discretize the input traffic features into discrete values. We denote these discrete values as unique values. In Figure 2.5, for example, a 0 , a 2 , a 4 , a 6 , b 0 , andb 2 are all unique values. During the preprocessing phase, the discretizers are tuned, and a decision-treeT is constructed. In a decision-tree, each internal node stores a unique value from a specific feature. Our work is inspired by the following observations on the classic C4.5 decision-tree- based approach: Using the same notations as in Section 4.1, we can see the decision-treeT usually has a relatively small number of leaf nodes, butT can be very imbalanced. This means the searching operations inT can incur large processing latency. Denoting as the total number of nodes inT , a lookup on an imbalancedT may take O() time. However, it is almost impossible to balance T , since various decisions require various amounts of processing latency. 99 Table 4.5: Notations for converting the decision-tree to hash tables A n = (S 0 ; S 1 ;:::; S kn ;:::; S Kn1 ), J root-to-leaf paths inT , n = 0; 1;:::; J 1,k n = 0; 1;:::; K n 1 each path visitingK n edges parfS kn g, chifS kn g parent and child nodes connecting edgeS kn u (m) i ,m = 0; 1;:::; M 1, Unique values forM features, i = 0; 1;:::; q (m) 1 q (m) unique values for featurem B (m) i =b (m) i;0 b (m) i;1 :::b (m) i;n :::b (m) i;J1 , q (m) BVs for featurem, i = 0; 1;:::; q (m) 1,n = 0; 1;:::; J 1 each BV consisting ofJ bits f (m) 0 ; f (m) 1 ;:::; f (m) p ;:::; f (m) P1 P Hash functions for featurem H (m) Hash table for featurem [] m A set of data structures with differentm The searching process onT is a serial process incurring large latency. To reduce the processing latency, parallel data structures have to be explored. We tend to search all the M input unique values in parallel to reduce the processing latency. We formulate the following problem: Given a set of unique values and an input unique value, locate the unique value matching the input unique value. An efficient solution to this problem is perfect hashing; without loss of generality, we use Cuckoo hashing [81] to reduce the number of memory accesses. The searching outcomes for all theM unique values, treated asM partial results, are merged together to produce the final classification result. This final classification result indicates what the final decision should be if we use the M unique values to search the decision-treeT . 4.2.2 Compact Hash Tables Before we present our algorithm, we revisit Bit Vector (BV) [90], a data structure used to efficiently merge the partial results. A BV consists of J bits; the n-th bit (n = 100 0; 1;:::; J1) is set to “1” if and only if the input matches or satisfies then-th condition. Note the number of leaf nodes (J) is usually small in T ; merging a set of short BVs can be very fast. This is the reason why we choose to use BV to represent the partial searching results. Hence, during the training phase, we convertT intoM compact hash tables, each constructed for a specific feature. Using the notations in Table 4.5, we show the pseudo- code in Algorithm 9. After the conversion, each hash table stores a number of q (m) entries, each consisting of a unique value and a BV . Note: Step 5: We set the corresponding bit of the BV to “1”, because the decision along pathA n is not dependent on featurem. Step 12 and Step 14: These steps can be overwritten by latter iterations (for deeper levels ofT ). Step 19: We construct a null unique value “” for each feature. “” is not stored inT , but it is required when none of the unique values stored inT equals to the input unique value. The total number of unique values, including “”, isq (m) for featurem. We show an example in Figure 4.7; in this figure, we convert the decision-treeT into M = 4 hash tables 2 . Each hash table contains a “”; each unique value is associated with a 6-bit BV (sinceT has a total number of 6 leaves). Suppose, for example, an input unique value (denoted as u (m) ) matches the unique value u (m) i ; then a “1” in the n-th position of B (m) i suggests: all the criteria related to feature m are satisfied along the n-th root-to-leaf path. 2 This is because only 4 features are examined byT in this example: SP, DP, APS, and MinPS. 101 Algorithm 9 Converting the decision-tree to compact hash tables SUBROUTINE: [H (m) ] m ( tree to hash (T ) 1: forn = 0; 1;:::; J 1fevery root-to-leaf pathg do 2: form = 0; 1;:::; M 1fevery featureg do 3: if pathA n does not check featurem then 4: fori = 0; 1;:::; q (m) 1 do 5: b (m) i;n 1 6: end for 7: else 8: fork n = 0; 1;:::; K n 1falong the pathg do 9: fori = 0; 1;:::; q (m) 2fexcludingg do 10: if parfS kn g stores a unique valueu (m) i then 11: ifS kn is a “YES-edge” then 12: b (m) i;n 1 13: else 14: b (m) i;n 0 15: end if 16: end if 17: end for 18: end for 19: constructu (m) q (m) 1 andB (m) q (m) 1 20: end if 21: end for 22: end for 23: form = 0; 1;:::; M 1fevery tableg do 24: fori = 0; 1;:::; q (m) 1fevery unique valueg do 25: p 0 26: if locationf p (u (m) i ) inH (m) is empty then 27: store (u (m) i ,B (m) i ) at locationf p (u (m) i ) 28: else 29: p p + 1 30: ifp =Pfalready tried all theP functionsg then 31: enlargeH (m) 32: reselect [f (m) p ] p 33: p 0 34: end if 35: go to Step 26 36: end if 37: end for 38: end for 102 0 3 4 5 1 2 P2PTV HTTP P2PTV MSN Skype MSN SP=80? Decision-tree DP=100? APS=500? MinPS=400? DP=400? YES NO Internal node Leaf ( = , , … , − ) Uniq. Val. BV 80 111100 ∅ 000011 SP Uniq. Val. BV 100 100011 400 010111 ∅ 001111 DP Uniq. Val. BV 500 111011 ∅ 100111 APS Uniq. Val. BV 400 111110 ∅ 111101 MinPS Converting = hash tables Figure 4.7: Converting a decision-treeT toM hash tables Denoting the number of bits used for featurem asW (m) , the memory consumption for the M hash tables is bounded by O P m q (m) [W (m) +J] . The actual mem- ory consumption in our implementations, considering the overhead when using Cuckoo hashing, is roughly 2 P m q (m) [W (m) +J] . As can be seen later in this thesis, this small memory footprint helps us improve the cache utilization. After the compact hash tables are constructed offline, we present the online classifi- cation phase in the following stages: 1. Discretizing the feature values extracted from the network traffic. 2. SearchingM input unique values in parallel in their corresponding hash tables. 3. Merging the partial results to produce the final classification result. 103 Table 4.6: Notations for the searching and merge stages u (m) i ,m = 0; 1;:::; M 1, Unique values forM features, i = 0; 1;:::; q (m) 1 q (m) unique values for featurem B (m) i =b (m) i;0 b (m) i;1 :::b (m) i;n :::b (m) i;J1 , q (m) BVs for featurem, i = 0; 1;:::; q (m) 1,n = 0; 1;:::; J 1 each BV consisting ofJ bits u (m) ,m = 0; 1;:::; M 1 M input unique values B (m) =b (m) 0 b (m) 1 :::b (m) n :::b (m) J1 partial searching result for featurem B =b 0 b 1 :::b n :::b J1 final classification result f (m) 0 ; f (m) 1 ;:::; f (m) p ;:::; f (m) P1 P Hash functions for featurem H (m) Hash table for featurem [] m A set of data structures with differentm 4.2.3 Online Classification The discretization stage remains the same as the classic C4.5 decision-tree-based classi- fier [103]; hence we only focus on the searching stage (Section 4.2.3.1) and the merging stage (Section 4.2.3.2) in the online classification phase. 4.2.3.1 Searching Compact Hash Tables Using the notations in Table 4.6, we show the pseudo-code for the searching stage in Algorithm 10. Because we use perfect hashing, there are no hash collisions. Also, the perfect hashing technique is only applied to the online classification phase, so our approach does not sacrifice classification accuracy for better performance [54]. Since we employP hash functions to efficiently access each of theM hash tables, the total number of memory accesses for the searching stage is at mostMP . The parallel time complexity of the searching stage is onlyO(P ). In our implementations,P is chosen to be 2, since a smaller value ofP indicates higher performance. 104 Algorithm 10 Searching the hash tables SUBROUTINE: [B (m) ] m ( hash search ([H (m) ] m , [u (m) ] m ) 1: form = 0; 1;:::; M 1 in parallel do 2: p 0 3: extract (u (m) i ,B (m) i ) at the locationf p (u (m) ) 4: ifu (m) =u (m) i fhash hitg then 5: B (m) B (m) i 6: else 7: p p + 1 8: go to Step 3 9: end if 10: end for 4.2.3.2 Merging Partial Results Algorithm 11 Merging the BVs SUBROUTINE:B( merge ([B (m) ] m ) 1: B (m) & M1 m=0 B (m) i f& denotes the bitwise AND operationg 2: useB (m) to report the application class We also show the pseudo-code for the merging stage in Algorithm 11. Note we merge all theM BVs sequentially. This is becauseM is usually very small; otherwise more efficient merging algorithms are also required. The time complexity of the merging stage isO(MJ). For example, in Figure 4.7, suppose the input unique values for the SP, DP, APS, and MinPS features are 80, 400, 500, and 1000, respectively. The partial results, after the searching stages, are 111100, 010111, 111011, and 111101 3 , respectively. Then we perform a bitwise AND operation on these 4 BVs. This leads to the final classification result B = 010000, indicating the leaf node “Skype” (indexed by n = 1) is the final decision for the incoming traffic. 3 In this example, the unique value “1000” for the MinPS feature matches “”. 105 4.2.4 Experimental Setup We conducted the experiments on a 2 AMD Opteron 6278 processor [2] and a 2 Intel Xeon E5-2470 processor [13]. The AMD processor has 16 physical cores, each running at 2:4 GHz. Each core is integrated with a 16 KB L1 data cache and a 2 MB L2 cache. A 6 MB L3 cache is shared among all the 16 cores; all the cores have access to 64 GB DDR3-1600 main memory. The AMD processor runs openSUSE 12.2 OS (64-bit 2.6.35 Linux Kernel). The Intel processor also has 16 cores, each running at 2:3 GHz. Each core has a 32 KB L1 data cache and a 256 KB L2 cache. All the 16 cores share a 20 MB L3 cache, and they have access to 48 GB DDR3-1600 main memory. This processor runs openSUSE 12.3 OS (64-bit 3.7.10 Linux Kernel). We implemented our approach using Pthreads with gcc version of 4.7.1 and 4.7.2 on the AMD and Intel processors, respectively. We deployedM threads for the searching stage of our approach, and used a single thread for merging. For any input traffic flow, we bound all the searching and merging threads to a single core in order to mitigate the penalty of data movements among the cores [90]. We used perf, a performance analysis tool in Linux, to monitor the hardware and software events such as the number of cache misses and the number of context switches. 4.2.5 Performance Metrics and Design Parameters Let classification accuracy denote the number of correctly classified traffic flows over all the incoming flows. The performance of our classification engines with respect to the classification accuracy is determined in the training phase; our classification engines neither improve nor degrade the classification accuracy. Hence we ignore any further discussion on this performance metric. 106 Balance factor: 0.63 Balance factor: 0.50 0.75 0.50 0.50 0.50 0.50 Balance factor: 1.00 1 1 Figure 4.8: Various shapes of decision-trees and their balance factors (B) For online traffic classification, we use the following metrics to measure the perfor- mance: Overall Throughput (T overall ): The total number of classifications performed per unit time (in Million Classifications Per Second, MCPS) Processing Latency: The average processing time used for classifying one traffic flow Our classification engines on multi-core platforms are software-based, so in this sub- section, we do not introduce any performance metrics related to hardware virtualization 4.1.6. Also, to make a fair comparison with our classification engines on FPGA, we define the processing latency to be the average processing time of a single traffic flow, although many traffic flows are processed in batches 4 . In this subsection, we present a new design parameter in order to show the advan- tages of our approach on imbalanced decision-trees. This parameter is only used for illustration purpose; our techniques do not depend on the shape of the decision- tree. We define balance factor for a binary decision-tree T as follows: Let x , = 0; 1;:::; J 1, denote all the internal nodes of T 5 . Hence x is the root 4 We configure each batch to be 10 K packets, since this batch size gives the best performance [92]. 5 Recall is the total number of nodes inT ;J is the total number of leaf nodes. 107 Table 4.7: Statistics of a typical C4.5 decision-tree[103] SP DP MinPS MaxPS APS VPS Prtl Feature width (bits) 16 16 12 12 12 16 8 No. of unique values 19 17 19 14 7 9 2 for at most 2 subtrees. We useLeft(x ) to denote the number of nodes of the left sub- tree; similarly,Right(x ) can be defined. We defineB =avg n max[Left(x);Right(x)] Left(x)+Right(x) o . As shown in Figure 4.8, a perfectly balanced binary tree satisfies B = 0:50. For the decision-treeT shown in Figure 4.7,B = 0:66. Note 0:50B 1:00. A typical C4.5 decision-tree usually has 0:91B 0:95. 4.2.6 Data Sets and Traces A major difference between our classification engines on FPGA and on multi-core plat- forms is: the performance of our classification engines on multi-core platforms are highly dependent on the statistics of the C4.5 decision-tree. Such statistics include (1) the number of leaf nodes (J), and (2) the number of unique values examined for each feature (q (m) ,m = 0; 1;:::;M 1). To estimate the typical number of leaf nodes for a C4.5 decision-tree, we collected around 100 decision trees whose accuracy is above 95% [103]. We observed that all of the decision-trees have around 100 leaf nodes. Hence, we prototyped our design exploiting a typical (real) C4.5 decision-tree [103] consisting of 92 leaf nodes. The statistics of this tree is shown in Table 4.7. We employed M = 7 flow-level features: Source Port (SP), Destination Port (DP), Minimum Size of the first 4 packets (MinPS), Maximum Size of the first 4 packets (MaxPS), Average Size of the first 4 packets (APS), Size Variance of the first 4 packets (VPS), and Protocol (Prtl). 108 0 250 500 750 1000 0.91 0.92 0.93 0.94 0.95 Latency (ns) Balance Factor AMD_classic Intel_classic AMD_proposed Intel_proposed Figure 4.9: Latency on both platforms Based on the features examined, all of our real decision-trees can achieve over 98:15% classification accuracy [50] for a publicly available traffic trace [26]. We con- sidered 8 application classes, including HTTP, MSN, P2PTV , QQ IM, Skype, Skype IM, Thunder, and Yahoo IM [103]. For large decision-trees (largerJ orq (m) ), we constructed synthetic C4.5 decision- trees. For simplicity, we keepB and the ratio of q (m) J constants while we scale up our synthetic decision-trees. 4.2.7 Latency Improvement As a baseline, we implemented the classic C4.5 decision-tree on the two multi-core platforms; throughout this subsection, we denote classic C4.5 decision-tree as the imple- mentation of the binary tree generated from the C4.5 algorithm. In Figure 4.9, we set M = 7, J = 92, and 0:91 B 0:95. We compare the latency performance of the classic C4.5 decision-tree and our approach on both the AMD and Intel platforms. Our classification engines demonstrate 2 speedup due to the efficient hashing techniques. 109 0 200 400 600 800 0 50 100 150 200 2 4 8 16 32 Latency (ns) Throughput (MCPS) No. of concurrent classifiers AMD_Throughput Intel_Throughput AMD_Latency Intel_Latency Figure 4.10: Varying the number of concurrent classifiers (P ) The speedup is also consistent with respect to various values ofB, because the process- ing latency in our approach does not depend onB. We also observed consistent speedup with respect to throughput. 4.2.8 Scalability of Throughput In this subsection, we show the scalability of our approach with respect to the num- ber of concurrent classifiers (Section 4.2.8.1), the number of decision-tree leaves (Sec- tion 4.2.8.2), and the number of traffic features examined (Section 4.2.8.3) for online classification. 4.2.8.1 Varying the Number of Concurrent Classifiers To efficiently reuse program threads, a number of P traffic flows are examined con- currently in our design. Hence we needP concurrent classifiers to processP flows in parallel in our designs. In Figure 4.10, we show the throughput of our approach with respect to various val- ues ofP . We choose the typical decision-tree withJ = 92 andM = 7 in this figure. As 110 0 200 400 600 800 0 50 100 150 200 128 256 512 1024 2048 Latency (ns) Throughput (MCPS) No. of decision-tree leaves AMD_Throughput Intel_Throughput AMD_Latency Intel_Latency Figure 4.11: Varying the number of leaf nodes (J) P increases, the throughput and latency both increase. When we haveP > 32 concur- rent classifiers, the throughput on both of the two platforms degrades dramatically due to resource contention among a large number of threads. However, we expect higher performance when more processor cores can be utilized; this means our approach can be extended to many-core platforms. In this thesis, we chooseP = 32 to achieve both high throughput and low latency. With this value, we achieve 134:15 MCPS throughput with 238:53 ns processing latency on the AMD platform; we achieve 102:07 MCPS throughput with 313:48 ns processing latency on the Intel platform. 4.2.8.2 Varying the Number of Leaf Nodes In Figure 4.11, we show the throughput of our approach with respect to various values of J. We chooseP = 32 andM = 7 in this figure, although we can see consistent perfor- mance for other values ofP andM as well. For large values ofJ, we created synthetic C4.5 decision-trees as discussed in Section 4.2.6. AsJ increases, the performance on both platforms tapers; we will discuss the reasons later in Section 4.2.9. 111 0 200 400 600 800 0 50 100 150 200 7 8 9 10 11 Latency (ns) Throughput (MCPS) No. of features AMD_Throughput Intel_Throughput AMD_Latency Intel_Latency Figure 4.12: Varying the number of features (M) 4.2.8.3 Varying the Number of Features In Figure 4.12, we show the throughput of our approach with respect to various values ofM. We chooseP = 32 andJ = 92 in this figure; consistent performance can be seen for other values ofP andJ. We added a few more synthetic features withq (m) t 0:2N. Note that asM increases, the performance on both platforms tapers; we will also discuss the reasons in Section 4.2.9. 4.2.9 Performance Analysis 4.2.9.1 Latency Breakdown In Section 4.2.8.1, we observe that the throughput improves while the latency deterio- rates as P increases. The intuition behind this phenomenon is that more traffic flows can be classified in parallel, leading to better throughput. However, we have to pay the penalty of parallelization as well, since a large number of parallel threads compete for a limited amount of hardware resources (caches, cores, etc.); as a result, the latency for classifying a single packet flow increases. 112 0 100 200 300 400 2 4 8 16 32 Latency (ns) No. of concurrent classifiers AMD_Merging AMD_Searching Figure 4.13: Latency breakdown on the AMD platform 0 100 200 300 400 2 4 8 16 32 Latency (ns) No. of concurrent classifiers Intel_Merging Intel_Searching Figure 4.14: Latency breakdown on the Intel platform We show the time breakdown of the processing latency on the AMD and Intel plat- forms in Figure 4.13 and Figure 4.14, respectively (M = 7 and J = 92). Note the searching time contributes 80% 90% of the total processing latency. The searching time of each lookup grows sublinearly with respect toP . Since the throughput is pro- portional to P and inversely proportional to the processing latency, we expect higher throughputs for larger values of P with diminishing returns. Also note the searching 113 0 20 40 60 80 0 1 2 3 4 2 4 8 16 32 Context Switches/M Classif. LLC misses/Classification No. of concurrent classifiers AMD_L3 Intel_L3 AMD_CS Intel_CS Figure 4.15: Cache misses and context switches with respect toP 0 25 50 75 100 0 5 10 15 20 128 256 512 1024 2048 Context Switches/M Classif. LLC misses/Classification No. of decision-tree leaves AMD_L3 Intel_L3 AMD_CS Intel_CS Figure 4.16: Cache misses and context switches with respect toJ time on the Intel platform is higher than that on the AMD platform; this in turn results in higher processing latency and lower throughput on the Intel platform. 4.2.9.2 Cache Misses and Context Switches We show in Figure 4.15 the cache performance (AMD L3 and Intel L3) as well as the number of context switches (AMD CS and Intel CS). Recall each L3 cache miss is fol- lowed by a main memory access; hence we measure the Last-level Cache (LLC; namely, 114 L3 cache) misses per classification, since this is an indicator of the number of main memory accesses as well. In this figure, we have chosen J = 128 and M = 7 as an example. Note: The Intel processor is integrated with larger L3 caches; the number of L3 cache misses is significantly less than that on the AMD processor. The Intel processor incurs more context switches per million classifications; note the Intel processor has smaller L2 caches and a slower clock frequency. These factors lead to inferior overall performance. Section 4.2.8.2 and Section 4.2.8.3 show that large values ofJ orM can have nega- tive effects on the performance of our approach. The reasons are: 1. AsJ orM increases, more calculations and more memory accesses are performed in a single calculations. 2. AsJ orM increases, the total size of the hash tables increases; it is more difficult to fit all the data structures in the lower-level caches. As an example, in Figure 4.16, we show the number of LLC misses and the number of context switches with respect toJ. AsJ increases, the overall performance of our classification engines, as shown in Figure 4.11, degrades due to the increasing numbers of cache misses and context switches. 4.3 Comparison of Traffic Classification Approaches In this section, we first compare the performance on various platforms in Section 4.3.1. Then we compare our work with prior works in Section 4.3.2. 115 Table 4.8: Comparison between various platforms Decision-tree 1 FPGA Multi-core Platform No. of leaf nodes 92 92 Total feature width 76 76 Throughput (MPPS) 533 MPPS 134 MPPS Processing latency (per pkt.) 1:59s 239 ns Decision-tree 2 FPGA Multi-core Platform No. of leaf nodes 2 K 2 K Total feature width 320 320 Throughput (MPPS) 365 MPPS 71 MPPS Processing latency (per pkt.) 37:41s 124 ns 4.3.1 Comparison between Various Platforms We compare the implementations on FPGA and on multi-core processors in Table 4.8. In this table, we use two decision-trees as examples. As can be seen: FPGA FPGA-based traffic classification engines can achieve very high throughput by exploiting deeply pipelined architectures. The massive parallelism and localized mem- ory accesses to small SRAM blocks allow FPGA to beat multi-core General Purpose Processor (GPP)-based platforms with respect to throughput. On FPGA, however, we have a limitation on the size of the decision-tree. The largest design that can be supported corresponds to 2048 leaf nodes and 320-bit total feature width, limited by the available on-chip hardware resources. Although off-chip mem- ory can be exploited for large trees, the long access latency to off-chip memory often deteriorates the performance. Multi-core Platform Our decomposition-based implementation on multi-core plat- forms is scalable with respect to the number of processor cores. We expect even higher throughput when using more processor cores. Using an optimized memory hierarchy, 116 our designs on multi-core platforms sustain lower per packet latency compared to our FPGA-based designs; at the same time, we can support very large decision-trees if we are willing to compromise the performance. The major drawbacks of multi-core platforms are: (1) Although the per packet latency is low, the processing latency of an entire batch can be very high, since each batch can contain 10 K packets. (2) The throughput, compared to the FPGA-based clas- sification engines, are lower. While our FPGA-based implementations are application- specific, our software implementations are based on general-purpose machines. As discussed in Section 3.3, it is possible to explore other accelerators, e.g., General Purpose Graphics Processing Units (GPGPU), on the multi-core platforms; however, these accelerators usually introduce long processing latency [82, 92]. We avoid any further discussion on these accelerators. 4.3.2 Comparison with Prior Works We compare our approaches with the prior works in Table 4.9. As can be seen, the pro- cessing latency is not measured in many existing works [45, 54]. We have the following observations: Compared to Este et al.[43], we have achieved 67 improvement with respect to the throughput, at a cost of 25% more processing latency per packet. The rea- sons include: (1) we target the state-of-the-art multi-core GPP, and (2) the C4.5 decision-tree-based approach requires much less computations then the SVM- based approach. 117 Table 4.9: Comparison with prior works Approach Algorithm Platform Hardware virtualization Throughput Latency Multi-core GPP Este et al.[43] SVM Dual Xeon, 2:6 GHz - 2 MCPS 191 ns 24 cores, 48 GB RAM Gringoli et al.[45] SVM Dual Xeon, 2:6 GHz - 7:44 MCPS Not 12 cores, 24 GB RAM measured This thesis [102] C4.5 Dual Xeon, 2:6 GHz - 134 MCPS 239 ns 16 cores, 64 GB RAM FPGA Groleat et al.[46] SVM Synthesized on No 290 MCPS 243:16s Virtex-5 FPGA Tong et al.[103] C4.5 Reimplemented on No 520 MCPS 0:22s Virtex-7 FPGA Jiang et al. [54] k-NN Place-and-routed on Yes 125 MCPS Not Virtex-5 FPGA measured This thesis [91] C4.5 Place-and-routed on Yes 533 MCPS 1:59s Virtex-7 FPGA 118 Compared to prior works on FPGA [46, 103], our classification engines on FPGA do not only improve the throughput performance, but also support hardware virtu- alization. Compared to existing dynamically updatable classification engines [54], we have achieved 4 throughput by using our customized pipelined architectures. In summary, our classification engines on FPGA and multi-core platforms have both advanced the state-of-the-art Internet traffic classification approaches. For any specific system requirements, the users have sufficient flexibility to choosing one over the other based on the performance tradeoffs between FPGA and multi-core processors. 119 Chapter 5 Conclusion The network classification problems, including both the multi-field packet classification and the Internet traffic classification, have been the focus of much research in recent years. In this thesis, we studied two Internet application kernels in the context of evolv- ing Internet infrastructures, optimized their solutions on state-of-the-art parallel archi- tectures, and improved their performance with respect to overall throughput and process- ing latency. We systematically explored, analyzed and optimized network classification engines on FPGA and multi-core processors, including: We investigated novel solutions for multi-field packet classification, using pipelined architectures on FPGA or parallel algorithms on multi-core platforms. We exploited parallel data structures on both FPGA and multi-core processors for Internet traffic classification, based on a conversion technique from a decision-tree to a compact rule set table. We evaluated and compare the performance of our designs on FPGA and multi- core platforms to exploit flexible and optimal performance tradeoffs among vari- ous parameters. Our research presented in this thesis substantiates the following understandings: Algorithmic and architectural optimizations of the class of data-intensive prob- lems on parallel architectures result in performance improvement more significant than data or input-specific optimizations and brute-force solutions. 120 Deeper understanding in the nature of a class of problems can be obtained through theoretical analysis and intelligent modeling, such as the extensive analysis we devised for both multi-field packet classification and Internet traffic classification. Various parallel design methodologies, including pipelining, field splitting, rule clustering, and modular composition, can be used jointly to derive at novel solu- tion designs. 5.1 Summary of Contributions From the perspective of solving multi-field packet classification on both FPGA and multi-core architectures, our research leads to the following conclusions: 1. A 2-dimensional pipelined architecture constructed from (1) splitting the long fields and (2) clustering multiple rules can achieve superior performance than the single pipeline used in the Field-Split Bit Vector (FSBV) approach. 2. The modular construction of each Processing Element (PE) leads to automated circuit generation on FPGA with scalable throughput, high power efficiency, and supports for dynamic updates. 3. Parallel decomposition-based algorithms on multi-core processors, including par- allel searching and linear merging algorithms, demonstrate better scalability than the decision-tree-based approaches. 4. Tradeoffs between the throughput and the latency can be performed by deploy- ing parallel program threads; oversubscription of program threads can result in inferior performance. From the perspective of solving Internet traffic classification on FPGA and multi- core architectures, our research leads to the following conclusions: 121 1. An arbitrary decision-tree can be converted to a Rule Set Table (RST), leading to an efficient mapping from the data structures to a systolic array on FPGA. 2. A snake-like pipeline in the systolic array facilitates hardware virtualization by supporting dynamic updates with only little FPGA resources. 3. A translation from a C4.5 decision-tree to multiple compact hash tables enables parallelization of the online classification process on multi-core processors. 4. Accessing multiple hash tables in parallel demonstrates better performance, with respect to both throughput and latency, than a classic tree-based search, especially for imbalanced decision-trees. 5.2 Future Work The two application kernels we have studied in this thesis, namely (1) multi-field packet classification and (2) Internet traffic classification, remain to be very challenging prob- lems due to their requirements in high throughput, low latency, and scalability. The explosive growth of digital information in the form of web pages, network traffic and sci- entific data is expected to continue in the foreseeable future, making large-scale packet or traffic classification even more challenging to solve from both computation and mem- ory bandwidth points of view. On the other hand, the performance of future FPGAs and multi-core processors are also expected to increase with higher integration, improve with better power efficiency, and expand with heterogeneous computing capabilities. These advancements in parallel architectures present great opportunities where novel solutions can be designed to meet the emerging challenges. 122 5.2.1 Exploration on Packet Classification Our 2-dimensional pipelined architecture can be further tailored for low-power opera- tions on both FPGAs and ASICs; it can be optimized to run on customized soft-cores on FPGA or the emerging many-core architectures. In all the scenarios above, we believe studying the packet classification problem analytically greatly helps the development of novel solutions. Our extensive analysis also captures the intrinsic complexity of multi- field packet classification and can potentially be used to explore novel design choices on heterogeneous platforms. For our decomposition-based packet classification on multi-core processors, the merging algorithm is performed linearly where all the RIDs stored in the RID sets have to be compared. In general, the merging phase serves as the performance bottleneck of our approach. One future direction is to design an efficient algorithm, e.g., hashing, that can identify the common RIDs in all the RID sets very quickly. To better capture complex security threats, state-of-the-art OpenFlow protocol uses adjustable rule sets. A rule set may not only expand / shrink along the vertical direction (e.g., insert / delete a rule), but also change along the horizontal direction (e.g., add / abandon a header field). One future direction is to enable our architecture to handle these extensions. 5.2.2 Exploration on Traffic Classification Our systolic array deployed for Internet traffic classification has the ability to self- reconfigure all of its PEs. This function can be used not only for hardware virtual- ization, but also in the training phase of the ML-based traffic classification approach. For instance, by recording the input flow features and computing a threshold value, the entire classification engine can adapt to the changes of the Internet traffic without 123 the conversion between the C4.5 decision-tree and the systolic array. We expect this approach to reduce the design complexity of the entire classification engine. Based on the design of our modular PEs, it may be interesting to research a power- efficient architecture by scheduling all the features in advance. For simplicity, assume each PE only handles one stride comparison (c = 1). Now for a specific row of PEs, any mismatched feature indicates the input traffic flow does not match all the feature criteria in this row; hence, it makes little sense to continue the corresponding classification process in the following PEs of this row. We can turn off the data memory modules of the remaining PEs to save power. The PEs need to generate, combine, and propagate the power-gating signals to the remaining PEs of this row. 5.2.3 Beyond FPGA and Multi-core GPP Future multi-core processors are integrating massively threaded computing resources such as the General-Purpose Graphics Processing Units (GPGPU). The transplantation of our approaches onto GPGPU will also lead to a class of novel algorithms and opti- mization techniques. For instance, we can implement soft-core PEs to match a large number of rules for packet classification, while each PE only matches a small subset of rules or a small number of fields; efficient algorithms are still required to reduce the data movement during the merging phase. Future FPGA devices are expected to integrate more logic and memory resources as well as GPP cores [30]. It will be interesting to combine and utilize both of our systolic array on hardware and parallel algorithms on software, each for matching a particu- lar type of rules, on future heterogeneous architectures. The communication overhead between the FPGA and processor cores has to be minimized in order to achieve high performance. 124 Besides the classification throughput and processing latency, it is also very chal- lenging to measure the power or energy on multi-core GPP, GPGPU, or heterogeneous platforms. The study of power-efficient software algorithms is an interesting topic, not only in the domain of Internet applications, but also in the context of large data, scientific computing, image / signal processing, communications, and computer security. 125 Bibliography [1] Amazon Virtual Private Cloud. http://docs.aws.amazon.com/AmazonVPC/latest/ UserGuide/VPC_ACLs.html. [2] AMD Opteron 6200 Series Processor. http://www.amd.com/us/products/server/processors/ 6000-series-platform/6200/Pages/6200-series- processors.aspx. [3] Available Pool of Unallocated IPv4 Internet Addresses Now Completely Emp- tied. https://www.icann.org/en/system/files/press- materials/release-03feb11-en.pdf. [4] Cisco 12816 Router. http://www.cisco.com. [5] Cisco ASR 1000 Series Aggregation Services Routers. http://www.cisco.com/en/US/prod/collateral/routers/ ps9343/data_sheet_c78-447652.pdf. [6] Cisco CSR-1 Carrier Routing System. http://www.cisco.com/en/US/prod/collateral/routers/ ps5763/prod_brochure0900aecd800f8118.pdf. [7] Cisco Security Appliance Command Line Configuration Guide. http://www.cisco.com/en/US/docs/security/asa/asa72/ configuration/guide/conf$_$gd.html. [8] DE4 NetFPGA. http://keb302.ecs.umass.edu/de4web/DE4_NetFPGA/. 126 [9] Free Pool of IPv4 Address Space Depleted. https://www.nro.net/news/ipv4-free-pool-depleted. [10] Home of The First Website. http://info.cern.ch/. [11] Huawei Technologies. http://www.huawei.com. [12] Intel Quick Path Interconnect. http://www.intel.com/content/www/us/en/io/quickpath- technology/quick-path-interconnect-introduction- paper.html. [13] Intel Xeon Processor E5-2470. http://ark.intel.com/products/64623/Intel-Xeon- Processor-E5-2470-20M-Cache-2_30-GHz-8_00-GTs-Intel- QPI. [14] Intel Xeon Processor X5570. http://ark.intel.com/products/37111/Intel-Xeon- Processor-X5570-8M-Cache-2_93-GHz-6_40-GTs-Intel-QPI. [15] Internet Live Stats. http://www.internetlivestats.com/. [16] Internet Traffic Classification. http://www.caida.org/research/traffic-analysis/ classification-overview/. [17] Introducing 6-pack: the First Open Hardware Modular Switch. https://code.facebook.com/posts/717010588413497/ introducing-6-pack-the-first-open-hardware-modular- switch/. [18] Juniper Networks T-series Routing Platforms. http://www.juniper.net/products/tseries/100051.pdf. [19] Juniper Networks T1600 Core Router. http://www.juniper.net. [20] Open Networking Foundation. https://www.opennetworking.org/. [21] Open Networking Research Center. http://onrc.stanford.edu/. 127 [22] OpenFlow Consortium. http://www.openflowswitch.org/. [23] OpenFlow Switch Specification. https://www.opennetworking.org/images/stories/ downloads/specification/openflow-spec-v1.3.0.pdf. [24] OpenFlow: The Next Generation in Networking Interoperability. http://www.bladenetwork.net/userfiles/file/OpenFlow- WP.pdf. [25] Software Defined Networking and Software-based Services with Intel Proces- sors. http://www.intel.com/content/dam/doc/white-paper/ communications-ia-software-defined-networking- paper.pdf. [26] TCP STatistic and Analysis Tool. http://tstat.tlc.polito.it/traces.shtml. [27] Technology Strategy Brief: Software Defined Networking (SDN) in the Enter- prise. http://www.enterasys.com/company/literature/SDN_ tsbrief.pdf. [28] Virtex-6 FPGA Family. http://www.xilinx.com/products/virtex6/index.htm. [29] Virtex-7 FPGA Family. http://www.xilinx.com/products/virtex7. [30] Zynq-7000 All Programmable SoC. http://www.xilinx.com/products/silicon-devices/soc/ zynq-7000/index.htm. [31] Hypertext Transfer Protocol HTTP/1.1. http://www.rfc-editor.org/info/rfc2616, June 1999. [32] FileZilla - The Free FTP Solution. https://filezilla-project.org/, June 2015. [33] R. Alshammari and A. N. Zincir-Heywood. Machine Learning Based Encrypted Traffic Classification: Identifying SSH and Skype. In IEEE Symposium on Com- putational Intelligence for Security and Defense Applications (CISDA), pages 289–296, 2009. 128 [34] AT&T. https://www.att.com/shop/internet.html. [35] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian. Traffic Classification on the Fly. SIGCOMM Comput. Commun. Rev., 36(2):23–26, 2006. [36] G. Brebner. Softly Defined Networking. In Proc. of the 8th ACM/IEEE Symp. on Architectures for Networking and Communications Systems (ANCS), pages 1–2, 2012. [37] G. Brebner and W. Jiang. High-Speed Packet Processing using Reconfigurable Computing. IEEE Micro, 34(1):8–18, 2014. [38] Comcast. http://www.xfinity.com/cable-internet-packages.html? CMP=KNC-IQ_ID_64056331-VQ2-g-VQ6-49550278176-VQ16-c- pkw-comcast-pmt-e. [39] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algo- rithms. The MIT Press, 3rd edition, 2009. [40] S. Dharmapurikar, P. Krishnamurthy, T. Sproull, and J. Lockwood. Deep Packet Inspection using Parallel Bloom Filters. IEEE Micro, 24(1):52–61, 2004. [41] W. Eatherton, G. Varghese, and Z. Dittia. Tree Bitmap: Hardware/software IP Lookups with Incremental Updates. SIGCOMM Computer Communication Review, 34(2):97–122, 2004. [42] N. Egi, A. Greenhalgh, M. Handley, M. Hoerdt, F. Huici, and L. Mathy. Towards High Performance Virtual Routers on Commodity Hardware. In Proceedings of the 2008 ACM CoNEXT Conference (CoNEXT), pages 20:1–20:12, 2008. [43] A. Este, F. Gringoli, and L. Salgarelli. On-line SVM Traffic Classification. In Proc. of 7th International Wireless Communications and Mobile Computing Con- ference (IWCMC), pages 1778–1783, 2011. [44] T. Ganegedara and V . Prasanna. StrideBV: Single Chip 400G+ Packet Classifi- cation. In High Performance Switching and Routing (HPSR), 2012 IEEE 13th International Conference on, pages 1–6, 2012. [45] F. Gringoli, L. Nava, A. Este, and L. Salgarelli. MTCLASS: Enabling Statistical Traffic Classification of Multi-gigabit Aggregates on Inexpensive Hardware. In 8th International Wireless Communications and Mobile Computing Conference (IWCMC), pages 450–455, 2012. 129 [46] T. Groleat, M. Arzel, and S. Vaton. Hardware Acceleration of SVM-based Traf- fic Classification on FPGA. In 8th International Wireless Communications and Mobile Computing Conference (IWCMC), pages 443–449, Aug 2012. [47] P. Gupta and N. McKeown. Classifying Packets with Hierarchical Intelligent Cuttings. IEEE Micro, 20(1):34–41, 2000. [48] P. Gupta and N. McKeown. Algorithms for Packet Classification. IEEE Network, 15(2):24–32, 2001. [49] M. Halkidi and I. Koutsopoulos. Online Clustering of Distributed Streaming Data Using Belief Propagation Techniques. In Proc. of 12th IEEE International Conference on Mobile Data Management (MDM), pages 216–225, 2011. [50] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA Data Mining Software: An Update. SIGKDD Explor. Newsletter, 11(1):10–18, 2009. [51] S. Han, K. Jang, K. Park, and S. Moon. PacketShader: A GPU-accelerated Soft- ware Router. SIGCOMM Comput. Commun. Rev., 40(4):195–206, 2010. [52] E. N. Harris, S. L. Wasmundt, L. De Carli, K. Sankaralingam, and C. Estan. LEAP: Latency- Energy- and Area-optimized Lookup Pipeline. In Proc. of the eighth ACM/IEEE symposium on Architectures for Networking and Communica- tions Systems (ANCS), pages 175–186, 2012. [53] G. S. Jedhe, A. Ramamoorthy, and K. Varghese. A Scalable High Throughput Firewall in FPGA. In Proc. of IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 802–807, 2008. [54] W. Jiang and M. Gokhale. Real-Time Classification of Multimedia Traffic Using FPGA. In Proc. of 2010 International Conference on Field Programmable Logic and Applications (FPL), pages 56–63, 2010. [55] W. Jiang and V . K. Prasanna. Field-split Parallel Architecture for High Perfor- mance Multi-match Packet Classification using FPGAs. In Proc. of the 21st Annual Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 188–196, 2009. [56] W. Jiang and V . K. Prasanna. Scalable Packet Classification on FPGA. IEEE Trans. VLSI Syst., 20(9):1668–1680, 2012. [57] M. Kende. Digital Handshake: Connecting Internet Backbones. CommLaw Con- spectus, 11:45, 2003. 130 [58] A. Kennedy, X. Wang, Z. Liu, and B. Liu. Low Power Architecture for High Speed Packet Classification. In Proc. of 2008 Symposium on Architectures for Networking and Communications Systems (ANCS), pages 131–140, 2008. [59] H. Kim, K. Claffy, M. Fomenkov, D. Barman, M. Faloutsos, and K. Lee. Inter- net Traffic Classification Demystified: Myths, Caveats, and the Best Practices. In Proceedings of the 2008 ACM CoNEXT Conference (CoNEXT), pages 11:1– 11:12, 2008. [60] T. Koponen. Software is the Future of Networking. In Proc. of the 8th ACM/IEEE Symp. on Architectures for Networking and Communications Systems (ANCS), pages 135–136, 2012. [61] S. Kumar, M. Becchi, P. Crowley, and J. Turner. CAMP: Fast and Efficient IP Lookup Architecture. In Proc. of 2005 Symposium on Architectures for Network- ing and Communications Systems (ANCS), pages 51–60, 2006. [62] V . Kumar. What do P2P Applications do and How to block Peer to Peer Applications (P2P) using Symantec Endpoint Protection? http://www.symantec.com/connect/articles/what-do-p2p- applications-do-and-how-block-peer-peer-applications- p2p-using-symantec-endpoin, November 2009. [63] T. V . Lakshman and D. Stiliadis. High-Speed Policy-Based Packet Forwarding Using Efficient Multi-Dimensional Range Matching. In Proc. of the 1998 con- ference on Applications, technologies, architectures, and protocols for computer communications (SIGCOMM), pages 203–214, 1998. [64] B. Lantz, B. Heller, and N. McKeown. A Network in a Laptop: Rapid Prototyping for Software-defined Networks. In Proc. of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks (Hotnets-IX), pages 19:1–19:6, 2010. [65] H. Le, T. Ganegedara, and V . K. Prasanna. Memory-efficient and Scalable Vir- tual Routers Using FPGA. In Proc. of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), pages 257–266, 2011. [66] Z. Lin, C. Lo, and P. Chow. K-means Implementation on FPGA for High- dimensional Data using Triangle Inequality. In Proc. of 22nd International Con- ference on Field Programmable Logic and Applications (FPL), pages 437–442, 2012. [67] Y . Luo, P. Cascon, E. Murray, and J. Ortega. Accelerating OpenFlow Switch- ing with Network Processors. In Proc. of 2009 Symposium on Architectures for Networking and Communications Systems (ANCS), pages 70–71, 2009. 131 [68] Y . Luo, K. Xiang, and S. Li. Acceleration of Decision Tree Searching for IP Traf- fic Classification. In Proc. of 2008 Symposium on Architectures for Networking and Communications Systems (ANCS), pages 40–49, 2008. [69] Y . Ma, S. Banerjee, S. Lu, and C. Estan. Leveraging Parallelism for Multi- dimensional Packet Classification on Software Routers. SIGMETRICS Perform. Eval. Rev., 38(1):227–238, 2010. [70] L. F. Manfroi, M. Ferro, A. M. Yokoyama, A. R. Mury, and B. Schulze. A Walking Dwarf on the Clouds. In IEEE/ACM 6th International Conference on Utility and Cloud Computing (UCC), pages 399–404, 2013. [71] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. OpenFlow: Enabling Innovation in Campus Networks. SIGCOMM Comput. Commun. Rev., 38(2):69–74, 2008. [72] A. Mitra, W. Najjar, and L. Bhuyan. Compiling PCRE to FPGA for accelerat- ing SNORT IDS. In Proc. of 2007 ACM/IEEE Symposium on Architecture for Networking and Communications Systems (ANCS), pages 127–136, 2007. [73] A. W. Moore and D. Zuev. Internet Traffic Classification Using Bayesian Analysis Techniques. SIGMETRICS Perform. Eval. Rev., 33(1):50–60, 2005. [74] J. Naous, D. Erickson, G. A. Covington, G. Appenzeller, and N. McKeown. Implementing an OpenFlow Switch on the NetFPGA Platform. In Proc. of the 4th ACM/IEEE Symposium on Architectures for Networking and Communica- tions Systems (ANCS), pages 1–9, 2008. [75] M. G. Naugle. Network Protocol Handbook. McGraw-Hill, Inc., 1st edition, 1998. [76] T. Nguyen and G. Armitage. A Survey of Techniques for Internet Traffic Clas- sification using Machine Learning. IEEE Communications Surveys Tutorials, 10(4):56–76, 2008. [77] A. Nikitakis and I. Papaefstathiou. A Memory-Efficient FPGA-based Classifi- cation Engine. In Proc. of IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 802–807, 2008. [78] Tesla GPU Accelerators for Servers. http://www.nvidia.com/object/tesla-servers.html. [79] OpenMP. http://openmp.org/wp/. [80] R. Oppliger. Internet Security: Firewalls and Beyond. Commun. ACM, 40(5):92– 102, 1997. 132 [81] R. Pagh and F. F. Rodler. Cuckoo Hashing. Journal of Algorithms, 51(2):122– 144, May 2004. [82] PCI-SIG. https://www.pcisig.com/home. [83] L. Peng, W. Lu, and L. Duan. Power Efficient IP Lookup with Supernode Caching. In IEEE Global Telecommunications Conference (GLOBECOM), pages 215 –219, nov. 2007. [84] I. Petri, M. Punceva, O. Rana, and G. Theodorakopoulos. Broker Emergence in Social Clouds. In IEEE 6th International Conference on Cloud Computing (CLOUD), pages 669–676, 2013. [85] B. Pfaff, J. Pettit, K. Amidon, M. Casado, T. Koponen, and S. Shenker. Extending Networking into the Virtualization Layer. In Proc. of workshop on Hot Topics in Networks (HotNets-VIII), 2009. [86] F. Pong, N.-F. Tzeng, and N.-F. Tzeng. HaRP: Rapid Packet Classification via Hashing Round-Down Prefixes. IEEE Transaction on Parallel and Distributed Systems, 22(7):1105 –1119, 2011. [87] V . Puˇ s and J. Korenek. Fast and Scalable Packet Classification Using Perfect Hash Functions. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), pages 229–236, 2009. [88] Y . Qi, J. Fong, W. Jiang, B. Xu, J. Li, and V . Prasanna. Multi-dimensional Packet Classification on FPGA: 100 Gbps and beyond. In International Conference on Field-Programmable Technology (FPT), pages 241–248, 2010. [89] Y . Qi, B. Xu, F. He, X. Zhou, J. Yu, and J. Li. Towards Optimized Packet Clas- sification Algorithms for Multi-Core Network Processors. In Proc. of 2007 Intl. Conf. on Parallel Processing (ICPP), page 2, 2007. [90] Y . Qu, S. Zhou, and V . K. Prasanna. Scalable Many-Field Packet Classification on Multi-core Processors. In Proceedings of the 2013 25th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 33–40, 2013. [91] Y . R. Qu and V . K. Prasanna. Enabling High Throughput and Virtualization for Traffic Classification on FPGA. In IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 44–51, May 2015. [92] Y . R. Qu, H. H. Zhang, S. Zhou, and V . K. Prasanna. Optimizing Many-field Packet Classification on FPGA, Multi-core General Purpose Processor, and GPU. 133 In Proceedings of the Eleventh ACM/IEEE Symposium on Architectures for Net- working and Communications Systems (ANCS), pages 87–98, 2015. [93] Y . R. Qu, S. Zhou, and V . K. Prasanna. High-performance Architecture for Dynamically Updatable Packet Classification on FPGA. In Proceedings of the Ninth ACM/IEEE Symposium on Architectures for Networking and Communica- tions Systems (ANCS), pages 125–136, 2013. [94] M. A. Ruiz-sanchez, I. S. Antipolis, E. W. Biersack, S. Antipolis, W. Dabbous, and I. S. Antipolis. Survey and Taxonomy of IP Address Lookup Algorithms. IEEE Network, 15:8–23, 2001. [95] R. Sawyer. Calculating Total Power Requirements for Data Centers, White Paper #3, American Power Conversion. http://www.apcmedia.com/salestools/VAVR-5TDTEF_R0_EN. pdf, 2004. [96] S. Shenker, M. Casado, T. Koponen, and N. McKeown. The Future of Network- ing, and the Past of Protocols. opennetsummit.org/talks/shenker-tue.pdf, 2012. [97] S. Singh, F. Baboescu, G. Varghese, and J. Wang. Packet Classification Using Multidimensional Cutting. In Proc. of the 2003 conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIG- COMM), pages 213–224, 2003. [98] H. Song and J. W. Lockwood. Efficient Packet Classification for Network Intru- sion Detection using FPGA. In Proc. of 13th International Symposium on Field Programmable Gate Arrays (FPGA), pages 238–245, 2005. [99] V . Srinivasan and G. Varghese. Fast Address Lookups using Controlled Prefix Expansion. ACM Trans. on Computer Systems, 17:1–40, 1999. [100] D. E. Taylor. Survey and Taxonomy of Packet Classification Techniques. ACM Computing Surveys, 37(3):238–275, 2005. [101] T. N. Thinh, S. Kittitornkun, and S. Tomiyama. Applying Cuckoo Hashing for FPGA-based Pattern Matching in NIDS/NIPS. In Proc. of IEEE International Conference on Field-Programmable Technology (FPT), pages 121 –128, 2007. [102] D. Tong, Y . Qu, and V . Prasanna. High-throughput Traffic Classification on Multi- core Processors. In IEEE 15th International Conference onHigh Performance Switching and Routing (HPSR), pages 138–145, July 2014. 134 [103] D. Tong, L. Sun, K. Matam, and V . Prasanna. High Throughput and Pro- grammable Online Trafficclassifier on FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), pages 255–264, 2013. [104] B. Vamanan and T. N. Vijaykumar. Treecam: Decoupling updates and lookups in packet classification. In Proceedings of the Seventh COnference on Emerging Networking EXperiments and Technologies (CoNEXT), pages 27:1–27:12, 2011. [105] H.-S. Wang, L.-S. Peh, and S. Malik. A Power Model for Routers: Modeling Alpha 21364 and InfiniBand Routers. IEEE Micro, 23(1):26–35, 2003. [106] P. Warkhede, S. Suri, and G. Varghese. Multiway Range Trees: Scalable IP Lookup with Fast Updates. Computer Networks, 44(3):289 – 303, 2004. [107] M. Weiss. Sprint to Scale Core Network to 40G/100G and later 400G with Ciena’s 6500 Packet-Optical Platform. http://community.comsoc.org/blogs/michaelweiss/ sprint-scale-core-network-40g100g-and-later-400g- cienas-6500-packet-optical-platf, June 2012. [108] T. Y . C. Woo. A Modular Approach to Packet Classification: Algorithms and Results. In Proc. of the 19th conference on Information Communications (INFO- COM), pages 1213–1222, 2000. [109] Y .-H. E. Yang and V . K. Prasanna. High Throughput and Large Capacity Pipelined Dynamic Search Tree on FPGA. In Proceedings of the 18th annual ACM/SIGDA international symposium on Field Programmable Gate Arrays (FPGA), pages 83–92, 2010. [110] F. Yu, R. H. Katz, and T. V . Lakshman. Efficient Multimatch Packet Classification and Lookup with TCAM. IEEE Micro, 25(1):50–59, 2005. [111] M. Yu, L. Jose, and R. Miao. Software Defined Traffic Measurement with OpenS- ketch. In the 10th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI), pages 29–42. USENIX, 2013. [112] Y . Zhang, S. Singh, S. Sen, N. Duffield, and C. Lund. Online Identification of Hierarchical Heavy Hitters: Algorithms, Evaluation, and Applications. In Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement (IMC), pages 101–114, 2004. [113] P. Zhong. An IPv6 Address Lookup Algorithm based on Recursive Balanced Multi-way Range Trees with Efficient Search and Update. In Proc. of Int Com- puter Science and Service System (CSSS) Conf, pages 2059–2063, 2011. 135 [114] S. Zhou, Y . R. Qu, and V . K. Prasanna. Multi-core Implementation of Decomposition-based Packet ClassificationAlgorithms. In Proc. of the 12th Inter- national Conference on Parallel Computing Technologies (PaCT), pages 105– 119, 2013. 136
Abstract (if available)
Abstract
The Internet backbone, including both core and edge routers, is becoming more flexible, scalable and programmable to enable future innovations in next generation Internet. While the functionality of Internet routers evolves, the performance remains a major concern for real-life deployment. In this thesis, we propose novel algorithms, constructions, and optimization techniques on two prominent classes of parallel architectures: Field-Programmable Gate Arrays (FPGAs), and multi-core General Purpose Processors (GPP). We focus on high-performance algorithmic solutions for two Internet application kernels: the multi-field packet classification, and the Internet traffic classification. ❧ For packet classification, we focus on algorithmic solutions to support high throughput and dynamic updates. We extend the decomposition-based packet classification approaches onto FPGA and multi-core processors. On FPGA, we present 2-dimensional pipelined architecture composed of fine-grained Processing Elements (PE). Efficient power optimization techniques are also proposed on this architecture. On multi-core processors, we use range-tree and hashing to search each field of the input packet header individually in parallel. The partial results from all the fields are merged to produce the final packet header match. Our implementations support very large rule sets consisting of many fields. ❧ For traffic classification, we present high-throughput and virtualized architectures for online traffic classification on FPGA. We provide a conversion from a decision-tree into a compact rule set table
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
High performance packet forwarding on parallel architectures
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
On efficient data transfers across geographically dispersed datacenters
PDF
Algorithms and architectures for high-performance IP lookup and packet classification engines
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Optimal designs for high throughput stream processing using universal RAM-based permutation network
PDF
Customized data mining objective functions
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
Machine learning for efficient network management
PDF
Hardware and software techniques for irregular parallelism
PDF
Protocols, algorithms, and application adaptation for mobile ad hoc network (MANET)-like disruption tolerant networks (MDTNs)
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Improving network security through cyber-insurance
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
Algorithmic aspects of throughput-delay performance for fast data collection in wireless sensor networks
PDF
Discovering and querying implicit relationships in semantic data
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
Asset Metadata
Creator
Qu, Yun Rock
(author)
Core Title
High performance classification engines on parallel architectures
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/17/2015
Defense Date
06/24/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
classification,Internet,network,OAI-PMH Harvest,packet,performance,Traffic
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Raghavendra, Cauligi (
committee member
), Yu, Minlan (
committee member
)
Creator Email
paradisequyun@gmail.com,yunqu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-184820
Unique identifier
UC11275566
Identifier
etd-QuYunRock-3932.pdf (filename),usctheses-c40-184820 (legacy record id)
Legacy Identifier
etd-QuYunRock-3932.pdf
Dmrecord
184820
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Qu, Yun Rock
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Internet
packet